Time aggregation case study

Whenever possible, we will use case studies and experiments to illustrate general concepts and specific features of our framework. We will try to create such examples based on different types of data. In some cases, we will use data streams within the context of a related application domain (e.g. healthcare or security). In other cases, we will use them as generic data samples, without considering their sources or possible goals of analysis. This post belongs to the second group, as we will use a stream of stock prices in order to illustrate some general challenges related to the nature of time dimension and the characteristics of time-oriented data.

In this post we look at time aggregation which is a transformation of data from a higher-frequency time granularity into lower-frequency statistics. This is actually a very common and practical scenario, applicable when we need to calculate aggregated statistics over intervals, for example a sum of weekly sales or a count of server responses per minute. The stock price data we use are also a product of time aggregation - they are last values for daily intervals (specifically adjusted close). The chart with the input stock price stream before aggregation is presented in Figure 1 (with horizontal lines for maximum, average and minimum values across the stream).

Figure 1: Input data stream with prices of MSFT stock since beginning of 2015 (SVG)

Figure 1: Input data stream with prices of MSFT stock since beginning of 2015 (SVG)

Time aggregation is a simple transformation that can be executed with a single command or a few intuitive gestures. It is usually parametrized with an aggregation interval and a summary function (e.g. count, average or standard deviation). The chart in Figure 2 includes our original data stream with a background of new data series created by aggregating the stream by MONTH intervals and with MAX/AVG/MIN functions. It is worth noting that the data series in Figure 2 are using both point (stock prices in navy) and interval (aggregated values in orange) time models. Since the base time granularity is determined by stock price (a day), we are using a stepped line for monthly summaries.

Figure 2: Input data stream aggregated by month with MAX/AVG/MIN summary functions (SVG)

Figure 2: Input data stream aggregated by month with MAX/AVG/MIN summary functions (SVG)

The first visible challenge with such a transformation is a different number of days in a month, and thus a variable duration of aggregation intervals. The mapping between different time granularities can be problematic also on different levels, for example in business scenarios, where analysis is focused on workweeks that are affected by irregular holidays. Analysis scenarios can become more complex if we deal not only with the time dimension but also with locations. Aggregating multiple data streams from different geographical points (e.g. sales across the country) can be affected by time zones, especially if short intervals as selected (e.g. daily statistics).

When using time aggregation, we need to pay attention not only to the time dimension, but also to other data elements, especially during selection of summary functions. Applicability of these functions depends on characteristics of analyzed data, e.g. calculating an average will not work with values of nominal or ordinal scales. But it also depends on context of data. In our case it wouldn’t make sense to calculate sum of prices per an interval, since our data points are related to a state (a price of an asset). However, if exactly the same values (including currency) would be related to events, like sale transactions, calculating average price per interval could be very useful.

We will discuss more detailed challenges related to analysis of time-oriented data in future posts. We will also show how technology can help with solving these problems.

It is about time

Why do you limit yourselves to the time dimension?” That is something we often hear when we talk about our analysis framework for time-oriented data. We immediately reply that it is not a limitation. The longer answer that usually follows consists of three key messages: time is everywhere (1), time is the key to understanding reality (2) and time is unique from an analysis point of view (3).  

1.
Time is the most common dimension. We live in the world of time-oriented data. When we think about time in the context of data analysis we usually imagine a chart with univariate time series like stock prices. But the time component is much more common - it is present in most models created for phenomena we are interested in. We can talk about time-oriented data even if a single variable of such a model is associated with the time dimension. This applies to structured and unstructured data, for example to a collection of pictures with timestamps in their EXIF metadata. And with data stored digitally, even if there is no specific time variable in a model, we usually get information stating when this record was created or updated.

2.
We conduct data analysis because we are interested in reality. We want to learn from the past and use historical data to better understand consequences of events and actions. We also look into the future to learn about threats and opportunities and evaluate available decision options. Surprisingly, time can also be very relevant while explaining the present – in the case of non-trivial phenomena, it may be difficult to distinguish between strong and weak elements of a model while completely ignoring temporal context. The time component in data enables us to notice, analyze and predict changes. It creates opportunities to go beyond states, to understand how they are affected by events and to provide insight is the nature of underlying processes.

3.
We are better at dealing with states than processes. In the context of data analysis time can seem both familiar and confusing. Representation of time can be based on time points or can use an interval-based model. Time values can be expressed using different levels of granularity (e.g. months and weeks) with mapping between them not always straightforward (e.g. days and months). And there are practical issues with autocorrelation, seasonality, outliers, but also time zones, daylight savings or holidays (e.g. workweeks). There are many techniques, especially around time series, that can help address specific analysis challenges.  Unfortunately, definition of an analysis problem and selection of appropriate algorithms usually require in-depth expertise.

We believe in making analysis of time-oriented data more usable and available to individuals who need to solve their problems rather than become data analysis experts. We don’t see focusing on the time dimension as a limitation but rather a unique opportunity, since time can become the shared dimension connecting data streams from different sources in an easy and natural way.

We simply change the way of looking at data – to always keep the time dimension in mind.