There is a nice consequence of our fixation on time-oriented data - we can consider everything as a data stream. Any input stream processed by our framework is required to include a time variable, from a simple timestamp field to more complex time structures. We use this variable to organize the data, but also to establish a shared dimension for connecting data elements from different sources. Such an approach gives us the opportunity to operate on multiple streams and implement interesting scenarios with the help of various stream transformations (e.g. time aggregation) and analysis methods.

We define data streams as sequences of events (or state snapshots/changes) that are time ordered, i.e. an event with a timestamp earlier than a previous one is flagged as a potential error. The simplest example of such a data stream is a univariate time series consisting of a timestamp and a data point (a single value), often with some additional requirements, like equal spacing between two consecutive data points. More complex objects in a data stream can include several variables describing an event, often closely related (e.g. time and distance of a running workout, enabling calculation of average speed). In many practical application scenarios, these objects can be models (e.g. snapshots of a system’s attack surface) or collections of unstructured data (e.g. articles published on a subject).

The beauty of a time variable being present in all our data is that it can become a shared dimension enabling integration of streams from different sources and building a bigger (or just cleaner) picture from seemingly disconnected elements. Let’s illustrate it using a case study based on very simple historical rent data. The input stream includes monthly rent payments and additional variables like location of a rented place and tags indicating lease renewal events. The interim visualization of the rent series is presented in Figure 1, with location and renewal events added as annotations (Point Marker artifacts).

Figure 1: Monthly rent payments with locations and renewal events as point annotations (SVG)

The history of rent payment is an obvious example of time-oriented data, but in this case study more interesting are variables for location and lease renewal events, all connected by time (of monthly granularity). Let’s start with the location variable - we can detect when its value changes and automatically create an annotation (Time Marker artifact) indicating an event of moving between apartments. The lease renewal variable on the other hand can be used to identify critical data points when rent is most likely to change the most, and produce a new series with percentage difference between consecutive renewals. Figure 2 includes the results of these operations.

Figure 2: Monthly payments with markers for moving and differences between lease renewals (SVG)

The requirement of time variable is one of the foundations of the framework we are creating. Time gives order to a stream of events, but also provides a frame of reference for data analysis. We can use time variables to effectively manage multiple different streams, create useful streams based on multiple input sources or extract key information from streams of unstructured data. Obviously there are challenges related to time granularities, irregular samples or overlapping of intervals and data points. All these problems however can be solved with help of the right technology.

Whenever possible, we will use case studies and experiments to illustrate general concepts and specific features of our framework. We will try to create such examples based on different types of data. In some cases, we will use data streams within the context of a related application domain (e.g. healthcare or security). In other cases, we will use them as generic data samples, without considering their sources or possible goals of analysis. This post belongs to the second group, as we will use a stream of stock prices in order to illustrate some general challenges related to the nature of time dimension and the characteristics of time-oriented data.

In this post we look at time aggregation which is a transformation of data from a higher-frequency time granularity into lower-frequency statistics. This is actually a very common and practical scenario, applicable when we need to calculate aggregated statistics over intervals, for example a sum of weekly sales or a count of server responses per minute. The stock price data we use are also a product of time aggregation - they are last values for daily intervals (specifically adjusted close). The chart with the input stock price stream before aggregation is presented in Figure 1 (with horizontal lines for maximum, average and minimum values across the stream).

Figure 1: Input data stream with prices of MSFT stock since beginning of 2015 (SVG)

Time aggregation is a simple transformation that can be executed with a single command or a few intuitive gestures. It is usually parametrized with an aggregation interval and a summary function (e.g. count, average or standard deviation). The chart in Figure 2 includes our original data stream with a background of new data series created by aggregating the stream by MONTH intervals and with MAX/AVG/MIN functions. It is worth noting that the data series in Figure 2 are using both point (stock prices in navy) and interval (aggregated values in orange) time models. Since the base time granularity is determined by stock price (a day), we are using a stepped line for monthly summaries.

Figure 2: Input data stream aggregated by month with MAX/AVG/MIN summary functions (SVG)

The first visible challenge with such a transformation is a different number of days in a month, and thus a variable duration of aggregation intervals. The mapping between different time granularities can be problematic also on different levels, for example in business scenarios, where analysis is focused on workweeks that are affected by irregular holidays. Analysis scenarios can become more complex if we deal not only with the time dimension but also with locations. Aggregating multiple data streams from different geographical points (e.g. sales across the country) can be affected by time zones, especially if short intervals as selected (e.g. daily statistics).

When using time aggregation, we need to pay attention not only to the time dimension, but also to other data elements, especially during selection of summary functions. Applicability of these functions depends on characteristics of analyzed data, e.g. calculating an average will not work with values of nominal or ordinal scales. But it also depends on context of data. In our case it wouldn’t make sense to calculate sum of prices per an interval, since our data points are related to a state (a price of an asset). However, if exactly the same values (including currency) would be related to events, like sale transactions, calculating average price per interval could be very useful.

We will discuss more detailed challenges related to analysis of time-oriented data in future posts. We will also show how technology can help with solving these problems.

blog

Everything is a stream

Time aggregation case study