Processing multiple streams

Working in a startup is about extreme resource management (which may justify the frequency of updates on this blog) and effective prioritization of tasks is the key to gaining, at least some, control over chaos. One practice we found very useful in these efforts is using real-life data with new features even during their development in order to get actionable feedback as early as possible. Often we simply start implementing a requested scenario that was the reason for selecting a specific feature. In other cases, we create small internal experiments, which we found very valuable not only in improving the features, but also in better understanding and explaining them. This a beginning of a short series focused on the results of such experiments.

The framework we are building is designed to help with advanced data analysis and one of the key requirements for effective analysis decision support is automation of different tasks. Some of these tasks may be complex, for example selecting an analysis method or incorporating domain specific knowledge, while others may be focused on simplicity, convenience or usability of data management. In this post, we are looking at the processing of multiple streams of the same type. This scenario is often needed in practice (e.g. sales data from different locations) when we want to build a big-picture to analyze differences and relationships between individual streams and to detect global patterns or anomalies. For this we needed functionality to effectively transform all, or selected, data streams in a set as well as properly modified analysis methods, starting with basic statistics.

In the experiments related to development of these features we were again using stock quotes data; this time specifically components of Dow Jones Industrial. We considered these components only as a set of data streams and when we talk about ‘average’ we refer to arithmetic mean, not Dow Jones Average (calculated using Dow Divisor). The chart in Figure 1 includes the daily close data series extracted from streams for components of Dow Jones since 2006, with min, avg and max statistics for the whole set.

Figure 1. History of Dow Jones components from 2006 with min, avg and max statistics calculated for the whole set (big)

Figure 1. History of Dow Jones components from 2006 with min, avg and max statistics calculated for the whole set (big)

As mentioned above, we need to be able to perform transformations on all these data series. An example of such an experiment is presented in Figure 2. In this case, the range of values for each of the data series in the scope has been normalized to [0, 1] (see feature scaling). As the result, all the series have the same value range, which make value changes in the given time frame much more visible. The visualization turned out to be interesting also because it nicely illustrated the impact of the financial crisis in 2008 on the stock market (automated annotation artifact added at the time point of the lowest average for the whole set).

Figure 2. Dow Jones components with value ranges normalized to [0,1] and annotation artifact indicating minimum value of average for the whole set (big)

Figure 2. Dow Jones components with value ranges normalized to [0,1] and annotation artifact indicating minimum value of average for the whole set (big)

Sets of data streams (internally we refer to them often as data ponds) can obviously be subjects for further analysis. They are essential for the analysis templates we develop, which due to their practical requirements often need to process many streams that are dynamically updated (including their arrival in and departure from a pond). Having specialized methods for processing sets of streams simplifies the development of analysis templates, which are based heavily on the transformations and applications of analysis methods. The big-picture created through visualization of multiple streams can also become a valuable background for presentation of individual streams, improving the visual data analysis experience. We will talk about these scenarios in future posts of this series.