Processing multiple streams

Working in a startup is about extreme resource management (which may justify the frequency of updates on this blog) and effective prioritization of tasks is the key to gaining, at least some, control over chaos. One practice we found very useful in these efforts is using real-life data with new features even during their development in order to get actionable feedback as early as possible. Often we simply start implementing a requested scenario that was the reason for selecting a specific feature. In other cases, we create small internal experiments, which we found very valuable not only in improving the features, but also in better understanding and explaining them. This a beginning of a short series focused on the results of such experiments.

The framework we are building is designed to help with advanced data analysis and one of the key requirements for effective analysis decision support is automation of different tasks. Some of these tasks may be complex, for example selecting an analysis method or incorporating domain specific knowledge, while others may be focused on simplicity, convenience or usability of data management. In this post, we are looking at the processing of multiple streams of the same type. This scenario is often needed in practice (e.g. sales data from different locations) when we want to build a big-picture to analyze differences and relationships between individual streams and to detect global patterns or anomalies. For this we needed functionality to effectively transform all, or selected, data streams in a set as well as properly modified analysis methods, starting with basic statistics.

In the experiments related to development of these features we were again using stock quotes data; this time specifically components of Dow Jones Industrial. We considered these components only as a set of data streams and when we talk about ‘average’ we refer to arithmetic mean, not Dow Jones Average (calculated using Dow Divisor). The chart in Figure 1 includes the daily close data series extracted from streams for components of Dow Jones since 2006, with min, avg and max statistics for the whole set.

Figure 1. History of Dow Jones components from 2006 with min, avg and max statistics calculated for the whole set (big)

Figure 1. History of Dow Jones components from 2006 with min, avg and max statistics calculated for the whole set (big)

As mentioned above, we need to be able to perform transformations on all these data series. An example of such an experiment is presented in Figure 2. In this case, the range of values for each of the data series in the scope has been normalized to [0, 1] (see feature scaling). As the result, all the series have the same value range, which make value changes in the given time frame much more visible. The visualization turned out to be interesting also because it nicely illustrated the impact of the financial crisis in 2008 on the stock market (automated annotation artifact added at the time point of the lowest average for the whole set).

Figure 2. Dow Jones components with value ranges normalized to [0,1] and annotation artifact indicating minimum value of average for the whole set (big)

Figure 2. Dow Jones components with value ranges normalized to [0,1] and annotation artifact indicating minimum value of average for the whole set (big)

Sets of data streams (internally we refer to them often as data ponds) can obviously be subjects for further analysis. They are essential for the analysis templates we develop, which due to their practical requirements often need to process many streams that are dynamically updated (including their arrival in and departure from a pond). Having specialized methods for processing sets of streams simplifies the development of analysis templates, which are based heavily on the transformations and applications of analysis methods. The big-picture created through visualization of multiple streams can also become a valuable background for presentation of individual streams, improving the visual data analysis experience. We will talk about these scenarios in future posts of this series.

Big small data

We really like the concept of small data. It seems to be well suited for the data analysis revolution that we are currently in. A lot of data about ourselves, our behavior, interactions and environments are collected on daily basis; and with the Internet of Everything there will only be more of them. As analysis techniques, from basic to highly specialized, are becoming more available for applications, the new challenges are what to do with data, what questions to ask and how to use the answers. Small data and related paradigms may indicate an interesting path towards turning available data into meaningful and actionable information that could be smoothly integrated into our daily lives.

Small data are relevant for their users

The concept of small data has been discussed for a few years and there are some interesting efforts to develop its definition. Small data are definitely not about size. They are also not necessarily about being always understood by a user, as the user may be not only an individual, but may be a group or an organization. Small data are however about being relevant for a user, immediately or potentially – i.e. after processing. Small data can be output of an analysis (including big data solutions) as well as input for the process. They can be human sourced, process mediated or machine generated. They can include unstructured elements and imperfections. They are often personal, unique and subjectively valuable, from sleep and activity records, through streams of financial transactions to family histories stored in digitized photographs. In every case small data exist in a well-defined context of users’ needs, requirements and preferences and are usually associated with some decision processes.

Despite emphasis on the small aspect of data, this concept is not in opposition to big data. The practical approaches related to these concepts can actually be complementary. Big data solutions can lead to amazing results. With practically unlimited storage, bandwidth and computing power, they create opportunities for conducting advanced analysis on entire populations rather than only on limited subsets. But this also means that the goals of such efforts are usually “big” and not always close to the “small” goals of individual users. In some cases, they can even be in conflict, especially when data ownership, transparency or privacy are added to the equation. Small data can very often be smoothly integrated with big data solutions, but small data analysis scenarios should be always build around user’s goals and priorities.

Small data analysis can solve big problems

These considerations may seem theoretical, but they may have very practical consequences, as focusing on small data can lead to new paradigms for designing data analysis solutions.

Small data analysis is about simplicity, focus on practical scenarios, connection to problem domains, with well-defined goals and expectation of inherently useful results. It is the bottom-up engineering approach that starts with available data, clear questions, and basic methods that can be quickly applied. In further steps, the process can be incrementally expanded as the context becomes better understood and more sophisticated methods can be selected for specific scenarios. Starting a data analysis project with advanced AI algorithms may be very tempting, but will not necessarily lead to the expected results. At the beginning, there are usually a lot of small treasures hidden in available data, treasures which can be extracted using relatively simple tools. Advanced techniques are more useful at later stages, when more complex questions are identified, and simple answers are no longer easy to find.

Small data are about providing value for their users. We find the concept especially useful as one of the foundation elements for defining personal analysis spaces, i.e. contexts for executing data analysis processes focused on individual goals with full control over data sharing (or accepting external streams) and tailored experiences. Emphasis on the personal aspects of analysis obviously doesn’t imply isolation or any other limitations. Similarly, a focus on the simplicity of solution’s design doesn’t mean restrictions to simple applications. On the contrary, small data analysis and the paradigm of incremental expansion seem to be very well suited to applications in complex domains, like digital healthcare. In this case, a system designed with a focus on a user and small data might enable brand new scenarios, based on data that users would not feel comfortable sharing and submitting to big data solutions.

Small data will need new rules and contracts

Small data analysis is in a sense an extremely user-centered approach to data processing. As such it can create unique opportunities to better understand users’ needs, requirements and preferences. Understanding of user’s context can lead to improving the value of results of data analysis, but also to more accurate specifications of technical requirements for design of data analysis solutions, with special emphasis on data flows, dependencies and trust boundaries. Small data have great potential value, especially when it comes to data generated, directly or indirectly, by users themselves. Privacy concerns are only the beginning of the story. The rules and contracts for handling small data (and ultimately benefiting from them) are yet to be determined.

Healthcare and data analysis

Healthcare is one of three domains we selected for initial applications of our framework for analysis of time oriented data (the other are security and finance). This is undoubtedly the most challenging domain, but also one with the biggest opportunities for delivering meaningful changes and positive impact. Healthcare is currently going through a radical revolution. And due to the nature of this domain, the consequences of the changes will affect everybody. We believe that data analysis is one of the key elements of the next generations of healthcare. But this also works in reverse – requirements and scenarios from this domain are great driving forces for innovation in data analysis.

Healthcare on the eve of a revolution

Healthcare is going through a gale of creative destruction, which will fundamentally change the landscape of the life science industry. It is primarily caused by scientific and technical progress: the revolution in sensors, always connected devices and capabilities to process and store massive amount of data. But it is not only about technologies themselves; it is also about how they have already changed users’ behaviors and expectations. Healthcare cannot be disconnected from the digital world that patients are used to. It is interesting to see that while many research efforts are still driven by the life science industry (e.g. genomics becoming more available), there are also strong initiatives originating from companies traditionally involved in information processing (IBM Watson, Microsoft Health or Apple CareKit). Revolution in healthcare is happening, even if there is significant resistance to change among medical professionals.

In our work we obviously focus on data analysis, which is actually very natural in the case of healthcare, as this domain is all about data. It always has been, long before the first signs of the digital revolution. Even when all doctor-patient interactions were direct and 1-on-1, they were based on insightful observations, and the doctor’s knowledge and experience to process them, to form a diagnosis and propose a therapy. Now we have more data, much more data, that can be available to patients and physicians. The data include genomics, anatomical imaging, physiological metrics, environmental records, patients’ (and physicians’) behaviors, decisions and observations.  There are many functional, technical and business challenges, but these data will eventually flow smoothly in networks of patients, doctors and AI systems. And then we’ll have to focus on different, but already familiar challenges: what to do with the new data that are within our reach and how to use them to help solve old problems? 

Data analysis will change healthcare

It should be clear now that we’re not talking here about data analysis understood as a basic process with some data on input and some results on the output. Data analysis in healthcare must be understood in a much broader scope, not limited only to selected, even if very useful, tasks, like analysis of anatomical images. Technical elements like data processing, extraction of information, creating models, learning from historical data, or providing practical decision support will always remain essential. But since data are expected to flow from multiple sources, and in many directions, data analysis should become one of the foundations, based on which required goal-oriented tasks can be effectively executed. And like any technology designed for digital healthcare, it must be extremely user-centered.

The goals of medicine should be always focused on the well-being of the patient. But in the digital world data are processed automatically, so the scope of data analysis can be much broader, without weakening that focus. What is more important, such expansion will actually increase the effectiveness of individual therapies. There are (at least) 3 levels at which data analysis should be considered in healthcare:

  • Individual healthcare will be radically changed by personalized medicine based on individual characteristics and situation. The context of decision making will be vastly expanded by the technology beyond the knowledge and experience of medical professionals that are directly involved. On the other side, patients will become more active participants, also benefiting from decision support mechanisms delivering information in highly usable form.
  • Healthcare relationships will become more recognized as essential for healthcare experience. There are concerns about reduced frequency of face to face interactions, but technology actually has the potential to help with establishing data flows and building strong trust-based connections. These should be the foundations for long-term relationships that are simple and convenient in good times, but remain very effective and natural in a case of a medical emergency.
  • Population research will be redefined by the availability of detailed data about individual differences and the ability to select subsets of population with a high degree of similarity. Data about individual treatment, after proper processing, will be submitted for population analysis and help with detection of trends and patterns on a local or global level. More importantly however, results from population analysis will also be used in individual diagnosis and treatment.

Healthcare of the future will be very different, though the details still remain unknown. One thing however is certain, it will be based on various types of data automatically collected, shared, processed and analyzed in many ways we are not yet able to foresee.

A pattern case study

Data analysis is not only about big data, as there is also big value hidden in small data and simple methods that can be easily adopted to solve big problems. Figure 1 presents a very basic example of visualization generated using our framework. This simple scenario integrates data streams of different types: physiological metrics of body temperature, and two streams covering medicine intake below the main chart: one with a predefined schedule (MED1), which could be supported by adaptive reminders in a mobile device, and second with a basic record (MED2). The chart also includes analysis artifacts showing target temperature range, notifications when the ranged was reached (for the 1st time, and for continuous 24h), and the expected path of temperature change within selected timespan, assumed to be associated with scheduled treatment.

Figure 1. Tracking temperature and medicine regime with associated analysis artifacts for target value range, expected change, and detected relevant events (SVG)

Figure 1. Tracking temperature and medicine regime with associated analysis artifacts for target value range, expected change, and detected relevant events (SVG)

This example is based on test data, but it is a useful illustration of the analysis & visualization pattern that could be applied in many medical scenarios. This pattern is based on streams with metrics of current state, streams with records of selected actions (including a plan), and artifacts generated from analysis of these data in the context of related models (possibly constructed based on analysis of similar cases). Healthcare is a very practical domain and it is fundamentally focused on a change. It is about learning from the past, understanding current conditions, and planning for the future, with emphasis on available options and the probability of their outcomes. This applies to the context of an individual patient, relationships between patients and physicians as well as populations of different scale. Healthcare is therefore not only about data in general, but about time-oriented data! This makes it a perfect application domain for our framework.

This is our first post about healthcare and it is the beginning of the longer story. We plan several more healthcare-related posts covering topics like practical security and privacy requirements, specific decision support scenarios and design patterns for interactions between patient and physicians. All posts in this series will be tagged as healthcare. In the meantime, we are always interested in healthcare-related research data we could use for template development or experiments with our framework. If you have time-oriented data and you are interested in their analysis or visualization, please don’t hesitate to contact us.

If you want to learn more about challenges and opportunities for modern healthcare, you may want to start with a great book Creative Destruction of Medicine by Eric Topol.

Decision support in data analysis

The title of this post may seem reversed, as usually we talk about data analysis as the foundation for decision support systems, which are heavily based on data analysis methods, including statistical analysis, machine learning, data mining or visualization. This relation is however bi-directional - as modern analysis methods become more advanced, they also become more complex and difficult to use in their full potential. Decision support elements can redefine data analysis experience in simple (non-experienced user working with base algorithms) as well as advanced (experienced user with advanced algorithms) usage scenarios.

We live in a data analysis world. More and more data are available about us, our surrounding systems and environments (our bodies, homes, environments; our activities in real life and online). Very often the question is no longer what else to measure, but what to do with the data that we already have or can easily access. Technical solutions follow, as collecting, storing and processing massive amount of data gets continuously cheaper, while algorithms become more sophisticated and effective. Data analysis seems to be more available than ever and its applications start to touch each and every aspect of our lives, providing insight into the behavior of societies, organizations and individuals.

Data analysis is everywhere

Data analysis is no longer the domain of researchers. It has become a big business focusing on extracting non-obvious information from available data (e.g. consumer preferences from shopping behaviors). On a daily basis we are overloaded with charts, reports, and direct or indirect recommendations. These are the products of data analysis that are aimed at answering generic questions, usually disconnected from personal context, and often delivered as an addition to purchased hardware or service. It is a different story if we want to look for answers to our own questions. This still requires significant effort, access to tools, data in right format and - most of all – data analysis skills (not only in the strictly technical sense).

Data analysis is essentially a decision process aimed at solving a problem through data exploration, transformations, application of analysis methods and eventually utilization of the results. Practical data analysis covers at least two areas: a domain of a problem (questions, measures, expected results) and the data analysis space (tools, methods, constraints). With the increasing complexity of both areas, it is more difficult to find a data analyst (or a team) with strong competencies on both sides. And finding both is critical, since a domain problem needs to be translated to data analysis problem for analysis, and results needs to be brought back to the domain space.

Data analysis needs decision support

The main reason for adding decision support to the data analysis experience is to shift focus from an analysis space to a problem domain. A user should spend most of his/her time working on the actual problem, selecting the right questions, controlling the process, and interpreting the results so they can be used in practice. Decision support is aimed at empowering new users to use data analysis effectively, starting with basic recommendations, through providing interactive assistance, and eventually giving contextual support also in the scope of a problem domain. And it can go further by enabling new analysis scenarios beyond the original configuration of an individual working with a dataset.

The core functionality of analysis decision support can be divided into 4 groups of goal-oriented tasks. The first group includes protecting the user from common mistakes, like applying an analysis method to a data stream of incorrect type. The second group is about explaining and providing insight into data quality, process status or candidate results. The next group includes guiding through key decision points related to the selection of specific data analysis methods, or just the application of the suitable analysis template. The last group of tasks is aimed at automating the whole process, detecting interesting characteristics of the data, and eventually delivering not only answers, but also recommendations for the right questions.

Analysis decision support will enable new scenarios

Obviously more advanced scenarios have stronger technical requirements, not only in the scope of analysis, but also in data management and human-computer interactions. Intelligent decision support for data analysis requires more information about input data streams, and effective mechanisms for managing data flows between local context, shared spaces, and external systems (with emphasis on privacy and security requirements). The user experience must be built around a user, be customizable and adaptive, and be based on individual requirements and preferences (including accessibility). These requirements become even more interesting when we move from individual to group, organizational or research scenarios.

In the first post of this blog we mentioned "a new type of framework" we’re working on. After this post we can be a little more precise and describe this project as an analysis decision support framework. This is still not a complete description and we will keep expanding it in the upcoming posts. It however provides a nice emphasis on the importance of intelligent decision support as one of the key elements of the framework we are implementing. It starts with the core engineering requirements, focus on time-oriented data and data analysis experience. But with application of decision support and other elements we can aim towards new and very exciting scenarios.

Everything is a stream

There is a nice consequence of our fixation on time-oriented data - we can consider everything as a data stream. Any input stream processed by our framework is required to include a time variable, from a simple timestamp field to more complex time structures. We use this variable to organize the data, but also to establish a shared dimension for connecting data elements from different sources. Such an approach gives us the opportunity to operate on multiple streams and implement interesting scenarios with the help of various stream transformations (e.g. time aggregation) and analysis methods.

We define data streams as sequences of events (or state snapshots/changes) that are time ordered, i.e. an event with a timestamp earlier than a previous one is flagged as a potential error. The simplest example of such a data stream is a univariate time series consisting of a timestamp and a data point (a single value), often with some additional requirements, like equal spacing between two consecutive data points. More complex objects in a data stream can include several variables describing an event, often closely related (e.g. time and distance of a running workout, enabling calculation of average speed). In many practical application scenarios, these objects can be models (e.g. snapshots of a system’s attack surface) or collections of unstructured data (e.g. articles published on a subject). 

The beauty of a time variable being present in all our data is that it can become a shared dimension enabling integration of streams from different sources and building a bigger (or just cleaner) picture from seemingly disconnected elements. Let’s illustrate it using a case study based on very simple historical rent data. The input stream includes monthly rent payments and additional variables like location of a rented place and tags indicating lease renewal events. The interim visualization of the rent series is presented in Figure 1, with location and renewal events added as annotations (Point Marker artifacts).

Figure 1: Monthly rent payments with locations and renewal events as point annotations (SVG)

Figure 1: Monthly rent payments with locations and renewal events as point annotations (SVG)

The history of rent payment is an obvious example of time-oriented data, but in this case study more interesting are variables for location and lease renewal events, all connected by time (of monthly granularity). Let’s start with the location variable - we can detect when its value changes and automatically create an annotation (Time Marker artifact) indicating an event of moving between apartments. The lease renewal variable on the other hand can be used to identify critical data points when rent is most likely to change the most, and produce a new series with percentage difference between consecutive renewals. Figure 2 includes the results of these operations.

Figure 2: Monthly payments with markers for moving and differences between lease renewals (SVG)

Figure 2: Monthly payments with markers for moving and differences between lease renewals (SVG)

The requirement of time variable is one of the foundations of the framework  we are creating. Time gives order to a stream of events, but also provides a frame of reference for data analysis. We can use time variables to effectively manage multiple different streams, create useful streams based on multiple input sources or extract key information from streams of unstructured data. Obviously there are challenges related to time granularities, irregular samples or overlapping of intervals and data points. All these problems however can be solved with help of the right technology.