March 30, 2017

Social context of decision process

March 30, 2017/ Tomasz

Our primary motivations for building data analysis solutions are to help with real problems and to make meaningful impacts. Solving a problem is all about decisions, sometimes a single big one, often a sequence of small steps leading to a preferable outcome. Data analysis software should help make better decisions, based on available data, in a timely manner and using natural experience. Some problems are isolated and solving them requires an individual exploration of vast data spaces – by a single user and with a single set of needs, requirements and preferences. However, in our digital reality, this is rarely the case in practice, as decision making processes usually occur in a social context. That context is based on a social structure of individuals (involved in solving a problem or affected by the solution), but it also includes other components like data sources, available analysis methods and, more and more often, intelligent agents that can actively participate in the decision process.

The relevance of a social context in decision making is the most visible in healthcare, which is currently going through a digital revolution. With all new data that can be processed and the application of advanced algorithms, healthcare is becoming more data-driven at all stages, including disease prevention, diagnosis and treatment. Different forms of data analysis improve the effectiveness and efficiency of decision processes and become key foundations for the next generations of health care. But with successful automation of specific tasks the importance of human elements only increases. There is obviously a focus on the patient, as health care becomes more personalized, with customization of a process and of the medications (pharmacogenomics). A lot of attention is also given to physicians, due to complexity and non-deterministic nature of this domain and the very high potential cost of an error. But it is still not enough, as success in health care critically depends on partnerships and collaborative relationships.

Social context in health care is built upon the relationship between patients and physicians. These relationships are no longer 1-to-1, nor symmetrical, as social structures on both sides usually include multiple participants. On the patient’s side, this is primarily a social network providing support and influence with dynamics that can get easily complex, especially in scenarios when patients cannot take control over their health (e.g. children or elderly persons). On the physician’s side, there is a virtual team of medical professionals working with the patient; the physician should be the trusted point of contact, but the process can now expand beyond the knowledge and experience of any individual. Social context in health care is unique - for example we may assume that all participants in the decision process share the same goal, i.e. well-being of the patient. But this means that there are also unique functional requirements for these relationships to work: they must be designed for a long term, simple and convenient on a daily basis, when there are no major problems, but also efficient and natural in case of a serious medical condition or an emergency.

Figure 1 includes a simple social structure built around the traditional relationship between a patient and a physician as its core. This is just a proof of concept, but similar models for real case studies can be created in various ways: defined a priori (e.g. by roles in a team), constructed based on provided information (e.g. key actors), or automatically generated using records of interactions.

Figure 1. An example of a social structure from a context of decision processes in health care

The models for actual social structures and contexts are obviously dynamic and specific to a situation. That, in addition to possible complexity, makes the functional requirements for the quality of the relationships very hard to meet. In order to be successful in domains likes healthcare, technology must be designed and implemented for the social contexts of their applications. This starts with strong generic fundamentals like secure and reliable data processing, natural experiences, or smooth integration with external components. But this is only the beginning if we want to facilitate efficient cooperation (which can be more important than actual data analysis itself), enable building trust and partnership or help with challenges, emotional factors (e.g. fear) and certain behaviors (e.g. avoidance). Such scenarios require a functionality of relationship management, what means that social context must be taken into consideration at each and every stage of creating software - this is no longer another feature, but rather it becomes one of the core fundamentals.

Let’s take a brief look at the seemingly straightforward requirement of keeping participants of a decision process informed. This means, among other things, that the results of data analysis must be useful for the user. However, with a social context, we have multiple users, with individual needs, requirements and preferences. Each of them needs a different type of story - even the same information should be presented differently to a physician (all the details with analysis decision support) and to a patient (explanations with option of learning more or starting a conversation). One of the features we’re developing in our framework is designed to provide personalized views of shared data space to multiple users and roles of a social structure (for example a company). Personalized user experience is based on individual preferences, but also on analysis of the user’s role, profile (e.g. age for accessibility), the nature of the task/scenario as well as any situational requirements (e.g. pressure due to an emergency). Figure 2 shows possible functional templates of personalized user experience for key users in our example.

Figure 2. Personalized user experiences in a social structure of decision processes in health care

In this post, we used healthcare as the application domain. In this domain, we can see that technology has the potential to improve existing processes and practices but, at the same time, it will change them dramatically. Modern data analysis solutions will not replace physicians, but they will change behavioral patterns of interactions between patients and physicians (and likely beyond). Obviously, healthcare is about people and relationships more than other domains are. But since social context is so essential for our decision processes, we may expect similar changes in other domains affected by democratization of data analysis. Data analysis is becoming social following the path from data connecting users, through natural interactions and cooperation, to relationships focused on very specific challenges. Social contexts will become even more relevant as we start implementing scenarios involving intelligent software agents that can participate in our decision processes. With that change, we are no longer only adapting to a social context, we are actually trying to shape it.

February 07, 2017

Playing in the data pond

February 07, 2017/ Tomasz

While talking about multiple data streams in the earlier posts of this series, we started using a term “data pond”. This is a concept we’re using internally in the context processing sets of streams, of the same or different types, that are usually somehow related - by source (e.g. a specific user or organization), domain (e.g. records from different patients) or processing requirements (e.g. data cannot be stored in cloud). Data ponds are very useful for simplification of data management, for example, in a basic scenario, adding a new stream to a project may require only dropping a file at a specific location. They are however also essential for analysis templates - sequences of transformations and analysis methods (generic or domain specific) that can be applied to streams in a pond.

Figure 1 illustrates an example of streams automatically added to, and removed from, a data pond. Again, we’re using streams with daily close prices of Dow Jones components. In this case, information about changing the stocks included in Dow Jones are added to a definition of the pond and our framework automatically includes appropriate data streams, with applicable time constraints (so we don’t have to directly edit streams). However, the scope of a pond doesn’t need to be predefined; it can be also automatically determined based on availability of data streams in associated data sources. Monitoring the state of a pond can be further expanded with custom rules (e.g. tracking updates’ frequency) that result in chart annotations or notifications from the framework.

Figure 1 Overview of changes in the list of Dow Jones components with automated change annotations (SVG)

Data ponds are not only useful for data management, they are also relevant for analysis templates, which can be executed on individual streams or on a data pond as a whole. Analysis templates can be applied by default during the importing phase, and include normalization, error detection or input validation. They may also be executed conditionally, based on specific events or the nature of data streams. For example, the prices in Figure 1 were not processed, and the changes due to stock splits are clearly visible (see V or NKE). A stream with information about such events was added to a pond’s definition and used to trigger a template for all affected stocks. The result is a new series with split adjusted prices calculated for use in a chart with percentage changes (Figure 2).

Figure 2 Example of an analysis template automatically applied to calculating split adjusted stock price (SVG)

Data streams about Dow Jones components are obviously just a simple example, but this case study can be easily adopted to more practical applications like analysis of individual stock portfolio (with sells and buys defining the scope). We find data ponds, and visualizations based on them, useful in different scenarios and types of streams: records from multiple points of sale, results from repeated research experiments, and logs from hierarchically organized server nodes. Data ponds can be used to improve the management of input data, with detection of new streams and application of initial transformations, but also to give more control over the scope and context of a data analysis. This is especially important for long-term or continuous projects (e.g. building more complex models) and enables interesting scenarios like private analysis spaces, where specific requirements, including security, need to be met.

December 14, 2016

Foreground vs background

December 14, 2016/ Tomasz

In the previous blog post we looked at the processing of multiple data streams and using the resulting sets of data (referred as data ponds) as subjects of data analysis. This is often an effective approach to help with understanding specific phenomena, as a big-picture created from a number of series can reveal trends and patterns. Such a big-picture can however serve additional purposes, as it also can be used to establish a relevant context for the processing of an individual stream (that may, but doesn’t have to, be part of the data used to create this context). The results from analysis templates implementing such an approach can be effectively visualized, with focus on the individual series as a clearly distinguished foreground and context from multiple series presented as a background.

In the examples below we again use Dow Jones components, this time with the 5-year history of their daily close prices. Figure 1 includes data series for all stocks in the scope of Dow Jones data pond, without any transformations applied and with focus on Microsoft (MSFT).

Figure 1. Five-year history of Dow Jones components with focus on Microsoft stock daily close prices (SVG)

This chart is not very useful, since the value range of MSFT price is small compared to the value range of the chart (determined by all series) and thus the foreground series seems rather flat. This problem can be addressed by transforming all the series in the data pond, as illustrated in Figure 2, where series’ value ranges were normalized to [0, 1] (we used this transformation also in the first post of the series).

Figure 2. Dow Jones background with value ranges normalized to [0,1] and Microsoft stock as the foreground (SVG)

Another type of transformation, often applied in practice, is based on calculating change from a previous value, or one at a selected point in time. Figure 3 includes results of such an experiment with the change (percentage) calculated against the first data point in the time frame (5 years earlier). In addition to MSFT stock, this chart also covers IBM, so that their performance can be easily compared.

Figure 3. Price changes (%) of Microsoft and IBM stock over 5-year interval with Dow Jones components background (SVG)

In the examples above we focused on the visualization of individual series against the context built from multiple series. But obviously, the foreground-vs-background pattern can also be used for analysis, as the focus series can be analyzed in the context of all the others. Such analysis doesn’t have to be limited to a single series, but can focus on a subset, e.g. patients meeting specified criteria. The context build from multiple series may also be of different types - it can be personal (e.g. latest workout metrics vs results collected over time), local (e.g. sales from a specific location vs company aggregation) or even global (e.g. our performance in the competitive landscape). We’ll get to such scenarios, in different application domains, in the future posts.

November 22, 2016

Processing multiple streams

November 22, 2016/ Tomasz

Working in a startup is about extreme resource management (which may justify the frequency of updates on this blog) and effective prioritization of tasks is the key to gaining, at least some, control over chaos. One practice we found very useful in these efforts is using real-life data with new features even during their development in order to get actionable feedback as early as possible. Often we simply start implementing a requested scenario that was the reason for selecting a specific feature. In other cases, we create small internal experiments, which we found very valuable not only in improving the features, but also in better understanding and explaining them. This a beginning of a short series focused on the results of such experiments.

The framework we are building is designed to help with advanced data analysis and one of the key requirements for effective analysis decision support is automation of different tasks. Some of these tasks may be complex, for example selecting an analysis method or incorporating domain specific knowledge, while others may be focused on simplicity, convenience or usability of data management. In this post, we are looking at the processing of multiple streams of the same type. This scenario is often needed in practice (e.g. sales data from different locations) when we want to build a big-picture to analyze differences and relationships between individual streams and to detect global patterns or anomalies. For this we needed functionality to effectively transform all, or selected, data streams in a set as well as properly modified analysis methods, starting with basic statistics.

In the experiments related to development of these features we were again using stock quotes data; this time specifically components of Dow Jones Industrial. We considered these components only as a set of data streams and when we talk about ‘average’ we refer to arithmetic mean, not Dow Jones Average (calculated using Dow Divisor). The chart in Figure 1 includes the daily close data series extracted from streams for components of Dow Jones since 2006, with min, avg and max statistics for the whole set.

Figure 1. History of Dow Jones components from 2006 with min, avg and max statistics calculated for the whole set (big)

As mentioned above, we need to be able to perform transformations on all these data series. An example of such an experiment is presented in Figure 2. In this case, the range of values for each of the data series in the scope has been normalized to [0, 1] (see feature scaling). As the result, all the series have the same value range, which make value changes in the given time frame much more visible. The visualization turned out to be interesting also because it nicely illustrated the impact of the financial crisis in 2008 on the stock market (automated annotation artifact added at the time point of the lowest average for the whole set).

Figure 2. Dow Jones components with value ranges normalized to [0,1] and annotation artifact indicating minimum value of average for the whole set (big)

Sets of data streams (internally we refer to them often as data ponds) can obviously be subjects for further analysis. They are essential for the analysis templates we develop, which due to their practical requirements often need to process many streams that are dynamically updated (including their arrival in and departure from a pond). Having specialized methods for processing sets of streams simplifies the development of analysis templates, which are based heavily on the transformations and applications of analysis methods. The big-picture created through visualization of multiple streams can also become a valuable background for presentation of individual streams, improving the visual data analysis experience. We will talk about these scenarios in future posts of this series.

July 18, 2016

Big small data

July 18, 2016/ Tomasz

We really like the concept of small data. It seems to be well suited for the data analysis revolution that we are currently in. A lot of data about ourselves, our behavior, interactions and environments are collected on daily basis; and with the Internet of Everything there will only be more of them. As analysis techniques, from basic to highly specialized, are becoming more available for applications, the new challenges are what to do with data, what questions to ask and how to use the answers. Small data and related paradigms may indicate an interesting path towards turning available data into meaningful and actionable information that could be smoothly integrated into our daily lives.

Small data are relevant for their users

The concept of small data has been discussed for a few years and there are some interesting efforts to develop its definition. Small data are definitely not about size. They are also not necessarily about being always understood by a user, as the user may be not only an individual, but may be a group or an organization. Small data are however about being relevant for a user, immediately or potentially – i.e. after processing. Small data can be output of an analysis (including big data solutions) as well as input for the process. They can be human sourced, process mediated or machine generated. They can include unstructured elements and imperfections. They are often personal, unique and subjectively valuable, from sleep and activity records, through streams of financial transactions to family histories stored in digitized photographs. In every case small data exist in a well-defined context of users’ needs, requirements and preferences and are usually associated with some decision processes.

Despite emphasis on the small aspect of data, this concept is not in opposition to big data. The practical approaches related to these concepts can actually be complementary. Big data solutions can lead to amazing results. With practically unlimited storage, bandwidth and computing power, they create opportunities for conducting advanced analysis on entire populations rather than only on limited subsets. But this also means that the goals of such efforts are usually “big” and not always close to the “small” goals of individual users. In some cases, they can even be in conflict, especially when data ownership, transparency or privacy are added to the equation. Small data can very often be smoothly integrated with big data solutions, but small data analysis scenarios should be always build around user’s goals and priorities.

Small data analysis can solve big problems

These considerations may seem theoretical, but they may have very practical consequences, as focusing on small data can lead to new paradigms for designing data analysis solutions.

Small data analysis is about simplicity, focus on practical scenarios, connection to problem domains, with well-defined goals and expectation of inherently useful results. It is the bottom-up engineering approach that starts with available data, clear questions, and basic methods that can be quickly applied. In further steps, the process can be incrementally expanded as the context becomes better understood and more sophisticated methods can be selected for specific scenarios. Starting a data analysis project with advanced AI algorithms may be very tempting, but will not necessarily lead to the expected results. At the beginning, there are usually a lot of small treasures hidden in available data, treasures which can be extracted using relatively simple tools. Advanced techniques are more useful at later stages, when more complex questions are identified, and simple answers are no longer easy to find.

Small data are about providing value for their users. We find the concept especially useful as one of the foundation elements for defining personal analysis spaces, i.e. contexts for executing data analysis processes focused on individual goals with full control over data sharing (or accepting external streams) and tailored experiences. Emphasis on the personal aspects of analysis obviously doesn’t imply isolation or any other limitations. Similarly, a focus on the simplicity of solution’s design doesn’t mean restrictions to simple applications. On the contrary, small data analysis and the paradigm of incremental expansion seem to be very well suited to applications in complex domains, like digital healthcare. In this case, a system designed with a focus on a user and small data might enable brand new scenarios, based on data that users would not feel comfortable sharing and submitting to big data solutions.

Small data will need new rules and contracts

Small data analysis is in a sense an extremely user-centered approach to data processing. As such it can create unique opportunities to better understand users’ needs, requirements and preferences. Understanding of user’s context can lead to improving the value of results of data analysis, but also to more accurate specifications of technical requirements for design of data analysis solutions, with special emphasis on data flows, dependencies and trust boundaries. Small data have great potential value, especially when it comes to data generated, directly or indirectly, by users themselves. Privacy concerns are only the beginning of the story. The rules and contracts for handling small data (and ultimately benefiting from them) are yet to be determined.