Big small data

We really like the concept of small data. It seems to be well suited for the data analysis revolution that we are currently in. A lot of data about ourselves, our behavior, interactions and environments are collected on daily basis; and with the Internet of Everything there will only be more of them. As analysis techniques, from basic to highly specialized, are becoming more available for applications, the new challenges are what to do with data, what questions to ask and how to use the answers. Small data and related paradigms may indicate an interesting path towards turning available data into meaningful and actionable information that could be smoothly integrated into our daily lives.

Small data are relevant for their users

The concept of small data has been discussed for a few years and there are some interesting efforts to develop its definition. Small data are definitely not about size. They are also not necessarily about being always understood by a user, as the user may be not only an individual, but may be a group or an organization. Small data are however about being relevant for a user, immediately or potentially – i.e. after processing. Small data can be output of an analysis (including big data solutions) as well as input for the process. They can be human sourced, process mediated or machine generated. They can include unstructured elements and imperfections. They are often personal, unique and subjectively valuable, from sleep and activity records, through streams of financial transactions to family histories stored in digitized photographs. In every case small data exist in a well-defined context of users’ needs, requirements and preferences and are usually associated with some decision processes.

Despite emphasis on the small aspect of data, this concept is not in opposition to big data. The practical approaches related to these concepts can actually be complementary. Big data solutions can lead to amazing results. With practically unlimited storage, bandwidth and computing power, they create opportunities for conducting advanced analysis on entire populations rather than only on limited subsets. But this also means that the goals of such efforts are usually “big” and not always close to the “small” goals of individual users. In some cases, they can even be in conflict, especially when data ownership, transparency or privacy are added to the equation. Small data can very often be smoothly integrated with big data solutions, but small data analysis scenarios should be always build around user’s goals and priorities.

Small data analysis can solve big problems

These considerations may seem theoretical, but they may have very practical consequences, as focusing on small data can lead to new paradigms for designing data analysis solutions.

Small data analysis is about simplicity, focus on practical scenarios, connection to problem domains, with well-defined goals and expectation of inherently useful results. It is the bottom-up engineering approach that starts with available data, clear questions, and basic methods that can be quickly applied. In further steps, the process can be incrementally expanded as the context becomes better understood and more sophisticated methods can be selected for specific scenarios. Starting a data analysis project with advanced AI algorithms may be very tempting, but will not necessarily lead to the expected results. At the beginning, there are usually a lot of small treasures hidden in available data, treasures which can be extracted using relatively simple tools. Advanced techniques are more useful at later stages, when more complex questions are identified, and simple answers are no longer easy to find.

Small data are about providing value for their users. We find the concept especially useful as one of the foundation elements for defining personal analysis spaces, i.e. contexts for executing data analysis processes focused on individual goals with full control over data sharing (or accepting external streams) and tailored experiences. Emphasis on the personal aspects of analysis obviously doesn’t imply isolation or any other limitations. Similarly, a focus on the simplicity of solution’s design doesn’t mean restrictions to simple applications. On the contrary, small data analysis and the paradigm of incremental expansion seem to be very well suited to applications in complex domains, like digital healthcare. In this case, a system designed with a focus on a user and small data might enable brand new scenarios, based on data that users would not feel comfortable sharing and submitting to big data solutions.

Small data will need new rules and contracts

Small data analysis is in a sense an extremely user-centered approach to data processing. As such it can create unique opportunities to better understand users’ needs, requirements and preferences. Understanding of user’s context can lead to improving the value of results of data analysis, but also to more accurate specifications of technical requirements for design of data analysis solutions, with special emphasis on data flows, dependencies and trust boundaries. Small data have great potential value, especially when it comes to data generated, directly or indirectly, by users themselves. Privacy concerns are only the beginning of the story. The rules and contracts for handling small data (and ultimately benefiting from them) are yet to be determined.

Everything is a stream

There is a nice consequence of our fixation on time-oriented data - we can consider everything as a data stream. Any input stream processed by our framework is required to include a time variable, from a simple timestamp field to more complex time structures. We use this variable to organize the data, but also to establish a shared dimension for connecting data elements from different sources. Such an approach gives us the opportunity to operate on multiple streams and implement interesting scenarios with the help of various stream transformations (e.g. time aggregation) and analysis methods.

We define data streams as sequences of events (or state snapshots/changes) that are time ordered, i.e. an event with a timestamp earlier than a previous one is flagged as a potential error. The simplest example of such a data stream is a univariate time series consisting of a timestamp and a data point (a single value), often with some additional requirements, like equal spacing between two consecutive data points. More complex objects in a data stream can include several variables describing an event, often closely related (e.g. time and distance of a running workout, enabling calculation of average speed). In many practical application scenarios, these objects can be models (e.g. snapshots of a system’s attack surface) or collections of unstructured data (e.g. articles published on a subject). 

The beauty of a time variable being present in all our data is that it can become a shared dimension enabling integration of streams from different sources and building a bigger (or just cleaner) picture from seemingly disconnected elements. Let’s illustrate it using a case study based on very simple historical rent data. The input stream includes monthly rent payments and additional variables like location of a rented place and tags indicating lease renewal events. The interim visualization of the rent series is presented in Figure 1, with location and renewal events added as annotations (Point Marker artifacts).

Figure 1: Monthly rent payments with locations and renewal events as point annotations (SVG)

Figure 1: Monthly rent payments with locations and renewal events as point annotations (SVG)

The history of rent payment is an obvious example of time-oriented data, but in this case study more interesting are variables for location and lease renewal events, all connected by time (of monthly granularity). Let’s start with the location variable - we can detect when its value changes and automatically create an annotation (Time Marker artifact) indicating an event of moving between apartments. The lease renewal variable on the other hand can be used to identify critical data points when rent is most likely to change the most, and produce a new series with percentage difference between consecutive renewals. Figure 2 includes the results of these operations.

Figure 2: Monthly payments with markers for moving and differences between lease renewals (SVG)

Figure 2: Monthly payments with markers for moving and differences between lease renewals (SVG)

The requirement of time variable is one of the foundations of the framework  we are creating. Time gives order to a stream of events, but also provides a frame of reference for data analysis. We can use time variables to effectively manage multiple different streams, create useful streams based on multiple input sources or extract key information from streams of unstructured data. Obviously there are challenges related to time granularities, irregular samples or overlapping of intervals and data points. All these problems however can be solved with help of the right technology.