Threat modeling data analysis processes

In the previous post, we talked about our data driven decision processes taking place in socio-technical systems and becoming more dependent on the results from data analysis solutions. These underlying technical solutions can be attacked in order to disrupt the decision processes, including changing their outcomes. In order to have confidence in data driven decision-making, we need to understand the threats that the underlying data analysis processes are facing. Fortunately, we can use experience from information and software security to do that.

Most security problems result from complexity, unverified assumptions and/or dependencies on external entities. We need to understand the system, in order to protect it. This is a common problem in information and software security. We use threat modeling methodologies to evaluate the design of an information processing system in the security context. We look at the system from the attackers’ POV and try to find ways in which security properties, like confidentiality, integrity or availability could be compromised. The resultant threat model includes a list of enumerated threats, e.g. using a format like "an adversary performs action A in order to achieve a specific goal X”. We look at each of these threats and check if it is mitigated. If a mitigation is missing or incomplete, we can talk about a potential design vulnerability. We can apply the same approach to the socio-technical systems used in data driven decision processes.

This is actually what we started to do in the previous post, when we were asking questions about entry points to our system, its dependencies and the related assumptions. Below we will briefly discuss possible mitigations, which can be technical, but can also be organizational or legal. Figure 1 includes the same model of a socio-technical system as in the previous post, but this time with examples of mitigations related to the critical components.

Figure 1. A model of a socio-technical system in a decision-making with examples of possible mitigations.

Figure 1. A model of a socio-technical system in a decision-making with examples of possible mitigations.

We know mitigations for generic threats (e.g. how to store and transfer data securely), however mitigations for threats related to data analysis scenarios still need to be researched.

  • In the scope of external data sources, we want to know where the data are coming form, what transformation were applied or how missing values or outliers were handled. We can apply data quality metrics and techniques to verify data origin, but there are still scenarios that will be more difficult, for example when data are contributed by many anonymous users.
  • In case of algorithms and models, we might need information like accuracy metrics, full configurations or summaries of training and test data sets. We may periodically test the models or evaluate the results from different providers. Still, in some cases, independent certifications or specific service level agreements, covering also analysis objectives and priorities, may be required.
  • Decision makers can be protected with user experiences designed for decision making scenarios (including domain characteristics or situational requirements). These are great opportunities for analysis or visual decision support. We need to be careful with new types of interfaces, like augmented reality, as they will be connected with new types of threats against our cognitive abilities.
  • New types of threats will also emerge from integration with AI agents in decision contexts. We don’t know yet the detailed applications, however we can already think about some potential mitigations focused on measuring, analysis and controlling interactions. There are types of threats, like repudiation, that will likely be much more important in such cooperation scenarios.

In order to have confidence in data driven decisions, we need to design our processes to be reliable, trustworthy and resistant to attacks. This requires good understanding of goals and assets of our decision-making; based on that we can specify requirements for underlying data analysis and make informed decisions about selecting specific data sources and analysis components. Threat modeling can be a great tool for that, but the methodologies must be adapted to the nature of socio-technical systems, which can be very dynamic and hard to model. But there can also be new opportunities, as we could define new requirements related, for example, to transparency, accountability or independence. These requirements could be very useful for decisions with broad social impact or shared goals, which had to be agreed to between multiple parties.

Security efforts are continuous in their nature. New technologies enable new scenarios, leading to new threats, which may require new or updated mitigations. We need to continuously think about threats and cannot focus only on the opportunities and benefits of new technologies and applications. If we do, we may soon find our decision processes to be very effective and accurate, but no longer compatible with our goals and priorities.

This series of posts is based on the presentation made during Bloomberg Data for Good Exchange Conference, on September 24th 2017 (paper, slides).

Security of data driven decision-making

Our decisions processes become more data driven, in individual, social and global scope. This is a good and natural trend, which can give us hope for more accurate and rational decisions. It is possible due to 3 major changes: we have much more data, algorithms and models useful in practice, and computing resources that are easily available. These changes are not disconnected, but rather we should consider them the foundation upon which our decision processes can be constructed. Data driven decision processes therefore take place in socio-technical systems and include at least two levels: the actual decision processes, with social and business dimensions, and the level of data analysis with the software and other technical components. It is critical to have consistency between these two levels as the outcome of decision-making depends on results from underlying data analysis.

In complex systems, many things can go wrong. Different elements of socio-technical systems are susceptible to failures caused by random errors or by intentional actions of 3rd parties. In the second case, we can talk about threats against decision processes, which can be defined as any activities aimed at disrupting their execution or changing their outcome. It is interesting to note that even though the goals of an attacker are usually related to the decision processes (and their results), the actual attacks are more likely to be implemented at the data analysis level. This is simply where we have software components that can be effectively attacked. The decision process can therefore be attacked indirectly, through data analysis solutions, upon which they depend.

Figure 1 includes a simple model of a socio-technical system used in a decision-making with examples of security relevant questions that can be asked regarding its critical components. This system includes two human decision makers, with shared goals and priorities, who use internal data and analysis solutions. In addition to that, there are some external data sources and analysis services separated by the trust boundary. When looking at such a system from a security point of view, we would usually start with the inbound data flows – since all untrusted input data needs to be validated. If we were concerned about the privacy of our data, we should also look at the outbound data flows, to get full understanding what data are exported to systems outside our control.

Figure 1. A model of a socio-technical system in a decision-making with examples of security questions regarding its critical components

Figure 1. A model of a socio-technical system in a decision-making with examples of security questions regarding its critical components

Such a review quickly becomes much more complex when we move to data analysis scenarios as the base for decision-making.

  • In the scope of external data sources, we are not only interested in the format of the data, but also in their quality, credibility or completeness. Can we trust the data to accurately represent the specific phenomena we’re interested in? Please note, that questions like that are not only applicable to data we consume directly, but also to data used by any analysis service we interact with.
  • When it comes to external analysis services, there is a lot of discussion about algorithm bias or the practical quality of models. It doesn’t help that many algorithms and models are black boxes due to their proprietary nature or selected business models. And again, this brings us to the questions about trust – will we get the results that we need and expect?
  • The 3rd group of key elements includes decision makers, who need to apply the results to the context of specific problem domain and decision situation. Their roles, tasks and types of interactions depend on a specific application scenario, but they are always operating under some constraints (e.g. time pressure), with cognitive limitations, that can be taken advantage of.
  • This model will get even more interesting with AI agents joining our decision processes and operating as frontends to external analysis services or as active participants. In interactive cooperation scenarios, it is harder to control what information we are sharing. The questions about operational objectives and priorities of the agents will become very relevant.

We cannot focus only on benefits and opportunities of new technologies and scenarios; we need to think also about new threats and their implications. Security is critical for any practical applications, that obviously includes also decision-making based on data analysis. We need to design these processes to be reliable, trustworthy and resistant to attacks. This applies even to basic scenarios, with seemingly simple decisions like selecting an external data source or trusting a provider of data analysis services with our data. In the following post, we will talk about using experience from information and software security to better understand our systems and making more informed decisions.

This series of posts is based on the presentation made during Bloomberg Data for Good Exchange Conference, on September 24th 2017 (paper, slides).

Data analysis in social context

In the previous blog post we talked about the social context of our decision-making processes. We used the example from the healthcare domain to show that decision making these days rarely occurs in isolation and that technical solutions aimed at supporting these processes need to become essentially social. In this post, we will take a step further and talk a bit about designing data analysis solutions to be effective and useful in social and business contexts. These contexts are dynamic and usually more complex that they might seem. They include multiple elements, roles, types of relationships and structures; can be designed and constructed, or grown organically; can exist continuously in background (everybody has multiple ones) or have a short lifespan tied to a specific purpose or situation. Such diverse characteristics can result in completely different functional requirements, what means for data analysis solutions that they need to be very flexible and adaptable.

Data analysis in social context is about sharing, but not only of data and results, but also of efforts, skills, experiences, and - probably the most important here – different points of view. There are some technical elements that are common in all such solutions, including efficient  data exchange that enables natural and smooth interactions, navigation through complex data spaces, and management of relationships (sometimes completely new types). We can also try to identify some higher-level principles that help with building effective and useful solutions for various social contexts:

  • Focus is on users as the centers of social contexts. This starts with a personal user experience and need for understanding individual requirements and preferences. But it can quickly get even more difficult, if we have multiple users with incompatible or conflicting goals. There is a need for clarity (do these agents really operate according to my priorities?) and transparency (who can access data or control the process?). In many situations, analysis decision support may include defining contract-based goals and rules of data analysis efforts (e.g. solving a specific problem).
  • Data analysis processes are distributed efforts. The scope of data analysis in social context expands from an individual, into groups, communities and eventually societies. This requires effective interactions between multiple participants, both human and agents, across shared data spaces. Here the requirements can be very different and a solution must support various scenarios covering cooperation, negotiations or competition. There can be also the challenges of integrating individual experiences (each with possibly different presentation) into consistent group communication system.
  • Data analysis process is usually part of a bigger system. Problems and contexts are unique; types of tasks, best practices, patterns and challenges are more general. A data analysis process can benefit from similar external projects (e.g. for population big picture) and contribute to them (with anonymized data). There are opportunities for sharing competencies, efforts and solutions even externally, in open, research or commercial frameworks. However, integration scenarios require very clear consistent rules and transparency regarding privacy, security or ownership of information.
  • Intelligent agents can be essential participants of data analysis. Interactions during analysis or decision making process can take place in networks of human and non-human actors. Intelligent agents can be interactive participants, sharing information with users or performing specific tasks per request. They can also operate in the background, monitoring actions, conversations or external events, and acting when it is needed or useful. In group scenarios, they may take special roles, like optimizing of efforts, balancing the structure, or mediating with odd or even number of agents.

Let’s take a quick look at that last point, as it seems to be the clearest illustration of relationships between technology and social contexts. We will reuse the example from the healthcare domain, introduced in our previous blog post, which shows relationships between a patient’s context (family and friends) and the physician’s context (professional medical network). Figure 1 presents that structure, with the addition of new connections involving intelligent agents, some interactive and others operating in the background. Interactive agents can provide direct assistance and support to patients, their friends and families, along with connections to the medical side, where different types of agents can help with coordination of efforts and collaboration in medical analysis. Background agents can enable various scenarios, like continuous remote monitoring (not only in the scope of physiological metrics), integration with population efforts (connecting physicians working on similar cases) or automatic documentation of decision processes.

Figure 1. An example of a social structure in healthcare combining humans and intelligent agents

Figure 1. An example of a social structure in healthcare combining humans and intelligent agents

Similar scenarios may seem distant, but they are already here, although usually in simpler configurations with a bot or a digital assistant as front-end to a realm of specific services. In the scope of data analysis, including a social context is a natural consequence of focusing on the user’s goals, needs and preferences. In our framework, this focus starts with personalized user experiences based on individual choices and activities. For groups scenarios, it is expanded to also include the user’s role, relationships and characteristics of a social or business context. At this point data analysis is no longer only about sharing, but also about communication and conversations embedded in a shared data space. Intelligent agents can fit in such spaces very naturally and become the key participants. An agent can interact with users, change their behaviors or even become an active driver of interactions between different users and agents. The result is a completely new social structure - technology is not only capable of adopting to a social context, but may shape it or, in some cases, construct it.

Human elements will long remain fundamental in solving real problems and there are great opportunities for solutions facilitating cooperation in complex scenarios. There are situations, where enabling efficient cooperation may actually be more important than selecting the right algorithms and analysis techniques. The data analysis solutions must however be designed for social and business contexts, with clear rules and transparency, always close to users and actively addressing challenges like possible incompatibilities in priorities between individuals or an individual and a group. Including social context in data analysis is becoming however unavoidable, due in part to the increasing popularity of conversation-based interactions. And with the application of intelligent agents, social context is added to all data analysis projects, even those conducted by a single user.

Social context of decision process

Our primary motivations for building data analysis solutions are to help with real problems and to make meaningful impacts. Solving a problem is all about decisions, sometimes a single big one, often a sequence of small steps leading to a preferable outcome. Data analysis software should help make better decisions, based on available data, in a timely manner and using natural experience. Some problems are isolated and solving them requires an individual exploration of vast data spaces – by a single user and with a single set of needs, requirements and preferences. However, in our digital reality, this is rarely the case in practice, as decision making processes usually occur in a social context. That context is based on a social structure of individuals (involved in solving a problem or affected by the solution), but it also includes other components like data sources, available analysis methods and, more and more often, intelligent agents that can actively participate in the decision process.

The relevance of a social context in decision making is the most visible in healthcare, which is currently going through a digital revolution. With all new data that can be processed and the application of advanced algorithms, healthcare is becoming more data-driven at all stages, including disease prevention, diagnosis and treatment. Different forms of data analysis improve the effectiveness and efficiency of decision processes and become key foundations for the next generations of health care. But with successful automation of specific tasks the importance of human elements only increases. There is obviously a focus on the patient, as health care becomes more personalized, with customization of a process and of the medications (pharmacogenomics). A lot of attention is also given to physicians, due to complexity and non-deterministic nature of this domain and the very high potential cost of an error. But it is still not enough, as success in health care critically depends on partnerships and collaborative relationships.

Social context in health care is built upon the relationship between patients and physicians. These relationships are no longer 1-to-1, nor symmetrical, as social structures on both sides usually include multiple participants. On the patient’s side, this is primarily a social network providing support and influence with dynamics that can get easily complex, especially in scenarios when patients cannot take control over their health (e.g. children or elderly persons). On the physician’s side, there is a virtual team of medical professionals working with the patient; the physician should be the trusted point of contact, but the process can now expand beyond the knowledge and experience of any individual. Social context in health care is unique -  for example we may assume that all participants in the decision process share the same goal, i.e. well-being of the patient. But this means that there are also unique functional requirements for these relationships to work: they must be designed for a long term, simple and convenient on a daily basis, when there are no major problems, but also efficient and natural in case of a serious medical condition or an emergency.

Figure 1 includes a simple social structure built around the traditional relationship between a patient and a physician as its core. This is just a proof of concept, but similar models for real case studies can be created in various ways: defined a priori (e.g. by roles in a team), constructed based on provided information (e.g. key actors), or automatically generated using records of interactions.

Figure 1. An example of a social structure from a context of decision processes in health care

Figure 1. An example of a social structure from a context of decision processes in health care

The models for actual social structures and contexts are obviously dynamic and specific to a situation. That, in addition to possible complexity, makes the functional requirements for the quality of the relationships very hard to meet. In order to be successful in domains likes healthcare, technology must be designed and implemented for the social contexts of their applications. This starts with strong generic fundamentals like secure and reliable data processing, natural experiences, or smooth integration with external components. But this is only the beginning if we want to facilitate efficient cooperation (which can be more important than actual data analysis itself), enable building trust and partnership or help with challenges, emotional factors (e.g. fear) and certain behaviors (e.g. avoidance). Such scenarios require a functionality of relationship management, what means that social context must be taken into consideration at each and every stage of creating software - this is no longer another feature, but rather it becomes one of the core fundamentals.

Let’s take a brief look at the seemingly straightforward requirement of keeping participants of a decision process informed. This means, among other things, that the results of data analysis must be useful for the user. However, with a social context, we have multiple users, with individual needs, requirements and preferences. Each of them needs a different type of story -  even the same information should be presented differently to a physician (all the details with analysis decision support) and to a patient (explanations with option of learning more or starting a conversation). One of the features we’re developing in our framework is designed to provide personalized views of shared data space to multiple users and roles of a social structure (for example a company). Personalized user experience is based on individual preferences, but also on analysis of the user’s role, profile (e.g. age for accessibility), the nature of the task/scenario as well as any situational requirements (e.g. pressure due to an emergency). Figure 2 shows possible functional templates of personalized user experience for key users in our example.

Figure 2. Personalized user experiences in a social structure of decision processes in health care

Figure 2. Personalized user experiences in a social structure of decision processes in health care

In this post, we used healthcare as the application domain. In this domain, we can see that technology has the potential to improve existing processes and practices but, at the same time, it will change them dramatically. Modern data analysis solutions will not replace physicians, but they will change behavioral patterns of interactions between patients and physicians (and likely beyond). Obviously, healthcare is about people and relationships more than other domains are. But since social context is so essential for our decision processes, we may expect similar changes in other domains affected by democratization of data analysis.  Data analysis is becoming social following the path from data connecting users, through natural interactions and cooperation, to relationships focused on very specific challenges. Social contexts will become even more relevant as we start implementing scenarios involving intelligent software agents that can participate in our decision processes. With that change, we are no longer only adapting to a social context, we are actually trying to shape it.