Threat modeling data analysis processes

In the previous post, we talked about our data driven decision processes taking place in socio-technical systems and becoming more dependent on the results from data analysis solutions. These underlying technical solutions can be attacked in order to disrupt the decision processes, including changing their outcomes. In order to have confidence in data driven decision-making, we need to understand the threats that the underlying data analysis processes are facing. Fortunately, we can use experience from information and software security to do that.

Most security problems result from complexity, unverified assumptions and/or dependencies on external entities. We need to understand the system, in order to protect it. This is a common problem in information and software security. We use threat modeling methodologies to evaluate the design of an information processing system in the security context. We look at the system from the attackers’ POV and try to find ways in which security properties, like confidentiality, integrity or availability could be compromised. The resultant threat model includes a list of enumerated threats, e.g. using a format like "an adversary performs action A in order to achieve a specific goal X”. We look at each of these threats and check if it is mitigated. If a mitigation is missing or incomplete, we can talk about a potential design vulnerability. We can apply the same approach to the socio-technical systems used in data driven decision processes.

This is actually what we started to do in the previous post, when we were asking questions about entry points to our system, its dependencies and the related assumptions. Below we will briefly discuss possible mitigations, which can be technical, but can also be organizational or legal. Figure 1 includes the same model of a socio-technical system as in the previous post, but this time with examples of mitigations related to the critical components.

Figure 1. A model of a socio-technical system in a decision-making with examples of possible mitigations.

Figure 1. A model of a socio-technical system in a decision-making with examples of possible mitigations.

We know mitigations for generic threats (e.g. how to store and transfer data securely), however mitigations for threats related to data analysis scenarios still need to be researched.

  • In the scope of external data sources, we want to know where the data are coming form, what transformation were applied or how missing values or outliers were handled. We can apply data quality metrics and techniques to verify data origin, but there are still scenarios that will be more difficult, for example when data are contributed by many anonymous users.
  • In case of algorithms and models, we might need information like accuracy metrics, full configurations or summaries of training and test data sets. We may periodically test the models or evaluate the results from different providers. Still, in some cases, independent certifications or specific service level agreements, covering also analysis objectives and priorities, may be required.
  • Decision makers can be protected with user experiences designed for decision making scenarios (including domain characteristics or situational requirements). These are great opportunities for analysis or visual decision support. We need to be careful with new types of interfaces, like augmented reality, as they will be connected with new types of threats against our cognitive abilities.
  • New types of threats will also emerge from integration with AI agents in decision contexts. We don’t know yet the detailed applications, however we can already think about some potential mitigations focused on measuring, analysis and controlling interactions. There are types of threats, like repudiation, that will likely be much more important in such cooperation scenarios.

In order to have confidence in data driven decisions, we need to design our processes to be reliable, trustworthy and resistant to attacks. This requires good understanding of goals and assets of our decision-making; based on that we can specify requirements for underlying data analysis and make informed decisions about selecting specific data sources and analysis components. Threat modeling can be a great tool for that, but the methodologies must be adapted to the nature of socio-technical systems, which can be very dynamic and hard to model. But there can also be new opportunities, as we could define new requirements related, for example, to transparency, accountability or independence. These requirements could be very useful for decisions with broad social impact or shared goals, which had to be agreed to between multiple parties.

Security efforts are continuous in their nature. New technologies enable new scenarios, leading to new threats, which may require new or updated mitigations. We need to continuously think about threats and cannot focus only on the opportunities and benefits of new technologies and applications. If we do, we may soon find our decision processes to be very effective and accurate, but no longer compatible with our goals and priorities.

This series of posts is based on the presentation made during Bloomberg Data for Good Exchange Conference, on September 24th 2017 (paper, slides).

Security of data driven decision-making

Our decisions processes become more data driven, in individual, social and global scope. This is a good and natural trend, which can give us hope for more accurate and rational decisions. It is possible due to 3 major changes: we have much more data, algorithms and models useful in practice, and computing resources that are easily available. These changes are not disconnected, but rather we should consider them the foundation upon which our decision processes can be constructed. Data driven decision processes therefore take place in socio-technical systems and include at least two levels: the actual decision processes, with social and business dimensions, and the level of data analysis with the software and other technical components. It is critical to have consistency between these two levels as the outcome of decision-making depends on results from underlying data analysis.

In complex systems, many things can go wrong. Different elements of socio-technical systems are susceptible to failures caused by random errors or by intentional actions of 3rd parties. In the second case, we can talk about threats against decision processes, which can be defined as any activities aimed at disrupting their execution or changing their outcome. It is interesting to note that even though the goals of an attacker are usually related to the decision processes (and their results), the actual attacks are more likely to be implemented at the data analysis level. This is simply where we have software components that can be effectively attacked. The decision process can therefore be attacked indirectly, through data analysis solutions, upon which they depend.

Figure 1 includes a simple model of a socio-technical system used in a decision-making with examples of security relevant questions that can be asked regarding its critical components. This system includes two human decision makers, with shared goals and priorities, who use internal data and analysis solutions. In addition to that, there are some external data sources and analysis services separated by the trust boundary. When looking at such a system from a security point of view, we would usually start with the inbound data flows – since all untrusted input data needs to be validated. If we were concerned about the privacy of our data, we should also look at the outbound data flows, to get full understanding what data are exported to systems outside our control.

Figure 1. A model of a socio-technical system in a decision-making with examples of security questions regarding its critical components

Figure 1. A model of a socio-technical system in a decision-making with examples of security questions regarding its critical components

Such a review quickly becomes much more complex when we move to data analysis scenarios as the base for decision-making.

  • In the scope of external data sources, we are not only interested in the format of the data, but also in their quality, credibility or completeness. Can we trust the data to accurately represent the specific phenomena we’re interested in? Please note, that questions like that are not only applicable to data we consume directly, but also to data used by any analysis service we interact with.
  • When it comes to external analysis services, there is a lot of discussion about algorithm bias or the practical quality of models. It doesn’t help that many algorithms and models are black boxes due to their proprietary nature or selected business models. And again, this brings us to the questions about trust – will we get the results that we need and expect?
  • The 3rd group of key elements includes decision makers, who need to apply the results to the context of specific problem domain and decision situation. Their roles, tasks and types of interactions depend on a specific application scenario, but they are always operating under some constraints (e.g. time pressure), with cognitive limitations, that can be taken advantage of.
  • This model will get even more interesting with AI agents joining our decision processes and operating as frontends to external analysis services or as active participants. In interactive cooperation scenarios, it is harder to control what information we are sharing. The questions about operational objectives and priorities of the agents will become very relevant.

We cannot focus only on benefits and opportunities of new technologies and scenarios; we need to think also about new threats and their implications. Security is critical for any practical applications, that obviously includes also decision-making based on data analysis. We need to design these processes to be reliable, trustworthy and resistant to attacks. This applies even to basic scenarios, with seemingly simple decisions like selecting an external data source or trusting a provider of data analysis services with our data. In the following post, we will talk about using experience from information and software security to better understand our systems and making more informed decisions.

This series of posts is based on the presentation made during Bloomberg Data for Good Exchange Conference, on September 24th 2017 (paper, slides).