Securing Hadoop in the Enterprise: Building Your Threat Model

Beginning the "Securing Hadoop in the Enterprise" Series

This post about general areas of concern when threat modeling Hadoop environments is the beginning of what I expect to be a very long series of articles about architecting, designing, deploying, and operating Hadoop deployments in moderate-to-high security enterprise environments. Hadoop in any environment is a complex topic and I hope to help readers deploy Hadoop securely from the beginning, particularly in enterprise environments which have particular security and operating standards, instead of learning lessons the hard way. This series will start off with higher level concepts to be aware of when designing your deployment, then delve into greater technical detail as we work to configure your clusters.

Threat Modeling

While it is very easy to get wrapped up in the technology behind the Hadoop ecosystem, when developing and implementing a secure Hadoop deployment in the enterprise it is important to first understand your threat model. As with any system it is possible to spend all of your time analyzing all of the potential security problems in Hadoop and its related projects, investigating the security of other applications and how they will integrate with your deployed cluster, and building custom workarounds or compensating controls for them, but no matter how much time or energy you spend no system is perfectly secure and Hadoop is no different. In fact, the complexity of the Hadoop ecosystem belies the very notion of "perfect security" and creates the perfect circumstances for using a threat model to truly understand where you should be focusing your energy in securing your Hadoop deployment.

There are many threat modeling methodologies and resources from which they can be learned (I personally prefer the STRIDE framework), so this article is not going to be about the use of a particular threat modeling framework. Instead, I will be highlighting some areas where I feel deeper discussion in your threat modeling sessions may be required due to the complexities of the Hadoop ecosystem.

Threat Actors

There are many lists of threat actors which can be used generically in threat modeling exercises, but to truly make the most of a threat modeling exercise for a Hadoop environment I believe there are some specific potential threat actors inside your environment which should be analyzed. When analyzing potential insider threat actors, it is important to think through the possibility of an insider acting maliciously or the person's account being compromised by an attacker and being used maliciously.

Cluster Administrators

The first potential threat actor to analyze in any environment is those with administrative privileges of any sort. There are many ways to split administrative duties in a Hadoop environment and how administrative privileges are split should be determined as a result of your threat modeling exercise. Put another way, the threat model should help determine how much of a separation of duties is required between administrative users to satisfy your corporate risk tolerance. When considering these administrators and the specific threat vectors they represent, you should consider a number of factors:

  1. What is the source of these administrators and how extensive is the vetting process for these personnel?
    • This question is particularly important if management of your Hadoop environment is being outsourced to a third-party service provider. While entire books can be written about how to perform proper risk management related to outsourced service providers, it is important to explicitly consider the service provider's hiring model, background check process, project staffing process, and whether or not the service provider will be following your corporate policies and procedures for information system management or if they will be given more latitude to service your environment as they see fit.
  2. Are these administrators trusted enough to have extensive access to data?
  3. Does your Hadoop environment contain data sensitive enough, or contribute to business processes sensitive enough, to require a strict separation of duties?

Keep in mind that each component of the Hadoop ecosystem is developed independently and many have different authorization mechanisms, meaning it is not always possible to enforce a truly strict separation of duties without implementing quite a few compensating controls. More details on this will be included in a future post.


Depending on the use cases for which your Hadoop environment is being implemented (you do have use cases, right?), there can be many different types of users who may have need to access or use cluster services. The most important thing to remember about these users is that they will be focused on the data and getting their jobs done, not on the system's security controls. When analyzing how users may be interacting with your Hadoop environment, ensure you are addressing all the potential avenues by which a user may interact with the system. What is the severity of the threat to the system if a specific population of users begins using the environment in an unexpected way or gains unauthorized access to a different part of the environment? Focus your security control design on the user populations or parts of the environment where that threat is highest.

As a very simple example, if you have a Hadoop cluster which stores both high sensitivity and low sensitivity information and completely separate user populations access each type of information, you should be focusing the majority of your efforts on ensuring users who should only be able to access the low sensitivity information cannot access the high sensitivity information. From a risk management and resource consumption perspective, this is far more important than ensuring the users accessing high risk data cannot access the low risk data.

External Attackers vs. Insider Threat

The topic of insider threat in the enterprise is getting a lot of attention right now, and rightly so, but I will just add one note here. If an external attacker performs a successful phishing attack against an employee, gains a beachhead inside the corporate network, and begins attacking the network using the employee's user context, what distinguishes this external attacker from an insider?

Access Avenues

The overall number of services within your Hadoop environment is definitely not going to be small, and it will almost certainly grow over time to accommodate proliferating business use cases. Since the flexibility of the Hadoop ecosystem is heavily reliant on the overall architecture of separate services with numerous integration points, there are large numbers of ways in which the system can be accessed. By examining the intended uses of each service and determining the threat posed by illicit access to each of these service interfaces, appropriate risk-based determinations can be made concerning the overall Hadoop environment architecture.

For example, HTTPFS is a service which is meant to provide an interface to end users, so as long as strong authentication and access controls are in place there is little threat to the system in exposing this interface. However, the potential threat to the cluster in exposing the KMS service broadly may be higher since this service should only be available in a programmatic fashion to cluster services.


The determination of how much effort should be put into placing security controls on a specific Hadoop environment must be highly dependent on the sensitivity of the data in the environment and the business processes which rely on your Hadoop deployment.