Challenges in Adopting a Big Data Strategy (Part 2 of 2)


Challenges in Adopting a Big Data Strategy (Part 2 of 2)

Wednesday, February 18th, 2015 - 1:00
Wednesday, February 18, 2015 - 17:17
Adopting a big data strategy presents four challenges for public sector organizations. This is the second entry of a two part blog post that identifies those challenges (talent management, interoperability, trust in the data, and cyber infrastructure) and poses a few solutions to help mitigate the risk these challenges present.


This entry is the third in a series of blog posts in which I discuss research designed to address the challenges that the government faces in regards to big data. In the first blog post in this series, I described why data should be defined as "big" based on complexity of the data, not volume alone.

The public sector faces four challenges in adopting big data technologies. In the previous entry, Challenges in Adopting a Big Data Strategy (Part 1 of 2), I address the first two challenges: talent management and interoperability. In this entry, I address building trust in the data and cyber infrastructure.

Challenge 3: Building Trust in the Security, Privacy, and Validity of Data
One reason organizations keep data locked in behind firewalls and are leery of sharing it is due to a lack of trust. Owners of the data and lawmakers are concerned about security, privacy and maintaining validity of the data. One instructor at Big Data University observed: "Many are rushing to big data like the Wild West, but there is sensitive data that needs to be protected and retention policies need to be determined." My research showed that organizations can build trust in the data by conducting a risk assessment, anonymizing the data, and designing transparency into the data warehouse.

Solution 1: Conduct a Risk Assessment
Mr. Baitman was clear that there is always risk in sharing data. Mr. Paydos identified one dimension of trust that must be built is trust that the value of the data and analytics is worth the risk of compromise. Data has value: research value, analytic value, and monetary value. Of these, monetary value causes ethical dilemmas. Ms. Wigen likened the dilemma to that of the bio-ethics issues in recent decades. Scientific advances in agriculture and the medical research have led to debates on ethical issues such as stem cell research, cloning, and genetically modified food. The Big Data Senior Steering Group recognized that there is a new field growing in the study of data ethics. The group also recommended that the Network and Information Technology Research and Development Program form a steering group as a branch from their Cyber Security Group to discuss ethics and privacy issues. With the ethics and privacy issues in mind, Mr. Baitman emphasized that public sector leaders need to have an open dialogue on why accepting risk is important to public health and well being, or the greater good. Leaders need to examine a cost benefit type model to assess the risk and they need to be willing to accept a reasonable amount of risk.

Solution 2: Anonymize the Data
One way to minimize the risk to privacy is to anonymize the data. Each person I interviewed mentioned that the first step in data privacy is making the data anonymous or de-identifying the data. Mr. Baitman felt that there needs to be more support for academic research into methods for anonymizing the data. Most often data is anonymized as the data placed in the data warehouse by removing personal identifiable information or aggregating the data to a level that preserves privacy. Mr. Szakal adds that organizations could apply privacy controls in the development of a set of common data protection services to include: encryption, obscuring or masking data sets, and data access.

Solution 3: Design Transparency into the Data Warehouse
Lastly, designing transparency into the data warehouse is another solution for building trust in the data. Mr. Paydos identified the other two dimensions that must be built: trust that the data itself is accurate and complete, and trust in the security of the data. Mr. Baitman added that an organization can build those dimensions of trust through transparency by identifying who is using the data and for what reason and by identifying the source of the data. He suggested using metadata as a means of providing this information about the data. Ms. Wigen also pointed out that data which is well-structured and governed is more trustworthy, as it is designed for reuse. Recall that metadata was also a means to design a data governance program. Mr. Murrow defined this type of data governance as master data management in that it provides the ability to know who is using the data and for what purpose. This additional use of metadata to provide transparency emphasizes its importance.

Challenge 4: Development of an Adequate Cyber Infrastructure
The final challenge associated with big data in the public sector that I will address in this blog is the development of an adequate cyber infrastructure needed to collect, store, and make big data available. Ms. Wigen said that public agencies need help building manned data hubs with the tools and expertise to clean, manage and store data so analysts can access it and reuse it. Development Operations (DevOps), open source software, and cloud computing are means to developing an adequate infrastructure.

Solution 1: Adopt an Iterative DevOps Development Methodology
The most common practice in adopting a big data capable cyber infrastructure is an iterative approach which Mr. Szakal identified as DevOps. This approach includes design, development, deployment, and feedback on the value of the function provided. It acknowledges the interdependence between software development and information technology operations. The function provided most often in the adoption of big data technologies is data warehouse augmentation, which builds on an existing data warehouse infrastructure, yet leverages big data technologies to augment the data's value. He added that DevOps tends to breakdown when security and controls need to be integrated. When this break down occurs he suggested a roll forward technique, continuing operation of the entire system in a degraded state, only bringing down the part of the system that is at risk or needs development. Historically, he said that the public sector falls back on a previous version which breaks the iterative development cycle and momentum.

Solution 2: Insist on Open Source Software
Recall that in my previous blog that defined big data, data is becoming increasingly complex and the amount of data produced each day is growing exponentially. The need to leverage a variety of data and optimize the warehouse infrastructure is driving data scientists to augment data warehouses using big data technologies such as a Hadoop Distributed File System (HDFS). Using technologies such as HDFS allows analysts and software to quickly, reliably, and efficiently access the desired data. More importantly, Hadoop is open source software. There are free courses to learn how to write Hadoop code at It is important that public organizations focus on using open innovation and open source software. This type of software is non-proprietary, so an organization doesn't get locked into a specific vendor. Additionally, the internet is full of information and forums where programmers share code and knowledge about open source software. R is an example of a very powerful, open source, analytic software tool, and incidentally, the primary tool I used in graduate school. It is free and there are hundreds of packages available online. If I ever got stuck, the answer and appropriate code was only one Google search away. Beyond R, other open source software tools are available; organizations should insist on their use in statements of work when possible.

Solution 3: Migrate to a Cloud Based Architecture
Another solution in helping build an adequate cyber infrastructure is moving to a cloud based data warehouse. Ms. Wigen pointed out that among a myriad of benefits of adopting a cloud solution, for analysts it reduces the time and resources required to move and process massive amounts of data. By having the data in one location where many analysts can process it, organizations can consolidate data storage, reduce software licenses and reduce the number of data sets that need to be cleansed and processed. In addition to streamlining this process, it facilitates the creation of an authoritative data source, a desire I heard from several public sector decision makers.

Conclusion: Risk Mitigation is Possible in Big Data Analytics
Adopting a big data strategy presents a number of challenges and exposes an organization to a considerable amount of risk. Mr. Murrow proposed that it is more risky for organizations to not adapt to the changes. Organizations must be innovative to stay ahead of the competition, the reason many of these lessons learned come from the profit-driven private sector. He observed that public sector leaders tend to accept more risk on things that are related to the mission. Analysts and data scientists have a responsibility to show these leaders that in many cases, big data is critical to the mission of decision making. I will address how they can meet that responsibility in future blogs.

In future Big Data Blog Series, I will cover:

***The ideas and opinions presented in this paper are those of the author and do not represent an official statement by IBM, the U.S. Department of Defense, U.S. Army, or other government entity.***