If you’ve been paying attention to trends around Apache Hadoop, data lakes, big data analytics, and the cloud, you’ve probably noticed the see-saw hype around each of these. In 2012, there was no end in sight to what Hadoop could do, and organizations were beginning to build data lakes to augment or replace data warehouses to unlock new insights from data exhaust. Harvard Business Review declared in 2012 that data scientist was the sexiest job of the twenty-first century, and Microsoft announced their Azure Data Lake Store in 2015.
Hadoop and Data Lakes Enter the “Trough of Disillusionment”?
Fast forward five years, the headlines sound much different. Some of the claims I’ve heard people say include Hadoop is dead, data lakes are data swamps, and AI is the new BI. In a September 2017 research report titled “Derive Value from Data Lakes Using Analytics Design Patterns,” Gartner stated,
“Through 2018, 90% of deployed data lakes will be rendered useless, as they’re overwhelmed with information assets captured for uncertain use cases.”
Gartner has also said the term “big data” is going away and in its 2017 Hype Cycle for Data Management that Hadoop distributions are in the trough of disillusionment.
These gloomy assertions don’t jibe with the optimistic anecdotes we’re hearing in the industry. As I’m about to share, recently collected statistics validate our observations.
New Primary Research on Data Lakes for Business Users Emerges
I’ve been in the big data market since 2008, and I’ve been working with industry analysts and asking for good primary research around the adoption and use of data lakes for at least the past three years. No research firm that I’m aware of has done any primary research using survey data specifically around the data lake design pattern, which I wrote about coming out of the Gartner Data and Analytics Summit in March this year, and the adoption by business users.
So that’s what we did. Together with Eckerson Group and The Bloor Group, Arcadia Data sponsored a data lake assessment with the goal of finding out directly from end users via survey data how they are deploying and using data lakes.
This assessment was also designed as a benchmark for end users to grade their use of data lakes within their businesses across different end user types and to recommend ways to enhance the value of their data lake. The data lake assessment tool is still freely available for you to see how your use of data lakes compares to peers, but we took a snapshot of the data to share the state of data lake adoption from the research conducted. I’m very excited and quite frankly a bit surprised at the results we saw and want to summarize a few key points from the data.
Surprise #1: Hadoop Data Lakes Are Still the Norm
Given the growth of the cloud and providers such as Amazon, Microsoft, and Google, you might expect that many data lake deployments are on the cloud and use cloud object stores. Eckerson Group research showed, “The majority of respondents (62%) have deployed their data lake on Hadoop, while 15% and 16% have built it on a relational database or cloud object stores, respectively.”
Eckerson Group does note that there is a movement toward cloud object stores such as Amazon S3 and Azure Data Lake Store (ADLS). I have also heard some attendees at recent trade shows say things like, “I’m going straight to the cloud with our data lake.” Some people might interpret this as they’re bypassing Hadoop, but you have to ask how that person defines Hadoop. Do they mean the various processing engines that sprang up around the Hadoop community or the Hadoop Distributed File System (HDFS) itself? My sense is they think of Hadoop as an on-prem deployment of HDFS, which leads to another surprise from the data lake survey data.
Surprise #2: On-Prem Still Exceeds Cloud Data Lake Deployments
Well, this one may not be that surprising in that the majority of deployed data lakes are on-premises, but you might have expected a much larger jump in more recent deployments to public cloud and hybrid cloud environments. As Eckerson Group points out, “Overall, the percentage of cloud deployments is only slightly greater for newly deployed data lakes than older ones.” This is what I found surprising.There are some additional cuts of this data in the research report, which break down deployment by company size – five segments from “very small” (less than 100 employees) to “very large” (greater than 10,000 employees) – from which you can deduce what types of companies are more likely to be cloud native vs. hybrid vs. on-prem.
Surprise #3: Data Lakes Are NOT Just for Data Scientists
The Gartner IT Glossary entry for data lake states, “The purpose of a data lake is to present an unrefined view of data to only the most highly skilled analysts, to help them explore their data refinement and analysis techniques independent of any of the system-of-record compromises that may exist in a traditional analytic data store (such as a data mart or data warehouse).”
I interpret this to mean data scientists are the only ones allowed to access Hadoop or the data lake for analytics, but the research results from Eckerson Group say something very different.
Eckerson Group states, “… about one-third of organizations (33%) have more than 250 business users accessing the data lake. Half of these users (50%) use a BI tool to query the data lake, while 25% use a language (e.g., Python), and 25% use SQL. This suggests that half are regular users and half are power users.”
Over 250 business users in one company using the data lake? That doesn’t sound like “the most highly skilled analysts” to me.
Moreover, Eckerson Group found companies have put powerful analytical functionality into the hands of business users.
“We were surprised by the high degree of analytical functionality available to data lake users, especially since about half are not data analysts or scientists. Almost two-thirds (63%) of respondents agreed that ‘business users can explore data (e.g., filter, drill) to get the views they want.’ Slightly less (61%) said ‘business users can author and edit reports and dashboards without coding.’”
There are a lot of other facts and figures from the 238 respondents with data lakes in production that I’d encourage you to check out for yourself in the research. It might help you justify getting started with your Hadoop, cloud, or data lake project, or simply take it to the next level and provide more business users access to the system. It answers questions like:
- Do data lakes provide fast performance for queries?
- Do users trust the accuracy of data from their Hadoop data lakes?
- What controls are administrators giving users within their data lake architecture?
- How do query tool usage, query performance, and business user access vary by company size?