Why Do I Need A Data Lake?

Bill Schmarzo

The data lake is gaining lots of momentum across the different customers to whom I talk.  Every, and I mean every organization wants to learn why and how to implement a data lake.  But “because it is a cheaper way to store/manage data” is not a good reason to adopt a data lake.  The “Why do I need a data lake?” answer is much more powerful than just having the IT organization save some money.

The data lake is a powerful data architecture that leverages the economics of big data (where it is 20x to 50x cheaper to store, manage and analyze data as compared to traditional data warehouse technologies). And new big data processing and analytics capabilities help organizations address business and operational challenges that were difficult to address using conventional Business Intelligence and data warehousing technologies.

The data lake has the potential to transform the business by providing a singular repository of all the organization’s data (structured AND unstructured data; internal AND external data) that enables your business analysts and data science team to mine all of organizational data that today is scattered across a multitude of operational systems, data warehouses, data marts and “spreadmarts”.

The value and power of a data lake are often not fully realized until we get into our second or third analytics use case.  Why is that?  Because it is at that point where the organization needs the ability to self-provision an analytics environment (compute nodes, data, analytic tools, permissions, data masking) and share data across traditional line-of-business silos (one singular location for all the organization’s data) in order to support the rapid exploration and discovery processes that the data science team uses to uncover variables and metrics that are better predictors of business performance.  The data lake enables the data science team to build the predictive and prescriptive analytics necessary to support the organization’s different business use cases and key business initiatives.

Joe Dossantos, the head of EMC Global Services Big Data Delivery team, termed this a “Hub and Spoke” analytics environment where the data lake is the “hub” that enables the data science teams to self-provision their own analytic sandboxes and facilitates the sharing of data, analytic tools and analytic best practices across the different parts of the organization (see figure 1).


Figure 1: Analytics Hub and Spoke Service Architecture

The hub of the “Hub and Spoke” architecture is the data lake.  The data lake has the following characteristics:

  • Centralized, singular, schema-less data store with raw (as-is) data as well as massaged data
  • Mechanism for rapid ingestion of data with appropriate latency
  • Ability to map data across sources and provide visibility and security to users
  • Catalog to find and retrieve data
  • Costing model of centralized service
  • Ability to manage security, permissions and data masking
  • Supports self-provisioning of compute nodes, data, and analytic tools without IT intervention

The spokes of the “Hub and Spoke” architecture are the resulting analytic use cases that have the following characteristics:

  • Ability to perform analytics (data scientist)
  • Analytics sandbox (HDFS, Hadoop, Spark,, Hive, HBase)
  • Data engineering tools (Elastic Search, MapReduce, YARN, HAWQ, SQL)
  • Analytical tools (SAS, R, Mahout, MADlib, H2O)
  • Visualization tools (Tableau, DataRPM, ggplot2)
  • Ability to exploit analytics (application development)
  • 3rd platform application (mobile app development, web site app development)
  • Analytics exposed as services to applications (API’s)
  • Integrate in-memory and/or in-database scoring and recommendations into business process and operational systems

The Analytics “Hub and Spoke” architecture enables the data science team to develop the predictive and prescriptive analytics that are necessary to optimize key business processes, provide a differentiated customer engagement and uncover new monetization opportunities.

You know something must be on the right track when the market incumbents are working so hard to either discredit or spread confusion.  And that seems to be the case for the data lake.  Lots of vendors, press and analysts are trying to position the data lake as just an extension to the data warehouse; as data warehouse 2.0.  And with that sort of thinking, we risk repeating many of the fatal mistakes we made with data warehousing.

Confusion #1:  “Feed the Data Lake from the Data Warehouse.”

That’s ridiculous and is being pushed by traditional data warehouse vendors as the most appropriate use of the data lake.  Sorry, but that’s like inventing the jet engine and then saying that you’re going to pull it with a horse and buggy.

Loading data into a data warehouse means that someone has already made assumptions about what data, level of granularity and amount of history is important.  You have to make those assumptions in order to pre-build the data warehouse schema.  And that means that the raw data has already gone through data transformations (and content elimination) in order to get the data to fit into the data warehouse schema.  Lots of assumptions being made a priori about what data, data granularity and data history is important when the only purpose of the data warehouse is to report on what happened!!  That’s like going wine tasting and swabbing Vaseline on your tongue!  Many of the valuable nuances in the data have been removed in order to aggregate the data to fit into a reporting-centric data schema.


Figure 2: Data Lake Architecture

As you can see in figure 2, the data lake sits in front of the data warehouse to provide a data repository that can leverage the “economics of big data” (where it is 20x to 50x cheaper to store, manage and analyze data using traditional data warehousing technologies) to store any and all data (structured AND unstructured; internal AND external) that the organization might want to leverage.  What are the benefits of having the data lake in front of the data warehouse?

  • Rapid ingest of data because the data lake captures data “as-is”; that is, it does not need to create a schema before capturing the data.
  • Un-handcuffing the data science team from having to try to do their analysis on the overly-expensive, overly-taxes data warehouse
  • Supporting data science team’s need for rapid exploration, discovery, testing, failing, learning and re-fining of the predictive and prescriptive analytics that power the organization’s key business processes and enables new business models.

The additional benefits of this architecture:

  • Provides an analytics environment where the data science team is free to explore new data sources and new analytic techniques in search of those variables and metrics that may be better predictors of business performance
  • Frees up expensive data warehouse resources and opens up SLA windows by off-loading the ETL processes off of the data warehouse and put those processes into the natively parallel, scale out, less expensive data lake

Clearly having the data lake in front of the data warehouse is a win-win for both the data warehouse administrators and the data science organization.

Confusion #2:  “Create multiple data lakes.”

Oh, the creation of multiple data warehouses and multiple supporting data marts has worked out soooo well for the world of data warehousing.  Disparate, duplicated data warehouses and data marts are a debilitating problem in the world of data warehouses. Not only does this hinder the sharing of data across departments and lines of business, but more importantly it causes confusion and a lack of confidence by senior management in the data.  How can senior management be confident that they are dealing with the “right” data when every business unit or business function has created their own data warehouse?

The result: silo’ed data and no easy way (or willingness) to share data across the business units.

For the data lake to be effective, an organization deploys only ONE data lake; a singular repository where all of the organizations data – whether the organization knows what to do with that data or not – can be made available.  Organizations such as EMC are leveraging technologies such as virtualization to ensure that a single data lake repository can scale out and meet the growing analytic needs of the different business units – all from a single data lake.

Do me a big data favor and scold anyone who starts talking about data lakes (plural) instead of a data lake.

Confusion #3:  Dependent upon IT to manually allocate analytic sandboxes.

Why insert a human-intensive IT intermediary into a process that can easily be managed, controlled and monitored by the system?  The data science team needs to be free to explore new data sources and new analytic techniques without adding a labor-intensive, middle step to have someone allocate the analytic sandbox environment.  IT as a Service, baby!  This seems more like a control issue than a technology issue and fight IT’s urge to control the data science creative process.

The data lake is a game-changer not because it saves IT a whole bunch of money, but because the data lake can help the business make a whole bunch of money!  Do not get caught up in the ability to build a data lake, instead focus on how the data lake can “Make me more money.”

Why Do I Need A Data Lake?
Bill Schmarzo

Read the original blog entry…

Bill Schmarzo, author of “Big Data: Understanding How Data Powers Big Business”, is responsible for setting the strategy and defining the Big Data service line offerings and capabilities for the EMC Global Services organization. As part of Bill’s CTO charter, he is responsible for working with organizations to help them identify where and how to start their big data journeys. He’s written several white papers, avid blogger and is a frequent speaker on the use of Big Data and advanced analytics to power organization’s key business initiatives. He also teaches the “Big Data MBA” at the University of San Francisco School of Management.

Bill has nearly three decades of experience in data warehousing, BI and analytics. Bill authored EMC’s Vision Workshop methodology that links an organization’s strategic business initiatives with their supporting data and analytic requirements, and co-authored with Ralph Kimball a series of articles on analytic applications. Bill has served on The Data Warehouse Institute’s faculty as the head of the analytic applications curriculum.

Previously, Bill was the Vice President of Advertiser Analytics at Yahoo and the Vice President of Analytic Applications at Business Objects.

Latest Stories

By Pat Romanski

There is no question that the cloud is where businesses want to host data. Until recently hypervisor virtualization was the most widely used method in cloud computing. Recently virtual containers have been gaining in popularity, and for good reason. In the debate between virtual machines and containers, the latter have been seen as the new kid on the block – and like other emerging technology have had some initial shortcomings. However, the container space has evolved drastically since coming on…

Aug. 13, 2015 10:30 AM EDT  Reads: 291

By Liz McMillan

WebRTC has had a real tough three or four years, and so have those working with it. Only a few short years ago, the development world were excited about WebRTC and proclaiming how awesome it was. You might have played with the technology a couple of years ago, only to find the extra infrastructure requirements were painful to implement and poorly documented. This probably left a bitter taste in your mouth, especially when things went wrong.

Aug. 13, 2015 10:00 AM EDT

By Liz McMillan

It’s been proven time and time again that in tech, diversity drives greater innovation, better team productivity and greater profits and market share. So what can we do in our DevOps teams to embrace diversity and help transform the culture of development and operations into a true “DevOps” team? In her session at DevOps Summit, Stefana Muller, Director, Product Management – Continuous Delivery at CA Technologies, answered that question citing examples, showing how to create opportunities for …

Aug. 13, 2015 10:00 AM EDT  Reads: 143

By Pat Romanski

Skeuomorphism usually means retaining existing design cues in something new that doesn’t actually need them. However, the concept of skeuomorphism can be thought of as relating more broadly to applying existing patterns to new technologies that, in fact, cry out for new approaches. In his session at DevOps Summit, Gordon Haff, Senior Cloud Strategy Marketing and Evangelism Manager at Red Hat, discussed why containers should be paired with new architectural practices such as microservices rathe…

Aug. 13, 2015 09:45 AM EDT  Reads: 116

By Elizabeth White

SYS-CON Events announced today that Key Information Systems, Inc. (KeyInfo), a leading cloud and infrastructure provider offering integrated solutions to enterprises, will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Key Information Systems is a leading regional systems integrator with world-class compute, storage and networking solutions and professional services for the most advanced softwa…

Aug. 13, 2015 09:30 AM EDT  Reads: 247

By Liz McMillan

Learn how you can use the CoSN SEND II Decision Tree for Education Technology to make sure that your K–12 technology initiatives create a more engaging learning experience that empowers students, teachers, and administrators alike.

Aug. 13, 2015 08:00 AM EDT  Reads: 328

By Elizabeth White

In today’s digital world, change is the one constant. Disruptive innovations like cloud, mobility, social media, and the Internet of Things have reshaped the market and set new standards in customer expectations. To remain competitive, businesses must tap the potential of emerging technologies and markets through the rapid release of new products and services. However, the rigid and siloed structures of traditional IT platforms and processes are slowing them down – resulting in lengthy delivery …

Aug. 13, 2015 06:30 AM EDT  Reads: 107

By Liz McMillan

Between the compelling mockups and specs produced by your analysts and designers, and the resulting application built by your developers, there is a gulf where projects fail, costs spiral out of control, and applications fall short of requirements. In his session at @DevOpsSummit, Charles Kendrick, CTO and Chief Architect at Isomorphic Software, presented a new approach where business and development users collaborate – each using tools appropriate to their goals and expertise – to build mocku…

Aug. 12, 2015 04:30 PM EDT  Reads: 337

By Liz McMillan

As Marc Andreessen says software is eating the world. Everything is rapidly moving toward being software-defined – from our phones and cars through our washing machines to the datacenter. However, there are larger challenges when implementing software defined on a larger scale – when building software defined infrastructure. In his session at 16th Cloud Expo, Boyan Ivanov, CEO of StorPool, provided some practical insights on what, how and why when implementing “software-defined” in the datacent…

Aug. 12, 2015 03:00 PM EDT  Reads: 280

By Elizabeth White

SYS-CON Events announced today that VividCortex, the monitoring solution for the modern data system, will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. The database is the heart of most applications, but it’s also the part that’s hardest to scale, monitor, and optimize even as it’s growing 50% year over year. VividCortex is the first unified suite of database monitoring tools specifically desi…

Aug. 12, 2015 02:30 PM EDT  Reads: 267

By Elizabeth White

Through WebRTC, audio and video communications are being embedded more easily than ever into applications, helping carriers, enterprises and independent software vendors deliver greater functionality to their end users. With today’s business world increasingly focused on outcomes, users’ growing calls for ease of use, and businesses craving smarter, tighter integration, what’s the next step in delivering a richer, more immersive experience? That richer, more fully integrated experience comes ab…

Aug. 12, 2015 02:15 PM EDT  Reads: 113

By Pat Romanski

Container technology is sending shock waves through the world of cloud computing. Heralded as the ‘next big thing,’ containers provide software owners a consistent way to package their software and dependencies while infrastructure operators benefit from a standard way to deploy and run them. Containers present new challenges for tracking usage due to their dynamic nature. They can also be deployed to bare metal, virtual machines and various cloud platforms. How do software owners track the usag…

Aug. 12, 2015 02:00 PM EDT  Reads: 420

By Liz McMillan

XebiaLabs has announced that XL Deploy, its Application Release Automation software, has received certification of its integration with ServiceNow. With XL Deploy from XebiaLabs, ServiceNow users can now easily automate the application deployment process so releases can occur in a repeatable, standard and efficient way leading to faster delivery of software at enterprise scale. XL Deploy also enables companies to reduce the risk of release failures, while providing comprehensive reporting and …

Aug. 12, 2015 12:30 PM EDT  Reads: 127

By Pat Romanski

As organizations shift towards IT-as-a-service models, the need for managing and protecting data residing across physical, virtual, and now cloud environments grows with it. CommVault can ensure protection and E-Discovery of your data – whether in a private cloud, a Service Provider delivered public cloud, or a hybrid cloud environment – across the heterogeneous enterprise. In his session at 17th Cloud Expo, Randy De Meno, Chief Technologist – Windows Products and Microsoft Partnerships at Com…

Aug. 12, 2015 12:00 PM EDT  Reads: 273

By Liz McMillan

DevOps is about increasing efficiency, but nothing is more inefficient than building the same application twice. However, this is a routine occurrence with enterprise applications that need both a rich desktop web interface and strong mobile support. With recent technological advances from Isomorphic Software and others, rich desktop and tuned mobile experiences can now be created with a single codebase – without compromising functionality, performance or usability. In his session at DevOps Su…

Aug. 12, 2015 12:00 PM EDT  Reads: 281

Why Do I Need A Data Lake?

Via: Google Alert for Data Science

Pin It on Pinterest

Share This

Join Our Newsletter

Sign up to our mailing list to receive the latest news and updates about homeAI.info and the Informed.AI Network of AI related websites which includes Events.AI, Neurons.AI, Awards.AI, and Vocation.AI

You have Successfully Subscribed!