jason atchley
Why you can't buy a data lake
May 17, 2015
By Hemant Varma
Data Lake has become one of the latest buzzwords for data management. It is probably the most misunderstood concept. The reason for that is it could be different for each organization. It depends on how much data exists, what the bottlenecks are and how is it going to be used. The good news is that technology exists today that can enable a wide variety of use cases, a good example is the Hadoop ecosystem.
In pharmaceutical and biotechnology research organizations, there has been a explosion of data long before it hit other industries. From sequencing of the human genomes over multiple years to now sequencing multiple genomes per week, that generated terabytes of raw data, which needed to be managed in a way so it could be analyzed at least as fast as it was getting produced. Now with Hadoop based solutions, this is somewhat a solved problem, but it was not the case when sequencing technologies started evolving more than 10 years ago. Then there are experiments that get conducted through various departments in the quest to characterize functions of genes for example. Scientists need a flexible environment to store their results, and record their insights and be able to quickly share those insights with other groups within the organization. The experimental data requires a structured data management treatment, but the insights are largely unstructured text that need to be then analyzed as such, which is where a data lake solution can provide value.There are new technologies being introduced in laboratories all the time, and there is a need to quickly integrate the data from these systems. Traditionally, the way to manage experimental protocols and data has bee to develop large enterprise LIMS systems that take several years to develop and are generally obsolete by the time they are deployed. A data lake in this context could be a re-imagined LIMS system that does not require thousands of hours of programming to integrate new sources of experimental data. One can argue you can still dump the output from experiments in a Hadoop environment, but managing samples through a laboratory workflow and seamlessly integrating instrumentation within that workflow all the while tracking the details of each process is needed for regulatory reasons. Data lake in a biotechnology research organization has the potential to accelerate productivity by removing bottlenecks in data movement through the organization, but it has to be designed for efficiency like any IT system. A high performance data storage and analysis environment is needed for both exploratory analytical experiments as well as production data analysis processes, both of which can potentially utilize variants of the data lake concept utilizing the Hadoop ecosystem.
In the magazine publishing industry or any consumer retail industry, the understanding the value of efficiently managing and utilizing large amounts of data is in its infancy by comparison to the the biological sciences mentioned above. In most cases, marketers are the key users of data for in any consumer retail industry. In the magazine publishing industry, traditionally, they would sales data and use segmentation to group their customers and maybe use some demographic data from other sources to further refine groups, and then either cross-sell or up-sell or flat out ignore bad payers for example. Very little statistical modeling was involved in this relatively simple model. In the last decade, there has been a explosion of data now available that can potentially characterize these customers even better beyond just the purchase history. The challenge here is to develop a analytical methodology that can extract the signal from the noise. A bigger challenge is even knowing what the signal is you are looking for, and how do you know it is significant. In such a environment, the role of the data lake is really less of data management, more of a playground that enables data scientists to use large data sets and run exploratory analyses to run data experiments, develop new methods and test them rapidly with new sources data as they become available. In this context, the data lake maybe more of a data repository that ingests data from all sources, internal and external, structured and unstructured and provides enough resources for data scientists. Most consumer retail organizations are not as mature as Google, Amazon etc, hence the data lake maybe more of a simplistic solution to get started. The key for such companies is to start small and carefully define specific use cases to be implemented first as they launch the data lake journey.
The current marketing materials from Hadoop vendors will try to convince you they have the data lake strategy for you, but in fact every IT organization in conjunction with the business needs to define what it is going to be. Managing and analyzing large amounts of data is a key requirement for business success now, but it each business needs to define what that means for them. Technology continues to evolve at a rapid pace, utilizing the power of opensource. As new use cases for data and analytics are explored, new components are being added to the Hadoop ecosystem as we speak. There is a lot of confusion and opinions regarding the Hadoop environment due to that. An experience that an organization may have had with Hadoop 2 years ago maybe no longer relevant as the solution set has evolved. Picking a vendor that stays somewhat current with the technology trends is an important consideration, as you launch your data lake strategy.
CIOs and C-level execs are enamored by the data lake term, but unfortunately, it is not something you can just buy off the shelf and check a box: "we now have a data lake"!
Jason Atchley Jason Atchley Jason Atchley Jason Atchley Jason Atchley Jason Atchley Jason Atchley Jason Atchley
Jason Atchley Jason Atchley Jason Atchley Jason Atchley Jason Atchley Jason Atchley Jason Atchley Jason Atchley
No comments:
Post a Comment