Data and information is the life blood of organisations and businesses as it provides vital information for operational, tactical and strategic decisions. Storage for this vital information scales from a simple spreadsheet, to traditional databases and data ware houses and most recently to data lakes.
A data lake is a storage repository, which holds huge quantities of raw data in its native format. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. It does not impose any structure. It can be incomplete. It can contain incorrect or missing information. Data stored in a lake may not have a designated purpose at the time of storage. It is simply kept until, eventually it may be, needed.
A data lake, in this essence can be seen as a relaxed, unimposing and unconstrained collection of an organisation’s huge highly diverse data. They are optimized for scaling to terabytes and petabytes of data, which typically comes from multiple heterogeneous sources and may be structured, semi-structured or unstructured. This relaxation of structure and diversity of a data lake has many benefits.
Features and benefits
A data lake retains all data. This includes data currently seen as useful as well as those that currently serves no purpose, in the hope that it might be required in the future. Data is also kept for all time so that the users can go back in time to any point for analysis. This only became possible with recent commodity, off-the-shelf servers and cheap storage making scaling to terabytes and petabytes fairly economical.
Data lakes support all data types. This includes traditional as well as non-traditional data sources such as web server logs, sensor data, social network activity, text and images. In the data lake, all data is retained regardless of its source and structure. Data is stored in its original raw form. Any required transforms are done at the point of use. This approach is known as ‘schema-on-read’, where as traditional databases and data warehouses use a ‘schema-on-write’ approach.
While the future seems to be in data lakes, realizing the benefits requires a great deal of good old-fashioned human effort and care. Organisations must tread surely but knowledgably and carefully to reap its full benefits and not end up in data puddles or data swamps
With schema-on-write, designers need to think all possible uses of the data in advance and define a schema that has something for everyone, which never gives the perfect fit for anyone. With schema-on-read, structure is not predetermined allowing data to be retrieved in a schema that is most relevant to the task at hand. The absence of scheme is also useful when large databases are being consolidated.
Finally using a schema-on-read approach means data can be simply stored and used immediately, with no time cost and effort spent on structural design. This is important when dealing with structured data but even more important when dealing with semi-structured, poly-structured and unstructured data, which is the vast majority by volume.
Data lakes support all users. A typical organisation has around 80 percent ‘operational’ users who are interested in reports, key performance metrics (KPIs) or slices the same set of data every day. The next 10 percent does more analysis on the data often drilling down into internal data and sometimes external data.
The last few percent requires deep analysis. They may create totally new data sources based on research. They mash up many different types of data and come up with new insights, understandings and models. These users include the data scientists and they may use advanced analytic tools and capabilities such as statistical analysis, machine learning and predictive modelling.
A data lake is able to support all of these users efficiently. Data scientists are able to work with the very large and varied data sets they need while other users make use of more structured views of the data provided for their use.
A data lake readily adapts to changes and new requirements. This is a direct result of the lack of structure and storing of data in its raw form. Users can explore data in varied ways and if any result is seen as useful a more formal schema can be applied and automation and reusability used to help extend the results to a broader audience. If the result is not useful, it can simply be discarded as no changes to the data structures have been made and no resources have been consumed.
The above advantages together means that data lakes are able to provide faster insights as they contain all data and data types, enables users to access data before it has been transformed, cleansed and structured in many flexible arrangements and thereby enables users to get to their results faster than the traditional data warehouse approach.
Pitfalls and problems
However, the relaxed structure and flexibility come with added complications and challenges.
Data lakes can easily become data swamps. A swamp is a dirty lake, where it is impossible or hard to locate the required data. Due to large volumes and as data cannot be identified from its structural characteristics, it is vital to ensure adequate meta data (data about what the data represents) is available about the data in the lake. This meta data allows searching, indexing and understanding what the data in a lake actually represents.
While many technologies are able to address aspects of the problem, the primary challenge is making sure that a data set can be seen for what it is and that the process of finding data (through the metadata data catalogue) is connected to the process of collecting information about the data.
A wider group of users means a much wider set of skills and competencies are required. While the data scientists in the organisation may be equipped to search, filter, join, shape and prep the data as need it is very unlikely that the rest of the business users can competently extract data from the lake unaided. The solution is to create simpler views and common reports that are readily accessible.
Data sensitivity is also a main issue. This includes, for example, confidential and proprietary information from a business perspective as well as personally identifying information (PII) from a legal perspective, which should be restricted. This is however a grey area.
While management wants to allow the data scientists full access, the legal perspective dictates that they shouldn’t have access to full credit card numbers of the customers. These require case by case study and custom filtering and restrictions.
Notably, the concept of governance over data lakes does not diminish the free-spirited exploration of data. While it will require some effort and resources it greatly enhances the utility of the data to the largest group of users and lowers the risk of data misuse.
Finally, it should be understood that a data lake is not a product but an approach an organisation uses to collect (and catalogue) its information for usage. Machine learning and big data is at the heart of insight and knowledge discovery from the data lake. However, a data lake can become a useless data swamp if good governance policies are not applied and constantly enforced.
While the future seems to be in data lakes, realizing the benefits requires a great deal of good old-fashioned human effort and care. Organisations must tread surely but knowledgably and carefully to reap its full benefits and not end up in data puddles or data swamps.
(The views and opinions expressed in this article are those of G.K. Kulatilleke (BSc Eng.(Computer), MSc. (Networking), MSc. (Data Science), ACMA, CGMA) and do not necessarily reflect the official policy or position of any institution)