Following the emergence of BI and the Data Warehouse , then Big Data, the world of data has continued to evolve, leading companies to change their technological needs in order to remain competitive.
First the Datawarehouse was created to centralise structured data, then came the Data Lake to offer distributed storage of raw data in different formats (text, video, audio, etc.). Finally, arrived the new Lakehouse architecture, which consists of grouping the Data Warehouse and the Data Lake under the same architecture to cover all the data use cases: Streaming, BI, Machine Learning.
Data warehouse:
😊Data warehouses are designed to store structured data that was prepared and transformed by ETL process for BI analytics
🙁But Data warhouses are not suited to deal with unstructured, semi-structured data, and data with high variety, velocity, and volume
Data Lake:
😊Data lakes: repositories for raw data in a variety of formats, specially unstructured data (text, images, video, audio,…) that didn’t have an immediate use case.
🙁 However, the data lake has also contributed to the complexity of our architectures. Because of their limited performance, it is necessary to copy the data into dedicated analytics databases. In order to address data science uses, it is necessary to add numerous tools.
😦 Data lakes not support transactions, they do not enforce data quality, and their lack of consistency makes it almost impossible to mix appends and reads, and batch and streaming jobs
Lakehouse:
==> the need for flexible systems including SQL analytics, real-time monitoring, data science, and machine learning, have generated the new lakehouse architecture combining the best elements of data lakes and data warehouses.
AI innovation is best achieved with the lakehouse:
The lakehouse is a new data management architecture that greatly simplifies the enterprise data infrastructure and accelerates ML and AI innovation.
In the past, all enterprise data was structured data retrieved from operational systems, whereas today many products incorporate AI data in the form of multiple models.
=>Using a lakehouse rather than a data lake offers real advantages for AI needs, as lakehouses provide data versioning, governance, security and ACID properties needed for all types of data.
links:
https://databricks.com/blog/2021/08/30/frequently-asked-questions-about-the-data-lakehouse.html