Basic concept of Data Lake
The left side info graphics represents the basic concept of Data Lake where we can use the approach of ELT (Extraction, loading and then transformation) against traditional ETL (Extraction, Transformation and then loading) process.
ETL process implies to traditional data warehousing system where structured data format follows (row and column). By leveraging HDFS (Hadoop Distributed File System), we can develop data lake to store any format data in order to process and analysis. Directly data can be loaded in the Lake without transformation, later transformation can be performed on demand basis. Data Lake concept is offering a tremendous advantage and benefit.
- Huge volume of data can be stored in a distributed manner.
- Format of data is not a criteria in Data Lake. Any data format can be stored like structured, Semi-Structured and Unstructured.
- Semi-Structured and Unstructured data can be stored in traditional data warehousing system. Pre-processing steps are mandatory to convert into Structured data format before loading. These steps are very expensive and time-consuming and chances of data loss/corrupt highly visible.
- Commodity hardware can be utilized to create/develop a Data Lake. Besides, it’s fault tolerant.