Data Ingestion

11AprApril 11, 2024

Transferring real-time data stream processed by Apache Flink to Kafka to Druid for analysis

Businesses can react quickly and effectively to user behavior patterns by using real-time analytics. This allows them to take advantage of opportunities that might otherwise pass them by and prevent problems from getting worse. Apache Kafka, a popular event streaming platform, can be used for real-time ingestion of data/events generated from various sources across multiple verticals such as IoT, financial transactions, inventory, etc. This data can then be streamed into multiple downstream applications or engines for further processing and eventual...

By Gautam GoswamiApache Druid, Apache Kafka, Big Data, Data Ingestion, Data Scienceapache druid, Apache Flink, Apache Kafka, Kafka and Druid, kafka flink and druid, multi-broker kafka cluster, real time analytics with apache druid, real time stream processing with apache flink, real time streaming with kafka, Real-time data analytics, real-time data analytics. with Apache Flink, sending streaming data from flink to kafkaComments Off

13OctOctober 13, 2023

Understanding Apache Druid Supervisor and its specification for real-time data ingestion from Apache Kafka

Although both Apache Druid and Apache Kafka are potent open-source data processing tools, they have diverse uses. While Druid is a high-performance, column-store, real-time analytical database, Kafka is a distributed platform for event streaming. However, they can work together in a typical data pipeline scenario where Kafka is used as a messaging system to ingest and store data/events, and Druid is used to perform real-time analytics on that data. In short, the indexing is the process of loading data in Druid...

By Gautam GoswamiApache Druid, Apache Kafka, Big Data, Data Engineering, Data Ingestionanalyze the data in real-time, apache druid, Apache Kafka, Apache Kafka Indexing Service, Apache Kafka supervisor, distributed platform for event streaming, druid supervisors, How to accept data streams from Apache Kafka, integration between Apache Kafka and Druid for real-time data intake and analytics, open-source data processing tools, Real time data ingestion from Apache Kafka, real-time analytical database, Supervisor and its specification in Apache Druid, Supervisor of Apache Druid, The data ingestion lifecycle, Understanding Apache Druid Supervisor and its specification for real-time data ingestion from Apache Kafka, Understanding Supervisor in DruidComments Off

25AugAugust 25, 2022

Importance of Schema Registry on Kafka Based Data Streaming Pipelines

Needless to say Apache Kafka delivers messages to both real-time and batch consumers without performance degradation and in addition to that gaining enormous momentum as a foremost component for data streaming pipelines too. Credit card fraud detection, predictive maintenance, or real-time analytics, building streaming IoT platform, etc are the example of real-time use cases. To handle massive amounts of data ingestion, Apache Kafka is the cornerstone of a robust IoT data platform. A schema defines the structure of the data...

By Gautam GoswamiApache Kafka, Architecture, Data Engineering, Data IngestionApache Kafka, assign schema info in the schema registry, Avro, building streaming IoT platform, centralized schema management, Confluent Schema Registry, Credit card fraud detection, Data ingestion, data pipeline, Data Pipelines, Data Streaming, deserialized the messages, distributed storage layer for schemas, Hadoop, JSON Schema, Kafka based data pipeline, Kafka based data streaming pipelines, Kafka connect, Kafka producers and consumers, multi-broker Kafka topic, Multi-node kafka cluster, predictive maintenance, producer-consumer contract, Protobuf schemas, real-time analytics, schema change history, schema evolution, schema of registered data streams, schema registry, Schema Registry on Kafka Based Data Streaming Pipelines, Schema Registry on Kafka Streaming Pipelines, service layer for metadata, streaming applications, streaming data to Kafka topicComments Off

18AugAugust 18, 2022

Why Kappa Architecture for processing of streaming data. Have competence to superseding Lambda Architecture?

Data is quickly becoming the new currency of the digital economy, but it is useless if it can’t be processed. The processing of data is essential for subsequent decision-making or executable actions either by the human brain or various devices/applications etc. There are two primary ways of processing data namely batch processing and stream processing. Typically batch processing has been adopted for very large data sets and projects where there is a necessity for deeper data analysis, on the...

By Gautam GoswamiArchitecture, Data Engineering, Data IngestionAmazon Kinesis, Apache Flink, Apache Hadoop, Apache Kafka, Apache Samza, Apache Storm, batch processing layer, Big Data, Data Lake, data warehouse, event based Kappa architecture, event streaming platform, Hadoop Data Lake, HDFS, Kafka, Kafka Architecture Development Kafka Architecture Development, Kappa Architecture, Kappa Architecture for streaming data, Lambda Architecture, Map-Reduce framework, messaging engine, multiple stream processors, stream processing application, streaming computation system, streaming data analyticsComments Off

19JanJanuary 19, 2021

iDropper – The Data Ingestion, Monitoring and Reporting Tool

In today’s complicated world of business, the data, organizations own and how they use it, make them different from others to innovate, to compete better and to stay ahead in the business. That’s the driving factor for the organizations to collect and process as much data as possible, transform it into information with data-driven discoveries, and deliver it to the end user in the right format for smart decision-making. Common Challenges/Concerns Fetching the raw data files from the various data...

By Kislay KomalData Analysis, Data IngestionBig Data Analytics Big Data Analytics, Creating Data Lake Creating Data Lake, Data Absorption Software, Data Absorption Tool, Data Ingestion and Reporting, Data Ingestion and Reporting Software, Data Ingestion Software, Data Ingestion Tool, IDropper, Migrating data from multiple data sources, The Data Ingestion and Reporting SoftwareComments Off

17SepSeptember 17, 2017

Steering number of mapper (MapReduce) in sqoop for parallelism of data ingestion into Hadoop Distributed File System (HDFS)

To import data from most the data source like RDBMS, sqoop internally use mapper. Before delegating the responsibility to the mapper, sqoop performs few initial operations in a sequence once we execute the command on a terminal in any node in the Hadoop cluster. Ideally, in production environment, sqoop installed in the separate node and updated .bashrc file to append sqoop's binary and configuration which helps to execute sqoop command from anywhere in the multi-node cluster. Most of the...

By Gautam GoswamiData Engineering, Data IngestionData ingestion, Hadoop Distributed File System, HDFS, Map Reduce, parallelism of data ingestion, Sqoop, sqoop for parallelism of data ingestion into Hadoop Distributed File System (HDFS)Comments Off

29AugAugust 29, 2017

Data Ingestion phase for migrating enterprise data into Hadoop Data Lake

The Big Data solutions helps to achieve valuable information to iron out the accurate strategic business decision. Exponential growth of digitalization, social media, telecommunication etc. are fueling enormous data generation everywhere. Prior to process of huge volume of data, we should have efficient data storage mechanism in a distributed manner to hold any form of data starting from structured to unstructured. Hadoop distributed file systems (HDFS) can be leveraged efficiently as data lake by installing on multi node cluster....

By Gautam GoswamiData Engineering, Data IngestionApache software foundation, Apache Sqoop, ata storage mechanism, ATG database, ATG database schema, cloud service providers, collecting Twitter streaming data, Couchbase, Data ingestion, Data Ingestion phase for migrating enterprise data into Hadoop Data Lake, Data Lake, data storage mechanism, DB2, Digitization, distributed storage, efficient data storage mechanism, ELT, enterprise data, export data from Kafka topic to HDFS, fault-tolerant, Flume, Hadoop, HADOOP Cluster, Hadoop Data Lake, Hadoop distributed file systems, Hadoop multi node cluster, HDFS, Hive, huge data reservoirs, huge volume of data, Ingestion, JDBC connector, JDBC protocol, Kafka, Kafka HDFS connector, Kafka to HDFS, Mainframe, mainframe dataset to HDFS, MapReduce, MapReduce distributed computing, migrating enterprise data, moving large amount of streaming data into HDFS, multi node cluster, multiple delimited text files, MySQL, Netezza, NoSql DB, NoSql Stores, Oracle, Oracle 11g Enterprise Edition, Oracle ATG Platform, parallel import process, parallel processing, pluggable mechanism, PostgreSQL, read the messages from Kafka topic, SQLServer, Sqoop, Sqoop installation, Strom, Using Kafka HDFS connectorComments Off

29MayMay 29, 2017

Apache Kafka, The next Generation Distributed Messaging System

In Big Data project, the main challenge is to collect an enormous volume of data. We need distributed high throughput messaging systems to overcome it. Apache Kafka is designed to address the challenge. It was originally developed at LinkedIn Corporation and later on became a part of Apache project. A Messaging System is typically responsible for transferring data from one application to another. A message is nothing but the bunch of data/information. To ingest huge volume of data into Hadoop...

By Gautam GoswamiData IngestionApache Kafka, Apache project, Big Data project, collect an enormous volume of data, distributed high throughput messaging systems, distributed messaging systems, ETL, Extraction, Hadoop Distributed File System, HDFS, high throughput, Kafka supports multi-subscribers, LinkedIn Corporation, Messaging System, multi-subscribers, next Generation Distributed Messaging System, transferring data from one application to another, Transformation and LoadingComments Off

18JanJanuary 18, 2017

Establishment of Data Lake specific to multi-channel e-commerce application to understand customer’s buying pattern

Post order fulfillment data is becoming a very important asset of e-commerce vendors to understand complete buying pattern of customers. Especially for the e-commerce vendors who sells multiple products starting from electronics to apparels. Extraction and transformation are time-consuming operations when partially structured data starts moving from the various sources and finally land into the relational data warehouse. Data extracted from the social media are semi-structured (JSON or XML). As an example, Facebook provides information in JSON format through Graph API and same...

By Gautam GoswamiApache Hadoop, Data Engineering, Data Ingestion, Hadoop Eco System, Processing Engine, Storage MechanismComments Off

17JanJanuary 17, 2017

Ingesting Big Data into HDFS

we are always talking about Big data processing using Hadoop. And know the basic definition of Big Data which is huge volume of data those can not be stored in existing traditional database or data repository. Interestingly, how can we import such a huge volume of data to the cluster of computers where Hadoop is installed? Yes, using Flume we can continuously collect the stream of data. For example Twitter data can be collected for analysis of comments. Sqoop...

By Gautam GoswamiData Engineering, Data IngestionComments Off

May I Know Your Details?