Big Data

04JunJune 4, 2025

Driving Streaming Intelligence On-Premises: Real-Time ML with Apache Kafka and Flink

Lately, companies, in their efforts to engage in real-time decision-making by exploiting big data, have been inclined to find a suitable architecture for this data as quickly as possible. With many companies, including SaaS users, choosing to deploy their own infrastructures entirely on their own, the combination of Apache Flink and Kafka offers low-latency data pipelines that are built for complete reliability. Particularly due to the financial and technical constraints it brings, small and the medium size enterprises often have...

By Gautam GoswamiApache Kafka, Artificial Intelligence, Big Data, Data Engineering, Machine LearningComments Off

26MayMay 26, 2025

Dark Data Demystified: The Role of Apache Iceberg

Lurking in the shadows of every organization is a silent giant—dark data. Undiscovered log files, unread emails, silent sensor readings, and decades-old documents collecting digital dust are all examples of the vast amount of data that companies unwittingly bury. Not only are these worthless artifacts, but they have the potential to be treasure troves that have been shut down because of antiquated systems, a lack of funding, or just plain negligence. Whether or not this data is structured, it...

By Gautam GoswamiBig Data, Data Engineering, Storage MechanismComments Off

24FebFebruary 24, 2025

The Role of Materialized Views in Modern Data Stream Processing Architectures + RisingWave

Incremental computation in data streaming means updating results as fresh data comes in, without redoing all calculations from the beginning. This method is essential for handling ever-changing information, like real-time sensor readings social media streams, or stock market figures. In a traditional, non-entrepreneurial calculation model, we need to process the entire dataset every time we get a new piece of data. It can be incompetent and slow. In incremental calculations, only the part of the result affected by new...

By Gautam GoswamiBig Data, Data Analysis, Data Engineering, Data Science, Processing Engine, Storage MechanismBig Data, Data Streaming, datainmotion, risingwave, streaming dataComments Off

11AprApril 11, 2024

Transferring real-time data stream processed by Apache Flink to Kafka to Druid for analysis

Businesses can react quickly and effectively to user behavior patterns by using real-time analytics. This allows them to take advantage of opportunities that might otherwise pass them by and prevent problems from getting worse. Apache Kafka, a popular event streaming platform, can be used for real-time ingestion of data/events generated from various sources across multiple verticals such as IoT, financial transactions, inventory, etc. This data can then be streamed into multiple downstream applications or engines for further processing and eventual...

By Gautam GoswamiApache Druid, Apache Kafka, Big Data, Data Ingestion, Data Scienceapache druid, Apache Flink, Apache Kafka, Kafka and Druid, kafka flink and druid, multi-broker kafka cluster, real time analytics with apache druid, real time stream processing with apache flink, real time streaming with kafka, Real-time data analytics, real-time data analytics. with Apache Flink, sending streaming data from flink to kafkaComments Off

23JanJanuary 23, 2024

Integrating rate-limiting and backpressure strategies synergistically to handle and alleviate consumer lag in Apache Kafka

Apache Kafka stands as a robust distributed streaming platform. However, like any system, it is imperative to proficiently oversee and control latency for optimal performance. Kafka Consumer Lag refers to the variance between the most recent message within a Kafka topic and the message that has been processed by a consumer. This lag may arise when the consumer struggles to match the pace at which new messages are generated and appended to the topic. Consumer lag in Kafka may...

By Gautam GoswamiApache Kafka, Big Data, Data EngineeringApache Kafka, Data Streaming, kafka consumerComments Off

13OctOctober 13, 2023

Understanding Apache Druid Supervisor and its specification for real-time data ingestion from Apache Kafka

Although both Apache Druid and Apache Kafka are potent open-source data processing tools, they have diverse uses. While Druid is a high-performance, column-store, real-time analytical database, Kafka is a distributed platform for event streaming. However, they can work together in a typical data pipeline scenario where Kafka is used as a messaging system to ingest and store data/events, and Druid is used to perform real-time analytics on that data. In short, the indexing is the process of loading data in Druid...

By Gautam GoswamiApache Druid, Apache Kafka, Big Data, Data Engineering, Data Ingestionanalyze the data in real-time, apache druid, Apache Kafka, Apache Kafka Indexing Service, Apache Kafka supervisor, distributed platform for event streaming, druid supervisors, How to accept data streams from Apache Kafka, integration between Apache Kafka and Druid for real-time data intake and analytics, open-source data processing tools, Real time data ingestion from Apache Kafka, real-time analytical database, Supervisor and its specification in Apache Druid, Supervisor of Apache Druid, The data ingestion lifecycle, Understanding Apache Druid Supervisor and its specification for real-time data ingestion from Apache Kafka, Understanding Supervisor in DruidComments Off

Driving Streaming Intelligence On-Premises: Real-Time ML with Apache Kafka and Flink

Dark Data Demystified: The Role of Apache Iceberg

The Role of Materialized Views in Modern Data Stream Processing Architectures + RisingWave

Transferring real-time data stream processed by Apache Flink to Kafka to Druid for analysis

Integrating rate-limiting and backpressure strategies synergistically to handle and alleviate consumer lag in Apache Kafka

Understanding Apache Druid Supervisor and its specification for real-time data ingestion from Apache Kafka

May I Know Your Details?