Data Engineering

Overcome LEADER_NOT_AVAILABLE error on Multi-node Apache Kafka Cluster

Kafka Connect assumes a significant part in streaming data between Apache Kafka and other data systems. As a tool, it holds the responsibility of a scalable and reliable way to move the data in and out of Apache Kafka. Importing data from the Database set to Apache Kafka is surely perhaps the most well-known use instance of JDBC Connector (Source & Sink) that belongs to Kafka Connect. This short article aims to elaborate on the steps on how can we...

Read more...

Crafting a Multi-Node Multi-Broker Kafka Cluster- A Weekend Project

For the past couple of years, there has been a huge development in the appropriation of Apache Kafka. Kafka is a scalable pub/sub system and in a nutshell, is designed as a distributed multi-subscription system where data persists to disks. On top of it as a highlight, Kafka delivers messages to both real-time and batch consumers at the same time without performance degradation. Current users of Kafka incorporate Uber, Twitter, Netflix, LinkedIn, Yahoo, Cisco, Goldman Sachs, and so forth....

Read more...

Orchestrating Multi-Brokers Kafka Cluster through CLI Commands

This short article aims to highlight the list of commands to manage a running multi-broker multi-topic Kafka cluster utilizing built-in scripts. These commands will be helpful/beneficial when the cluster is not integrated or hooked up with any third party administrative tool having GUI facilities to administer or control on the fly. Of course, most of them are not free to use. Can refer here to set up a multi-broker Kafka cluster.By executing the built-in scripts available inside the bin...

Read more...

Why Kappa Architecture for processing of streaming data. Have competence to superseding Lambda Architecture?

Data is quickly becoming the new currency of the digital economy, but it is useless if it can’t be processed. The processing of data is essential for subsequent decision-making or executable actions either by the human brain or various devices/applications etc. There are two primary ways of processing data namely batch processing and stream processing. Typically batch processing has been adopted for very large data sets and projects where there is a necessity for deeper data analysis, on the...

Read more...

iDropper – The Data Ingestion, Monitoring and Reporting Tool

In today’s complicated world of business, the data, organizations own and how they use it, make them different from others to innovate, to compete better and to stay ahead in the business. That’s the driving factor for the organizations to collect and process as much data as possible, transform it into information with data-driven discoveries, and deliver it to the end user in the right format for smart decision-making. Common Challenges/Concerns Fetching the raw data files from the various data...

Read more...

Error while batch processing of rest data persisted in Basic Hadoop based (HDFS) Data Lake “Permission denied: user=dr.who, access=READ_EXECUTE, inode=”/tmp”:hdadmin:supergroup:drwx……..”

Typically,  persisting unstructured data and subsequent batch processing  can be very costly and is not advisable for small organizations & startups, as cost is prime factor for them. A Hadoop based Data Lake using Map-Reduce, fits perfectly in this scenario which is not only cost effective but also scalable and easy to extend further. Though it may sound a great option to have, we might face issues while setting up the same and one of common issues is, error "Permission...

Read more...

Resolved – ” Incompatible clusterIds in… ” in Multi Node Hadoop Cluster Setup

Currently, there are many startups / small companies and their customers, working on Data Analytics, ML, AI and related solutions. Due to their budget constraints, some of them don't want to leverage Cloud-based storage.  Alternatively, to process ingested data, they create basic Data Lake using HDFS. During this process, they might encounter the exception of "org.apache.hadoop.hdfs.server.common.Storage: java.io.IOException:  Incompatible  clusterIDs in /home/....". while starting the Name Node or Master Node in a  multi-node Hadoop Cluster. This may occur in the following scenarios: ...

Read more...

Network Topology To Create Multi Node Hybrid Cluster For Hadoop Installation

The aim of this article is to provide an outline for creating network topology for Hadoop installation in multi node hybrid cluster with limited available hardware resources.  This cluster would be beneficial for learning Hadoop, with lower volume of unstructured data processing using various engines etc. Before the cluster setup: We installed Hadoop on a single node cluster running on Ubuntu 14.04 on top of Windows 10 using VMware workstation player. Later we have copied the .vmx file into multiple...

Read more...

Data Governance & Security Mechanism in Distributed Data Storage System

We are aware that the traditional data storage mechanism is incapable to hold the massive volume of  data generated with lightning speed for further utilization even if we perform vertical scaling,  and we have anticipated only one fuel, nothing but DATA to accelerate the movement across all the sectors starting from business to natural resources including medical towards rapid growth. But the question is how to persist this massive volume of data for processing? The answer is, storing the data...

Read more...