Data Engineering

Orchestrating Multi-Brokers Kafka Cluster through CLI Commands

This short article aims to highlight the list of commands to manage a running multi-broker multi-topic Kafka cluster utilizing built-in scripts. These commands will be helpful/beneficial when the cluster is not integrated or hooked up with any third party administrative tool having GUI facilities to administer or control on the fly. Of course, most of them are not free to use. Can refer here to set up a multi-broker Kafka cluster.By executing the built-in scripts available inside the bin...

Read more...

Why Kappa Architecture for processing of streaming data. Have competence to superseding Lambda Architecture?

Data is quickly becoming the new currency of the digital economy, but it is useless if it can’t be processed. The processing of data is essential for subsequent decision-making or executable actions either by the human brain or various devices/applications etc. There are two primary ways of processing data namely batch processing and stream processing. Typically batch processing has been adopted for very large data sets and projects where there is a necessity for deeper data analysis, on the...

Read more...

iDropper – The Data Ingestion, Monitoring and Reporting Tool

In today’s complicated world of business, the data, organizations own and how they use it, make them different from others to innovate, to compete better and to stay ahead in the business. That’s the driving factor for the organizations to collect and process as much data as possible, transform it into information with data-driven discoveries, and deliver it to the end user in the right format for smart decision-making. Common Challenges/Concerns Fetching the raw data files from the various data...

Read more...

Error while batch processing of rest data persisted in Basic Hadoop based (HDFS) Data Lake “Permission denied: user=dr.who, access=READ_EXECUTE, inode=”/tmp”:hdadmin:supergroup:drwx……..”

Typically,  persisting unstructured data and subsequent batch processing  can be very costly and is not advisable for small organizations & startups, as cost is prime factor for them. A Hadoop based Data Lake using Map-Reduce, fits perfectly in this scenario which is not only cost effective but also scalable and easy to extend further. Though it may sound a great option to have, we might face issues while setting up the same and one of common issues is, error "Permission...

Read more...

Resolved – ” Incompatible clusterIds in… ” in Multi Node Hadoop Cluster Setup

Currently, there are many startups / small companies and their customers, working on Data Analytics, ML, AI and related solutions. Due to their budget constraints, some of them don't want to leverage Cloud-based storage.  Alternatively, to process ingested data, they create basic Data Lake using HDFS. During this process, they might encounter the exception of "org.apache.hadoop.hdfs.server.common.Storage: java.io.IOException:  Incompatible  clusterIDs in /home/....". while starting the Name Node or Master Node in a  multi-node Hadoop Cluster. This may occur in the following scenarios: ...

Read more...

Network Topology To Create Multi Node Hybrid Cluster For Hadoop Installation

The aim of this article is to provide an outline for creating network topology for Hadoop installation in multi node hybrid cluster with limited available hardware resources.  This cluster would be beneficial for learning Hadoop, with lower volume of unstructured data processing using various engines etc. Before the cluster setup: We installed Hadoop on a single node cluster running on Ubuntu 14.04 on top of Windows 10 using VMware workstation player. Later we have copied the .vmx file into multiple...

Read more...

Data Governance & Security Mechanism in Distributed Data Storage System

We are aware that the traditional data storage mechanism is incapable to hold the massive volume of  data generated with lightning speed for further utilization even if we perform vertical scaling,  and we have anticipated only one fuel, nothing but DATA to accelerate the movement across all the sectors starting from business to natural resources including medical towards rapid growth. But the question is how to persist this massive volume of data for processing? The answer is, storing the data...

Read more...

Processing and Analysis of Big Telecom Data to minimize crime, combat terrorism, unsocial activities etc.

Telecom providers have a treasure trove of captive data - customer data, CDR (call detail records), call center interactions, tower logs etc. and are metaphorically “sitting on a gold mine”. Ideally, each category of the generated data has the following information. ⦁ Customer data consolidates customer id, plan details, demographic, subscribed services and spending patterns ⦁ Service data category consolidates types of customer, customer history, complain category, query resolved etc.       are on ⦁ Usually for the smart mobile phone subscriber,...

Read more...

Deleting Solr log files/folder from Standby NameNode could be the disaster when Primary NameNode is active in the HDP (Hortonworks Data Platform) Hadoop Cluster

Most of us know that we use Apache Ambari for managing, provisioning and monitor different components of a Hortonworks Hadoop cluster. We also know that Apache Ranger can be used as a centralized security administration solution for Hadoop that enables administrators to create and enforce security policies for HDFS and other Hadoop platform components. When ranger hdfs plugin is enabled ,it writes the client interaction activity to Solr if it is configured. The default location of this solr log files...

Read more...