Author - Gautam Goswami

Why Kappa Architecture for processing of streaming data. Have competence to superseding Lambda Architecture?

Data is quickly becoming the new currency of the digital economy, but it is useless if it can’t be processed. The processing of data is essential for subsequent decision-making or executable actions either by the human brain or various devices/applications etc. There are two primary ways of processing data namely batch processing and stream processing. Typically batch processing has been adopted for very large data sets and projects where there is a necessity for deeper data analysis, on the...

Read more...

Resolved – ” Incompatible clusterIds in… ” in Multi Node Hadoop Cluster Setup

Currently, there are many startups / small companies and their customers, working on Data Analytics, ML, AI and related solutions. Due to their budget constraints, some of them don't want to leverage Cloud-based storage.  Alternatively, to process ingested data, they create basic Data Lake using HDFS. During this process, they might encounter the exception of "org.apache.hadoop.hdfs.server.common.Storage: java.io.IOException:  Incompatible  clusterIDs in /home/....". while starting the Name Node or Master Node in a  multi-node Hadoop Cluster. This may occur in the following scenarios: ...

Read more...

Network Topology To Create Multi Node Hybrid Cluster For Hadoop Installation

The aim of this article is to provide an outline for creating network topology for Hadoop installation in multi node hybrid cluster with limited available hardware resources.  This cluster would be beneficial for learning Hadoop, with lower volume of unstructured data processing using various engines etc. Before the cluster setup: We installed Hadoop on a single node cluster running on Ubuntu 14.04 on top of Windows 10 using VMware workstation player. Later we have copied the .vmx file into multiple...

Read more...

Data Governance & Security Mechanism in Distributed Data Storage System

We are aware that the traditional data storage mechanism is incapable to hold the massive volume of  data generated with lightning speed for further utilization even if we perform vertical scaling,  and we have anticipated only one fuel, nothing but DATA to accelerate the movement across all the sectors starting from business to natural resources including medical towards rapid growth. But the question is how to persist this massive volume of data for processing? The answer is, storing the data...

Read more...

Apache Flink – A 4G Data Processing Engine

Analyzing streaming data in large-scale systems is becoming a focal point day by day to take accurate business decisions due to mushrooming of digital data generation sources around the globe including social media. Real-Time analytics are becoming more attractive due to possibilities of getting insights from the time-value of data (in other words, when data is in motion). Apache Flink, an open source highly innovative stream processor engine has been grounded which helps to take advantage of stream-based approaches. Besides...

Read more...

Steering number of mapper (MapReduce) in sqoop for parallelism of data ingestion into Hadoop Distributed File System (HDFS)

To import data from most the data source like RDBMS, sqoop internally use mapper. Before delegating the responsibility to the mapper, sqoop performs few initial operations in a sequence once we execute the command on a terminal in any node in the Hadoop cluster. Ideally, in production environment, sqoop installed in the separate node and updated .bashrc file to append sqoop's binary and configuration which helps to execute sqoop command from anywhere in the multi-node cluster. Most of the...

Read more...

Transfer structured data from Oracle to Hadoop storage system

Using Apache's sqoop, we can transfer structured data from Relational Database Management System to Hadoop distributed file system (HDFS). Because of distributed storage mechanism in Hadoop Distributed File System (HDFS), we can store any format of data in huge volume in terms of capacity. In RDBMS, data persists in the row and column format (Known as Structured Data). In order to process the huge volume of enterprise data, we can leverage HDFS as a basic data lake. In this...

Read more...

Data Ingestion phase for migrating enterprise data into Hadoop Data Lake

The Big Data solutions helps to achieve valuable information to iron out the accurate strategic business decision. Exponential growth of digitalization, social media, telecommunication etc. are fueling enormous data generation everywhere. Prior to process of huge volume of data, we should have efficient data storage mechanism in a distributed manner to hold any form of data starting from structured to unstructured. Hadoop distributed file systems (HDFS) can be leveraged efficiently as data lake by installing on multi node cluster....

Read more...

Why Lambda Architecture in Big Data Processing

Due to the exponential growth of digitization, the entire globe is creating minimum 2.5 Quintilian 2500000000000 Million) bytes of data every day and that we can denote as Big Data. Data generation is happening from everywhere starting from social media sites, various sensors, satellite, purchase transaction, Mobile, GPS signals and much more. With the advancement of technology, there is no sign of slowing down of data generation, instead it will grow in massive volume. All the major organizations, retailers,...

Read more...