Author - Gautam Goswami

Data Ingestion phase for migrating enterprise data into Hadoop Data Lake

The Big Data solutions helps to achieve valuable information to iron out the accurate strategic business decision. Exponential growth of digitalization, social media, telecommunication etc. are fueling enormous data generation everywhere. Prior to process of huge volume of data, we should have efficient data storage mechanism in a distributed manner to hold any form of data starting from structured to unstructured. Hadoop distributed file systems (HDFS) can be leveraged efficiently as data lake by installing on multi node cluster....

Read more...

Why Lambda Architecture in Big Data Processing

Due to the exponential growth of digitization, the entire globe is creating minimum 2.5 Quintilian 2500000000000 Million) bytes of data every day and that we can denote as Big Data. Data generation is happening from everywhere starting from social media sites, various sensors, satellite, purchase transaction, Mobile, GPS signals and much more. With the advancement of technology, there is no sign of slowing down of data generation, instead it will grow in massive volume. All the major organizations, retailers,...

Read more...

Apache Kafka, The next Generation Distributed Messaging System

In Big Data project, the main challenge is to collect an enormous volume of data. We need distributed high throughput messaging systems to overcome it. Apache Kafka is designed to address the challenge. It was originally developed at LinkedIn Corporation and later on became a part of Apache project. A Messaging System is typically responsible for transferring data from one application to another. A message is nothing but the bunch of data/information. To ingest huge volume of data into Hadoop...

Read more...

Fog Computing

Fog computing also refer to Edge computing . Cisco Systems introduced the term "Fog Computing" and it's not the replacement of cloud computing. Ideally cloud computing points to storing and accessing data and programs over the Internet instead of local computer's hard drive or storage. The cloud is simply a metaphor for the Internet. In Fog computing, data, processing and applications are concentrated in devices at the network edge. Here devices communicate peer-to-peer so that data storage and share...

Read more...

Basic concept of Data Lake

The left side info graphics represents the basic concept of Data Lake where we can use the approach of ELT (Extraction, loading and then transformation) against traditional ETL (Extraction, Transformation and then loading) process. ETL process implies to traditional data warehousing system where structured data format follows (row and column). By leveraging HDFS (Hadoop Distributed File System), we can develop data lake to store any format data in order to process and analysis. Directly data can be loaded in the Lake...

Read more...

Real time data analytics helps mobile service providers to achieve aggressive advantages

Usage of smart phones has become an integral part of our daily routine. Keeping aside phone calls and SMS, we are always engaged with lots of other activities Starting from entertainment to domestic shopping, social engagement etc., by installing various types of mobile applications. Of course, mobile internet is mandatory to carry out above.  Mobile service providers are facing new and difficult challenges. Due to exponential growth of customer's expectations, they need to serve accordingly with advanced mobile technology and handle...

Read more...

How Google news is able to group similar news together

Google news uses clustering machine learning techniques to group similar kind of news or articles together.  Interestingly, they don't have thousand news editors on trunk instead use the clustering techniques to forms groups of similar data based on the common characteristics. Mahout is a machine learning software from Apache community that applications leverage to analyse large sets of data.  Before invention of Mahout, it was too complex to a analyse large sets of data. Mahout extensively utilize Apache Hadoop to...

Read more...

Essentially of Data Wrangling

To roll out a new software product commercially irrespective of any domain in the market,  360-degree quality check with test data is mandatory.  We can correlate this with a visualized concept of a new vehicle.  After completion of vehicle manufacturing, fuel has to be injected to the engine to make it operational. Once the vehicle starts moving, all the quality checks, testing get started like brake performance, mileage, comfort etc with thousands of other factors which are decided/concluded during...

Read more...

Semi-Structured Data

Semi-structured data lies between structured and unstructured data. Data that get stored in the traditional database system or excel sheet can be denoted as structured data and organized in COLUMNS and ROWS. Unstructured data can be considered as any data or piece of information which can't be stored in Databases/RDBMS etc. Email, Facebook comments, news paper etc. are the examples of unstructured data. Semi-structured data do not follow strict data model structure and neither raw data nor typed data in...

Read more...

Why Omni-channel approach is becoming focal point for retailers

In short, we can define retailing is sales goods or services using different types channels. E-Commerce is falling under internet/electronic commerce which is one of the channel in multi-channel approaches. Business to customer (B2C) and business to business (B2B) are part of electronic retailing if carried out over internet. Due to advancement of technology, precise and appropriate customer engagement strategy is very important for retailer's business growth if e-commerce is their prime channel. Even though they have adopted multi...

Read more...