Data Engineering

How Google news is able to group similar news together

Google news uses clustering machine learning techniques to group similar kind of news or articles together.  Interestingly, they don't have thousand news editors on trunk instead use the clustering techniques to forms groups of similar data based on the common characteristics. Mahout is a machine learning software from Apache community that applications leverage to analyse large sets of data.  Before invention of Mahout, it was too complex to a analyse large sets of data. Mahout extensively utilize Apache Hadoop to...

Read more...

Essentially of Data Wrangling

To roll out a new software product commercially irrespective of any domain in the market,  360-degree quality check with test data is mandatory.  We can correlate this with a visualized concept of a new vehicle.  After completion of vehicle manufacturing, fuel has to be injected to the engine to make it operational. Once the vehicle starts moving, all the quality checks, testing get started like brake performance, mileage, comfort etc with thousands of other factors which are decided/concluded during...

Read more...

Semi-Structured Data

Semi-structured data lies between structured and unstructured data. Data that get stored in the traditional database system or excel sheet can be denoted as structured data and organized in COLUMNS and ROWS. Unstructured data can be considered as any data or piece of information which can't be stored in Databases/RDBMS etc. Email, Facebook comments, news paper etc. are the examples of unstructured data. Semi-structured data do not follow strict data model structure and neither raw data nor typed data in...

Read more...

Big Data Generation and its sources

Here are the sources where we can visualized how Big data is generating today and anticipate how the entire globe sink under Oceans of Data in near future due to exponential growth of digitization. Media Media and communications outlet (article, audio, video, emails, blogs, podcasts) Machine Data generated by computers and machines generally without human intervention (Server logs, phone calls, sensors, business process logs etc). Social Digital materials created by social media like Face book, LinkedIn etc (texts, photos, videos, tweet etc) Historical Data about our environment (weather, traffic,...

Read more...

Data in Motion

The information security strategy in today's business world is important and critical. Information is nothing but data in various shape like flat file, audio, video files, multimedia etc.  The data always lays in three states. Data in Rest This state we can visualize when information or data stores in Databases (RDBMS), Hard disk, Pen drive, SD card in smart phone etc. Organizations are bound to have additional layers as defense to protect sensitive data from intruders in the event that the...

Read more...

Importance of unstructured data

In today's world,  Internet plays a major factor to generate and propagate information from various sources. Social media, Email, What'sApp, E-News Paper  etc  are playing a crucial role on circulation followed by creation of information.  These type of information often include text and multimedia contents.  These information or data methodically can't be persisted in database and it is referred as unstructured data. Due to advancement in  technology,  70-80 % growing data is unstructured and increasing significantly over structured/semi structured data....

Read more...

Hadoop Development Environment

We always talk about Big Data processing using Hadoop framework. By leveraging distributed cluster computing programming, now a days it is possible to process and analysis huge volume of data probably in exabytes or more than that. Giant cloud providers like Amazon, Microsoft Azure provides hosting as well as development environment with multi-node cluster according to hardware requirements on Pay Per Use model.  They usually charge on hourly basis. But the main concern is, to open an account a...

Read more...

Technology Platform behind Aadhaar card implementation

We are almost familiar with Aadhaar card which had been rolled out as a first initiative in 2003. It's a 12-digit unique identification number issued by the Indian government to every individual resident of India and that can be used to access a variety of services and benefits. Hadoop is an open-source big data processing framework, that has been customized excessively by the company named MapR to boost performance.The aadhaar card project has been developed using MapR's customized Hadoop and...

Read more...

Challenges in data analysis due to Demonetization

We are all aware of demonetization and its ongoing impact in the entire country. Government is trying hard to digitalize entire financial system. In nutshell, initially to create lower cash transaction before reaching to complete cashless transaction. Use of different types digital wallet, mobile banking, debit and credit cards are started booming among the people of the country and citizens slowly started adopting it mainly to get rid of long queues in ATM, Banks and due to their advantages...

Read more...

Introduction to Facebook graph Application Program Interface( API).

API is a consolidation of protocols, set of routines and tools for developing software application. Ideally Facebook recognizes the relationship among the entities as a "Social Graph". In Facebook, an entity can be a person, place, event or object that is relevant to a given system and an attribute is a property , characteristic of the entity. As shown in the diagram, Location is an entity and country, state, location id are the attributes. Based on the registered user...

Read more...