Introduction to Hadoop


Hadoop was created by Doug Cutter and Mike Cafarella. Cutting, the creator of Apache Lucene who was working at Yahoo! at the time, named it after his son's toy elephant. It was originally developed to support distribution for the nutch search engine project. Hadoop provides a reliable shared storage and analysis system. 
Hadoop elephant

Hadoop


Apache Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. 

It's ability to detect and handle failures at the application layer. 



Apache Hadoop framework is composed of following modules :



  1. Hadoop CommonIt contains libraries and utilities needed by other Hadoop modules. 

  2. Hadoop Distributed File System (HDFS)A distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.

  3. Hadoop YARNA resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications.

  4. Hadoop MapReduceA programming model for large scale data processing.


Small Hadoop cluster :

  It includes a single master and multiple worker nodes.

 Master Node : The master node consists of a jobTracker, TaskTracker, NameNode and DataNode

Slave or Worker Node :

  It acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes and compute-only worker nodes.

 What is nutch ? 

Nutch is coded entirely in the Java Programming Language, but data is written in language independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.

History of Hadoop :

  • Nutch was started in 2002, and a working crawler and search system quickly emerged. However, they realized that, their architecture would not be scale to the billions of pages on the web. 

  • In 2003, they described the architecture of Google Distributed File System, called GFS, it would solve the problem of storage needs for a large files.GFS would free up time being spent on administrative tasks such as as managing storage nodes. 

  • In 2004, Google published a paper that introduced the MapReduce to the world. 

  • In 2005, Nutch developers had a working MapReduce implementation in nutch at the same time, Doug Cuting joined the Yahoo!, which provides a dedicated team and the resources to run Hadoop.

  • In January 2008, Hadoop was made its own top-label project at apache. By this time Hadoop was being used by many companies such as Facebook and the New York times.

  • In April 2009, Hadoop introduced a fastest system to sort the terabyte of data which was running a 910-node clusters. Hadoop sorted one terabyte in 209 seconds. In the same year, Google reported that its Mapreduce implementation sorted one terabyte in 68 seconds. In 2009, Yahoo team announced a Hadoop uses 62 seconds tosort one terrabyte.

Popular Posts