Welcome to Hadoop Users Group Chennai.

An Open source community formed by three professionals (Senthil Kumar, Binish, Prasad) in Sep 2012 to discuss as well as share the knowledge of the bigdata Technologies such as Apache Hadoop, Apache Spark, NoSQLs(Cassandra, MongoDB, HBase), stream processing frameworks such as Storm, Spark Streaming and other related bigdata techonologies in Chennai.

Here the group will be discussing about bigdata usecases and learnings. Any kind of Big Data doubts can be dispelled by any of the Big Data specialists belong to this group. Come learn and share your experiences with us.

 Till now we have conducted 10+ chennai meetups in OrangeScape Technologies, Chennai. If any one is interested in hosting the user group in chennai, please reach us through contact us form.  With our constant growth in the number of members, we are making a significant impact in the chennai tech communities.

Overview of Hadoop EcoSystem

Chennai Hadoop EcoSystem Overview


Introduction to Hadoop EcoSystem
Apache Hadoop is an open source software framework written by Doug Cutting to store as well as process large scale datasets on cluster of commodity hardwares. In other words, hadoop provides distributed storage as well as distributed processing.Above picture shows the overview of current Hadoop framework. Below are the some layers or components in the Hadoop Framework.
Storage –  HDFS(Hadoop Distributed FileSystem) provides scalable fault tolerant storage by dividing the data into blocks and storing across the machines.
-  HBase (NoSQL) – Provides random reads as well as write capabilities. It again stores data in HDFS(Hadoop Distributed FileSystem)

ResourceManagement - YARN (Yet Another Resource Negotiator) provides cluster resource management for any applications in Hadoop clusters. YARN become a fully GA from hadoop-2.4 but work started from hadoop-0.23

Data Processing
MapReduce – A default framework or programming model for processing in Hadoop till hadoop-1.x. Even in Hadoop-2.x, we still use MapReduce. This model is based upon the Google’s MapReducer Paper.

Tez – An Application framework built atop YARN to serve both batch as well as interactive processing.

Apache Spark – A cluster computing system written in Scala which also provides iteractive , interactive processing. Spark can also run independent of Hadoop. Spark provides an unified framework for big Data processing. Spark Framework also provides high level APIs in Java, Python, Scala and Soon R Language too.Apache Spark also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing.Apache Spark officially sets a new record in large-scale sorting beating previous record by Apache Hadoop.

Apache Storm – Stream processing framework developed by BackType(acquired by Twitter later). Widely used Stream processing. Other important streaming processing are Samza, Spark Streaming, DataTorrent.

High Level Abstractions for processing.
Apache Hive – Started by Facebook and widely used SQL on Hadoop solution. Hive currently supports MR and Tez as execution engine in Hadoop cluster. Soon Apache Spark as execution engine will be GA in Apache Hive and works are on by Hadoop community.
Apache Pig – A Script(Piglatin) based high level abstration for MR and Tez in Hadoop. There has been an open source version “spork” supporting Apache Spark as execution engine in Apache Pig.