Hadoop Overview

From NJIT-ARCS HPC Wiki
Jump to: navigation, search

Introduction

From the Apache Hadoop Website:

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The project includes these modules:

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Infrastructure

The Hadoop infrastructure is a virtual environment based on VMware Bid Data Extensions (BDE). From the BDE Datasheet:

VMware introduced Big Data Extensions, or BDE, as a commercially supported version of Project Serengeti designed for enterprises seeking VMware support. BDE enables customers to run clustered, scale-out Hadoop applications on the vSphere platform, delivering all the benefits of virtualization to Hadoop users. BDE delivers operational simplicity with an easy-to-use interface, improved utilization through compute elasticity, and a scalable and flexible Big Data platform to satisfy changing business requirements. VMware has built BDE to support all major Hadoop distributions and associated Apache Hadoop projects such as Pig, Hive, and HBase.

Hardware

The hardware associated with BDE is as follows

2 x IBM iDataPlex dx360 M4 nodes, each with:

  • 2 x Intel Xeon CPU E5-2680 (8 Core)
  • 16 CPU CORES @ 2.70GHz
  • 32 Logical Processors with Hyperthreading
  • 128G RAM

Software

vSphere 5.5
Nodes are running ESXi 5.5.0
Big Data Extensions 2.3

Hadoop Distribution

BDE allows for several different Hadoop distributions to be deployed including Hortonworks (HDP). Cloudera, MapR, etc... The only completely opensource distribution is HDP and it is this distribution ARCS/IST chooses to deploy for the general purpose Hadoop cluster.