Did extensive research on use and importance of Hadoop Map Reduce framework
- Notes
- Map/Reduce is a programming paradigm that expresses a large distributed computation as a sequence of distributed operations on data sets of key/value pairs.
- Map phase - split the input dataset into large number of fragements each of which is a part of an individual task.
- Reduce phase - Process each task with input as a list of fragements as they were split in the Map phase.
- Combine phase(optional) - Intermediate phase between Map and Reduce phases, does a reduce-like operation, where in each task can be preprocessed/combined with other tasks.
- Check pointing is done after each phase to achieve complete reliability.
- Integrated Distributed File System (known as HDFS) that takes care of reliably storing very large files and maintain their appropriate number of replicas. Well defined interface (shell + API) available for writing/reading files to this file system.
- Programs can be written in Java, C++, Python and Ruby.
- Extensive documentation and relevant examples available for Java
- Compile instructions missing for C++.
- Command line interface available for submitting jobs and communicating with HDFS (for i/o files)
- Quick Facts
- Mostly built with Java, to achieve platform independence, so only pre-requisites are
- Java >= 1.5.x
- ssh and sshd - For initiating hadoop daemons on slave nodes from master itself
- rsync - For replicating purposes (HDFS)
- Public key authentication set between master and all slave machines (blank passphrase)
- Big list of organisations who utilize hadoop
- Yahoo - More than 25,000 computers - Used to support research activities on Ad Systems and Web Search
- Used 6 CS machines to setup hadoop on them (in /tmp directory) and tested executing several examples.