Paresh Khatri Progress Report

Long Term Goals

  • Work on Master's project - redesigning and redeveloping parallel wrapper around epanet (part of Water Threat management project) in a  map/reduce way

Summer 2009

  • Week 11 (8/18/2009)
    • Got access to teragrid (I didn't activated it in past)
    • Helped Young Suk Moon to carry out a series of test cases for WTM MPI code
    • WTM working code on oak and results match the output generated by the MPI pepanet
    • Gregor asked me to figure out how to do this on teragrid
      • Answer: Hadoop on demand (HOD). This uses torque resource manager to start(qsub) hadoop services on the allocated cluster
      • Had a hard time figuring out how to execute HOD as it is still in the beta stage, hence almost no documentation is available and has many open bugs.
      • None of the cluster machine has Java 1.6 ! -> Installed java on my account and pointed JAVA_HOME to my NFS account. This seems to be a heavy performance loss.
      • HOD executes but fails to start all the services, google is not able to help me with error log generated ! Tried all the possible scenarious
        • Error resources: 1, 2
      • Tried to study HOD python implementation to figure out the problem. But the code seemed never ending.
      • HOD claims that it is better if Twisted Python is installed, but it will still work with a little performance loss. Bug found: Due to lack of test cases, developers did not test HOD properly and I realised that HOD fails to execute if twisted python is not found! After I installed twisted python on my NFS and passed appropriate environment variable, HOD worked successfully.
    • Wrote a prilimnary Makefile, to compile cpp code and generate jar for my hadoop WTM code using maven.
    • Code on gerenium
    • To do:
      • Test my hadoop WTM code on teragrid machines and gather performance results
      • Improve the makefile
      • Document exact steps on how to execute my code using hadoop and hadoop on demand
  • Week 1-3 (6/1/2009 - 6/21/2009)

    • [x] Reworked on the parallel epanet version (non MPI) of water threat management, so as to fit it in the hadoop model.
    • [x] Worked on designing skeleton for pepanet to solve it using hadoop map/reduce way
    • [x] Spent lot of time to look for appropriate compile instructions for executing a C++ map/reduce program on hadoop, but failed, due to lack of documentation.
      • If this was successfuly then it would have been possible to write a C++ only version of parallel epanet.
      • Alternatively now I am planning to put the executable and related input files within the HDFS, and execute it from a java map/reduce program passing relavent arguments to it.
    • [x] Wrote a java map/reduce program to manage i/o activites with the hdfs for epanet
    • [x] Having minor issues with the exact path of the input files for C++ program, as they will vary on all the participating nodes.
      • Exact path can be known by copying the input files from the HDFS to the local file system. Relative path of the files is not the same as it is within the HDFS. Will address this issue by sunday.

      [ ] Make a presentation on hadoop and addressing my current design

Spring 2009

  • Look for Internship opportunites as now I am done with all the courses.
  • Spent time on looking for alternatives for redesigning current parallel version of epanet (pepanet) for water threat management project
  • Did extensive research on use and importance of Hadoop Map Reduce framework

    • Notes
      • Map/Reduce is a programming paradigm that expresses a large distributed computation as a sequence of distributed operations on data sets of key/value pairs.
      • Map phase - split the input dataset into large number of fragements each of which is a part of an individual task.
      • Reduce phase - Process each task with input as a list of fragements as they were split in the Map phase.
      • Combine phase(optional) - Intermediate phase between Map and Reduce phases, does a reduce-like operation, where in each task can be preprocessed/combined with other tasks. 
      • Check pointing is done after each phase to achieve complete reliability.
      • Integrated Distributed File System (known as HDFS) that takes care of reliably storing very large files and maintain their appropriate number of replicas. Well defined interface (shell + API) available for writing/reading files to this file system.
      • Programs can be written in Java, C++, Python and Ruby.
        • Extensive documentation and relevant examples available for Java
        • Compile instructions missing for C++.
      • Command line interface available for submitting jobs and communicating with HDFS (for i/o files)
    • Quick Facts
      • Mostly built with Java, to achieve platform independence, so only pre-requisites are
        • Java >= 1.5.x
        • ssh and sshd - For initiating hadoop daemons on slave nodes from master itself
        • rsync - For replicating purposes (HDFS)
        • Public key authentication set between master and all slave machines (blank passphrase)
      • Big list of organisations who utilize hadoop
        • Yahoo - More than 25,000 computers - Used to support research activities on Ad Systems and Web Search
    • Used 6 CS machines to setup hadoop on them (in /tmp directory) and tested executing several examples.

Winter 2008-2009

  • [x] Studied the thesis and source code of water threat management project as developed by Sarat and relevant papers to prepare for Masters project seminar course
  • [x] Redesigned the code after removing all the MPI stuff.
  • [x] Code is not checked into google/sourceforge svn as it cannot be distributed
  • [ ] Andrew has created a private svn and set up a trak wiki for me. Get its username/password and check in the code into it.