RE: MapReduce with related data from disparate files

2008-03-24 Thread Nathan Wang
It's possible to do the whole thing in one round of map/reduce. The only requirement is to be able to differentiate between the 2 different types of input files, possibly using different file name extensions. One of my coworkers wrote a smart InputFormat class that creates a different

RE: 2 questions about hadoopifying

2008-03-19 Thread Nathan Wang
1. You can if you copy the cfg file into HDFS. Otherwise, it's local to one node, and can't be accessed by map/reduce jobs running on other nodes. 2. You can write your own RecordReader/InputFormat classes and handle input files in any formats of your own. Nathan -Original Message-

Problems with NFS share in dfs.name.dir

2008-02-22 Thread Nathan Wang
Hi, We're having problems when trying to deal with the namenode failover, by following the wiki http://wiki.apache.org/hadoop/NameNodeFailover If we point dfs.name.dir to 2 local directories, it works fine. But, if one of the directories is NFS mounted, we're having these problems: 1)

RE: Using jmx fails because of multiple port listeners

2008-02-15 Thread Nathan Wang
Right, you can't add that line globally. That will affect all processes. What you can do is to modify this file: HADOOP_HOME/bin/hadoop. For each process, give a different port number. For example, for tasktracker, assign port 12345: ... elif [ $COMMAND = tasktracker ] ; then

Re: Improving performance for large values in reduce

2008-02-07 Thread Nathan Wang
It depends on the uniqueness of your input data and maybe on how you implemented concatenateValues. Since you're collecting twice for each line, on both subject and object, then concatenating the original line twice again. If you have many rows with the same subjects and objects, you'll end up