Re: Pregel
Hello, I don't have a background in CS, but does MS's Dryad ( http://research.microsoft.com/en-us/projects/Dryad/ ) fit in anywhere here? Regards Saptarshi On Fri, Jun 26, 2009 at 5:19 AM, Edward J. Yoonedwardy...@apache.org wrote: According to my understanding, I think the Pregel is in same layer with MR, not a MR based language processor. I think the 'Collective Communication' of BSP seems the core of the problem. For example, this BFS problem (http://blog.udanax.org/2009/02/breadth-first-search-mapreduce.html) can be solved at once w/o MR iterations. On Fri, Jun 26, 2009 at 3:17 PM, Owen O'Malleyomal...@apache.org wrote: On Jun 25, 2009, at 9:42 PM, Mark Kerzner wrote: my guess, as good as anybody's, is that Pregel is to large graphs is what Hadoop is to large datasets. I think it is much more likely a language that allows you to easily define fixed point algorithms. I would imagine a distributed version of something similar to Michal Young's GenSet. http://portal.acm.org/citation.cfm?doid=586094.586108 I've been trying to figure out how to justify working on a project like that for a couple of years, but haven't yet. (I have a background in program static analysis, so I've implemented similar stuff.) In other words, Pregel is the next natural step for massively scalable computations after Hadoop. I wonder if it uses map/reduce as a base or not. It would be easier to use map/reduce, but a direct implementation would be more performant. In either case, it is a new hammer. From what I see, it likely won't replace map/reduce, pig, or hive; but rather support a different class of applications much more directly than you can under map/reduce. -- Owen -- Best Regards, Edward J. Yoon @ NHN, corp. edwardy...@apache.org http://blog.udanax.org
Re: When is configure and close run
Thank you! Just to confirm. Consider a JVM (that is being reused), has to reduce K1,{V11,V12,V13..} and K2,{V21,V22,V23,}. Then the configure and close methods are called once each for both K1,{V11,...} and K2,{V2,}? Is my understanding correct? Once again, there is no combiner, and it makes sense that it is not called. Thank you Saptarshi On Mon, Jun 22, 2009 at 10:55 PM, jason hadoopjason.had...@gmail.com wrote: configure and close are run for each task, mapper and reducer. The configure and close are NOT run on the combiner class. On Mon, Jun 22, 2009 at 9:23 AM, Saptarshi Guha saptarshi.g...@gmail.com wrote: Hello, In a mapreduce job, a given map JVM will run N map tasks. Are the configure and close methods executed for every one of these N tasks? Or is configure executed once when the JVM starts and the close method executed once when all N have been completed? I have the same question for the reduce task. Will it be run before for every reduce task? And close is run when all the values for a given key have been processed? We can assume there isn't a combiner. Regards Saptarshi -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: EC2, Hadoop, copy file from CLUSTER_MASTER to CLUSTER, failing
Hello, Thank you. This is quite useful. Regards Saptarshi On Wed, Jun 24, 2009 at 6:16 AM, Tom Whitet...@cloudera.com wrote: Hi Saptarshi, The group permissions open the firewall ports to enable access, but there are no shared keys on the cluster by default. See https://issues.apache.org/jira/browse/HADOOP-4131 for a patch to the scripts that shares keys to allow SSH access between machines in the cluster. Cheers, Tom On Sat, Jun 20, 2009 at 7:09 AM, Saptarshi Guhasaptarshi.g...@gmail.com wrote: Hello, I have a cluster with 1 master and 1 slave (testing). In the EC2 scripts, in the hadoop-ec2-init-remote.sh file, I wish to copy a file from the MASTER to the CLUSTER i.e in the slave section scp $MASTER_HOST:/tmp/v /tmp/v However, this didnt work and when I logged in, ssh'd to the slave and tried the command, I got the following error: Permission denied (publickey,gssapi-with-mic) Yet, the group permissions appear to be valid i.e ec2-authorize $CLUSTER_MASTER -o $CLUSTER -u $AWS_ACCOUNT_ID ec2-authorize $CLUSTER -o $CLUSTER_MASTER -u $AWS_ACCOUNT_ID So I don't see why I can't ssh into the MASTER group from a slave. Any suggestion as to where I'm going wrong? Regards Saptarshi P.S I know I can copy a file from S3, but would like to know what is going on here.
EC2, Max tasks, under utilized?
Hello, I'm running a 90 node c1.xlarge cluster. No reducers, mapred.max.map.tasks=6 per machine. The AMI is own and uses Hadoop 0.19.1 The dataset has 145K keys, and the processing time is huge. Now, when set the mapred.map.tasks=14,000 what ends up running is 49 map tasks, across the machines. No machine is running more than 3 tasks most are running 1, some are running 0. Looking at the map records read, it appears these 49 tasks correspond to the 145k records. Q) Why? Why isn't the running tasks a much higher number? If each machine can run 6, then why not make this a higher number and run across the machines? This is under utilization So I set the mapred.map.tasks=90. At the hadoop machine list, all 90 machines are at least 1 task , mostly 1, some 2 and a small few 3+(max 4) At the job tracker page, only 23 are running, 48 pending (when i sent this email). With 90 machines(and Map Task Capacity of 540), why aren't 90 running at one go? What should be set? What isn't set? Regards Saptarshi Guha
Re: EC2, Max tasks, under utilized?
Hello, I should also point out that I'm using a SequenceFileInputFormat. Regards Saptarshi Guha On Tue, Jun 23, 2009 at 10:43 AM, Saptarshi Guha saptarshi.g...@gmail.comwrote: Hello, I'm running a 90 node c1.xlarge cluster. No reducers, mapred.max.map.tasks=6 per machine. The AMI is own and uses Hadoop 0.19.1 The dataset has 145K keys, and the processing time is huge. Now, when set the mapred.map.tasks=14,000 what ends up running is 49 map tasks, across the machines. No machine is running more than 3 tasks most are running 1, some are running 0. Looking at the map records read, it appears these 49 tasks correspond to the 145k records. Q) Why? Why isn't the running tasks a much higher number? If each machine can run 6, then why not make this a higher number and run across the machines? This is under utilization So I set the mapred.map.tasks=90. At the hadoop machine list, all 90 machines are at least 1 task , mostly 1, some 2 and a small few 3+(max 4) At the job tracker page, only 23 are running, 48 pending (when i sent this email). With 90 machines(and Map Task Capacity of 540), why aren't 90 running at one go? What should be set? What isn't set? Regards Saptarshi Guha
EC2, Hadoop, copy file from CLUSTER_MASTER to CLUSTER, failing
Hello, I have a cluster with 1 master and 1 slave (testing). In the EC2 scripts, in the hadoop-ec2-init-remote.sh file, I wish to copy a file from the MASTER to the CLUSTER i.e in the slave section scp $MASTER_HOST:/tmp/v /tmp/v However, this didnt work and when I logged in, ssh'd to the slave and tried the command, I got the following error: Permission denied (publickey,gssapi-with-mic) Yet, the group permissions appear to be valid i.e ec2-authorize $CLUSTER_MASTER -o $CLUSTER -u $AWS_ACCOUNT_ID ec2-authorize $CLUSTER -o $CLUSTER_MASTER -u $AWS_ACCOUNT_ID So I don't see why I can't ssh into the MASTER group from a slave. Any suggestion as to where I'm going wrong? Regards Saptarshi P.S I know I can copy a file from S3, but would like to know what is going on here.
Tutorial on building an AMI
Hello, Is there a tutorial available to build an Hadoop AMI (like Cloudera's)? Cloudera has an 18.2 ami and for reasons I understand they can't provide(as of now) AMIs for higher Hadoop versions until they become stable. I would like to create an AMI for 19.2 - so was hoping if there is a guide for building one. Thank you Saptarshi Guha
Re: No reduce tasks running, yet 1 is pending
Interestingly, when i started other jobs, this one finished. I have no idea why. Saptarshi Guha On Tue, May 12, 2009 at 10:36 PM, Saptarshi Guha saptarshi.g...@gmail.com wrote: Hello, I mentioned this issue before for the case of map tasks. I have 43 reduce tasks, 42 completed, 1 pending and 0 running. This is the case for the last 30 minutes. A pictur(tiff) of the job tracker can be found here( http://www.stat.purdue.edu/~sguha/mr.tiff ), since I haven't canceled the job, i can send logs and output if anyone requires. Regards Saptarshi
Heterogeneous cluster - quadcores/8 cores, Fairscheduler
Hello, Our unit has 5 quad-core machines, running Hadoop. We have a dedicated Jobtracker/Namenode. Each machine has 32GB ram. We have the option of buying an 8 core,128GB machine and the question is would this be useful as a Tasktracker? A) It can certainly be used as the JobTracker and Namenode B) I also read the FairScheduler, and mapred.fairscheduler.loadmanager appears to allow the administrator allow more jobs to run on a given machine. Thus the new machine might be able to run double the number of maps and reduces. Otherwise, if FairScheduler is not an option(or I misunderstood), then adding the new machine to the cluster would only underutilize it. Are my inferences correct? Regards Saptarshi Guha
R on the cloudera hadoop ami?
Hello, Thank you to Cloudera for providing an AMI for Hadoop. Would it be possible to include R in the Cloudera yum repositories or better still can the i386 and x86-64 AMIs be updated to have R pre-installed? If yes (thanks!) I think yum -y install R installs R built with --shared-libs (which allows programs to link with R) Thank you in advance Saptarshi
ANN: R and Hadoop = RHIPE 0.1
Hello, I'd like to announce the release of the 0.1 version of RHIPE -R and Hadoop Integrated Processing Environment. Using RHIPE, it is possible to write map-reduce algorithms using the R language and start them from within R. RHIPE is built on Hadoop and so benefits from Hadoop's fault tolerance, distributed file system and job scheduling features. For the R user, there is rhlapply which runs an lapply across the cluster. For the Hadoop user, there is rhmr which runs a general map-reduce program. The tired example of counting words: m - function(key,val){ words - substr(val, +)[[1]] wc - table(words) cln - names(wc) return(sapply(1:length(wc),function(r) list(key=cln[r],value=wc[[r]]),simplify=F)) } r - function(key,value){ value - do.call(rbind,value) return(list(list(key=key,value=sum(value } rhmr(mapper=m,reduce=r,input.folder=X,output.folder=Y) URL: http://ml.stat.purdue.edu/rhipe There are some downsides to RHIPE which are described at http://ml.stat.purdue.edu/rhipe/install.html#sec-5 Regards Saptarshi Guha
RuntimeException, coult not obtain block causes trackers to get blacklisted
Hello, I have an intensive job running across 5 machines. During the map stage, each map emits 200 records, so effectively for 50,000,000 input reords, the map creates 200*50e6 records. However, after a long time, I see two trackers are blacklisted Caused by: java.lang.RuntimeException: Could not obtain block: blk_-5964245027287878843_92134 file=/tmp/8a5bc814-b4ff-4641-bc3a-abfeda9e7e33.mapfile/index at org.saptarshiguha.rhipe.hadoop.RHMR$RHMRCombiner.configure(RHMR.java:314) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:1110) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:989) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:401) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:886) The map file is very much present in that location and other trackers could read it. What could be the reason for these two machines not being able to read it? Now that they are blacklisted, how can I add them back to the computation (a Saptarshi Guha
File name to which mapper's key,value belongs to - is it available?
Hello, Is there a conf variable for getting the filename to which the current mapper's key,value belongs to? I have dir/dirA/part-X and dir/dirB/part-X i will process dir, but need to know whether the key,value is from dirA/part-* file or from a dirB/part-* file. I'd much rather not implement my own inputformat, since I'd like the method to be inputformat agnostic. Much thanks Saptarshi Guha
Re: Sometimes no map tasks are run - X are complete and N-X are pending, none running
I forgot to examine logs, I will next time it happens. Thank you BW, no maps are running. Only few are pending and the rest are complete. Saptarshi Guha On Fri, Apr 17, 2009 at 3:06 AM, Jothi Padmanabhan joth...@yahoo-inc.comwrote: On 4/17/09 12:26 PM, Sharad Agarwal shara...@yahoo-inc.com wrote: The last map task is forrever in the pending queue - is this is issue my setup/config or do others have the problem? Do you mean the left over maps are not at all scheduled ? What do you see in jobtracker logs ? Also in the JT UI, please check on how many maps are marked as running, when this map is still pending?
Sometimes no map tasks are run - X are complete and N-X are pending, none running
Hello, I'm using 0.19.2-dev-core (checked out from cvs and build). With 51 maps, i have a case where 50 tasks have completed and 1 is pending, about 1400 records left for this one to process. The completed map taska have written out 18GB to the HDFS. The last map task is forrever in the pending queue - is this is issue my setup/config or do others have the problem? Thank you Saptarshi Guha
Is combiner and map in same JVM?
Hello, Suppose I have a Hadoop job and have set my combiner to the Reducer class. Does the map function and the combiner function run in the same JVM in different threads? or in different JVMs? I ask because I have to load a native library and if they are in the same JVM then the native library is loaded once and I have to take precautions. Thank you Saptarshi Guha
Re: Is combiner and map in same JVM?
Thanks. I am using 0.19, and to confirm, the map and combiner (in the map jvm) are run in *different* threads at the same time? My native library is not thread safe, so I would have to implement locks. Aaron's email gave me hope(since the map and combiner would then be running sequentially), but this appears to make things complicated. Saptarshi Guha On Tue, Apr 14, 2009 at 2:01 PM, Owen O'Malley omal...@apache.org wrote: On Apr 14, 2009, at 10:52 AM, Aaron Kimball wrote: They're in the same JVM, and I believe in the same thread. They are the same JVM. They *used* to be the same thread. In either 0.19 or 0.20, combiners are also called in the reduce JVM if spills are required. -- Owen
Building LZO on hadoop
I checked out hadoop-core-0.19 export CFLAGS=$CUSTROOT/include export LDFLAGS=$CUSTROOT/lib (they contain lzo which was built with --shared) ls $CUSTROOT/include/lzo/ lzo1a.h lzo1b.h lzo1c.h lzo1f.h lzo1.h lzo1x.h lzo1y.h lzo1z.h lzo2a.h lzo_asm.h lzoconf.h lzodefs.h lzoutil.h ls $CUSTROOT/lib/ liblzo2.so liblzo.a liblzo.la liblzo.so liblzo.so.1 liblzo.so.2 liblzo.so.2.0.0 I then run (from hadoop-core-0.19.1/) ant -Dcompile.native=true I get messages like : (many others like this) exec] configure: WARNING: lzo/lzo1x.h: accepted by the compiler, rejected by the preprocessor! [exec] configure: WARNING: lzo/lzo1x.h: proceeding with the compiler's result [exec] checking for lzo/lzo1x.h... yes [exec] checking Checking for the 'actual' dynamic-library for '-llzo2'... (cached) [exec] checking lzo/lzo1y.h usability... yes [exec] checking lzo/lzo1y.h presence... no [exec] configure: WARNING: lzo/lzo1y.h: accepted by the compiler, rejected by the preprocessor! [exec] configure: WARNING: lzo/lzo1y.h: proceeding with the compiler's result [exec] checking for lzo/lzo1y.h... yes [exec] checking Checking for the 'actual' dynamic-library for '-llzo2'... (cached) and finally, ive/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c -fPIC -DPIC -o .libs/LzoCompressor.o [exec] /ln/meraki/custom/hadoop-core-0.19.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c: In function 'Java_org_apache_hadoop_io_compress_lzo_LzoCompressor_initIDs': [exec] /ln/meraki/custom/hadoop-core-0.19.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c:137: error: expected expression before ',' token Any ideas? Saptarshi Guha
Re: Building LZO on hadoop
Fixed. In the configure script src/native/ change echo 'int main(int argc, char **argv){return 0;}' conftest.c if test -z `${CC} ${LDFLAGS} -o conftest conftest.c -llzo2 21`; then if test ! -z `which objdump | grep -v 'no objdump'`; then ac_cv_libname_lzo2=`objdump -p conftest | grep NEEDED | grep lzo2 | sed 's/\W*NEEDED\W*\(.*\)\W*$/\\1\/'` elif test ! -z `which ldd | grep -v 'no ldd'`; then ac_cv_libname_lzo2=`ldd conftest | grep lzo2 | sed 's/^[^A-Za-z0-9]*\([A-Za-z0-9\.]*\)[^A-Za-z0-9]*=.*$/\\1\/'` else { { echo $as_me:$LINENO: error: Can't find either 'objdump' or 'ldd' to compute the dynamic library for '-llzo2' 5 echo $as_me: error: Can't find either 'objdump' or 'ldd' to compute the dynamic library for '-llzo2' 2;} { (exit 1); exit 1; }; } fi else ac_cv_libname_lzo2=libnotfound.so fi rm -f conftest* lzo2 to lzo.so.2 (again this depends on what the user has), also set CFLAGS and LDFLAGS to include your lzo libs/incs Saptarshi Guha On Wed, Apr 1, 2009 at 2:29 PM, Saptarshi Guha saptarshi.g...@gmail.com wrote: I checked out hadoop-core-0.19 export CFLAGS=$CUSTROOT/include export LDFLAGS=$CUSTROOT/lib (they contain lzo which was built with --shared) ls $CUSTROOT/include/lzo/ lzo1a.h lzo1b.h lzo1c.h lzo1f.h lzo1.h lzo1x.h lzo1y.h lzo1z.h lzo2a.h lzo_asm.h lzoconf.h lzodefs.h lzoutil.h ls $CUSTROOT/lib/ liblzo2.so liblzo.a liblzo.la liblzo.so liblzo.so.1 liblzo.so.2 liblzo.so.2.0.0 I then run (from hadoop-core-0.19.1/) ant -Dcompile.native=true I get messages like : (many others like this) exec] configure: WARNING: lzo/lzo1x.h: accepted by the compiler, rejected by the preprocessor! [exec] configure: WARNING: lzo/lzo1x.h: proceeding with the compiler's result [exec] checking for lzo/lzo1x.h... yes [exec] checking Checking for the 'actual' dynamic-library for '-llzo2'... (cached) [exec] checking lzo/lzo1y.h usability... yes [exec] checking lzo/lzo1y.h presence... no [exec] configure: WARNING: lzo/lzo1y.h: accepted by the compiler, rejected by the preprocessor! [exec] configure: WARNING: lzo/lzo1y.h: proceeding with the compiler's result [exec] checking for lzo/lzo1y.h... yes [exec] checking Checking for the 'actual' dynamic-library for '-llzo2'... (cached) and finally, ive/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c -fPIC -DPIC -o .libs/LzoCompressor.o [exec] /ln/meraki/custom/hadoop-core-0.19.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c: In function 'Java_org_apache_hadoop_io_compress_lzo_LzoCompressor_initIDs': [exec] /ln/meraki/custom/hadoop-core-0.19.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c:137: error: expected expression before ',' token Any ideas? Saptarshi Guha
Re: Building LZO on hadoop
Actually, if one installs the latest liblzo and sets CFLAGS, LDFLAGS and LFLAGS correctly, things work fine. Saptarshi Guha On Wed, Apr 1, 2009 at 3:55 PM, Saptarshi Guha saptarshi.g...@gmail.com wrote: Fixed. In the configure script src/native/ change echo 'int main(int argc, char **argv){return 0;}' conftest.c if test -z `${CC} ${LDFLAGS} -o conftest conftest.c -llzo2 21`; then if test ! -z `which objdump | grep -v 'no objdump'`; then ac_cv_libname_lzo2=`objdump -p conftest | grep NEEDED | grep lzo2 | sed 's/\W*NEEDED\W*\(.*\)\W*$/\\1\/'` elif test ! -z `which ldd | grep -v 'no ldd'`; then ac_cv_libname_lzo2=`ldd conftest | grep lzo2 | sed 's/^[^A-Za-z0-9]*\([A-Za-z0-9\.]*\)[^A-Za-z0-9]*=.*$/\\1\/'` else { { echo $as_me:$LINENO: error: Can't find either 'objdump' or 'ldd' to compute the dynamic library for '-llzo2' 5 echo $as_me: error: Can't find either 'objdump' or 'ldd' to compute the dynamic library for '-llzo2' 2;} { (exit 1); exit 1; }; } fi else ac_cv_libname_lzo2=libnotfound.so fi rm -f conftest* lzo2 to lzo.so.2 (again this depends on what the user has), also set CFLAGS and LDFLAGS to include your lzo libs/incs Saptarshi Guha On Wed, Apr 1, 2009 at 2:29 PM, Saptarshi Guha saptarshi.g...@gmail.com wrote: I checked out hadoop-core-0.19 export CFLAGS=$CUSTROOT/include export LDFLAGS=$CUSTROOT/lib (they contain lzo which was built with --shared) ls $CUSTROOT/include/lzo/ lzo1a.h lzo1b.h lzo1c.h lzo1f.h lzo1.h lzo1x.h lzo1y.h lzo1z.h lzo2a.h lzo_asm.h lzoconf.h lzodefs.h lzoutil.h ls $CUSTROOT/lib/ liblzo2.so liblzo.a liblzo.la liblzo.so liblzo.so.1 liblzo.so.2 liblzo.so.2.0.0 I then run (from hadoop-core-0.19.1/) ant -Dcompile.native=true I get messages like : (many others like this) exec] configure: WARNING: lzo/lzo1x.h: accepted by the compiler, rejected by the preprocessor! [exec] configure: WARNING: lzo/lzo1x.h: proceeding with the compiler's result [exec] checking for lzo/lzo1x.h... yes [exec] checking Checking for the 'actual' dynamic-library for '-llzo2'... (cached) [exec] checking lzo/lzo1y.h usability... yes [exec] checking lzo/lzo1y.h presence... no [exec] configure: WARNING: lzo/lzo1y.h: accepted by the compiler, rejected by the preprocessor! [exec] configure: WARNING: lzo/lzo1y.h: proceeding with the compiler's result [exec] checking for lzo/lzo1y.h... yes [exec] checking Checking for the 'actual' dynamic-library for '-llzo2'... (cached) and finally, ive/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c -fPIC -DPIC -o .libs/LzoCompressor.o [exec] /ln/meraki/custom/hadoop-core-0.19.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c: In function 'Java_org_apache_hadoop_io_compress_lzo_LzoCompressor_initIDs': [exec] /ln/meraki/custom/hadoop-core-0.19.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c:137: error: expected expression before ',' token Any ideas? Saptarshi Guha
Re: X not in slaves, yet still being used as a tasktracker
Aha,there isn't a tasktracker running on X, yet jps shows 4 children and when jobs fail i see failures on X too. So what exactly can I stop? jps output on X 20044 org.apache.hadoop.mapred.JobTracker 12871 org.apache.hadoop.mapred.Child 127.0.0.1 51590 attempt_200903302220_0036_m_20_0 -541073721 12819 org.apache.hadoop.mapred.Child 127.0.0.1 51590 attempt_200903302220_0036_m_03_0 1185151254 19768 org.apache.hadoop.hdfs.server.namenode.NameNode 13004 org.apache.hadoop.mapred.Child 127.0.0.1 51590 attempt_200903302220_0036_r_10_0 1044080068 11396 org.saptarshiguha.rhipe.rhipeserver.RHIPEMain --tsp=mimosa:8200 --listen=4445 --quiet --no-save --max-nsize=1G --max-ppsize=10 19961 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode 13131 sun.tools.jps.Jps -ml 22054 12931 org.apache.hadoop.mapred.Child 127.0.0.1 51590 attempt_200903302220_0036_m_32_0 -690673928 12962 org.apache.hadoop.mapred.Child 127.0.0.1 51590 attempt_200903302220_0036_m_33_0 203873369 Saptarshi Guha On Mon, Mar 30, 2009 at 11:37 AM, Saptarshi Guha saptarshi.g...@gmail.com wrote: thanks Saptarshi Guha On Sun, Mar 29, 2009 at 7:42 PM, Bill Au bill.w...@gmail.com wrote: The jobtracker does not have to be a tasktracker. Just stop and don't start the tasktracker process. Bill On Sun, Mar 29, 2009 at 12:00 PM, Saptarshi Guha saptarshi.g...@gmail.com wrote: Hello, A machine X which is the master: it is the jobtracker, namenode and secondary namenode. It is not in the slaves file and is not part of the HDFSHowever in the mapreduce web page, I notice it is being used as a tasktracker. Is the jobtracker always a tasktracker? I'd rather not place too much load on the one machine which plays so many roles. (Hadoop 0.19.0) Thanks Saptarshi
Re: JNI and calling Hadoop jar files
Delayed response. However, I stand corrected. I was using a package called rJava which integrates R and java. Maybe there was a classloader issue but once I rewrote my stuff using C and JNI, the issues disappeared. When I create my configuration, i add as a resource $HADOOP/conf/hadoop-{default,site},xml in that order. All problems disappeared. Sorry, I couldn't provide the information requested with the error causing approach - I stopped using that package. regards Saptarshi Guha P.S As an aside, If I launch my own java apps, which require the hadoop configuration etc, I have to manually add the {default,site}.xml files. On Tue, Mar 24, 2009 at 6:52 AM, jason hadoop jason.had...@gmail.com wrote: The exception reference to *org.apache.hadoop.hdfs.DistributedFileSystem*, implies strongly that a hadoop-default.xml file, or at least a job.xml file is present. Since hadoop-default.xml is bundled into the hadoop-0.X.Y-core.jar, the assumption is that the core jar is available. The class not found exception, the implication is that the hadoop-0.X.Y-core.jar is not available to jni. Given the above constraints, the two likely possibilities are that the -core jar is unavailable or damaged, or that somehow the classloader being used does not have access to the -core jar. A possible reason for the jar not being available is that the application is running on a different machine, or as a different user and the jar is not actually present or perhaps readable in the expected location. Which way is your JNI, java application calling into a native shared library, or a native application calling into a jvm that it instantiates via libjvm calls? Could you dump the classpath that is in effect before your failing jni call? System.getProperty( java.class.path), and for that matter, java.library.path, or getenv(CLASSPATH) and provide an ls -l of the core.jar from the class path, run as the user that owns the process, on the machine that the process is running on. !-- from hadoop-default.xml -- property namefs.hdfs.impl/name valueorg.apache.hadoop.hdfs.DistributedFileSystem/value descriptionThe FileSystem for hdfs: uris./description /property On Mon, Mar 23, 2009 at 9:47 PM, Jeff Eastman j...@windwardsolutions.comwrote: This looks somewhat similar to my Subtle Classloader Issue from yesterday. I'll be watching this thread too. Jeff Saptarshi Guha wrote: Hello, I'm using some JNI interfaces, via a R. My classpath contains all the jar files in $HADOOP_HOME and $HADOOP_HOME/lib My class is public SeqKeyList() throws Exception { config = new org.apache.hadoop.conf.Configuration(); config.addResource(new Path(System.getenv(HADOOP_CONF_DIR) +/hadoop-default.xml)); config.addResource(new Path(System.getenv(HADOOP_CONF_DIR) +/hadoop-site.xml)); System.out.println(C=+config); filesystem = FileSystem.get(config); System.out.println(C=+config+F= +filesystem); System.out.println(filesystem.getUri().getScheme()); } I am using a distributed filesystem (org.apache.hadoop.hdfs.DistributedFileSystem for fs.hdfs.impl). When run from the command line and this class is created everything works fine When called using jni I get java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.DistributedFileSystem Is this a jni issue? How can it work from the commandline using the same classpath, yet throw this is exception when run via JNI? Saptarshi Guha -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
JNI and calling Hadoop jar files
Hello, I'm using some JNI interfaces, via a R. My classpath contains all the jar files in $HADOOP_HOME and $HADOOP_HOME/lib My class is public SeqKeyList() throws Exception { config = new org.apache.hadoop.conf.Configuration(); config.addResource(new Path(System.getenv(HADOOP_CONF_DIR) +/hadoop-default.xml)); config.addResource(new Path(System.getenv(HADOOP_CONF_DIR) +/hadoop-site.xml)); System.out.println(C=+config); filesystem = FileSystem.get(config); System.out.println(C=+config+F= +filesystem); System.out.println(filesystem.getUri().getScheme()); } I am using a distributed filesystem (org.apache.hadoop.hdfs.DistributedFileSystem for fs.hdfs.impl). When run from the command line and this class is created everything works fine When called using jni I get java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.DistributedFileSystem Is this a jni issue? How can it work from the commandline using the same classpath, yet throw this is exception when run via JNI? Saptarshi Guha
Task Side Effect files and copying(getWorkOutputPath)
Hello, I would like to produce side effect files which will be later copied to the outputfolder. I am using FileOuputFormat, and in the Map's close() method i copy files (from the local tmp/ folder) to FileOutputFormat.getWorkOutputPath(job); void close() { if (shouldcopy) { ArrayListPath lop = new ArrayListPath(); for(String ff : tempdir.list()){ lop.add(new Path(temppfx+ff)); } dstFS.moveFromLocalFile(lop.toArray(new Path[]{}), dstPath); } However, this throws an error java.io.IOException: `hdfs://X:54310/tmp/testseq/_temporary/_attempt_200903160945_0010_m_00_0': specified destination directory doest not exist I though this is the right to place to drop side effect files. Prior to this I was copying o the output folder, but many were not copied, or in fact all were, but during the reduce output stage many were deleted - am not sure(I used NullOutputFormat and all the files were present in the output folder) So i resorted to getWorkOutputPath which threw the above exception. So if I'm using FileOutputFormat, and my maps and/or reduces produce side effects files on the localFS 1)when should I copy them to the DFS (e.g the close method? or one at a time in the map/reduce method) 2) Where should i copy them to. I am using Hadoop 0.19 and have set jobConf.setNumTasksToExecutePerJvm(-1); Also, each side effect file produced has a unique name, i.e there is no overwriting. Thank you Saptarshi Guha
RecordReader and non thread safe JNI libraries
Hello, My RecordReader subclass reads from object X. To parse this object and emit records, i need the use of a C library and a JNI wrapper. public boolean next(LongWritable key, BytesWritable value) throws IOException { if (leftover == 0) return false; long wi = pos + split.getStart(); key.set(wi); value.readFields(X.at( wi); pos ++; leftover --; return true; } X.at uses the JNI lib to read a record number wi My question is who running this? 1) For a given job, is one instance of this running on each tasktracker? reading records and feeding to the mappers on its machine? Or, 2) as I have mapred.tasktracker.map.tasks.maximum == 7, does each jvm launched have one RecordReader running feeding records to the maps its jvm is running. If it's either (1) or (2), I guess I'm safe from threading issues. Please correct me if i'm totally wrong. Regards Saptarshi Guha
Re: RecordReader and non thread safe JNI libraries
Hello, I am quite confused and my email seems to prove it. My question is essentially, I need to use this non thread safe library in the Mapper, Reducer and RecordReader. assume, i do not create threads. Will I run into any thread safety issues? In a given JVM, the maps will run sequentially, so will the reduces, but will maps run alongside recorder reader? Hope this is clearer. Regards Saptarshi Guha On Sun, Mar 1, 2009 at 11:07 PM, Saptarshi Guha saptarshi.g...@gmail.com wrote: Hello, My RecordReader subclass reads from object X. To parse this object and emit records, i need the use of a C library and a JNI wrapper. public boolean next(LongWritable key, BytesWritable value) throws IOException { if (leftover == 0) return false; long wi = pos + split.getStart(); key.set(wi); value.readFields(X.at( wi); pos ++; leftover --; return true; } X.at uses the JNI lib to read a record number wi My question is who running this? 1) For a given job, is one instance of this running on each tasktracker? reading records and feeding to the mappers on its machine? Or, 2) as I have mapred.tasktracker.map.tasks.maximum == 7, does each jvm launched have one RecordReader running feeding records to the maps its jvm is running. If it's either (1) or (2), I guess I'm safe from threading issues. Please correct me if i'm totally wrong. Regards Saptarshi Guha
Re: Limit number of records or total size in combiner input using jobconf?
Thank you. On Fri, Feb 20, 2009 at 5:34 PM, Chris Douglas chri...@yahoo-inc.com wrote: So here are my questions: (1) is there a jobconf hint to limit the number of records in kviter? I can (and have) made a fix to my code that processes the values in a combiner step in batches (i.e takes N at a go,processes that and repeat), but was wondering if i could just set an option. Approximately and indirectly, yes. You can limit the amount of memory allocated to storing serialized records in memory (io.sort.mb) and the percentage of that space reserved for storing record metadata (io.sort.record.percent, IIRC). That can be used to limit the number of records in each spill, though you may also need to disable the combiner during the merge, where you may run into the same problem. You're almost certainly better off designing your combiner to scale well (as you have), since you'll hit this in the reduce, too. Since this occurred in the MapContext, changing the number of reducers wont help. (2) How does changing the number of reducers help at all? I have 7 machines, so I feel 11 (a prime close to 7, why a prime?) is good enough (some machines are 16GB others 32GB) Your combiner will look at all the records for a partition and only those records in a partition. If your partitioner distributes your records evenly in a particular spill, then increasing the total number of partitions will decrease the number of records your combiner considers in each call. For most partitioners, whether the number of reducers is prime should be irrelevant. -C
Limit number of records or total size in combiner input using jobconf?
Hello, Running a MR job on 7 machines failed when it came to processing 53GB. Browsing the errors, org.saptarshiguha.rhipe.GRMapreduce$GRCombiner.reduce(GRMapreduce.java:149) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:1106) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:979) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:391) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:876) The reason why my line failed is that there were too many records. I offload calculations to a another program and it screamed out of memory. Seeing the source in sortAndSpill where this happened:(hadoop -0.19) int spstart = spindex; while (spindex endPosition kvindices[kvoffsets[spindex % kvoffsets.length] + PARTITION] == i) { ++spindex; } // Note: we would like to avoid the combiner if we've fewer // than some threshold of records for a partition if (spstart != spindex) { combineCollector.setWriter(writer); RawKeyValueIterator kvIter = new MRResultIterator(spstart, spindex); combineAndSpill(kvIter, combineInputCounter); } So here are my questions: (1) is there a jobconf hint to limit the number of records in kviter? I can (and have) made a fix to my code that processes the values in a combiner step in batches (i.e takes N at a go,processes that and repeat), but was wondering if i could just set an option. Since this occurred in the MapContext, changing the number of reducers wont help. (2) How does changing the number of reducers help at all? I have 7 machines, so I feel 11 (a prime close to 7, why a prime?) is good enough (some machines are 16GB others 32GB) Regards Saptarshi -- Saptarshi Guha - saptarshi.g...@gmail.com
Very large file copied to cluster, and the copy fails. All blocks bad
hello, I have a 42 GB file on the local fs(call the machine A) which i need to copy to a HDFS (replicattion 1), according the HDFS webtracker it has 208GB across 7 machines. Note, the machine A has about 80 GB total, so there is no place to store copies of the file. Using the command bin/hadoop dfs -put /local/x /remote/tmp/ fails, with all blocks being bad. This is not surprising since the file is copied entirely to the HDFS region that resides on A. Had the file been copied across all machines, this would not have failed. I have more experience with mapreduce and not much with the hdfs side of things. Is there a configuration option i'm missing that forces the file to be split across the machines(when it is being copied)? -- Saptarshi Guha - saptarshi.g...@gmail.com
Re: Very large file copied to cluster, and the copy fails. All blocks bad
Did you run the copy command from machine A? Yes, exactly. I had to have the client doing the copy either on the master or on an off-cluster Thanks! I uploaded it from an off cluster (i.e not participating in the hdfs) and it worked splendidly. Regards Saptarshi On Thu, Feb 12, 2009 at 11:03 PM, TCK moonwatcher32...@yahoo.com wrote: I believe that if you do the copy from an hdfs client that is on the same machine as a data node, then for each block the primary copy always goes to that data node, and only the replicas get distributed among other data nodes. I ran into this issue once -- I had to have the client doing the copy either on the master or on an off-cluster node. -TCK --- On Thu, 2/12/09, Saptarshi Guha saptarshi.g...@gmail.com wrote: From: Saptarshi Guha saptarshi.g...@gmail.com Subject: Very large file copied to cluster, and the copy fails. All blocks bad To: core-user@hadoop.apache.org core-user@hadoop.apache.org Date: Thursday, February 12, 2009, 9:50 PM hello, I have a 42 GB file on the local fs(call the machine A) which i need to copy to a HDFS (replicattion 1), according the HDFS webtracker it has 208GB across 7 machines. Note, the machine A has about 80 GB total, so there is no place to store copies of the file. Using the command bin/hadoop dfs -put /local/x /remote/tmp/ fails, with all blocks being bad. This is not surprising since the file is copied entirely to the HDFS region that resides on A. Had the file been copied across all machines, this would not have failed. I have more experience with mapreduce and not much with the hdfs side of things. Is there a configuration option i'm missing that forces the file to be split across the machines(when it is being copied)? -- Saptarshi Guha - saptarshi.g...@gmail.com -- Saptarshi Guha - saptarshi.g...@gmail.com
FileInnputFormat, FileSplit, and LineRecorder: where are they run?
Hello All, In order to get a better understanding of Hadoop, i've started reading the source and have a question The FileInputFormat, reads in files, splits into splitsizes (which may be bigger than block size) and creates FileSplits. The FileSplits contain the start, length *and* the locations of the split. The LineRecordReader, receives a split and emits records. So far I think i'm correct(hopefully). Now, my questions Does the LineRecordReader run on a machine, in some sense, closest to the location of the splits? i.e Q1: If the split is less than the block size, then the split is located on one machine (apart from replicates): does the LineRecordReader run on the machine which contains the split? Or at least attempt to? Q2. If a split is greater than the block size, it spans multiple blocks which could reside on more than 1 machine. In this case, on which machine does the LineRecordReader run? The machine 'closest' to them? Please correct me if i'm wrong. Thank you Saptarshi -- Saptarshi Guha - saptarshi.g...@gmail.com
NLineInputFormat and very high number of maptasks
Hello, When I use NLIneInputFormat, when I output: System.out.println(mapred.map.tasks:+jobConf.get(mapred.map.tasks)); I see 51, but on the jobtracker site, the number is 18114. Yet with TextInputFormat it shows 51. I'm using Hadoop - 0.19 Any ideas why? Regards Saptarshi -- Saptarshi Guha - saptarshi.g...@gmail.com
Re: NLineInputFormat and very high number of maptasks
Sorry, i see - every line is now a maptask - one split,one task.(in this case N=1 line per split) Is that correct? Saptarshi On Jan 20, 2009, at 11:39 AM, Saptarshi Guha wrote: Hello, When I use NLIneInputFormat, when I output: System .out.println(mapred.map.tasks:+jobConf.get(mapred.map.tasks)); I see 51, but on the jobtracker site, the number is 18114. Yet with TextInputFormat it shows 51. I'm using Hadoop - 0.19 Any ideas why? Regards Saptarshi -- Saptarshi Guha - saptarshi.g...@gmail.com Saptarshi Guha | saptarshi.g...@gmail.com | http://www.stat.purdue.edu/~sguha If the church put in half the time on covetousness that it does on lust, this would be a better world. -- Garrison Keillor, Lake Wobegon Days
Sorting on several columns using KeyFieldSeparator and Paritioner
Hello, I have a file with n columns, some which are text and some numeric. Given a sequence of indices, i would like to sort on those indices i.e first on Index1, then within Index2 and so on. In the example code below, i have 3 columns, numeric, text, numeric, space separated. Sort on 2(reverse), then 1(reverse,numeric) and lastly 3 Though my code runs (and gives wrong results,col 2 is sorted in reverse, and within that col3 which is treated as tex and then col1 ) on the local, when distributed I get a merge error - my guess is fixing the latter fixes the former. This is the error: java.io.IOException: Final merge failed at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.createKVIterator(ReduceTask.java:2093) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.access$400(ReduceTask.java:457) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:380) at org.apache.hadoop.mapred.Child.main(Child.java:155) Caused by: java.lang.ArrayIndexOutOfBoundsException: 562 at org.apache.hadoop.io.WritableComparator.compareBytes(WritableComparator.java:128) at org.apache.hadoop.mapred.lib.KeyFieldBasedComparator.compareByteSequence(KeyFieldBasedComparator.java:109) at org.apache.hadoop.mapred.lib.KeyFieldBasedComparator.compare(KeyFieldBasedComparator.java:85) at org.apache.hadoop.mapred.Merger$MergeQueue.lessThan(Merger.java:308) at org.apache.hadoop.util.PriorityQueue.downHeap(PriorityQueue.java:144) at org.apache.hadoop.util.PriorityQueue.adjustTop(PriorityQueue.java:103) at org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:270) at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:285) at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:108) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.createKVIterator(ReduceTask.java:2087) ... 3 more Thanks for your time And the code (not too big) is ==CODE== public class RMRSort extends Configured implements Tool { static class RMRSortMap extends MapReduceBase implements MapperLongWritable, Text, Text, Text { public void map(LongWritable key, Text value,OutputCollectorText, Text output, Reporter reporter) throws IOException { output.collect(value,value); } } static class RMRSortReduce extends MapReduceBase implements ReducerText, Text, NullWritable, Text { public void reduce(Text key, IteratorText values,OutputCollectorNullWritable, Text output, Reporter reporter) throws IOException { NullWritable n = NullWritable.get(); while(values.hasNext()) output.collect(n,values.next() ); } } static JobConf createConf(String rserveport,String uid,String infolder, String outfolder) Configuration defaults = new Configuration(); JobConf jobConf = new JobConf(defaults, RMRSort.class); jobConf.setJobName(Sorter: +uid); jobConf.addResource(new Path(System.getenv(HADOOP_CONF_DIR)+/hadoop-site.xml)); // jobConf.set(mapred.job.tracker, local); jobConf.setMapperClass(RMRSortMap.class); jobConf.setReducerClass(RMRSortReduce.class); jobConf.set(map.output.key.field.separator,fsep); jobConf.setPartitionerClass(KeyFieldBasedPartitioner.class); jobConf.set(mapred.text.key.partitioner.options,-k2,2 -k1,1 -k3,3); jobConf.setOutputKeyComparatorClass(KeyFieldBasedComparator.class); jobConf.set(mapred.text.key.comparator.options,-k2r,2r -k1rn,1rn -k3n,3n); //infolder, outfolder information removed jobConf.setMapOutputKeyClass(Text.class); jobConf.setMapOutputValueClass(Text.class); jobConf.setOutputKeyClass(NullWritable.class); return(jobConf); } public int run(String[] args) throws Exception { return(1); } } -- Saptarshi Guha - saptarshi.g...@gmail.com
A reporter thread during the reduce stage for a long running line
Hello, Sorry for the puzzling subject. I have a single long running /statement/ in my reduce method, so the the framework might assume my reduce is not responding and kill it. I solved the problem in the map method by subclassing MapRunner, and running a thread which calls reporter.progress() every minute or so. However the same thread does not run during the reduce (i checked this by setting a status string in the thread which did not appear (on the Jobtracker website) during the reduce stage but did appearing during the map stage). Hadoop v.0.20 appears to solve this by having separate run methods for both Map and Reduce, however I'm using v 0.19. I scanned the Streaming source and it only subclasses MapRunner, so I assume it to has the same limitation (probably wrong, if so can someone point me to the location?) Is there a way around this, /without/ starting a thread in the reduce function? Hadoop v 0.19 Many thanks Saptarshi -- Saptarshi Guha - saptarshi.g...@gmail.com
Re: Combiner run specification and questions
. So as long as the correctness of the computation doesn't rely on a transformation performed in the combiner, it should be OK. In Right, i had the same thought. However, this restriction limits the scalability of your solution. It might be necessary to work around R's limitations by breaking up large computations into intermediate steps, possibly by explicitly instantiating and running the combiner in the reduce. So, i explicitly call the combiner? However at times, the reducer needs all the values so calling the combiner would not always work here. However, if i recall correctly(from reading the google paper) one does not **humongous** number values for a single key 1) I am guaranteed a reducer. So, The combiner, if defined, will run zero or more times on records emitted from the map, before being fed to the reduce. This zero case possibility worries me. However you mention, that it occurs collector spills in the map I have noticed this happening - what 'spilling' mean? Records emitted from the map are serialized into a buffer, which is periodically written to disk when it is (sufficiently) full. Each of these batch writes is a spill. In casual usage, it refers to any time when records need to be written to disk. The merge of intermediate files into the final map output and merging in-memory segments to disk in the reduce are two examples. -C Thanks for the explanation. Regards Saptarshi
Re: Combiner run specification and questions
I agree with the requirement that the key does not change. Of course, the values can change. I am primarily worried that the combiner might not be run at all - I have 'successfully' integrated Hadoop and R i.e the user can provide map/reduce functions written in R. However, R is not great with memory management, and if I have N (N is huge) values for a given key K, then R will baulk when it comes to processing this. Thus the combiner. The combiner will process n values for K, and ultimately, a few values for K in the reducer . If the combiner where not to run, R would collapse under the load. 1) I am guaranteed a reducer. So, The combiner, if defined, will run zero or more times on records emitted from the map, before being fed to the reduce. This zero case possibility worries me. However you mention, that it occurs collector spills in the map I have noticed this happening - what 'spilling' mean? Thank you Saptarshi On Jan 5, 2009, at 10:22 PM, Chris Douglas wrote: The combiner, if defined, will run zero or more times on records emitted from the map, before being fed to the reduce. It is run when the collector spills in the map and in some merge cases. If the combiner transforms the key, it is illegal to change its type, the partition to which it is assigned, or its ordering. For example, if you emit a record (k,v) from your map and (k',v) from the combiner, your comparator is C(K,K) and your partitioner function is P(K), it must be the case that P(k) == P(k') and C(k,k') == 0. If either of these does not hold, the semantics to the reduce are broken. Clearly, if k is not transformed (as in true for most combiners), this holds trivially. As was mentioned earlier, the purpose of the combiner is to compress data pulled across the network and spilled to disk. It should not affect the correctness or, in most cases, the output of the job. -C On Jan 2, 2009, at 9:57 AM, Saptarshi Guha wrote: Hello, I would just like to confirm, when does the Combiner run(since it might not be run at all,see below). I read somewhere that it is run, if there is at least one reduce (which in my case i can be sure of). I also read, that the combiner is an optimization. However, it is also a chance for a function to transform the key/value (keeping the class the same i.e the combiner semantics are not changed) and deal with a smaller set ( this could be done in the reducer but the number of values for a key might be relatively large). However, I guess it would be a mistake for reducer to expect its input coming from a combiner? E.g if there are only 10 value corresponding to a key(as outputted by the mapper), will these 10 values go straight to the reducer or to the reducer via the combiner? Here I am assuming my reduce operations does not need all the values for a key to work(so that a combiner can be used) i.e additive operations. Thank you Saptarshi On Sun, Nov 16, 2008 at 6:18 PM, Owen O'Malley omal...@apache.org wrote: The Combiner may be called 0, 1, or many times on each key between the mapper and reducer. Combiners are just an application specific optimization that compress the intermediate output. They should not have side effects or transform the types. Unfortunately, since there isn't a separate interface for Combiners, there is isn't a great place to document this requirement. I've just filed HADOOP-4668 to improve the documentation. -- Saptarshi Guha - saptarshi.g...@gmail.com Saptarshi Guha | saptarshi.g...@gmail.com | http://www.stat.purdue.edu/~sguha The way of the world is to praise dead saints and prosecute live ones. -- Nathaniel Howe
Map input records(on JobTracker website) increasing and decreasing
Hello, When I check the job tracker web page, and look at the Map Input records read,the map input records goes up to say 1.4MN and then drops to 410K and then goes up again. The same happens with input/output bytes and output records. Why is this? Is there something wrong with the mapper code? In my map function, i assume I have received one line of input. The oscillatory behavior does not occur for tiny datasets, but for 1GB of data (tiny for others) i see this happening. Thank s Saptarshi -- Saptarshi Guha - saptarshi.g...@gmail.com
Re: Problem loading hadoop-site.xml - dumping parameters
Hello, I have set my HADOOP_CONF_DIR to the conf folder and still not loading. I have to manually set the options when I create my conf. Have you resolved this? Regards Saptarshi On Tue, Dec 30, 2008 at 5:25 PM, g00dn3ss g00dn...@gmail.com wrote: Hey all, I have a similar issue. I am specifically having problems with the config option mapred.child.java.opts. I set it to -Xmx1024m and it uses -Xmx200m regardless. I am running Hadoop 0.18.2 and I'm pretty sure this option was working in the previous versions of Hadoop I was using. I am not explicitly setting HADOOP_CONF_DIR. My site config is in ${HADOOP_HOME}/conf. Just to test things further, I wrote a small map task to print out the ENV values and it has the correct value for HADOOP_HOME, HADOOP_LOG_DIR, HADOOP_OPTS, etc... I also printed out the key/values in the JobConf passed to the mapper and it has my specified values for fs.default.name and mapred.job.tracker. Other settings like dfs.name.dir, dfs.data.dir, and mapred.child.java.opts do not have my values. Any suggestion where to look at next? Thanks! On Mon, Dec 29, 2008 at 10:27 PM, Amareshwari Sriramadasu amar...@yahoo-inc.com wrote: Saptarshi Guha wrote: Hello, I had previously emailed regarding heap size issue and have discovered that the hadoop-site.xml is not loading completely, i.e Configuration defaults = new Configuration(); JobConf jobConf = new JobConf(defaults, XYZ.class); System.out.println(1:+jobConf.get(mapred.child.java.opts)); System.out.println(2:+jobConf.get(mapred.map.tasks)); System.out.println(3:+jobConf.get(mapred.reduce.tasks)); System.out.println(3:+jobConf.get(mapred.tasktracker.reduce.tasks.maximum)); returns -Xmx200m, 2,1,2 respectively, even though the numbers in the hadoop-site.xml are very different. Is there a way for hadoop to dump the parameters read in from hadoop-site.xml and hadoop-default.xml? Is your hadoop-site.xml present in the conf (HADOOP_CONF_DIR) directory? http://hadoop.apache.org/core/docs/r0.19.0/cluster_setup.html#Configuration -Amareshwari -- Saptarshi Guha - saptarshi.g...@gmail.com
Re: Problem loading hadoop-site.xml - dumping parameters
For some strange reason, neither hadoop-default.xml nor hadoop-site.xml is loading. Both files are in hadoop-0.19.0/conf/ folder. HADOOP_CONF_DIR points to thi In some java code, (not a job), i did Configuration cf = new Configuration(); String[] xu= new String[]{io.file.buffer.size,io.sort.mb,io.sort.factor,mapred.tasktracker.map.tasks.maximum,hadoop.logfile.count,hadoop.logfile.size,hadoop.tmp.dir,dfs.replication,mapred.map.tasks,mapred.job.tracker,fs.default.name}; for(String x : xu){ LOG.info(x+: +cf.get(x)); } All the values returned were default values(as noted in hadoop-default.xml). Changing the value in hadoop-default.xml were not reflected here. Grepping the source reveal the defaults are hardcoded. So it seems neither is hadoop-default.xml not hadooop-site.xml loading. Also, Configuration.java mentionds that hadoop-site.xml is depecated (thought still loaded) Any suggestions? Regards Saptarshi On Mon, Jan 5, 2009 at 2:26 PM, Saptarshi Guha saptarshi.g...@gmail.com wrote: Hello, I have set my HADOOP_CONF_DIR to the conf folder and still not loading. I have to manually set the options when I create my conf. Have you resolved this? Regards Saptarshi On Tue, Dec 30, 2008 at 5:25 PM, g00dn3ss g00dn...@gmail.com wrote: Hey all, I have a similar issue. I am specifically having problems with the config option mapred.child.java.opts. I set it to -Xmx1024m and it uses -Xmx200m regardless. I am running Hadoop 0.18.2 and I'm pretty sure this option was working in the previous versions of Hadoop I was using. I am not explicitly setting HADOOP_CONF_DIR. My site config is in ${HADOOP_HOME}/conf. Just to test things further, I wrote a small map task to print out the ENV values and it has the correct value for HADOOP_HOME, HADOOP_LOG_DIR, HADOOP_OPTS, etc... I also printed out the key/values in the JobConf passed to the mapper and it has my specified values for fs.default.name and mapred.job.tracker. Other settings like dfs.name.dir, dfs.data.dir, and mapred.child.java.opts do not have my values. Any suggestion where to look at next? Thanks! On Mon, Dec 29, 2008 at 10:27 PM, Amareshwari Sriramadasu amar...@yahoo-inc.com wrote: Saptarshi Guha wrote: Hello, I had previously emailed regarding heap size issue and have discovered that the hadoop-site.xml is not loading completely, i.e Configuration defaults = new Configuration(); JobConf jobConf = new JobConf(defaults, XYZ.class); System.out.println(1:+jobConf.get(mapred.child.java.opts)); System.out.println(2:+jobConf.get(mapred.map.tasks)); System.out.println(3:+jobConf.get(mapred.reduce.tasks)); System.out.println(3:+jobConf.get(mapred.tasktracker.reduce.tasks.maximum)); returns -Xmx200m, 2,1,2 respectively, even though the numbers in the hadoop-site.xml are very different. Is there a way for hadoop to dump the parameters read in from hadoop-site.xml and hadoop-default.xml? Is your hadoop-site.xml present in the conf (HADOOP_CONF_DIR) directory? http://hadoop.apache.org/core/docs/r0.19.0/cluster_setup.html#Configuration -Amareshwari -- Saptarshi Guha - saptarshi.g...@gmail.com -- Saptarshi Guha - saptarshi.g...@gmail.com
Re: Map input records(on JobTracker website) increasing and decreasing
True, i suspected that, however none died. (Nothing in the Tasks killed/Failed field) On Mon, Jan 5, 2009 at 4:04 PM, Doug Cutting cutt...@apache.org wrote: Values can drop if tasks die and must be re-run. Doug Aaron Kimball wrote: The actual number of input records is most likely steadily increasing. The counters on the web site are inaccurate until the job is complete; their values will fluctuate wildly. I'm not sure why this is. - Aaron On Mon, Jan 5, 2009 at 8:34 AM, Saptarshi Guha saptarshi.g...@gmail.comwrote: Hello, When I check the job tracker web page, and look at the Map Input records read,the map input records goes up to say 1.4MN and then drops to 410K and then goes up again. The same happens with input/output bytes and output records. Why is this? Is there something wrong with the mapper code? In my map function, i assume I have received one line of input. The oscillatory behavior does not occur for tiny datasets, but for 1GB of data (tiny for others) i see this happening. Thank s Saptarshi -- Saptarshi Guha - saptarshi.g...@gmail.com -- Saptarshi Guha - saptarshi.g...@gmail.com
Re: Problem loading hadoop-site.xml - dumping parameters
Possibly. When i force load the configuration Configuration cf = new Configuration(); LOG.info(Adding new resource); cf.addResource(hadoop-site.xml) It doesn't load, though hadoop-site.xml is present in hadoop-0.19.0/conf, and the hadoop-0.19 folder is replicated across all the machines. However if load it via cf.addResource(new Path(/home/godhuli/custom/hadoop-0.19.0/conf/hadoop-site.xml)); It works! So clearly, there is some other hadoop-site.xml in the classpath, but where? Should I add /home/godhuli/custom/hadoop-0.19.0/conf/ to the Hadoop classpath in hadoop-env.sh? Thanks Saptarshi On Mon, Jan 5, 2009 at 5:09 PM, Jason Venner ja...@attributor.com wrote: somehow you have alternate versions of the file earlier in the class path. Perhaps someone's empty copies are bundled into one of your application jar files. Or perhaps the configurationfiles are not distributed to the datanodes in the expected locations. Saptarshi Guha wrote: For some strange reason, neither hadoop-default.xml nor hadoop-site.xml is loading. Both files are in hadoop-0.19.0/conf/ folder. HADOOP_CONF_DIR points to thi In some java code, (not a job), i did Configuration cf = new Configuration(); String[] xu= new String[]{io.file.buffer.size,io.sort.mb,io.sort.factor,mapred.tasktracker.map.tasks.maximum,hadoop.logfile.count,hadoop.logfile.size,hadoop.tmp.dir,dfs.replication,mapred.map.tasks,mapred.job.tracker,fs.default.name}; for(String x : xu){ LOG.info(x+: +cf.get(x)); } All the values returned were default values(as noted in hadoop-default.xml). Changing the value in hadoop-default.xml were not reflected here. Grepping the source reveal the defaults are hardcoded. So it seems neither is hadoop-default.xml not hadooop-site.xml loading. Also, Configuration.java mentionds that hadoop-site.xml is depecated (thought still loaded) Any suggestions? Regards Saptarshi On Mon, Jan 5, 2009 at 2:26 PM, Saptarshi Guha saptarshi.g...@gmail.com wrote: Hello, I have set my HADOOP_CONF_DIR to the conf folder and still not loading. I have to manually set the options when I create my conf. Have you resolved this? Regards Saptarshi On Tue, Dec 30, 2008 at 5:25 PM, g00dn3ss g00dn...@gmail.com wrote: Hey all, I have a similar issue. I am specifically having problems with the config option mapred.child.java.opts. I set it to -Xmx1024m and it uses -Xmx200m regardless. I am running Hadoop 0.18.2 and I'm pretty sure this option was working in the previous versions of Hadoop I was using. I am not explicitly setting HADOOP_CONF_DIR. My site config is in ${HADOOP_HOME}/conf. Just to test things further, I wrote a small map task to print out the ENV values and it has the correct value for HADOOP_HOME, HADOOP_LOG_DIR, HADOOP_OPTS, etc... I also printed out the key/values in the JobConf passed to the mapper and it has my specified values for fs.default.name and mapred.job.tracker. Other settings like dfs.name.dir, dfs.data.dir, and mapred.child.java.opts do not have my values. Any suggestion where to look at next? Thanks! On Mon, Dec 29, 2008 at 10:27 PM, Amareshwari Sriramadasu amar...@yahoo-inc.com wrote: Saptarshi Guha wrote: Hello, I had previously emailed regarding heap size issue and have discovered that the hadoop-site.xml is not loading completely, i.e Configuration defaults = new Configuration(); JobConf jobConf = new JobConf(defaults, XYZ.class); System.out.println(1:+jobConf.get(mapred.child.java.opts)); System.out.println(2:+jobConf.get(mapred.map.tasks)); System.out.println(3:+jobConf.get(mapred.reduce.tasks)); System.out.println(3:+jobConf.get(mapred.tasktracker.reduce.tasks.maximum)); returns -Xmx200m, 2,1,2 respectively, even though the numbers in the hadoop-site.xml are very different. Is there a way for hadoop to dump the parameters read in from hadoop-site.xml and hadoop-default.xml? Is your hadoop-site.xml present in the conf (HADOOP_CONF_DIR) directory? http://hadoop.apache.org/core/docs/r0.19.0/cluster_setup.html#Configuration -Amareshwari -- Saptarshi Guha - saptarshi.g...@gmail.com -- Saptarshi Guha - saptarshi.g...@gmail.com
Combiner run specification and questions
Hello, I would just like to confirm, when does the Combiner run(since it might not be run at all,see below). I read somewhere that it is run, if there is at least one reduce (which in my case i can be sure of). I also read, that the combiner is an optimization. However, it is also a chance for a function to transform the key/value (keeping the class the same i.e the combiner semantics are not changed) and deal with a smaller set ( this could be done in the reducer but the number of values for a key might be relatively large). However, I guess it would be a mistake for reducer to expect its input coming from a combiner? E.g if there are only 10 value corresponding to a key(as outputted by the mapper), will these 10 values go straight to the reducer or to the reducer via the combiner? Here I am assuming my reduce operations does not need all the values for a key to work(so that a combiner can be used) i.e additive operations. Thank you Saptarshi On Sun, Nov 16, 2008 at 6:18 PM, Owen O'Malley omal...@apache.org wrote: The Combiner may be called 0, 1, or many times on each key between the mapper and reducer. Combiners are just an application specific optimization that compress the intermediate output. They should not have side effects or transform the types. Unfortunately, since there isn't a separate interface for Combiners, there is isn't a great place to document this requirement. I've just filed HADOOP-4668 to improve the documentation. -- Saptarshi Guha - saptarshi.g...@gmail.com
Problem loading hadoop-site.xml - dumping parameters
Hello, I had previously emailed regarding heap size issue and have discovered that the hadoop-site.xml is not loading completely, i.e Configuration defaults = new Configuration(); JobConf jobConf = new JobConf(defaults, XYZ.class); System.out.println(1:+jobConf.get(mapred.child.java.opts)); System.out.println(2:+jobConf.get(mapred.map.tasks)); System.out.println(3:+jobConf.get(mapred.reduce.tasks)); System.out.println(3:+jobConf.get(mapred.tasktracker.reduce.tasks.maximum)); returns -Xmx200m, 2,1,2 respectively, even though the numbers in the hadoop-site.xml are very different. Is there a way for hadoop to dump the parameters read in from hadoop-site.xml and hadoop-default.xml? -- Saptarshi Guha - saptarshi.g...@gmail.com
OutofMemory Error, inspite of large amounts provided
Hello, I have work machines with 32GB and allocated 16GB to the heap size ==hadoop-env.sh== export HADOOP_HEAPSIZE=16384 ==hadoop-site.xml== property namemapred.child.java.opts/name value-Xmx16384m/value /property The same code runs when not being run through Hadoop, but it fails when in a Maptask. Are there other places where I can specify the memory to the maptasks? Regards Saptarshi -- Saptarshi Guha - saptarshi.g...@gmail.com
Re: OutofMemory Error, inspite of large amounts provided
On Sun, Dec 28, 2008 at 4:33 PM, Brian Bockelman bbock...@cse.unl.edu wrote: Hey Saptarshi, Watch the running child process while using ps, top, or Ganglia monitoring. Does the map task actually use 16GB of memory, or is the memory not getting set properly? Brian I haven't figured out how to run ganglia, however, also the children quit before i can see their memory usage. The trackers all use 16GB.(from the ps command). However, i noticed some use 512MB only(when i manged to catch them in time) Regards
Re: OutofMemory Error, inspite of large amounts provided
Caught it in action. Running ps -e -o 'vsz pid ruser args' |sort -nr|head -5 on a machine where the map task was running 04812 16962 sguha/home/godhuli/custom/jdk1.6.0_11/jre/bin/java -Djava.library.path=/home/godhuli/custom/hadoop/bin/../lib/native/Linux-amd64-64:/home/godhuli/custom/hdfs/mapred/local/taskTracker/jobcache/job_200812282102_0003/attempt_200812282102_0003_m_00_0/work -Xmx200m -Djava.io.tmpdir=/home/godhuli/custom/hdfs/mapred/local/taskTracker/jobcache/job_200812282102_0003/attempt_200812282102_0003_m_00_0/work/tmp -classpath /attempt_200812282102_0003_m_00_0/work -Dhadoop.log.dir=/home/godhuli/custom/hadoop/bin/../logs -Dhadoop.root.logger=INFO,TLA -Dhadoop.tasklog.taskid=attempt_200812282102_0003_m_00_0 -Dhadoop.tasklog.totalLogFileSize=0 org.apache.hadoop.mapred.Child 127.0.0.1 40443 attempt_200812282102_0003_m_00_0 1525207782 Also, the reducer only used 540mb. I notice -Xmx200m was passed, how to change it? Regards Saptarshi On Sun, Dec 28, 2008 at 10:19 PM, Saptarshi Guha saptarshi.g...@gmail.com wrote: On Sun, Dec 28, 2008 at 4:33 PM, Brian Bockelman bbock...@cse.unl.edu wrote: Hey Saptarshi, Watch the running child process while using ps, top, or Ganglia monitoring. Does the map task actually use 16GB of memory, or is the memory not getting set properly? Brian I haven't figured out how to run ganglia, however, also the children quit before i can see their memory usage. The trackers all use 16GB.(from the ps command). However, i noticed some use 512MB only(when i manged to catch them in time) Regards -- Saptarshi Guha - saptarshi.g...@gmail.com
A question about MultipleOutputFormat
Hello, MultipleOutputFormat is a very good idea. Thanks. I have a question, from the web page , The reducer wants to write data to different files depending on the actual keys .. and values. Examining, TestMultipleTextOutputFormat, class KeyBasedMultipleTextOutputFormat extends MultipleTextOutputFormatText, Text In my implementation, will the key and value classes be the same as the ones given to the Reduce? or the Map? Thank you Saptarshi -- Saptarshi Guha - saptarshi.g...@gmail.com
Classes Not Found even when classpath is mentioned (Starting mapreduce from another app)
Hello, I intend to start a mapreduce job from another java app, using ToolRunner.run method. This works fine on a local job. However when distributed i get java.lang.NoClassDefFoundError: org/rosuda/REngine/Rserve/RserveException at java.lang.Class.getDeclaredConstructors0(Native Method) at java.lang.Class.privateGetDeclaredConstructors(Class.java:2389) at java.lang.Class.getConstructor0(Class.java:2699) at java.lang.Class.getDeclaredConstructor(Class.java:1985) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:74) at org.apache.hadoop.mapred.JobConf.getInputFormat(JobConf.java:402) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) Now, i have a CLASSPATH entry in my .bashrc which contains the location of Rserve.jar (which contains the above classes), yet I still get the above error. Do I have to somehow emulate RunJar.main or is there a simpler way out? Thank you Saptarshi -- Saptarshi Guha - saptarshi.g...@gmail.com
Re: Classes Not Found even when classpath is mentioned (Starting mapreduce from another app)
Hello, What if each machine has the identical directory structure and Rserve.jar is present at the same place(i.e the entry in the classpath is identical across machines). The will the latter suggestion work? On Thu, Dec 18, 2008 at 9:02 PM, Zheng Shao zs...@facebook.com wrote: You either need to add the your Rserve.jar to -libjars, or put Rserve.jar in a cluster-accessible NFS mount (and add it to CLASSPATH/HADOOP_CLASSPATH). Zheng -Original Message- From: Saptarshi Guha [mailto:saptarshi.g...@gmail.com] Sent: Thursday, December 18, 2008 3:38 PM To: core-user@hadoop.apache.org Subject: Classes Not Found even when classpath is mentioned (Starting mapreduce from another app) Hello, I intend to start a mapreduce job from another java app, using ToolRunner.run method. This works fine on a local job. However when distributed i get java.lang.NoClassDefFoundError: org/rosuda/REngine/Rserve/RserveException at java.lang.Class.getDeclaredConstructors0(Native Method) at java.lang.Class.privateGetDeclaredConstructors(Class.java:2389) at java.lang.Class.getConstructor0(Class.java:2699) at java.lang.Class.getDeclaredConstructor(Class.java:1985) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:74) at org.apache.hadoop.mapred.JobConf.getInputFormat(JobConf.java:402) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) Now, i have a CLASSPATH entry in my .bashrc which contains the location of Rserve.jar (which contains the above classes), yet I still get the above error. Do I have to somehow emulate RunJar.main or is there a simpler way out? Thank you Saptarshi -- Saptarshi Guha - saptarshi.g...@gmail.com -- Saptarshi Guha - saptarshi.g...@gmail.com
Re: Lookup HashMap available within the Map
The more I use it, i realize Hadoop is not build around shared memory. For these type of things, use TSpaces (IBM), that way you can have a flag to load it once and allow for sharing. Regards Saptarshi On Tue, Nov 25, 2008 at 3:42 PM, Chris K Wensel [EMAIL PROTECTED] wrote: cool. If you need a hand with Cascading stuff, feel free to ping me on the mail list or #cascading irc. lots of other friendly folk there already. ckw On Nov 25, 2008, at 12:35 PM, tim robertson wrote: Thanks Chris, I have a different test running, then will implement that. Might give cascading a shot for what I am doing. Cheers Tim On Tue, Nov 25, 2008 at 9:24 PM, Chris K Wensel [EMAIL PROTECTED] wrote: Hey Tim The .configure() method is what you are looking for i believe. It is called once per task, which in the default case, is once per jvm. Note Jobs are broken into parallel tasks, each task handles a portion of the input data. So you may create your map 100 times, because there are 100 tasks, it will only be created once per jvm. I hope this makes sense. chris On Nov 25, 2008, at 11:46 AM, tim robertson wrote: Hi Doug, Thanks - it is not so much I want to run in a single JVM - I do want a bunch of machines doing the work, it is just I want them all to have this in-memory lookup index, that is configured once per job. Is there some hook somewhere that I can trigger a read from the distributed cache, or is a Mapper.configure() the best place for this? Can it be called multiple times per Job meaning I need to keep some static synchronised indicator flag? Thanks again, Tim On Tue, Nov 25, 2008 at 8:41 PM, Doug Cutting [EMAIL PROTECTED] wrote: tim robertson wrote: Thanks Alex - this will allow me to share the shapefile, but I need to one time only per job per jvm read it, parse it and store the objects in the index. Is the Mapper.configure() the best place to do this? E.g. will it only be called once per job? In 0.19, with HADOOP-249, all tasks from a job can be run in a single JVM. So, yes, you could access a static cache from Mapper.configure(). Doug -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/ -- Saptarshi Guha - [EMAIL PROTECTED]
Creating a temp folder on the dfs?
Hello, Is there any method for creating a temporary folder on the jobs dfs? jobconf.getWorkingDirectory() is actually my home folder (on the dfs), i was thinking something random looking(like mkdtemp) Regards Saptarshi -- Saptarshi Guha - [EMAIL PROTECTED]
Problem building a inputformat using 0.18.1
Hello, During the build of inputformat i get the following error: [javac] /home/sguha/distr/mapred/SListInputFormat.java:118: incompatible types [javac] found : java.lang.String [javac] required: org.apache.hadoop.io.Text [javac] slist_name = job.get(testfoo); The method where this is happening is public void configure(JobConf job){ slist_name = job.get(testfoo); } //slist_name is a String I notice that JobConf is a subclass onf Configuration. I checked the sources of the later and indeed there is the method public String get(String name) { return substituteVars(getProps().getProperty(name)); } So why do I get the above error? My java is fresh (a few weeks old) and I know this is not the correct way to approach Hadoop programming but i need to implement an idea as soon as possible. I appreciate your help. -- Saptarshi Guha - [EMAIL PROTECTED]
Re: A question about the combiner, reducer and the Output value class: can they be different?
On Nov 16, 2008, at 6:18 PM, Owen O'Malley wrote: On Sun, Nov 16, 2008 at 2:18 PM, Saptarshi Guha [EMAIL PROTECTED] wrote: Hello, If my understanding is correct, the combiner will read in values for a given key, process it, output it and then **all** values for a key are given to the reducer. Not quite. The flow looks like RecordReader - Mapper - Combiner * - Reducer - OutputFormat . Yes, i glossed over that bit. Thanks for the correction. The Combiner may be called 0, 1, or many times on each key between the mapper and reducer. Combiners are just an application specific optimization that compress the intermediate output. They should not have side effects or transform the types. Unfortunately, since there isn't a separate interface for Combiners, there is isn't a great place to document this requirement. I've just filed HADOOP-4668 to improve the documentation. Hmm, i had no idea that the combiner could be called 0 times. Thanks for the heads up Thank you Saptarshi
Empty source in map?
Hello, I have a JobControl with an unknown number of jobs. The way I execute it is add a job, start it and upon finnishing decide whether to add another job. When I add another a job and run it,I get the following(see below) What does the Unknown Source mean? Does it have anything do with 08/11/16 16:11:16 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 08/11/16 16:16:41 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= [output of first job (job_1) skipped] 08/11/16 16:16:51 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 08/11/16 16:16:51 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 08/11/16 16:16:51 INFO mapred.FileInputFormat: Total input paths to process : 1 08/11/16 16:16:51 INFO mapred.FileInputFormat: Total input paths to process : 1 08/11/16 16:16:51 INFO mapred.FileInputFormat: Total input paths to process : 1 08/11/16 16:16:51 INFO mapred.FileInputFormat: Total input paths to process : 1 08/11/16 16:16:51 INFO mapred.MapTask: numReduceTasks: 1 08/11/16 16:16:51 INFO mapred.MapTask: io.sort.mb = 200 08/11/16 16:16:52 INFO mapred.MapTask: data buffer = 159383552/199229440 08/11/16 16:16:52 INFO mapred.MapTask: record buffer = 524288/655360 08/11/16 16:16:52 WARN mapred.LocalJobRunner: job_local_0002 java.lang.NullPointerException at org.saptarshiguha.clusters.ClusterCenter $ClosestCenterMR.map(Unknown Source) at org.saptarshiguha.clusters.ClusterCenter $ClosestCenterMR.map(Unknown Source) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.LocalJobRunner $Job.run(LocalJobRunner.java:157) 08/11/16 16:16:57 INFO mapred.LocalJobRunner: file:/tmp/input/ sample.data:0+308 The map function is not run at all. ===BEGIN_CODE JobConf jobConf_1 = ColumnRanges.createSetupJob(inPaths_1, outdir); Job job_1 = new Job(jobConf_1, null); JobControl theControl = new JobControl(Test); theControl.addJob(job_1); Thread theController = new Thread(theControl); theController.start(); while (!theControl.allFinished()) { System.out.println(Jobs in waiting state: + theControl.getWaitingJobs().size()); System.out.println(Jobs in ready state: + theControl.getReadyJobs().size()); System.out.println(Jobs in running state: + theControl.getRunningJobs().size()); System.out.println(Jobs in success state: + theControl.getSuccessfulJobs().size()); System.out.println(Jobs in failed state: + theControl.getFailedJobs().size()); System.out.println(\n); try { Thread.sleep(5000); } catch (Exception e) { System.out.println(Exception\n); } } theControl.stop(); if ( job_1.getState() == Job.SUCCESS) { System.out.println(Choosing initial centers); createSeedCenters(); FileUtil.fullyDelete(outdir.getFileSystem(jobConf_1 ),outdir); JobConf jobConf = ClusterCenter.createSetupJob(inPaths_1, outdir); Job job = new Job(jobConf, null); theControl.addJob(job); Thread threadControl = new Thread(theControl); threadControl.start(); while (!theControl.allFinished()){ try { Thread.sleep(5000); } catch (Exception e) { System.out.println(Exception\n); } } }
A question about the combiner, reducer and the Output value class: can they be different?
Hello, If my understanding is correct, the combiner will read in values for a given key, process it, output it and then **all** values for a key are given to the reducer. Then it ought to be possible for the combiner to be of the form public static class ClosestCenterCB extends MapReduceBase implements ReducerIntWritable, Text, IntWritable, BytesWritable{ public void reduce(IntWritable key, IteratorText values, OutputCollectorIntWritable, BytesWritable output, Reporter reporter) {...} } and the reducer: public static class ClosestCenterMR extends MapReduceBase implements MapperLongWritable, Text, IntWritable, Text, ReducerIntWritable, BytesWritable, IntWritable, Text{ public void reduce(IntWritable key, IteratorBytesWritable values, OutputCollectorIntWritable, Text output, Reporter reporter) throws IOException { ..} } However, when I set up the jobconf theJob.setOutputKeyClass(IntWritable.class); theJob.setOutputValueClass(Text.class); theJob.setReducerClass(ClosestCenterMR.class); theJob.setCombinerClass(ClosestCenterCB.class); The outputvalue is TextClass and so I get the following error: ava.io.IOException: wrong value class: class org.apache.hadoop.io.BytesWritable is not class org.apache.hadoop.io.Text at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:144) at org.apache.hadoop.mapred.Task $CombineOutputCollector.collect(Task.java:626) at org.saptarshiguha.clusters.ClusterCenter $ClosestCenterCB.reduce(Unknown Source) at org.saptarshiguha.clusters.ClusterCenter $ClosestCenterCB.reduce(Unknown Source) at org.apache.hadoop.mapred.MapTask $MapOutputBuffer.combineAndSpill(MapTask.java:904) at org.apache.hadoop.mapred.MapTask $MapOutputBuffer.sortAndSpill(MapTask.java:785) at org.apache.hadoop.mapred.MapTask $MapOutputBuffer.flush(MapTask.java:698) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:228) at org.apache.hadoop.mapred.LocalJobRunner $Job.run(LocalJobRunner.java:157) 08/11/16 16:48:18 INFO mapred.LocalJobRunner: file:/tmp/input/ sample.data:0+308 Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha This is your fortune.
Re: 答复: Passing information from one job to the next in a JobControl
Hi Jerry, This actually makes a lot of sense. Hadn't seen it in this light. Thank you Saptarshi On Nov 12, 2008, at 3:07 AM, jerry ye wrote: Hi Saptarshi: Please refer the following example code, I wish it can help you. JobConf grepJob = new JobConf(getConf(), Grep.class); try { grepJob.setJobName(search); FileInputFormat.setInputPaths(grepJob, args[0]); … FileOutputFormat.setOutputPath(grepJob, tempDir); . JobClient.runJob(grepJob); JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName(sort); FileInputFormat.setInputPaths(sortJob, tempDir); . FileOutputFormat.setOutputPath(sortJob, new Path(args[1])); …….. JobClient.runJob(sortJob); --Jerry -邮件原件- 发件人: Saptarshi Guha [mailto:[EMAIL PROTECTED] 发送时间: 2008年11月11日 12:06 收件人: core-user@hadoop.apache.org 主题: Passing information from one job to the next in a JobControl Hello, I am using JobControl to run a sequence of jobs(Job_1,Job_2,..Job_n) on after the other. Each job returns some information e.g key1 value1,value2 key2 value1,value2 and so on. This can be found in the outdir passed to the jar file. Is there a way for Job_1 to return some data (which can be passed onto the Job_2), without my main program having to read the information from the file in the HDFS? I could use things like Linda Spaces, however does MapReduce have a framework for this? Thanks Saptarshi -- Saptarshi Guha - [EMAIL PROTECTED] Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha Intel CPUs are not defective, they just act that way. -- Henry Spencer
Passing information from one job to the next in a JobControl
Hello, I am using JobControl to run a sequence of jobs(Job_1,Job_2,..Job_n) on after the other. Each job returns some information e.g key1 value1,value2 key2 value1,value2 and so on. This can be found in the outdir passed to the jar file. Is there a way for Job_1 to return some data (which can be passed onto the Job_2), without my main program having to read the information from the file in the HDFS? I could use things like Linda Spaces, however does MapReduce have a framework for this? Thanks Saptarshi -- Saptarshi Guha - [EMAIL PROTECTED]
Keep jobcache files around
Hello, I wish to keep my jobcache files after the run. I'm using a program which can't read from STDIN (i'm hadoop streaming) so i've written a python wrapper to create a file and pass the file to the program. However, though the python file runs (and maybe the program) i'm not getting the desired results. Nothing fails, and even though I've kept keep.failed.tasks=true (-jobconf mapred.reduce.tasks=0 -jobconf keep.failed.tasks.files=1 in the streaming command line) nothing is preserved i.e the jobcache folders(no attempt_200810091420_0004_m_03_3*** folders) are deletecd from the task nodes. How can I keep them, even when nothing fails? Regards Saptarshi Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha
Re: Jobtracker config?
Hi, Thanks, it worked. Correct me if I'm wrong, but isn't this a configuration defect? E.g the location of sec namenode is in conf/master and if I run start- dfs.sh, the sec namenode starts it on B. Similarly, given that the Jobtracker is specified to run on C, shouldn't start-all.sh start the jobtracker on C? Regards Saptarshi On Sep 29, 2008, at 6:37 PM, Arun C Murthy wrote: On Sep 29, 2008, at 2:52 PM, Saptarshi Guha wrote: Setup: I am running the namenode on A, the sec. namenode on B and the jobtracker on C. The datanodes and tasktrackers are on Z1,Z2,Z3. Problem: However, the jobtracker is starting up on A. Here are my configs for Jobtracker This would happen if you ran 'start-all.sh' on A rather than start- dfs.sh on A and start-mapred.sh on B. Is that what you did? If not, please post the commands you used to start the HDFS and Map- Reduce clusters... Arun property namemapred.job.tracker/name valueC:54311/value descriptionThe host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property property namemapred.job.tracker.http.address/name valueC:50030/value description The job tracker http server address and port the server will listen on. If the port is 0 then the server will start on a free port. /description /property Also, my masters contains on entry for B (so that the sec. name node starts on B) and my slaves file contains Z1,Z2,Z3. The config files are synchronized across all machines. Any help would be appreciated. Thank you Saptarshi Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha I think I'm schizophrenic. One half of me's paranoid and the other half's out to get him.
A quick question about partioner and reducer
Hello, I am slightly confused about the number of reducers executed and the size of data each receives. Setup: I have a setup of 5 task trackers. In my hadoop-site: (1) property namemapred.reduce.tasks/name value7/value descriptionThe default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is local. /description /property (2) property namemapred.tasktracker.map.tasks.maximum/name value7/value descriptionThe maximum number of map tasks that will be run simultaneously by a task tracker. /description /property (3) However from http://hadoop.apache.org/core/docs/r0.18.1/api/index.html The total number of partitions is the same as the number of reduce tasks for the job.. Q: So does that mean (from (1) (2)) , there will be a total of 7 reduce tasks distributed across 5 machines such that no machine receives more than 7 reduce jobs? If so, suppose i have millions of unique keys which need to be reduced(e.g urls/hashes), these will be partitioned into 7 groups (from (3)) and distributed across 5 machines? Which is equivalent to saying that the number of reduces tasks run across all machines will be equal to 7? Wouldn't that be too large a number of keys for each reduce task? Are these possible solutions: Solution: I) Fixed machines (5), but increase mapred.reduce.tasks (loss in performance?) 2) Increase number of machines (not possible for me, but a theoretical solution) and set mapred.reduce.tasks to a commensurate number Many thanks for your time Saptarshi Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha More people are flattered into virtue than bullied out of vice. -- R. S. Surtees
Jobtracker config?
Hello, Back to hadoop after several months. I searched a few thousand messages but the config for jobtracker escaped me. Setup: I am running the namenode on A, the sec. namenode on B and the jobtracker on C. The datanodes and tasktrackers are on Z1,Z2,Z3. Problem: However, the jobtracker is starting up on A. Here are my configs for Jobtracker property namemapred.job.tracker/name valueC:54311/value descriptionThe host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property property namemapred.job.tracker.http.address/name valueC:50030/value description The job tracker http server address and port the server will listen on. If the port is 0 then the server will start on a free port. /description /property Also, my masters contains on entry for B (so that the sec. name node starts on B) and my slaves file contains Z1,Z2,Z3. The config files are synchronized across all machines. Any help would be appreciated. Thank you Saptarshi Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha
Data-local tasks
Hello, I recall asking this question but this is in addition to what I'ev askd. Firstly, to recap my question and Arun's specific response: -- On May 20, 2008, at 9:03 AM, Saptarshi Guha wrote: Hello, -- Does the Data-local map tasks counter mean the number of tasks that the had the input data already present on the machine on they are running on? -- i.e the wasn't a need to ship the data to them. Response from Arun -- Yes. Your understanding is correct. More specifically it means that the map-task got scheduled on a machine on which one of the -- replicas of it's input-split-block was present and was served by the datanode running on that machine. *smile* Arun Now, Is Hadoop designed to schedule a map task on a machine which has one of the replicas of it's input split block? Failing that, does then assign a map task on machine close to one that contains a replica of it's input split block? Are there any performance metrics for this? Many thanks Saptarshi Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha smime.p7s Description: S/MIME cryptographic signature
Re: Import path for hadoop streaming with python
I haven't done this using hadoop but before i 16.4 i had written my own distributed batch processor using HDFS as a common file storage and remote execution of python scripts. They all required a custom module which was copied to the remote temp folders (a primitive implementation of cacheFile) So this is what I did: just after #!/usr/bin/env python import sys sys.path.append('.') import mylib dostuff so that your module can be found in the current path. It should work thereafter Regards Saptarshi On May 22, 2008, at 7:39 PM, Martin Blom wrote: Hello all, I'm trying to stream a little python script on my small hadoop cluster, and it doesn't work like I thought it would. The script looks something like #!/usr/bin/env python import mylib dostuff where mylib is a small python library that I want included, and I launch the whole thing with something like bin/hadoop jar contrib/streaming/hadoop-0.16.4-streaming.jar -cacheFile hdfs://master:54310/user/hadoop/mylib.py#mylib.py -file scrpit.py -mapper script.py -input input -output output so it seems to me like the library should be available to the script. When I run the script locally on my machine everything works perfectly fine. However, when I run it it the script can't find the library. Does hadoop do anything strange to default paths? Am I missing something obvious? Any pointers or ideas on how to fix this would be great. Martin Blom Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha You love your home and want it to be beautiful. smime.p7s Description: S/MIME cryptographic signature
Meaning of Data-local map tasks in the web status gui to MapReduce
Hello, Does the Data-local map tasks counter mean the number of tasks that the had the input data already present on the machine on they are running on? i.e the wasn't a need to ship the data to them. Thanks Saptarsh Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha smime.p7s Description: S/MIME cryptographic signature
Large(Thousands) of files -fast
Hello, I have a similar scenario to jkupferman's situation - 1000's of files mostly ranging from Kb,some MBs and few of which GBs. I am not too familiar with java and am using hadoopstreaming with python. The mapper must work on individual files. I've placed the 1000's of files into the DFS. I've given the map job a manifest listing locations of the files, this is given to Hadoop which streams it to my python script which then copies the specified filename and processes it. I also tried tar-ring the files, converting them into a sequence file and then using SequenceFileAsTextInputFormat. The problem with this is that it sends the file contents as a string representation of the bytes, which i would have to convert. Q: Is there any way I can make it send me the data as BytesWritable(mentioned below), using the command line and python? Thanks for your time. Regards Saptarshi On May 18, 2008, at 10:54 PM, Brian Vargas wrote: Hi, You can realize a huge improvement by sticking them into a sequence file. With lots of small files, name lookups against the name node will be a big bottleneck. One easy approach is making the key be a Text of the filename that was loaded in, and the value be a BytesWritable, which is the contents of the file. Since they're relatively small files (or you wouldn't be having this problem), you won't have to worry about OOMing yourself. It worked really well for me, dealing with a few hundred thousand ~4MB files. Brian jkupferman wrote: Hi Everyone, I am working on a project which takes in data from a lot of text files, and although there are a lot of ways to do it, it is not clear to me which is the best/fastest. I am working on an EC2 cluster with approximately 20 machines. The data is currently spread across 20k text files (total of a ~3gb ), each of which needs to be treated as a whole (no splits within those files), but I am willing to change around the format if I can get increased speed. Using the regular TextInputFormat adjusted to take in entire files is pretty slow since each file takes a minimum of about ~3 seconds no matter how small it is. From what I have read the possible options to proceed with are as follows: 1. Use MultiFileInputSplit, it seems to be designed for this sort of situation, but I have yet to see an implementation of this, or a commentary on its performance increase over the regular input. 2. Read the data in, and output it as a Sequence File and use the sequence file as input from there on out. 3. Condense the files down to a small number of files (say ~100) and then delimit the files so each part gets a separate record reader. If anyone could give me guidance as to what will provide the best performance for this setup, I would greatly appreciate it. Thanks for your help Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha The typewriting machine, when played with expression, is no more annoying than the piano when played by a sister or near relation. -- Oscar Wilde smime.p7s Description: S/MIME cryptographic signature
Re: Large(Thousands) of files -fast
Aah, use org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat as the inputformat. Thanks Saptarshi On May 18, 2008, at 11:17 PM, Saptarshi Guha wrote: Hello, I have a similar scenario to jkupferman's situation - 1000's of files mostly ranging from Kb,some MBs and few of which GBs. I am not too familiar with java and am using hadoopstreaming with python. The mapper must work on individual files. I've placed the 1000's of files into the DFS. I've given the map job a manifest listing locations of the files, this is given to Hadoop which streams it to my python script which then copies the specified filename and processes it. I also tried tar-ring the files, converting them into a sequence file and then using SequenceFileAsTextInputFormat. The problem with this is that it sends the file contents as a string representation of the bytes, which i would have to convert. Q: Is there any way I can make it send me the data as BytesWritable(mentioned below), using the command line and python? Thanks for your time. Regards Saptarshi On May 18, 2008, at 10:54 PM, Brian Vargas wrote: Hi, You can realize a huge improvement by sticking them into a sequence file. With lots of small files, name lookups against the name node will be a big bottleneck. One easy approach is making the key be a Text of the filename that was loaded in, and the value be a BytesWritable, which is the contents of the file. Since they're relatively small files (or you wouldn't be having this problem), you won't have to worry about OOMing yourself. It worked really well for me, dealing with a few hundred thousand ~4MB files. Brian jkupferman wrote: Hi Everyone, I am working on a project which takes in data from a lot of text files, and although there are a lot of ways to do it, it is not clear to me which is the best/fastest. I am working on an EC2 cluster with approximately 20 machines. The data is currently spread across 20k text files (total of a ~3gb ), each of which needs to be treated as a whole (no splits within those files), but I am willing to change around the format if I can get increased speed. Using the regular TextInputFormat adjusted to take in entire files is pretty slow since each file takes a minimum of about ~3 seconds no matter how small it is. From what I have read the possible options to proceed with are as follows: 1. Use MultiFileInputSplit, it seems to be designed for this sort of situation, but I have yet to see an implementation of this, or a commentary on its performance increase over the regular input. 2. Read the data in, and output it as a Sequence File and use the sequence file as input from there on out. 3. Condense the files down to a small number of files (say ~100) and then delimit the files so each part gets a separate record reader. If anyone could give me guidance as to what will provide the best performance for this setup, I would greatly appreciate it. Thanks for your help Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha The typewriting machine, when played with expression, is no more annoying than the piano when played by a sister or near relation. -- Oscar Wilde Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha Back when I was a boy, it was 40 miles to everywhere, uphill both ways and it was always snowing. smime.p7s Description: S/MIME cryptographic signature