Re: Job history logging
Thanks Harsh. Doesn't seem like the assumption is a correct one? In case when disk space is exhausted and JT stops writing history logs, does it mean we require a JT restart for logs to be enabled again? In my case, I am seeing JT trying to write logs with a different user than the superuser. I am not sure why this is happening either, but the attempt to write fails as the other user does not have permissions. On Sep 14, 2012, at 7:11 PM, Harsh J ha...@cloudera.com wrote: I guess the reason is that it assumes it can't write history files after that point, and skips the rest of the work? On Sat, Sep 15, 2012 at 3:07 AM, Prashant Kommireddi prash1...@gmail.com wrote: Hi All, I have a question about job history logging. Seems like history logging is disabled if file creation fails, is there a reason this is done? The following snippet is from JobHistory.JobInfo.logSubmitted() - Hadoop 0.20.2 // Log the history meta info JobHistory.MetaInfoManager.logMetaInfo(writers); //add to writer as well JobHistory.log(writers, RecordTypes.Job, new Keys[]{Keys.JOBID, Keys.JOBNAME, Keys.USER, Keys.SUBMIT_TIME, Keys.JOBCONF }, new String[]{jobId.toString(), jobName, user, String.valueOf(submitTime) , jobConfPath} ); }catch(IOException e){ LOG.error(Failed creating job history log file, disabling history, e); *disableHistory = true; * } } Thanks, -- Harsh J
Re: IOException: too many length or distance symbols
Thanks Harsh. On digging some more it appears there was a data corruption issue with the file that caused the exception. After having regenerated the gzip file from source I no longer see the issue. On Jul 20, 2012, at 8:48 PM, Harsh J ha...@cloudera.com wrote: Prashant, Can you add in some context on how these files were written, etc.? Perhaps open a JIRA with a sample file and test-case to reproduce this? Other env stuff with info on version of hadoop, etc. would help too. On Sat, Jul 21, 2012 at 2:05 AM, Prashant Kommireddi prash1...@gmail.com wrote: I am seeing these exceptions, anyone know what they might be caused due to? Case of corrupt file? java.io.IOException: too many length or distance symbols at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method) at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:221) at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:80) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74) at java.io.InputStream.read(InputStream.java:85) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134) at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97) at org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:109) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Thanks, Prashant -- Harsh J
Re: EOFException at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)......
I have seen this issue with large file writes using SequenceFile writer. Not found the same issue when testing with writing fairly small files ( 1GB). On Fri, May 25, 2012 at 10:33 PM, Kasi Subrahmanyam kasisubbu...@gmail.comwrote: Hi, If you are using a custom writable object while passing data from the mapper to the reducer make sure that the read fields and the write has the same number of variables. It might be possible that you wrote datavtova file using custom writable but later modified the custom writable (like adding new attribute to the writable) which the old data doesn't have. It might be a possibility is please check once On Friday, May 25, 2012, waqas latif wrote: Hi Experts, I am fairly new to hadoop MapR and I was trying to run a matrix multiplication example presented by Mr. Norstadt under following link http://www.norstad.org/matrix-multiply/index.html. I can run it successfully with hadoop 0.20.2 but I tried to run it with hadoop 1.0.3 but I am getting following error. Is it the problem with my hadoop configuration or it is compatibility problem in the code which was written in hadoop 0.20 by author.Also please guide me that how can I fix this error in either case. Here is the error I am getting. in thread main java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readFully(DataInputStream.java:152) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1486) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1475) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1470) at TestMatrixMultiply.fillMatrix(TestMatrixMultiply.java:60) at TestMatrixMultiply.readMatrix(TestMatrixMultiply.java:87) at TestMatrixMultiply.checkAnswer(TestMatrixMultiply.java:112) at TestMatrixMultiply.runOneTest(TestMatrixMultiply.java:150) at TestMatrixMultiply.testRandom(TestMatrixMultiply.java:278) at TestMatrixMultiply.main(TestMatrixMultiply.java:308) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Thanks in advance Regards, waqas
Re: Namenode EOF Exception
Thanks, I agree I need to upgrade :) I was able to recover NN following your suggestions, and an additional hack was to sync the namespaceID across data nodes with the namenode. On May 14, 2012, at 11:48 AM, Harsh J ha...@cloudera.com wrote: True, I don't recall 0.20.2 (the original release that was a few years ago) carrying these fixes. You ought to upgrade that cluster to the current stable release for the many fixes you can benefit from :) On Mon, May 14, 2012 at 11:58 PM, Prashant Kommireddi prash1...@gmail.com wrote: Thanks Harsh. I am using 0.20.2, I see on the Jira this issue was fixed for 0.23? I will try out your suggestions and get back. On May 14, 2012, at 1:22 PM, Harsh J ha...@cloudera.com wrote: Your fsimage seems to have gone bad (is it 0-sized? I recall that as a known issue long since fixed). The easiest way is to fall back to the last available good checkpoint (From SNN). Or if you have multiple dfs.name.dirs, see if some of the other points have better/complete files on them, and re-spread them across after testing them out (and backing up the originals). Though what version are you running? Cause AFAIK most of the recent stable versions/distros include NN resource monitoring threads which should have placed your NN into safemode the moment all its disks ran near to out of space. On Mon, May 14, 2012 at 10:50 PM, Prashant Kommireddi prash1...@gmail.com wrote: Hi, I am seeing an issue where Namenode does not start due an EOFException. The disk was full and I cleared space up but I am unable to get past this exception. Any ideas on how this can be resolved? 2012-05-14 10:10:44,018 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=hadoop 2012-05-14 10:10:44,018 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=false 2012-05-14 10:10:44,023 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.file.FileContext 2012-05-14 10:10:44,024 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean 2012-05-14 10:10:44,047 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 205470 2012-05-14 10:10:44,844 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106) at org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1578) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:880) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:292) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:279) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965) 2012-05-14 10:10:44,845 INFO org.apache.hadoop.ipc.Server: Stopping server on 54310 2012-05-14 10:10:44,845 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106) at org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1578) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:880) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:292) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:279) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965) 2012-05-14 10:10:44,846 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NameNode at gridforce-1
Re: Namenode EOF Exception
Thanks Harsh. I am using 0.20.2, I see on the Jira this issue was fixed for 0.23? I will try out your suggestions and get back. On May 14, 2012, at 1:22 PM, Harsh J ha...@cloudera.com wrote: Your fsimage seems to have gone bad (is it 0-sized? I recall that as a known issue long since fixed). The easiest way is to fall back to the last available good checkpoint (From SNN). Or if you have multiple dfs.name.dirs, see if some of the other points have better/complete files on them, and re-spread them across after testing them out (and backing up the originals). Though what version are you running? Cause AFAIK most of the recent stable versions/distros include NN resource monitoring threads which should have placed your NN into safemode the moment all its disks ran near to out of space. On Mon, May 14, 2012 at 10:50 PM, Prashant Kommireddi prash1...@gmail.com wrote: Hi, I am seeing an issue where Namenode does not start due an EOFException. The disk was full and I cleared space up but I am unable to get past this exception. Any ideas on how this can be resolved? 2012-05-14 10:10:44,018 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=hadoop 2012-05-14 10:10:44,018 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=false 2012-05-14 10:10:44,023 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.file.FileContext 2012-05-14 10:10:44,024 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean 2012-05-14 10:10:44,047 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 205470 2012-05-14 10:10:44,844 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106) at org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1578) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:880) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:292) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:279) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965) 2012-05-14 10:10:44,845 INFO org.apache.hadoop.ipc.Server: Stopping server on 54310 2012-05-14 10:10:44,845 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106) at org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1578) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:880) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:292) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:279) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965) 2012-05-14 10:10:44,846 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NameNode at gridforce-1.internal.salesforce.com/10.0.201.159 / -- Harsh J
Re: java.io.IOException: Task process exit with nonzero status of 1
You might be running out of disk space. Check for that on your cluster nodes. -Prashant On Fri, May 11, 2012 at 12:21 AM, JunYong Li lij...@gmail.com wrote: is there errors in the task outpu file? on the jobtracker.jsp click the Jobid link - tasks link - Taskid link - Task logs link 2012/5/11 Mohit Kundra mohit@gmail.com Hi , I am new user to hadoop . I have installed hadoop0.19.1 on single windows machine. Its http://localhost:50030/jobtracker.jsp and http://localhost:50070/dfshealth.jsp pages are working fine but when i am executing bin/hadoop jar hadoop-0.19.1-examples.jar pi 5 100 It is showing below $ bin/hadoop jar hadoop-0.19.1-examples.jar pi 5 100 cygpath: cannot create short name of D:hadoop-0.19.1logs Number of Maps = 5 Samples per Map = 100 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Starting Job 12/05/11 12:07:26 INFO mapred.JobClient: Running job: job_20120513_0002 12/05/11 12:07:27 INFO mapred.JobClient: map 0% reduce 0% 12/05/11 12:07:35 INFO mapred.JobClient: Task Id : attempt_20120513_0002_m_06_ 0, Status : FAILED java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run (TaskRunner.java:425) Please tell me what is the root cause regards , Mohit -- Regards Junyong
Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3
Seems like a matter of upgrade. I am not a Cloudera user so would not know much, but you might find some help moving this to Cloudera mailing list. On Thu, May 3, 2012 at 2:51 AM, Austin Chungath austi...@gmail.com wrote: There is only one cluster. I am not copying between clusters. Say I have a cluster running apache 0.20.205 with 10 TB storage capacity and has about 8 TB of data. Now how can I migrate the same cluster to use cdh3 and use that same 8 TB of data. I can't copy 8 TB of data using distcp because I have only 2 TB of free space On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar nitinpawar...@gmail.com wrote: you can actually look at the distcp http://hadoop.apache.org/common/docs/r0.20.0/distcp.html but this means that you have two different set of clusters available to do the migration On Thu, May 3, 2012 at 12:51 PM, Austin Chungath austi...@gmail.com wrote: Thanks for the suggestions, My concerns are that I can't actually copyToLocal from the dfs because the data is huge. Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can do a namenode upgrade. I don't have to copy data out of dfs. But here I am having Apache hadoop 0.20.205 and I want to use CDH3 now, which is based on 0.20 Now it is actually a downgrade as 0.20.205's namenode info has to be used by 0.20's namenode. Any idea how I can achieve what I am trying to do? Thanks. On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar nitinpawar...@gmail.com wrote: i can think of following options 1) write a simple get and put code which gets the data from DFS and loads it in dfs 2) see if the distcp between both versions are compatible 3) this is what I had done (and my data was hardly few hundred GB) .. did a dfs -copyToLocal and then in the new grid did a copyFromLocal On Thu, May 3, 2012 at 11:41 AM, Austin Chungath austi...@gmail.com wrote: Hi, I am migrating from Apache hadoop 0.20.205 to CDH3u3. I don't want to lose the data that is in the HDFS of Apache hadoop 0.20.205. How do I migrate to CDH3u3 but keep the data that I have on 0.20.205. What is the best practice/ techniques to do this? Thanks Regards, Austin -- Nitin Pawar -- Nitin Pawar
Re: Compressing map only output
Yes. These are hadoop properties - using set is just a way for Pig to set those properties in your job conf. On Mon, Apr 30, 2012 at 5:25 PM, Mohit Anchlia mohitanch...@gmail.comwrote: Is there a way to compress map only jobs to compress map output that gets stored on hdfs as part-m-* files? In pig I used : Would these work form plain map reduce jobs as well? set output.compression.enabled true; set output.compression.codec org.apache.hadoop.io.compress.SnappyCodec;
Re: Distributing MapReduce on a computer cluster
Shailesh, there's a lot that goes into distributing work across tasks/nodes. It's not just distributing work but also fault-tolerance, data locality etc that come into play. It might be good to refer Hadoop apache docs or Tom White's definitive guide. Sent from my iPhone On Apr 23, 2012, at 11:03 AM, Shailesh Samudrala shailesh2...@gmail.com wrote: Hello, I am trying to design my own MapReduce Implementation and I want to know how hadoop is able to distribute its workload across multiple computers. Can anyone shed more light on this? thanks!
Re: Jobtracker history logs missing
Anyone faced similar issue or knows what the issue might be? Thanks in advance. On Thu, Apr 5, 2012 at 10:52 AM, Prashant Kommireddi prash1...@gmail.comwrote: Thanks Nitin. I believe the config key you mentioned controls the task attempts logs that go under - ${hadoop.log.dir}/userlogs. The ones that I mentioned are the job history logs that go under - ${hadoop.log.dir}/history and are specified by the key hadoop.job.history.location. Are these cleaned up based on mapred.userlog.retain.hours too? Also, this is what I am seeing in history dir Available Conf files - Mar 3rd - April 5th Available Job files - Mar 3rd - April 3rd There is no job file present after the 3rd of April, but conf files continue to be written. Thanks, Prashant On Thu, Apr 5, 2012 at 3:22 AM, Nitin Khandelwal nitin.khandel...@germinait.com wrote: Hi Prashant, The userlogs for job are deleted after time specified by * mapred.userlog.retain.hours* property defined in mapred-site.xml (default is 24 Hrs). Thanks, Nitin On 5 April 2012 14:26, Prashant Kommireddi prash1...@gmail.com wrote: I am noticing something strange with JobTracker history logs on my cluster. I see configuration files (*_conf.xml) under /logs/history/ but none of the actual job logs. Anyone has ideas on what might be happening? Thanks, -- Nitin Khandelwal
Re: Data Node is not Started
Can you check the datanode logs? May its an incompatible namespace issue. On Apr 6, 2012, at 11:13 AM, Sujit Dhamale sujitdhamal...@gmail.com wrote: Hi all, my DataNode is not started . even after deleting hadoop*.pid file from /tmp , But still Data node is not started , Hadoop Version: hadoop-1.0.1.tar.gz Java version : java version 1.6.0_26 Operating System : Ubuntu 11.10 i did below procedure *hduser@sujit:~/Desktop/hadoop/bin$ jps* 11455 Jps *hduser@sujit:~/Desktop/hadoop/bin$ start-all.sh* Warning: $HADOOP_HOME is deprecated. starting namenode, logging to /home/hduser/Desktop/hadoop/libexec/../logs/hadoop-hduser-namenode-sujit.out localhost: starting datanode, logging to /home/hduser/Desktop/hadoop/libexec/../logs/hadoop-hduser-datanode-sujit.out localhost: starting secondarynamenode, logging to /home/hduser/Desktop/hadoop/libexec/../logs/hadoop-hduser-secondarynamenode-sujit.out starting jobtracker, logging to /home/hduser/Desktop/hadoop/libexec/../logs/hadoop-hduser-jobtracker-sujit.out localhost: starting tasktracker, logging to /home/hduser/Desktop/hadoop/libexec/../logs/hadoop-hduser-tasktracker-sujit.out *hduser@sujit:~/Desktop/hadoop/bin$ jps* 11528 NameNode 12019 SecondaryNameNode 12355 TaskTracker 12115 JobTracker 12437 Jps *hduser@sujit:~/Desktop/hadoop/bin$ stop-all.sh* Warning: $HADOOP_HOME is deprecated. stopping jobtracker localhost: stopping tasktracker stopping namenode localhost: no datanode to stop localhost: stopping secondarynamenode *hduser@sujit:~/Desktop/hadoop/bin$ jps* 13127 Jps *hduser@sujit:~/Desktop/hadoop/bin$ ls /tmp* hadoop-hduser-datanode.pid hsperfdata_hduserkeyring-meecr7 ssh-JXYCAJsX1324 hadoop-hduser-jobtracker.pid hsperfdata_sujit plugtmp unity_support_test.0 hadoop-hduser-namenode.pid Jetty_0_0_0_0_50030_jobyn7qmkpulse-2L9K88eMlGn7 virtual-hduser.Q8j5nJ hadoop-hduser-secondarynamenode.pid Jetty_0_0_0_0_50070_hdfsw2cu08 pulse-Ob9vyJcXyHZz hadoop-hduser-tasktracker.pid Jetty_0_0_0_0_50090_secondaryy6aanv pulse-PKdhtXMmr18n *Deleted *.pid file :) hduser@sujit:~$ ls /tmp* hsperfdata_hduserpulse-2L9K88eMlGn7 hsperfdata_sujit pulse-Ob9vyJcXyHZz Jetty_0_0_0_0_50030_jobyn7qmkpulse-PKdhtXMmr18n Jetty_0_0_0_0_50070_hdfsw2cu08 ssh-JXYCAJsX1324 Jetty_0_0_0_0_50090_secondaryy6aanv unity_support_test.0 keyring-meecr7 virtual-hduser.Q8j5nJ plugtmp *hduser@sujit:~/Desktop/hadoop$ bin/hadoop namenode -format* Warning: $HADOOP_HOME is deprecated. 12/04/06 23:23:22 INFO namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = sujit.(null)/127.0.1.1 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 1.0.1 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1243785; compiled by 'hortonfo' on Tue Feb 14 08:15:38 UTC 2012 / Re-format filesystem in /app/hadoop/tmp/dfs/name ? (Y or N) Y 12/04/06 23:23:25 INFO util.GSet: VM type = 32-bit 12/04/06 23:23:25 INFO util.GSet: 2% max memory = 17.77875 MB 12/04/06 23:23:25 INFO util.GSet: capacity = 2^22 = 4194304 entries 12/04/06 23:23:25 INFO util.GSet: recommended=4194304, actual=4194304 12/04/06 23:23:25 INFO namenode.FSNamesystem: fsOwner=hduser 12/04/06 23:23:25 INFO namenode.FSNamesystem: supergroup=supergroup 12/04/06 23:23:25 INFO namenode.FSNamesystem: isPermissionEnabled=true 12/04/06 23:23:25 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100 12/04/06 23:23:25 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 12/04/06 23:23:25 INFO namenode.NameNode: Caching file names occuring more than 10 times 12/04/06 23:23:26 INFO common.Storage: Image file of size 112 saved in 0 seconds. 12/04/06 23:23:26 INFO common.Storage: Storage directory /app/hadoop/tmp/dfs/name has been successfully formatted. 12/04/06 23:23:26 INFO namenode.NameNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NameNode at sujit.(null)/127.0.1.1 / hduser@sujit:~/Desktop/hadoop$ bin/start-all.sh Warning: $HADOOP_HOME is deprecated. starting namenode, logging to /home/hduser/Desktop/hadoop/libexec/../logs/hadoop-hduser-namenode-sujit.out localhost: starting datanode, logging to /home/hduser/Desktop/hadoop/libexec/../logs/hadoop-hduser-datanode-sujit.out localhost: starting secondarynamenode, logging to /home/hduser/Desktop/hadoop/libexec/../logs/hadoop-hduser-secondarynamenode-sujit.out starting jobtracker, logging to
Re: Jobtracker history logs missing
Thanks Nitin. I believe the config key you mentioned controls the task attempts logs that go under - ${hadoop.log.dir}/userlogs. The ones that I mentioned are the job history logs that go under - ${hadoop.log.dir}/history and are specified by the key hadoop.job.history.location. Are these cleaned up based on mapred.userlog.retain.hours too? Also, this is what I am seeing in history dir Available Conf files - Mar 3rd - April 5th Available Job files - Mar 3rd - April 3rd There is no job file present after the 3rd of April, but conf files continue to be written. Thanks, Prashant On Thu, Apr 5, 2012 at 3:22 AM, Nitin Khandelwal nitin.khandel...@germinait.com wrote: Hi Prashant, The userlogs for job are deleted after time specified by * mapred.userlog.retain.hours* property defined in mapred-site.xml (default is 24 Hrs). Thanks, Nitin On 5 April 2012 14:26, Prashant Kommireddi prash1...@gmail.com wrote: I am noticing something strange with JobTracker history logs on my cluster. I see configuration files (*_conf.xml) under /logs/history/ but none of the actual job logs. Anyone has ideas on what might be happening? Thanks, -- Nitin Khandelwal
Re: Doubt from the book Definitive Guide
Answers inline. On Wed, Apr 4, 2012 at 4:56 PM, Mohit Anchlia mohitanch...@gmail.comwrote: I am going through the chapter How mapreduce works and have some confusion: 1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under The Reduce Side doesn't specifically say if it's copied using HTTP. So first confusion, Is the output copied from mapper - reducer or from reducer - mapper? And second, Is the call http:// or hdfs:// Map output is written to local FS, not HDFS. 2) My understanding was that mapper output gets written to hdfs, since I've seen part-m-0 files in hdfs. If mapper output is written to HDFS then shouldn't reducers simply read it from hdfs instead of making http calls to tasktrackers location? Map output is sent to HDFS when reducer is not used. - from the book --- Mapper The output file’s partitions are made available to the reducers over HTTP. The number of worker threads used to serve the file partitions is controlled by the tasktracker.http.threads property this setting is per tasktracker, not per map task slot. The default of 40 may need increasing for large clusters running large jobs.6.4.2. The Reduce Side Let’s turn now to the reduce part of the process. The map output file is sitting on the local disk of the tasktracker that ran the map task (note that although map outputs always get written to the local disk of the map tasktracker, reduce outputs may not be), but now it is needed by the tasktracker that is about to run the reduce task for the partition. Furthermore, the reduce task needs the map output for its particular partition from several map tasks across the cluster. The map tasks may finish at different times, so the reduce task starts copying their outputs as soon as each completes. This is known as the copy phase of the reduce task. The reduce task has a small number of copier threads so that it can fetch map outputs in parallel. The default is five threads, but this number can be changed by setting the mapred.reduce.parallel.copies property.
Re: Doubt from the book Definitive Guide
Hi Mohit, What would be the advantage? Reducers in most cases read data from all the mappers. In the case where mappers were to write to HDFS, a reducer would still require to read data from other datanodes across the cluster. Prashant On Apr 4, 2012, at 9:55 PM, Mohit Anchlia mohitanch...@gmail.com wrote: On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote: Hi Mohit, On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am going through the chapter How mapreduce works and have some confusion: 1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under The Reduce Side doesn't specifically say if it's copied using HTTP. So first confusion, Is the output copied from mapper - reducer or from reducer - mapper? And second, Is the call http:// or hdfs:// The flow is simple as this: 1. For M+R job, map completes its task after writing all partitions down into the tasktracker's local filesystem (under mapred.local.dir directories). 2. Reducers fetch completion locations from events at JobTracker, and query the TaskTracker there to provide it the specific partition it needs, which is done over the TaskTracker's HTTP service (50060). So to clear things up - map doesn't send it to reduce, nor does reduce ask the actual map task. It is the task tracker itself that makes the bridge here. Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would be over Netty connections. This would be much more faster and reliable. 2) My understanding was that mapper output gets written to hdfs, since I've seen part-m-0 files in hdfs. If mapper output is written to HDFS then shouldn't reducers simply read it from hdfs instead of making http calls to tasktrackers location? A map-only job usually writes out to HDFS directly (no sorting done, cause no reducer is involved). If the job is a map+reduce one, the default output is collected to local filesystem for partitioning and sorting at map end, and eventually grouping at reduce end. Basically: Data you want to send to reducer from mapper goes to local FS for multiple actions to be performed on them, other data may directly go to HDFS. Reducers currently are scheduled pretty randomly but yes their scheduling can be improved for certain scenarios. However, if you are pointing that map partitions ought to be written to HDFS itself (with replication or without), I don't see performance improving. Note that the partitions aren't merely written but need to be sorted as well (at either end). To do that would need ability to spill frequently (cause we don't have infinite memory to do it all in RAM) and doing such a thing on HDFS would only mean slowdown. Thanks for clearing my doubts. In this case I was merely suggesting that if the mapper output (merged output in the end or the shuffle output) is stored in HDFS then reducers can just retrieve it from HDFS instead of asking tasktracker for it. Once reducer threads read it they can continue to work locally. I hope this helps clear some things up for you. -- Harsh J
Re: Using a combiner
It is a function of the number of spills on map side and I believe the default is 3. So for every 3 times data is spilled the combiner is run. This number is configurable. Sent from my iPhone On Mar 14, 2012, at 3:26 PM, Gayatri Rao rgayat...@gmail.com wrote: Hi all, I have a quick query on using a combiner in a MR job. Is it true the framework decides whether or not the combiner gets called? Can any one please give more information on how t his is done. Thanks, Gayatri
Re: 100x slower mapreduce compared to pig
It would be great if we can take a look at what you are doing in the UDF vs the Mapper. 100x slow does not make sense for the same job/logic, its either the Mapper code or may be the cluster was busy at the time you scheduled MapReduce job? Thanks, Prashant On Tue, Feb 28, 2012 at 4:11 PM, Mohit Anchlia mohitanch...@gmail.comwrote: I am comparing runtime of similar logic. The entire logic is exactly same but surprisingly map reduce job that I submit is 100x slow. For pig I use udf and for hadoop I use mapper only and the logic same as pig. Even the splits on the admin page are same. Not sure why it's so slow. I am submitting job like: java -classpath .:analytics.jar:/hadoop-0.20.2-cdh3u3/lib/*:/root/.mohit/hadoop-0.20.2-cdh3u3/*:common.jar com.services.dp.analytics.hadoop.mapred.FormMLProcessor /examples/testfile40.seq,/examples/testfile41.seq,/examples/testfile42.seq,/examples/testfile43.seq,/examples/testfile44.seq,/examples/testfile45.seq,/examples/testfile46.seq,/examples/testfile47.seq,/examples/testfile48.seq,/examples/testfile49.seq /examples/output1/ How should I go about looking the root cause of why it's so slow? Any suggestions would be really appreciated. One of the things I noticed is that on the admin page of map task list I see status as hdfs://dsdb1:54310/examples/testfile40.seq:0+134217728 but for pig the status is blank.
Re: Adding mahout math jar to hadoop mapreduce execution
How are you building the mapreduce jar? Try not to include the Mahout dist while building MR jar, and include it only on -libjars option. On Mon, Jan 30, 2012 at 10:33 PM, Daniel Quach danqu...@cs.ucla.edu wrote: I have been compiling my mapreduce with the jars in the classpath, and I believe I need to also add the jars as an option to -libjars to hadoop. However, even when I do this, I still get an error complaining about missing classes at runtime. (Compilation works fine). Here is my command: hadoop jar makevector.jar org.myorg.MakeVector -libjars /usr/local/mahout/math/target/mahout-math-0.6-SNAPSHOT.jar input/ output/ This is the error I receive: Exception in thread main java.lang.NoClassDefFoundError: org/apache/mahout/math/DenseVector I wonder if I am using the GenericOptionsParser incorrectly? I'm not sure if there is a deeper problem here.
Re: Killing hadoop jobs automatically
You might want to take a look at the kill command : hadoop job -kill jobid. Prashant On Sun, Jan 29, 2012 at 11:06 PM, praveenesh kumar praveen...@gmail.comwrote: Is there anyway through which we can kill hadoop jobs that are taking enough time to execute ? What I want to achieve is - If some job is running more than _some_predefined_timeout_limit, it should be killed automatically. Is it possible to achieve this, through shell scripts or any other way ? Thanks, Praveenesh
Re: Parallel CSV loader
I am assuming you want to move data between Hadoop and database. Please take a look at Sqoop. Thanks, Prashant Sent from my iPhone On Jan 24, 2012, at 9:19 AM, Edmon Begoli ebeg...@gmail.com wrote: I am looking to use Hadoop for parallel loading of CSV file into a non-Hadoop, parallel database. Is there an existing utility that allows one to pick entries, row-by-row, synchronized and in parallel and load into a database? Thank you in advance, Edmon
Re: hadoop filesystem cache
You mean something different from the DistributedCache? Sent from my iPhone On Jan 14, 2012, at 5:30 PM, Rita rmorgan...@gmail.com wrote: After reading this article, http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I was wondering if there was a filesystem cache for hdfs. For example, if a large file (10gigabytes) was keep getting accessed on the cluster instead of keep getting it from the network why not storage the content of the file locally on the client itself. A use case on the client would be like this: property namedfs.client.cachedirectory/name value/var/cache/hdfs/value /property property namedfs.client.cachesize/name descriptionin megabytes/description value10/value /property Any thoughts of a feature like this? -- --- Get your facts first, then you can distort them as you please.--
Re: increase number of map tasks
1. Performance tunning/optimization any good suggestions or links? Take a look at http://wiki.datameer.com/documentation/current/Hadoop+Cluster+Configuration+Tips 2. Logging - If I do any logging in map/reduce class where will be logging or system.out information written? Be careful while doing so since on large amounts of data you can fill up disk on datanodes very quickly. You can find the logs through the jobtracker page by clicking on specific map and reduce tasks. 3. How do we reuse jvm? map tasks creation takes time. Look at mapred.job.reuse.jvm.num.tasks 4. Different types of spills - how do we avoid them? Depends on what is causing the spills. You can have spills on Map and Reduce side, and adjusting config properties such io.sort.mb, io.sort.factor, and a few others on the Reduce side. Tom White's book has a good explanation on these. Thanks, Prashant Kommireddi On Thu, Jan 12, 2012 at 8:10 AM, screen satish.se...@hcl.in wrote: Thanks. Seperate files for line item has created 10 map tasks out of which only some are in running state (given by max map reduce tasks) rest are in wait. So if I have 8 cpus, I have max_map_tasks as 7 so 3 are in wait state. I can see 7 cpus utilization 90-95%. 1. Performance tunning/optimization any good suggestions or links? 2. Logging - If I do any logging in map/reduce class where will be logging or system.out information written? 3. How do we reuse jvm? map tasks creation takes time. 4. Different types of spills - how do we avoid them? -- View this message in context: http://old.nabble.com/increase-number-of-map-tasks-tp33107775p33128748.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum
Hi Hao, Ideally you would want to leave out a core each for Tasktracker and Datanode process' on each node. The rest could be used for maps and reducers. Thanks, Prashant 2012/1/10 hao.wang hao.w...@ipinyou.com Hi, Thanks for your help, your suggestion is very usefully. I have another question that is whether the sum of maps and reduces equals to the total number of cores. regards! 2012-01-10 hao.wang 发件人: Harsh J 发送时间: 2012-01-10 16:44:07 收件人: common-user 抄送: 主题: Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum Hello Hao, Am sorry if I confused you. By CPUs I meant the CPUs visible to your OS (/proc/cpuinfo), so yes the total number of cores. On 10-Jan-2012, at 12:39 PM, hao.wang wrote: Hi , Thanks for your reply! According to your suggestion, Maybe I can't apply it to our hadoop cluster. Cus, each server in our hadoop cluster just contains 2 CPUs. So, I think maybe you mean the core # but not CPU # in each searver? I am looking for your reply. regards! 2012-01-10 hao.wang 发件人: Harsh J 发送时间: 2012-01-10 11:33:38 收件人: common-user 抄送: 主题: Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum Hello again, Try a 4:3 ratio between maps and reduces, against a total # of available CPUs per node (minus one or two, for DN and HBase if you run those). Then tweak it as you go (more map-only loads or more map-reduce loads, that depends on your usage, and you can tweak the ratio accordingly over time -- changing those props do not need JobTracker restarts, just TaskTracker). On 10-Jan-2012, at 8:17 AM, hao.wang wrote: Hi, Thanks for your reply! I had already read the pages before, can you give me sme more specific suggestions about how to choose the values of mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum according to our cluster configuration if possible? regards! 2012-01-10 hao.wang 发件人: Harsh J 发送时间: 2012-01-09 23:19:21 收件人: common-user 抄送: 主题: Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum Hi, Please read http://hadoop.apache.org/common/docs/current/single_node_setup.html to learn how to configure Hadoop using the various *-site.xml configuration files, and then follow http://hadoop.apache.org/common/docs/current/cluster_setup.html to achieve optimal configs for your cluster. On 09-Jan-2012, at 5:50 PM, hao.wang wrote: Hi ,all Our hadoop cluster has 22 nodes including one namenode, one jobtracker and 20 datanodes. Each node has 2 * 12 cores with 32G RAM Dose anyone tell me how to config following parameters: mapred.tasktracker.map.tasks.maximum mapred.tasktracker.reduce.tasks.maximum regards! 2012-01-09 hao.wang
Re: Hadoop MySQL database access
By design reduce would start only after all the maps finish. There is no way for the reduce to begin grouping/merging by key unless all the maps have finished. Sent from my iPhone On Dec 28, 2011, at 8:53 AM, JAGANADH G jagana...@gmail.com wrote: Hi All, I wrote a map reduce program to fetch data from MySQL and process the data(word count). The program executes successfully . But I noticed that the reduce task starts after finishing the map task only . Is there any way to run the map and reduce in parallel. The program fetches data from MySQL and writes the processed output to hdfs. I am using hadoop in pseduo-distributed mode . -- ** JAGANADH G http://jaganadhg.in *ILUGCBE* http://ilugcbe.org.in
Re: Another newbie - problem with grep example
Seems like you do not have /user/MyId/input/conf on HDFS. Try this. cd $HADOOP_HOME_DIR (this should be your hadoop root dir) hadoop fs -put conf input/conf And then run the MR job again. -Prashant Kommireddi On Fri, Dec 23, 2011 at 3:40 PM, Pat Flaherty p...@well.com wrote: Hi, Installed 0.22.0 on CentOS 5.7. I can start dfs and mapred and see their processes. Ran the first grep example: bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'. It seems the correct jar name is hadoop-mapred-examples-0.22.0.jar - there are no other hadoop*examples*.jar files in HADOOP_HOME. Didn't work. Then found and tried pi (compute pi) - that works, so my installation is to some degree of approximation good. Back to grep. It fails with java.io.FileNotFoundException: File does not exist: /user/MyId/input/conf Found and ran bin/hadoop fs -ls. OK these directory names are internal to hadoop (I assume) because Linux has no idea of /user. And the directory is there - but the program is failing. Any suggestions; where to start; etc? Thanks - Pat
Re: Configure hadoop scheduler
I am guessing you are trying to use the FairScheduler but you have specified CapacityScheduler in your configuration. You need to change mapreduce.jobtracker.scheduler to FairScheduler. Sent from my iPhone On Dec 20, 2011, at 8:51 AM, Merto Mertek masmer...@gmail.com wrote: Hi, I am having problems with changing the default hadoop scheduler (i assume that the default scheduler is a FIFO scheduler). I am following the guide located in hadoop/docs directory however I am not able to run it. Link for scheduling administration returns an http error 404 ( http://localhost:50030/scheduler ). In the UI under scheduling information I can see only one queue named default. mapred-site.xml file is accessible because when changing a port for a jobtracker I can see a daemon running with a changed port. Variable $HADOOP_CONFIG_DIR was added to .bashrc, however that did not solve the problem. I tried to rebuild hadoop, manualy place the fair scheduler jar in hadoop/lib and changed the hadoop classpath in hadoop-env.sh to point to the lib folder, but without success. The only info of the scheduler that is seen in the jobtracker log is the folowing info: Scheduler configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT, limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1) I am working on this several days and running out of ideas... I am wondering how to fix it and where to check currently active scheduler parameters? Config files: mapred-site.xml http://pastebin.com/HmDfWqE1 allocation.xml http://pastebin.com/Uexq7uHV Tried versions: 0.20.203 and 204 Thank you
Re: Regarding pointers for LZO compression in Hive and Hadoop
http://code.google.com/p/hadoop-gpl-packing/ Thanks, Prashant On Wed, Dec 14, 2011 at 11:32 AM, Abhishek Pratap Singh manu.i...@gmail.com wrote: Hi, I m looking for some useful docs on enabling LZO on hadoop cluster. I tried few of the blogs, but somehow its not working. Here is my requirement. I have a hadoop 0.20.2 and Hive 0.6. I have some tables with 1.5 TB of data, i want to compress them using LZO and enable LZO in hive as well as in hadoop. Let me know if you have any useful docs or pointers for the same. Regards, Abhishek
Re: More cores Vs More Nodes ?
Hi Brad, how many taskstrackers did you have on each node in both cases? Thanks, Prashant Sent from my iPhone On Dec 13, 2011, at 9:42 AM, Brad Sarsfield b...@bing.com wrote: Praveenesh, Your question is not naïve; in fact, optimal hardware design can ultimately be a very difficult question to answer on what would be better. If you made me pick one without much information I'd go for more machines. But... It all depends; and there is no right answer :) More machines +May run your workload faster +Will give you a higher degree of reliability protection from node / hardware / hard drive failure. +More aggregate IO capabilities - capex / opex may be higher than allocating more cores More cores +May run your workload faster +More cores may allow for more tasks to run on the same machine +More cores/tasks may reduce network contention and increase increasing task to task data flow performance. Notice May run your workload faster is in both; as it can be very workload dependant. My Experience: I did a recent experiment and found that given the same number of cores (64) with the exact same network / machine configuration; A: I had 8 machines with 8 cores B: I had 28 machines with 2 cores (and 1x8 core head node) B was able to outperform A by 2x using teragen and terasort. These machines were running in a virtualized environment; where some of the IO capabilities behind the scenes were being regulated to 400Mbps per node when running in the 2 core configuration vs 1Gbps on the 8 core. So I would expect the non-throttled scenario to work even better. ~Brad -Original Message- From: praveenesh kumar [mailto:praveen...@gmail.com] Sent: Monday, December 12, 2011 8:51 PM To: common-user@hadoop.apache.org Subject: More cores Vs More Nodes ? Hey Guys, So I have a very naive question in my mind regarding Hadoop cluster nodes ? more cores or more nodes - Shall I spend money on going from 2-4 core machines, or spend money on buying more nodes less core eg. say 2 machines of 2 cores for example? Thanks, Praveenesh
Re: Create a single output per each mapper
Take a look at cleanup() method on Mapper. Thanks, Prashant Sent from my iPhone On Dec 12, 2011, at 8:46 PM, Shi Yu sh...@uchicago.edu wrote: Hi, Suppose I have two mappers, each mapper is assigned 10 lines of data. I want to set a counter for each mapper, counting and accumulating, then output the counter value to the reducer when the mapper finishes processing all the assigned lines. So I want the mapper outputs values only when there is no further incoming data (when that mapper closes). Is this doable? How? thanks! Shi
Re: OOM Error Map output copy.
Arun, I faced the same issue and increasing the # of reducers fixed the problem. I was initially under the impression MR framework spills to disk if data is too huge to keep in memory, however on extraordinarily large reduce inputs this was not the case and the job failed on trying to assign the in-memory buffer private MapOutput shuffleInMemory(MapOutputLocation mapOutputLoc, URLConnection connection, InputStream input, int mapOutputLength, int compressedLength) throws IOException, InterruptedException { // Reserve ram for the map-output . . . . // Copy map-output into an in-memory buffer byte[] shuffleData = new byte[mapOutputLength]; -Prahant Kommireddi On Fri, Dec 9, 2011 at 10:29 AM, Arun C Murthy a...@hortonworks.com wrote: Moving to mapreduce-user@, bcc common-user@. Please use project specific lists. Niranjan, If you average as 0.5G output per-map, it's 5000 maps *0.5G - 2.5TB over 12 reduces i.e. nearly 250G per reduce - compressed! If you think you have 4:1 compression you are doing nearly a Terabyte per reducer... which is way too high! I'd recommend you bump to somewhere along 1000 reduces to get to 2.5G (compressed) per reducer for your job. If your compression ratio is 2:1, try 500 reduces and so on. If you are worried about other users, use the CapacityScheduler and submit your job to a queue with a small capacity and max-capacity to restrict your job to 10 or 20 concurrent reduces at a given point. Arun On Dec 7, 2011, at 10:51 AM, Niranjan Balasubramanian wrote: All I am encountering the following out-of-memory error during the reduce phase of a large job. Map output copy failure : java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1669) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1529) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1378) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1310) I tried increasing the memory available using mapped.child.java.opts but that only helps a little. The reduce task eventually fails again. Here are some relevant job configuration details: 1. The input to the mappers is about 2.5 TB (LZO compressed). The mappers filter out a small percentage of the input ( less than 1%). 2. I am currently using 12 reducers and I can't increase this count by much to ensure availability of reduce slots for other users. 3. mapred.child.java.opts -- -Xms512M -Xmx1536M -XX:+UseSerialGC 4. mapred.job.shuffle.input.buffer.percent-- 0.70 5. mapred.job.shuffle.merge.percent -- 0.66 6. mapred.inmem.merge.threshold -- 1000 7. I have nearly 5000 mappers which are supposed to produce LZO compressed outputs. The logs seem to indicate that the map outputs range between 0.3G to 0.8GB. Does anything here seem amiss? I'd appreciate any input of what settings to try. I can try different reduced values for the input buffer percent and the merge percent. Given that the job runs for about 7-8 hours before crashing, I would like to make some informed choices if possible. Thanks. ~ Niranjan.
Re: Hadoop Comic
Here you go https://docs.google.com/viewer?a=vpid=explorerchrome=truesrcid=0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLWE0OGItYTU5OGMxYjc0N2M1hl=en_USpli=1 Thanks, Prashant On Wed, Dec 7, 2011 at 1:47 AM, shreya@cognizant.com wrote: Hi, Can someone please send me the Hadoop comic. Saw references about it in the mailing list. Regards, Shreya This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful.
Re: how to integrate snappy into hadoop
I had to struggle a bit while building Snappy for Hadoop 0.20.2 on Ubuntu. However, I have now been able to install it on a 10 node cluster and it works great for map output compression. Please check these notes, may be it might help in addition to official Hadoop-Snappy notes. Tom White (amongst a few others) is the best person to answer this since he is working directly with this integration project. *Installing Maven 3* http://www.discursive.com/blog/4636 *Snappy build* http://code.google.com/p/hadoop-snappy/ *Creating the symbolic link for BUILD to pass:* THIS IS IMPORTANT sudo ln -s /home/pkommireddi/dev/tools/Linux/jdk/jdk1.6.0_21_x64/jre/lib/amd64/server/libjvm.so /usr/local/lib/ *Additional build notes:* http://shanky.org/2011/10/17/build-hadoop-from-source/ Install commands I issued on localhost, these were notes for myself. There are dependencies that Snappy build needs, and you might NOT need all of the commands that I have issues below. Please refer to official notes and install accordingly. pkommireddi@pkommireddi-wsl:~$ history | grep install 1678 sudo apt-get install python-software-properties 1681 sudo apt-get install maven 1750 make install 1751 ./configure make sudo make install 1752 sudo apt-get install zlibc zlib1g zlib1g-dev 1763 sudo apt-get install subversion 1776 ./configure make sudo make install 1824 sudo apt-get install zlibc zlib1g zlib1g-dev 1846 make install 1847 sudo make install 1858 make installcheck 1861 sudo make install 1862 sudo make installcheck 1904 sudo apt-get install libtool 1907 sudo apt-get install automake Thanks, Prashant Kommireddi On Wed, Dec 7, 2011 at 5:39 PM, Jinyan Xu jinyan...@exar.com wrote: Hi , Anyone else have the experience integrating snappy into hadoop ? help me with it I find google doesn't provide the hadoop-snappy now : Hadoop-snappy is integrated into Hadoop Common(JUN 2011). Hadoop-Snappy can be used as an add-on for recent (released) versions of Hadoop that do not provide Snappy Codec support yet. Hadoop-Snappy is being kept in synch with Hadoop Common. Thanks! The information and any attached documents contained in this message may be confidential and/or legally privileged. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, dissemination, or reproduction is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender immediately by return e-mail and destroy all copies of the original message.
Re: how to integrate snappy into hadoop
I have not tried it with HBase, and yes 0.20.2 is not compatible with it. What is the error you receive when you try compiling Snappy? I don't think compiling Snappy would be dependent on HBase. 2011/12/7 Jinyan Xu jinyan...@exar.com Hi Prashant Kommireddi, Last week, I read build-hadoop-from-source and follow it, but I failed to compile hbase with mvn compile -Dsnappy . Did you install HBase 0.90.2? According build-hadoop-from-source HBase 0.90.2 is incompatible with Hadoop0.20.2-release. -Original Message- From: Prashant Kommireddi [mailto:prash1...@gmail.com] Sent: 2011年12月8日 10:13 To: common-user@hadoop.apache.org Subject: Re: how to integrate snappy into hadoop I had to struggle a bit while building Snappy for Hadoop 0.20.2 on Ubuntu. However, I have now been able to install it on a 10 node cluster and it works great for map output compression. Please check these notes, may be it might help in addition to official Hadoop-Snappy notes. Tom White (amongst a few others) is the best person to answer this since he is working directly with this integration project. *Installing Maven 3* http://www.discursive.com/blog/4636 *Snappy build* http://code.google.com/p/hadoop-snappy/ *Creating the symbolic link for BUILD to pass:* THIS IS IMPORTANT sudo ln -s /home/pkommireddi/dev/tools/Linux/jdk/jdk1.6.0_21_x64/jre/lib/amd64/server/libjvm.so /usr/local/lib/ *Additional build notes:* http://shanky.org/2011/10/17/build-hadoop-from-source/ Install commands I issued on localhost, these were notes for myself. There are dependencies that Snappy build needs, and you might NOT need all of the commands that I have issues below. Please refer to official notes and install accordingly. pkommireddi@pkommireddi-wsl:~$ history | grep install 1678 sudo apt-get install python-software-properties 1681 sudo apt-get install maven 1750 make install 1751 ./configure make sudo make install 1752 sudo apt-get install zlibc zlib1g zlib1g-dev 1763 sudo apt-get install subversion 1776 ./configure make sudo make install 1824 sudo apt-get install zlibc zlib1g zlib1g-dev 1846 make install 1847 sudo make install 1858 make installcheck 1861 sudo make install 1862 sudo make installcheck 1904 sudo apt-get install libtool 1907 sudo apt-get install automake Thanks, Prashant Kommireddi On Wed, Dec 7, 2011 at 5:39 PM, Jinyan Xu jinyan...@exar.com wrote: Hi , Anyone else have the experience integrating snappy into hadoop ? help me with it I find google doesn't provide the hadoop-snappy now : Hadoop-snappy is integrated into Hadoop Common(JUN 2011). Hadoop-Snappy can be used as an add-on for recent (released) versions of Hadoop that do not provide Snappy Codec support yet. Hadoop-Snappy is being kept in synch with Hadoop Common. Thanks! The information and any attached documents contained in this message may be confidential and/or legally privileged. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, dissemination, or reproduction is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender immediately by return e-mail and destroy all copies of the original message. The information and any attached documents contained in this message may be confidential and/or legally privileged. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, dissemination, or reproduction is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender immediately by return e-mail and destroy all copies of the original message.
Re: HDFS Explained as Comics
Thanks Maneesh. Quick question, does a client really need to know Block size and replication factor - A lot of times client has no control over these (set at cluster level) -Prashant Kommireddi On Wed, Nov 30, 2011 at 12:51 PM, Dejan Menges dejan.men...@gmail.comwrote: Hi Maneesh, Thanks a lot for this! Just distributed it over the team and comments are great :) Best regards, Dejan On Wed, Nov 30, 2011 at 9:28 PM, maneesh varshney mvarsh...@gmail.com wrote: For your reading pleasure! PDF 3.3MB uploaded at (the mailing list has a cap of 1MB attachments): https://docs.google.com/open?id=0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLWE0OGItYTU5OGMxYjc0N2M1 Appreciate if you can spare some time to peruse this little experiment of mine to use Comics as a medium to explain computer science topics. This particular issue explains the protocols and internals of HDFS. I am eager to hear your opinions on the usefulness of this visual medium to teach complex protocols and algorithms. [My personal motivations: I have always found text descriptions to be too verbose as lot of effort is spent putting the concepts in proper time-space context (which can be easily avoided in a visual medium); sequence diagrams are unwieldy for non-trivial protocols, and they do not explain concepts; and finally, animations/videos happen too fast and do not offer self-paced learning experience.] All forms of criticisms, comments (and encouragements) welcome :) Thanks Maneesh
Re: HDFS Explained as Comics
Sure, its just a case of how readers interpret it. 1. Client is required to specify block size and replication factor each time 2. Client does not need to worry about it since an admin has set the properties in default configuration files A client could not be allowed to override the default configs if they are set final (well there are ways to go around it as well as you suggest by using create() :) The information is great and helpful. Just want to make sure a beginner who wants to write a WordCount in Mapreduce does not worry about specifying block size' and replication factor in his code. Thanks, Prashant On Wed, Nov 30, 2011 at 1:18 PM, maneesh varshney mvarsh...@gmail.comwrote: Hi Prashant Others may correct me if I am wrong here.. The client (org.apache.hadoop.hdfs.DFSClient) has a knowledge of block size and replication factor. In the source code, I see the following in the DFSClient constructor: defaultBlockSize = conf.getLong(dfs.block.size, DEFAULT_BLOCK_SIZE); defaultReplication = (short) conf.getInt(dfs.replication, 3); My understanding is that the client considers the following chain for the values: 1. Manual values (the long form constructor; when a user provides these values) 2. Configuration file values (these are cluster level defaults: dfs.block.size and dfs.replication) 3. Finally, the hardcoded values (DEFAULT_BLOCK_SIZE and 3) Moreover, in the org.apache.hadoop.hdfs.protocool.ClientProtocol the API to create a file is void create(, short replication, long blocksize); I presume it means that the client already has knowledge of these values and passes them to the NameNode when creating a new file. Hope that helps. thanks -Maneesh On Wed, Nov 30, 2011 at 1:04 PM, Prashant Kommireddi prash1...@gmail.com wrote: Thanks Maneesh. Quick question, does a client really need to know Block size and replication factor - A lot of times client has no control over these (set at cluster level) -Prashant Kommireddi On Wed, Nov 30, 2011 at 12:51 PM, Dejan Menges dejan.men...@gmail.com wrote: Hi Maneesh, Thanks a lot for this! Just distributed it over the team and comments are great :) Best regards, Dejan On Wed, Nov 30, 2011 at 9:28 PM, maneesh varshney mvarsh...@gmail.com wrote: For your reading pleasure! PDF 3.3MB uploaded at (the mailing list has a cap of 1MB attachments): https://docs.google.com/open?id=0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLWE0OGItYTU5OGMxYjc0N2M1 Appreciate if you can spare some time to peruse this little experiment of mine to use Comics as a medium to explain computer science topics. This particular issue explains the protocols and internals of HDFS. I am eager to hear your opinions on the usefulness of this visual medium to teach complex protocols and algorithms. [My personal motivations: I have always found text descriptions to be too verbose as lot of effort is spent putting the concepts in proper time-space context (which can be easily avoided in a visual medium); sequence diagrams are unwieldy for non-trivial protocols, and they do not explain concepts; and finally, animations/videos happen too fast and do not offer self-paced learning experience.] All forms of criticisms, comments (and encouragements) welcome :) Thanks Maneesh
Re: Passing data files via the distributed cache
I believe you want to ship data to each node in your cluster before MR begins so the mappers can access files local to their machine. Hadoop tutorial on YDN has some good info on this. http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata -Prashant Kommireddi On Fri, Nov 25, 2011 at 1:05 AM, Andy Doddington a...@doddington.netwrote: I have a series of mappers that I would like to be passed data using the distributed cache mechanism. At the moment, I am using HDFS to pass the data, but this seems wasteful to me, since they are all reading the same data. Is there a piece of example code that shows how data files can be placed in the cache and accessed by mappers? Thanks, Andy Doddington