Re: Job history logging

2012-09-17 Thread Prashant Kommireddi
Thanks Harsh.

Doesn't seem like the assumption is a correct one? In case when disk
space is exhausted and JT stops writing history logs, does it mean we
require a JT restart for logs to be enabled again?

In my case, I am seeing JT trying to write logs with a different user
than the superuser. I am not sure why this is happening either, but
the attempt to write fails as the other user does not have
permissions.

On Sep 14, 2012, at 7:11 PM, Harsh J ha...@cloudera.com wrote:

 I guess the reason is that it assumes it can't write history files
 after that point, and skips the rest of the work?

 On Sat, Sep 15, 2012 at 3:07 AM, Prashant Kommireddi
 prash1...@gmail.com wrote:
 Hi All,

 I have a question about job history logging. Seems like history logging is
 disabled if file creation fails, is there a reason this is done?
 The following snippet is from JobHistory.JobInfo.logSubmitted()  -
 Hadoop 0.20.2


  // Log the history meta info
  JobHistory.MetaInfoManager.logMetaInfo(writers);

  //add to writer as well
  JobHistory.log(writers, RecordTypes.Job,
 new Keys[]{Keys.JOBID, Keys.JOBNAME, Keys.USER,
 Keys.SUBMIT_TIME, Keys.JOBCONF },
 new String[]{jobId.toString(), jobName, user,
  String.valueOf(submitTime) ,
 jobConfPath}
);

}catch(IOException e){
  LOG.error(Failed creating job history log file, disabling
 history, e);
  *disableHistory = true; *
}
  }


 Thanks,



 --
 Harsh J


Re: IOException: too many length or distance symbols

2012-07-29 Thread Prashant Kommireddi
Thanks Harsh.

On digging some more it appears there was a data corruption issue with
the file that caused the exception. After having regenerated the gzip
file from source I no longer see the issue.


On Jul 20, 2012, at 8:48 PM, Harsh J ha...@cloudera.com wrote:

 Prashant,

 Can you add in some context on how these files were written, etc.?
 Perhaps open a JIRA with a sample file and test-case to reproduce
 this? Other env stuff with info on version of hadoop, etc. would help
 too.

 On Sat, Jul 21, 2012 at 2:05 AM, Prashant Kommireddi
 prash1...@gmail.com wrote:
 I am seeing these exceptions, anyone know what they might be caused due to?
 Case of corrupt file?

 java.io.IOException: too many length or distance symbols
at 
 org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native
 Method)
at 
 org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:221)
at 
 org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:80)
at 
 org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
at java.io.InputStream.read(InputStream.java:85)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at 
 org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97)
at org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:109)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187)
at 
 org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
at 
 org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)


 Thanks,
 Prashant



 --
 Harsh J


Re: EOFException at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)......

2012-05-27 Thread Prashant Kommireddi
I have seen this issue with large file writes using SequenceFile writer.
Not found the same issue when testing with writing fairly small files ( 
1GB).

On Fri, May 25, 2012 at 10:33 PM, Kasi Subrahmanyam
kasisubbu...@gmail.comwrote:

 Hi,
 If you are using a custom writable object while passing data from the
 mapper to the reducer make sure that the read fields and the write has the
 same number of variables. It might be possible that you wrote datavtova
 file using custom writable but later modified the custom writable (like
 adding new attribute to the writable) which the old data doesn't have.

 It might be a possibility is please check once

 On Friday, May 25, 2012, waqas latif wrote:

  Hi Experts,
 
  I am fairly new to hadoop MapR and I was trying to run a matrix
  multiplication example presented by Mr. Norstadt under following link
  http://www.norstad.org/matrix-multiply/index.html. I can run it
  successfully with hadoop 0.20.2 but I tried to run it with hadoop 1.0.3
 but
  I am getting following error. Is it the problem with my hadoop
  configuration or it is compatibility problem in the code which was
 written
  in hadoop 0.20 by author.Also please guide me that how can I fix this
 error
  in either case. Here is the error I am getting.
 
  in thread main java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:180)
 at java.io.DataInputStream.readFully(DataInputStream.java:152)
 at
  org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
 at
  org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1486)
 at
  org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1475)
 at
  org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1470)
 at TestMatrixMultiply.fillMatrix(TestMatrixMultiply.java:60)
 at TestMatrixMultiply.readMatrix(TestMatrixMultiply.java:87)
 at TestMatrixMultiply.checkAnswer(TestMatrixMultiply.java:112)
 at TestMatrixMultiply.runOneTest(TestMatrixMultiply.java:150)
 at TestMatrixMultiply.testRandom(TestMatrixMultiply.java:278)
 at TestMatrixMultiply.main(TestMatrixMultiply.java:308)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
 
  Thanks in advance
 
  Regards,
  waqas
 



Re: Namenode EOF Exception

2012-05-15 Thread Prashant Kommireddi
Thanks, I agree I need to upgrade :)

I was able to recover NN following your suggestions, and an additional
hack was to sync the namespaceID across data nodes with the namenode.

On May 14, 2012, at 11:48 AM, Harsh J ha...@cloudera.com wrote:

 True, I don't recall 0.20.2 (the original release that was a few years
 ago) carrying these fixes. You ought to upgrade that cluster to the
 current stable release for the many fixes you can benefit from :)

 On Mon, May 14, 2012 at 11:58 PM, Prashant Kommireddi
 prash1...@gmail.com wrote:
 Thanks Harsh. I am using 0.20.2, I see on the Jira this issue was
 fixed for 0.23?

 I will try out your suggestions and get back.

 On May 14, 2012, at 1:22 PM, Harsh J ha...@cloudera.com wrote:

 Your fsimage seems to have gone bad (is it 0-sized? I recall that as a
 known issue long since fixed).

 The easiest way is to fall back to the last available good checkpoint
 (From SNN). Or if you have multiple dfs.name.dirs, see if some of the
 other points have better/complete files on them, and re-spread them
 across after testing them out (and backing up the originals).

 Though what version are you running? Cause AFAIK most of the recent
 stable versions/distros include NN resource monitoring threads which
 should have placed your NN into safemode the moment all its disks ran
 near to out of space.

 On Mon, May 14, 2012 at 10:50 PM, Prashant Kommireddi
 prash1...@gmail.com wrote:
 Hi,

 I am seeing an issue where Namenode does not start due an EOFException. The
 disk was full and I cleared space up but I am unable to get past this
 exception. Any ideas on how this can be resolved?

 2012-05-14 10:10:44,018 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=hadoop
 2012-05-14 10:10:44,018 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 isPermissionEnabled=false
 2012-05-14 10:10:44,023 INFO
 org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
 Initializing FSNamesystemMetrics using context
 object:org.apache.hadoop.metrics.file.FileContext
 2012-05-14 10:10:44,024 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
 FSNamesystemStatusMBean
 2012-05-14 10:10:44,047 INFO org.apache.hadoop.hdfs.server.common.Storage:
 Number of files = 205470
 2012-05-14 10:10:44,844 ERROR
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
 initialization failed.
 java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
at
 org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1578)
at
 org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:880)
at
 org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
at
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
at
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:292)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:279)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
 2012-05-14 10:10:44,845 INFO org.apache.hadoop.ipc.Server: Stopping server
 on 54310
 2012-05-14 10:10:44,845 ERROR
 org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
at
 org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1578)
at
 org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:880)
at
 org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
at
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
at
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:292)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:279)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)

 2012-05-14 10:10:44,846 INFO
 org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NameNode at
 gridforce-1

Re: Namenode EOF Exception

2012-05-14 Thread Prashant Kommireddi
Thanks Harsh. I am using 0.20.2, I see on the Jira this issue was
fixed for 0.23?

I will try out your suggestions and get back.

On May 14, 2012, at 1:22 PM, Harsh J ha...@cloudera.com wrote:

 Your fsimage seems to have gone bad (is it 0-sized? I recall that as a
 known issue long since fixed).

 The easiest way is to fall back to the last available good checkpoint
 (From SNN). Or if you have multiple dfs.name.dirs, see if some of the
 other points have better/complete files on them, and re-spread them
 across after testing them out (and backing up the originals).

 Though what version are you running? Cause AFAIK most of the recent
 stable versions/distros include NN resource monitoring threads which
 should have placed your NN into safemode the moment all its disks ran
 near to out of space.

 On Mon, May 14, 2012 at 10:50 PM, Prashant Kommireddi
 prash1...@gmail.com wrote:
 Hi,

 I am seeing an issue where Namenode does not start due an EOFException. The
 disk was full and I cleared space up but I am unable to get past this
 exception. Any ideas on how this can be resolved?

 2012-05-14 10:10:44,018 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=hadoop
 2012-05-14 10:10:44,018 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 isPermissionEnabled=false
 2012-05-14 10:10:44,023 INFO
 org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
 Initializing FSNamesystemMetrics using context
 object:org.apache.hadoop.metrics.file.FileContext
 2012-05-14 10:10:44,024 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
 FSNamesystemStatusMBean
 2012-05-14 10:10:44,047 INFO org.apache.hadoop.hdfs.server.common.Storage:
 Number of files = 205470
 2012-05-14 10:10:44,844 ERROR
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
 initialization failed.
 java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
at
 org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1578)
at
 org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:880)
at
 org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
at
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
at
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:292)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:279)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
 2012-05-14 10:10:44,845 INFO org.apache.hadoop.ipc.Server: Stopping server
 on 54310
 2012-05-14 10:10:44,845 ERROR
 org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
at
 org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1578)
at
 org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:880)
at
 org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
at
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
at
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:292)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:279)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)

 2012-05-14 10:10:44,846 INFO
 org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NameNode at
 gridforce-1.internal.salesforce.com/10.0.201.159
 /



 --
 Harsh J


Re: java.io.IOException: Task process exit with nonzero status of 1

2012-05-11 Thread Prashant Kommireddi
You might be running out of disk space. Check for that on your cluster
nodes.

-Prashant

On Fri, May 11, 2012 at 12:21 AM, JunYong Li lij...@gmail.com wrote:

 is there errors in the task outpu file?
 on the jobtracker.jsp click the Jobid link - tasks link - Taskid link -
 Task logs link

 2012/5/11 Mohit Kundra mohit@gmail.com

  Hi ,
 
  I am new user to hadoop . I have installed hadoop0.19.1 on single windows
  machine.
  Its http://localhost:50030/jobtracker.jsp and
  http://localhost:50070/dfshealth.jsp pages are working fine but when i
 am
  executing  bin/hadoop jar hadoop-0.19.1-examples.jar pi 5 100
 
  It is showing below
 
  $ bin/hadoop jar hadoop-0.19.1-examples.jar pi 5 100
  cygpath: cannot create short name of D:hadoop-0.19.1logs
  Number of Maps = 5 Samples per Map = 100
  Wrote input for Map #0
  Wrote input for Map #1
  Wrote input for Map #2
  Wrote input for Map #3
  Wrote input for Map #4
  Starting Job
  12/05/11 12:07:26 INFO mapred.JobClient:
  Running job: job_20120513_0002
  12/05/11 12:07:27 INFO mapred.JobClient:  map 0% reduce 0%
  12/05/11 12:07:35 INFO mapred.JobClient: Task Id :
  attempt_20120513_0002_m_06_ 0, Status : FAILED
  java.io.IOException: Task process exit with nonzero status of 1.
  at org.apache.hadoop.mapred.TaskRunner.run (TaskRunner.java:425)
 
 
 
  Please tell me what is the root cause
 
  regards ,
  Mohit
 
 


 --
 Regards
 Junyong



Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

2012-05-03 Thread Prashant Kommireddi
Seems like a matter of upgrade. I am not a Cloudera user so would not know
much, but you might find some help moving this to Cloudera mailing list.

On Thu, May 3, 2012 at 2:51 AM, Austin Chungath austi...@gmail.com wrote:

 There is only one cluster. I am not copying between clusters.

 Say I have a cluster running apache 0.20.205 with 10 TB storage capacity
 and has about 8 TB of data.
 Now how can I migrate the same cluster to use cdh3 and use that same 8 TB
 of data.

 I can't copy 8 TB of data using distcp because I have only 2 TB of free
 space


 On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar nitinpawar...@gmail.com
 wrote:

  you can actually look at the distcp
 
  http://hadoop.apache.org/common/docs/r0.20.0/distcp.html
 
  but this means that you have two different set of clusters available to
 do
  the migration
 
  On Thu, May 3, 2012 at 12:51 PM, Austin Chungath austi...@gmail.com
  wrote:
 
   Thanks for the suggestions,
   My concerns are that I can't actually copyToLocal from the dfs because
  the
   data is huge.
  
   Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can do a
   namenode upgrade. I don't have to copy data out of dfs.
  
   But here I am having Apache hadoop 0.20.205 and I want to use CDH3 now,
   which is based on 0.20
   Now it is actually a downgrade as 0.20.205's namenode info has to be
 used
   by 0.20's namenode.
  
   Any idea how I can achieve what I am trying to do?
  
   Thanks.
  
   On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar nitinpawar...@gmail.com
   wrote:
  
i can think of following options
   
1) write a simple get and put code which gets the data from DFS and
  loads
it in dfs
2) see if the distcp  between both versions are compatible
3) this is what I had done (and my data was hardly few hundred GB) ..
   did a
dfs -copyToLocal and then in the new grid did a copyFromLocal
   
On Thu, May 3, 2012 at 11:41 AM, Austin Chungath austi...@gmail.com
 
wrote:
   
 Hi,
 I am migrating from Apache hadoop 0.20.205 to CDH3u3.
 I don't want to lose the data that is in the HDFS of Apache hadoop
 0.20.205.
 How do I migrate to CDH3u3 but keep the data that I have on
 0.20.205.
 What is the best practice/ techniques to do this?

 Thanks  Regards,
 Austin

   
   
   
--
Nitin Pawar
   
  
 
 
 
  --
  Nitin Pawar
 



Re: Compressing map only output

2012-04-30 Thread Prashant Kommireddi
Yes. These are hadoop properties - using set is just a way for Pig to set
those properties in your job conf.


On Mon, Apr 30, 2012 at 5:25 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 Is there a way to compress map only jobs to compress map output that gets
 stored on hdfs as part-m-* files? In pig I used :

 Would these work form plain map reduce jobs as well?


 set output.compression.enabled true;

 set output.compression.codec org.apache.hadoop.io.compress.SnappyCodec;



Re: Distributing MapReduce on a computer cluster

2012-04-23 Thread Prashant Kommireddi
Shailesh, there's a lot that goes into distributing work across
tasks/nodes. It's not just distributing work but also fault-tolerance,
data locality etc that come into play. It might be good to refer
Hadoop apache docs or Tom White's definitive guide.

Sent from my iPhone

On Apr 23, 2012, at 11:03 AM, Shailesh Samudrala shailesh2...@gmail.com wrote:

 Hello,

 I am trying to design my own MapReduce Implementation and I want to know
 how hadoop is able to distribute its workload across multiple computers.
 Can anyone shed more light on this? thanks!


Re: Jobtracker history logs missing

2012-04-09 Thread Prashant Kommireddi
Anyone faced similar issue or knows what the issue might be?

Thanks in advance.

On Thu, Apr 5, 2012 at 10:52 AM, Prashant Kommireddi prash1...@gmail.comwrote:

 Thanks Nitin.

 I believe the config key you mentioned controls the task attempts logs
 that go under - ${hadoop.log.dir}/userlogs.

 The ones that I mentioned are the job history logs that go under - 
 ${hadoop.log.dir}/history
 and are specified by the key hadoop.job.history.location. Are these
 cleaned up based on mapred.userlog.retain.hours too?

 Also, this is what I am seeing in history dir

 Available Conf files - Mar 3rd - April 5th
 Available Job files   - Mar 3rd - April 3rd

 There is no job file present after the 3rd of April, but conf files
 continue to be written.

 Thanks,
 Prashant




 On Thu, Apr 5, 2012 at 3:22 AM, Nitin Khandelwal 
 nitin.khandel...@germinait.com wrote:

 Hi Prashant,

 The userlogs for job are deleted after time specified by  *
 mapred.userlog.retain.hours*  property defined in mapred-site.xml
 (default
 is 24 Hrs).

 Thanks,
 Nitin

 On 5 April 2012 14:26, Prashant Kommireddi prash1...@gmail.com wrote:

  I am noticing something strange with JobTracker history logs on my
 cluster.
  I see configuration files (*_conf.xml) under /logs/history/ but none of
 the
  actual job logs. Anyone has ideas on what might be happening?
 
  Thanks,
 



 --


 Nitin Khandelwal





Re: Data Node is not Started

2012-04-06 Thread Prashant Kommireddi
Can you check the datanode logs? May  its an incompatible namespace issue.

On Apr 6, 2012, at 11:13 AM, Sujit Dhamale sujitdhamal...@gmail.com wrote:

 Hi all,
 my DataNode is not started .

 even after deleting hadoop*.pid file from /tmp , But still Data node is not
 started ,


 Hadoop Version: hadoop-1.0.1.tar.gz
 Java version : java version 1.6.0_26
 Operating System : Ubuntu 11.10


 i did below procedure


 *hduser@sujit:~/Desktop/hadoop/bin$ jps*
 11455 Jps


 *hduser@sujit:~/Desktop/hadoop/bin$ start-all.sh*
 Warning: $HADOOP_HOME is deprecated.

 starting namenode, logging to
 /home/hduser/Desktop/hadoop/libexec/../logs/hadoop-hduser-namenode-sujit.out
 localhost: starting datanode, logging to
 /home/hduser/Desktop/hadoop/libexec/../logs/hadoop-hduser-datanode-sujit.out
 localhost: starting secondarynamenode, logging to
 /home/hduser/Desktop/hadoop/libexec/../logs/hadoop-hduser-secondarynamenode-sujit.out
 starting jobtracker, logging to
 /home/hduser/Desktop/hadoop/libexec/../logs/hadoop-hduser-jobtracker-sujit.out
 localhost: starting tasktracker, logging to
 /home/hduser/Desktop/hadoop/libexec/../logs/hadoop-hduser-tasktracker-sujit.out

 *hduser@sujit:~/Desktop/hadoop/bin$ jps*
 11528 NameNode
 12019 SecondaryNameNode
 12355 TaskTracker
 12115 JobTracker
 12437 Jps


 *hduser@sujit:~/Desktop/hadoop/bin$ stop-all.sh*
 Warning: $HADOOP_HOME is deprecated.

 stopping jobtracker
 localhost: stopping tasktracker
 stopping namenode
 localhost: no datanode to stop
 localhost: stopping secondarynamenode


 *hduser@sujit:~/Desktop/hadoop/bin$ jps*
 13127 Jps


 *hduser@sujit:~/Desktop/hadoop/bin$ ls /tmp*
 hadoop-hduser-datanode.pid
 hsperfdata_hduserkeyring-meecr7
 ssh-JXYCAJsX1324
 hadoop-hduser-jobtracker.pid
 hsperfdata_sujit plugtmp
 unity_support_test.0
 hadoop-hduser-namenode.pid
 Jetty_0_0_0_0_50030_jobyn7qmkpulse-2L9K88eMlGn7
 virtual-hduser.Q8j5nJ
 hadoop-hduser-secondarynamenode.pid
 Jetty_0_0_0_0_50070_hdfsw2cu08   pulse-Ob9vyJcXyHZz
 hadoop-hduser-tasktracker.pid
 Jetty_0_0_0_0_50090_secondaryy6aanv  pulse-PKdhtXMmr18n

 *Deleted *.pid file :)

 hduser@sujit:~$ ls /tmp*
 hsperfdata_hduserpulse-2L9K88eMlGn7
 hsperfdata_sujit pulse-Ob9vyJcXyHZz
 Jetty_0_0_0_0_50030_jobyn7qmkpulse-PKdhtXMmr18n
 Jetty_0_0_0_0_50070_hdfsw2cu08   ssh-JXYCAJsX1324
 Jetty_0_0_0_0_50090_secondaryy6aanv  unity_support_test.0
 keyring-meecr7   virtual-hduser.Q8j5nJ
 plugtmp





 *hduser@sujit:~/Desktop/hadoop$ bin/hadoop namenode -format*
 Warning: $HADOOP_HOME is deprecated.

 12/04/06 23:23:22 INFO namenode.NameNode: STARTUP_MSG:
 /
 STARTUP_MSG: Starting NameNode
 STARTUP_MSG:   host = sujit.(null)/127.0.1.1
 STARTUP_MSG:   args = [-format]
 STARTUP_MSG:   version = 1.0.1
 STARTUP_MSG:   build =
 https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r
 1243785; compiled by 'hortonfo' on Tue Feb 14 08:15:38 UTC 2012
 /
 Re-format filesystem in /app/hadoop/tmp/dfs/name ? (Y or N) Y
 12/04/06 23:23:25 INFO util.GSet: VM type   = 32-bit
 12/04/06 23:23:25 INFO util.GSet: 2% max memory = 17.77875 MB
 12/04/06 23:23:25 INFO util.GSet: capacity  = 2^22 = 4194304 entries
 12/04/06 23:23:25 INFO util.GSet: recommended=4194304, actual=4194304
 12/04/06 23:23:25 INFO namenode.FSNamesystem: fsOwner=hduser
 12/04/06 23:23:25 INFO namenode.FSNamesystem: supergroup=supergroup
 12/04/06 23:23:25 INFO namenode.FSNamesystem: isPermissionEnabled=true
 12/04/06 23:23:25 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
 12/04/06 23:23:25 INFO namenode.FSNamesystem: isAccessTokenEnabled=false
 accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
 12/04/06 23:23:25 INFO namenode.NameNode: Caching file names occuring more
 than 10 times
 12/04/06 23:23:26 INFO common.Storage: Image file of size 112 saved in 0
 seconds.
 12/04/06 23:23:26 INFO common.Storage: Storage directory
 /app/hadoop/tmp/dfs/name has been successfully formatted.
 12/04/06 23:23:26 INFO namenode.NameNode: SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NameNode at sujit.(null)/127.0.1.1
 /
 hduser@sujit:~/Desktop/hadoop$ bin/start-all.sh
 Warning: $HADOOP_HOME is deprecated.

 starting namenode, logging to
 /home/hduser/Desktop/hadoop/libexec/../logs/hadoop-hduser-namenode-sujit.out
 localhost: starting datanode, logging to
 /home/hduser/Desktop/hadoop/libexec/../logs/hadoop-hduser-datanode-sujit.out
 localhost: starting secondarynamenode, logging to
 /home/hduser/Desktop/hadoop/libexec/../logs/hadoop-hduser-secondarynamenode-sujit.out
 starting jobtracker, logging to
 

Re: Jobtracker history logs missing

2012-04-05 Thread Prashant Kommireddi
Thanks Nitin.

I believe the config key you mentioned controls the task attempts logs that
go under - ${hadoop.log.dir}/userlogs.

The ones that I mentioned are the job history logs that go under -
${hadoop.log.dir}/history
and are specified by the key hadoop.job.history.location. Are these
cleaned up based on mapred.userlog.retain.hours too?

Also, this is what I am seeing in history dir

Available Conf files - Mar 3rd - April 5th
Available Job files   - Mar 3rd - April 3rd

There is no job file present after the 3rd of April, but conf files
continue to be written.

Thanks,
Prashant



On Thu, Apr 5, 2012 at 3:22 AM, Nitin Khandelwal 
nitin.khandel...@germinait.com wrote:

 Hi Prashant,

 The userlogs for job are deleted after time specified by  *
 mapred.userlog.retain.hours*  property defined in mapred-site.xml (default
 is 24 Hrs).

 Thanks,
 Nitin

 On 5 April 2012 14:26, Prashant Kommireddi prash1...@gmail.com wrote:

  I am noticing something strange with JobTracker history logs on my
 cluster.
  I see configuration files (*_conf.xml) under /logs/history/ but none of
 the
  actual job logs. Anyone has ideas on what might be happening?
 
  Thanks,
 



 --


 Nitin Khandelwal



Re: Doubt from the book Definitive Guide

2012-04-04 Thread Prashant Kommireddi
Answers inline.

On Wed, Apr 4, 2012 at 4:56 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 I am going through the chapter How mapreduce works and have some
 confusion:

 1) Below description of Mapper says that reducers get the output file using
 HTTP call. But the description under The Reduce Side doesn't specifically
 say if it's copied using HTTP. So first confusion, Is the output copied
 from mapper - reducer or from reducer - mapper? And second, Is the call
 http:// or hdfs://


Map output is written to local FS, not HDFS.


 2) My understanding was that mapper output gets written to hdfs, since I've
 seen part-m-0 files in hdfs. If mapper output is written to HDFS then
 shouldn't reducers simply read it from hdfs instead of making http calls to
 tasktrackers location?

 Map output is sent to HDFS when reducer is not used.



 - from the book ---
 Mapper
 The output file’s partitions are made available to the reducers over HTTP.
 The number of worker threads used to serve the file partitions is
 controlled by the tasktracker.http.threads property
 this setting is per tasktracker, not per map task slot. The default of 40
 may need increasing for large clusters running large jobs.6.4.2.

 The Reduce Side
 Let’s turn now to the reduce part of the process. The map output file is
 sitting on the local disk of the tasktracker that ran the map task
 (note that although map outputs always get written to the local disk of the
 map tasktracker, reduce outputs may not be), but now it is needed by the
 tasktracker
 that is about to run the reduce task for the partition. Furthermore, the
 reduce task needs the map output for its particular partition from several
 map tasks across the cluster.
 The map tasks may finish at different times, so the reduce task starts
 copying their outputs as soon as each completes. This is known as the copy
 phase of the reduce task.
 The reduce task has a small number of copier threads so that it can fetch
 map outputs in parallel.
 The default is five threads, but this number can be changed by setting the
 mapred.reduce.parallel.copies property.



Re: Doubt from the book Definitive Guide

2012-04-04 Thread Prashant Kommireddi
Hi Mohit,

What would be the advantage? Reducers in most cases read data from all
the mappers. In the case where mappers were to write to HDFS, a
reducer would still require to read data from other datanodes across
the cluster.

Prashant

On Apr 4, 2012, at 9:55 PM, Mohit Anchlia mohitanch...@gmail.com wrote:

 On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote:

 Hi Mohit,

 On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
 I am going through the chapter How mapreduce works and have some
 confusion:

 1) Below description of Mapper says that reducers get the output file
 using
 HTTP call. But the description under The Reduce Side doesn't
 specifically
 say if it's copied using HTTP. So first confusion, Is the output copied
 from mapper - reducer or from reducer - mapper? And second, Is the call
 http:// or hdfs://

 The flow is simple as this:
 1. For M+R job, map completes its task after writing all partitions
 down into the tasktracker's local filesystem (under mapred.local.dir
 directories).
 2. Reducers fetch completion locations from events at JobTracker, and
 query the TaskTracker there to provide it the specific partition it
 needs, which is done over the TaskTracker's HTTP service (50060).

 So to clear things up - map doesn't send it to reduce, nor does reduce
 ask the actual map task. It is the task tracker itself that makes the
 bridge here.

 Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would
 be over Netty connections. This would be much more faster and
 reliable.

 2) My understanding was that mapper output gets written to hdfs, since
 I've
 seen part-m-0 files in hdfs. If mapper output is written to HDFS then
 shouldn't reducers simply read it from hdfs instead of making http calls
 to
 tasktrackers location?

 A map-only job usually writes out to HDFS directly (no sorting done,
 cause no reducer is involved). If the job is a map+reduce one, the
 default output is collected to local filesystem for partitioning and
 sorting at map end, and eventually grouping at reduce end. Basically:
 Data you want to send to reducer from mapper goes to local FS for
 multiple actions to be performed on them, other data may directly go
 to HDFS.

 Reducers currently are scheduled pretty randomly but yes their
 scheduling can be improved for certain scenarios. However, if you are
 pointing that map partitions ought to be written to HDFS itself (with
 replication or without), I don't see performance improving. Note that
 the partitions aren't merely written but need to be sorted as well (at
 either end). To do that would need ability to spill frequently (cause
 we don't have infinite memory to do it all in RAM) and doing such a
 thing on HDFS would only mean slowdown.

 Thanks for clearing my doubts. In this case I was merely suggesting that
 if the mapper output (merged output in the end or the shuffle output) is
 stored in HDFS then reducers can just retrieve it from HDFS instead of
 asking tasktracker for it. Once reducer threads read it they can continue
 to work locally.



 I hope this helps clear some things up for you.

 --
 Harsh J



Re: Using a combiner

2012-03-14 Thread Prashant Kommireddi
It is a function of the number of spills on map side and I believe
the default is 3. So for every 3 times data is spilled the combiner is
run. This number is configurable.

Sent from my iPhone

On Mar 14, 2012, at 3:26 PM, Gayatri Rao rgayat...@gmail.com wrote:

 Hi all,

 I have a quick query on using a combiner in a MR job. Is it true the
 framework decides whether or not the combiner gets called?
 Can any one please give more information on how t his is done.

 Thanks,
 Gayatri


Re: 100x slower mapreduce compared to pig

2012-02-28 Thread Prashant Kommireddi
It would be great if we can take a look at what you are doing in the UDF vs
the Mapper.

100x slow does not make sense for the same job/logic, its either the Mapper
code or may be the cluster was busy at the time you scheduled MapReduce job?

Thanks,
Prashant

On Tue, Feb 28, 2012 at 4:11 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 I am comparing runtime of similar logic. The entire logic is exactly same
 but surprisingly map reduce job that I submit is 100x slow. For pig I use
 udf and for hadoop I use mapper only and the logic same as pig. Even the
 splits on the admin page are same. Not sure why it's so slow. I am
 submitting job like:

 java -classpath

 .:analytics.jar:/hadoop-0.20.2-cdh3u3/lib/*:/root/.mohit/hadoop-0.20.2-cdh3u3/*:common.jar
 com.services.dp.analytics.hadoop.mapred.FormMLProcessor

 /examples/testfile40.seq,/examples/testfile41.seq,/examples/testfile42.seq,/examples/testfile43.seq,/examples/testfile44.seq,/examples/testfile45.seq,/examples/testfile46.seq,/examples/testfile47.seq,/examples/testfile48.seq,/examples/testfile49.seq
 /examples/output1/

 How should I go about looking the root cause of why it's so slow? Any
 suggestions would be really appreciated.



 One of the things I noticed is that on the admin page of map task list I
 see status as hdfs://dsdb1:54310/examples/testfile40.seq:0+134217728 but
 for pig the status is blank.



Re: Adding mahout math jar to hadoop mapreduce execution

2012-01-30 Thread Prashant Kommireddi
How are you building the mapreduce jar? Try not to include the Mahout dist
while building MR jar, and include it only on -libjars option.

On Mon, Jan 30, 2012 at 10:33 PM, Daniel Quach danqu...@cs.ucla.edu wrote:

 I have been compiling my mapreduce with the jars in the classpath, and I
 believe I need to also add the jars as an option to -libjars to hadoop.
 However, even when I do this, I still get an error complaining about
 missing classes at runtime. (Compilation works fine).

 Here is my command:
 hadoop jar makevector.jar org.myorg.MakeVector -libjars
 /usr/local/mahout/math/target/mahout-math-0.6-SNAPSHOT.jar input/ output/

 This is the error I receive:
 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/mahout/math/DenseVector

 I wonder if I am using the GenericOptionsParser incorrectly? I'm not sure
 if there is a deeper problem here.



Re: Killing hadoop jobs automatically

2012-01-29 Thread Prashant Kommireddi
You might want to take a look at the kill command : hadoop job -kill
jobid.

Prashant

On Sun, Jan 29, 2012 at 11:06 PM, praveenesh kumar praveen...@gmail.comwrote:

 Is there anyway through which we can kill hadoop jobs that are taking
 enough time to execute ?

 What I want to achieve is - If some job is running more than
 _some_predefined_timeout_limit, it should be killed automatically.

 Is it possible to achieve this, through shell scripts or any other way ?

 Thanks,
 Praveenesh



Re: Parallel CSV loader

2012-01-24 Thread Prashant Kommireddi
I am assuming you want to move data between Hadoop and database.
Please take a look at Sqoop.

Thanks,
Prashant

Sent from my iPhone

On Jan 24, 2012, at 9:19 AM, Edmon Begoli ebeg...@gmail.com wrote:

 I am looking to use Hadoop for parallel loading of CSV file into a
 non-Hadoop, parallel database.

 Is there an existing utility that allows one to pick entries,
 row-by-row, synchronized and in parallel and load into a database?

 Thank you in advance,
 Edmon


Re: hadoop filesystem cache

2012-01-14 Thread Prashant Kommireddi
You mean something different from the DistributedCache?

Sent from my iPhone

On Jan 14, 2012, at 5:30 PM, Rita rmorgan...@gmail.com wrote:

 After reading this article,
 http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I was
 wondering if there was a filesystem cache for hdfs. For example, if a large
 file (10gigabytes) was keep getting accessed on the cluster instead of keep
 getting it from the network why not storage the content of the file locally
 on the client itself.  A use case on the client would be like this:



 property
  namedfs.client.cachedirectory/name
  value/var/cache/hdfs/value
 /property


 property
 namedfs.client.cachesize/name
 descriptionin megabytes/description
 value10/value
 /property


 Any thoughts of a feature like this?


 --
 --- Get your facts first, then you can distort them as you please.--


Re: increase number of map tasks

2012-01-12 Thread Prashant Kommireddi
1. Performance tunning/optimization any good suggestions or links?
Take a look at
http://wiki.datameer.com/documentation/current/Hadoop+Cluster+Configuration+Tips

2. Logging - If I do any logging in map/reduce class where will be logging
or system.out information written?
Be careful while doing so since on large amounts of data you can fill up
disk on datanodes very quickly. You can find the logs through the
jobtracker page by clicking on specific map and reduce tasks.

3. How do we reuse jvm? map tasks creation takes time.
Look at mapred.job.reuse.jvm.num.tasks

4. Different types of spills - how do we avoid them?
Depends on what is causing the spills. You can have spills on Map and
Reduce side, and adjusting config properties such io.sort.mb,
io.sort.factor, and a few others on the Reduce side. Tom White's book has
a good explanation on these.

Thanks,
Prashant Kommireddi


On Thu, Jan 12, 2012 at 8:10 AM, screen satish.se...@hcl.in wrote:


 Thanks. Seperate files for line item has created 10 map tasks out of which
 only some are in running state (given by max map reduce tasks)  rest are in
 wait. So if I have 8 cpus, I have max_map_tasks as 7 so 3 are in wait
 state.
 I can see 7 cpus utilization 90-95%.

 1. Performance tunning/optimization any good suggestions or links?

 2. Logging - If I do any logging in map/reduce class where will be logging
 or system.out information written?

 3. How do we reuse jvm? map tasks creation takes time.

 4. Different types of spills - how do we avoid them?

 --
 View this message in context:
 http://old.nabble.com/increase-number-of-map-tasks-tp33107775p33128748.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: Re: how to set mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum

2012-01-10 Thread Prashant Kommireddi
Hi Hao,

Ideally you would want to leave out a core each for Tasktracker and
Datanode process' on each node. The rest could be used for maps and
reducers.

Thanks,
Prashant

2012/1/10 hao.wang hao.w...@ipinyou.com

 Hi,
Thanks for your help, your suggestion is very usefully.
I have another question that is whether the sum of maps and reduces
 equals to the total number of cores.

 regards!


 2012-01-10



 hao.wang



 发件人: Harsh J
 发送时间: 2012-01-10 16:44:07
 收件人: common-user
 抄送:
 主题: Re: how to set mapred.tasktracker.map.tasks.maximum and
 mapred.tasktracker.reduce.tasks.maximum

 Hello Hao,
 Am sorry if I confused you. By CPUs I meant the CPUs visible to your OS
 (/proc/cpuinfo), so yes the total number of cores.
 On 10-Jan-2012, at 12:39 PM, hao.wang wrote:
  Hi ,
 
  Thanks for your reply!
  According to your suggestion, Maybe I can't apply it to our hadoop
 cluster.
  Cus, each server in our hadoop cluster just contains 2 CPUs.
  So, I think maybe you mean the core #  but not CPU # in each searver?
  I am looking for your reply.
 
  regards!
 
 
  2012-01-10
 
 
 
  hao.wang
 
 
 
  发件人: Harsh J
  发送时间: 2012-01-10 11:33:38
  收件人: common-user
  抄送:
  主题: Re: how to set mapred.tasktracker.map.tasks.maximum and
 mapred.tasktracker.reduce.tasks.maximum
 
  Hello again,
  Try a 4:3 ratio between maps and reduces, against a total # of available
 CPUs per node (minus one or two, for DN and HBase if you run those). Then
 tweak it as you go (more map-only loads or more map-reduce loads, that
 depends on your usage, and you can tweak the ratio accordingly over time --
 changing those props do not need JobTracker restarts, just TaskTracker).
  On 10-Jan-2012, at 8:17 AM, hao.wang wrote:
  Hi,
Thanks for your reply!
I had already read the pages before, can you give me sme more
 specific suggestions about how to choose the values of
  mapred.tasktracker.map.tasks.maximum and
 mapred.tasktracker.reduce.tasks.maximum according to our cluster
 configuration if possible?
 
  regards!
 
 
  2012-01-10
 
 
 
  hao.wang
 
 
 
  发件人: Harsh J
  发送时间: 2012-01-09 23:19:21
  收件人: common-user
  抄送:
  主题: Re: how to set mapred.tasktracker.map.tasks.maximum and
 mapred.tasktracker.reduce.tasks.maximum
 
  Hi,
  Please read
 http://hadoop.apache.org/common/docs/current/single_node_setup.html to
 learn how to configure Hadoop using the various *-site.xml configuration
 files, and then follow
 http://hadoop.apache.org/common/docs/current/cluster_setup.html to
 achieve optimal configs for your cluster.
  On 09-Jan-2012, at 5:50 PM, hao.wang wrote:
  Hi ,all
   Our hadoop cluster has 22 nodes including one namenode, one
 jobtracker and 20 datanodes.
   Each node has 2 * 12 cores with 32G RAM
   Dose anyone tell me how to config following parameters:
   mapred.tasktracker.map.tasks.maximum
   mapred.tasktracker.reduce.tasks.maximum
 
  regards!
  2012-01-09
 
 
 
  hao.wang



Re: Hadoop MySQL database access

2011-12-28 Thread Prashant Kommireddi
By design reduce would start only after all the maps finish. There is
no way for the reduce to begin grouping/merging by key unless all the
maps have finished.

Sent from my iPhone

On Dec 28, 2011, at 8:53 AM, JAGANADH G jagana...@gmail.com wrote:

 Hi All,

 I wrote a map reduce program to fetch data from MySQL and process the
 data(word count).
 The program executes successfully . But I noticed that the reduce task
 starts after finishing the map task only .
 Is there any way to run the map and reduce in parallel.

 The program fetches data from MySQL and writes the processed output to
 hdfs.
 I am using hadoop in pseduo-distributed mode .
 --
 **
 JAGANADH G
 http://jaganadhg.in
 *ILUGCBE*
 http://ilugcbe.org.in


Re: Another newbie - problem with grep example

2011-12-23 Thread Prashant Kommireddi
Seems like you do not have /user/MyId/input/conf on HDFS.

Try this.

cd $HADOOP_HOME_DIR (this should be your hadoop root dir)
hadoop fs -put conf input/conf

And then run the MR job again.

-Prashant Kommireddi

On Fri, Dec 23, 2011 at 3:40 PM, Pat Flaherty p...@well.com wrote:

 Hi,

 Installed 0.22.0 on CentOS 5.7.  I can start dfs and mapred and see their
 processes.

 Ran the first grep example: bin/hadoop jar hadoop-*-examples.jar grep
 input output 'dfs[a-z.]+'.  It seems the correct jar name is
 hadoop-mapred-examples-0.22.0.jar - there are no other hadoop*examples*.jar
 files in HADOOP_HOME.

 Didn't work.  Then found and tried pi (compute pi) - that works, so my
 installation is to some degree of approximation good.

 Back to grep.  It fails with

  java.io.FileNotFoundException: File does not exist: /user/MyId/input/conf

 Found and ran bin/hadoop fs -ls.  OK these directory names are internal to
 hadoop (I assume) because Linux has no idea of /user.

 And the directory is there - but the program is failing.

 Any suggestions; where to start; etc?

 Thanks - Pat



Re: Configure hadoop scheduler

2011-12-20 Thread Prashant Kommireddi
I am guessing you are trying to use the FairScheduler but you have
specified CapacityScheduler in your configuration. You need to change
mapreduce.jobtracker.scheduler to FairScheduler.

Sent from my iPhone

On Dec 20, 2011, at 8:51 AM, Merto Mertek masmer...@gmail.com wrote:

 Hi,

 I am having problems with changing the default hadoop scheduler (i assume
 that the default scheduler is a FIFO scheduler).

 I am following the guide located in hadoop/docs directory however I am not
 able to run it.  Link for scheduling administration returns an http error
 404 ( http://localhost:50030/scheduler ). In the UI under scheduling
 information I can see only one queue named default. mapred-site.xml file
 is accessible because when changing a port for a jobtracker I can see a
 daemon running with a changed port. Variable $HADOOP_CONFIG_DIR was added
 to .bashrc, however that did not solve the problem. I tried to rebuild
 hadoop, manualy place the fair scheduler jar in hadoop/lib and changed the
 hadoop classpath in hadoop-env.sh to point to the lib folder, but without
 success. The only info of the scheduler that is seen in the jobtracker log
 is the folowing info:

 Scheduler configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT,
 limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1)


 I am working on this several days and running out of ideas... I am
 wondering how to fix it and where to check currently active scheduler
 parameters?

 Config files:
 mapred-site.xml http://pastebin.com/HmDfWqE1
 allocation.xml http://pastebin.com/Uexq7uHV
 Tried versions: 0.20.203 and 204

 Thank you


Re: Regarding pointers for LZO compression in Hive and Hadoop

2011-12-14 Thread Prashant Kommireddi
http://code.google.com/p/hadoop-gpl-packing/

Thanks,
Prashant

On Wed, Dec 14, 2011 at 11:32 AM, Abhishek Pratap Singh manu.i...@gmail.com
 wrote:

 Hi,

 I m looking for some useful docs on enabling LZO on hadoop cluster. I tried
 few of the blogs, but somehow its not working.
 Here is my requirement.

 I have a hadoop 0.20.2 and Hive 0.6. I have some tables with 1.5 TB of
 data, i want to compress them using LZO and enable LZO in hive as well as
 in hadoop.
 Let me know if you have any useful docs or pointers for the same.


 Regards,
 Abhishek



Re: More cores Vs More Nodes ?

2011-12-13 Thread Prashant Kommireddi
Hi Brad, how many taskstrackers did you have on each node in both cases?

Thanks,
Prashant

Sent from my iPhone

On Dec 13, 2011, at 9:42 AM, Brad Sarsfield b...@bing.com wrote:

 Praveenesh,

 Your question is not naïve; in fact, optimal hardware design can ultimately 
 be a very difficult question to answer on what would be better. If you made 
 me pick one without much information I'd go for more machines.  But...

 It all depends; and there is no right answer :)

 More machines
+May run your workload faster
+Will give you a higher degree of reliability protection from node / 
 hardware / hard drive failure.
+More aggregate IO capabilities
- capex / opex may be higher than allocating more cores
 More cores
+May run your workload faster
+More cores may allow for more tasks to run on the same machine
+More cores/tasks may reduce network contention and increase increasing 
 task to task data flow performance.

 Notice May run your workload faster is in both; as it can be very workload 
 dependant.

 My Experience:
 I did a recent experiment and found that given the same number of cores (64) 
 with the exact same network / machine configuration;
A: I had 8 machines with 8 cores
B: I had 28 machines with 2 cores (and 1x8 core head node)

 B was able to outperform A by 2x using teragen and terasort. These machines 
 were running in a virtualized environment; where some of the IO capabilities 
 behind the scenes were being regulated to 400Mbps per node when running in 
 the 2 core configuration vs 1Gbps on the 8 core.  So I would expect the 
 non-throttled scenario to work even better.

 ~Brad


 -Original Message-
 From: praveenesh kumar [mailto:praveen...@gmail.com]
 Sent: Monday, December 12, 2011 8:51 PM
 To: common-user@hadoop.apache.org
 Subject: More cores Vs More Nodes ?

 Hey Guys,

 So I have a very naive question in my mind regarding Hadoop cluster nodes ?

 more cores or more nodes - Shall I spend money on going from 2-4 core 
 machines, or spend money on buying more nodes less core eg. say 2 machines of 
 2 cores for example?

 Thanks,
 Praveenesh



Re: Create a single output per each mapper

2011-12-12 Thread Prashant Kommireddi
Take a look at cleanup() method on Mapper.

Thanks,
Prashant

Sent from my iPhone

On Dec 12, 2011, at 8:46 PM, Shi Yu sh...@uchicago.edu wrote:

 Hi,

 Suppose I have two mappers, each mapper is assigned 10 lines of
 data. I want to set a counter for each mapper, counting and
 accumulating, then output the counter value to the reducer when
 the mapper finishes processing all the assigned lines.  So I
 want the mapper outputs values only when there is no further
 incoming data (when that mapper closes). Is this doable? How?
 thanks!

 Shi


Re: OOM Error Map output copy.

2011-12-09 Thread Prashant Kommireddi
Arun, I faced the same issue and increasing the # of reducers fixed the
problem.

I was initially under the impression MR framework spills to disk if data is
too huge to keep in memory, however on extraordinarily large reduce inputs
this was not the case and the job failed on trying to assign the in-memory
buffer

private MapOutput shuffleInMemory(MapOutputLocation mapOutputLoc,
URLConnection connection,
InputStream input,
int mapOutputLength,
int compressedLength)
  throws IOException, InterruptedException {
// Reserve ram for the map-output
  .
.
.
.

// Copy map-output into an in-memory buffer
byte[] shuffleData = new byte[mapOutputLength];


-Prahant Kommireddi

On Fri, Dec 9, 2011 at 10:29 AM, Arun C Murthy a...@hortonworks.com wrote:

 Moving to mapreduce-user@, bcc common-user@. Please use project specific
 lists.

 Niranjan,

 If you average as 0.5G output per-map, it's 5000 maps *0.5G - 2.5TB over
 12 reduces i.e. nearly 250G per reduce - compressed!

 If you think you have 4:1 compression you are doing nearly a Terabyte per
 reducer... which is way too high!

 I'd recommend you bump to somewhere along 1000 reduces to get to 2.5G
 (compressed) per reducer for your job. If your compression ratio is 2:1,
 try 500 reduces and so on.

 If you are worried about other users, use the CapacityScheduler and submit
 your job to a queue with a small capacity and max-capacity to restrict your
 job to 10 or 20 concurrent reduces at a given point.

 Arun

 On Dec 7, 2011, at 10:51 AM, Niranjan Balasubramanian wrote:

  All
 
  I am encountering the following out-of-memory error during the reduce
 phase of a large job.
 
  Map output copy failure : java.lang.OutOfMemoryError: Java heap space
at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1669)
at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1529)
at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1378)
at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1310)
  I tried increasing the memory available using mapped.child.java.opts but
 that only helps a little. The reduce task eventually fails again. Here are
 some relevant job configuration details:
 
  1. The input to the mappers is about 2.5 TB (LZO compressed). The
 mappers filter out a small percentage of the input ( less than 1%).
 
  2. I am currently using 12 reducers and I can't increase this count by
 much to ensure availability of reduce slots for other users.
 
  3. mapred.child.java.opts -- -Xms512M -Xmx1536M -XX:+UseSerialGC
 
  4. mapred.job.shuffle.input.buffer.percent-- 0.70
 
  5. mapred.job.shuffle.merge.percent   -- 0.66
 
  6. mapred.inmem.merge.threshold   -- 1000
 
  7. I have nearly 5000 mappers which are supposed to produce LZO
 compressed outputs. The logs seem to indicate that the map outputs range
 between 0.3G to 0.8GB.
 
  Does anything here seem amiss? I'd appreciate any input of what settings
 to try. I can try different reduced values for the input buffer percent and
 the merge percent.  Given that the job runs for about 7-8 hours before
 crashing, I would like to make some informed choices if possible.
 
  Thanks.
  ~ Niranjan.
 
 
 




Re: Hadoop Comic

2011-12-07 Thread Prashant Kommireddi
Here you go
https://docs.google.com/viewer?a=vpid=explorerchrome=truesrcid=0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLWE0OGItYTU5OGMxYjc0N2M1hl=en_USpli=1

Thanks,
Prashant

On Wed, Dec 7, 2011 at 1:47 AM, shreya@cognizant.com wrote:

 Hi,



 Can someone please send me the Hadoop comic.

 Saw references about it in the mailing list.



 Regards,

 Shreya


 This e-mail and any files transmitted with it are for the sole use of the
 intended recipient(s) and may contain confidential and privileged
 information.
 If you are not the intended recipient, please contact the sender by reply
 e-mail and destroy all copies of the original message.
 Any unauthorized review, use, disclosure, dissemination, forwarding,
 printing or copying of this email or any action taken in reliance on this
 e-mail is strictly prohibited and may be unlawful.



Re: how to integrate snappy into hadoop

2011-12-07 Thread Prashant Kommireddi
I had to struggle a bit while building Snappy for Hadoop 0.20.2 on Ubuntu.
However, I have now been able to install it on a 10 node cluster and it
works great for map output compression. Please check these notes, may be it
might help in addition to official Hadoop-Snappy notes.

Tom White (amongst a few others) is the best person to answer this since he
is working directly with this integration project.

*Installing Maven 3*
http://www.discursive.com/blog/4636

*Snappy build*
http://code.google.com/p/hadoop-snappy/

*Creating the symbolic link for BUILD to pass:* THIS IS IMPORTANT
sudo ln -s
/home/pkommireddi/dev/tools/Linux/jdk/jdk1.6.0_21_x64/jre/lib/amd64/server/libjvm.so
/usr/local/lib/


*Additional build notes:*
http://shanky.org/2011/10/17/build-hadoop-from-source/

Install commands I issued on localhost, these were notes for myself. There
are dependencies that Snappy build needs, and you might NOT need all of the
commands that I have issues below. Please refer to official notes and
install accordingly.

pkommireddi@pkommireddi-wsl:~$ history | grep install
 1678  sudo apt-get install python-software-properties
 1681  sudo apt-get install maven
 1750  make install
 1751  ./configure  make  sudo make install
 1752  sudo apt-get install zlibc zlib1g zlib1g-dev
 1763  sudo apt-get install subversion
 1776  ./configure  make  sudo make install
 1824  sudo apt-get install zlibc zlib1g zlib1g-dev
 1846  make install
 1847  sudo make install
 1858  make installcheck
 1861  sudo make install
 1862  sudo make installcheck
 1904  sudo apt-get install libtool
 1907  sudo apt-get install automake

Thanks,
Prashant Kommireddi

On Wed, Dec 7, 2011 at 5:39 PM, Jinyan Xu jinyan...@exar.com wrote:

 Hi ,


 Anyone else have the experience integrating snappy into hadoop ?  help me
 with it

 I find google doesn't provide the hadoop-snappy now :

 Hadoop-snappy is integrated into Hadoop Common(JUN 2011).

 Hadoop-Snappy can be used as an add-on for recent (released) versions of
 Hadoop that do not provide Snappy Codec support yet.

 Hadoop-Snappy is being kept in synch with Hadoop Common. 





 Thanks!


 
 The information and any attached documents contained in this message
 may be confidential and/or legally privileged. The message is
 intended solely for the addressee(s). If you are not the intended
 recipient, you are hereby notified that any use, dissemination, or
 reproduction is strictly prohibited and may be unlawful. If you are
 not the intended recipient, please contact the sender immediately by
 return e-mail and destroy all copies of the original message.



Re: how to integrate snappy into hadoop

2011-12-07 Thread Prashant Kommireddi
I have not tried it with HBase, and yes 0.20.2 is not compatible with it.
What is the error you receive when you try compiling Snappy? I don't think
compiling Snappy would be dependent on HBase.

2011/12/7 Jinyan Xu jinyan...@exar.com

 Hi Prashant Kommireddi,

 Last week, I read build-hadoop-from-source and follow it, but I failed to
 compile hbase with mvn compile -Dsnappy . Did you install HBase 0.90.2?
 According build-hadoop-from-source HBase 0.90.2 is incompatible with
 Hadoop0.20.2-release.


 -Original Message-
 From: Prashant Kommireddi [mailto:prash1...@gmail.com]
 Sent: 2011年12月8日 10:13
 To: common-user@hadoop.apache.org
 Subject: Re: how to integrate snappy into hadoop

 I had to struggle a bit while building Snappy for Hadoop 0.20.2 on Ubuntu.
 However, I have now been able to install it on a 10 node cluster and it
 works great for map output compression. Please check these notes, may be it
 might help in addition to official Hadoop-Snappy notes.

 Tom White (amongst a few others) is the best person to answer this since he
 is working directly with this integration project.

 *Installing Maven 3*
 http://www.discursive.com/blog/4636

 *Snappy build*
 http://code.google.com/p/hadoop-snappy/

 *Creating the symbolic link for BUILD to pass:* THIS IS IMPORTANT
 sudo ln -s

 /home/pkommireddi/dev/tools/Linux/jdk/jdk1.6.0_21_x64/jre/lib/amd64/server/libjvm.so
 /usr/local/lib/


 *Additional build notes:*
 http://shanky.org/2011/10/17/build-hadoop-from-source/

 Install commands I issued on localhost, these were notes for myself. There
 are dependencies that Snappy build needs, and you might NOT need all of the
 commands that I have issues below. Please refer to official notes and
 install accordingly.

 pkommireddi@pkommireddi-wsl:~$ history | grep install
  1678  sudo apt-get install python-software-properties
  1681  sudo apt-get install maven
  1750  make install
  1751  ./configure  make  sudo make install
  1752  sudo apt-get install zlibc zlib1g zlib1g-dev
  1763  sudo apt-get install subversion
  1776  ./configure  make  sudo make install
  1824  sudo apt-get install zlibc zlib1g zlib1g-dev
  1846  make install
  1847  sudo make install
  1858  make installcheck
  1861  sudo make install
  1862  sudo make installcheck
  1904  sudo apt-get install libtool
  1907  sudo apt-get install automake

 Thanks,
 Prashant Kommireddi

 On Wed, Dec 7, 2011 at 5:39 PM, Jinyan Xu jinyan...@exar.com wrote:

  Hi ,
 
 
  Anyone else have the experience integrating snappy into hadoop ?  help me
  with it
 
  I find google doesn't provide the hadoop-snappy now :
 
  Hadoop-snappy is integrated into Hadoop Common(JUN 2011).
 
  Hadoop-Snappy can be used as an add-on for recent (released) versions of
  Hadoop that do not provide Snappy Codec support yet.
 
  Hadoop-Snappy is being kept in synch with Hadoop Common. 
 
 
 
 
 
  Thanks!
 
 
  
  The information and any attached documents contained in this message
  may be confidential and/or legally privileged. The message is
  intended solely for the addressee(s). If you are not the intended
  recipient, you are hereby notified that any use, dissemination, or
  reproduction is strictly prohibited and may be unlawful. If you are
  not the intended recipient, please contact the sender immediately by
  return e-mail and destroy all copies of the original message.
 

 The information and any attached documents contained in this message
 may be confidential and/or legally privileged.  The message is
 intended solely for the addressee(s).  If you are not the intended
 recipient, you are hereby notified that any use, dissemination, or
 reproduction is strictly prohibited and may be unlawful.  If you are
 not the intended recipient, please contact the sender immediately by
 return e-mail and destroy all copies of the original message.



Re: HDFS Explained as Comics

2011-11-30 Thread Prashant Kommireddi
Thanks Maneesh.

Quick question, does a client really need to know Block size and
replication factor - A lot of times client has no control over these (set
at cluster level)

-Prashant Kommireddi

On Wed, Nov 30, 2011 at 12:51 PM, Dejan Menges dejan.men...@gmail.comwrote:

 Hi Maneesh,

 Thanks a lot for this! Just distributed it over the team and comments are
 great :)

 Best regards,
 Dejan

 On Wed, Nov 30, 2011 at 9:28 PM, maneesh varshney mvarsh...@gmail.com
 wrote:

  For your reading pleasure!
 
  PDF 3.3MB uploaded at (the mailing list has a cap of 1MB attachments):
 
 
 https://docs.google.com/open?id=0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLWE0OGItYTU5OGMxYjc0N2M1
 
 
  Appreciate if you can spare some time to peruse this little experiment of
  mine to use Comics as a medium to explain computer science topics. This
  particular issue explains the protocols and internals of HDFS.
 
  I am eager to hear your opinions on the usefulness of this visual medium
 to
  teach complex protocols and algorithms.
 
  [My personal motivations: I have always found text descriptions to be too
  verbose as lot of effort is spent putting the concepts in proper
 time-space
  context (which can be easily avoided in a visual medium); sequence
 diagrams
  are unwieldy for non-trivial protocols, and they do not explain concepts;
  and finally, animations/videos happen too fast and do not offer
  self-paced learning experience.]
 
  All forms of criticisms, comments (and encouragements) welcome :)
 
  Thanks
  Maneesh
 



Re: HDFS Explained as Comics

2011-11-30 Thread Prashant Kommireddi
Sure, its just a case of how readers interpret it.

   1. Client is required to specify block size and replication factor each
   time
   2. Client does not need to worry about it since an admin has set the
   properties in default configuration files

A client could not be allowed to override the default configs if they are
set final (well there are ways to go around it as well as you suggest by
using create() :)

The information is great and helpful. Just want to make sure a beginner who
wants to write a WordCount in Mapreduce does not worry about specifying
block size' and replication factor in his code.

Thanks,
Prashant

On Wed, Nov 30, 2011 at 1:18 PM, maneesh varshney mvarsh...@gmail.comwrote:

 Hi Prashant

 Others may correct me if I am wrong here..

 The client (org.apache.hadoop.hdfs.DFSClient) has a knowledge of block size
 and replication factor. In the source code, I see the following in the
 DFSClient constructor:

defaultBlockSize = conf.getLong(dfs.block.size, DEFAULT_BLOCK_SIZE);

defaultReplication = (short) conf.getInt(dfs.replication, 3);

 My understanding is that the client considers the following chain for the
 values:
 1. Manual values (the long form constructor; when a user provides these
 values)
 2. Configuration file values (these are cluster level defaults:
 dfs.block.size and dfs.replication)
 3. Finally, the hardcoded values (DEFAULT_BLOCK_SIZE and 3)

 Moreover, in the org.apache.hadoop.hdfs.protocool.ClientProtocol the API to
 create a file is
 void create(, short replication, long blocksize);

 I presume it means that the client already has knowledge of these values
 and passes them to the NameNode when creating a new file.

 Hope that helps.

 thanks
 -Maneesh

 On Wed, Nov 30, 2011 at 1:04 PM, Prashant Kommireddi prash1...@gmail.com
 wrote:

  Thanks Maneesh.
 
  Quick question, does a client really need to know Block size and
  replication factor - A lot of times client has no control over these (set
  at cluster level)
 
  -Prashant Kommireddi
 
  On Wed, Nov 30, 2011 at 12:51 PM, Dejan Menges dejan.men...@gmail.com
  wrote:
 
   Hi Maneesh,
  
   Thanks a lot for this! Just distributed it over the team and comments
 are
   great :)
  
   Best regards,
   Dejan
  
   On Wed, Nov 30, 2011 at 9:28 PM, maneesh varshney mvarsh...@gmail.com
   wrote:
  
For your reading pleasure!
   
PDF 3.3MB uploaded at (the mailing list has a cap of 1MB
 attachments):
   
   
  
 
 https://docs.google.com/open?id=0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLWE0OGItYTU5OGMxYjc0N2M1
   
   
Appreciate if you can spare some time to peruse this little
 experiment
  of
mine to use Comics as a medium to explain computer science topics.
 This
particular issue explains the protocols and internals of HDFS.
   
I am eager to hear your opinions on the usefulness of this visual
  medium
   to
teach complex protocols and algorithms.
   
[My personal motivations: I have always found text descriptions to be
  too
verbose as lot of effort is spent putting the concepts in proper
   time-space
context (which can be easily avoided in a visual medium); sequence
   diagrams
are unwieldy for non-trivial protocols, and they do not explain
  concepts;
and finally, animations/videos happen too fast and do not offer
self-paced learning experience.]
   
All forms of criticisms, comments (and encouragements) welcome :)
   
Thanks
Maneesh
   
  
 



Re: Passing data files via the distributed cache

2011-11-25 Thread Prashant Kommireddi
I believe you want to ship data to each node in your cluster before MR
begins so the mappers can access files local to their machine. Hadoop
tutorial on YDN has some good info on this.

http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata

-Prashant Kommireddi

On Fri, Nov 25, 2011 at 1:05 AM, Andy Doddington a...@doddington.netwrote:

 I have a series of mappers that I would like to be passed data using the
 distributed cache mechanism. At the
 moment, I am using HDFS to pass the data, but this seems wasteful to me,
 since they are all reading the same data.

 Is there a piece of example code that shows how data files can be placed
 in the cache and accessed by mappers?

 Thanks,

Andy Doddington