Re: Pregel

2009-06-26 Thread Saptarshi Guha
Hello,
I don't have a  background in CS, but does MS's Dryad (
http://research.microsoft.com/en-us/projects/Dryad/ ) fit in anywhere
here?
Regards
Saptarshi


On Fri, Jun 26, 2009 at 5:19 AM, Edward J. Yoonedwardy...@apache.org wrote:
 According to my understanding, I think the Pregel is in same layer
 with MR, not a MR based language processor.

 I think the 'Collective Communication' of BSP seems the core of the
 problem. For example, this BFS problem
 (http://blog.udanax.org/2009/02/breadth-first-search-mapreduce.html)
 can be solved at once w/o MR iterations.

 On Fri, Jun 26, 2009 at 3:17 PM, Owen O'Malleyomal...@apache.org wrote:

 On Jun 25, 2009, at 9:42 PM, Mark Kerzner wrote:

 my guess, as good as anybody's, is that Pregel is to large graphs is what
 Hadoop is to large datasets.

 I think it is much more likely a language that allows you to easily define
 fixed point algorithms.  I would imagine a distributed version of something
 similar to Michal Young's GenSet.
 http://portal.acm.org/citation.cfm?doid=586094.586108

 I've been trying to figure out how to justify working on a project like that
 for a couple of years, but haven't yet. (I have a background in program
 static analysis, so I've implemented similar stuff.)

 In other words, Pregel is the next natural step
 for massively scalable computations after Hadoop.

 I wonder if it uses map/reduce as a base or not. It would be easier to use
 map/reduce, but a direct implementation would be more performant. In either
 case, it is a new hammer. From what I see, it likely won't replace
 map/reduce, pig, or hive; but rather support a different class of
 applications much more directly than you can under map/reduce.

 -- Owen





 --
 Best Regards, Edward J. Yoon @ NHN, corp.
 edwardy...@apache.org
 http://blog.udanax.org



Re: When is configure and close run

2009-06-24 Thread Saptarshi Guha
Thank you! Just to confirm. Consider a JVM (that is being reused), has
to reduce K1,{V11,V12,V13..} and K2,{V21,V22,V23,}. Then the
configure and close methods are called once each for both K1,{V11,...}
and K2,{V2,}?

Is my understanding correct?

Once again, there is no combiner, and it makes sense that it is not called.

Thank you
Saptarshi


On Mon, Jun 22, 2009 at 10:55 PM, jason hadoopjason.had...@gmail.com wrote:
 configure and close are run for each task, mapper and reducer. The configure
 and close are NOT run on the combiner class.

 On Mon, Jun 22, 2009 at 9:23 AM, Saptarshi Guha saptarshi.g...@gmail.com
 wrote:

 Hello,
 In a mapreduce job, a given map JVM will run N map tasks. Are the
 configure and close methods executed for every one of these N tasks?
 Or is configure executed once when the JVM starts and the close method
 executed once when all N have been completed?

 I have the same question for the reduce task. Will it be run before
 for every reduce task? And close is run when all the values for a
 given key have been processed?

 We can assume there isn't a combiner.

 Regards
 Saptarshi



 --
 Pro Hadoop, a book to guide you from beginner to hadoop mastery,
 http://www.amazon.com/dp/1430219424?tag=jewlerymall
 www.prohadoopbook.com a community for Hadoop Professionals



Re: EC2, Hadoop, copy file from CLUSTER_MASTER to CLUSTER, failing

2009-06-24 Thread Saptarshi Guha
Hello,
Thank you. This is quite useful.

Regards
Saptarshi


On Wed, Jun 24, 2009 at 6:16 AM, Tom Whitet...@cloudera.com wrote:
 Hi Saptarshi,

 The group permissions open the firewall ports to enable access, but
 there are no shared keys on the cluster by default. See
 https://issues.apache.org/jira/browse/HADOOP-4131 for a patch to the
 scripts that shares keys to allow SSH access between machines in the
 cluster.

 Cheers,
 Tom

 On Sat, Jun 20, 2009 at 7:09 AM, Saptarshi Guhasaptarshi.g...@gmail.com 
 wrote:
 Hello,
 I have a cluster with 1 master and 1 slave (testing). In the EC2
 scripts, in the hadoop-ec2-init-remote.sh file, I wish to copy a file
 from
 the MASTER to the CLUSTER i.e in the slave section

  scp $MASTER_HOST:/tmp/v  /tmp/v

 However, this didnt work and when I logged in, ssh'd to the slave and
 tried the command, I got the following error:

 Permission denied (publickey,gssapi-with-mic)

 Yet, the group permissions appear to be valid i.e
  ec2-authorize $CLUSTER_MASTER -o $CLUSTER -u $AWS_ACCOUNT_ID
  ec2-authorize $CLUSTER -o $CLUSTER_MASTER -u $AWS_ACCOUNT_ID

 So I don't see why I can't ssh into the MASTER group from a slave.

 Any suggestion as to where I'm going wrong?
 Regards
 Saptarshi

 P.S I know I can copy a file from S3, but would like to know what is
 going on here.




EC2, Max tasks, under utilized?

2009-06-23 Thread Saptarshi Guha
Hello,
I'm running a 90 node c1.xlarge cluster. No reducers, mapred.max.map.tasks=6
per machine.
The AMI is own and uses Hadoop 0.19.1
The dataset has 145K keys, and the processing time is huge.

Now, when set the mapred.map.tasks=14,000 what ends up running is 49 map
tasks, across the machines.
No machine is running more than 3 tasks most are running 1, some are running
0.
Looking at the map records read, it appears these 49 tasks  correspond to
the 145k records.
Q) Why? Why isn't  the running tasks a much higher number? If each machine
can run 6, then why not make this a higher number and run across the
machines?
This is under utilization

So I set the mapred.map.tasks=90.
At the hadoop machine list, all 90 machines are at least 1 task , mostly 1,
some 2 and a small few 3+(max 4)
At the job tracker page, only 23 are running, 48 pending (when i sent this
email).
With 90 machines(and Map Task Capacity of 540), why aren't  90 running at
one go?

What should be set? What isn't set?

Regards
Saptarshi Guha


Re: EC2, Max tasks, under utilized?

2009-06-23 Thread Saptarshi Guha
Hello,
I should also point out that I'm using a SequenceFileInputFormat.

Regards
Saptarshi Guha


On Tue, Jun 23, 2009 at 10:43 AM, Saptarshi Guha
saptarshi.g...@gmail.comwrote:

 Hello,
 I'm running a 90 node c1.xlarge cluster. No reducers,
 mapred.max.map.tasks=6 per machine.
 The AMI is own and uses Hadoop 0.19.1
 The dataset has 145K keys, and the processing time is huge.

 Now, when set the mapred.map.tasks=14,000 what ends up running is 49 map
 tasks, across the machines.
 No machine is running more than 3 tasks most are running 1, some are
 running 0.
 Looking at the map records read, it appears these 49 tasks  correspond to
 the 145k records.
 Q) Why? Why isn't  the running tasks a much higher number? If each machine
 can run 6, then why not make this a higher number and run across the
 machines?
 This is under utilization

 So I set the mapred.map.tasks=90.
 At the hadoop machine list, all 90 machines are at least 1 task , mostly 1,
 some 2 and a small few 3+(max 4)
 At the job tracker page, only 23 are running, 48 pending (when i sent this
 email).
 With 90 machines(and Map Task Capacity of 540), why aren't  90 running at
 one go?

 What should be set? What isn't set?

 Regards
 Saptarshi Guha



EC2, Hadoop, copy file from CLUSTER_MASTER to CLUSTER, failing

2009-06-20 Thread Saptarshi Guha
Hello,
I have a cluster with 1 master and 1 slave (testing). In the EC2
scripts, in the hadoop-ec2-init-remote.sh file, I wish to copy a file
from
the MASTER to the CLUSTER i.e in the slave section

 scp $MASTER_HOST:/tmp/v  /tmp/v

However, this didnt work and when I logged in, ssh'd to the slave and
tried the command, I got the following error:

Permission denied (publickey,gssapi-with-mic)

Yet, the group permissions appear to be valid i.e
 ec2-authorize $CLUSTER_MASTER -o $CLUSTER -u $AWS_ACCOUNT_ID
 ec2-authorize $CLUSTER -o $CLUSTER_MASTER -u $AWS_ACCOUNT_ID

So I don't see why I can't ssh into the MASTER group from a slave.

Any suggestion as to where I'm going wrong?
Regards
Saptarshi

P.S I know I can copy a file from S3, but would like to know what is
going on here.


Tutorial on building an AMI

2009-05-22 Thread Saptarshi Guha
Hello,
Is there a tutorial available to build an Hadoop AMI (like
Cloudera's)? Cloudera has an 18.2 ami and for reasons I understand
they can't provide(as of now) AMIs for higher Hadoop versions until
they become stable.
I would like to create an AMI for 19.2 - so was hoping if there is a
guide for building one.


Thank you

Saptarshi Guha


Re: No reduce tasks running, yet 1 is pending

2009-05-12 Thread Saptarshi Guha
Interestingly, when i started other jobs, this one finished.
I have no idea why.

Saptarshi Guha



On Tue, May 12, 2009 at 10:36 PM, Saptarshi Guha
saptarshi.g...@gmail.com wrote:
 Hello,
 I mentioned this issue before for the case of map tasks. I have 43
 reduce tasks, 42 completed, 1 pending and 0 running.
 This is the case for the last 30 minutes. A pictur(tiff) of the job
 tracker can be found here( http://www.stat.purdue.edu/~sguha/mr.tiff
 ),
  since I haven't canceled the job, i can send logs and output if
 anyone requires.


 Regards
 Saptarshi



Heterogeneous cluster - quadcores/8 cores, Fairscheduler

2009-05-09 Thread Saptarshi Guha
Hello,
Our unit has 5 quad-core machines, running Hadoop. We have a dedicated
Jobtracker/Namenode. Each machine has 32GB ram.
We have the option of buying an 8 core,128GB machine and the question
is would this be useful as a Tasktracker?

A) It can certainly be used as the JobTracker and Namenode
B) I also read the FairScheduler, and mapred.fairscheduler.loadmanager
appears to allow the administrator allow more jobs to run on a given
machine. Thus the new machine might be able to run double the number
of maps and reduces.

Otherwise, if FairScheduler is not an option(or I misunderstood), then
adding the new machine to the cluster would only underutilize it.

Are my inferences correct?
Regards
Saptarshi Guha


R on the cloudera hadoop ami?

2009-04-28 Thread Saptarshi Guha
Hello,
Thank you to Cloudera for providing an AMI for Hadoop. Would it be
possible to include R in the Cloudera yum repositories or better still
can the i386 and x86-64 AMIs be updated to have R pre-installed?

If yes (thanks!) I think yum -y install R installs R built with
--shared-libs (which allows programs to link with R)

Thank you  in advance
Saptarshi


ANN: R and Hadoop = RHIPE 0.1

2009-04-27 Thread Saptarshi Guha
Hello,
I'd like to announce the release of the 0.1 version of RHIPE -R and
Hadoop Integrated Processing Environment. Using RHIPE, it is possible
to write map-reduce algorithms using the R language and start them
from within R.
RHIPE is built on Hadoop and so benefits from Hadoop's fault
tolerance, distributed file system and job scheduling features.
For the R user, there is rhlapply which runs an lapply across the cluster.
For the Hadoop user, there is rhmr which runs a general map-reduce program.

The tired example of counting words:

m - function(key,val){
  words - substr(val, +)[[1]]
  wc - table(words)
  cln - names(wc)
  return(sapply(1:length(wc),function(r)
list(key=cln[r],value=wc[[r]]),simplify=F))
}
r - function(key,value){
  value - do.call(rbind,value)
  return(list(list(key=key,value=sum(value
}
rhmr(mapper=m,reduce=r,input.folder=X,output.folder=Y)

URL: http://ml.stat.purdue.edu/rhipe

There are some downsides to RHIPE which are described at
http://ml.stat.purdue.edu/rhipe/install.html#sec-5

Regards
Saptarshi Guha


RuntimeException, coult not obtain block causes trackers to get blacklisted

2009-04-25 Thread Saptarshi Guha
Hello,
I have an intensive job running across 5 machines. During the map
stage, each map emits 200 records, so effectively for 50,000,000 input
reords, the map creates 200*50e6 records.

However, after a long time, I see two trackers are blacklisted
Caused by: java.lang.RuntimeException: Could not obtain block:
blk_-5964245027287878843_92134
file=/tmp/8a5bc814-b4ff-4641-bc3a-abfeda9e7e33.mapfile/index
at 
org.saptarshiguha.rhipe.hadoop.RHMR$RHMRCombiner.configure(RHMR.java:314)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:1110)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:989)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:401)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:886)


The map file is very much present in that location and other trackers
could read it. What could be the reason for these two machines not
being able to read it?
Now that they are blacklisted, how can I add them back to the computation (a
Saptarshi Guha


File name to which mapper's key,value belongs to - is it available?

2009-04-25 Thread Saptarshi Guha
Hello,
Is there a conf variable for getting the filename to which the current
mapper's key,value belongs to?
I have dir/dirA/part-X and dir/dirB/part-X
i will process dir, but need to know whether the key,value is from
dirA/part-* file or from a dirB/part-* file.

I'd much rather not implement my own inputformat, since I'd like the
method to be inputformat agnostic.

Much thanks
Saptarshi Guha


Re: Sometimes no map tasks are run - X are complete and N-X are pending, none running

2009-04-17 Thread Saptarshi Guha
I forgot to examine logs, I will next time it happens.
Thank you
BW, no maps are running. Only few are pending and the rest are complete.
Saptarshi Guha


On Fri, Apr 17, 2009 at 3:06 AM, Jothi Padmanabhan joth...@yahoo-inc.comwrote:



 On 4/17/09 12:26 PM, Sharad Agarwal shara...@yahoo-inc.com wrote:

 
 
  The last map task is forrever in the pending queue - is this is issue my
  setup/config or do others have the problem?
  Do you mean the left over maps are not at all scheduled ? What do you see
 in
  jobtracker logs ?

 Also in the JT UI, please check on how many maps are marked as running,
 when
 this map is still pending?




Sometimes no map tasks are run - X are complete and N-X are pending, none running

2009-04-16 Thread Saptarshi Guha
Hello,
I'm using 0.19.2-dev-core (checked out from cvs and build). With 51 maps, i
have a case where 50 tasks have completed and 1 is pending, about 1400
records left for this one to process. The completed map taska have written
out 18GB to the HDFS.
The last map task is forrever in the pending queue - is this is issue my
setup/config or do others have the problem?

Thank you

Saptarshi Guha


Is combiner and map in same JVM?

2009-04-14 Thread Saptarshi Guha
Hello,
Suppose I have a Hadoop job and have set my combiner to the Reducer class.
Does the map function and the combiner function run in the same JVM in
different threads? or in different JVMs?
I ask because I have to load a native library and if they are in the same
JVM then the native library is loaded once and I have to take  precautions.

Thank you
Saptarshi Guha


Re: Is combiner and map in same JVM?

2009-04-14 Thread Saptarshi Guha
Thanks. I am using 0.19, and to confirm, the map and combiner (in the map
jvm) are run in *different* threads at the same time?
My native library is not thread safe, so I would have to implement locks.
Aaron's email gave me hope(since the map and combiner would then be running
sequentially), but this appears to make things complicated.


Saptarshi Guha


On Tue, Apr 14, 2009 at 2:01 PM, Owen O'Malley omal...@apache.org wrote:


 On Apr 14, 2009, at 10:52 AM, Aaron Kimball wrote:

  They're in the same JVM, and I believe in the same thread.


 They are the same JVM. They *used* to be the same thread. In either 0.19 or
 0.20, combiners are also called in the reduce JVM if spills are required.

 -- Owen



Building LZO on hadoop

2009-04-01 Thread Saptarshi Guha
I checked out hadoop-core-0.19
export CFLAGS=$CUSTROOT/include
export LDFLAGS=$CUSTROOT/lib

(they contain lzo which was built with --shared)
ls $CUSTROOT/include/lzo/
lzo1a.h  lzo1b.h  lzo1c.h  lzo1f.h  lzo1.h  lzo1x.h  lzo1y.h  lzo1z.h
lzo2a.h  lzo_asm.h  lzoconf.h  lzodefs.h  lzoutil.h

ls $CUSTROOT/lib/
liblzo2.so  liblzo.a  liblzo.la  liblzo.so  liblzo.so.1  liblzo.so.2
liblzo.so.2.0.0

I then run (from hadoop-core-0.19.1/)
ant -Dcompile.native=true

I get messages like : (many others like this)
exec] configure: WARNING: lzo/lzo1x.h: accepted by the compiler,
rejected by the preprocessor!
 [exec] configure: WARNING: lzo/lzo1x.h: proceeding with the
compiler's result
 [exec] checking for lzo/lzo1x.h... yes
 [exec] checking Checking for the 'actual' dynamic-library for
'-llzo2'... (cached)
 [exec] checking lzo/lzo1y.h usability... yes
 [exec] checking lzo/lzo1y.h presence... no
 [exec] configure: WARNING: lzo/lzo1y.h: accepted by the compiler,
rejected by the preprocessor!
 [exec] configure: WARNING: lzo/lzo1y.h: proceeding with the
compiler's result
 [exec] checking for lzo/lzo1y.h... yes
 [exec] checking Checking for the 'actual' dynamic-library for
'-llzo2'... (cached)

and finally,
ive/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c  -fPIC -DPIC
-o .libs/LzoCompressor.o
 [exec] 
/ln/meraki/custom/hadoop-core-0.19.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c:
In function 'Java_org_apache_hadoop_io_compress_lzo_LzoCompressor_initIDs':
 [exec] 
/ln/meraki/custom/hadoop-core-0.19.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c:137:
error: expected expression before ',' token


Any ideas?
Saptarshi Guha


Re: Building LZO on hadoop

2009-04-01 Thread Saptarshi Guha
Fixed. In the configure script src/native/
change
 echo 'int main(int argc, char **argv){return 0;}'  conftest.c
  if test -z `${CC} ${LDFLAGS} -o conftest conftest.c -llzo2 21`; then
if test ! -z `which objdump | grep -v 'no objdump'`; then
  ac_cv_libname_lzo2=`objdump -p conftest | grep NEEDED | grep
lzo2 | sed 's/\W*NEEDED\W*\(.*\)\W*$/\\1\/'`
elif test ! -z `which ldd | grep -v 'no ldd'`; then
  ac_cv_libname_lzo2=`ldd conftest | grep lzo2 | sed
's/^[^A-Za-z0-9]*\([A-Za-z0-9\.]*\)[^A-Za-z0-9]*=.*$/\\1\/'`
else
  { { echo $as_me:$LINENO: error: Can't find either 'objdump' or
'ldd' to compute the dynamic library for '-llzo2' 5
echo $as_me: error: Can't find either 'objdump' or 'ldd' to compute
the dynamic library for '-llzo2' 2;}
   { (exit 1); exit 1; }; }
fi
  else
ac_cv_libname_lzo2=libnotfound.so
  fi
  rm -f conftest*

lzo2 to lzo.so.2 (again this depends on what the user has), also set
CFLAGS and LDFLAGS to include your lzo libs/incs



Saptarshi Guha



On Wed, Apr 1, 2009 at 2:29 PM, Saptarshi Guha saptarshi.g...@gmail.com wrote:
 I checked out hadoop-core-0.19
 export CFLAGS=$CUSTROOT/include
 export LDFLAGS=$CUSTROOT/lib

 (they contain lzo which was built with --shared)
ls $CUSTROOT/include/lzo/
 lzo1a.h  lzo1b.h  lzo1c.h  lzo1f.h  lzo1.h  lzo1x.h  lzo1y.h  lzo1z.h
 lzo2a.h  lzo_asm.h  lzoconf.h  lzodefs.h  lzoutil.h

ls $CUSTROOT/lib/
 liblzo2.so  liblzo.a  liblzo.la  liblzo.so  liblzo.so.1  liblzo.so.2
 liblzo.so.2.0.0

 I then run (from hadoop-core-0.19.1/)
 ant -Dcompile.native=true

 I get messages like : (many others like this)
 exec] configure: WARNING: lzo/lzo1x.h: accepted by the compiler,
 rejected by the preprocessor!
     [exec] configure: WARNING: lzo/lzo1x.h: proceeding with the
 compiler's result
     [exec] checking for lzo/lzo1x.h... yes
     [exec] checking Checking for the 'actual' dynamic-library for
 '-llzo2'... (cached)
     [exec] checking lzo/lzo1y.h usability... yes
     [exec] checking lzo/lzo1y.h presence... no
     [exec] configure: WARNING: lzo/lzo1y.h: accepted by the compiler,
 rejected by the preprocessor!
     [exec] configure: WARNING: lzo/lzo1y.h: proceeding with the
 compiler's result
     [exec] checking for lzo/lzo1y.h... yes
     [exec] checking Checking for the 'actual' dynamic-library for
 '-llzo2'... (cached)

 and finally,
 ive/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c  -fPIC -DPIC
 -o .libs/LzoCompressor.o
     [exec] 
 /ln/meraki/custom/hadoop-core-0.19.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c:
 In function 'Java_org_apache_hadoop_io_compress_lzo_LzoCompressor_initIDs':
     [exec] 
 /ln/meraki/custom/hadoop-core-0.19.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c:137:
 error: expected expression before ',' token


 Any ideas?
 Saptarshi Guha



Re: Building LZO on hadoop

2009-04-01 Thread Saptarshi Guha
Actually, if one installs the latest liblzo and sets CFLAGS, LDFLAGS
and LFLAGS correctly, things work fine.
Saptarshi Guha



On Wed, Apr 1, 2009 at 3:55 PM, Saptarshi Guha saptarshi.g...@gmail.com wrote:
 Fixed. In the configure script src/native/
 change
  echo 'int main(int argc, char **argv){return 0;}'  conftest.c
  if test -z `${CC} ${LDFLAGS} -o conftest conftest.c -llzo2 21`; then
        if test ! -z `which objdump | grep -v 'no objdump'`; then
      ac_cv_libname_lzo2=`objdump -p conftest | grep NEEDED | grep
 lzo2 | sed 's/\W*NEEDED\W*\(.*\)\W*$/\\1\/'`
    elif test ! -z `which ldd | grep -v 'no ldd'`; then
      ac_cv_libname_lzo2=`ldd conftest | grep lzo2 | sed
 's/^[^A-Za-z0-9]*\([A-Za-z0-9\.]*\)[^A-Za-z0-9]*=.*$/\\1\/'`
    else
      { { echo $as_me:$LINENO: error: Can't find either 'objdump' or
 'ldd' to compute the dynamic library for '-llzo2' 5
 echo $as_me: error: Can't find either 'objdump' or 'ldd' to compute
 the dynamic library for '-llzo2' 2;}
   { (exit 1); exit 1; }; }
    fi
  else
    ac_cv_libname_lzo2=libnotfound.so
  fi
  rm -f conftest*

 lzo2 to lzo.so.2 (again this depends on what the user has), also set
 CFLAGS and LDFLAGS to include your lzo libs/incs



 Saptarshi Guha



 On Wed, Apr 1, 2009 at 2:29 PM, Saptarshi Guha saptarshi.g...@gmail.com 
 wrote:
 I checked out hadoop-core-0.19
 export CFLAGS=$CUSTROOT/include
 export LDFLAGS=$CUSTROOT/lib

 (they contain lzo which was built with --shared)
ls $CUSTROOT/include/lzo/
 lzo1a.h  lzo1b.h  lzo1c.h  lzo1f.h  lzo1.h  lzo1x.h  lzo1y.h  lzo1z.h
 lzo2a.h  lzo_asm.h  lzoconf.h  lzodefs.h  lzoutil.h

ls $CUSTROOT/lib/
 liblzo2.so  liblzo.a  liblzo.la  liblzo.so  liblzo.so.1  liblzo.so.2
 liblzo.so.2.0.0

 I then run (from hadoop-core-0.19.1/)
 ant -Dcompile.native=true

 I get messages like : (many others like this)
 exec] configure: WARNING: lzo/lzo1x.h: accepted by the compiler,
 rejected by the preprocessor!
     [exec] configure: WARNING: lzo/lzo1x.h: proceeding with the
 compiler's result
     [exec] checking for lzo/lzo1x.h... yes
     [exec] checking Checking for the 'actual' dynamic-library for
 '-llzo2'... (cached)
     [exec] checking lzo/lzo1y.h usability... yes
     [exec] checking lzo/lzo1y.h presence... no
     [exec] configure: WARNING: lzo/lzo1y.h: accepted by the compiler,
 rejected by the preprocessor!
     [exec] configure: WARNING: lzo/lzo1y.h: proceeding with the
 compiler's result
     [exec] checking for lzo/lzo1y.h... yes
     [exec] checking Checking for the 'actual' dynamic-library for
 '-llzo2'... (cached)

 and finally,
 ive/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c  -fPIC -DPIC
 -o .libs/LzoCompressor.o
     [exec] 
 /ln/meraki/custom/hadoop-core-0.19.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c:
 In function 'Java_org_apache_hadoop_io_compress_lzo_LzoCompressor_initIDs':
     [exec] 
 /ln/meraki/custom/hadoop-core-0.19.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c:137:
 error: expected expression before ',' token


 Any ideas?
 Saptarshi Guha




Re: X not in slaves, yet still being used as a tasktracker

2009-03-31 Thread Saptarshi Guha
Aha,there isn't a tasktracker running on X, yet jps shows 4 children
and when jobs fail i see failures on X too.
So what exactly can I stop?
jps output on X
20044 org.apache.hadoop.mapred.JobTracker
12871 org.apache.hadoop.mapred.Child 127.0.0.1 51590
attempt_200903302220_0036_m_20_0 -541073721
12819 org.apache.hadoop.mapred.Child 127.0.0.1 51590
attempt_200903302220_0036_m_03_0 1185151254
19768 org.apache.hadoop.hdfs.server.namenode.NameNode
13004 org.apache.hadoop.mapred.Child 127.0.0.1 51590
attempt_200903302220_0036_r_10_0 1044080068
11396 org.saptarshiguha.rhipe.rhipeserver.RHIPEMain --tsp=mimosa:8200
--listen=4445 --quiet --no-save --max-nsize=1G --max-ppsize=10
19961 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
13131 sun.tools.jps.Jps -ml
22054
12931 org.apache.hadoop.mapred.Child 127.0.0.1 51590
attempt_200903302220_0036_m_32_0 -690673928
12962 org.apache.hadoop.mapred.Child 127.0.0.1 51590
attempt_200903302220_0036_m_33_0 203873369

Saptarshi Guha



On Mon, Mar 30, 2009 at 11:37 AM, Saptarshi Guha
saptarshi.g...@gmail.com wrote:
 thanks
 Saptarshi Guha



 On Sun, Mar 29, 2009 at 7:42 PM, Bill Au bill.w...@gmail.com wrote:
 The jobtracker does not have to be a tasktracker.  Just stop and don't start
 the tasktracker process.
 Bill

 On Sun, Mar 29, 2009 at 12:00 PM, Saptarshi Guha saptarshi.g...@gmail.com
 wrote:

 Hello,
 A machine X which is the master: it is the jobtracker, namenode and
 secondary namenode.
 It is not in the slaves file and  is not part of the HDFSHowever in
 the mapreduce web page, I notice it is being used as a tasktracker.
 Is the jobtracker always a tasktracker? I'd rather not place too much
 load on the one machine which plays so many roles.
 (Hadoop 0.19.0)

 Thanks
 Saptarshi





Re: JNI and calling Hadoop jar files

2009-03-27 Thread Saptarshi Guha
Delayed response. However, I stand corrected. I was using a package
called rJava which integrates R and java. Maybe there was a
classloader issue but once I rewrote my stuff using C and JNI, the
issues disappeared.
When I create my configuration, i add as a resource
$HADOOP/conf/hadoop-{default,site},xml in that order.
All problems disappeared.
Sorry, I couldn't provide the information requested with the error
causing approach - I stopped using that package.

regards
Saptarshi Guha

P.S As an aside, If I launch my own java apps, which require the
hadoop configuration etc, I have to manually add the
{default,site}.xml files.



On Tue, Mar 24, 2009 at 6:52 AM, jason hadoop jason.had...@gmail.com wrote:
 The exception reference to *org.apache.hadoop.hdfs.DistributedFileSystem*,
 implies strongly that a hadoop-default.xml file, or at least a  job.xml file
 is present.
 Since hadoop-default.xml is bundled into the hadoop-0.X.Y-core.jar, the
 assumption is that the core jar is available.
 The class not found exception, the implication is that the
 hadoop-0.X.Y-core.jar is not available to jni.

 Given the above constraints, the two likely possibilities are that the -core
 jar is unavailable or damaged, or that somehow the classloader being used
 does not have access to the -core  jar.

 A possible reason for the jar not being available is that the application is
 running on a different machine, or as a different user and the jar is not
 actually present or perhaps readable in the expected location.





 Which way is your JNI, java application calling into a native shared
 library, or a native application calling into a jvm that it instantiates via
 libjvm calls?

 Could you dump the classpath that is in effect before your failing jni call?
 System.getProperty( java.class.path), and for that matter,
 java.library.path, or getenv(CLASSPATH)
 and provide an ls -l of the core.jar from the class path, run as the user
 that owns the process, on the machine that the process is running on.

 !-- from hadoop-default.xml --
 property
  namefs.hdfs.impl/name
  valueorg.apache.hadoop.hdfs.DistributedFileSystem/value
  descriptionThe FileSystem for hdfs: uris./description
 /property



 On Mon, Mar 23, 2009 at 9:47 PM, Jeff Eastman 
 j...@windwardsolutions.comwrote:

 This looks somewhat similar to my Subtle Classloader Issue from yesterday.
 I'll be watching this thread too.

 Jeff


 Saptarshi Guha wrote:

 Hello,
 I'm using some JNI interfaces, via a R. My classpath contains all the
 jar files in $HADOOP_HOME and $HADOOP_HOME/lib
 My class is
    public SeqKeyList() throws Exception {

        config = new  org.apache.hadoop.conf.Configuration();
        config.addResource(new Path(System.getenv(HADOOP_CONF_DIR)
                                    +/hadoop-default.xml));
        config.addResource(new Path(System.getenv(HADOOP_CONF_DIR)
                                    +/hadoop-site.xml));

        System.out.println(C=+config);
        filesystem = FileSystem.get(config);
        System.out.println(C=+config+F= +filesystem);
        System.out.println(filesystem.getUri().getScheme());

    }

 I am using a distributed filesystem
 (org.apache.hadoop.hdfs.DistributedFileSystem for fs.hdfs.impl).
 When run from the command line and this class is created everything works
 fine
 When called using jni I get
  java.lang.ClassNotFoundException:
 org.apache.hadoop.hdfs.DistributedFileSystem

 Is this a jni issue? How can it work from the commandline using the
 same classpath, yet throw this is exception when run via JNI?
 Saptarshi Guha








 --
 Alpha Chapters of my book on Hadoop are available
 http://www.apress.com/book/view/9781430219422



JNI and calling Hadoop jar files

2009-03-23 Thread Saptarshi Guha
Hello,
I'm using some JNI interfaces, via a R. My classpath contains all the
jar files in $HADOOP_HOME and $HADOOP_HOME/lib
My class is
public SeqKeyList() throws Exception {

config = new  org.apache.hadoop.conf.Configuration();
config.addResource(new Path(System.getenv(HADOOP_CONF_DIR)
+/hadoop-default.xml));
config.addResource(new Path(System.getenv(HADOOP_CONF_DIR)
+/hadoop-site.xml));

System.out.println(C=+config);
filesystem = FileSystem.get(config);
System.out.println(C=+config+F= +filesystem);
System.out.println(filesystem.getUri().getScheme());

}

I am using a distributed filesystem
(org.apache.hadoop.hdfs.DistributedFileSystem for fs.hdfs.impl).
When run from the command line and this class is created everything works fine
When called using jni I get
 java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.DistributedFileSystem

Is this a jni issue? How can it work from the commandline using the
same classpath, yet throw this is exception when run via JNI?
Saptarshi Guha


Task Side Effect files and copying(getWorkOutputPath)

2009-03-16 Thread Saptarshi Guha
Hello,
I would like to produce side effect files which will be later copied
to the outputfolder.
I am using FileOuputFormat, and in the Map's close() method i copy
files (from the local tmp/ folder) to
FileOutputFormat.getWorkOutputPath(job);

void close()  {
if (shouldcopy) {
ArrayListPath lop = new ArrayListPath();
for(String ff :  tempdir.list()){
lop.add(new Path(temppfx+ff));
}
dstFS.moveFromLocalFile(lop.toArray(new Path[]{}), dstPath);
}

However, this throws an error java.io.IOException:
`hdfs://X:54310/tmp/testseq/_temporary/_attempt_200903160945_0010_m_00_0':
specified destination directory doest not exist

I though this is the right to place to drop side effect files. Prior
to this I was copying o the output folder, but many were not copied,
or in fact all were, but during the reduce output stage many were
deleted - am not sure(I used NullOutputFormat and all the files were
present in the output folder)  So i resorted to getWorkOutputPath
which threw the above exception.

So if I'm using FileOutputFormat, and my maps and/or reduces produce
side effects files on the localFS
1)when should I copy them to the DFS (e.g the close method? or one at
a time in the map/reduce method)
2) Where should i copy them to.

I am using Hadoop 0.19 and have set jobConf.setNumTasksToExecutePerJvm(-1);
Also, each side effect file produced has a unique name, i.e there is
no overwriting.

Thank you
Saptarshi Guha


RecordReader and non thread safe JNI libraries

2009-03-01 Thread Saptarshi Guha
Hello,
My RecordReader subclass reads from object X. To parse this object and
emit records, i need the use of a C library and a JNI wrapper.

public boolean next(LongWritable key, BytesWritable value) throws 
IOException {
if (leftover == 0) return false;
long wi = pos + split.getStart();
key.set(wi);
value.readFields(X.at( wi);
pos ++; leftover --;
return true;
}

X.at uses the JNI lib to read a record number wi

My question is who running this?
1) For a given job, is one instance of this running on each
tasktracker? reading records and feeding to the mappers on its
machine?
Or,
2) as I have mapred.tasktracker.map.tasks.maximum == 7, does each jvm
launched have one RecordReader running feeding records to the maps its
jvm is running.

If it's either (1) or (2), I guess I'm safe from threading issues.

Please correct me if i'm totally wrong.
Regards

Saptarshi Guha


Re: RecordReader and non thread safe JNI libraries

2009-03-01 Thread Saptarshi Guha
Hello,
I am quite confused and my email seems to prove it. My question is
essentially, I need to use this non thread safe library in the Mapper,
Reducer and RecordReader. assume, i do not create threads.
Will I run into any thread safety issues?

In a given JVM, the maps will run sequentially, so will the reduces,
but will maps run alongside recorder reader?

Hope this is clearer.
Regards


Saptarshi Guha



On Sun, Mar 1, 2009 at 11:07 PM, Saptarshi Guha
saptarshi.g...@gmail.com wrote:
 Hello,
 My RecordReader subclass reads from object X. To parse this object and
 emit records, i need the use of a C library and a JNI wrapper.

        public boolean next(LongWritable key, BytesWritable value) throws 
 IOException {
            if (leftover == 0) return false;
            long wi = pos + split.getStart();
            key.set(wi);
            value.readFields(X.at( wi);
            pos ++; leftover --;
            return true;
        }

 X.at uses the JNI lib to read a record number wi

 My question is who running this?
 1) For a given job, is one instance of this running on each
 tasktracker? reading records and feeding to the mappers on its
 machine?
 Or,
 2) as I have mapred.tasktracker.map.tasks.maximum == 7, does each jvm
 launched have one RecordReader running feeding records to the maps its
 jvm is running.

 If it's either (1) or (2), I guess I'm safe from threading issues.

 Please correct me if i'm totally wrong.
 Regards

 Saptarshi Guha



Re: Limit number of records or total size in combiner input using jobconf?

2009-02-23 Thread Saptarshi Guha
Thank you.




On Fri, Feb 20, 2009 at 5:34 PM, Chris Douglas chri...@yahoo-inc.com wrote:
 So here are my questions:
 (1) is there a  jobconf hint to limit the number of records in kviter?
 I can (and have) made a fix to my code that processes the values in a
 combiner step in batches (i.e takes N at a go,processes that and
 repeat), but was wondering if i could just set an option.

 Approximately and indirectly, yes. You can limit the amount of memory
 allocated to storing serialized records in memory (io.sort.mb) and the
 percentage of that space reserved for storing record metadata
 (io.sort.record.percent, IIRC). That can be used to limit the number of
 records in each spill, though you may also need to disable the combiner
 during the merge, where you may run into the same problem.

 You're almost certainly better off designing your combiner to scale well (as
 you have), since you'll hit this in the reduce, too.

 Since this occurred in the MapContext, changing the number of reducers
 wont help.
 (2) How does changing the number of reducers help at all? I have 7
 machines, so I feel 11 (a prime close to 7, why a prime?) is good
 enough (some machines are 16GB others 32GB)

 Your combiner will look at all the records for a partition and only those
 records in a partition. If your partitioner distributes your records evenly
 in a particular spill, then increasing the total number of partitions will
 decrease the number of records your combiner considers in each call. For
 most partitioners, whether the number of reducers is prime should be
 irrelevant. -C



Limit number of records or total size in combiner input using jobconf?

2009-02-13 Thread Saptarshi Guha
Hello,
Running  a MR job on 7 machines failed when it came to processing
53GB. Browsing the errors,
org.saptarshiguha.rhipe.GRMapreduce$GRCombiner.reduce(GRMapreduce.java:149)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:1106)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:979)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:391)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:876)


The reason why my line failed is that there were too many records. I
offload calculations to a another program and it screamed out of
memory.
Seeing the source in sortAndSpill where this happened:(hadoop -0.19)
  int spstart = spindex;
  while (spindex  endPosition 
  kvindices[kvoffsets[spindex % kvoffsets.length]
+ PARTITION] == i) {
++spindex;
  }
  // Note: we would like to avoid the combiner if we've fewer
  // than some threshold of records for a partition
  if (spstart != spindex) {
combineCollector.setWriter(writer);
RawKeyValueIterator kvIter = new
MRResultIterator(spstart, spindex);
combineAndSpill(kvIter, combineInputCounter);
  }


So here are my questions:
(1) is there a  jobconf hint to limit the number of records in kviter?
I can (and have) made a fix to my code that processes the values in a
combiner step in batches (i.e takes N at a go,processes that and
repeat), but was wondering if i could just set an option.

Since this occurred in the MapContext, changing the number of reducers
wont help.
(2) How does changing the number of reducers help at all? I have 7
machines, so I feel 11 (a prime close to 7, why a prime?) is good
enough (some machines are 16GB others 32GB)

Regards
Saptarshi
-- 
Saptarshi Guha - saptarshi.g...@gmail.com


Very large file copied to cluster, and the copy fails. All blocks bad

2009-02-12 Thread Saptarshi Guha
hello,
I have a 42 GB file on the local fs(call the machine A)  which i need
to copy to a HDFS (replicattion 1), according the HDFS webtracker it
has 208GB across 7 machines.
Note, the machine A has about 80 GB total, so there is no place to
store copies of the file.
Using the command bin/hadoop dfs -put /local/x /remote/tmp/ fails,
with all blocks being bad. This is not surprising since the file is
copied entirely to the HDFS region that resides on A. Had the file
been copied across all machines, this would not have failed.

I have more experience with mapreduce and not much with the hdfs side
of things.
Is there a configuration option i'm missing that forces the file to be
split across the machines(when it is being copied)?
-- 
Saptarshi Guha - saptarshi.g...@gmail.com


Re: Very large file copied to cluster, and the copy fails. All blocks bad

2009-02-12 Thread Saptarshi Guha
 Did you run the copy command from machine A?
Yes, exactly.
 I had to have the client doing the copy either on the master or on an 
 off-cluster
 Thanks! I uploaded it from an off cluster (i.e not participating in
the hdfs) and it worked splendidly.

Regards
Saptarshi


On Thu, Feb 12, 2009 at 11:03 PM, TCK moonwatcher32...@yahoo.com wrote:

 I believe that if you do the copy from an hdfs client that is on the
same machine as a data node, then for each block the primary copy
always goes to that data node, and only the replicas get distributed
among other data nodes. I ran into this issue once -- I had to have
the client doing the copy either on the master or on an off-cluster
node.
 -TCK



 --- On Thu, 2/12/09, Saptarshi Guha saptarshi.g...@gmail.com wrote:
 From: Saptarshi Guha saptarshi.g...@gmail.com
 Subject: Very large file copied to cluster, and the copy fails. All blocks bad
 To: core-user@hadoop.apache.org core-user@hadoop.apache.org
 Date: Thursday, February 12, 2009, 9:50 PM

 hello,
 I have a 42 GB file on the local fs(call the machine A)  which i need
 to copy to a HDFS (replicattion 1), according the HDFS webtracker it
 has 208GB across 7 machines.
 Note, the machine A has about 80 GB total, so there is no place to
 store copies of the file.
 Using the command bin/hadoop dfs -put /local/x /remote/tmp/ fails,
 with all blocks being bad. This is not surprising since the file is
 copied entirely to the HDFS region that resides on A. Had the file
 been copied across all machines, this would not have failed.

 I have more experience with mapreduce and not much with the hdfs side
 of things.
 Is there a configuration option i'm missing that forces the file to be
 split across the machines(when it is being copied)?
 --
 Saptarshi Guha - saptarshi.g...@gmail.com







-- 
Saptarshi Guha - saptarshi.g...@gmail.com


FileInnputFormat, FileSplit, and LineRecorder: where are they run?

2009-02-05 Thread Saptarshi Guha
Hello All,
In order to get a better understanding of Hadoop, i've started reading
the source and have a question
The FileInputFormat, reads in files, splits into splitsizes (which may
be bigger than block size) and creates FileSplits.
The FileSplits contain the start, length *and* the locations of the split.
The LineRecordReader, receives a split and emits records.

So far I think i'm correct(hopefully). Now, my questions
Does the LineRecordReader run on a machine, in some sense, closest to
the location of the splits? i.e
Q1: If the split is less than the block size, then the split is
located on one machine (apart from replicates): does the
LineRecordReader run on the machine which contains the split? Or at
least attempt to?
Q2. If a split is greater than the  block size, it spans multiple
blocks which could reside on more than 1 machine. In this case, on
which machine does the LineRecordReader run? The machine 'closest' to
them?

Please correct me if i'm wrong.
Thank you
Saptarshi


-- 
Saptarshi Guha - saptarshi.g...@gmail.com


NLineInputFormat and very high number of maptasks

2009-01-20 Thread Saptarshi Guha
Hello,
When I use NLIneInputFormat, when I output:
System.out.println(mapred.map.tasks:+jobConf.get(mapred.map.tasks));
I see 51, but on the jobtracker site, the number is 18114. Yet with
TextInputFormat it shows 51.
I'm using Hadoop - 0.19

Any ideas why?
Regards
Saptarshi

-- 
Saptarshi Guha - saptarshi.g...@gmail.com


Re: NLineInputFormat and very high number of maptasks

2009-01-20 Thread Saptarshi Guha
Sorry, i see - every line is now a maptask - one split,one task.(in  
this case N=1 line per split)

Is that correct?
Saptarshi

On Jan 20, 2009, at 11:39 AM, Saptarshi Guha wrote:


Hello,
When I use NLIneInputFormat, when I output:
	 
System 
.out.println(mapred.map.tasks:+jobConf.get(mapred.map.tasks));

I see 51, but on the jobtracker site, the number is 18114. Yet with
TextInputFormat it shows 51.
I'm using Hadoop - 0.19

Any ideas why?
Regards
Saptarshi

--
Saptarshi Guha - saptarshi.g...@gmail.com


Saptarshi Guha | saptarshi.g...@gmail.com | http://www.stat.purdue.edu/~sguha
If the church put in half the time on covetousness that it does on lust,
this would be a better world.
-- Garrison Keillor, Lake Wobegon Days



Sorting on several columns using KeyFieldSeparator and Paritioner

2009-01-17 Thread Saptarshi Guha
Hello,
I have  a file with n columns, some which are text and some numeric.
Given a sequence of indices, i would like to sort on those indices i.e
first on Index1, then within Index2 and so on.
In the example code below, i have 3 columns, numeric, text, numeric,
space separated.
Sort on 2(reverse), then 1(reverse,numeric) and lastly 3

Though my code runs (and gives wrong results,col 2 is sorted in
reverse, and within that col3 which is treated as tex and then col1 )
on the local, when distributed I get a merge error - my guess is
fixing the latter fixes the former.

This is the error:
java.io.IOException: Final merge failed
at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier.createKVIterator(ReduceTask.java:2093)
at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier.access$400(ReduceTask.java:457)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:380)
at org.apache.hadoop.mapred.Child.main(Child.java:155)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 562
at 
org.apache.hadoop.io.WritableComparator.compareBytes(WritableComparator.java:128)
at 
org.apache.hadoop.mapred.lib.KeyFieldBasedComparator.compareByteSequence(KeyFieldBasedComparator.java:109)
at 
org.apache.hadoop.mapred.lib.KeyFieldBasedComparator.compare(KeyFieldBasedComparator.java:85)
at org.apache.hadoop.mapred.Merger$MergeQueue.lessThan(Merger.java:308)
at org.apache.hadoop.util.PriorityQueue.downHeap(PriorityQueue.java:144)
at 
org.apache.hadoop.util.PriorityQueue.adjustTop(PriorityQueue.java:103)
at 
org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:270)
at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:285)
at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:108)
at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier.createKVIterator(ReduceTask.java:2087)
... 3 more


Thanks for your time

And the code (not too big) is
==CODE==

public class RMRSort extends Configured implements Tool {

  static class RMRSortMap extends MapReduceBase implements
MapperLongWritable, Text, Text, Text {

public void map(LongWritable key, Text value,OutputCollectorText,
Text output, Reporter reporter)  throws IOException {
output.collect(value,value);
}
  }

static class RMRSortReduce extends MapReduceBase implements
ReducerText, Text, NullWritable, Text {

public void reduce(Text key, IteratorText
values,OutputCollectorNullWritable, Text output, Reporter reporter)
throws IOException {
NullWritable n = NullWritable.get();
while(values.hasNext())
output.collect(n,values.next() );
}
}


static JobConf createConf(String rserveport,String uid,String
infolder, String outfolder)
Configuration defaults = new Configuration();
JobConf jobConf = new JobConf(defaults, RMRSort.class);
jobConf.setJobName(Sorter: +uid);
jobConf.addResource(new
Path(System.getenv(HADOOP_CONF_DIR)+/hadoop-site.xml));
//  jobConf.set(mapred.job.tracker, local);
jobConf.setMapperClass(RMRSortMap.class);
jobConf.setReducerClass(RMRSortReduce.class);
jobConf.set(map.output.key.field.separator,fsep);
jobConf.setPartitionerClass(KeyFieldBasedPartitioner.class);
jobConf.set(mapred.text.key.partitioner.options,-k2,2 -k1,1 -k3,3);
jobConf.setOutputKeyComparatorClass(KeyFieldBasedComparator.class);
jobConf.set(mapred.text.key.comparator.options,-k2r,2r -k1rn,1rn 
-k3n,3n);
//infolder, outfolder information removed
jobConf.setMapOutputKeyClass(Text.class);
jobConf.setMapOutputValueClass(Text.class);
jobConf.setOutputKeyClass(NullWritable.class);
return(jobConf);
}
public int run(String[] args) throws Exception {
return(1);
}

}




-- 
Saptarshi Guha - saptarshi.g...@gmail.com


A reporter thread during the reduce stage for a long running line

2009-01-09 Thread Saptarshi Guha
Hello,
  Sorry for the puzzling subject. I have a single long running
/statement/ in my reduce method, so the the framework might assume my
reduce is not responding and kill it.
I solved the problem in the map method by subclassing MapRunner, and
running a thread which calls reporter.progress() every minute or so.
However the same thread does not run during the reduce (i checked this
by setting a status string in the thread which  did not appear (on the
Jobtracker website) during the reduce stage but did appearing during
the map stage).

Hadoop v.0.20 appears to solve this by having separate run methods for
both Map and Reduce, however I'm using v 0.19.
I scanned the Streaming source and it only subclasses MapRunner, so I
assume it to has the same limitation (probably wrong, if so can
someone point me to the location?)

Is there a way around this, /without/ starting a thread in the reduce function?
Hadoop v 0.19

Many thanks
Saptarshi




-- 
Saptarshi Guha - saptarshi.g...@gmail.com


Re: Combiner run specification and questions

2009-01-07 Thread Saptarshi Guha
. So as long as the correctness of the computation doesn't
 rely on a transformation performed in the combiner, it should be OK. In

Right, i had the same thought.




 However, this restriction limits the scalability of your solution. It might
 be necessary to work around R's limitations by breaking up large
 computations into intermediate steps, possibly by explicitly instantiating
 and running the combiner in the reduce.


So, i explicitly call the combiner? However at times, the reducer
needs all the values so calling the combiner would not always work
here. However, if i recall correctly(from reading the google paper)
one does not **humongous**  number values for a single key


 1) I am guaranteed a reducer.
 So,

 The combiner, if defined, will run zero or more times on records emitted
 from the map, before being fed to the reduce.


 This zero case possibility worries me. However you mention, that it occurs

 collector spills in the map

 I have noticed this happening - what 'spilling' mean?

 Records emitted from the map are serialized into a buffer, which is
 periodically written to disk when it is (sufficiently) full. Each of these
 batch writes is a spill. In casual usage, it refers to any time when
 records need to be written to disk. The merge of intermediate files into the
 final map output and merging in-memory segments to disk in the reduce are
 two examples. -C


Thanks for the explanation.
Regards
Saptarshi


Re: Combiner run specification and questions

2009-01-06 Thread Saptarshi Guha
I agree with the requirement that the key does not change. Of course,  
the values can change.
I am primarily worried that the combiner might not be run at all - I  
have 'successfully' integrated Hadoop and R i.e the user can provide  
map/reduce functions written in R. However, R is not great with memory  
management, and if I have N (N is huge) values for a given key K, then  
R will baulk when it comes to processing this.
Thus the combiner. The combiner will process n values for K, and  
ultimately, a few values for K in the reducer . If the combiner where  
not to run, R would collapse under the load.


1) I am guaranteed a reducer.
So,
The combiner, if defined, will run zero or more times on records  
emitted from the map, before being fed to the reduce.



This zero case possibility worries me. However you mention, that it  
occurs

collector spills in the map


I have noticed this happening - what 'spilling' mean?

Thank you
Saptarshi


On Jan 5, 2009, at 10:22 PM, Chris Douglas wrote:

The combiner, if defined, will run zero or more times on records  
emitted from the map, before being fed to the reduce. It is run when  
the collector spills in the map and in some merge cases. If the  
combiner transforms the key, it is illegal to change its type, the  
partition to which it is assigned, or its ordering.


For example, if you emit a record (k,v) from your map and (k',v)  
from the combiner, your comparator is C(K,K) and your partitioner  
function is P(K), it must be the case that P(k) == P(k') and C(k,k')  
== 0. If either of these does not hold, the semantics to the reduce  
are broken. Clearly, if k is not transformed (as in true for most  
combiners), this holds trivially.


As was mentioned earlier, the purpose of the combiner is to compress  
data pulled across the network and spilled to disk. It should not  
affect the correctness or, in most cases, the output of the job. -C


On Jan 2, 2009, at 9:57 AM, Saptarshi Guha wrote:


Hello,
I would just like to confirm, when does the Combiner run(since it
might not be run at all,see below). I read somewhere that it is run,
if there is at least one reduce (which in my case i can be sure of).
I also read, that the combiner is an optimization. However, it is  
also

a chance for a function to transform the key/value (keeping the class
the same i.e the combiner semantics are not changed) and deal with a
smaller set ( this could be done in the reducer but the number of
values for a key might be relatively large).

However, I guess it would be a mistake for reducer to expect its  
input

coming from a combiner? E.g if there are only 10 value corresponding
to a key(as outputted by the mapper), will these 10 values go  
straight

to the reducer or to the reducer via the combiner?

Here I am assuming my reduce operations does not need all the values
for a key to work(so that a combiner can be used) i.e additive
operations.

Thank you
Saptarshi


On Sun, Nov 16, 2008 at 6:18 PM, Owen O'Malley omal...@apache.org  
wrote:
The Combiner may be called 0, 1, or many times on each key between  
the
mapper and reducer. Combiners are just an application specific  
optimization
that compress the intermediate output. They should not have side  
effects or
transform the types. Unfortunately, since there isn't a separate  
interface
for Combiners, there is isn't a great place to document this  
requirement.

I've just filed HADOOP-4668 to improve the documentation.




--
Saptarshi Guha - saptarshi.g...@gmail.com




Saptarshi Guha | saptarshi.g...@gmail.com | http://www.stat.purdue.edu/~sguha
The way of the world is to praise dead saints and prosecute live ones.
-- Nathaniel Howe



Map input records(on JobTracker website) increasing and decreasing

2009-01-05 Thread Saptarshi Guha
Hello,
When I check the job tracker web page, and look at the Map Input
records read,the map input records goes up to say 1.4MN and then drops
to 410K and then goes up again.
The same happens with input/output bytes and output records.

Why is this? Is there something wrong with the mapper code? In my map
function, i assume I have received one line of input.
The oscillatory behavior does not occur for tiny datasets, but for 1GB
of data (tiny for others) i see this happening.

Thank s
Saptarshi
-- 
Saptarshi Guha - saptarshi.g...@gmail.com


Re: Problem loading hadoop-site.xml - dumping parameters

2009-01-05 Thread Saptarshi Guha
Hello,
I have set my HADOOP_CONF_DIR to the conf folder and still not
loading. I have to manually set the options when I create my conf.
Have you resolved this?

Regards
Saptarshi

On Tue, Dec 30, 2008 at 5:25 PM, g00dn3ss g00dn...@gmail.com wrote:
 Hey all,

 I have a similar issue.  I am specifically having problems with the config
 option mapred.child.java.opts.  I set it to -Xmx1024m and it uses -Xmx200m
 regardless.  I am running Hadoop 0.18.2 and I'm pretty sure this option was
 working in the previous versions of Hadoop I was using.

 I am not explicitly setting HADOOP_CONF_DIR.  My site config is in
 ${HADOOP_HOME}/conf.  Just to test things further, I wrote a small map task
 to print out the ENV values and it has the correct value for HADOOP_HOME,
 HADOOP_LOG_DIR, HADOOP_OPTS, etc...  I also printed out the key/values in
 the JobConf passed to the mapper and it has my specified values for
 fs.default.name and mapred.job.tracker.  Other settings like dfs.name.dir,
 dfs.data.dir, and mapred.child.java.opts do not have my values.

 Any suggestion where to look at next?

 Thanks!



 On Mon, Dec 29, 2008 at 10:27 PM, Amareshwari Sriramadasu 
 amar...@yahoo-inc.com wrote:

 Saptarshi Guha wrote:

 Hello,
 I had previously emailed regarding heap size issue and have discovered
 that the hadoop-site.xml is not loading completely, i.e
  Configuration defaults = new Configuration();
JobConf jobConf = new JobConf(defaults, XYZ.class);
System.out.println(1:+jobConf.get(mapred.child.java.opts));
System.out.println(2:+jobConf.get(mapred.map.tasks));
System.out.println(3:+jobConf.get(mapred.reduce.tasks));

  
 System.out.println(3:+jobConf.get(mapred.tasktracker.reduce.tasks.maximum));

 returns -Xmx200m, 2,1,2 respectively, even though the numbers in the
 hadoop-site.xml are very different.

 Is there a way for hadoop to dump the parameters read in from
 hadoop-site.xml and hadoop-default.xml?



 Is your hadoop-site.xml present in the conf (HADOOP_CONF_DIR) directory?
 http://hadoop.apache.org/core/docs/r0.19.0/cluster_setup.html#Configuration

 -Amareshwari





-- 
Saptarshi Guha - saptarshi.g...@gmail.com


Re: Problem loading hadoop-site.xml - dumping parameters

2009-01-05 Thread Saptarshi Guha
For some strange reason, neither hadoop-default.xml nor
hadoop-site.xml is loading. Both files are in hadoop-0.19.0/conf/
folder.
HADOOP_CONF_DIR points to thi
In some java code, (not a job), i did
Configuration cf = new Configuration();
String[] xu= new
String[]{io.file.buffer.size,io.sort.mb,io.sort.factor,mapred.tasktracker.map.tasks.maximum,hadoop.logfile.count,hadoop.logfile.size,hadoop.tmp.dir,dfs.replication,mapred.map.tasks,mapred.job.tracker,fs.default.name};
for(String x : xu){
LOG.info(x+: +cf.get(x));
}

All the values returned were default values(as noted in
hadoop-default.xml). Changing the value in hadoop-default.xml were not
reflected here. Grepping the source reveal  the defaults are
hardcoded. So it seems neither is hadoop-default.xml not
hadooop-site.xml loading.
Also, Configuration.java mentionds that hadoop-site.xml is depecated
(thought still loaded)

Any suggestions?
Regards
Saptarshi


On Mon, Jan 5, 2009 at 2:26 PM, Saptarshi Guha saptarshi.g...@gmail.com wrote:
 Hello,
 I have set my HADOOP_CONF_DIR to the conf folder and still not
 loading. I have to manually set the options when I create my conf.
 Have you resolved this?

 Regards
 Saptarshi

 On Tue, Dec 30, 2008 at 5:25 PM, g00dn3ss g00dn...@gmail.com wrote:
 Hey all,

 I have a similar issue.  I am specifically having problems with the config
 option mapred.child.java.opts.  I set it to -Xmx1024m and it uses -Xmx200m
 regardless.  I am running Hadoop 0.18.2 and I'm pretty sure this option was
 working in the previous versions of Hadoop I was using.

 I am not explicitly setting HADOOP_CONF_DIR.  My site config is in
 ${HADOOP_HOME}/conf.  Just to test things further, I wrote a small map task
 to print out the ENV values and it has the correct value for HADOOP_HOME,
 HADOOP_LOG_DIR, HADOOP_OPTS, etc...  I also printed out the key/values in
 the JobConf passed to the mapper and it has my specified values for
 fs.default.name and mapred.job.tracker.  Other settings like dfs.name.dir,
 dfs.data.dir, and mapred.child.java.opts do not have my values.

 Any suggestion where to look at next?

 Thanks!



 On Mon, Dec 29, 2008 at 10:27 PM, Amareshwari Sriramadasu 
 amar...@yahoo-inc.com wrote:

 Saptarshi Guha wrote:

 Hello,
 I had previously emailed regarding heap size issue and have discovered
 that the hadoop-site.xml is not loading completely, i.e
  Configuration defaults = new Configuration();
JobConf jobConf = new JobConf(defaults, XYZ.class);
System.out.println(1:+jobConf.get(mapred.child.java.opts));
System.out.println(2:+jobConf.get(mapred.map.tasks));
System.out.println(3:+jobConf.get(mapred.reduce.tasks));

  
 System.out.println(3:+jobConf.get(mapred.tasktracker.reduce.tasks.maximum));

 returns -Xmx200m, 2,1,2 respectively, even though the numbers in the
 hadoop-site.xml are very different.

 Is there a way for hadoop to dump the parameters read in from
 hadoop-site.xml and hadoop-default.xml?



 Is your hadoop-site.xml present in the conf (HADOOP_CONF_DIR) directory?
 http://hadoop.apache.org/core/docs/r0.19.0/cluster_setup.html#Configuration

 -Amareshwari





 --
 Saptarshi Guha - saptarshi.g...@gmail.com




-- 
Saptarshi Guha - saptarshi.g...@gmail.com


Re: Map input records(on JobTracker website) increasing and decreasing

2009-01-05 Thread Saptarshi Guha
True, i suspected that, however none died. (Nothing in the Tasks
killed/Failed field)

On Mon, Jan 5, 2009 at 4:04 PM, Doug Cutting cutt...@apache.org wrote:
 Values can drop if tasks die and must be re-run.

 Doug

 Aaron Kimball wrote:

 The actual number of input records is most likely steadily increasing. The
 counters on the web site are inaccurate until the job is complete; their
 values will fluctuate wildly. I'm not sure why this is.

 - Aaron

 On Mon, Jan 5, 2009 at 8:34 AM, Saptarshi Guha
 saptarshi.g...@gmail.comwrote:

 Hello,
 When I check the job tracker web page, and look at the Map Input
 records read,the map input records goes up to say 1.4MN and then drops
 to 410K and then goes up again.
 The same happens with input/output bytes and output records.

 Why is this? Is there something wrong with the mapper code? In my map
 function, i assume I have received one line of input.
 The oscillatory behavior does not occur for tiny datasets, but for 1GB
 of data (tiny for others) i see this happening.

 Thank s
 Saptarshi
 --
 Saptarshi Guha - saptarshi.g...@gmail.com






-- 
Saptarshi Guha - saptarshi.g...@gmail.com


Re: Problem loading hadoop-site.xml - dumping parameters

2009-01-05 Thread Saptarshi Guha
Possibly. When i force load the configuration
Configuration cf = new Configuration();
LOG.info(Adding new resource);
   cf.addResource(hadoop-site.xml)

It doesn't load, though hadoop-site.xml is present in
hadoop-0.19.0/conf, and the hadoop-0.19 folder is replicated across
all the machines. However if load it via
cf.addResource(new
Path(/home/godhuli/custom/hadoop-0.19.0/conf/hadoop-site.xml));
It works! So clearly, there is some other hadoop-site.xml in the
classpath, but where? Should I add
/home/godhuli/custom/hadoop-0.19.0/conf/ to the Hadoop classpath in
hadoop-env.sh?

Thanks
Saptarshi

On Mon, Jan 5, 2009 at 5:09 PM, Jason Venner ja...@attributor.com wrote:
 somehow you have alternate versions of the file earlier in the class path.

 Perhaps someone's empty copies are bundled into one of your application jar
 files.

 Or perhaps the configurationfiles are not distributed to the datanodes in
 the expected locations.

 Saptarshi Guha wrote:

 For some strange reason, neither hadoop-default.xml nor
 hadoop-site.xml is loading. Both files are in hadoop-0.19.0/conf/
 folder.
 HADOOP_CONF_DIR points to thi
 In some java code, (not a job), i did
Configuration cf = new Configuration();
String[] xu= new

 String[]{io.file.buffer.size,io.sort.mb,io.sort.factor,mapred.tasktracker.map.tasks.maximum,hadoop.logfile.count,hadoop.logfile.size,hadoop.tmp.dir,dfs.replication,mapred.map.tasks,mapred.job.tracker,fs.default.name};
for(String x : xu){
LOG.info(x+: +cf.get(x));
}

 All the values returned were default values(as noted in
 hadoop-default.xml). Changing the value in hadoop-default.xml were not
 reflected here. Grepping the source reveal  the defaults are
 hardcoded. So it seems neither is hadoop-default.xml not
 hadooop-site.xml loading.
 Also, Configuration.java mentionds that hadoop-site.xml is depecated
 (thought still loaded)

 Any suggestions?
 Regards
 Saptarshi


 On Mon, Jan 5, 2009 at 2:26 PM, Saptarshi Guha saptarshi.g...@gmail.com
 wrote:


 Hello,
 I have set my HADOOP_CONF_DIR to the conf folder and still not
 loading. I have to manually set the options when I create my conf.
 Have you resolved this?

 Regards
 Saptarshi

 On Tue, Dec 30, 2008 at 5:25 PM, g00dn3ss g00dn...@gmail.com wrote:


 Hey all,

 I have a similar issue.  I am specifically having problems with the
 config
 option mapred.child.java.opts.  I set it to -Xmx1024m and it uses
 -Xmx200m
 regardless.  I am running Hadoop 0.18.2 and I'm pretty sure this option
 was
 working in the previous versions of Hadoop I was using.

 I am not explicitly setting HADOOP_CONF_DIR.  My site config is in
 ${HADOOP_HOME}/conf.  Just to test things further, I wrote a small map
 task
 to print out the ENV values and it has the correct value for
 HADOOP_HOME,
 HADOOP_LOG_DIR, HADOOP_OPTS, etc...  I also printed out the key/values
 in
 the JobConf passed to the mapper and it has my specified values for
 fs.default.name and mapred.job.tracker.  Other settings like
 dfs.name.dir,
 dfs.data.dir, and mapred.child.java.opts do not have my values.

 Any suggestion where to look at next?

 Thanks!



 On Mon, Dec 29, 2008 at 10:27 PM, Amareshwari Sriramadasu 
 amar...@yahoo-inc.com wrote:



 Saptarshi Guha wrote:



 Hello,
 I had previously emailed regarding heap size issue and have discovered
 that the hadoop-site.xml is not loading completely, i.e
  Configuration defaults = new Configuration();
   JobConf jobConf = new JobConf(defaults, XYZ.class);
   System.out.println(1:+jobConf.get(mapred.child.java.opts));
   System.out.println(2:+jobConf.get(mapred.map.tasks));
   System.out.println(3:+jobConf.get(mapred.reduce.tasks));


  
 System.out.println(3:+jobConf.get(mapred.tasktracker.reduce.tasks.maximum));

 returns -Xmx200m, 2,1,2 respectively, even though the numbers in the
 hadoop-site.xml are very different.

 Is there a way for hadoop to dump the parameters read in from
 hadoop-site.xml and hadoop-default.xml?





 Is your hadoop-site.xml present in the conf (HADOOP_CONF_DIR)
 directory?

 http://hadoop.apache.org/core/docs/r0.19.0/cluster_setup.html#Configuration

 -Amareshwari



 --
 Saptarshi Guha - saptarshi.g...@gmail.com










-- 
Saptarshi Guha - saptarshi.g...@gmail.com


Combiner run specification and questions

2009-01-02 Thread Saptarshi Guha
Hello,
I would just like to confirm, when does the Combiner run(since it
might not be run at all,see below). I read somewhere that it is run,
if there is at least one reduce (which in my case i can be sure of).
I also read, that the combiner is an optimization. However, it is also
a chance for a function to transform the key/value (keeping the class
the same i.e the combiner semantics are not changed) and deal with a
smaller set ( this could be done in the reducer but the number of
values for a key might be relatively large).

However, I guess it would be a mistake for reducer to expect its input
coming from a combiner? E.g if there are only 10 value corresponding
to a key(as outputted by the mapper), will these 10 values go straight
to the reducer or to the reducer via the combiner?

Here I am assuming my reduce operations does not need all the values
for a key to work(so that a combiner can be used) i.e additive
operations.

Thank you
Saptarshi


On Sun, Nov 16, 2008 at 6:18 PM, Owen O'Malley omal...@apache.org wrote:
 The Combiner may be called 0, 1, or many times on each key between the
 mapper and reducer. Combiners are just an application specific optimization
 that compress the intermediate output. They should not have side effects or
 transform the types. Unfortunately, since there isn't a separate interface
 for Combiners, there is isn't a great place to document this requirement.
 I've just filed HADOOP-4668 to improve the documentation.



-- 
Saptarshi Guha - saptarshi.g...@gmail.com


Problem loading hadoop-site.xml - dumping parameters

2008-12-29 Thread Saptarshi Guha
Hello,
I had previously emailed regarding heap size issue and have discovered
that the hadoop-site.xml is not loading completely, i.e
 Configuration defaults = new Configuration();
JobConf jobConf = new JobConf(defaults, XYZ.class);
System.out.println(1:+jobConf.get(mapred.child.java.opts));
System.out.println(2:+jobConf.get(mapred.map.tasks));
System.out.println(3:+jobConf.get(mapred.reduce.tasks));

System.out.println(3:+jobConf.get(mapred.tasktracker.reduce.tasks.maximum));

returns -Xmx200m, 2,1,2 respectively, even though the numbers in the
hadoop-site.xml are very different.

Is there a way for hadoop to dump the parameters read in from
hadoop-site.xml and hadoop-default.xml?

-- 
Saptarshi Guha - saptarshi.g...@gmail.com


OutofMemory Error, inspite of large amounts provided

2008-12-28 Thread Saptarshi Guha
Hello,
I have work machines with 32GB and allocated 16GB to the heap size
==hadoop-env.sh==
export HADOOP_HEAPSIZE=16384

==hadoop-site.xml==
property
  namemapred.child.java.opts/name
  value-Xmx16384m/value
/property

The same code runs when not being run through Hadoop, but it fails
when in a Maptask.
Are there other places where I can specify the memory to the maptasks?

Regards
Saptarshi

-- 
Saptarshi Guha - saptarshi.g...@gmail.com


Re: OutofMemory Error, inspite of large amounts provided

2008-12-28 Thread Saptarshi Guha
On Sun, Dec 28, 2008 at 4:33 PM, Brian Bockelman bbock...@cse.unl.edu wrote:
 Hey Saptarshi,

 Watch the running child process while using ps, top, or Ganglia
 monitoring.  Does the map task actually use 16GB of memory, or is the memory
 not getting set properly?

 Brian

I haven't figured out how to run ganglia, however, also the children
quit before i can see their memory usage. The trackers all use
16GB.(from the ps command). However, i noticed some use 512MB
only(when i manged to catch them in time)

Regards


Re: OutofMemory Error, inspite of large amounts provided

2008-12-28 Thread Saptarshi Guha
Caught it in action.
Running  ps -e -o 'vsz pid ruser args' |sort -nr|head -5
on a machine where the map task was running
04812 16962 sguha/home/godhuli/custom/jdk1.6.0_11/jre/bin/java
-Djava.library.path=/home/godhuli/custom/hadoop/bin/../lib/native/Linux-amd64-64:/home/godhuli/custom/hdfs/mapred/local/taskTracker/jobcache/job_200812282102_0003/attempt_200812282102_0003_m_00_0/work
-Xmx200m 
-Djava.io.tmpdir=/home/godhuli/custom/hdfs/mapred/local/taskTracker/jobcache/job_200812282102_0003/attempt_200812282102_0003_m_00_0/work/tmp
-classpath /attempt_200812282102_0003_m_00_0/work
-Dhadoop.log.dir=/home/godhuli/custom/hadoop/bin/../logs
-Dhadoop.root.logger=INFO,TLA
-Dhadoop.tasklog.taskid=attempt_200812282102_0003_m_00_0
-Dhadoop.tasklog.totalLogFileSize=0 org.apache.hadoop.mapred.Child
127.0.0.1 40443 attempt_200812282102_0003_m_00_0 1525207782

Also, the reducer only used 540mb. I notice -Xmx200m was passed, how
to change it?
Regards
Saptarshi

On Sun, Dec 28, 2008 at 10:19 PM, Saptarshi Guha
saptarshi.g...@gmail.com wrote:
 On Sun, Dec 28, 2008 at 4:33 PM, Brian Bockelman bbock...@cse.unl.edu wrote:
 Hey Saptarshi,

 Watch the running child process while using ps, top, or Ganglia
 monitoring.  Does the map task actually use 16GB of memory, or is the memory
 not getting set properly?

 Brian

 I haven't figured out how to run ganglia, however, also the children
 quit before i can see their memory usage. The trackers all use
 16GB.(from the ps command). However, i noticed some use 512MB
 only(when i manged to catch them in time)

 Regards




-- 
Saptarshi Guha - saptarshi.g...@gmail.com


A question about MultipleOutputFormat

2008-12-23 Thread Saptarshi Guha
Hello,
MultipleOutputFormat is a very good idea. Thanks. I have a question,
from the web page ,
The reducer wants to write data to different files depending on the
actual keys .. and values.
Examining, TestMultipleTextOutputFormat,

class KeyBasedMultipleTextOutputFormat extends
MultipleTextOutputFormatText, Text

In my implementation, will the key and value classes be the same as
the ones given to the Reduce? or the Map?

Thank you
Saptarshi

-- 
Saptarshi Guha - saptarshi.g...@gmail.com


Classes Not Found even when classpath is mentioned (Starting mapreduce from another app)

2008-12-18 Thread Saptarshi Guha
Hello,
I intend to start a mapreduce job from another java app, using
ToolRunner.run method. This works fine on a local job. However when
distributed i get java.lang.NoClassDefFoundError:
org/rosuda/REngine/Rserve/RserveException
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2389)
at java.lang.Class.getConstructor0(Class.java:2699)
at java.lang.Class.getDeclaredConstructor(Class.java:1985)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:74)
at org.apache.hadoop.mapred.JobConf.getInputFormat(JobConf.java:402)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

Now, i have a CLASSPATH entry in my .bashrc which contains the
location of Rserve.jar (which contains the above classes), yet I still
get the above error.
Do I have to somehow emulate RunJar.main or is there a simpler way out?
Thank you
Saptarshi

-- 
Saptarshi Guha - saptarshi.g...@gmail.com


Re: Classes Not Found even when classpath is mentioned (Starting mapreduce from another app)

2008-12-18 Thread Saptarshi Guha
Hello,
What if each machine has the identical directory structure and
Rserve.jar is present at the same place(i.e the entry in the classpath
is identical across machines). The will the latter suggestion work?

On Thu, Dec 18, 2008 at 9:02 PM, Zheng Shao zs...@facebook.com wrote:
 You either need to add the your Rserve.jar to -libjars, or put Rserve.jar in 
 a cluster-accessible NFS mount (and add it to CLASSPATH/HADOOP_CLASSPATH).

 Zheng
 -Original Message-
 From: Saptarshi Guha [mailto:saptarshi.g...@gmail.com]
 Sent: Thursday, December 18, 2008 3:38 PM
 To: core-user@hadoop.apache.org
 Subject: Classes Not Found even when classpath is mentioned (Starting 
 mapreduce from another app)

 Hello,
 I intend to start a mapreduce job from another java app, using
 ToolRunner.run method. This works fine on a local job. However when
 distributed i get java.lang.NoClassDefFoundError:
 org/rosuda/REngine/Rserve/RserveException
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2389)
at java.lang.Class.getConstructor0(Class.java:2699)
at java.lang.Class.getDeclaredConstructor(Class.java:1985)
at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:74)
at org.apache.hadoop.mapred.JobConf.getInputFormat(JobConf.java:402)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

 Now, i have a CLASSPATH entry in my .bashrc which contains the
 location of Rserve.jar (which contains the above classes), yet I still
 get the above error.
 Do I have to somehow emulate RunJar.main or is there a simpler way out?
 Thank you
 Saptarshi

 --
 Saptarshi Guha - saptarshi.g...@gmail.com




-- 
Saptarshi Guha - saptarshi.g...@gmail.com


Re: Lookup HashMap available within the Map

2008-11-28 Thread Saptarshi Guha
The more I use it, i realize Hadoop is not  build around shared
memory. For these type of things, use TSpaces (IBM), that way you can
have a flag to load it once and allow for sharing.
Regards
Saptarshi


On Tue, Nov 25, 2008 at 3:42 PM, Chris K Wensel [EMAIL PROTECTED] wrote:
 cool. If you need a hand with Cascading stuff, feel free to ping me on the
 mail list or #cascading irc. lots of other friendly folk there already.

 ckw

 On Nov 25, 2008, at 12:35 PM, tim robertson wrote:

 Thanks Chris,

 I have a different test running, then will implement that.  Might give
 cascading a shot for what I am doing.

 Cheers

 Tim


 On Tue, Nov 25, 2008 at 9:24 PM, Chris K Wensel [EMAIL PROTECTED] wrote:

 Hey Tim

 The .configure() method is what you are looking for i believe.

 It is called once per task, which in the default case, is once per jvm.

 Note Jobs are broken into parallel tasks, each task handles a portion of
 the
 input data. So you may create your map 100 times, because there are 100
 tasks, it will only be created once per jvm.

 I hope this makes sense.

 chris

 On Nov 25, 2008, at 11:46 AM, tim robertson wrote:

 Hi Doug,

 Thanks - it is not so much I want to run in a single JVM - I do want a
 bunch of machines doing the work, it is just I want them all to have
 this in-memory lookup index, that is configured once per job.  Is
 there some hook somewhere that I can trigger a read from the
 distributed cache, or is a Mapper.configure() the best place for this?
 Can it be called multiple times per Job meaning I need to keep some
 static synchronised indicator flag?

 Thanks again,

 Tim


 On Tue, Nov 25, 2008 at 8:41 PM, Doug Cutting [EMAIL PROTECTED]
 wrote:

 tim robertson wrote:

 Thanks Alex - this will allow me to share the shapefile, but I need to
 one time only per job per jvm read it, parse it and store the
 objects in the index.
 Is the Mapper.configure() the best place to do this?  E.g. will it
 only be called once per job?

 In 0.19, with HADOOP-249, all tasks from a job can be run in a single
 JVM.
 So, yes, you could access a static cache from Mapper.configure().

 Doug



 --
 Chris K Wensel
 [EMAIL PROTECTED]
 http://chris.wensel.net/
 http://www.cascading.org/







-- 
Saptarshi Guha - [EMAIL PROTECTED]


Creating a temp folder on the dfs?

2008-11-28 Thread Saptarshi Guha
Hello,
Is there any method for creating a temporary folder on the jobs dfs?
jobconf.getWorkingDirectory() is actually my home folder (on the dfs),
i was thinking something random looking(like mkdtemp)

Regards
Saptarshi


-- 
Saptarshi Guha - [EMAIL PROTECTED]


Problem building a inputformat using 0.18.1

2008-11-27 Thread Saptarshi Guha
Hello,
During the build of inputformat i get the following error:

 [javac] /home/sguha/distr/mapred/SListInputFormat.java:118: incompatible types
 [javac] found   : java.lang.String
 [javac] required: org.apache.hadoop.io.Text
 [javac]   slist_name = job.get(testfoo);

The method where this is happening is
public void configure(JobConf job){
  slist_name = job.get(testfoo);
  }
 //slist_name is a String

I notice that JobConf is a subclass onf Configuration. I checked the
sources of the later and indeed there is the method
public String get(String name) {
return substituteVars(getProps().getProperty(name));
  }

So why do I get the above error?
My java is fresh (a few weeks old) and I know this is not the correct
way to approach Hadoop programming but i need to implement an idea as
soon as possible.
I appreciate your help.
-- 
Saptarshi Guha - [EMAIL PROTECTED]


Re: A question about the combiner, reducer and the Output value class: can they be different?

2008-11-17 Thread Saptarshi Guha


On Nov 16, 2008, at 6:18 PM, Owen O'Malley wrote:

On Sun, Nov 16, 2008 at 2:18 PM, Saptarshi Guha [EMAIL PROTECTED] 
wrote:



Hello,
  If my understanding is correct, the combiner will read in  
values for
a given key, process it, output it and then **all** values for a  
key are

given to the reducer.



Not quite. The flow looks like RecordReader - Mapper - Combiner * -
Reducer - OutputFormat .




Yes, i glossed over that bit. Thanks for the correction.


The Combiner may be called 0, 1, or many times on each key between the
mapper and reducer. Combiners are just an application specific  
optimization
that compress the intermediate output. They should not have side  
effects or
transform the types. Unfortunately, since there isn't a separate  
interface
for Combiners, there is isn't a great place to document this  
requirement.

I've just filed HADOOP-4668 to improve the documentation.



Hmm, i had no idea that the combiner could be called 0 times. Thanks  
for the heads up


Thank you
Saptarshi



Empty source in map?

2008-11-16 Thread Saptarshi Guha

Hello,
	I have a JobControl with an unknown number of jobs. The way I execute  
it is add a job, start it and upon finnishing

decide whether to add another job.
When I add another a job and run it,I get the following(see below)
	What does the Unknown Source mean? Does it have anything do with  
08/11/16 16:11:16 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics  
with processName=JobTracker, sessionId= - already initialized 


	08/11/16 16:16:41 INFO jvm.JvmMetrics: Initializing JVM Metrics with  
processName=JobTracker, sessionId=

[output of first job (job_1) skipped]
	08/11/16 16:16:51 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics  
with processName=JobTracker, sessionId= - already initialized
	08/11/16 16:16:51 WARN mapred.JobClient: Use GenericOptionsParser for  
parsing the arguments. Applications should implement Tool for the same.
	08/11/16 16:16:51 INFO mapred.FileInputFormat: Total input paths to  
process : 1
	08/11/16 16:16:51 INFO mapred.FileInputFormat: Total input paths to  
process : 1
	08/11/16 16:16:51 INFO mapred.FileInputFormat: Total input paths to  
process : 1
	08/11/16 16:16:51 INFO mapred.FileInputFormat: Total input paths to  
process : 1

08/11/16 16:16:51 INFO mapred.MapTask: numReduceTasks: 1
08/11/16 16:16:51 INFO mapred.MapTask: io.sort.mb = 200
	08/11/16 16:16:52 INFO mapred.MapTask: data buffer =  
159383552/199229440

08/11/16 16:16:52 INFO mapred.MapTask: record buffer = 524288/655360
08/11/16 16:16:52 WARN mapred.LocalJobRunner: job_local_0002
java.lang.NullPointerException
		at org.saptarshiguha.clusters.ClusterCenter 
$ClosestCenterMR.map(Unknown Source)
		at org.saptarshiguha.clusters.ClusterCenter 
$ClosestCenterMR.map(Unknown Source)

at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
		at org.apache.hadoop.mapred.LocalJobRunner 
$Job.run(LocalJobRunner.java:157)
	08/11/16 16:16:57 INFO mapred.LocalJobRunner: file:/tmp/input/ 
sample.data:0+308



The map function is not run at all.


===BEGIN_CODE
JobConf jobConf_1 = ColumnRanges.createSetupJob(inPaths_1, outdir);
Job job_1 = new Job(jobConf_1, null);
JobControl theControl = new JobControl(Test);
theControl.addJob(job_1);
Thread theController = new Thread(theControl);
theController.start();
while (!theControl.allFinished()) {
	System.out.println(Jobs in waiting state:  +  
theControl.getWaitingJobs().size());
	System.out.println(Jobs in ready state:  +  
theControl.getReadyJobs().size());
	System.out.println(Jobs in running state:  +  
theControl.getRunningJobs().size());
	System.out.println(Jobs in success state:  +  
theControl.getSuccessfulJobs().size());
	System.out.println(Jobs in failed state:  +  
theControl.getFailedJobs().size());

System.out.println(\n);
try {
Thread.sleep(5000);
} catch (Exception e) {
System.out.println(Exception\n);
}
}
theControl.stop();

if ( job_1.getState() == Job.SUCCESS) {
System.out.println(Choosing initial centers);
createSeedCenters();

FileUtil.fullyDelete(outdir.getFileSystem(jobConf_1 ),outdir);
JobConf jobConf = ClusterCenter.createSetupJob(inPaths_1, outdir);
Job job = new Job(jobConf, null);
theControl.addJob(job);
Thread threadControl = new Thread(theControl);
threadControl.start();
while (!theControl.allFinished()){
 try {
 Thread.sleep(5000);
 } catch (Exception e) {
 System.out.println(Exception\n);
 }
}
}





A question about the combiner, reducer and the Output value class: can they be different?

2008-11-16 Thread Saptarshi Guha

Hello,
	If my understanding is correct, the combiner will read in values for  
a given key, process it, output it and then **all** values for a key  
are given to the reducer.

Then it ought to be possible for the combiner to be of the form

	public static class ClosestCenterCB extends MapReduceBase implements  
ReducerIntWritable, Text, IntWritable, BytesWritable{
		public void reduce(IntWritable key, IteratorText values,  
OutputCollectorIntWritable, BytesWritable output, Reporter reporter) 
{...}

}

and the reducer:
	public static class ClosestCenterMR extends MapReduceBase implements  
MapperLongWritable, Text, IntWritable, Text, ReducerIntWritable,  
BytesWritable, IntWritable, Text{
		public void reduce(IntWritable key, IteratorBytesWritable values,  
OutputCollectorIntWritable, Text output, Reporter reporter) throws  
IOException { ..}

}

However, when I set up the jobconf
theJob.setOutputKeyClass(IntWritable.class);
theJob.setOutputValueClass(Text.class);
theJob.setReducerClass(ClosestCenterMR.class);
theJob.setCombinerClass(ClosestCenterCB.class);

The outputvalue is TextClass and so I get the following error:
	ava.io.IOException: wrong value class: class  
org.apache.hadoop.io.BytesWritable is not class  
org.apache.hadoop.io.Text

at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:144)
	at org.apache.hadoop.mapred.Task 
$CombineOutputCollector.collect(Task.java:626)
	at org.saptarshiguha.clusters.ClusterCenter 
$ClosestCenterCB.reduce(Unknown Source)
	at org.saptarshiguha.clusters.ClusterCenter 
$ClosestCenterCB.reduce(Unknown Source)
	at org.apache.hadoop.mapred.MapTask 
$MapOutputBuffer.combineAndSpill(MapTask.java:904)
	at org.apache.hadoop.mapred.MapTask 
$MapOutputBuffer.sortAndSpill(MapTask.java:785)
	at org.apache.hadoop.mapred.MapTask 
$MapOutputBuffer.flush(MapTask.java:698)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:228)
	at org.apache.hadoop.mapred.LocalJobRunner 
$Job.run(LocalJobRunner.java:157)
	08/11/16 16:48:18 INFO mapred.LocalJobRunner: file:/tmp/input/ 
sample.data:0+308


Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha
This is your fortune.



Re: 答复: Passing information from one job to the next in a JobControl

2008-11-12 Thread Saptarshi Guha

Hi Jerry,
This actually makes a lot of sense. Hadn't seen it in this light.
Thank you
Saptarshi

On Nov 12, 2008, at 3:07 AM, jerry ye wrote:


Hi Saptarshi:

Please refer the following example code, I wish it can help you.

JobConf grepJob = new JobConf(getConf(), Grep.class);

try {

  grepJob.setJobName(search);

  FileInputFormat.setInputPaths(grepJob, args[0]);
  …
  FileOutputFormat.setOutputPath(grepJob, tempDir);
  .
  JobClient.runJob(grepJob);

  JobConf sortJob = new JobConf(Grep.class);
  sortJob.setJobName(sort);
  FileInputFormat.setInputPaths(sortJob, tempDir);
  .
  FileOutputFormat.setOutputPath(sortJob, new Path(args[1]));
  ……..
  JobClient.runJob(sortJob);

--Jerry

-邮件原件-
发件人: Saptarshi Guha [mailto:[EMAIL PROTECTED]
发送时间: 2008年11月11日 12:06
收件人: core-user@hadoop.apache.org
主题: Passing information from one job to the next in a JobControl

Hello,
I am using JobControl to run a sequence of jobs(Job_1,Job_2,..Job_n)
on after the other. Each job returns some information
e.g
key1 value1,value2
key2 value1,value2

and so on. This can be found in the outdir passed to the jar file.
Is there a way for Job_1 to return some data (which can be passed onto
the Job_2), without my main program having to read the information
from the file in the HDFS?
I could use things like Linda Spaces, however does MapReduce have a
framework for this?

Thanks
Saptarshi
--
Saptarshi Guha - [EMAIL PROTECTED]


Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha
Intel CPUs are not defective, they just act that way.
-- Henry Spencer



Passing information from one job to the next in a JobControl

2008-11-10 Thread Saptarshi Guha
Hello,
I am using JobControl to run a sequence of jobs(Job_1,Job_2,..Job_n)
on after the other. Each job returns some information
e.g
key1 value1,value2
key2 value1,value2

and so on. This can be found in the outdir passed to the jar file.
Is there a way for Job_1 to return some data (which can be passed onto
the Job_2), without my main program having to read the information
from the file in the HDFS?
I could use things like Linda Spaces, however does MapReduce have a
framework for this?

Thanks
Saptarshi
-- 
Saptarshi Guha - [EMAIL PROTECTED]


Keep jobcache files around

2008-10-09 Thread Saptarshi Guha

Hello,
I wish to keep my jobcache files after the run. I'm using a program  
which can't read from STDIN (i'm hadoop streaming)
so i've written a python wrapper to create a file and pass the file to  
the program.
However, though the python file runs (and maybe the program) i'm not  
getting the desired results.

Nothing fails, and even though I've kept keep.failed.tasks=true
(-jobconf mapred.reduce.tasks=0 -jobconf keep.failed.tasks.files=1 in  
the streaming command line)
nothing is preserved i.e the jobcache folders(no  
attempt_200810091420_0004_m_03_3*** folders) are deletecd from the  
task nodes.


How can I keep them, even when nothing fails?
Regards
Saptarshi



Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha



Re: Jobtracker config?

2008-09-30 Thread Saptarshi Guha

Hi,
Thanks, it worked. Correct me if I'm wrong, but isn't this a  
configuration defect?
E.g the location of sec namenode is in conf/master and if I run start- 
dfs.sh, the sec namenode starts it on B.


Similarly, given that the Jobtracker is specified to run on C,  
shouldn't start-all.sh start the jobtracker on C?


Regards
Saptarshi

On Sep 29, 2008, at 6:37 PM, Arun C Murthy wrote:



On Sep 29, 2008, at 2:52 PM, Saptarshi Guha wrote:

Setup:
I am running the namenode on A, the sec. namenode on B and the  
jobtracker on C. The datanodes and tasktrackers are on Z1,Z2,Z3.


Problem:
However, the jobtracker is starting up on A. Here are my configs  
for Jobtracker


This would happen if you ran 'start-all.sh' on A rather than start- 
dfs.sh on A and start-mapred.sh on B. Is that what you did?


If not, please post the commands you used to start the HDFS and Map- 
Reduce clusters...


Arun



property
namemapred.job.tracker/name
valueC:54311/value
descriptionThe host and port that the MapReduce job tracker runs
at.  If local, then jobs are run in-process as a single map
and reduce task.
/description
/property
property
namemapred.job.tracker.http.address/name
valueC:50030/value
description
  The job tracker http server address and port the server will  
listen on.

  If the port is 0 then the server will start on a free port.
/description
/property

Also, my masters contains on entry for B (so that the sec. name  
node starts on B) and my slaves file contains Z1,Z2,Z3.

The config files are synchronized across all machines.

Any help would be appreciated.
Thank you
Saptarshi

Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha







Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha
I think I'm schizophrenic.  One half of me's
paranoid and the other half's out to get him.



A quick question about partioner and reducer

2008-09-30 Thread Saptarshi Guha

Hello,
I am slightly confused about the number of reducers executed and the  
size of data each receives.

Setup:
I have a setup of 5 task trackers.
In my hadoop-site:
(1)
property
  namemapred.reduce.tasks/name
  value7/value
  descriptionThe default number of reduce tasks per job.  Typically  
set

  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is local.
  /description
/property
(2)
property
  namemapred.tasktracker.map.tasks.maximum/name
  value7/value
  descriptionThe maximum number of map tasks that will be run
  simultaneously by a task tracker.
  /description
/property

(3)
However  from http://hadoop.apache.org/core/docs/r0.18.1/api/index.html
The total number of partitions is the same as the number of reduce  
tasks for the job..


Q:
So does that mean (from (1)  (2)) , there will be a total of 7 reduce  
tasks distributed across 5 machines such that

no machine receives more than 7 reduce jobs?
If so, suppose i have millions of unique keys which  need to be  
reduced(e.g urls/hashes), these will be partitioned into 7 groups  
(from (3))

and distributed across 5 machines?
Which is equivalent to saying that the number of reduces tasks run  
across all machines will be equal to 7?


Wouldn't that be too large a number of keys for each reduce task?

Are these possible solutions:
Solution:
I) Fixed machines (5), but increase mapred.reduce.tasks (loss in  
performance?)
2) Increase number of machines (not possible for me, but a theoretical  
solution) and set mapred.reduce.tasks to a commensurate number



Many thanks for your time
Saptarshi


Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha
More people are flattered into virtue than bullied out of vice.
-- R. S. Surtees



Jobtracker config?

2008-09-29 Thread Saptarshi Guha

Hello,
Back to hadoop after several months. I searched a few thousand  
messages but the config for jobtracker escaped me.


Setup:
I am running the namenode on A, the sec. namenode on B and the  
jobtracker on C. The datanodes and tasktrackers are on Z1,Z2,Z3.


Problem:

However, the jobtracker is starting up on A. Here are my configs for  
Jobtracker

property
  namemapred.job.tracker/name
  valueC:54311/value
  descriptionThe host and port that the MapReduce job tracker runs
  at.  If local, then jobs are run in-process as a single map
  and reduce task.
  /description
/property
property
  namemapred.job.tracker.http.address/name
  valueC:50030/value
  description
The job tracker http server address and port the server will  
listen on.

If the port is 0 then the server will start on a free port.
  /description
/property

Also, my masters contains on entry for B (so that the sec. name node  
starts on B) and my slaves file contains Z1,Z2,Z3.

The config files are synchronized across all machines.

Any help would be appreciated.
Thank you
Saptarshi

Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha





Data-local tasks

2008-06-30 Thread Saptarshi Guha

Hello,
	I recall asking this question but this is in addition to what I'ev  
askd.

Firstly, to recap my question and Arun's specific response:

--  On May 20, 2008, at 9:03 AM, Saptarshi Guha wrote:  Hello, 
--	Does the Data-local map tasks counter mean the number of tasks   
that the had the input data already present on the machine on they   
are running on?

--  i.e the wasn't a need to ship the data to them.

Response from Arun
--	Yes. Your understanding is correct. More specifically it means that  
the map-task got scheduled on a machine on which one of the
--	replicas of it's input-split-block was present and was served by  
the datanode running on that machine. *smile* Arun



	Now, Is Hadoop designed to schedule a map task on a machine which has  
one of the replicas of it's input split block?
	Failing that, does then assign a map task on machine close to one  
that contains a replica of it's input split block?

Are there any performance metrics for this?

Many thanks
Saptarshi


Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha



smime.p7s
Description: S/MIME cryptographic signature


Re: Import path for hadoop streaming with python

2008-05-22 Thread Saptarshi Guha
I haven't done this using hadoop but before i 16.4 i had written my  
own distributed batch processor using HDFS as a common file storage  
and remote execution of python scripts.
They all required a custom module which was copied to the remote temp  
folders (a primitive implementation of cacheFile)


So this is what I did:  just after #!/usr/bin/env python

import sys
sys.path.append('.')
import mylib
dostuff

so that your module can be found in the current path.
It should work thereafter
Regards
Saptarshi

On May 22, 2008, at 7:39 PM, Martin Blom wrote:


Hello all,

I'm trying to stream a little python script on my small hadoop
cluster, and it doesn't work like I thought it would.

The script looks something like

#!/usr/bin/env python
import mylib
dostuff

where mylib is a small python library that I want included, and I
launch the whole thing with something like

bin/hadoop jar contrib/streaming/hadoop-0.16.4-streaming.jar
-cacheFile hdfs://master:54310/user/hadoop/mylib.py#mylib.py -file
scrpit.py -mapper script.py -input input -output output







so it seems to me like the library should be available to the script.
When I run the script locally on my machine everything works perfectly
fine. However, when I run it it the script can't find the library.
Does hadoop do anything strange to default paths? Am I missing
something obvious? Any pointers or ideas on how to fix this would be
great.

Martin Blom


Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha
You love your home and want it to be beautiful.



smime.p7s
Description: S/MIME cryptographic signature


Meaning of Data-local map tasks in the web status gui to MapReduce

2008-05-20 Thread Saptarshi Guha

Hello,
	Does the Data-local map tasks counter mean the number of tasks that  
the had the input data already present on the machine on they are  
running on? i.e the wasn't a need to ship the data to them.

Thanks
Saptarsh

Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha



smime.p7s
Description: S/MIME cryptographic signature


Large(Thousands) of files -fast

2008-05-18 Thread Saptarshi Guha

Hello,
	I have a similar scenario to jkupferman's situation - 1000's of files  
mostly ranging from Kb,some  MBs and few of which GBs. I am not too  
familiar with java and am  using

hadoopstreaming with python. The mapper must work on individual files.
I've placed the 1000's of files into the DFS.   
	I've given the map job a manifest listing locations of the files,  
this is given to Hadoop which streams it to my python script which  
then copies the specified filename and processes it.


	I also tried tar-ring the files, converting them into a sequence file  
and then using SequenceFileAsTextInputFormat.
	The problem with this is that it sends the file contents as a string  
representation of the bytes, which i would have to convert.


	Q: Is there any way I can  make it send me the data as  
BytesWritable(mentioned below), using the command line and python?


Thanks for your time.
Regards 
Saptarshi


On May 18, 2008, at 10:54 PM, Brian Vargas wrote:


Hi,

You can realize a huge improvement by sticking them into a sequence  
file.  With lots of small files, name lookups against the name node  
will be a big bottleneck.


One easy approach is making the key be a Text of the filename that  
was loaded in, and the value be a BytesWritable, which is the  
contents of the file.  Since they're relatively small files (or you  
wouldn't be having this problem), you won't have to worry about  
OOMing yourself.  It worked really well for me, dealing with a few  
hundred thousand ~4MB files.


Brian

jkupferman wrote:

Hi Everyone,
I am working on a project which takes in data from a lot of text  
files, and
although there are a lot of ways to do it, it is not clear to me  
which is
the best/fastest. I am working on an EC2 cluster with approximately  
20

machines.
The data is currently spread across 20k text files (total of a  
~3gb ), each
of which needs to be treated as a whole (no splits within those  
files), but
I am willing to change around the format if I can get increased  
speed. Using
the regular TextInputFormat adjusted to take in entire files is  
pretty slow
since each file takes a minimum of about ~3 seconds no matter how  
small it

is.
From what I have read the possible options to proceed with are as  
follows:

1. Use MultiFileInputSplit, it seems to be designed for this sort of
situation, but I have yet to see an implementation of this, or a  
commentary

on its performance increase over the regular input.
2. Read the data in, and output it as a Sequence File and use the  
sequence

file as input from there on out.
3. Condense the files down to a small number of files (say ~100)  
and then
delimit the files so each part gets a separate record reader. If  
anyone could give me guidance as to what will provide the best

performance for this setup, I would greatly appreciate it.
Thanks for your help




Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha
The typewriting machine, when played with expression, is no more
annoying than the piano when played by a sister or near relation.
-- Oscar Wilde



smime.p7s
Description: S/MIME cryptographic signature


Re: Large(Thousands) of files -fast

2008-05-18 Thread Saptarshi Guha
Aah, use org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat as  
the inputformat.

Thanks
Saptarshi

On May 18, 2008, at 11:17 PM, Saptarshi Guha wrote:


Hello,
	I have a similar scenario to jkupferman's situation - 1000's of  
files mostly ranging from Kb,some  MBs and few of which GBs. I am  
not too familiar with java and am  using
	hadoopstreaming with python. The mapper must work on individual  
files.

I've placed the 1000's of files into the DFS.   
	I've given the map job a manifest listing locations of the files,  
this is given to Hadoop which streams it to my python script which  
then copies the specified filename and processes it.


	I also tried tar-ring the files, converting them into a sequence  
file and then using SequenceFileAsTextInputFormat.
	The problem with this is that it sends the file contents as a  
string representation of the bytes, which i would have to convert.


	Q: Is there any way I can  make it send me the data as  
BytesWritable(mentioned below), using the command line and python?


Thanks for your time.
Regards 
Saptarshi


On May 18, 2008, at 10:54 PM, Brian Vargas wrote:


Hi,

You can realize a huge improvement by sticking them into a sequence  
file.  With lots of small files, name lookups against the name node  
will be a big bottleneck.


One easy approach is making the key be a Text of the filename that  
was loaded in, and the value be a BytesWritable, which is the  
contents of the file.  Since they're relatively small files (or you  
wouldn't be having this problem), you won't have to worry about  
OOMing yourself.  It worked really well for me, dealing with a few  
hundred thousand ~4MB files.


Brian

jkupferman wrote:

Hi Everyone,
I am working on a project which takes in data from a lot of text  
files, and
although there are a lot of ways to do it, it is not clear to me  
which is
the best/fastest. I am working on an EC2 cluster with  
approximately 20

machines.
The data is currently spread across 20k text files (total of a  
~3gb ), each
of which needs to be treated as a whole (no splits within those  
files), but
I am willing to change around the format if I can get increased  
speed. Using
the regular TextInputFormat adjusted to take in entire files is  
pretty slow
since each file takes a minimum of about ~3 seconds no matter how  
small it

is.
From what I have read the possible options to proceed with are as  
follows:

1. Use MultiFileInputSplit, it seems to be designed for this sort of
situation, but I have yet to see an implementation of this, or a  
commentary

on its performance increase over the regular input.
2. Read the data in, and output it as a Sequence File and use the  
sequence

file as input from there on out.
3. Condense the files down to a small number of files (say ~100)  
and then
delimit the files so each part gets a separate record reader. If  
anyone could give me guidance as to what will provide the best

performance for this setup, I would greatly appreciate it.
Thanks for your help




Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha
The typewriting machine, when played with expression, is no more
annoying than the piano when played by a sister or near relation.
-- Oscar Wilde



Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha
Back when I was a boy, it was 40 miles to everywhere, uphill both ways
and it was always snowing.



smime.p7s
Description: S/MIME cryptographic signature