Hi, Thank you for your anwsers. I have tried setting up the -Dmapred.max.split.size in the $MAHOUT_OPTS variable in the bin/mahout file, but it doesn't seem to do anything different. I tried setting up to 1MB to see if that way would break it down in more pieces.
Right now I have it setup like this: MAHOUT_OPTS="$MAHOUT_OPTS -Dhadoop.log.dir=$MAHOUT_LOG_DIR" MAHOUT_OPTS="$MAHOUT_OPTS -Dhadoop.log.file=$MAHOUT_LOGFILE" MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.min.split.size=1048576" MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.max.split.size=1048576" MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.map.child.java.opts=-Xmx4096m" MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.reduce.child.java.opts=-Xmx4096m" MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.output.compress=true" MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.compress.map.output=true" MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.map.tasks=18" MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.reduce.tasks=1" MAHOUT_OPTS="$MAHOUT_OPTS -Dio.sort.factor=30" MAHOUT_OPTS="$MAHOUT_OPTS -Dio.sort.mb=1024" MAHOUT_OPTS="$MAHOUT_OPTS -Dio.file.buffer.size=32786" Should I instead set this up in the hadoop's mapred-site.xml configuration file? Thank you, Rafael Alfaro On Thu, Jun 20, 2013 at 7:15 AM, Alan Gardner <[email protected]> wrote: > You need to set the size of the input splits; by default the > FileInputFormat will split on blocks. You can override this with > mapred.max.split.size; if you want 10 map jobs from a 100MB file, set the > flag -Dmapred.max.split.size=10485760 (10MB in bytes). > > Splitting the files themselves up is bad long-term because every block and > file takes up memory in the Namenode. If you have a bunch of small files > (or a bunch of files split into small blocks), you may run out of memory in > the Namenode before you run out of disk space on your cluster. Of course, > federation is supposed to take care of this, but it's still best practice > to keep your files large. > > > On Thu, Jun 20, 2013 at 3:30 AM, Dan Filimon > <[email protected]>wrote: > >> Hi! >> >> I don't know the particular details of this job, but usually the number of >> mappers being launched is a Hadoop problem. And Hadoop looks at the number >> of input splits as its main hint. >> So, if your matrices are split in multiple smaller files, you'll likely get >> multiple mappers. >> >> Since I assume your matrices are SequenceFiles, maybe try out this: >> >> https://github.com/apache/mahout/blob/trunk/examples/src/main/java/org/apache/mahout/clustering/streaming/tools/ResplitSequenceFiles.java >> >> This tool is called "resplit" and it should work for any Writables. >> >> https://github.com/apache/mahout/blob/trunk/src/conf/driver.classes.default.props >> >> See if resplitting works. :) >> >> >> On Thu, Jun 20, 2013 at 9:18 AM, Rafa Alfaro <[email protected]> >> wrote: >> >> > Hi, >> > >> > I'm trying to run the matrix multiplication of two relatively small >> > (4219*200)(200*54622) but it is taking too long because only a single >> > mapper is launched. I'm running this on a 10 node cluster. >> > >> > I have tried changing the MAHOUT_OPTS in the mahout file: >> > >> > MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.map.tasks=18" >> > MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.reduce.tasks=9" >> > >> > Also passing the options directly on the command: >> > >> > mahout matrixmult -Dmapred.map.tasks=18 -Dmapred.reduce.tasks=9 >> > --numRowsA 200 --numColsA 4819 --numRowsB 200 --numColsB 54622 >> > --inputPathA matrixA --inputPathB matrixB >> > >> > But no luck with this either. >> > >> > My Hadoop mapred-site.xml looks like this: >> > >> > <configuration> >> > <property> >> > <name>mapred.job.tracker</name> >> > <value>serverX:54311</value> >> > <final>true</final> >> > </property> >> > <property> >> > <name>mapred.child.ulimit</name> >> > <value>unlimited</value> >> > </property> >> > <property> >> > <name>mapred.tasktracker.map.tasks.maximum</name> >> > <value>2</value> >> > <final>true</final> >> > </property> >> > <property> >> > <name>mapred.tasktracker.reduce.tasks.maximum</name> >> > <value>2</value> >> > <final>true</final> >> > </property> >> > <property> >> > <name>mapred.child.java.opts</name> >> > <value>-Xmx2000m</value> >> > </property> >> > </configuration> >> > >> > Am I missing something on the configuration? >> > >> > Right now with 1 mapper it is taking 4 min in average to advance 1% >> > with the mapper task. >> > >> > Thank you, >> > Rafael Alfaro >> > >> > > > > -- > Alan Gardner > Solutions Architect - CTO Office > > [email protected] | LinkedIn: > http://www.linkedin.com/profile/view?id=65508699 | > @alanctgardner<https://twitter.com/alanctgardner> > Tel: +1 613 565 8696 x1218 > Mobile: +1 613 897 5655 > > -- > > > -- > > >
