Re: Increase the number of mappers/split file? for matrixmult

Rafa Alfaro Sat, 22 Jun 2013 14:41:20 -0700

Hi,

Thank you for your anwsers.  I have tried setting up the
-Dmapred.max.split.size in the $MAHOUT_OPTS variable in the bin/mahout
file, but it doesn't seem to do anything different.  I tried setting
up to 1MB to see if that way would break it down in more pieces.


Right now I have it setup like this:

MAHOUT_OPTS="$MAHOUT_OPTS -Dhadoop.log.dir=$MAHOUT_LOG_DIR"
MAHOUT_OPTS="$MAHOUT_OPTS -Dhadoop.log.file=$MAHOUT_LOGFILE"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.min.split.size=1048576"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.max.split.size=1048576"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.map.child.java.opts=-Xmx4096m"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.reduce.child.java.opts=-Xmx4096m"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.output.compress=true"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.compress.map.output=true"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.map.tasks=18"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.reduce.tasks=1"
MAHOUT_OPTS="$MAHOUT_OPTS -Dio.sort.factor=30"
MAHOUT_OPTS="$MAHOUT_OPTS -Dio.sort.mb=1024"
MAHOUT_OPTS="$MAHOUT_OPTS -Dio.file.buffer.size=32786"

Should I instead set this up in the hadoop's mapred-site.xml configuration file?

Thank you,
Rafael Alfaro

On Thu, Jun 20, 2013 at 7:15 AM, Alan Gardner <[email protected]> wrote:
> You need to set the size of the input splits; by default the
> FileInputFormat will split on blocks. You can override this with
> mapred.max.split.size; if you want 10 map jobs from a 100MB file, set the
> flag -Dmapred.max.split.size=10485760 (10MB in bytes).
>
> Splitting the files themselves up is bad long-term because every block and
> file takes up memory in the Namenode. If you have a bunch of small files
> (or a bunch of files split into small blocks), you may run out of memory in
> the Namenode before you run out of disk space on your cluster. Of course,
> federation is supposed to take care of this, but it's still best practice
> to keep your files large.
>
>
> On Thu, Jun 20, 2013 at 3:30 AM, Dan Filimon 
> <[email protected]>wrote:
>
>> Hi!
>>
>> I don't know the particular details of this job, but usually  the number of
>> mappers being launched is a Hadoop problem. And Hadoop looks at the number
>> of input splits as its main hint.
>> So, if your matrices are split in multiple smaller files, you'll likely get
>> multiple mappers.
>>
>> Since I assume your matrices are SequenceFiles, maybe try out this:
>>
>> https://github.com/apache/mahout/blob/trunk/examples/src/main/java/org/apache/mahout/clustering/streaming/tools/ResplitSequenceFiles.java
>>
>> This tool is called "resplit" and it should work for any Writables.
>>
>> https://github.com/apache/mahout/blob/trunk/src/conf/driver.classes.default.props
>>
>> See if resplitting works. :)
>>
>>
>> On Thu, Jun 20, 2013 at 9:18 AM, Rafa Alfaro <[email protected]>
>> wrote:
>>
>> > Hi,
>> >
>> > I'm trying to run the matrix multiplication of two relatively small
>> > (4219*200)(200*54622) but it is taking too long because only a single
>> > mapper is launched. I'm running this on a 10 node cluster.
>> >
>> > I have tried changing the MAHOUT_OPTS in the mahout file:
>> >
>> > MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.map.tasks=18"
>> > MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.reduce.tasks=9"
>> >
>> > Also passing the options directly on the command:
>> >
>> > mahout matrixmult -Dmapred.map.tasks=18 -Dmapred.reduce.tasks=9
>> > --numRowsA 200 --numColsA 4819 --numRowsB 200 --numColsB 54622
>> > --inputPathA matrixA --inputPathB matrixB
>> >
>> > But no luck with this either.
>> >
>> > My Hadoop mapred-site.xml looks like this:
>> >
>> > <configuration>
>> >   <property>
>> >     <name>mapred.job.tracker</name>
>> >     <value>serverX:54311</value>
>> >     <final>true</final>
>> >   </property>
>> >   <property>
>> >     <name>mapred.child.ulimit</name>
>> >     <value>unlimited</value>
>> >   </property>
>> >   <property>
>> >     <name>mapred.tasktracker.map.tasks.maximum</name>
>> >     <value>2</value>
>> >     <final>true</final>
>> >   </property>
>> >   <property>
>> >     <name>mapred.tasktracker.reduce.tasks.maximum</name>
>> >     <value>2</value>
>> >     <final>true</final>
>> >   </property>
>> >   <property>
>> >     <name>mapred.child.java.opts</name>
>> >     <value>-Xmx2000m</value>
>> >   </property>
>> > </configuration>
>> >
>> > Am I missing something on the configuration?
>> >
>> > Right now with 1 mapper it is taking 4 min in average to advance 1%
>> > with the mapper task.
>> >
>> > Thank you,
>> > Rafael Alfaro
>> >
>>
>
>
>
> --
> Alan Gardner
> Solutions Architect - CTO Office
>
> [email protected] | LinkedIn:
> http://www.linkedin.com/profile/view?id=65508699 |
> @alanctgardner<https://twitter.com/alanctgardner>
> Tel: +1 613 565 8696 x1218
> Mobile: +1 613 897 5655
>
> --
>
>
> --
>
>
>

Re: Increase the number of mappers/split file? for matrixmult

Reply via email to