Thanks Ashish, So according to the link if one is using CompositeInputFormat then it will take entire file as Input to a mapper without considering InputSplits/blocksize. If I am understanding it correctly then it is asking to break [Original Input File]->[flie1,file2,....] .
So If my file is [/test/MatrixA] --> [/test/smallfiles/file1, [/test/smallfiles/file2, [/test/smallfiles/file3............... ] Now will the input path in MatrixMultiplicationJob will be directory path : /test/smallfiles ?? Will breaking file in such manner will cause problem in algorithmic execution of MR job. Im not sure if output will be correct . -----Original Message----- From: Ashish [mailto:[email protected]] Sent: Wednesday, January 16, 2013 5:44 PM To: [email protected] Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ? MatrixMultiplicationJob internally sets InputFormat as CompositeInputFormat JobConf conf = new JobConf(initialConf, MatrixMultiplicationJob.class); conf.setInputFormat(CompositeInputFormat.class); and AFAIK, CompositeInputFormat ignores the splits. See this http://stackoverflow.com/questions/8654200/hadoop-file-splits-compositeinputformat-inner-join Unfortunately, I don't know any other alternative as of now. On Wed, Jan 16, 2013 at 5:05 PM, Stuti Awasthi <[email protected]> wrote: > The issue is that currently my matrix is of dimension (100x100k), > Later it can be (1MX10M) or big. > > Even now if my job is running with the single mapper for (100x100k) > and it is not able to complete the Job. As I mentioned map task just > proceed to 0.99% and started spilling the map output. Hence I wanted > to tune my job so that Mahout is able to complete the job and I can > utilize my cluster resources. > > As MatrixMultiplicationJob is a MR, so it should be able to handle > parallel map tasks. I am not sure if there is any algorithmic > constraints due to which it runs only with single mapper ? > I have taken the reference of thread so that I can set Configuration > myself rather by getting it with getConf() but did not got any success > > http://lucene.472066.n3.nabble.com/Setting-Number-of-Mappers-and-Reduc > ers-in-DistributedRowMatrix-Jobs-td888980.html > > Stuti > > -----Original Message----- > From: Sean Owen [mailto:[email protected]] > Sent: Wednesday, January 16, 2013 4:46 PM > To: Mahout User List > Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ? > > Why do you need multiple mappers? Is one too slow? Many are not > necessarily faster for small input On Jan 16, 2013 10:46 AM, "Stuti > Awasthi" <[email protected]> wrote: > > > Hi, > > I tried to call programmatically also but facing same issue : Only > > single MapTask is running and that too spilling the map output > continuously. > > Hence im not able to generate the output for large matrix multiplication. > > > > Code Snippet : > > > > DistributedRowMatrix a = new DistributedRowMatrix(new > > Path("/test/points/matrixA"), new > > Path("/test/temp"),Integer.parseInt("100"), > > Integer.parseInt("100000")); DistributedRowMatrix b = new > > DistributedRowMatrix(new Path("/test/points/matrixA"),new > > Path("tempDir"),Integer.parseInt("100"), > > Integer.parseInt("100000")); > > Configuration conf = new Configuration(); > > conf.set("fs.default.name", "hdfs://DS-1078D24B4736:10818"); > > conf.set("mapred.child.java.opts", > > "-Xmx2048m"); conf.set("mapred.max.split.size","10485760"); > > a.setConf(conf); > > b.setConf(conf); > > a.times(b); > > > > Where Im going wrong. Any idea ? > > > > Thanks > > Stuti > > -----Original Message----- > > From: Stuti Awasthi > > Sent: Wednesday, January 16, 2013 2:55 PM > > To: Mahout User List > > Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ? > > > > Hey Sean, > > Thanks for response. MatrixMultiplicationJob help shows the usage like : > > usage: <command> [Generic Options] [Job-Specific Options] > > > > Here Generic Option can be provided by -D <property=value>. Hence I > > tried with commandline -D options but it seems like that it is not > > making any effect. It is also suggested in : > > > > https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/maho > > ut > > /common/AbstractJob.html > > > > Here I have noted 1 thing after your suggestion that currently Im > > passing arguments like -D<property=value> rather than -D > > <property=value>. I tried with space between -D and property=value > > also but then its giving error > > like: > > 13/01/16 14:21:47 ERROR common.AbstractJob: Unexpected > > /test/points/matrixA while processing Job-Specific Options: > > > > No such error comes if im passing the arguments without space between -D. > > > > By reference of Hadoop Definite Guide : "Do not confuse setting > > Hadoop properties using the -D property=value option to > > GenericOptionsParser (and > > ToolRunner) with setting JVM system properties using the > > -Dproperty=value option to the java command. The syntax for JVM > > system properties does not allow any whitespace between the D and > > the property name, whereas GenericOptionsParser requires them to be > > separated by whitespace." > > > > Hence I suppose that GenericOptions should be parsed by -D > > property=value rather than -Dproperty=value. > > > > Additionally I tried -Dmapred.max.split.size=10485760 also through > > commandline but again only single MapTask started. > > > > Please Suggest > > > > > > -----Original Message----- > > From: Sean Owen [mailto:[email protected]] > > Sent: Wednesday, January 16, 2013 1:23 PM > > To: Mahout User List > > Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ? > > > > It's up to Hadoop in the end. > > > > Try calling FileInputFormat.setMaxInputSplitSize() with a smallish > > value, like your 10MB (10000000). > > > > I don't know if Hadoop params can be set as sys properties like that > > anyway? > > > > On Wed, Jan 16, 2013 at 7:48 AM, Stuti Awasthi > > <[email protected]> > > wrote: > > > Hi, > > > > > > I am trying to multiple dense matrix of size [100 x 100k]. The > > > size of > > the file is 104MB and with default block sizeof 64MB only 2 blocks > > are getting created. > > > So I reduced the block size to 10MB and now my file divided into > > > 11 > > blocks across the cluster. Cluster size is 10 nodes with 1 NN/JT and > > 9 DN/TT. > > > > > > Everytime Im running Mahout MatrixMultiplicationJob through > > > commandline, > > I can see on JobTracker WebUI that only 1 map task is launched. > > According to my understanding of Inputsplit, there should be 11 map > tasks launched. > > > Apart from this Map task stays at 0.99% completion and in the > > > Tasks Logs > > , I can see that map task is spilling the map output. > > > > > > Mahout Command: > > > > > > mahout matrixmult -Dmapred.child.java.opts=-Xmx1024M > > > -Dfs.inmemory.size.mb=200 -Dio.sort.factor=100 -Dio.sort.mb=200 > > > -Dio.file.buffer.size=131072 --inputPathA /test/matrixA --numRowsA > > > 100 --numColsA 100000 --inputPathB /test/matrixA --numRowsB 100 > > > --numColsB > > > 100000 --tempDir /test/temp > > > > > > Now here I want to know that why only 1 map task is launched > > > everytime > > and how can I performance tune the cluster so that I can perform the > > dense matrix multiplication of the order [90K x 1 Million] . > > > > > > Thanks > > > Stuti > > > > > > > > > ::DISCLAIMER:: > > > ------------------------------------------------------------------ > > > -- > > > -- > > > ------------------------------------------------------------------ > > > -- > > > -- > > > -------- > > > > > > The contents of this e-mail and any attachment(s) are confidential > > > and > > intended for the named recipient(s) only. > > > E-mail transmission is not guaranteed to be secure or error-free > > > as information could be intercepted, corrupted, lost, destroyed, > > > arrive late or incomplete, or may contain viruses in transmission. > > > The e mail > > and its contents (with or without referred errors) shall therefore > > not attach any liability on the originator or HCL or its affiliates. > > > Views or opinions, if any, presented in this email are solely > > > those of the author and may not necessarily reflect the views or > > > opinions of HCL or its affiliates. Any form of reproduction, > > > dissemination, copying, disclosure, modification, distribution and > > > / or publication of > > this message without the prior written consent of authorized > > representative of HCL is strictly prohibited. If you have received > > this email in error please delete it and notify the sender immediately. > > > Before opening any email and/or attachments, please check them for > > viruses and other defects. > > > > > > ------------------------------------------------------------------ > > > -- > > > -- > > > ------------------------------------------------------------------ > > > -- > > > -- > > > -------- > > > -- thanks ashish Blog: http://www.ashishpaliwal.com/blog My Photo Galleries: http://www.pbase.com/ashishpaliwal
