Thanks. Let me post a new thread on both lists with details Regards, Mahmood
On Tuesday, March 11, 2014 10:24 PM, Andrew Musselman <[email protected]> wrote: Mahmood, just an observation and reminder, Suneel's not the only one on the list. We're all here to help. This may be a question for a Hadoop list, unless I'm misunderstanding. When you say "resume" what do you mean? On Tue, Mar 11, 2014 at 11:46 AM, Mahmood Naderan <[email protected]>wrote: > Suneel, > One more thing.... Right now it has created 500 chunks. So 32GB out of > 48GB (the original size of XML file) has been processed. Is it possible to > resume that? > > > > Regards, > Mahmood > > > > On Tuesday, March 11, 2014 9:47 PM, Mahmood Naderan <[email protected]> > wrote: > > Suneel, > Is it possible to create some king of parallelism in the process of > creating the chunks in order to divide the resources to smaller pieces? > > Let me explain in this way. Assume with one thread needs 20GB of heap and > my system cannot afford that. So I will divide that to 10 threads each > needs 2GB. > If my system supports 10GB of heap, then I will feed 5 threads at one > time. When the first 5 threads are done (the chunks) then I will feed the > next 5 threads and so on. > > > > > Regards, > Mahmood > > > > On Monday, March 10, 2014 9:42 PM, Mahmood Naderan <[email protected]> > wrote: > > UPDATE: > I split another 5.4GB XML file with 4GB of RAM and -Xmx128m and it took 5 > minutes > > Regards, > Mahmood > > > > On Monday, March 10, 2014 7:16 PM, Mahmood Naderan <[email protected]> > wrote: > > The extracted size is about 960MB (enwiki-latest-pages-articles10.xml). > With 4GB of RAM set for the OS and -Xmx128m for Hadoop, it took 77 seconds > to create 64MB chunks. > I was able to see 15 chunks with "hadoop dfs -ls". > > P.S: Whenever I modify -Xmx value in mapred-site.xml, I run > $HADOOP/sbin/stop-all.sh && $HADOOP/sbin/start-all.sh > Is that necessary? > > > Regards, > Mahmood > > > > > On Monday, March 10, 2014 5:30 PM, Suneel Marthi <[email protected]> > wrote: > > Morning Mahmood, > > Please first try running this on a smaller dataset like > 'enwiki-latest-pages-articles10.xml' as opposed to running on the entire > english wikipedia. > > > > > > > > On Monday, March 10, 2014 2:59 AM, Mahmood Naderan <[email protected]> > wrote: > > Thanks for the update. > Thing is, when that command is running, in another terminal I run 'top' > command and I see that the java process takes less 1GB of memory. As > another test, I > increased the size of memory to 48GB (since I am working with virtualbox) > and set the heap size to -Xmx45000m > > Still I get the heap error. > > > I > expect that there should be a more meaningful error message that *who* > needs more heap size? Hadoop, Mahout, Java, ....? > > > Regards, > Mahmood > > > > > On Monday, March 10, 2014 1:31 AM, Suneel Marthi <[email protected]> > wrote: > > Mahmood, > > Firstly thanks for starting this email thread and for > highlighting the issues with > wikipedia example. Since you raised this issue, I updated the new > wikipedia examples page at > http://mahout.apache.org/users/classification/wikipedia-bayes-example.html > and also responded to a similar question on StackOverFlow at > > http://stackoverflow.com/questions/19505422/mahout-error-when-try-out-wikipedia-examples/22286839#22286839 > . > > I am assuming that u r running this locally on ur machine and r just > trying out the examples. Try out Sebastian's suggestion or else try running > the example on a much smaller dataset of wikipedia articles. > > > Lastly, w do realize that u have been struggling with this for about 3 > days now. Mahout presently lacks an entry for 'wikipediaXmlSplitter' in > driver.classes.default.props. Not sure at what point in time and which > release that had happened. > > Please file a Jira for > this and submit a patch. > > > > > > On Sunday, March 9, 2014 2:25 PM, Mahmood Naderan <[email protected]> > wrote: > > Hi Suneel, > Do you have any idea? Searching the web shows many question regarding the > heap size for wikipediaXMLSplitter. I have increased the the memory size to > 16GB and still get that error. I have to say that using 'top' command, I > see only 1GB of memory is in use. So I wonder why it report such an error. > Is this a problem with Java, Mahout, Hadoop, ..? > > > Regards, > Mahmood > > > > On Sunday, March 9, 2014 4:00 PM, Mahmood Naderan > <[email protected]> wrote: > > Excuse me, I added the -Xmx option and restarted the > hadoop services using > sbin/stop-all.sh && sbin/start-all.sh > > however still I get heap size error. How can I find the correct and needed > heap size? > > > Regards, > Mahmood > > > > On Sunday, March 9, 2014 1:37 PM, Mahmood Naderan <[email protected]> > wrote: > > OK I found that I have to add this property to mapred-site.xml > > > <property> > <name>mapred.child.java.opts</name> > <value>-Xmx2048m</value> > </property> > > > > Regards, > Mahmood > > > > > On Sunday, March 9, 2014 11:39 AM, Mahmood Naderan <[email protected]> > wrote: > > Hello, > I ran this command > > ./bin/mahout wikipediaXMLSplitter -d > examples/temp/enwiki-latest-pages-articles.xml -o wikipedia/chunks -c 64 > > but got this error > Exception in thread "main" > java.lang.OutOfMemoryError: Java heap space > > There are many web pages regarding this and the solution is to add "-Xmx > 2048M" > for example. My question is, that option should be passed to java command > and not Mahout. As result, > running "./bin/mahout -Xmx 2048M" > shows that there is no such option. What should I > do? > > > Regards, > Mahmood >
