We used the wikipedia splitter as a benchmark for our simulation on hadoop 0.2. I am now trying to run that on the latest hadoop to be up to date and check some differences. For now, I have no other choice.
Regards, Mahmood On Thursday, March 13, 2014 10:12 PM, Andrew Musselman <[email protected]> wrote: What's your larger goal here; are you putting Hadoop and Mahout through paces as an exercise? If your process is blowing through data quickly up to a certain point there may be something happening with a common value, which is a "data bug". I don't know what this wikipedia splitter class does but if you're interested in isolating the issue you could find out what is happening data-wise and see if there is some very large grouping on a pathologically frequent key for instance. On Thu, Mar 13, 2014 at 11:31 AM, Mahmood Naderan <[email protected]>wrote: > I am pretty sure that there is something wrong with hadoop/mahout/java. > With any configuration, it stuck at the chunk #571. Previous chunks are > created rapidly but I see it waits for bout 30 minutes on 571 and that is > the reason for heap error size. > > I will try to submit a bug report. > > > Regards, > Mahmood > > > > On Thursday, March 13, 2014 2:31 PM, Mahmood Naderan <[email protected]> > wrote: > > Strange thing is that if I use either -Xmx128m of -Xmx16384m the process > stops at the chunk #571 (571*64=36.5GB). > Still I haven't figured out is this a problem with JVM or Hadoop or Mahout? > > I have tested various parameters on 16GB RAM > > > <property> > <name>mapred.map.child.java.opts</name> > <value>-Xmx2048m</value> > > </property> > <property> > <name>mapred.reduce.child.java.opts</name> > <value>-Xmx4096m</value> > > </property> > > Is there an relation between the parameters and the amount of available > memory? > I also see a HADOOP_HEAPSIZE in hadoop-env.sh which is commented by > default. What is that? > > Regards, > Mahmood > > > > On Tuesday, March 11, 2014 11:57 PM, Mahmood Naderan <[email protected]> > wrote: > > As I posted earlier, here is the result of a successful test > > 5.4GB XML file (which is larger than enwiki-latest-pages-articles10.xml) > with 4GB of RAM and -Xmx128m tooks 5 minutes to complete. > > I didn't find a larger wikipedia XML file. Need > to test 10GB, 20GB and 30GB files > > > > Regards, > Mahmood > > > > > On Tuesday, March 11, 2014 11:41 PM, Andrew Musselman < > [email protected]> wrote: > > Can you please try running this on a smaller file first, per Suneel's > comment a while back: > > "Please first try running this on a smaller dataset like > 'enwiki-latest-pages-articles10.xml' as opposed to running on the entire > english wikipedia." > > > > On Tue, Mar 11, 2014 at 12:56 PM, Mahmood Naderan <[email protected] > >wrote: > > > Hi, > > Recently I have faced a heap size error when I run > > > > $MAHOUT_HOME/bin/mahout wikipediaXMLSplitter -d > > > $MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles.xml -o > > wikipedia/chunks -c 64 > > > > Here is the specs > > 1- XML file size = 44GB > > 2- System memory = 54GB (on virtualbox) > > 3- Heap size = 51GB (-Xmx51000m) > > > > At the time of failure, I see that 571 chunks are created (hadoop dfs > -ls) > > so 36GB of the original file has been processed. Now here are my > questions > > > > 1- Is there any way to > resume the process? As stated before, 571 chunks > > have been created. So by resuming, it can create the rest of the chunks > > (572~). > > > > 2- Is it possible to parallelize the process? Assume, 100GB of heap is > > required to process the XML file and my system cannot > afford that. Then we > > can create 20 threads each requires 5GB of heap. Next by feeding the > first > > 10 threads we can use the available 50GB of heap and after completion, we > > can feed the next set of threads. > > > > > > Regards, > > Mahmood >
