I also see the parsing step (run ParseSegment on a new segment) takes forever. All Documents are parsed very quick. It's the mapred stuff goes extremely slow and eventually stalls. log: 2010-07-30 16:43:36,035 INFO mapred.JobClient - map 100% reduce 69% 2010-07-30 16:43:38,676 INFO mapred.LocalJobRunner - reduce > reduce 2010-07-30 16:43:41,678 INFO mapred.LocalJobRunner - reduce > reduce 2010-07-30 16:43:44,680 INFO mapred.LocalJobRunner - reduce > reduce 2010-07-30 16:43:47,681 INFO mapred.LocalJobRunner - reduce > reduce 2010-07-30 16:43:50,683 INFO mapred.LocalJobRunner - reduce > reduce 2010-07-30 16:43:53,685 INFO mapred.LocalJobRunner - reduce > reduce 2010-07-30 16:43:56,687 INFO mapred.LocalJobRunner - reduce > reduce
I'm using one machine with default hadoop and mapred configuration. Is there any config that I can change to prevent this? thanks, -aj On Fri, Jul 23, 2010 at 3:51 PM, brad <[email protected]> wrote: > I'm continuing to have performance problems with parsing. I ran the fetch > process with -noParsing and got great performance. If I do the same > process > with parsing left in, the fetching seems to be going great, but as the > process continues to run, everything slows down to almost a dead stop. > > When I check the thread stack, I find that 1062 threads are blocked: > java.lang.Thread.State: BLOCKED (on object monitor) > at > sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135) > > Apparently this is a known issue with Java, and a couple articles are > written about it: > > http://paul.vox.com/library/post/the-mysteries-of-java-character-set-perform > ance.html<http://paul.vox.com/library/post/the-mysteries-of-java-character-set-perform%0Aance.html> > http://halfbottle.blogspot.com/2009/07/charset-continued-i-wrote-about.html > > There is also a note in java bug database about scaling issues with the > class... > Please also note that the current implementation of > sun.nio.cs.FastCharsetProvider.charsetForName() uses a JVM-wide lock and is > called very often (e.g. by new String(byte[] data,String encoding)). This > JVM-wide lock means that Java applications do not scale beyond 4 CPU cores. > > I noted in the case of my stack at this particular point in time. The > BLOCKED calls to charsetForName were generated by: > > at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:84) > 378 > at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:99) > 61 > at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:133) > 19 > at > > org.apache.nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlParser.jav > a:86) 238 > at > > org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector > .java:310) 133 > at org.apache.pdfbox.pdfparser.PDFParser.skipToNextObj(PDFParser.java:270) > 8 > at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:253) 47 > at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:247) 19 > at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:227) 2 > at org.apache.pdfbox.cos.COSDocument.<init>(COSDocument.java:104) 7 > at org.apache.hadoop.io.Text$1.initialValue(Text.java:54) 88 > at org.apache.hadoop.io.Text.decode(Text.java:344) 2 > at > > org.apache.tika.parser.xml.XMLParser.getDefaultSAXParserFactory(XMLParser.ja > va:161) 12 > at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:192) 13 > at org.apache.pdfbox.cos.COSString.getString(COSString.java:245) 3 > > Is this an issue that only I'm facing? Is it worth looking at alternatives > as talked about in the articles? Or, just limit the number of threads that > are run? Right now it seems like the block is causing problem unrelated to > general design and behavior of Nutch. > > Thoughts?? > > Thanks > Brad > > > -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA

