I also see the parsing step (run ParseSegment on a new segment) takes
forever. All Documents are parsed very quick. It's the mapred stuff goes
extremely slow and eventually stalls. log:
2010-07-30 16:43:36,035 INFO  mapred.JobClient -  map 100% reduce 69%
2010-07-30 16:43:38,676 INFO  mapred.LocalJobRunner - reduce > reduce
2010-07-30 16:43:41,678 INFO  mapred.LocalJobRunner - reduce > reduce
2010-07-30 16:43:44,680 INFO  mapred.LocalJobRunner - reduce > reduce
2010-07-30 16:43:47,681 INFO  mapred.LocalJobRunner - reduce > reduce
2010-07-30 16:43:50,683 INFO  mapred.LocalJobRunner - reduce > reduce
2010-07-30 16:43:53,685 INFO  mapred.LocalJobRunner - reduce > reduce
2010-07-30 16:43:56,687 INFO  mapred.LocalJobRunner - reduce > reduce

I'm using one machine with default hadoop and mapred configuration.  Is
there any config that I can change to prevent this?
thanks,
-aj


On Fri, Jul 23, 2010 at 3:51 PM, brad <[email protected]> wrote:

> I'm continuing to have performance problems with parsing.  I ran the fetch
> process with -noParsing and got great performance.  If I do the same
> process
> with parsing left in, the fetching seems to be going great, but as the
> process continues to run, everything slows down to almost a dead stop.
>
> When I check the thread stack, I find that 1062 threads are blocked:
> java.lang.Thread.State: BLOCKED (on object monitor)
>        at
> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135)
>
> Apparently this is a known issue with Java, and a couple articles are
> written about it:
>
> http://paul.vox.com/library/post/the-mysteries-of-java-character-set-perform
> ance.html<http://paul.vox.com/library/post/the-mysteries-of-java-character-set-perform%0Aance.html>
> http://halfbottle.blogspot.com/2009/07/charset-continued-i-wrote-about.html
>
> There is also a note in java bug database about scaling issues with the
> class...
> Please also note that the current implementation of
> sun.nio.cs.FastCharsetProvider.charsetForName() uses a JVM-wide lock and is
> called very often (e.g. by new String(byte[] data,String encoding)). This
> JVM-wide lock means that Java applications do not scale beyond 4 CPU cores.
>
> I noted in the case of my stack at this particular point in time.  The
> BLOCKED calls to charsetForName were generated by:
>
> at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:84)
> 378
> at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:99)
> 61
> at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:133)
> 19
> at
>
> org.apache.nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlParser.jav
> a:86)  238
> at
>
> org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector
> .java:310) 133
> at org.apache.pdfbox.pdfparser.PDFParser.skipToNextObj(PDFParser.java:270)
> 8
> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:253) 47
> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:247) 19
> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:227) 2
> at org.apache.pdfbox.cos.COSDocument.<init>(COSDocument.java:104) 7
> at org.apache.hadoop.io.Text$1.initialValue(Text.java:54) 88
> at org.apache.hadoop.io.Text.decode(Text.java:344) 2
> at
>
> org.apache.tika.parser.xml.XMLParser.getDefaultSAXParserFactory(XMLParser.ja
> va:161) 12
> at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:192) 13
> at org.apache.pdfbox.cos.COSString.getString(COSString.java:245) 3
>
> Is this an issue that only I'm facing?  Is it worth looking at alternatives
> as talked about in the articles?  Or, just limit the number of threads that
> are run?  Right now it seems like the block is causing problem unrelated to
> general design and behavior of Nutch.
>
> Thoughts??
>
> Thanks
> Brad
>
>
>


-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Reply via email to