I'm continuing to have performance problems with parsing.  I ran the fetch
process with -noParsing and got great performance.  If I do the same process
with parsing left in, the fetching seems to be going great, but as the
process continues to run, everything slows down to almost a dead stop.

When I check the thread stack, I find that 1062 threads are blocked:
java.lang.Thread.State: BLOCKED (on object monitor)
        at
sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135)

Apparently this is a known issue with Java, and a couple articles are
written about it:
http://paul.vox.com/library/post/the-mysteries-of-java-character-set-perform
ance.html 
http://halfbottle.blogspot.com/2009/07/charset-continued-i-wrote-about.html 

There is also a note in java bug database about scaling issues with the
class...
Please also note that the current implementation of
sun.nio.cs.FastCharsetProvider.charsetForName() uses a JVM-wide lock and is
called very often (e.g. by new String(byte[] data,String encoding)). This
JVM-wide lock means that Java applications do not scale beyond 4 CPU cores.

I noted in the case of my stack at this particular point in time.  The
BLOCKED calls to charsetForName were generated by:

at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:84)
378
at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:99) 61
at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:133)
19 
at
org.apache.nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlParser.jav
a:86)  238
at
org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector
.java:310) 133
at org.apache.pdfbox.pdfparser.PDFParser.skipToNextObj(PDFParser.java:270) 8
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:253) 47
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:247) 19
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:227) 2
at org.apache.pdfbox.cos.COSDocument.<init>(COSDocument.java:104) 7
at org.apache.hadoop.io.Text$1.initialValue(Text.java:54) 88
at org.apache.hadoop.io.Text.decode(Text.java:344) 2
at
org.apache.tika.parser.xml.XMLParser.getDefaultSAXParserFactory(XMLParser.ja
va:161) 12
at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:192) 13
at org.apache.pdfbox.cos.COSString.getString(COSString.java:245) 3

Is this an issue that only I'm facing?  Is it worth looking at alternatives
as talked about in the articles?  Or, just limit the number of threads that
are run?  Right now it seems like the block is causing problem unrelated to
general design and behavior of Nutch.

Thoughts??

Thanks
Brad


Reply via email to