Re: Parsing Performance - related to Java concurrency issue

Ken Krugler Fri, 23 Jul 2010 18:38:41 -0700

Hi Brad,

Thanks for the nice write-up, and the refs.

I'll look into using a simple cache in Tika to avoid this type ofblocking. Feel free to comment on https://issues.apache.org/jira/browse/TIKA-471

Note that the Tika code base has changed from what it appears thatyou're using (e.g. the switch from Neko to TagSoup happened quite awhile ago).


-- Ken

On Jul 23, 2010, at 3:51pm, brad wrote:

I'm continuing to have performance problems with parsing. I ran thefetchprocess with -noParsing and got great performance. If I do the sameprocess
with parsing left in, the fetching seems to be going great, but as the
process continues to run, everything slows down to almost a dead stop.

When I check the thread stack, I find that 1062 threads are blocked:
java.lang.Thread.State: BLOCKED (on object monitor)
        at
sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135)
Apparently this is a known issue with Java, and a couple articles are
written about it:
http://paul.vox.com/library/post/the-mysteries-of-java-character-set-perform
ance.html
http://halfbottle.blogspot.com/2009/07/charset-continued-i-wrote-about.html
There is also a note in java bug database about scaling issues withthe
class...
Please also note that the current implementation of
sun.nio.cs.FastCharsetProvider.charsetForName() uses a JVM-wide lockand iscalled very often (e.g. by new String(byte[] data,String encoding)).ThisJVM-wide lock means that Java applications do not scale beyond 4 CPUcores.
I noted in the case of my stack at this particular point in time.  The
BLOCKED calls to charsetForName were generated by:
atorg.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:84)
378
atorg.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:99) 61atorg.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:133)
19
at
org.apache.nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlParser.jav
a:86)  238
at
org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector
.java:310) 133
atorg.apache.pdfbox.pdfparser.PDFParser.skipToNextObj(PDFParser.java:270) 8at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:253) 47at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:247) 19at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:227) 2
at org.apache.pdfbox.cos.COSDocument.<init>(COSDocument.java:104) 7
at org.apache.hadoop.io.Text$1.initialValue(Text.java:54) 88
at org.apache.hadoop.io.Text.decode(Text.java:344) 2
at
org.apache.tika.parser.xml.XMLParser.getDefaultSAXParserFactory(XMLParser.ja
va:161) 12
at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:192)13
at org.apache.pdfbox.cos.COSString.getString(COSString.java:245) 3
Is this an issue that only I'm facing? Is it worth looking atalternativesas talked about in the articles? Or, just limit the number ofthreads thatare run? Right now it seems like the block is causing problemunrelated to
general design and behavior of Nutch.

Thoughts??

Thanks
Brad


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Parsing Performance - related to Java concurrency issue

Reply via email to