Hi Ken, Thanks for the info. I'm using Nutch 1.1, so I believe it is Tika 0.7? The jar files in my Nutch path are tika-core-0.7.jar and tika-parsers-0.7.jar. Is there a way to find out if it actually pulling something different when executing?
The ps -ef | grep nutch Includes -classpath ...:/usr/local/nutch/lib/tika-core-0.7.jar:... in the Nutch execution command line My server does have /usr/local/solr/contrib/extraction/lib/tika-core-0.4.jar /usr/local/solr/contrib/extraction/lib/tika-parsers-0.4.jar But, they are not in the classpath nor are the $PATH so I doubt they are being picked up? Is there someplace else I should be looking? Thanks Brad -----Original Message----- From: Ken Krugler [mailto:[email protected]] Sent: Friday, July 23, 2010 6:38 PM To: [email protected] Subject: Re: Parsing Performance - related to Java concurrency issue Hi Brad, Thanks for the nice write-up, and the refs. I'll look into using a simple cache in Tika to avoid this type of blocking. Feel free to comment on https://issues.apache.org/jira/browse/TIKA-471 Note that the Tika code base has changed from what it appears that you're using (e.g. the switch from Neko to TagSoup happened quite a while ago). -- Ken On Jul 23, 2010, at 3:51pm, brad wrote: > I'm continuing to have performance problems with parsing. I ran the > fetch process with -noParsing and got great performance. If I do the > same process with parsing left in, the fetching seems to be going > great, but as the process continues to run, everything slows down to > almost a dead stop. > > When I check the thread stack, I find that 1062 threads are blocked: > java.lang.Thread.State: BLOCKED (on object monitor) > at > sun > .nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java: > 135) > > Apparently this is a known issue with Java, and a couple articles are > written about it: > http://paul.vox.com/library/post/the-mysteries-of-java-character-set-p > erform > ance.html > http://halfbottle.blogspot.com/2009/07/charset-continued-i-wrote-about > .html > > There is also a note in java bug database about scaling issues with > the class... > Please also note that the current implementation of > sun.nio.cs.FastCharsetProvider.charsetForName() uses a JVM-wide lock > and is called very often (e.g. by new String(byte[] data,String > encoding)). > This > JVM-wide lock means that Java applications do not scale beyond 4 CPU > cores. > > I noted in the case of my stack at this particular point in time. The > BLOCKED calls to charsetForName were generated by: > > at > org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:84) > 378 > at > org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java: > 99) 61 > at > org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java: > 133) > 19 > at > org > .apache > .nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlParser.jav > a:86) 238 > at > org > .apache > .nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector > .java:310) 133 > at > org.apache.pdfbox.pdfparser.PDFParser.skipToNextObj(PDFParser.java: > 270) 8 > at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java: > 253) 47 > at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java: > 247) 19 > at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java: > 227) 2 > at org.apache.pdfbox.cos.COSDocument.<init>(COSDocument.java:104) 7 at > org.apache.hadoop.io.Text$1.initialValue(Text.java:54) 88 at > org.apache.hadoop.io.Text.decode(Text.java:344) 2 at org .apache > .tika.parser.xml.XMLParser.getDefaultSAXParserFactory(XMLParser.ja > va:161) 12 > at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:192) > 13 > at org.apache.pdfbox.cos.COSString.getString(COSString.java:245) 3 > > Is this an issue that only I'm facing? Is it worth looking at > alternatives as talked about in the articles? Or, just limit the > number of threads that are run? Right now it seems like the block is > causing problem unrelated to general design and behavior of Nutch. > > Thoughts?? > > Thanks > Brad > > -------------------------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g

