I just running Nutch as delivered. The information about the org.apache.tika.parser.html.HtmlParser.getEncoding, etc is from running jstack on the nutch process when it slowed down to a crawl...
-----Original Message----- From: Ken Krugler [mailto:[email protected]] Sent: Friday, July 23, 2010 8:25 PM To: [email protected] Subject: Re: Parsing Performance - related to Java concurrency issue Hi Brad, On Jul 23, 2010, at 7:21pm, brad wrote: > Hi Ken, > Thanks for the info. I'm using Nutch 1.1, so I believe it is Tika > 0.7? The jar files in my Nutch path are tika-core-0.7.jar and tika- > parsers-0.7.jar. > Is there a way to find out if it actually pulling something different > when executing? Tika switched from Neko to TagSoup on 14/Oct/2009. Tika 0.7 was released on April 3rd, 2010 so I would expect that if you are using Tika 0.7, you'd be using TagSoup. However I see the line that references Neko is this one: >> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java: So it's the Nutch HtmlParser that's using Neko. Curious why you have both Nutch and Tika HtmlParser refs in your file, e.g. I also see: >> org.apache.tika.parser.html.HtmlParser.getEncoding -- Ken > The ps -ef | grep nutch > Includes -classpath ...:/usr/local/nutch/lib/tika-core-0.7.jar:... > in the > Nutch execution command line > > My server does have > /usr/local/solr/contrib/extraction/lib/tika-core-0.4.jar > /usr/local/solr/contrib/extraction/lib/tika-parsers-0.4.jar > > But, they are not in the classpath nor are the $PATH so I doubt they > are being picked up? Is there someplace else I should be looking? > > Thanks > Brad > > -----Original Message----- > From: Ken Krugler [mailto:[email protected]] > Sent: Friday, July 23, 2010 6:38 PM > To: [email protected] > Subject: Re: Parsing Performance - related to Java concurrency issue > > Hi Brad, > > Thanks for the nice write-up, and the refs. > > I'll look into using a simple cache in Tika to avoid this type of > blocking. > Feel free to comment on https://issues.apache.org/jira/browse/TIKA-471 > > Note that the Tika code base has changed from what it appears that > you're using (e.g. the switch from Neko to TagSoup happened quite a > while ago). > > -- Ken > > On Jul 23, 2010, at 3:51pm, brad wrote: > >> I'm continuing to have performance problems with parsing. I ran the >> fetch process with -noParsing and got great performance. If I do the >> same process with parsing left in, the fetching seems to be going >> great, but as the process continues to run, everything slows down to >> almost a dead stop. >> >> When I check the thread stack, I find that 1062 threads are blocked: >> java.lang.Thread.State: BLOCKED (on object monitor) >> at >> sun >> .nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java: >> 135) >> >> Apparently this is a known issue with Java, and a couple articles are >> written about it: >> http://paul.vox.com/library/post/the-mysteries-of-java-character- >> set-p >> erform >> ance.html >> http://halfbottle.blogspot.com/2009/07/charset-continued-i-wrote- >> about >> .html >> >> There is also a note in java bug database about scaling issues with >> the class... >> Please also note that the current implementation of >> sun.nio.cs.FastCharsetProvider.charsetForName() uses a JVM-wide lock >> and is called very often (e.g. by new String(byte[] data,String >> encoding)). >> This >> JVM-wide lock means that Java applications do not scale beyond 4 CPU >> cores. >> >> I noted in the case of my stack at this particular point in time. >> The >> BLOCKED calls to charsetForName were generated by: >> >> at >> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java: >> 84) >> 378 >> at >> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java: >> 99) 61 >> at >> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java: >> 133) >> 19 >> at >> org >> .apache >> .nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlParser.jav >> a:86) 238 >> at >> org >> .apache >> .nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector >> .java:310) 133 >> at >> org.apache.pdfbox.pdfparser.PDFParser.skipToNextObj(PDFParser.java: >> 270) 8 >> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java: >> 253) 47 >> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java: >> 247) 19 >> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java: >> 227) 2 >> at org.apache.pdfbox.cos.COSDocument.<init>(COSDocument.java:104) 7 >> at >> org.apache.hadoop.io.Text$1.initialValue(Text.java:54) 88 at >> org.apache.hadoop.io.Text.decode(Text.java:344) 2 at org .apache >> .tika.parser.xml.XMLParser.getDefaultSAXParserFactory(XMLParser.ja >> va:161) 12 >> at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:192) >> 13 >> at org.apache.pdfbox.cos.COSString.getString(COSString.java:245) 3 >> >> Is this an issue that only I'm facing? Is it worth looking at >> alternatives as talked about in the articles? Or, just limit the >> number of threads that are run? Right now it seems like the block is >> causing problem unrelated to general design and behavior of Nutch. >> >> Thoughts?? >> >> Thanks >> Brad >> >> > > -------------------------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > e l a s t i c w e b m i n i n g > > > > > -------------------------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g

