Hi guys, Brad, thanks for sharing your observations with us, that's great. looks like we could definitely do without the lock on the charset.
The stack trace definitely shows that BOTH parse-html and parse-tika are used, which should not happen. I wonder whether they are both called on each document or alternatively. I will have a look at it and see if I can find an explanation for this. It would be interesting to see for a fetched segments : - how long it takes to parse it when both parse-(html|tika) are in plugin.includes - same with only parse-tika - same with only parse-html Thanks Jul -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com On 24 July 2010 18:33, Ken Krugler <[email protected]> wrote: > Hi Brad, > > > On Jul 23, 2010, at 8:50pm, brad wrote: > > The items listed in the original email were just the location of the >> original call. Here is the actual jstack dump for the parseNeko and the >> other. The parseNeko apparently being directly called by the Nutch >> parser, >> not Tika... >> > > Thanks for the additional details. > > Maybe Julien can provide an explanation for why there are both Tika and > Nutch HtmlParser references showing up - Julien? > > -- Ken > > > "Thread-177094" daemon prio=10 tid=0x00002aab281fb000 nid=0x5638 waiting >> for >> monitor entry [0x00002aab82a18000..0x00002aab82a18b90] >> java.lang.Thread.State: BLOCKED (on object monitor) >> at >> >> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135) >> - waiting to lock <0x00002aaace621488> (a >> sun.nio.cs.StandardCharsets) >> at java.nio.charset.Charset.lookup2(Charset.java:468) >> at java.nio.charset.Charset.lookup(Charset.java:456) >> at java.nio.charset.Charset.isSupported(Charset.java:498) >> at >> sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:67) >> at java.io.InputStreamReader.<init>(InputStreamReader.java:100) >> at >> org.cyberneko.html.HTMLScanner.setInputSource(HTMLScanner.java:774) >> at >> >> org.cyberneko.html.HTMLConfiguration.setInputSource(HTMLConfiguration.java:4 >> 57) >> at >> org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:430) >> at >> >> org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java:16 >> 4) >> at >> org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:253) >> at >> org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:210) >> at >> org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145) >> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18) >> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7) >> at >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) >> at java.util.concurrent.FutureTask.run(FutureTask.java:166) >> at java.lang.Thread.run(Thread.java:636) >> >> >> Here are the main ones listed in the jira case: >> >> "Thread-177096" daemon prio=10 tid=0x00002aab283e5400 nid=0x563a waiting >> for >> monitor entry [0x000000007318b000..0x000000007318bc90] >> java.lang.Thread.State: BLOCKED (on object monitor) >> at >> >> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135) >> - waiting to lock <0x00002aaace621488> (a >> sun.nio.cs.StandardCharsets) >> at java.nio.charset.Charset.lookup2(Charset.java:468) >> at java.nio.charset.Charset.lookup(Charset.java:456) >> at java.nio.charset.Charset.isSupported(Charset.java:498) >> at >> sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:67) >> at java.io.InputStreamReader.<init>(InputStreamReader.java:100) >> at >> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:84) >> at >> org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:181) >> at >> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95) >> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18) >> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7) >> at >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) >> at java.util.concurrent.FutureTask.run(FutureTask.java:166) >> at java.lang.Thread.run(Thread.java:636) >> >> >> "Thread-177079" daemon prio=10 tid=0x00002aab1c149000 nid=0x5629 waiting >> for >> monitor entry [0x00002aab8200e000..0x00002aab8200ec10] >> java.lang.Thread.State: BLOCKED (on object monitor) >> at >> >> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135) >> - waiting to lock <0x00002aaace621488> (a >> sun.nio.cs.StandardCharsets) >> at java.nio.charset.Charset.lookup2(Charset.java:468) >> at java.nio.charset.Charset.lookup(Charset.java:456) >> at java.nio.charset.Charset.isSupported(Charset.java:498) >> at >> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:99) >> at >> org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:181) >> at >> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95) >> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18) >> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7) >> at >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) >> at java.util.concurrent.FutureTask.run(FutureTask.java:166) >> at java.lang.Thread.run(Thread.java:636) >> >> >> "Thread-177029" daemon prio=10 tid=0x0000000017a67400 nid=0x55f7 waiting >> for >> monitor entry [0x00002aab8806f000..0x00002aab8806fb90] >> java.lang.Thread.State: BLOCKED (on object monitor) >> at >> >> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135) >> - waiting to lock <0x00002aaace621488> (a >> sun.nio.cs.StandardCharsets) >> at java.nio.charset.Charset.lookup2(Charset.java:468) >> at java.nio.charset.Charset.lookup(Charset.java:456) >> at java.nio.charset.Charset.isSupported(Charset.java:498) >> at >> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:133) >> at >> org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:181) >> at >> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95) >> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18) >> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7) >> at >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) >> at java.util.concurrent.FutureTask.run(FutureTask.java:166) >> at java.lang.Thread.run(Thread.java:636) >> >> >> "Thread-177090" daemon prio=10 tid=0x00002aab48060c00 nid=0x5634 waiting >> for >> monitor entry [0x00002aab82f1d000..0x00002aab82f1dd90] >> java.lang.Thread.State: BLOCKED (on object monitor) >> at >> >> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135) >> - waiting to lock <0x00002aaace621488> (a >> sun.nio.cs.StandardCharsets) >> at java.nio.charset.Charset.lookup2(Charset.java:468) >> at java.nio.charset.Charset.lookup(Charset.java:456) >> at java.nio.charset.Charset.forName(Charset.java:521) >> at >> >> org.apache.nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlParser.jav >> a:86) >> at >> org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:137) >> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18) >> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7) >> at >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) >> at java.util.concurrent.FutureTask.run(FutureTask.java:166) >> at java.lang.Thread.run(Thread.java:636) >> >> The last one isn't a tika problem, it's nutch's issue... >> >> >> If there is anything else I can provide, please let me know. >> >> Thanks >> Brad >> >> -----Original Message----- >> From: brad [mailto:[email protected]] >> Sent: Friday, July 23, 2010 8:37 PM >> To: [email protected] >> Subject: RE: Parsing Performance - related to Java concurrency issue >> >> I just running Nutch as delivered. The information about the >> org.apache.tika.parser.html.HtmlParser.getEncoding, etc is from running >> jstack on the nutch process when it slowed down to a crawl... >> >> -----Original Message----- >> From: Ken Krugler [mailto:[email protected]] >> Sent: Friday, July 23, 2010 8:25 PM >> To: [email protected] >> Subject: Re: Parsing Performance - related to Java concurrency issue >> >> Hi Brad, >> >> On Jul 23, 2010, at 7:21pm, brad wrote: >> >> Hi Ken, >>> Thanks for the info. I'm using Nutch 1.1, so I believe it is Tika >>> 0.7? The jar files in my Nutch path are tika-core-0.7.jar and tika- >>> parsers-0.7.jar. >>> Is there a way to find out if it actually pulling something different >>> when executing? >>> >> >> Tika switched from Neko to TagSoup on 14/Oct/2009. >> >> Tika 0.7 was released on April 3rd, 2010 so I would expect that if you are >> using Tika 0.7, you'd be using TagSoup. >> >> However I see the line that references Neko is this one: >> >> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java: >>>> >>> >> So it's the Nutch HtmlParser that's using Neko. >> >> Curious why you have both Nutch and Tika HtmlParser refs in your file, >> e.g. >> I also see: >> >> org.apache.tika.parser.html.HtmlParser.getEncoding >>>> >>> >> -- Ken >> >> The ps -ef | grep nutch >>> Includes -classpath ...:/usr/local/nutch/lib/tika-core-0.7.jar:... >>> in the >>> Nutch execution command line >>> >>> My server does have >>> /usr/local/solr/contrib/extraction/lib/tika-core-0.4.jar >>> /usr/local/solr/contrib/extraction/lib/tika-parsers-0.4.jar >>> >>> But, they are not in the classpath nor are the $PATH so I doubt they >>> are being picked up? Is there someplace else I should be looking? >>> >>> Thanks >>> Brad >>> >>> -----Original Message----- >>> From: Ken Krugler [mailto:[email protected]] >>> Sent: Friday, July 23, 2010 6:38 PM >>> To: [email protected] >>> Subject: Re: Parsing Performance - related to Java concurrency issue >>> >>> Hi Brad, >>> >>> Thanks for the nice write-up, and the refs. >>> >>> I'll look into using a simple cache in Tika to avoid this type of >>> blocking. >>> Feel free to comment on https://issues.apache.org/jira/browse/TIKA-471 >>> >>> Note that the Tika code base has changed from what it appears that >>> you're using (e.g. the switch from Neko to TagSoup happened quite a >>> while ago). >>> >>> -- Ken >>> >>> On Jul 23, 2010, at 3:51pm, brad wrote: >>> >>> I'm continuing to have performance problems with parsing. I ran the >>>> fetch process with -noParsing and got great performance. If I do the >>>> same process with parsing left in, the fetching seems to be going >>>> great, but as the process continues to run, everything slows down to >>>> almost a dead stop. >>>> >>>> When I check the thread stack, I find that 1062 threads are blocked: >>>> java.lang.Thread.State: BLOCKED (on object monitor) >>>> at >>>> sun >>>> .nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java: >>>> 135) >>>> >>>> Apparently this is a known issue with Java, and a couple articles are >>>> written about it: >>>> http://paul.vox.com/library/post/the-mysteries-of-java-character- >>>> set-p >>>> erform >>>> ance.html >>>> http://halfbottle.blogspot.com/2009/07/charset-continued-i-wrote- >>>> about >>>> .html >>>> >>>> There is also a note in java bug database about scaling issues with >>>> the class... >>>> Please also note that the current implementation of >>>> sun.nio.cs.FastCharsetProvider.charsetForName() uses a JVM-wide lock >>>> and is called very often (e.g. by new String(byte[] data,String >>>> encoding)). >>>> This >>>> JVM-wide lock means that Java applications do not scale beyond 4 CPU >>>> cores. >>>> >>>> I noted in the case of my stack at this particular point in time. >>>> The >>>> BLOCKED calls to charsetForName were generated by: >>>> >>>> at >>>> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java: >>>> 84) >>>> 378 >>>> at >>>> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java: >>>> 99) 61 >>>> at >>>> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java: >>>> 133) >>>> 19 >>>> at >>>> org >>>> .apache >>>> .nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlParser.jav >>>> a:86) 238 >>>> at >>>> org >>>> .apache >>>> .nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector >>>> .java:310) 133 >>>> at >>>> org.apache.pdfbox.pdfparser.PDFParser.skipToNextObj(PDFParser.java: >>>> 270) 8 >>>> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java: >>>> 253) 47 >>>> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java: >>>> 247) 19 >>>> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java: >>>> 227) 2 >>>> at org.apache.pdfbox.cos.COSDocument.<init>(COSDocument.java:104) 7 >>>> at >>>> org.apache.hadoop.io.Text$1.initialValue(Text.java:54) 88 at >>>> org.apache.hadoop.io.Text.decode(Text.java:344) 2 at org .apache >>>> .tika.parser.xml.XMLParser.getDefaultSAXParserFactory(XMLParser.ja >>>> va:161) 12 >>>> at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:192) >>>> 13 >>>> at org.apache.pdfbox.cos.COSString.getString(COSString.java:245) 3 >>>> >>>> Is this an issue that only I'm facing? Is it worth looking at >>>> alternatives as talked about in the articles? Or, just limit the >>>> number of threads that are run? Right now it seems like the block is >>>> causing problem unrelated to general design and behavior of Nutch. >>>> >>>> Thoughts?? >>>> >>>> Thanks >>>> Brad >>>> >>>> >>>> >>> -------------------------------------------- >>> Ken Krugler >>> +1 530-210-6378 >>> http://bixolabs.com >>> e l a s t i c w e b m i n i n g >>> >>> >>> >>> >>> >>> >> -------------------------------------------- >> Ken Krugler >> +1 530-210-6378 >> http://bixolabs.com >> e l a s t i c w e b m i n i n g >> >> >> >> >> >> >> > -------------------------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > e l a s t i c w e b m i n i n g > > > > >

