> since tika covers the same mime-types as parse-text and parse-html you > probably don't need to include them. Can't remember why we kept them in the > default (anyone?). >
of course if you decide to use say parse-html and not tika for a given mime-type but keep tika as a default parser for the other types you will need to create a mapping in parse-plugins.xml > > Re-finding both the html and tika parsers in the stacks, the only case > where it should happen is when both are loaded (as in your conf and the > default) and the default parser (i.e. Tika) is loaded first fails to return > a result then remaining parsers for the mime type are used. > > Did you notice anything in the log about any errors during the parsing with > Tika on HTML docs? That could explain why the html parser was tried > > J. > > > > On 24 July 2010 19:45, brad <[email protected]> wrote: > >> Here is my plugin.includes from nutch-site.xml >> >> protocol-http|urlfilter-regex|parse-(rss|text|html|js|tika)|index-(basic|anc >> >> hor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|u >> rlnormalizer-(pass|regex|basic) >> >> Here is the one from nutch-default.xml >> >> >> protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor) >> >> |query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlno >> rmalizer-(pass|regex|basic) >> >> Do I need to change something on this? >> >> Thanks >> Brad >> >> >> -----Original Message----- >> From: Julien Nioche [mailto:[email protected]] >> Sent: Saturday, July 24, 2010 11:09 AM >> To: [email protected] >> Subject: Re: Parsing Performance - related to Java concurrency issue >> >> Hi guys, >> >> Brad, thanks for sharing your observations with us, that's great. looks >> like >> we could definitely do without the lock on the charset. >> >> The stack trace definitely shows that BOTH parse-html and parse-tika are >> used, which should not happen. I wonder whether they are both called on >> each >> document or alternatively. I will have a look at it and see if I can find >> an >> explanation for this. >> >> It would be interesting to see for a fetched segments : >> - how long it takes to parse it when both parse-(html|tika) are in >> plugin.includes >> - same with only parse-tika >> - same with only parse-html >> >> Thanks >> >> Jul >> >> -- >> DigitalPebble Ltd >> >> Open Source Solutions for Text Engineering http://www.digitalpebble.com >> >> >> On 24 July 2010 18:33, Ken Krugler <[email protected]> wrote: >> >> > Hi Brad, >> > >> > >> > On Jul 23, 2010, at 8:50pm, brad wrote: >> > >> > The items listed in the original email were just the location of the >> >> original call. Here is the actual jstack dump for the parseNeko and >> >> the other. The parseNeko apparently being directly called by the >> >> Nutch parser, not Tika... >> >> >> > >> > Thanks for the additional details. >> > >> > Maybe Julien can provide an explanation for why there are both Tika >> > and Nutch HtmlParser references showing up - Julien? >> > >> > -- Ken >> > >> > >> > "Thread-177094" daemon prio=10 tid=0x00002aab281fb000 nid=0x5638 >> > waiting >> >> for >> >> monitor entry [0x00002aab82a18000..0x00002aab82a18b90] >> >> java.lang.Thread.State: BLOCKED (on object monitor) >> >> at >> >> >> >> >> >> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135) >> >> - waiting to lock <0x00002aaace621488> (a >> >> sun.nio.cs.StandardCharsets) >> >> at java.nio.charset.Charset.lookup2(Charset.java:468) >> >> at java.nio.charset.Charset.lookup(Charset.java:456) >> >> at java.nio.charset.Charset.isSupported(Charset.java:498) >> >> at >> >> sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:67) >> >> at java.io.InputStreamReader.<init>(InputStreamReader.java:100) >> >> at >> >> org.cyberneko.html.HTMLScanner.setInputSource(HTMLScanner.java:774) >> >> at >> >> >> >> org.cyberneko.html.HTMLConfiguration.setInputSource(HTMLConfiguration >> >> .java:4 >> >> 57) >> >> at >> >> org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:430) >> >> at >> >> >> >> org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser. >> >> java:16 >> >> 4) >> >> at >> >> org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:253) >> >> at >> >> org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:210) >> >> at >> >> org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145) >> >> at >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18) >> >> at >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7) >> >> at >> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) >> >> at java.util.concurrent.FutureTask.run(FutureTask.java:166) >> >> at java.lang.Thread.run(Thread.java:636) >> >> >> >> >> >> Here are the main ones listed in the jira case: >> >> >> >> "Thread-177096" daemon prio=10 tid=0x00002aab283e5400 nid=0x563a >> >> waiting for monitor entry [0x000000007318b000..0x000000007318bc90] >> >> java.lang.Thread.State: BLOCKED (on object monitor) >> >> at >> >> >> >> >> >> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135) >> >> - waiting to lock <0x00002aaace621488> (a >> >> sun.nio.cs.StandardCharsets) >> >> at java.nio.charset.Charset.lookup2(Charset.java:468) >> >> at java.nio.charset.Charset.lookup(Charset.java:456) >> >> at java.nio.charset.Charset.isSupported(Charset.java:498) >> >> at >> >> sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:67) >> >> at java.io.InputStreamReader.<init>(InputStreamReader.java:100) >> >> at >> >> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:84) >> >> at >> >> org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:181) >> >> at >> >> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95) >> >> at >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18) >> >> at >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7) >> >> at >> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) >> >> at java.util.concurrent.FutureTask.run(FutureTask.java:166) >> >> at java.lang.Thread.run(Thread.java:636) >> >> >> >> >> >> "Thread-177079" daemon prio=10 tid=0x00002aab1c149000 nid=0x5629 >> >> waiting for monitor entry [0x00002aab8200e000..0x00002aab8200ec10] >> >> java.lang.Thread.State: BLOCKED (on object monitor) >> >> at >> >> >> >> >> >> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135) >> >> - waiting to lock <0x00002aaace621488> (a >> >> sun.nio.cs.StandardCharsets) >> >> at java.nio.charset.Charset.lookup2(Charset.java:468) >> >> at java.nio.charset.Charset.lookup(Charset.java:456) >> >> at java.nio.charset.Charset.isSupported(Charset.java:498) >> >> at >> >> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:99) >> >> at >> >> org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:181) >> >> at >> >> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95) >> >> at >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18) >> >> at >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7) >> >> at >> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) >> >> at java.util.concurrent.FutureTask.run(FutureTask.java:166) >> >> at java.lang.Thread.run(Thread.java:636) >> >> >> >> >> >> "Thread-177029" daemon prio=10 tid=0x0000000017a67400 nid=0x55f7 >> >> waiting for monitor entry [0x00002aab8806f000..0x00002aab8806fb90] >> >> java.lang.Thread.State: BLOCKED (on object monitor) >> >> at >> >> >> >> >> >> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135) >> >> - waiting to lock <0x00002aaace621488> (a >> >> sun.nio.cs.StandardCharsets) >> >> at java.nio.charset.Charset.lookup2(Charset.java:468) >> >> at java.nio.charset.Charset.lookup(Charset.java:456) >> >> at java.nio.charset.Charset.isSupported(Charset.java:498) >> >> at >> >> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:133) >> >> at >> >> org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:181) >> >> at >> >> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95) >> >> at >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18) >> >> at >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7) >> >> at >> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) >> >> at java.util.concurrent.FutureTask.run(FutureTask.java:166) >> >> at java.lang.Thread.run(Thread.java:636) >> >> >> >> >> >> "Thread-177090" daemon prio=10 tid=0x00002aab48060c00 nid=0x5634 >> >> waiting for monitor entry [0x00002aab82f1d000..0x00002aab82f1dd90] >> >> java.lang.Thread.State: BLOCKED (on object monitor) >> >> at >> >> >> >> >> >> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135) >> >> - waiting to lock <0x00002aaace621488> (a >> >> sun.nio.cs.StandardCharsets) >> >> at java.nio.charset.Charset.lookup2(Charset.java:468) >> >> at java.nio.charset.Charset.lookup(Charset.java:456) >> >> at java.nio.charset.Charset.forName(Charset.java:521) >> >> at >> >> >> >> org.apache.nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlPar >> >> ser.jav >> >> a:86) >> >> at >> >> org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:137) >> >> at >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18) >> >> at >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7) >> >> at >> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) >> >> at java.util.concurrent.FutureTask.run(FutureTask.java:166) >> >> at java.lang.Thread.run(Thread.java:636) >> >> >> >> The last one isn't a tika problem, it's nutch's issue... >> >> >> >> >> >> If there is anything else I can provide, please let me know. >> >> >> >> Thanks >> >> Brad >> >> >> >> -----Original Message----- >> >> From: brad [mailto:[email protected]] >> >> Sent: Friday, July 23, 2010 8:37 PM >> >> To: [email protected] >> >> Subject: RE: Parsing Performance - related to Java concurrency issue >> >> >> >> I just running Nutch as delivered. The information about the >> >> org.apache.tika.parser.html.HtmlParser.getEncoding, etc is from >> >> running jstack on the nutch process when it slowed down to a crawl... >> >> >> >> -----Original Message----- >> >> From: Ken Krugler [mailto:[email protected]] >> >> Sent: Friday, July 23, 2010 8:25 PM >> >> To: [email protected] >> >> Subject: Re: Parsing Performance - related to Java concurrency issue >> >> >> >> Hi Brad, >> >> >> >> On Jul 23, 2010, at 7:21pm, brad wrote: >> >> >> >> Hi Ken, >> >>> Thanks for the info. I'm using Nutch 1.1, so I believe it is Tika >> >>> 0.7? The jar files in my Nutch path are tika-core-0.7.jar and tika- >> >>> parsers-0.7.jar. >> >>> Is there a way to find out if it actually pulling something >> >>> different when executing? >> >>> >> >> >> >> Tika switched from Neko to TagSoup on 14/Oct/2009. >> >> >> >> Tika 0.7 was released on April 3rd, 2010 so I would expect that if >> >> you are using Tika 0.7, you'd be using TagSoup. >> >> >> >> However I see the line that references Neko is this one: >> >> >> >> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java: >> >>>> >> >>> >> >> So it's the Nutch HtmlParser that's using Neko. >> >> >> >> Curious why you have both Nutch and Tika HtmlParser refs in your >> >> file, e.g. >> >> I also see: >> >> >> >> org.apache.tika.parser.html.HtmlParser.getEncoding >> >>>> >> >>> >> >> -- Ken >> >> >> >> The ps -ef | grep nutch >> >>> Includes -classpath ...:/usr/local/nutch/lib/tika-core-0.7.jar:... >> >>> in the >> >>> Nutch execution command line >> >>> >> >>> My server does have >> >>> /usr/local/solr/contrib/extraction/lib/tika-core-0.4.jar >> >>> /usr/local/solr/contrib/extraction/lib/tika-parsers-0.4.jar >> >>> >> >>> But, they are not in the classpath nor are the $PATH so I doubt they >> >>> are being picked up? Is there someplace else I should be looking? >> >>> >> >>> Thanks >> >>> Brad >> >>> >> >>> -----Original Message----- >> >>> From: Ken Krugler [mailto:[email protected]] >> >>> Sent: Friday, July 23, 2010 6:38 PM >> >>> To: [email protected] >> >>> Subject: Re: Parsing Performance - related to Java concurrency issue >> >>> >> >>> Hi Brad, >> >>> >> >>> Thanks for the nice write-up, and the refs. >> >>> >> >>> I'll look into using a simple cache in Tika to avoid this type of >> >>> blocking. >> >>> Feel free to comment on >> >>> https://issues.apache.org/jira/browse/TIKA-471 >> >>> >> >>> Note that the Tika code base has changed from what it appears that >> >>> you're using (e.g. the switch from Neko to TagSoup happened quite a >> >>> while ago). >> >>> >> >>> -- Ken >> >>> >> >>> On Jul 23, 2010, at 3:51pm, brad wrote: >> >>> >> >>> I'm continuing to have performance problems with parsing. I ran >> >>> the >> >>>> fetch process with -noParsing and got great performance. If I do >> >>>> the same process with parsing left in, the fetching seems to be >> >>>> going great, but as the process continues to run, everything slows >> >>>> down to almost a dead stop. >> >>>> >> >>>> When I check the thread stack, I find that 1062 threads are blocked: >> >>>> java.lang.Thread.State: BLOCKED (on object monitor) >> >>>> at >> >>>> sun >> >>>> .nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java: >> >>>> 135) >> >>>> >> >>>> Apparently this is a known issue with Java, and a couple articles >> >>>> are written about it: >> >>>> http://paul.vox.com/library/post/the-mysteries-of-java-character- >> >>>> set-p >> >>>> erform >> >>>> ance.html >> >>>> http://halfbottle.blogspot.com/2009/07/charset-continued-i-wrote- >> >>>> about >> >>>> .html >> >>>> >> >>>> There is also a note in java bug database about scaling issues with >> >>>> the class... >> >>>> Please also note that the current implementation of >> >>>> sun.nio.cs.FastCharsetProvider.charsetForName() uses a JVM-wide >> >>>> lock and is called very often (e.g. by new String(byte[] >> >>>> data,String encoding)). >> >>>> This >> >>>> JVM-wide lock means that Java applications do not scale beyond 4 >> >>>> CPU cores. >> >>>> >> >>>> I noted in the case of my stack at this particular point in time. >> >>>> The >> >>>> BLOCKED calls to charsetForName were generated by: >> >>>> >> >>>> at >> >>>> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java: >> >>>> 84) >> >>>> 378 >> >>>> at >> >>>> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java: >> >>>> 99) 61 >> >>>> at >> >>>> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java: >> >>>> 133) >> >>>> 19 >> >>>> at >> >>>> org >> >>>> .apache >> >>>> .nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlParser.jav >> >>>> a:86) 238 >> >>>> at >> >>>> org >> >>>> .apache >> >>>> .nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector >> >>>> .java:310) 133 >> >>>> at >> >>>> org.apache.pdfbox.pdfparser.PDFParser.skipToNextObj(PDFParser.java: >> >>>> 270) 8 >> >>>> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java: >> >>>> 253) 47 >> >>>> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java: >> >>>> 247) 19 >> >>>> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java: >> >>>> 227) 2 >> >>>> at org.apache.pdfbox.cos.COSDocument.<init>(COSDocument.java:104) 7 >> >>>> at >> >>>> org.apache.hadoop.io.Text$1.initialValue(Text.java:54) 88 at >> >>>> org.apache.hadoop.io.Text.decode(Text.java:344) 2 at org .apache >> >>>> .tika.parser.xml.XMLParser.getDefaultSAXParserFactory(XMLParser.ja >> >>>> va:161) 12 >> >>>> at >> >>>> org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:192) >> >>>> 13 >> >>>> at org.apache.pdfbox.cos.COSString.getString(COSString.java:245) 3 >> >>>> >> >>>> Is this an issue that only I'm facing? Is it worth looking at >> >>>> alternatives as talked about in the articles? Or, just limit the >> >>>> number of threads that are run? Right now it seems like the block >> >>>> is causing problem unrelated to general design and behavior of Nutch. >> >>>> >> >>>> Thoughts?? >> >>>> >> >>>> Thanks >> >>>> Brad >> >>>> >> >>>> >> >>>> >> >>> -------------------------------------------- >> >>> Ken Krugler >> >>> +1 530-210-6378 >> >>> http://bixolabs.com >> >>> e l a s t i c w e b m i n i n g >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >> -------------------------------------------- >> >> Ken Krugler >> >> +1 530-210-6378 >> >> http://bixolabs.com >> >> e l a s t i c w e b m i n i n g >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > -------------------------------------------- >> > Ken Krugler >> > +1 530-210-6378 >> > http://bixolabs.com >> > e l a s t i c w e b m i n i n g >> > >> > >> > >> > >> > >> >> > > > -- > DigitalPebble Ltd > > Open Source Solutions for Text Engineering > http://www.digitalpebble.com > -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com

