> since tika covers the same mime-types as parse-text and parse-html you
> probably don't need to include them. Can't remember why we kept them in the
> default (anyone?).
>

of course if you decide to use say parse-html and not tika for a given
mime-type but keep tika as a default parser for the other types you will
need to create a mapping in parse-plugins.xml


>
> Re-finding both the html and tika parsers in the stacks, the only case
> where it should happen is when both are loaded (as in your conf and the
> default) and the default parser (i.e. Tika)  is loaded first fails to return
> a result then remaining parsers for the mime type are used.
>
> Did you notice anything in the log about any errors during the parsing with
> Tika on HTML docs? That could explain why the html parser was tried
>
> J.
>
>
>
> On 24 July 2010 19:45, brad <[email protected]> wrote:
>
>> Here is my plugin.includes from nutch-site.xml
>>
>> protocol-http|urlfilter-regex|parse-(rss|text|html|js|tika)|index-(basic|anc
>>
>> hor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|u
>> rlnormalizer-(pass|regex|basic)
>>
>> Here is the one from nutch-default.xml
>>
>>
>> protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)
>>
>> |query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlno
>> rmalizer-(pass|regex|basic)
>>
>> Do I need to change something on this?
>>
>> Thanks
>> Brad
>>
>>
>> -----Original Message-----
>> From: Julien Nioche [mailto:[email protected]]
>> Sent: Saturday, July 24, 2010 11:09 AM
>> To: [email protected]
>> Subject: Re: Parsing Performance - related to Java concurrency issue
>>
>> Hi guys,
>>
>> Brad, thanks for sharing your observations with us, that's great. looks
>> like
>> we could definitely do without the lock on the charset.
>>
>> The stack trace definitely shows that BOTH parse-html and parse-tika are
>> used, which should not happen. I wonder whether they are both called on
>> each
>> document or alternatively. I will have a look at it and see if I can find
>> an
>> explanation for this.
>>
>> It would be interesting to see for a fetched segments :
>> - how long it takes to parse it when both parse-(html|tika) are in
>> plugin.includes
>> - same with only parse-tika
>> - same with only parse-html
>>
>> Thanks
>>
>> Jul
>>
>> --
>> DigitalPebble Ltd
>>
>> Open Source Solutions for Text Engineering http://www.digitalpebble.com
>>
>>
>> On 24 July 2010 18:33, Ken Krugler <[email protected]> wrote:
>>
>> > Hi Brad,
>> >
>> >
>> > On Jul 23, 2010, at 8:50pm, brad wrote:
>> >
>> >  The items listed in the original email were just the location of the
>> >> original call.  Here is the actual jstack dump for the parseNeko and
>> >> the other.  The parseNeko apparently being directly called by the
>> >> Nutch parser, not Tika...
>> >>
>> >
>> > Thanks for the additional details.
>> >
>> > Maybe Julien can provide an explanation for why there are both Tika
>> > and Nutch HtmlParser references showing up - Julien?
>> >
>> > -- Ken
>> >
>> >
>> >  "Thread-177094" daemon prio=10 tid=0x00002aab281fb000 nid=0x5638
>> > waiting
>> >> for
>> >> monitor entry [0x00002aab82a18000..0x00002aab82a18b90]
>> >>  java.lang.Thread.State: BLOCKED (on object monitor)
>> >>        at
>> >>
>> >>
>>
>> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135)
>> >>        - waiting to lock <0x00002aaace621488> (a
>> >> sun.nio.cs.StandardCharsets)
>> >>        at java.nio.charset.Charset.lookup2(Charset.java:468)
>> >>        at java.nio.charset.Charset.lookup(Charset.java:456)
>> >>        at java.nio.charset.Charset.isSupported(Charset.java:498)
>> >>        at
>> >> sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:67)
>> >>        at java.io.InputStreamReader.<init>(InputStreamReader.java:100)
>> >>        at
>> >> org.cyberneko.html.HTMLScanner.setInputSource(HTMLScanner.java:774)
>> >>        at
>> >>
>> >> org.cyberneko.html.HTMLConfiguration.setInputSource(HTMLConfiguration
>> >> .java:4
>> >> 57)
>> >>        at
>> >> org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:430)
>> >>        at
>> >>
>> >> org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.
>> >> java:16
>> >> 4)
>> >>        at
>> >> org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:253)
>> >>        at
>> >> org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:210)
>> >>        at
>> >> org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145)
>> >>        at
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
>> >>        at
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
>> >>        at
>> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>> >>        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>> >>        at java.lang.Thread.run(Thread.java:636)
>> >>
>> >>
>> >> Here are the main ones listed in the jira case:
>> >>
>> >> "Thread-177096" daemon prio=10 tid=0x00002aab283e5400 nid=0x563a
>> >> waiting for monitor entry [0x000000007318b000..0x000000007318bc90]
>> >>  java.lang.Thread.State: BLOCKED (on object monitor)
>> >>        at
>> >>
>> >>
>>
>> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135)
>> >>        - waiting to lock <0x00002aaace621488> (a
>> >> sun.nio.cs.StandardCharsets)
>> >>        at java.nio.charset.Charset.lookup2(Charset.java:468)
>> >>        at java.nio.charset.Charset.lookup(Charset.java:456)
>> >>        at java.nio.charset.Charset.isSupported(Charset.java:498)
>> >>        at
>> >> sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:67)
>> >>        at java.io.InputStreamReader.<init>(InputStreamReader.java:100)
>> >>        at
>> >> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:84)
>> >>        at
>> >> org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:181)
>> >>        at
>> >> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>> >>        at
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
>> >>        at
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
>> >>        at
>> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>> >>        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>> >>        at java.lang.Thread.run(Thread.java:636)
>> >>
>> >>
>> >> "Thread-177079" daemon prio=10 tid=0x00002aab1c149000 nid=0x5629
>> >> waiting for monitor entry [0x00002aab8200e000..0x00002aab8200ec10]
>> >>  java.lang.Thread.State: BLOCKED (on object monitor)
>> >>        at
>> >>
>> >>
>>
>> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135)
>> >>        - waiting to lock <0x00002aaace621488> (a
>> >> sun.nio.cs.StandardCharsets)
>> >>        at java.nio.charset.Charset.lookup2(Charset.java:468)
>> >>        at java.nio.charset.Charset.lookup(Charset.java:456)
>> >>        at java.nio.charset.Charset.isSupported(Charset.java:498)
>> >>        at
>> >> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:99)
>> >>        at
>> >> org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:181)
>> >>        at
>> >> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>> >>        at
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
>> >>        at
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
>> >>        at
>> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>> >>        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>> >>        at java.lang.Thread.run(Thread.java:636)
>> >>
>> >>
>> >> "Thread-177029" daemon prio=10 tid=0x0000000017a67400 nid=0x55f7
>> >> waiting for monitor entry [0x00002aab8806f000..0x00002aab8806fb90]
>> >>  java.lang.Thread.State: BLOCKED (on object monitor)
>> >>        at
>> >>
>> >>
>>
>> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135)
>> >>        - waiting to lock <0x00002aaace621488> (a
>> >> sun.nio.cs.StandardCharsets)
>> >>        at java.nio.charset.Charset.lookup2(Charset.java:468)
>> >>        at java.nio.charset.Charset.lookup(Charset.java:456)
>> >>        at java.nio.charset.Charset.isSupported(Charset.java:498)
>> >>        at
>> >> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:133)
>> >>        at
>> >> org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:181)
>> >>        at
>> >> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>> >>        at
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
>> >>        at
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
>> >>        at
>> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>> >>        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>> >>        at java.lang.Thread.run(Thread.java:636)
>> >>
>> >>
>> >> "Thread-177090" daemon prio=10 tid=0x00002aab48060c00 nid=0x5634
>> >> waiting for monitor entry [0x00002aab82f1d000..0x00002aab82f1dd90]
>> >>  java.lang.Thread.State: BLOCKED (on object monitor)
>> >>        at
>> >>
>> >>
>>
>> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135)
>> >>        - waiting to lock <0x00002aaace621488> (a
>> >> sun.nio.cs.StandardCharsets)
>> >>        at java.nio.charset.Charset.lookup2(Charset.java:468)
>> >>        at java.nio.charset.Charset.lookup(Charset.java:456)
>> >>        at java.nio.charset.Charset.forName(Charset.java:521)
>> >>        at
>> >>
>> >> org.apache.nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlPar
>> >> ser.jav
>> >> a:86)
>> >>        at
>> >> org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:137)
>> >>        at
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
>> >>        at
>> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
>> >>        at
>> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>> >>        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>> >>        at java.lang.Thread.run(Thread.java:636)
>> >>
>> >> The last one isn't a tika problem, it's nutch's issue...
>> >>
>> >>
>> >> If there is anything else I can provide, please let me know.
>> >>
>> >> Thanks
>> >> Brad
>> >>
>> >> -----Original Message-----
>> >> From: brad [mailto:[email protected]]
>> >> Sent: Friday, July 23, 2010 8:37 PM
>> >> To: [email protected]
>> >> Subject: RE: Parsing Performance - related to Java concurrency issue
>> >>
>> >> I just running Nutch as delivered.  The information about the
>> >> org.apache.tika.parser.html.HtmlParser.getEncoding, etc is from
>> >> running jstack on the nutch process when it slowed down to a crawl...
>> >>
>> >> -----Original Message-----
>> >> From: Ken Krugler [mailto:[email protected]]
>> >> Sent: Friday, July 23, 2010 8:25 PM
>> >> To: [email protected]
>> >> Subject: Re: Parsing Performance - related to Java concurrency issue
>> >>
>> >> Hi Brad,
>> >>
>> >> On Jul 23, 2010, at 7:21pm, brad wrote:
>> >>
>> >>  Hi Ken,
>> >>> Thanks for the info.  I'm using Nutch 1.1, so I believe it is Tika
>> >>> 0.7?  The jar files in my Nutch path are tika-core-0.7.jar and tika-
>> >>> parsers-0.7.jar.
>> >>> Is there a way to find out if it actually pulling something
>> >>> different when executing?
>> >>>
>> >>
>> >> Tika switched from Neko to TagSoup on 14/Oct/2009.
>> >>
>> >> Tika 0.7 was released on April 3rd, 2010 so I would expect that if
>> >> you are using Tika 0.7, you'd be using TagSoup.
>> >>
>> >> However I see the line that references Neko is this one:
>> >>
>> >>  at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
>> >>>>
>> >>>
>> >> So it's the Nutch HtmlParser that's using Neko.
>> >>
>> >> Curious why you have both Nutch and Tika HtmlParser refs in your
>> >> file, e.g.
>> >> I also see:
>> >>
>> >>  org.apache.tika.parser.html.HtmlParser.getEncoding
>> >>>>
>> >>>
>> >> -- Ken
>> >>
>> >>  The ps -ef | grep nutch
>> >>> Includes -classpath ...:/usr/local/nutch/lib/tika-core-0.7.jar:...
>> >>> in the
>> >>> Nutch execution command line
>> >>>
>> >>> My server does have
>> >>> /usr/local/solr/contrib/extraction/lib/tika-core-0.4.jar
>> >>> /usr/local/solr/contrib/extraction/lib/tika-parsers-0.4.jar
>> >>>
>> >>> But, they are not in the classpath nor are the $PATH so I doubt they
>> >>> are being picked up?  Is there someplace else I should be looking?
>> >>>
>> >>> Thanks
>> >>> Brad
>> >>>
>> >>> -----Original Message-----
>> >>> From: Ken Krugler [mailto:[email protected]]
>> >>> Sent: Friday, July 23, 2010 6:38 PM
>> >>> To: [email protected]
>> >>> Subject: Re: Parsing Performance - related to Java concurrency issue
>> >>>
>> >>> Hi Brad,
>> >>>
>> >>> Thanks for the nice write-up, and the refs.
>> >>>
>> >>> I'll look into using a simple cache in Tika to avoid this type of
>> >>> blocking.
>> >>> Feel free to comment on
>> >>> https://issues.apache.org/jira/browse/TIKA-471
>> >>>
>> >>> Note that the Tika code base has changed from what it appears that
>> >>> you're using (e.g. the switch from Neko to TagSoup happened quite a
>> >>> while ago).
>> >>>
>> >>> -- Ken
>> >>>
>> >>> On Jul 23, 2010, at 3:51pm, brad wrote:
>> >>>
>> >>>  I'm continuing to have performance problems with parsing.  I ran
>> >>> the
>> >>>> fetch process with -noParsing and got great performance.  If I do
>> >>>> the same process with parsing left in, the fetching seems to be
>> >>>> going great, but as the process continues to run, everything slows
>> >>>> down to almost a dead stop.
>> >>>>
>> >>>> When I check the thread stack, I find that 1062 threads are blocked:
>> >>>> java.lang.Thread.State: BLOCKED (on object monitor)
>> >>>>        at
>> >>>> sun
>> >>>> .nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:
>> >>>> 135)
>> >>>>
>> >>>> Apparently this is a known issue with Java, and a couple articles
>> >>>> are written about it:
>> >>>> http://paul.vox.com/library/post/the-mysteries-of-java-character-
>> >>>> set-p
>> >>>> erform
>> >>>> ance.html
>> >>>> http://halfbottle.blogspot.com/2009/07/charset-continued-i-wrote-
>> >>>> about
>> >>>> .html
>> >>>>
>> >>>> There is also a note in java bug database about scaling issues with
>> >>>> the class...
>> >>>> Please also note that the current implementation of
>> >>>> sun.nio.cs.FastCharsetProvider.charsetForName() uses a JVM-wide
>> >>>> lock and is called very often (e.g. by new String(byte[]
>> >>>> data,String encoding)).
>> >>>> This
>> >>>> JVM-wide lock means that Java applications do not scale beyond 4
>> >>>> CPU cores.
>> >>>>
>> >>>> I noted in the case of my stack at this particular point in time.
>> >>>> The
>> >>>> BLOCKED calls to charsetForName were generated by:
>> >>>>
>> >>>> at
>> >>>> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:
>> >>>> 84)
>> >>>> 378
>> >>>> at
>> >>>> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:
>> >>>> 99) 61
>> >>>> at
>> >>>> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:
>> >>>> 133)
>> >>>> 19
>> >>>> at
>> >>>> org
>> >>>> .apache
>> >>>> .nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlParser.jav
>> >>>> a:86)  238
>> >>>> at
>> >>>> org
>> >>>> .apache
>> >>>> .nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector
>> >>>> .java:310) 133
>> >>>> at
>> >>>> org.apache.pdfbox.pdfparser.PDFParser.skipToNextObj(PDFParser.java:
>> >>>> 270) 8
>> >>>> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
>> >>>> 253) 47
>> >>>> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
>> >>>> 247) 19
>> >>>> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
>> >>>> 227) 2
>> >>>> at org.apache.pdfbox.cos.COSDocument.<init>(COSDocument.java:104) 7
>> >>>> at
>> >>>> org.apache.hadoop.io.Text$1.initialValue(Text.java:54) 88 at
>> >>>> org.apache.hadoop.io.Text.decode(Text.java:344) 2 at org .apache
>> >>>> .tika.parser.xml.XMLParser.getDefaultSAXParserFactory(XMLParser.ja
>> >>>> va:161) 12
>> >>>> at
>> >>>> org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:192)
>> >>>> 13
>> >>>> at org.apache.pdfbox.cos.COSString.getString(COSString.java:245) 3
>> >>>>
>> >>>> Is this an issue that only I'm facing?  Is it worth looking at
>> >>>> alternatives as talked about in the articles?  Or, just limit the
>> >>>> number of threads that are run?  Right now it seems like the block
>> >>>> is causing problem unrelated to general design and behavior of Nutch.
>> >>>>
>> >>>> Thoughts??
>> >>>>
>> >>>> Thanks
>> >>>> Brad
>> >>>>
>> >>>>
>> >>>>
>> >>> --------------------------------------------
>> >>> Ken Krugler
>> >>> +1 530-210-6378
>> >>> http://bixolabs.com
>> >>> e l a s t i c   w e b   m i n i n g
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >> --------------------------------------------
>> >> Ken Krugler
>> >> +1 530-210-6378
>> >> http://bixolabs.com
>> >> e l a s t i c   w e b   m i n i n g
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> > --------------------------------------------
>> > Ken Krugler
>> > +1 530-210-6378
>> > http://bixolabs.com
>> > e l a s t i c   w e b   m i n i n g
>> >
>> >
>> >
>> >
>> >
>>
>>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>



-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Reply via email to