Hi guys,

Brad, thanks for sharing your observations with us, that's great. looks like
we could definitely do without the lock on the charset.

The stack trace definitely shows that BOTH parse-html and parse-tika are
used, which should not happen. I wonder whether they are both called on each
document or alternatively. I will have a look at it and see if I can find an
explanation for this.

It would be interesting to see for a fetched segments :
- how long it takes to parse it when both parse-(html|tika) are in
plugin.includes
- same with only parse-tika
- same with only parse-html

Thanks

Jul

-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com


On 24 July 2010 18:33, Ken Krugler <[email protected]> wrote:

> Hi Brad,
>
>
> On Jul 23, 2010, at 8:50pm, brad wrote:
>
>  The items listed in the original email were just the location of the
>> original call.  Here is the actual jstack dump for the parseNeko and the
>> other.  The parseNeko apparently being directly called by the Nutch
>> parser,
>> not Tika...
>>
>
> Thanks for the additional details.
>
> Maybe Julien can provide an explanation for why there are both Tika and
> Nutch HtmlParser references showing up - Julien?
>
> -- Ken
>
>
>  "Thread-177094" daemon prio=10 tid=0x00002aab281fb000 nid=0x5638 waiting
>> for
>> monitor entry [0x00002aab82a18000..0x00002aab82a18b90]
>>  java.lang.Thread.State: BLOCKED (on object monitor)
>>        at
>>
>> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135)
>>        - waiting to lock <0x00002aaace621488> (a
>> sun.nio.cs.StandardCharsets)
>>        at java.nio.charset.Charset.lookup2(Charset.java:468)
>>        at java.nio.charset.Charset.lookup(Charset.java:456)
>>        at java.nio.charset.Charset.isSupported(Charset.java:498)
>>        at
>> sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:67)
>>        at java.io.InputStreamReader.<init>(InputStreamReader.java:100)
>>        at
>> org.cyberneko.html.HTMLScanner.setInputSource(HTMLScanner.java:774)
>>        at
>>
>> org.cyberneko.html.HTMLConfiguration.setInputSource(HTMLConfiguration.java:4
>> 57)
>>        at
>> org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:430)
>>        at
>>
>> org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java:16
>> 4)
>>        at
>> org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:253)
>>        at
>> org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:210)
>>        at
>> org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145)
>>        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
>>        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
>>        at
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>        at java.lang.Thread.run(Thread.java:636)
>>
>>
>> Here are the main ones listed in the jira case:
>>
>> "Thread-177096" daemon prio=10 tid=0x00002aab283e5400 nid=0x563a waiting
>> for
>> monitor entry [0x000000007318b000..0x000000007318bc90]
>>  java.lang.Thread.State: BLOCKED (on object monitor)
>>        at
>>
>> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135)
>>        - waiting to lock <0x00002aaace621488> (a
>> sun.nio.cs.StandardCharsets)
>>        at java.nio.charset.Charset.lookup2(Charset.java:468)
>>        at java.nio.charset.Charset.lookup(Charset.java:456)
>>        at java.nio.charset.Charset.isSupported(Charset.java:498)
>>        at
>> sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:67)
>>        at java.io.InputStreamReader.<init>(InputStreamReader.java:100)
>>        at
>> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:84)
>>        at
>> org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:181)
>>        at
>> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>>        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
>>        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
>>        at
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>        at java.lang.Thread.run(Thread.java:636)
>>
>>
>> "Thread-177079" daemon prio=10 tid=0x00002aab1c149000 nid=0x5629 waiting
>> for
>> monitor entry [0x00002aab8200e000..0x00002aab8200ec10]
>>  java.lang.Thread.State: BLOCKED (on object monitor)
>>        at
>>
>> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135)
>>        - waiting to lock <0x00002aaace621488> (a
>> sun.nio.cs.StandardCharsets)
>>        at java.nio.charset.Charset.lookup2(Charset.java:468)
>>        at java.nio.charset.Charset.lookup(Charset.java:456)
>>        at java.nio.charset.Charset.isSupported(Charset.java:498)
>>        at
>> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:99)
>>        at
>> org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:181)
>>        at
>> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>>        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
>>        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
>>        at
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>        at java.lang.Thread.run(Thread.java:636)
>>
>>
>> "Thread-177029" daemon prio=10 tid=0x0000000017a67400 nid=0x55f7 waiting
>> for
>> monitor entry [0x00002aab8806f000..0x00002aab8806fb90]
>>  java.lang.Thread.State: BLOCKED (on object monitor)
>>        at
>>
>> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135)
>>        - waiting to lock <0x00002aaace621488> (a
>> sun.nio.cs.StandardCharsets)
>>        at java.nio.charset.Charset.lookup2(Charset.java:468)
>>        at java.nio.charset.Charset.lookup(Charset.java:456)
>>        at java.nio.charset.Charset.isSupported(Charset.java:498)
>>        at
>> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:133)
>>        at
>> org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:181)
>>        at
>> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>>        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
>>        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
>>        at
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>        at java.lang.Thread.run(Thread.java:636)
>>
>>
>> "Thread-177090" daemon prio=10 tid=0x00002aab48060c00 nid=0x5634 waiting
>> for
>> monitor entry [0x00002aab82f1d000..0x00002aab82f1dd90]
>>  java.lang.Thread.State: BLOCKED (on object monitor)
>>        at
>>
>> sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135)
>>        - waiting to lock <0x00002aaace621488> (a
>> sun.nio.cs.StandardCharsets)
>>        at java.nio.charset.Charset.lookup2(Charset.java:468)
>>        at java.nio.charset.Charset.lookup(Charset.java:456)
>>        at java.nio.charset.Charset.forName(Charset.java:521)
>>        at
>>
>> org.apache.nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlParser.jav
>> a:86)
>>        at
>> org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:137)
>>        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
>>        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
>>        at
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>        at java.lang.Thread.run(Thread.java:636)
>>
>> The last one isn't a tika problem, it's nutch's issue...
>>
>>
>> If there is anything else I can provide, please let me know.
>>
>> Thanks
>> Brad
>>
>> -----Original Message-----
>> From: brad [mailto:[email protected]]
>> Sent: Friday, July 23, 2010 8:37 PM
>> To: [email protected]
>> Subject: RE: Parsing Performance - related to Java concurrency issue
>>
>> I just running Nutch as delivered.  The information about the
>> org.apache.tika.parser.html.HtmlParser.getEncoding, etc is from running
>> jstack on the nutch process when it slowed down to a crawl...
>>
>> -----Original Message-----
>> From: Ken Krugler [mailto:[email protected]]
>> Sent: Friday, July 23, 2010 8:25 PM
>> To: [email protected]
>> Subject: Re: Parsing Performance - related to Java concurrency issue
>>
>> Hi Brad,
>>
>> On Jul 23, 2010, at 7:21pm, brad wrote:
>>
>>  Hi Ken,
>>> Thanks for the info.  I'm using Nutch 1.1, so I believe it is Tika
>>> 0.7?  The jar files in my Nutch path are tika-core-0.7.jar and tika-
>>> parsers-0.7.jar.
>>> Is there a way to find out if it actually pulling something different
>>> when executing?
>>>
>>
>> Tika switched from Neko to TagSoup on 14/Oct/2009.
>>
>> Tika 0.7 was released on April 3rd, 2010 so I would expect that if you are
>> using Tika 0.7, you'd be using TagSoup.
>>
>> However I see the line that references Neko is this one:
>>
>>  at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
>>>>
>>>
>> So it's the Nutch HtmlParser that's using Neko.
>>
>> Curious why you have both Nutch and Tika HtmlParser refs in your file,
>> e.g.
>> I also see:
>>
>>  org.apache.tika.parser.html.HtmlParser.getEncoding
>>>>
>>>
>> -- Ken
>>
>>  The ps -ef | grep nutch
>>> Includes -classpath ...:/usr/local/nutch/lib/tika-core-0.7.jar:...
>>> in the
>>> Nutch execution command line
>>>
>>> My server does have
>>> /usr/local/solr/contrib/extraction/lib/tika-core-0.4.jar
>>> /usr/local/solr/contrib/extraction/lib/tika-parsers-0.4.jar
>>>
>>> But, they are not in the classpath nor are the $PATH so I doubt they
>>> are being picked up?  Is there someplace else I should be looking?
>>>
>>> Thanks
>>> Brad
>>>
>>> -----Original Message-----
>>> From: Ken Krugler [mailto:[email protected]]
>>> Sent: Friday, July 23, 2010 6:38 PM
>>> To: [email protected]
>>> Subject: Re: Parsing Performance - related to Java concurrency issue
>>>
>>> Hi Brad,
>>>
>>> Thanks for the nice write-up, and the refs.
>>>
>>> I'll look into using a simple cache in Tika to avoid this type of
>>> blocking.
>>> Feel free to comment on https://issues.apache.org/jira/browse/TIKA-471
>>>
>>> Note that the Tika code base has changed from what it appears that
>>> you're using (e.g. the switch from Neko to TagSoup happened quite a
>>> while ago).
>>>
>>> -- Ken
>>>
>>> On Jul 23, 2010, at 3:51pm, brad wrote:
>>>
>>>  I'm continuing to have performance problems with parsing.  I ran the
>>>> fetch process with -noParsing and got great performance.  If I do the
>>>> same process with parsing left in, the fetching seems to be going
>>>> great, but as the process continues to run, everything slows down to
>>>> almost a dead stop.
>>>>
>>>> When I check the thread stack, I find that 1062 threads are blocked:
>>>> java.lang.Thread.State: BLOCKED (on object monitor)
>>>>        at
>>>> sun
>>>> .nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:
>>>> 135)
>>>>
>>>> Apparently this is a known issue with Java, and a couple articles are
>>>> written about it:
>>>> http://paul.vox.com/library/post/the-mysteries-of-java-character-
>>>> set-p
>>>> erform
>>>> ance.html
>>>> http://halfbottle.blogspot.com/2009/07/charset-continued-i-wrote-
>>>> about
>>>> .html
>>>>
>>>> There is also a note in java bug database about scaling issues with
>>>> the class...
>>>> Please also note that the current implementation of
>>>> sun.nio.cs.FastCharsetProvider.charsetForName() uses a JVM-wide lock
>>>> and is called very often (e.g. by new String(byte[] data,String
>>>> encoding)).
>>>> This
>>>> JVM-wide lock means that Java applications do not scale beyond 4 CPU
>>>> cores.
>>>>
>>>> I noted in the case of my stack at this particular point in time.
>>>> The
>>>> BLOCKED calls to charsetForName were generated by:
>>>>
>>>> at
>>>> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:
>>>> 84)
>>>> 378
>>>> at
>>>> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:
>>>> 99) 61
>>>> at
>>>> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:
>>>> 133)
>>>> 19
>>>> at
>>>> org
>>>> .apache
>>>> .nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlParser.jav
>>>> a:86)  238
>>>> at
>>>> org
>>>> .apache
>>>> .nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector
>>>> .java:310) 133
>>>> at
>>>> org.apache.pdfbox.pdfparser.PDFParser.skipToNextObj(PDFParser.java:
>>>> 270) 8
>>>> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
>>>> 253) 47
>>>> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
>>>> 247) 19
>>>> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
>>>> 227) 2
>>>> at org.apache.pdfbox.cos.COSDocument.<init>(COSDocument.java:104) 7
>>>> at
>>>> org.apache.hadoop.io.Text$1.initialValue(Text.java:54) 88 at
>>>> org.apache.hadoop.io.Text.decode(Text.java:344) 2 at org .apache
>>>> .tika.parser.xml.XMLParser.getDefaultSAXParserFactory(XMLParser.ja
>>>> va:161) 12
>>>> at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:192)
>>>> 13
>>>> at org.apache.pdfbox.cos.COSString.getString(COSString.java:245) 3
>>>>
>>>> Is this an issue that only I'm facing?  Is it worth looking at
>>>> alternatives as talked about in the articles?  Or, just limit the
>>>> number of threads that are run?  Right now it seems like the block is
>>>> causing problem unrelated to general design and behavior of Nutch.
>>>>
>>>> Thoughts??
>>>>
>>>> Thanks
>>>> Brad
>>>>
>>>>
>>>>
>>> --------------------------------------------
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://bixolabs.com
>>> e l a s t i c   w e b   m i n i n g
>>>
>>>
>>>
>>>
>>>
>>>
>> --------------------------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> e l a s t i c   w e b   m i n i n g
>>
>>
>>
>>
>>
>>
>>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>

Reply via email to