The items listed in the original email were just the location of the
original call. Here is the actual jstack dump for the parseNeko and the
other. The parseNeko apparently being directly called by the Nutch parser,
not Tika...
"Thread-177094" daemon prio=10 tid=0x00002aab281fb000 nid=0x5638 waiting for
monitor entry [0x00002aab82a18000..0x00002aab82a18b90]
java.lang.Thread.State: BLOCKED (on object monitor)
at
sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135)
- waiting to lock <0x00002aaace621488> (a
sun.nio.cs.StandardCharsets)
at java.nio.charset.Charset.lookup2(Charset.java:468)
at java.nio.charset.Charset.lookup(Charset.java:456)
at java.nio.charset.Charset.isSupported(Charset.java:498)
at
sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:67)
at java.io.InputStreamReader.<init>(InputStreamReader.java:100)
at
org.cyberneko.html.HTMLScanner.setInputSource(HTMLScanner.java:774)
at
org.cyberneko.html.HTMLConfiguration.setInputSource(HTMLConfiguration.java:4
57)
at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:430)
at
org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java:16
4)
at
org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:253)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:210)
at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.lang.Thread.run(Thread.java:636)
Here are the main ones listed in the jira case:
"Thread-177096" daemon prio=10 tid=0x00002aab283e5400 nid=0x563a waiting for
monitor entry [0x000000007318b000..0x000000007318bc90]
java.lang.Thread.State: BLOCKED (on object monitor)
at
sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135)
- waiting to lock <0x00002aaace621488> (a
sun.nio.cs.StandardCharsets)
at java.nio.charset.Charset.lookup2(Charset.java:468)
at java.nio.charset.Charset.lookup(Charset.java:456)
at java.nio.charset.Charset.isSupported(Charset.java:498)
at
sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:67)
at java.io.InputStreamReader.<init>(InputStreamReader.java:100)
at
org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:84)
at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:181)
at
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.lang.Thread.run(Thread.java:636)
"Thread-177079" daemon prio=10 tid=0x00002aab1c149000 nid=0x5629 waiting for
monitor entry [0x00002aab8200e000..0x00002aab8200ec10]
java.lang.Thread.State: BLOCKED (on object monitor)
at
sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135)
- waiting to lock <0x00002aaace621488> (a
sun.nio.cs.StandardCharsets)
at java.nio.charset.Charset.lookup2(Charset.java:468)
at java.nio.charset.Charset.lookup(Charset.java:456)
at java.nio.charset.Charset.isSupported(Charset.java:498)
at
org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:99)
at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:181)
at
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.lang.Thread.run(Thread.java:636)
"Thread-177029" daemon prio=10 tid=0x0000000017a67400 nid=0x55f7 waiting for
monitor entry [0x00002aab8806f000..0x00002aab8806fb90]
java.lang.Thread.State: BLOCKED (on object monitor)
at
sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135)
- waiting to lock <0x00002aaace621488> (a
sun.nio.cs.StandardCharsets)
at java.nio.charset.Charset.lookup2(Charset.java:468)
at java.nio.charset.Charset.lookup(Charset.java:456)
at java.nio.charset.Charset.isSupported(Charset.java:498)
at
org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:133)
at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:181)
at
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.lang.Thread.run(Thread.java:636)
"Thread-177090" daemon prio=10 tid=0x00002aab48060c00 nid=0x5634 waiting for
monitor entry [0x00002aab82f1d000..0x00002aab82f1dd90]
java.lang.Thread.State: BLOCKED (on object monitor)
at
sun.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:135)
- waiting to lock <0x00002aaace621488> (a
sun.nio.cs.StandardCharsets)
at java.nio.charset.Charset.lookup2(Charset.java:468)
at java.nio.charset.Charset.lookup(Charset.java:456)
at java.nio.charset.Charset.forName(Charset.java:521)
at
org.apache.nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlParser.jav
a:86)
at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:137)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.lang.Thread.run(Thread.java:636)
The last one isn't a tika problem, it's nutch's issue...
If there is anything else I can provide, please let me know.
Thanks
Brad
-----Original Message-----
From: brad [mailto:[email protected]]
Sent: Friday, July 23, 2010 8:37 PM
To: [email protected]
Subject: RE: Parsing Performance - related to Java concurrency issue
I just running Nutch as delivered. The information about the
org.apache.tika.parser.html.HtmlParser.getEncoding, etc is from running
jstack on the nutch process when it slowed down to a crawl...
-----Original Message-----
From: Ken Krugler [mailto:[email protected]]
Sent: Friday, July 23, 2010 8:25 PM
To: [email protected]
Subject: Re: Parsing Performance - related to Java concurrency issue
Hi Brad,
On Jul 23, 2010, at 7:21pm, brad wrote:
> Hi Ken,
> Thanks for the info. I'm using Nutch 1.1, so I believe it is Tika
> 0.7? The jar files in my Nutch path are tika-core-0.7.jar and tika-
> parsers-0.7.jar.
> Is there a way to find out if it actually pulling something different
> when executing?
Tika switched from Neko to TagSoup on 14/Oct/2009.
Tika 0.7 was released on April 3rd, 2010 so I would expect that if you are
using Tika 0.7, you'd be using TagSoup.
However I see the line that references Neko is this one:
>> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
So it's the Nutch HtmlParser that's using Neko.
Curious why you have both Nutch and Tika HtmlParser refs in your file, e.g.
I also see:
>> org.apache.tika.parser.html.HtmlParser.getEncoding
-- Ken
> The ps -ef | grep nutch
> Includes -classpath ...:/usr/local/nutch/lib/tika-core-0.7.jar:...
> in the
> Nutch execution command line
>
> My server does have
> /usr/local/solr/contrib/extraction/lib/tika-core-0.4.jar
> /usr/local/solr/contrib/extraction/lib/tika-parsers-0.4.jar
>
> But, they are not in the classpath nor are the $PATH so I doubt they
> are being picked up? Is there someplace else I should be looking?
>
> Thanks
> Brad
>
> -----Original Message-----
> From: Ken Krugler [mailto:[email protected]]
> Sent: Friday, July 23, 2010 6:38 PM
> To: [email protected]
> Subject: Re: Parsing Performance - related to Java concurrency issue
>
> Hi Brad,
>
> Thanks for the nice write-up, and the refs.
>
> I'll look into using a simple cache in Tika to avoid this type of
> blocking.
> Feel free to comment on https://issues.apache.org/jira/browse/TIKA-471
>
> Note that the Tika code base has changed from what it appears that
> you're using (e.g. the switch from Neko to TagSoup happened quite a
> while ago).
>
> -- Ken
>
> On Jul 23, 2010, at 3:51pm, brad wrote:
>
>> I'm continuing to have performance problems with parsing. I ran the
>> fetch process with -noParsing and got great performance. If I do the
>> same process with parsing left in, the fetching seems to be going
>> great, but as the process continues to run, everything slows down to
>> almost a dead stop.
>>
>> When I check the thread stack, I find that 1062 threads are blocked:
>> java.lang.Thread.State: BLOCKED (on object monitor)
>> at
>> sun
>> .nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:
>> 135)
>>
>> Apparently this is a known issue with Java, and a couple articles are
>> written about it:
>> http://paul.vox.com/library/post/the-mysteries-of-java-character-
>> set-p
>> erform
>> ance.html
>> http://halfbottle.blogspot.com/2009/07/charset-continued-i-wrote-
>> about
>> .html
>>
>> There is also a note in java bug database about scaling issues with
>> the class...
>> Please also note that the current implementation of
>> sun.nio.cs.FastCharsetProvider.charsetForName() uses a JVM-wide lock
>> and is called very often (e.g. by new String(byte[] data,String
>> encoding)).
>> This
>> JVM-wide lock means that Java applications do not scale beyond 4 CPU
>> cores.
>>
>> I noted in the case of my stack at this particular point in time.
>> The
>> BLOCKED calls to charsetForName were generated by:
>>
>> at
>> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:
>> 84)
>> 378
>> at
>> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:
>> 99) 61
>> at
>> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:
>> 133)
>> 19
>> at
>> org
>> .apache
>> .nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlParser.jav
>> a:86) 238
>> at
>> org
>> .apache
>> .nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector
>> .java:310) 133
>> at
>> org.apache.pdfbox.pdfparser.PDFParser.skipToNextObj(PDFParser.java:
>> 270) 8
>> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
>> 253) 47
>> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
>> 247) 19
>> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
>> 227) 2
>> at org.apache.pdfbox.cos.COSDocument.<init>(COSDocument.java:104) 7
>> at
>> org.apache.hadoop.io.Text$1.initialValue(Text.java:54) 88 at
>> org.apache.hadoop.io.Text.decode(Text.java:344) 2 at org .apache
>> .tika.parser.xml.XMLParser.getDefaultSAXParserFactory(XMLParser.ja
>> va:161) 12
>> at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:192)
>> 13
>> at org.apache.pdfbox.cos.COSString.getString(COSString.java:245) 3
>>
>> Is this an issue that only I'm facing? Is it worth looking at
>> alternatives as talked about in the articles? Or, just limit the
>> number of threads that are run? Right now it seems like the block is
>> causing problem unrelated to general design and behavior of Nutch.
>>
>> Thoughts??
>>
>> Thanks
>> Brad
>>
>>
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c w e b m i n i n g
>
>
>
>
>
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g