Hi Brad & Julien,

Seems odd that nutch-default.xml has parse-html and parse-tika.

-- Ken

On Jul 24, 2010, at 11:45am, brad wrote:

Here is my plugin.includes from nutch-site.xml
protocol-http|urlfilter-regex|parse-(rss|text|html|js|tika)|index- (basic|anc hor)|query-(basic|site|url)|response-(json|xml)|summary-basic| scoring-opic|u
rlnormalizer-(pass|regex|basic)

Here is the one from nutch-default.xml

protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index-(basic| anchor) |query-(basic|site|url)|response-(json|xml)|summary-basic|scoring- opic|urlno
rmalizer-(pass|regex|basic)

Do I need to change something on this?

Thanks
Brad


-----Original Message-----
From: Julien Nioche [mailto:[email protected]]
Sent: Saturday, July 24, 2010 11:09 AM
To: [email protected]
Subject: Re: Parsing Performance - related to Java concurrency issue

Hi guys,

Brad, thanks for sharing your observations with us, that's great. looks like
we could definitely do without the lock on the charset.

The stack trace definitely shows that BOTH parse-html and parse-tika are used, which should not happen. I wonder whether they are both called on each document or alternatively. I will have a look at it and see if I can find an
explanation for this.

It would be interesting to see for a fetched segments :
- how long it takes to parse it when both parse-(html|tika) are in
plugin.includes
- same with only parse-tika
- same with only parse-html

Thanks

Jul

--
DigitalPebble Ltd

Open Source Solutions for Text Engineering http:// www.digitalpebble.com


On 24 July 2010 18:33, Ken Krugler <[email protected]> wrote:

Hi Brad,


On Jul 23, 2010, at 8:50pm, brad wrote:

The items listed in the original email were just the location of the
original call.  Here is the actual jstack dump for the parseNeko and
the other.  The parseNeko apparently being directly called by the
Nutch parser, not Tika...


Thanks for the additional details.

Maybe Julien can provide an explanation for why there are both Tika
and Nutch HtmlParser references showing up - Julien?

-- Ken


"Thread-177094" daemon prio=10 tid=0x00002aab281fb000 nid=0x5638
waiting
for
monitor entry [0x00002aab82a18000..0x00002aab82a18b90]
java.lang.Thread.State: BLOCKED (on object monitor)
     at


sun .nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java: 135)
     - waiting to lock <0x00002aaace621488> (a
sun.nio.cs.StandardCharsets)
     at java.nio.charset.Charset.lookup2(Charset.java:468)
     at java.nio.charset.Charset.lookup(Charset.java:456)
     at java.nio.charset.Charset.isSupported(Charset.java:498)
     at
sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:67)
     at java.io.InputStreamReader.<init>(InputStreamReader.java:100)
     at
org.cyberneko.html.HTMLScanner.setInputSource(HTMLScanner.java:774)
     at

org .cyberneko.html.HTMLConfiguration.setInputSource(HTMLConfiguration
.java:4
57)
     at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java: 430)
     at

org .cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.
java:16
4)
     at
org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java: 253)
     at
org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:210)
     at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145)
     at
org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
     at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
     at java.util.concurrent.FutureTask.run(FutureTask.java:166)
     at java.lang.Thread.run(Thread.java:636)


Here are the main ones listed in the jira case:

"Thread-177096" daemon prio=10 tid=0x00002aab283e5400 nid=0x563a
waiting for monitor entry [0x000000007318b000..0x000000007318bc90]
java.lang.Thread.State: BLOCKED (on object monitor)
     at


sun .nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java: 135)
     - waiting to lock <0x00002aaace621488> (a
sun.nio.cs.StandardCharsets)
     at java.nio.charset.Charset.lookup2(Charset.java:468)
     at java.nio.charset.Charset.lookup(Charset.java:456)
     at java.nio.charset.Charset.isSupported(Charset.java:498)
     at
sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:67)
     at java.io.InputStreamReader.<init>(InputStreamReader.java:100)
     at
org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java: 84)
     at
org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:181)
     at
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
     at
org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
     at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
     at java.util.concurrent.FutureTask.run(FutureTask.java:166)
     at java.lang.Thread.run(Thread.java:636)


"Thread-177079" daemon prio=10 tid=0x00002aab1c149000 nid=0x5629
waiting for monitor entry [0x00002aab8200e000..0x00002aab8200ec10]
java.lang.Thread.State: BLOCKED (on object monitor)
     at


sun .nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java: 135)
     - waiting to lock <0x00002aaace621488> (a
sun.nio.cs.StandardCharsets)
     at java.nio.charset.Charset.lookup2(Charset.java:468)
     at java.nio.charset.Charset.lookup(Charset.java:456)
     at java.nio.charset.Charset.isSupported(Charset.java:498)
     at
org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java: 99)
     at
org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:181)
     at
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
     at
org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
     at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
     at java.util.concurrent.FutureTask.run(FutureTask.java:166)
     at java.lang.Thread.run(Thread.java:636)


"Thread-177029" daemon prio=10 tid=0x0000000017a67400 nid=0x55f7
waiting for monitor entry [0x00002aab8806f000..0x00002aab8806fb90]
java.lang.Thread.State: BLOCKED (on object monitor)
     at


sun .nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java: 135)
     - waiting to lock <0x00002aaace621488> (a
sun.nio.cs.StandardCharsets)
     at java.nio.charset.Charset.lookup2(Charset.java:468)
     at java.nio.charset.Charset.lookup(Charset.java:456)
     at java.nio.charset.Charset.isSupported(Charset.java:498)
     at
org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java: 133)
     at
org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:181)
     at
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
     at
org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
     at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
     at java.util.concurrent.FutureTask.run(FutureTask.java:166)
     at java.lang.Thread.run(Thread.java:636)


"Thread-177090" daemon prio=10 tid=0x00002aab48060c00 nid=0x5634
waiting for monitor entry [0x00002aab82f1d000..0x00002aab82f1dd90]
java.lang.Thread.State: BLOCKED (on object monitor)
     at


sun .nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java: 135)
     - waiting to lock <0x00002aaace621488> (a
sun.nio.cs.StandardCharsets)
     at java.nio.charset.Charset.lookup2(Charset.java:468)
     at java.nio.charset.Charset.lookup(Charset.java:456)
     at java.nio.charset.Charset.forName(Charset.java:521)
     at

org .apache.nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlPar
ser.jav
a:86)
     at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:137)
     at
org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
     at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
     at java.util.concurrent.FutureTask.run(FutureTask.java:166)
     at java.lang.Thread.run(Thread.java:636)

The last one isn't a tika problem, it's nutch's issue...


If there is anything else I can provide, please let me know.

Thanks
Brad

-----Original Message-----
From: brad [mailto:[email protected]]
Sent: Friday, July 23, 2010 8:37 PM
To: [email protected]
Subject: RE: Parsing Performance - related to Java concurrency issue

I just running Nutch as delivered.  The information about the
org.apache.tika.parser.html.HtmlParser.getEncoding, etc is from
running jstack on the nutch process when it slowed down to a crawl...

-----Original Message-----
From: Ken Krugler [mailto:[email protected]]
Sent: Friday, July 23, 2010 8:25 PM
To: [email protected]
Subject: Re: Parsing Performance - related to Java concurrency issue

Hi Brad,

On Jul 23, 2010, at 7:21pm, brad wrote:

Hi Ken,
Thanks for the info.  I'm using Nutch 1.1, so I believe it is Tika
0.7? The jar files in my Nutch path are tika-core-0.7.jar and tika-
parsers-0.7.jar.
Is there a way to find out if it actually pulling something
different when executing?


Tika switched from Neko to TagSoup on 14/Oct/2009.

Tika 0.7 was released on April 3rd, 2010 so I would expect that if
you are using Tika 0.7, you'd be using TagSoup.

However I see the line that references Neko is this one:

at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:


So it's the Nutch HtmlParser that's using Neko.

Curious why you have both Nutch and Tika HtmlParser refs in your
file, e.g.
I also see:

org.apache.tika.parser.html.HtmlParser.getEncoding


-- Ken

The ps -ef | grep nutch
Includes -classpath ...:/usr/local/nutch/lib/tika-core-0.7.jar:...
in the
Nutch execution command line

My server does have
/usr/local/solr/contrib/extraction/lib/tika-core-0.4.jar
/usr/local/solr/contrib/extraction/lib/tika-parsers-0.4.jar

But, they are not in the classpath nor are the $PATH so I doubt they
are being picked up?  Is there someplace else I should be looking?

Thanks
Brad

-----Original Message-----
From: Ken Krugler [mailto:[email protected]]
Sent: Friday, July 23, 2010 6:38 PM
To: [email protected]
Subject: Re: Parsing Performance - related to Java concurrency issue

Hi Brad,

Thanks for the nice write-up, and the refs.

I'll look into using a simple cache in Tika to avoid this type of
blocking.
Feel free to comment on
https://issues.apache.org/jira/browse/TIKA-471

Note that the Tika code base has changed from what it appears that
you're using (e.g. the switch from Neko to TagSoup happened quite a
while ago).

-- Ken

On Jul 23, 2010, at 3:51pm, brad wrote:

I'm continuing to have performance problems with parsing.  I ran
the
fetch process with -noParsing and got great performance.  If I do
the same process with parsing left in, the fetching seems to be
going great, but as the process continues to run, everything slows
down to almost a dead stop.

When I check the thread stack, I find that 1062 threads are blocked:
java.lang.Thread.State: BLOCKED (on object monitor)
     at
sun
.nio .cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:
135)

Apparently this is a known issue with Java, and a couple articles
are written about it:
http://paul.vox.com/library/post/the-mysteries-of-java-character-
set-p
erform
ance.html
http://halfbottle.blogspot.com/2009/07/charset-continued-i-wrote-
about
.html

There is also a note in java bug database about scaling issues with
the class...
Please also note that the current implementation of
sun.nio.cs.FastCharsetProvider.charsetForName() uses a JVM-wide
lock and is called very often (e.g. by new String(byte[]
data,String encoding)).
This
JVM-wide lock means that Java applications do not scale beyond 4
CPU cores.

I noted in the case of my stack at this particular point in time.
The
BLOCKED calls to charsetForName were generated by:

at
org .apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:
84)
378
at
org .apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:
99) 61
at
org .apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:
133)
19
at
org
.apache
.nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlParser.jav
a:86)  238
at
org
.apache
.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector
.java:310) 133
at
org .apache.pdfbox.pdfparser.PDFParser.skipToNextObj(PDFParser.java:
270) 8
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
253) 47
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
247) 19
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
227) 2
at org.apache.pdfbox.cos.COSDocument.<init>(COSDocument.java: 104) 7
at
org.apache.hadoop.io.Text$1.initialValue(Text.java:54) 88 at
org.apache.hadoop.io.Text.decode(Text.java:344) 2 at org .apache
.tika.parser.xml.XMLParser.getDefaultSAXParserFactory(XMLParser.ja
va:161) 12
at
org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:192)
13
at org.apache.pdfbox.cos.COSString.getString(COSString.java:245) 3

Is this an issue that only I'm facing?  Is it worth looking at
alternatives as talked about in the articles?  Or, just limit the
number of threads that are run?  Right now it seems like the block
is causing problem unrelated to general design and behavior of Nutch.

Thoughts??

Thanks
Brad



--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to