Re: Parsing Performance - related to Java concurrency issue

Ken Krugler Fri, 23 Jul 2010 20:25:26 -0700

Hi Brad,

On Jul 23, 2010, at 7:21pm, brad wrote:

Hi Ken,
Thanks for the info. I'm using Nutch 1.1, so I believe it is Tika0.7? Thejar files in my Nutch path are tika-core-0.7.jar and tika-parsers-0.7.jar.Is there a way to find out if it actually pulling somethingdifferent when
executing?


Tika switched from Neko to TagSoup on 14/Oct/2009.

Tika 0.7 was released on April 3rd, 2010 so I would expect that if youare using Tika 0.7, you'd be using TagSoup.


However I see the line that references Neko is this one:

at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:


So it's the Nutch HtmlParser that's using Neko.

Curious why you have both Nutch and Tika HtmlParser refs in your file,e.g. I also see:

org.apache.tika.parser.html.HtmlParser.getEncoding


-- Ken

The ps -ef | grep nutch

Includes -classpath ...:/usr/local/nutch/lib/tika-core-0.7.jar:...in the

Nutch execution command line

My server does have
/usr/local/solr/contrib/extraction/lib/tika-core-0.4.jar
/usr/local/solr/contrib/extraction/lib/tika-parsers-0.4.jar

But, they are not in the classpath nor are the $PATH so I doubt theyare

being picked up?  Is there someplace else I should be looking?

Thanks
Brad

-----Original Message-----
From: Ken Krugler [mailto:[email protected]]
Sent: Friday, July 23, 2010 6:38 PM
To: [email protected]
Subject: Re: Parsing Performance - related to Java concurrency issue

Hi Brad,

Thanks for the nice write-up, and the refs.

I'll look into using a simple cache in Tika to avoid this type ofblocking.

Feel free to comment on https://issues.apache.org/jira/browse/TIKA-471

Note that the Tika code base has changed from what it appears thatyou'reusing (e.g. the switch from Neko to TagSoup happened quite a whileago).


-- Ken

On Jul 23, 2010, at 3:51pm, brad wrote:

I'm continuing to have performance problems with parsing.  I ran the
fetch process with -noParsing and got great performance.  If I do the
same process with parsing left in, the fetching seems to be going
great, but as the process continues to run, everything slows down to
almost a dead stop.

When I check the thread stack, I find that 1062 threads are blocked:
java.lang.Thread.State: BLOCKED (on object monitor)
        at
sun
.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:
135)

Apparently this is a known issue with Java, and a couple articles are
written about it:

http://paul.vox.com/library/post/the-mysteries-of-java-character-set-p

erform
ance.html

http://halfbottle.blogspot.com/2009/07/charset-continued-i-wrote-about

.html

There is also a note in java bug database about scaling issues with
the class...
Please also note that the current implementation of
sun.nio.cs.FastCharsetProvider.charsetForName() uses a JVM-wide lock
and is called very often (e.g. by new String(byte[] data,String
encoding)).
This
JVM-wide lock means that Java applications do not scale beyond 4 CPU
cores.

I noted in the case of my stack at this particular point in time.The

BLOCKED calls to charsetForName were generated by:

at

org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:84)

378
at
org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:
99) 61
at
org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:
133)
19
at
org
.apache
.nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlParser.jav
a:86)  238
at
org
.apache
.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector
.java:310) 133
at
org.apache.pdfbox.pdfparser.PDFParser.skipToNextObj(PDFParser.java:
270) 8
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
253) 47
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
247) 19
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
227) 2

at org.apache.pdfbox.cos.COSDocument.<init>(COSDocument.java:104) 7at

org.apache.hadoop.io.Text$1.initialValue(Text.java:54) 88 at
org.apache.hadoop.io.Text.decode(Text.java:344) 2 at org .apache
.tika.parser.xml.XMLParser.getDefaultSAXParserFactory(XMLParser.ja
va:161) 12
at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:192)
13
at org.apache.pdfbox.cos.COSString.getString(COSString.java:245) 3

Is this an issue that only I'm facing?  Is it worth looking at
alternatives as talked about in the articles?  Or, just limit the
number of threads that are run?  Right now it seems like the block is
causing problem unrelated to general design and behavior of Nutch.

Thoughts??

Thanks
Brad


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Parsing Performance - related to Java concurrency issue

Reply via email to