Hi Brad,
On Jul 23, 2010, at 7:21pm, brad wrote:
Hi Ken,
Thanks for the info. I'm using Nutch 1.1, so I believe it is Tika
0.7? The
jar files in my Nutch path are tika-core-0.7.jar and tika-
parsers-0.7.jar.
Is there a way to find out if it actually pulling something
different when
executing?
Tika switched from Neko to TagSoup on 14/Oct/2009.
Tika 0.7 was released on April 3rd, 2010 so I would expect that if you
are using Tika 0.7, you'd be using TagSoup.
However I see the line that references Neko is this one:
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
So it's the Nutch HtmlParser that's using Neko.
Curious why you have both Nutch and Tika HtmlParser refs in your file,
e.g. I also see:
org.apache.tika.parser.html.HtmlParser.getEncoding
-- Ken
The ps -ef | grep nutch
Includes -classpath ...:/usr/local/nutch/lib/tika-core-0.7.jar:...
in the
Nutch execution command line
My server does have
/usr/local/solr/contrib/extraction/lib/tika-core-0.4.jar
/usr/local/solr/contrib/extraction/lib/tika-parsers-0.4.jar
But, they are not in the classpath nor are the $PATH so I doubt they
are
being picked up? Is there someplace else I should be looking?
Thanks
Brad
-----Original Message-----
From: Ken Krugler [mailto:[email protected]]
Sent: Friday, July 23, 2010 6:38 PM
To: [email protected]
Subject: Re: Parsing Performance - related to Java concurrency issue
Hi Brad,
Thanks for the nice write-up, and the refs.
I'll look into using a simple cache in Tika to avoid this type of
blocking.
Feel free to comment on https://issues.apache.org/jira/browse/TIKA-471
Note that the Tika code base has changed from what it appears that
you're
using (e.g. the switch from Neko to TagSoup happened quite a while
ago).
-- Ken
On Jul 23, 2010, at 3:51pm, brad wrote:
I'm continuing to have performance problems with parsing. I ran the
fetch process with -noParsing and got great performance. If I do the
same process with parsing left in, the fetching seems to be going
great, but as the process continues to run, everything slows down to
almost a dead stop.
When I check the thread stack, I find that 1062 threads are blocked:
java.lang.Thread.State: BLOCKED (on object monitor)
at
sun
.nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:
135)
Apparently this is a known issue with Java, and a couple articles are
written about it:
http://paul.vox.com/library/post/the-mysteries-of-java-character-
set-p
erform
ance.html
http://halfbottle.blogspot.com/2009/07/charset-continued-i-wrote-
about
.html
There is also a note in java bug database about scaling issues with
the class...
Please also note that the current implementation of
sun.nio.cs.FastCharsetProvider.charsetForName() uses a JVM-wide lock
and is called very often (e.g. by new String(byte[] data,String
encoding)).
This
JVM-wide lock means that Java applications do not scale beyond 4 CPU
cores.
I noted in the case of my stack at this particular point in time.
The
BLOCKED calls to charsetForName were generated by:
at
org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:
84)
378
at
org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:
99) 61
at
org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:
133)
19
at
org
.apache
.nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlParser.jav
a:86) 238
at
org
.apache
.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector
.java:310) 133
at
org.apache.pdfbox.pdfparser.PDFParser.skipToNextObj(PDFParser.java:
270) 8
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
253) 47
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
247) 19
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
227) 2
at org.apache.pdfbox.cos.COSDocument.<init>(COSDocument.java:104) 7
at
org.apache.hadoop.io.Text$1.initialValue(Text.java:54) 88 at
org.apache.hadoop.io.Text.decode(Text.java:344) 2 at org .apache
.tika.parser.xml.XMLParser.getDefaultSAXParserFactory(XMLParser.ja
va:161) 12
at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:192)
13
at org.apache.pdfbox.cos.COSString.getString(COSString.java:245) 3
Is this an issue that only I'm facing? Is it worth looking at
alternatives as talked about in the articles? Or, just limit the
number of threads that are run? Right now it seems like the block is
causing problem unrelated to general design and behavior of Nutch.
Thoughts??
Thanks
Brad
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g