Hi Ken,
Thanks for the info.  I'm using Nutch 1.1, so I believe it is Tika 0.7?  The
jar files in my Nutch path are tika-core-0.7.jar and tika-parsers-0.7.jar.
Is there a way to find out if it actually pulling something different when
executing?

The ps -ef | grep nutch
Includes -classpath ...:/usr/local/nutch/lib/tika-core-0.7.jar:...  in the
Nutch execution command line

My server does have 
/usr/local/solr/contrib/extraction/lib/tika-core-0.4.jar
/usr/local/solr/contrib/extraction/lib/tika-parsers-0.4.jar

But, they are not in the classpath nor are the $PATH so I doubt they are
being picked up?  Is there someplace else I should be looking?

Thanks
Brad

-----Original Message-----
From: Ken Krugler [mailto:[email protected]] 
Sent: Friday, July 23, 2010 6:38 PM
To: [email protected]
Subject: Re: Parsing Performance - related to Java concurrency issue

Hi Brad,

Thanks for the nice write-up, and the refs.

I'll look into using a simple cache in Tika to avoid this type of blocking.
Feel free to comment on https://issues.apache.org/jira/browse/TIKA-471

Note that the Tika code base has changed from what it appears that you're
using (e.g. the switch from Neko to TagSoup happened quite a while ago).

-- Ken

On Jul 23, 2010, at 3:51pm, brad wrote:

> I'm continuing to have performance problems with parsing.  I ran the 
> fetch process with -noParsing and got great performance.  If I do the 
> same process with parsing left in, the fetching seems to be going 
> great, but as the process continues to run, everything slows down to 
> almost a dead stop.
>
> When I check the thread stack, I find that 1062 threads are blocked:
> java.lang.Thread.State: BLOCKED (on object monitor)
>       at
> sun
> .nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java: 
> 135)
>
> Apparently this is a known issue with Java, and a couple articles are 
> written about it:
> http://paul.vox.com/library/post/the-mysteries-of-java-character-set-p
> erform
> ance.html
> http://halfbottle.blogspot.com/2009/07/charset-continued-i-wrote-about
> .html
>
> There is also a note in java bug database about scaling issues with 
> the class...
> Please also note that the current implementation of
> sun.nio.cs.FastCharsetProvider.charsetForName() uses a JVM-wide lock 
> and is called very often (e.g. by new String(byte[] data,String 
> encoding)).
> This
> JVM-wide lock means that Java applications do not scale beyond 4 CPU 
> cores.
>
> I noted in the case of my stack at this particular point in time.  The 
> BLOCKED calls to charsetForName were generated by:
>
> at
> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:84)
> 378
> at
> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java: 
> 99) 61
> at
> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java: 
> 133)
> 19
> at
> org
> .apache
> .nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlParser.jav
> a:86)  238
> at
> org
> .apache
> .nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector
> .java:310) 133
> at
> org.apache.pdfbox.pdfparser.PDFParser.skipToNextObj(PDFParser.java: 
> 270) 8
> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java: 
> 253) 47
> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java: 
> 247) 19
> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java: 
> 227) 2
> at org.apache.pdfbox.cos.COSDocument.<init>(COSDocument.java:104) 7 at 
> org.apache.hadoop.io.Text$1.initialValue(Text.java:54) 88 at 
> org.apache.hadoop.io.Text.decode(Text.java:344) 2 at org .apache 
> .tika.parser.xml.XMLParser.getDefaultSAXParserFactory(XMLParser.ja
> va:161) 12
> at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:192)
> 13
> at org.apache.pdfbox.cos.COSString.getString(COSString.java:245) 3
>
> Is this an issue that only I'm facing?  Is it worth looking at 
> alternatives as talked about in the articles?  Or, just limit the 
> number of threads that are run?  Right now it seems like the block is 
> causing problem unrelated to general design and behavior of Nutch.
>
> Thoughts??
>
> Thanks
> Brad
>
>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply via email to