I just running Nutch as delivered.  The information about the
org.apache.tika.parser.html.HtmlParser.getEncoding, etc is from running
jstack on the nutch process when it slowed down to a crawl... 

-----Original Message-----
From: Ken Krugler [mailto:[email protected]] 
Sent: Friday, July 23, 2010 8:25 PM
To: [email protected]
Subject: Re: Parsing Performance - related to Java concurrency issue

Hi Brad,

On Jul 23, 2010, at 7:21pm, brad wrote:

> Hi Ken,
> Thanks for the info.  I'm using Nutch 1.1, so I believe it is Tika 
> 0.7?  The jar files in my Nutch path are tika-core-0.7.jar and tika- 
> parsers-0.7.jar.
> Is there a way to find out if it actually pulling something different 
> when executing?

Tika switched from Neko to TagSoup on 14/Oct/2009.

Tika 0.7 was released on April 3rd, 2010 so I would expect that if you are
using Tika 0.7, you'd be using TagSoup.

However I see the line that references Neko is this one:

>> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:

So it's the Nutch HtmlParser that's using Neko.

Curious why you have both Nutch and Tika HtmlParser refs in your file, e.g.
I also see:

>> org.apache.tika.parser.html.HtmlParser.getEncoding

-- Ken

> The ps -ef | grep nutch
> Includes -classpath ...:/usr/local/nutch/lib/tika-core-0.7.jar:...   
> in the
> Nutch execution command line
>
> My server does have
> /usr/local/solr/contrib/extraction/lib/tika-core-0.4.jar
> /usr/local/solr/contrib/extraction/lib/tika-parsers-0.4.jar
>
> But, they are not in the classpath nor are the $PATH so I doubt they 
> are being picked up?  Is there someplace else I should be looking?
>
> Thanks
> Brad
>
> -----Original Message-----
> From: Ken Krugler [mailto:[email protected]]
> Sent: Friday, July 23, 2010 6:38 PM
> To: [email protected]
> Subject: Re: Parsing Performance - related to Java concurrency issue
>
> Hi Brad,
>
> Thanks for the nice write-up, and the refs.
>
> I'll look into using a simple cache in Tika to avoid this type of 
> blocking.
> Feel free to comment on https://issues.apache.org/jira/browse/TIKA-471
>
> Note that the Tika code base has changed from what it appears that 
> you're using (e.g. the switch from Neko to TagSoup happened quite a 
> while ago).
>
> -- Ken
>
> On Jul 23, 2010, at 3:51pm, brad wrote:
>
>> I'm continuing to have performance problems with parsing.  I ran the 
>> fetch process with -noParsing and got great performance.  If I do the 
>> same process with parsing left in, the fetching seems to be going 
>> great, but as the process continues to run, everything slows down to 
>> almost a dead stop.
>>
>> When I check the thread stack, I find that 1062 threads are blocked:
>> java.lang.Thread.State: BLOCKED (on object monitor)
>>      at
>> sun
>> .nio.cs.FastCharsetProvider.charsetForName(FastCharsetProvider.java:
>> 135)
>>
>> Apparently this is a known issue with Java, and a couple articles are 
>> written about it:
>> http://paul.vox.com/library/post/the-mysteries-of-java-character-
>> set-p
>> erform
>> ance.html
>> http://halfbottle.blogspot.com/2009/07/charset-continued-i-wrote-
>> about
>> .html
>>
>> There is also a note in java bug database about scaling issues with 
>> the class...
>> Please also note that the current implementation of
>> sun.nio.cs.FastCharsetProvider.charsetForName() uses a JVM-wide lock 
>> and is called very often (e.g. by new String(byte[] data,String 
>> encoding)).
>> This
>> JVM-wide lock means that Java applications do not scale beyond 4 CPU 
>> cores.
>>
>> I noted in the case of my stack at this particular point in time.   
>> The
>> BLOCKED calls to charsetForName were generated by:
>>
>> at
>> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java: 
>> 84)
>> 378
>> at
>> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:
>> 99) 61
>> at
>> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:
>> 133)
>> 19
>> at
>> org
>> .apache
>> .nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlParser.jav
>> a:86)  238
>> at
>> org
>> .apache
>> .nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector
>> .java:310) 133
>> at
>> org.apache.pdfbox.pdfparser.PDFParser.skipToNextObj(PDFParser.java:
>> 270) 8
>> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
>> 253) 47
>> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
>> 247) 19
>> at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:
>> 227) 2
>> at org.apache.pdfbox.cos.COSDocument.<init>(COSDocument.java:104) 7 
>> at
>> org.apache.hadoop.io.Text$1.initialValue(Text.java:54) 88 at
>> org.apache.hadoop.io.Text.decode(Text.java:344) 2 at org .apache 
>> .tika.parser.xml.XMLParser.getDefaultSAXParserFactory(XMLParser.ja
>> va:161) 12
>> at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:192)
>> 13
>> at org.apache.pdfbox.cos.COSString.getString(COSString.java:245) 3
>>
>> Is this an issue that only I'm facing?  Is it worth looking at 
>> alternatives as talked about in the articles?  Or, just limit the 
>> number of threads that are run?  Right now it seems like the block is 
>> causing problem unrelated to general design and behavior of Nutch.
>>
>> Thoughts??
>>
>> Thanks
>> Brad
>>
>>
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply via email to