I finally had a chance to test the Nutch html parsing this without fetching
per Julien suggestion.  The results were pretty much the same as my previous
tests:

                        parse-html              Tika-html
Elapsed Time:   04:21:47                08:55:57
Parse (Success):        150,634         150,615
Parse (failed): 3,788                   3,807

So, based on this test, parse-html is a little more than twice as fast as
tika's html parsing.

This was done on Linux Centos 5.5, 8gb ram, Intel Xeon CPU X3220 @ 2.40GHz
Only Nutch related processes were running on the server
Nutch 1.2 - which now has the nice timings feature!

The data was retrieved using:
bin/nutch fetch <segment> -noParsing -threads 200

All data was parsed using:
bin/nutch parse <segment> -threads 200

Brad
 

-----Original Message-----
From: Ken Krugler [mailto:[email protected]] 
Sent: Wednesday, August 11, 2010 2:20 PM
To: [email protected]
Subject: Re: For HTML - is parse-html twice as fast as parse-tika

Hi Brad,

On Aug 2, 2010, at 9:26am, brad wrote:

> Hi Julien,
> I'll see if I can give a try later this week.

[snip]

Were you able to try the parse-only approach that Julien suggested below?

I'm asking because (a) I do a fair amount of work with/on the Tika HTML
parsing support, and (b) I've also run into surprisingly slow parse
performance with Tika, though I didn't compare to Nutch's older parser (or
using NekoHTML instead of TagSoup).

Thanks,

-- Ken


> -----Original Message-----
> From: Julien Nioche [mailto:[email protected]]
> Sent: Monday, August 02, 2010 5:11 AM
> To: [email protected]
> Subject: Re: For HTML - is parse-html twice as fast as parse-tika
>
> Hi Brad,
>
> Could you run and measure the parser independently of the fetching?  
> That
> would remove any possible side effect due to caching, network issues 
> etc...
>
> All you need to do is remove the subdirectories parse_text, parse_data 
> and crawl_parse then run : nutch parse
>
> Thanks
>
> Julien
>
> PS: regarding parse-html being phased out : see Andrzej's JIRA from 
> this morning
>
>
> On 31 July 2010 22:43, brad <[email protected]> wrote:
>
>>> I have been experiencing some performance issues with Tika and 
>>> general parsing (see Parsing Performance - related to Java 
>>> concurrency issue)
>>>
>>> Ken pointed out that both the both Tika and Nutch HtmlParser show up 
>>> in
>> my
>>> jstack list using the delivered configuration.
>>>
>>> Julien suggested checking parsing with only parse-tika (html) and 
>>> then with parse-html.
>>>
>>> So here is what I did.
>>>
>>> Option 1) parse-tika
>>>          parse-(rss|text|js|tika)
>>>          parse-plugin.xml as delivered
>>         tika-mimetypes.xml as delivered
>>
>>> Option 2) parse-html
>>>          parse-(rss|text|html|js|tika)
>>>          parse-plugin.xml turned ON <plugin id="parse-html" />
>>>          tika-mimetypes.xml commented out <mime-type 
>>> type="text/html">
>>>
>>> Using the same generated crawl, ran fetch with parse for each of the 
>>> options for 2 hours.
>>> All other configurations and settings are identical
>>>
>>> Results:
>>> Parse-tika
>>> INFO  mapred.LocalJobRunner - 200 threads, 200370 pages, 6756 
>>> errors,
>> 27.8
>>> pages/s, 12916 kb/s
>>>
>>> Parse-html
>>> INFO  mapred.LocalJobRunner - 200 threads, 433738 pages, 13360 
>>> errors,
>>> 60.1 pages/s, 27980 kb/s,
>>>
>>>
>>> The results:
>>> Parse-html is 116% faster than parse-tika for html for the same 
>>> period of time and same URLs
>>>
>>> The error rate was about the same parse-html 3%, parse-tika 3.3% 
>>> Most of the errors are read timeouts
>>>
>>>
>>> So is parse-html better?  It appears to be faster.  But, is the data 
>>> as good?
>>> Other considerations?  Is parse-html really going to be phased out?
>>>
>>> Brad
>>>
>>>
>>>
>>
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering http:// 
> www.digitalpebble.com
>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply via email to