I finally had a chance to test the Nutch html parsing this without fetching
per Julien suggestion. The results were pretty much the same as my previous
tests:
parse-html Tika-html
Elapsed Time: 04:21:47 08:55:57
Parse (Success): 150,634 150,615
Parse (failed): 3,788 3,807
So, based on this test, parse-html is a little more than twice as fast as
tika's html parsing.
This was done on Linux Centos 5.5, 8gb ram, Intel Xeon CPU X3220 @ 2.40GHz
Only Nutch related processes were running on the server
Nutch 1.2 - which now has the nice timings feature!
The data was retrieved using:
bin/nutch fetch <segment> -noParsing -threads 200
All data was parsed using:
bin/nutch parse <segment> -threads 200
Brad
-----Original Message-----
From: Ken Krugler [mailto:[email protected]]
Sent: Wednesday, August 11, 2010 2:20 PM
To: [email protected]
Subject: Re: For HTML - is parse-html twice as fast as parse-tika
Hi Brad,
On Aug 2, 2010, at 9:26am, brad wrote:
> Hi Julien,
> I'll see if I can give a try later this week.
[snip]
Were you able to try the parse-only approach that Julien suggested below?
I'm asking because (a) I do a fair amount of work with/on the Tika HTML
parsing support, and (b) I've also run into surprisingly slow parse
performance with Tika, though I didn't compare to Nutch's older parser (or
using NekoHTML instead of TagSoup).
Thanks,
-- Ken
> -----Original Message-----
> From: Julien Nioche [mailto:[email protected]]
> Sent: Monday, August 02, 2010 5:11 AM
> To: [email protected]
> Subject: Re: For HTML - is parse-html twice as fast as parse-tika
>
> Hi Brad,
>
> Could you run and measure the parser independently of the fetching?
> That
> would remove any possible side effect due to caching, network issues
> etc...
>
> All you need to do is remove the subdirectories parse_text, parse_data
> and crawl_parse then run : nutch parse
>
> Thanks
>
> Julien
>
> PS: regarding parse-html being phased out : see Andrzej's JIRA from
> this morning
>
>
> On 31 July 2010 22:43, brad <[email protected]> wrote:
>
>>> I have been experiencing some performance issues with Tika and
>>> general parsing (see Parsing Performance - related to Java
>>> concurrency issue)
>>>
>>> Ken pointed out that both the both Tika and Nutch HtmlParser show up
>>> in
>> my
>>> jstack list using the delivered configuration.
>>>
>>> Julien suggested checking parsing with only parse-tika (html) and
>>> then with parse-html.
>>>
>>> So here is what I did.
>>>
>>> Option 1) parse-tika
>>> parse-(rss|text|js|tika)
>>> parse-plugin.xml as delivered
>> tika-mimetypes.xml as delivered
>>
>>> Option 2) parse-html
>>> parse-(rss|text|html|js|tika)
>>> parse-plugin.xml turned ON <plugin id="parse-html" />
>>> tika-mimetypes.xml commented out <mime-type
>>> type="text/html">
>>>
>>> Using the same generated crawl, ran fetch with parse for each of the
>>> options for 2 hours.
>>> All other configurations and settings are identical
>>>
>>> Results:
>>> Parse-tika
>>> INFO mapred.LocalJobRunner - 200 threads, 200370 pages, 6756
>>> errors,
>> 27.8
>>> pages/s, 12916 kb/s
>>>
>>> Parse-html
>>> INFO mapred.LocalJobRunner - 200 threads, 433738 pages, 13360
>>> errors,
>>> 60.1 pages/s, 27980 kb/s,
>>>
>>>
>>> The results:
>>> Parse-html is 116% faster than parse-tika for html for the same
>>> period of time and same URLs
>>>
>>> The error rate was about the same parse-html 3%, parse-tika 3.3%
>>> Most of the errors are read timeouts
>>>
>>>
>>> So is parse-html better? It appears to be faster. But, is the data
>>> as good?
>>> Other considerations? Is parse-html really going to be phased out?
>>>
>>> Brad
>>>
>>>
>>>
>>
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering http://
> www.digitalpebble.com
>
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g