RE: For HTML - is parse-html twice as fast as parse-tika

brad Wed, 11 Aug 2010 14:38:22 -0700

Hi Ken,
I haven't had a chance yet.  I'm working on some compression issues.  I'll
put it on my calendar for next week.


Even though the results may not have been as accurate because the parse
included the fetch, I felt pretty comfortable with the numbers.  I switched
my configuration from the Tika HTML parser to the Nutch HTML parser and all
of the fetch/parse have been faster.  I have also replaced tika's
commons-compress-1.0.jar with the pre-release commons-compress-1.1.jar which
has helped.

Brad


-----Original Message-----
From: Ken Krugler [mailto:[email protected]] 
Sent: Wednesday, August 11, 2010 2:20 PM
To: [email protected]
Subject: Re: For HTML - is parse-html twice as fast as parse-tika

Hi Brad,

On Aug 2, 2010, at 9:26am, brad wrote:

> Hi Julien,
> I'll see if I can give a try later this week.

[snip]

Were you able to try the parse-only approach that Julien suggested below?

I'm asking because (a) I do a fair amount of work with/on the Tika HTML
parsing support, and (b) I've also run into surprisingly slow parse
performance with Tika, though I didn't compare to Nutch's older parser (or
using NekoHTML instead of TagSoup).

Thanks,

-- Ken


> -----Original Message-----
> From: Julien Nioche [mailto:[email protected]]
> Sent: Monday, August 02, 2010 5:11 AM
> To: [email protected]
> Subject: Re: For HTML - is parse-html twice as fast as parse-tika
>
> Hi Brad,
>
> Could you run and measure the parser independently of the fetching?  
> That
> would remove any possible side effect due to caching, network issues 
> etc...
>
> All you need to do is remove the subdirectories parse_text, parse_data 
> and crawl_parse then run : nutch parse
>
> Thanks
>
> Julien
>
> PS: regarding parse-html being phased out : see Andrzej's JIRA from 
> this morning
>
>
> On 31 July 2010 22:43, brad <[email protected]> wrote:
>
>>> I have been experiencing some performance issues with Tika and 
>>> general parsing (see Parsing Performance - related to Java 
>>> concurrency issue)
>>>
>>> Ken pointed out that both the both Tika and Nutch HtmlParser show up 
>>> in
>> my
>>> jstack list using the delivered configuration.
>>>
>>> Julien suggested checking parsing with only parse-tika (html) and 
>>> then with parse-html.
>>>
>>> So here is what I did.
>>>
>>> Option 1) parse-tika
>>>          parse-(rss|text|js|tika)
>>>          parse-plugin.xml as delivered
>>         tika-mimetypes.xml as delivered
>>
>>> Option 2) parse-html
>>>          parse-(rss|text|html|js|tika)
>>>          parse-plugin.xml turned ON <plugin id="parse-html" />
>>>          tika-mimetypes.xml commented out <mime-type 
>>> type="text/html">
>>>
>>> Using the same generated crawl, ran fetch with parse for each of the 
>>> options for 2 hours.
>>> All other configurations and settings are identical
>>>
>>> Results:
>>> Parse-tika
>>> INFO  mapred.LocalJobRunner - 200 threads, 200370 pages, 6756 
>>> errors,
>> 27.8
>>> pages/s, 12916 kb/s
>>>
>>> Parse-html
>>> INFO  mapred.LocalJobRunner - 200 threads, 433738 pages, 13360 
>>> errors,
>>> 60.1 pages/s, 27980 kb/s,
>>>
>>>
>>> The results:
>>> Parse-html is 116% faster than parse-tika for html for the same 
>>> period of time and same URLs
>>>
>>> The error rate was about the same parse-html 3%, parse-tika 3.3% 
>>> Most of the errors are read timeouts
>>>
>>>
>>> So is parse-html better?  It appears to be faster.  But, is the data 
>>> as good?
>>> Other considerations?  Is parse-html really going to be phased out?
>>>
>>> Brad
>>>
>>>
>>>
>>
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering http:// 
> www.digitalpebble.com
>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

RE: For HTML - is parse-html twice as fast as parse-tika

Reply via email to