Hi Brad,
On Aug 2, 2010, at 9:26am, brad wrote:
Hi Julien,
I'll see if I can give a try later this week.
[snip]
Were you able to try the parse-only approach that Julien suggested
below?
I'm asking because (a) I do a fair amount of work with/on the Tika
HTML parsing support, and (b) I've also run into surprisingly slow
parse performance with Tika, though I didn't compare to Nutch's older
parser (or using NekoHTML instead of TagSoup).
Thanks,
-- Ken
-----Original Message-----
From: Julien Nioche [mailto:[email protected]]
Sent: Monday, August 02, 2010 5:11 AM
To: [email protected]
Subject: Re: For HTML - is parse-html twice as fast as parse-tika
Hi Brad,
Could you run and measure the parser independently of the fetching?
That
would remove any possible side effect due to caching, network issues
etc...
All you need to do is remove the subdirectories parse_text,
parse_data and
crawl_parse then run : nutch parse
Thanks
Julien
PS: regarding parse-html being phased out : see Andrzej's JIRA from
this
morning
On 31 July 2010 22:43, brad <[email protected]> wrote:
I have been experiencing some performance issues with Tika and
general parsing (see Parsing Performance - related to Java
concurrency issue)
Ken pointed out that both the both Tika and Nutch HtmlParser show up
in
my
jstack list using the delivered configuration.
Julien suggested checking parsing with only parse-tika (html) and
then with parse-html.
So here is what I did.
Option 1) parse-tika
parse-(rss|text|js|tika)
parse-plugin.xml as delivered
tika-mimetypes.xml as delivered
Option 2) parse-html
parse-(rss|text|html|js|tika)
parse-plugin.xml turned ON <plugin id="parse-html" />
tika-mimetypes.xml commented out <mime-type
type="text/html">
Using the same generated crawl, ran fetch with parse for each of the
options for 2 hours.
All other configurations and settings are identical
Results:
Parse-tika
INFO mapred.LocalJobRunner - 200 threads, 200370 pages, 6756
errors,
27.8
pages/s, 12916 kb/s
Parse-html
INFO mapred.LocalJobRunner - 200 threads, 433738 pages, 13360
errors,
60.1 pages/s, 27980 kb/s,
The results:
Parse-html is 116% faster than parse-tika for html for the same
period of time and same URLs
The error rate was about the same parse-html 3%, parse-tika 3.3%
Most of the errors are read timeouts
So is parse-html better? It appears to be faster. But, is the data
as good?
Other considerations? Is parse-html really going to be phased out?
Brad
--
DigitalPebble Ltd
Open Source Solutions for Text Engineering http://
www.digitalpebble.com
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g