RE: For HTML - is parse-html twice as fast as parse-tika

brad Mon, 02 Aug 2010 09:27:01 -0700

Hi Julien,
I'll see if I can give a try later this week.  

I'm having a problem in the mapred.LocalJobRunner - reduce > reduce portion
right after the actual URL fetch/parse portion is complete.  I don't know
how long it is supposed to take for this portion to complete, but I have had
fetches run for 12 hours and map-reduce portion run for 36 hours and still
not be complete.  I ended up killing the job.


Right now, I'm running a fetch on 1 million URLs.  The parse and fetch
portion took less than 7 hours, but the map-reduce has been running for 11
hours now and I'm going to wait and see if it completes.

It started complete of fetcher.Fetcher:
2010-08-01 22:06:43,479 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2010-08-01 22:06:44,368 INFO  fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2010-08-01 22:06:44,369 INFO  fetcher.Fetcher - -activeThreads=0
2010-08-01 22:06:44,369 INFO  mapred.MapTask - Starting flush of map output
2010-08-01 22:06:45,129 INFO  mapred.LocalJobRunner - 0 threads, 853809
pages, 18772 errors, 35.4 pages/s, 16989 kb/s, 

The issue appears to start with
2010-08-01 23:22:22,174 INFO  mapred.Merger - Down to the last merge-pass,
with 1 segments left of total size: 31012166567 bytes

Now the process has been cycling on for 10 hours:
INFO  mapred.LocalJobRunner - reduce > reduce

I'm running Nutch on a single server.

Thanks
Brad


-----Original Message-----
From: Julien Nioche [mailto:[email protected]] 
Sent: Monday, August 02, 2010 5:11 AM
To: [email protected]
Subject: Re: For HTML - is parse-html twice as fast as parse-tika

Hi Brad,

Could you run and measure the parser independently of the fetching? That
would remove any possible side effect due to caching, network issues etc...

All you need to do is remove the subdirectories parse_text, parse_data and
crawl_parse then run : nutch parse

Thanks

Julien

PS: regarding parse-html being phased out : see Andrzej's JIRA from this
morning


On 31 July 2010 22:43, brad <[email protected]> wrote:

> > I have been experiencing some performance issues with Tika and 
> > general parsing (see Parsing Performance - related to Java 
> > concurrency issue)
> >
> > Ken pointed out that both the both Tika and Nutch HtmlParser show up 
> > in
> my
> > jstack list using the delivered configuration.
> >
> > Julien suggested checking parsing with only parse-tika (html) and 
> > then with parse-html.
> >
> > So here is what I did.
> >
> > Option 1) parse-tika
> >           parse-(rss|text|js|tika)
> >           parse-plugin.xml as delivered
>          tika-mimetypes.xml as delivered
>
> > Option 2) parse-html
> >           parse-(rss|text|html|js|tika)
> >           parse-plugin.xml turned ON <plugin id="parse-html" />
> >           tika-mimetypes.xml commented out <mime-type 
> > type="text/html">
> >
> > Using the same generated crawl, ran fetch with parse for each of the 
> > options for 2 hours.
> > All other configurations and settings are identical
> >
> > Results:
> > Parse-tika
> > INFO  mapred.LocalJobRunner - 200 threads, 200370 pages, 6756 
> > errors,
> 27.8
> > pages/s, 12916 kb/s
> >
> > Parse-html
> > INFO  mapred.LocalJobRunner - 200 threads, 433738 pages, 13360 
> > errors,
> > 60.1 pages/s, 27980 kb/s,
> >
> >
> > The results:
> > Parse-html is 116% faster than parse-tika for html for the same 
> > period of time and same URLs
> >
> > The error rate was about the same parse-html 3%, parse-tika 3.3% 
> > Most of the errors are read timeouts
> >
> >
> > So is parse-html better?  It appears to be faster.  But, is the data 
> > as good?
> > Other considerations?  Is parse-html really going to be phased out?
> >
> > Brad
> >
> >
> >
>



--
DigitalPebble Ltd

Open Source Solutions for Text Engineering http://www.digitalpebble.com

RE: For HTML - is parse-html twice as fast as parse-tika

Reply via email to