Hi,

Try calling jstack on the pid of the task to have a better idea of what it
is doing. My bet is on the normalisation of some long URLs taking ages but
it could be a lot of other things

J.

On 2 August 2010 17:26, brad <[email protected]> wrote:

> Hi Julien,
> I'll see if I can give a try later this week.
>
> I'm having a problem in the mapred.LocalJobRunner - reduce > reduce portion
> right after the actual URL fetch/parse portion is complete.  I don't know
> how long it is supposed to take for this portion to complete, but I have
> had
> fetches run for 12 hours and map-reduce portion run for 36 hours and still
> not be complete.  I ended up killing the job.
>
> Right now, I'm running a fetch on 1 million URLs.  The parse and fetch
> portion took less than 7 hours, but the map-reduce has been running for 11
> hours now and I'm going to wait and see if it completes.
>
> It started complete of fetcher.Fetcher:
> 2010-08-01 22:06:43,479 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=0
> 2010-08-01 22:06:44,368 INFO  fetcher.Fetcher - -activeThreads=0,
> spinWaiting=0, fetchQueues.totalSize=0
> 2010-08-01 22:06:44,369 INFO  fetcher.Fetcher - -activeThreads=0
> 2010-08-01 22:06:44,369 INFO  mapred.MapTask - Starting flush of map output
> 2010-08-01 22:06:45,129 INFO  mapred.LocalJobRunner - 0 threads, 853809
> pages, 18772 errors, 35.4 pages/s, 16989 kb/s,
>
> The issue appears to start with
> 2010-08-01 23:22:22,174 INFO  mapred.Merger - Down to the last merge-pass,
> with 1 segments left of total size: 31012166567 bytes
>
> Now the process has been cycling on for 10 hours:
> INFO  mapred.LocalJobRunner - reduce > reduce
>
> I'm running Nutch on a single server.
>
> Thanks
> Brad
>
>
> -----Original Message-----
> From: Julien Nioche [mailto:[email protected]]
> Sent: Monday, August 02, 2010 5:11 AM
> To: [email protected]
> Subject: Re: For HTML - is parse-html twice as fast as parse-tika
>
> Hi Brad,
>
> Could you run and measure the parser independently of the fetching? That
> would remove any possible side effect due to caching, network issues etc...
>
> All you need to do is remove the subdirectories parse_text, parse_data and
> crawl_parse then run : nutch parse
>
> Thanks
>
> Julien
>
> PS: regarding parse-html being phased out : see Andrzej's JIRA from this
> morning
>
>
> On 31 July 2010 22:43, brad <[email protected]> wrote:
>
> > > I have been experiencing some performance issues with Tika and
> > > general parsing (see Parsing Performance - related to Java
> > > concurrency issue)
> > >
> > > Ken pointed out that both the both Tika and Nutch HtmlParser show up
> > > in
> > my
> > > jstack list using the delivered configuration.
> > >
> > > Julien suggested checking parsing with only parse-tika (html) and
> > > then with parse-html.
> > >
> > > So here is what I did.
> > >
> > > Option 1) parse-tika
> > >           parse-(rss|text|js|tika)
> > >           parse-plugin.xml as delivered
> >          tika-mimetypes.xml as delivered
> >
> > > Option 2) parse-html
> > >           parse-(rss|text|html|js|tika)
> > >           parse-plugin.xml turned ON <plugin id="parse-html" />
> > >           tika-mimetypes.xml commented out <mime-type
> > > type="text/html">
> > >
> > > Using the same generated crawl, ran fetch with parse for each of the
> > > options for 2 hours.
> > > All other configurations and settings are identical
> > >
> > > Results:
> > > Parse-tika
> > > INFO  mapred.LocalJobRunner - 200 threads, 200370 pages, 6756
> > > errors,
> > 27.8
> > > pages/s, 12916 kb/s
> > >
> > > Parse-html
> > > INFO  mapred.LocalJobRunner - 200 threads, 433738 pages, 13360
> > > errors,
> > > 60.1 pages/s, 27980 kb/s,
> > >
> > >
> > > The results:
> > > Parse-html is 116% faster than parse-tika for html for the same
> > > period of time and same URLs
> > >
> > > The error rate was about the same parse-html 3%, parse-tika 3.3%
> > > Most of the errors are read timeouts
> > >
> > >
> > > So is parse-html better?  It appears to be faster.  But, is the data
> > > as good?
> > > Other considerations?  Is parse-html really going to be phased out?
> > >
> > > Brad
> > >
> > >
> > >
> >
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering http://www.digitalpebble.com
>
>


-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Reply via email to