Guys,

There are certainly overheads in using the distributed mode (communication
with servers etc...) and moving the job file around, unpacking it etc...
but before we start taking about optimisation and efficiency you need
to first try to get an understanding of what takes time by profiling /
jstacking. It could depend on the backend used, number of tasks running in
parallel and so on.

Julien


On 2 October 2012 09:52, Lewis John Mcgibbney <[email protected]>wrote:

> Hi,
>
> Yeah with 2.x head, generating most certainly takes a good deal longer
> on a 2 core machine (with Hadoop 1.0.1) in pseudo distrib over 1 core
> in local. I don't have concrete stats however but these are just my
> manual observations. This is also noted regardless of the size of the
> list to be generated e.g. I still notice a significant increase in CPU
> regardless of whether I'm generating fetchlists from a small list of
> injected urls (10 for example) or whether I am generating large(er)
> lists from iterative crawl cycles (several hundred/thousand).
>
> Do you have any idea suggestion about mitigating against this Markus
> in an attempt to drive efficiency during the generate phase?
>
> Thanks
>
> Lewis
>
> On Tue, Oct 2, 2012 at 8:30 AM, Markus Jelsma
> <[email protected]> wrote:
> > Hi - i don't know 2.0 but Hadoop's Mapred is likely just taking
> advantage of multiple CPU cores.
> >
> > -----Original message-----
> >> From:[email protected] <[email protected]>
> >> Sent: Tue 02-Oct-2012 04:15
> >> To: [email protected]
> >> Subject: nutch-2.0  generate in  deploy mode
> >>
> >> Hello,
> >>
> >> I use nutch-2.0 with hadoop-0.20.2. bin/nutch generate  command takes
> 87% of cpu  in deploy mode versus 18% in local mode.
> >> Any ideas how to fix this issue?
> >>
> >> Thanks.
> >> Alex.
> >>
>
>
>
> --
> Lewis
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to