On Thu, Dec 15, 2011 at 12:47 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> This is overwhelmingly weighted towards Hadoop configuration.
>
> There are some guidance notes on the Nutch wiki for performance issues
> so you may wish to give them a try first.
> --
>  Lewis
>

I'm assuming you're referring to this page?
http://wiki.apache.org/nutch/OptimizingCrawls


On Thu, Dec 15, 2011 at 2:01 PM, Markus Jelsma
<[email protected]>wrote:

> Well, if performance is low its likely not a Hadoop issue. Hadoop tuning is
> only required if you start pushing it to limits.
>
> I would indeed check the Nutch wiki. There are important settings such as
> threads, queues etc that are very important.
>
>
I did end up tweaking some of the hadoop settings, as it looked like it was
thrashing the disk due to not spreading out the map tasks.


On Thu, Dec 15, 2011 at 3:00 PM, Julien Nioche <
[email protected]> wrote:

>
> Having beefy machines is not going to be very useful for the fetching step
> which is IO bound and usually takes most of the time.
> How big is your crawldb?  How long do the generate / parse and update steps
> take? Having more than one machine won't make a massive difference if your
> crawldb or segments are small.
>
> Julien
>
>
The machines were all I had handy to make the cluster with.


I'm looking at the time for a recent job and here's what I'm seeing.  This
is with 12k urls queued by domain with a max of 50 urls per domain.
I know why the fetcher takes so long.  Most of the fetcher map jobs finish
in 3-4 minutes, but 1-2 always end up getting stuck on a single site and
taking an additional ten minutes to work through the remaining urls.  Not
sure how to fix that.
The crawldb had around 1.2 million urls in it when I looked this afternoon.

nutch-1.4.job SUCCEEDED Thu Dec 15 16:14:30 EST 2011 Thu Dec 15 16:14:44
EST 2011generate: select from crawl/crawldb SUCCEEDED Thu Dec 15 16:14:45
EST 2011 Thu Dec 15 16:16:17 EST 2011generate: partition
crawl/segments/20111215161618 SUCCEEDED Thu Dec 15 16:16:19 EST 2011 Thu
Dec 15 16:16:42 EST 2011fetch crawl/segments/20111215161618 SUCCEEDED Thu
Dec 15 16:16:44 EST 2011 Thu Dec 15 16:33:29 EST 2011parse
crawl/segments/20111215161618 SUCCEEDED Thu Dec 15 16:33:30 EST 2011 Thu
Dec 15 16:35:11 EST 2011crawldb crawl/crawldb SUCCEEDED Thu Dec 15 16:35:12
EST 2011 Thu Dec 15 16:36:37 EST 2011linkdb crawl/linkdb SUCCEEDED Thu Dec
15 16:36:38 EST 2011 Thu Dec 15 16:36:58 EST 2011linkdb merge crawl/linkdb
SUCCEEDED Thu Dec 15 16:36:59 EST 2011 Thu Dec 15 16:38:27 EST 2011index-solr
http://solr:8080/solr SUCCEEDED Thu Dec 15 16:38:28 EST 2011 Thu Dec 15
16:38:56 EST 2011

Reply via email to