RE: Best practices for running Nutch

Markus Jelsma Mon, 19 Nov 2012 00:15:18 -0800

Hi

-----Original message-----
> From:kiran chitturi <[email protected]>
> Sent: Sun 18-Nov-2012 18:38
> To: [email protected]
> Subject: Best practices for running Nutch
> 
> Hi!
> 
> I have been running crawls using Nutch for 13000 documents (protocol http)
> on a single machine and it goes on to take 2-3 days to get finished. I am
> using 2.x version of Nutch.
> 
> I use a depth of 20 and topN of 1000 (2000) when i initiate the 'sh
> bin/nutch crawl -depth 20 -topN 1000'.
> 
> I keep running in to Exceptions after one day. Sometimes its
> 
> 
>    - Memory Exception : Heap Space (after the parsing of the documents)


After parsing the documents? That should be during updatedb but are you sure? 
That job hardly ever runs out of memory. 

>    - Mysql Connection Error (because the crawler went on to fetch 10,000
>    documents after the command 'sh bin/nutch crawl -continue -depth 10 -topN
>    700' as the crawl failed because
> 
> I increased the heap space and increased the timeout.
> 
> I am wondering what are the best practices to run Nutch crawls. Is a full
> crawl a good thing to do or should i do it in steps (generate, fetch,
> parse, updatedb) ?

Separate steps are good for debugging and give you more control.  

> 
> Also how do i choose the value of the parameters, even if i give topN as
> 700 the fetcher goes to fetch 3000 documents. What parameters have high
> impact on the running time of the crawl ?

Are you sure? The generator (at least in trunk) honors the topN parameter and 
will not generate more than specified. Keep in mind that using the crawl script 
and the depth parameter you're multiplying topN by depth.

> 
> All these options might be system based and need not have general values
> which work for everyone.
> 
> I am wondering what are things that Nutch Users and Developers follow here
> when running big crawls ?

What is a big crawl? 13.000 documents are very easy to manage on a very small 
machine running locally. If you're downloading from one or a few hosts it's 
expected to take a very long time due to crawler politeness, don't download 
faster than one page every > 5 seconds unless you're allowed to or own the 
host. If you own a host or are allowed to you can increase or increase the 
number of threads per queue (host, domain or IP).

> 
> Some of the exceptions come after 1 or 2 days of running the crawler, so
> its getting hard to know how to fix them before hand.

I'm not sure this applies to you because i don't know what you mean by `running 
crawler`; never run the fetcher for longer than an hour orso.

> Are there any common exceptions that Nutch can run in to frequently ?

The usual exceptions are network errors.

> 
> Is there any documentation for Nutch practices ? I have seen people crawls
> go for a long time because of the filtering sometimes.

I'm not sure but the best thing to do on this list is not talk about crawl 
(e.g. my crawl fails or takes too long) but to talk about the separate jobs. We 
don't know what's wrong if one tells us a crawl is taking long because it 
consists of the separate steps.

> 
> Sorry for the long email.
> 
> Thank you,
> -- 
> Kiran Chitturi
>

RE: Best practices for running Nutch

Reply via email to