Re: Crawling Pages from Single Domain

Jonathan Cooper-Ellis Tue, 10 Mar 2015 16:18:33 -0700

Hi Siddharth,

Check out the bin/crawl script. There you can set the number of slave
nodes, as well as topN for your crawl (size fetchlist * number of slaves),
which you want to be 700,000+.

If you tell the bin/crawl script to execute 1 round of 700,000+ pages, you
will get your entire seed list. You'd really only want to do it like this
if you're only planning on crawling the pages once and not interested in
any of the outlinks. If you run another crawl using the same crawl db, you
will end up following the outlinks collected in the initial crawl, unless
you've excluded everything but your desired pages in regex urlfilter.

Hope that helps.

jce

On Tue, Mar 10, 2015 at 8:14 AM, Siddharth Shah <[email protected]> wrote:

> Hello All,
>               I have a question regarding running Nutch on Hadoop. The
> current setup is as follows
>
>    - Hadoop 1.0.3 cluster on AWS's EMR (1 Master - Medium Instance + 3
>    Slave Nodes Small Instance)
>    - Nutch 1.7
>    - Apart from default hadoop config only mapred.map.tasks set to 3
>    - On Nutch I've update nutch-site.xml with proper agent name
>
> I have seed-list of about 7,00,000 pages from a single domain. So my
> questions are
>
>    - What setting do I need to update so that fetcher works on all 3 nodes
>    as opposed to single node?
>    - What would be appropriate settings for depth and topN values? (I am
>    assuming them to be 1 and 700000 respectively)
>
> Thank you,
> Sidharth
>

-- 
Jonathan Cooper-Ellis
Field Enablement Engineer
<http://www.cloudera.com>

Re: Crawling Pages from Single Domain

Reply via email to