Hi Jonathan,
                    Apologies for my delayed response. Thank you for the
pointer the crawl worked as expected, I needed to tweak regex filtering.

Thank you once again,
Sidharth

On Wed, Mar 11, 2015 at 4:46 AM, Jonathan Cooper-Ellis <
[email protected]> wrote:

> Hi Siddharth,
>
> Check out the bin/crawl script. There you can set the number of slave
> nodes, as well as topN for your crawl (size fetchlist * number of slaves),
> which you want to be 700,000+.
>
> If you tell the bin/crawl script to execute 1 round of 700,000+ pages, you
> will get your entire seed list. You'd really only want to do it like this
> if you're only planning on crawling the pages once and not interested in
> any of the outlinks. If you run another crawl using the same crawl db, you
> will end up following the outlinks collected in the initial crawl, unless
> you've excluded everything but your desired pages in regex urlfilter.
>
> Hope that helps.
>
> jce
>
> On Tue, Mar 10, 2015 at 8:14 AM, Siddharth Shah <[email protected]> wrote:
>
> > Hello All,
> >               I have a question regarding running Nutch on Hadoop. The
> > current setup is as follows
> >
> >    - Hadoop 1.0.3 cluster on AWS's EMR (1 Master - Medium Instance + 3
> >    Slave Nodes Small Instance)
> >    - Nutch 1.7
> >    - Apart from default hadoop config only mapred.map.tasks set to 3
> >    - On Nutch I've update nutch-site.xml with proper agent name
> >
> > I have seed-list of about 7,00,000 pages from a single domain. So my
> > questions are
> >
> >    - What setting do I need to update so that fetcher works on all 3
> nodes
> >    as opposed to single node?
> >    - What would be appropriate settings for depth and topN values? (I am
> >    assuming them to be 1 and 700000 respectively)
> >
> > Thank you,
> > Sidharth
> >
>
>
>
> --
> Jonathan Cooper-Ellis
> Field Enablement Engineer
> <http://www.cloudera.com>
>

Reply via email to