Re: nutch infinite deph crawl

lewis john mcgibbney Wed, 06 Jul 2011 08:33:22 -0700

Hi C.B.,

To fetch all the pages in your two domains, I would start with a breadth of
search equal to one i.e. -depth 1. This way after every crawl you can
evaluate how your crawldb and linkdb are looking with readdb and readlinkdb
respectively and operate in an ancremental manner. This will also allow you
to use Luke, and you can see the quality of your index.

Please note that we can use URLfilters for filtering out domains we do not
want to search, also have a look at redirect properties, as well as
db.ignore.external.links and db.ignore.internal.links in nutch-site.xml.
Once you read the description of these properties you will get an idea for
the type of conf setting which will provide best results.

On Wed, Jul 6, 2011 at 5:09 AM, Cam Bazz <[email protected]> wrote:

> Hello,
>
> I am running nutch with bin/nutch crawl urls -dir crawl -depth 3 -topN 3
>
> and in my urls/sites file, I have two sites like:
>
> http://www.mysite.com
> http://www.mysite2.com
>
> I would like to crawl those two sites to infinite depth, and just
> index all the pages in these sites. But I dont want it to go to remote
> sites, like facebook if there is a link from those sites.
>
> How do I do it? I know this is a primitive question, but I have looked
> all the documentation but could not figure it out.
>
> Best Regards,
> C.B.
>

-- 
*Lewis*

Re: nutch infinite deph crawl

Reply via email to