Re: nutch infinite deph crawl

Cam Bazz Wed, 06 Jul 2011 09:05:22 -0700

Hello Lewis,

Thank you very much. I have indeed figured it out, and now nutch is
indexing like I want.


Best,
C.B.

On Wed, Jul 6, 2011 at 6:32 PM, lewis john mcgibbney
<[email protected]> wrote:
> Hi C.B.,
>
> To fetch all the pages in your two domains, I would start with a breadth of
> search equal to one i.e. -depth 1. This way after every crawl you can
> evaluate how your crawldb and linkdb are looking with readdb and readlinkdb
> respectively and operate in an ancremental manner. This will also allow you
> to use Luke, and you can see the quality of your index.
>
> Please note that we can use URLfilters for filtering out domains we do not
> want to search, also have a look at redirect properties, as well as
> db.ignore.external.links and db.ignore.internal.links in nutch-site.xml.
> Once you read the description of these properties you will get an idea for
> the type of conf setting which will provide best results.
>
> On Wed, Jul 6, 2011 at 5:09 AM, Cam Bazz <[email protected]> wrote:
>
>> Hello,
>>
>> I am running nutch with bin/nutch crawl urls -dir crawl -depth 3 -topN 3
>>
>> and in my urls/sites file, I have two sites like:
>>
>> http://www.mysite.com
>> http://www.mysite2.com
>>
>> I would like to crawl those two sites to infinite depth, and just
>> index all the pages in these sites. But I dont want it to go to remote
>> sites, like facebook if there is a link from those sites.
>>
>> How do I do it? I know this is a primitive question, but I have looked
>> all the documentation but could not figure it out.
>>
>> Best Regards,
>> C.B.
>>
>
>
>
> --
> *Lewis*
>

Re: nutch infinite deph crawl

Reply via email to