Re: nutch infinite deph crawl

Markus Jelsma Wed, 06 Jul 2011 09:09:48 -0700

db.ignore.external.links will not necessarily keep you within one domain when 
it comes to redirections. Try the domain filter instead or url filter.


On Wednesday 06 July 2011 17:32:53 lewis john mcgibbney wrote:
> Hi C.B.,
> 
> To fetch all the pages in your two domains, I would start with a breadth of
> search equal to one i.e. -depth 1. This way after every crawl you can
> evaluate how your crawldb and linkdb are looking with readdb and readlinkdb
> respectively and operate in an ancremental manner. This will also allow you
> to use Luke, and you can see the quality of your index.
> 
> Please note that we can use URLfilters for filtering out domains we do not
> want to search, also have a look at redirect properties, as well as
> db.ignore.external.links and db.ignore.internal.links in nutch-site.xml.
> Once you read the description of these properties you will get an idea for
> the type of conf setting which will provide best results.
> 
> On Wed, Jul 6, 2011 at 5:09 AM, Cam Bazz <[email protected]> wrote:
> > Hello,
> > 
> > I am running nutch with bin/nutch crawl urls -dir crawl -depth 3 -topN 3
> > 
> > and in my urls/sites file, I have two sites like:
> > 
> > http://www.mysite.com
> > http://www.mysite2.com
> > 
> > I would like to crawl those two sites to infinite depth, and just
> > index all the pages in these sites. But I dont want it to go to remote
> > sites, like facebook if there is a link from those sites.
> > 
> > How do I do it? I know this is a primitive question, but I have looked
> > all the documentation but could not figure it out.
> > 
> > Best Regards,
> > C.B.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: nutch infinite deph crawl

Reply via email to