Re: Site being crawled even when the URL is removed from seed.txt

Rajani Maski Wed, 26 Dec 2012 04:04:49 -0800

Hi Lewis,

   I think there is something wrong in the configuration from my end.

But I am yet to find the reason for crawl that is taking place on the
history of links that are not mentioned in the seed text file nor the old
crawl db created in /home/ubuntu/nutch_new_setup/testcrawl/crawldb exists.
Did you mean the same crawldb or does it create tmp folder somewhere else
that need to be cleared?

Please find the screens shots in this
link<http://rajinimaski.blogspot.in/2012/12/nutch-learning.html> taken
during set up and while crawl is executed. It shows the detailed
configuration steps followed.

Thanks & Regards,
Rajani Maski

On Wed, Dec 19, 2012 at 6:35 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Rajani,
>
>
>
> On Wed, Dec 19, 2012 at 10:33 AM, Rajani Maski <[email protected]
> >wrote:
>
> >
> >    Now I wanted to do fresh new crawl, So after the completion of above
> > crawling process,  i followed the below steps:
> >
> >    - Changed the URL in seed.txt to service.sony.com.in,
> >
>
> Did you inject the above URL into the new crawl database?
>
>
> >    - in the regexurlfilter.txt I just  gave "+." [I know that this "+."
> >    means to accept anything , But "anything"  does it mean any ULRS that
> is
> >    not there in seed.txt too? ]
> >
>
> Nutch will follow out/in links for any given URL (depending on your
> configuration). The crawler cannot magically jump to undiscovered URLs,
> there needs to be a graph linking nodes.
>
>
> >
> > *What I observe is crawling for the site :
> > http://viterbi.usc.edu/admission/
> > is still taking place even when the url does not exist in seed.txt nor
> the
> > old crawldb(nutchcrawldb) exists.
> >
>
> If you have totally deleted the old crawl database this should be
> impossible. The crawl database tracks URLs along with lots of metadata,
> once it is deleted this information is lost and you will need to create
> your crawl database from scratch.
>
> Lewis
>

Re: Site being crawled even when the URL is removed from seed.txt

Reply via email to