Re: Site being crawled even when the URL is removed from seed.txt

Lewis John Mcgibbney Wed, 19 Dec 2012 05:05:55 -0800

Hi Rajani,

On Wed, Dec 19, 2012 at 10:33 AM, Rajani Maski <[email protected]>wrote:

>
>    Now I wanted to do fresh new crawl, So after the completion of above
> crawling process,  i followed the below steps:
>
>    - Changed the URL in seed.txt to service.sony.com.in,
>

Did you inject the above URL into the new crawl database?

>    - in the regexurlfilter.txt I just  gave "+." [I know that this "+."
>    means to accept anything , But "anything"  does it mean any ULRS that is
>    not there in seed.txt too? ]
>

Nutch will follow out/in links for any given URL (depending on your
configuration). The crawler cannot magically jump to undiscovered URLs,
there needs to be a graph linking nodes.

>
> *What I observe is crawling for the site :
> http://viterbi.usc.edu/admission/
> is still taking place even when the url does not exist in seed.txt nor the
> old crawldb(nutchcrawldb) exists.
>

If you have totally deleted the old crawl database this should be
impossible. The crawl database tracks URLs along with lots of metadata,
once it is deleted this information is lost and you will need to create
your crawl database from scratch.

Lewis

Re: Site being crawled even when the URL is removed from seed.txt

Reply via email to