Re: Site being crawled even when the URL is removed from seed.txt

Tejas Patil Wed, 26 Dec 2012 10:07:23 -0800

Hi Rajani,

As per screen shot #1, the seed url (
http://localhost:8080/nutch-test-site/chi.html) was saved in the file named
"seeds.txt". But while running the crawl (screen shot #3), this file is not
passed as an argument to the crawl command. Instead some other file named
"urls" is passed as an argument. I suspect that it might be having the
links from sony.com and usc.edu.
Please pass the correct seed file in the crawl command and run a fresh
crawl again.


Thanks,
Tejas Patil


On Wed, Dec 26, 2012 at 4:04 AM, Rajani Maski <[email protected]> wrote:

> Hi Lewis,
>
>    I think there is something wrong in the configuration from my end.
>
> But I am yet to find the reason for crawl that is taking place on the
> history of links that are not mentioned in the seed text file nor the old
> crawl db created in /home/ubuntu/nutch_new_setup/testcrawl/crawldb exists.
> Did you mean the same crawldb or does it create tmp folder somewhere else
> that need to be cleared?
>
> Please find the screens shots in this
> link<http://rajinimaski.blogspot.in/2012/12/nutch-learning.html> taken
> during set up and while crawl is executed. It shows the detailed
> configuration steps followed.
>
>
> Thanks & Regards,
> Rajani Maski
>
>
>
> On Wed, Dec 19, 2012 at 6:35 PM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
> > Hi Rajani,
> >
> >
> >
> > On Wed, Dec 19, 2012 at 10:33 AM, Rajani Maski <[email protected]
> > >wrote:
> >
> > >
> > >    Now I wanted to do fresh new crawl, So after the completion of above
> > > crawling process,  i followed the below steps:
> > >
> > >    - Changed the URL in seed.txt to service.sony.com.in,
> > >
> >
> > Did you inject the above URL into the new crawl database?
> >
> >
> > >    - in the regexurlfilter.txt I just  gave "+." [I know that this "+."
> > >    means to accept anything , But "anything"  does it mean any ULRS
> that
> > is
> > >    not there in seed.txt too? ]
> > >
> >
> > Nutch will follow out/in links for any given URL (depending on your
> > configuration). The crawler cannot magically jump to undiscovered URLs,
> > there needs to be a graph linking nodes.
> >
> >
> > >
> > > *What I observe is crawling for the site :
> > > http://viterbi.usc.edu/admission/
> > > is still taking place even when the url does not exist in seed.txt nor
> > the
> > > old crawldb(nutchcrawldb) exists.
> > >
> >
> > If you have totally deleted the old crawl database this should be
> > impossible. The crawl database tracks URLs along with lots of metadata,
> > once it is deleted this information is lost and you will need to create
> > your crawl database from scratch.
> >
> > Lewis
> >
>

Re: Site being crawled even when the URL is removed from seed.txt

Reply via email to