Hi Lewis, I think there is something wrong in the configuration from my end.
But I am yet to find the reason for crawl that is taking place on the history of links that are not mentioned in the seed text file nor the old crawl db created in /home/ubuntu/nutch_new_setup/testcrawl/crawldb exists. Did you mean the same crawldb or does it create tmp folder somewhere else that need to be cleared? Please find the screens shots in this link<http://rajinimaski.blogspot.in/2012/12/nutch-learning.html> taken during set up and while crawl is executed. It shows the detailed configuration steps followed. Thanks & Regards, Rajani Maski On Wed, Dec 19, 2012 at 6:35 PM, Lewis John Mcgibbney < [email protected]> wrote: > Hi Rajani, > > > > On Wed, Dec 19, 2012 at 10:33 AM, Rajani Maski <[email protected] > >wrote: > > > > > Now I wanted to do fresh new crawl, So after the completion of above > > crawling process, i followed the below steps: > > > > - Changed the URL in seed.txt to service.sony.com.in, > > > > Did you inject the above URL into the new crawl database? > > > > - in the regexurlfilter.txt I just gave "+." [I know that this "+." > > means to accept anything , But "anything" does it mean any ULRS that > is > > not there in seed.txt too? ] > > > > Nutch will follow out/in links for any given URL (depending on your > configuration). The crawler cannot magically jump to undiscovered URLs, > there needs to be a graph linking nodes. > > > > > > *What I observe is crawling for the site : > > http://viterbi.usc.edu/admission/ > > is still taking place even when the url does not exist in seed.txt nor > the > > old crawldb(nutchcrawldb) exists. > > > > If you have totally deleted the old crawl database this should be > impossible. The crawl database tracks URLs along with lots of metadata, > once it is deleted this information is lost and you will need to create > your crawl database from scratch. > > Lewis >

