Hi Rajani, As per screen shot #1, the seed url ( http://localhost:8080/nutch-test-site/chi.html) was saved in the file named "seeds.txt". But while running the crawl (screen shot #3), this file is not passed as an argument to the crawl command. Instead some other file named "urls" is passed as an argument. I suspect that it might be having the links from sony.com and usc.edu. Please pass the correct seed file in the crawl command and run a fresh crawl again.
Thanks, Tejas Patil On Wed, Dec 26, 2012 at 4:04 AM, Rajani Maski <[email protected]> wrote: > Hi Lewis, > > I think there is something wrong in the configuration from my end. > > But I am yet to find the reason for crawl that is taking place on the > history of links that are not mentioned in the seed text file nor the old > crawl db created in /home/ubuntu/nutch_new_setup/testcrawl/crawldb exists. > Did you mean the same crawldb or does it create tmp folder somewhere else > that need to be cleared? > > Please find the screens shots in this > link<http://rajinimaski.blogspot.in/2012/12/nutch-learning.html> taken > during set up and while crawl is executed. It shows the detailed > configuration steps followed. > > > Thanks & Regards, > Rajani Maski > > > > On Wed, Dec 19, 2012 at 6:35 PM, Lewis John Mcgibbney < > [email protected]> wrote: > > > Hi Rajani, > > > > > > > > On Wed, Dec 19, 2012 at 10:33 AM, Rajani Maski <[email protected] > > >wrote: > > > > > > > > Now I wanted to do fresh new crawl, So after the completion of above > > > crawling process, i followed the below steps: > > > > > > - Changed the URL in seed.txt to service.sony.com.in, > > > > > > > Did you inject the above URL into the new crawl database? > > > > > > > - in the regexurlfilter.txt I just gave "+." [I know that this "+." > > > means to accept anything , But "anything" does it mean any ULRS > that > > is > > > not there in seed.txt too? ] > > > > > > > Nutch will follow out/in links for any given URL (depending on your > > configuration). The crawler cannot magically jump to undiscovered URLs, > > there needs to be a graph linking nodes. > > > > > > > > > > *What I observe is crawling for the site : > > > http://viterbi.usc.edu/admission/ > > > is still taking place even when the url does not exist in seed.txt nor > > the > > > old crawldb(nutchcrawldb) exists. > > > > > > > If you have totally deleted the old crawl database this should be > > impossible. The crawl database tracks URLs along with lots of metadata, > > once it is deleted this information is lost and you will need to create > > your crawl database from scratch. > > > > Lewis > > >

