This might be the reason: You are using GEdit to edit the seeds file. It creates a backup of the old version of the file when changes are made to it. The backup file is hidden.
Check the contents of the urls directory using this command: *ls -a urls* (to be executed from NUTCH_HOME. In your setup its ~/nutch_new_setup) * * This might give you: *. .. seed.txt seed.txt~* seed.txt, the updated version, will have http://localhost:8080/nutch-test-site/chi.html while the backup version, seed.txt~ will have the sony.com and usc.edu urls. The second file is a hidden file. Nutch scans the "urls" directory and gets *all* the files inside it... both the files are getting picked by nutch and hence you see the old urls too. Delete the hidden file urls/seeds.txt~ and try a fresh crawl. Thanks, Tejas Patil On Wed, Dec 26, 2012 at 8:54 PM, Rajani Maski <[email protected]> wrote: > http://localhost:8080/nutch-test-site/chi.html >

