This might be the reason: You are using GEdit to edit the seeds file. It
creates a backup of the old version of the file when changes are made to
it. The backup file is hidden.

Check the contents of the urls directory using this command: *ls -a urls*
(to be executed from NUTCH_HOME. In your setup its ~/nutch_new_setup)
*
*
This might give you:
*.  ..  seed.txt  seed.txt~*

seed.txt, the updated version, will have
http://localhost:8080/nutch-test-site/chi.html  while the backup version,
seed.txt~ will have the sony.com and usc.edu urls. The second file is a
hidden file.

Nutch scans the "urls" directory and gets *all* the files inside it... both
the files are getting picked by nutch and hence you see the old urls too.
Delete the hidden file urls/seeds.txt~ and try a fresh crawl.

Thanks,
 Tejas Patil

On Wed, Dec 26, 2012 at 8:54 PM, Rajani Maski <[email protected]> wrote:

>  http://localhost:8080/nutch-test-site/chi.html
>

Reply via email to