Re: 'crawled already exists' - how do I recrawl?

2008-01-15 Thread nghianghesi
the script work well (Nutch 0.9) However, I have some concerns: As log in screen, and review code, the script re-index all database --> low speed (long as index new one) --> Is there any way to re-index only changed pages? The generate step is long also --> can improve it? The db.default.fetch.in

Re: 'crawled already exists' - how do I recrawl?

2008-01-12 Thread Susam Pal
The script creates a 'crawl' directory in the present working directory. Where is your Nutch directory and where are you running the script. I usually change directory to the top level Nutch directory, put the script in the 'bin' directory, chmod a+x bin/crawl and then run it as bin/crawl. So, as

Re: 'crawled already exists' - how do I recrawl?

2008-01-12 Thread Manoj Bist
Thanks for the response. I tried this with nutch-0.9. The script seems to be accessing non-existent file/dirs. Fetcher: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : /user/nutch/-threads/crawl_generate at org.apache.hadoop.mapred.InputFormatBase.validateInput( I

Re: 'crawled already exists' - how do I recrawl?

2008-01-12 Thread Susam Pal
You can try the crawl script: http://wiki.apache.org/nutch/Crawl Regards, Susam Pal On Jan 13, 2008 8:36 AM, Manoj Bist <[EMAIL PROTECTED]> wrote: > Hi, > > When I run crawl the second time, it always complains that 'crawled' already > exists. I always need to remove this directory using 'hadoop

'crawled already exists' - how do I recrawl?

2008-01-12 Thread Manoj Bist
Hi, When I run crawl the second time, it always complains that 'crawled' already exists. I always need to remove this directory using 'hadoop dfs -rm crawled' to get going. Is there some way to avoid this error and tell nutch that its a recrawl? bin/nutch crawl urls -dir crawled -depth 1 2>&1 |