the script work well (Nutch 0.9)
However, I have some concerns:
As log in screen, and review code, the script re-index all database --> low
speed (long as index new one)
--> Is there any way to re-index only changed pages?
The generate step is long also
--> can improve it?
The db.default.fetch.in
The script creates a 'crawl' directory in the present working directory.
Where is your Nutch directory and where are you running the script. I
usually change directory to the top level Nutch directory, put the
script in the 'bin' directory, chmod a+x bin/crawl and then run it as
bin/crawl. So, as
Thanks for the response.
I tried this with nutch-0.9. The script seems to be accessing non-existent
file/dirs.
Fetcher: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt
exist : /user/nutch/-threads/crawl_generate
at org.apache.hadoop.mapred.InputFormatBase.validateInput(
I
You can try the crawl script: http://wiki.apache.org/nutch/Crawl
Regards,
Susam Pal
On Jan 13, 2008 8:36 AM, Manoj Bist <[EMAIL PROTECTED]> wrote:
> Hi,
>
> When I run crawl the second time, it always complains that 'crawled' already
> exists. I always need to remove this directory using 'hadoop
Hi,
When I run crawl the second time, it always complains that 'crawled' already
exists. I always need to remove this directory using 'hadoop dfs -rm
crawled' to get going.
Is there some way to avoid this error and tell nutch that its a recrawl?
bin/nutch crawl urls -dir crawled -depth 1 2>&1 |