could you dump the entry of this url in crawl db with
bin/nutch readdb crawl/crawldb -url <url>
it also depends on your configuration.
what is the value for
db.fetch.interval.default and for
db.fetch.schedule.class?
a guess. i have not done crawling with nutch for a while, so im not sure.
i guess it has CrawlDatum.STATUS_FETCH_RETRY when crawling and it uses
schedule.setPageRetrySchedule((Text)key, result, prevFetchTime,
prevModifiedTime, fetch.getFetchTime());
in AbstractFetchSchedule you can read
/**
* This method adjusts the fetch schedule if fetching needs to be
* re-tried due to transient errors. The default implementation
* sets the next fetch time 1 day in the future and increases
* the retry counter.
* @param url URL of the page
* @param datum page information
* @param prevFetchTime previous fetch time
* @param prevModifiedTime previous modified time
* @param fetchTime current fetch time
* @return adjusted page information, including all original information.
* NOTE: this may be a different instance than {...@param datum}, but
* implementations should make sure that it contains at least all
* information from {...@param datum}.
*/
public CrawlDatum setPageRetrySchedule(Text url, CrawlDatum datum,
long prevFetchTime, long prevModifiedTime, long fetchTime) {
datum.setFetchTime(fetchTime + (long)SECONDS_PER_DAY*1000);
datum.setRetriesSinceFetch(datum.getRetriesSinceFetch() + 1);
return datum;
}
you have set adddays to 5 days.
in this case it will be refetched and refetched again.
try adddays with value 0.
or change the code in AbstractFetchSchedule
datum.setFetchTime(fetchTime + (long)SECONDS_PER_DAY*1000);
[email protected] schrieb:
> Hi,
>
> I am trying to use Nutch 1.1 to build a complete index of our corporate web
> sites. I am using a script based on this one:
>
> http://wiki.apache.org/nutch/Crawl
>
> The text of my script is included below. I set the crawling depth to 100 to
> make sure that everything is indexed, expecting that the process will stop
> after about 20 iterations. The problem is that the Generator keeps
> re-scheduling fetching of failed URLs. The process stopped because the max
> number of iterations (the depth) was reached. After a few iterations, only
> failing URLs were being repeatedly fetched. I checked the log for one of
> them. It failed with code 403 and was re-scheduled for fetching 94 times.
> This does not seem right.
>
> Another problem that I've noticed is that sometimes the Generators schedules
> fetching of just one URL per iteration. This happened once and I did not try
> to repeat this effect, but this does not seem right either.
>
> Here is my script text:
>
> -----------------------------------------------
> #!/bin/bash
>
> depth=100
> threads=10
> adddays=5
> topN=50000
>
> NUTCH_HOME=/data/HORUS_1/nutch1.1
> NUTCH_HEAPSIZE=8000
> nutch=$NUTCH_HOME
> JAVA_HOME=/usr/lib/jvm/java-6-sun
> export NUTCH_HOME
> export JAVA_HOME
> export NUTCH_HEAPSIZE
>
> steps=6
> echo "----- Inject (Step 1 of $steps) -----"
> $nutch/bin/nutch inject $nutch/crawl/crawldb $nutch/crawl/seed
>
> echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
> for((i=0; i < $depth; i++))
> do
> echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
> $nutch/bin/nutch generate $nutch/crawl/crawldb $nutch/crawl/segments $topN
> -adddays $adddays
> if [ $? -ne 0 ]
> then
> echo "Stopping at depth $depth. No more URLs to fetch."
> break
> fi
> segment=`ls -d $nutch/crawl/segments/* | tail -1`
>
> $nutch/bin/nutch fetch $segment -threads $threads
> if [ $? -ne 0 ]
> then
> echo "fetch $segment at depth $depth failed. Deleting it."
> rm -rf $segment
> continue
> fi
>
> # echo "--- Parsing Segment $segment ---"
> # $nutch/bin/nutch parse $segment
>
> $nutch/bin/nutch updatedb $nutch/crawl/crawldb $segment
> done
>
> echo "----- Invert Links (Step 3 of $steps) -----"
> $nutch/bin/nutch invertlinks $nutch/crawl/linkdb $nutch/crawl/segments/*
>
> echo "----- Index (Step 4 of $steps) -----"
> $nutch/bin/nutch index $nutch/crawl/preIndex $nutch/crawl/crawldb
> $nutch/crawl/linkdb $nutch/crawl/segments/*
>
> echo "----- Dedup (Step 5 of $steps) -----"
> $nutch/bin/nutch dedup $nutch/crawl/preIndex
>
> echo "----- Merge Indexes (Step 6 of $steps) -----"
> $nutch/bin/nutch merge $nutch/crawl/index $nutch/crawl/preIndex
>
> # in nutch-site, hadoop.tmp.dir points to crawl/tmp
> rm -rf $nutch/crawl/tmp/*
> -----------------------------------------------
>
> Is anyone experiencing same problems? Is there anything wrong in what I am
> doing?
>
> Regards,
>
> Arkadi
>
>
>
>
>
>
>