Hi all, I am using the following script to do my re-crawl. This is basically a slightly modified version of the script that is found here. http://wiki.apache.org/nutch/Crawl
I have a small site that I would like to crawl using this script may be 3 times a day on a windows server by scheduling the script using the windows scheduled task feature. The actual plan is to create a windows batch file and call the CygWin bash tool and supply my script to it as shown below C:\cygwin\bin\bash.exe -l nutchrecrawl Where " nutchrecrawl" is the file containing the script below. Is the overall approach I am taking correct to achieve my objective? How does Nutch determine that there are old documents that have been updated and hence need to be crawled. I also read that " db.fetch.interval.default" property controls how and when Nutch decides to re-crawl an existing document and the default is 30 days. Let us say I change " db.fetch.interval.default" value from 30 days to say 20 minutes (12000 seconds). Then I run the script the first time and index the results into Solr/Lucene index. Then I go to my site and make a text change to an existing page (already indexed during the first script run) site and immediately re-run the script and index the results again to Solr/Lucene index. Assuming that these steps happened within 20 minutes then I should not see the change I made to the page in the index yet. If I run the script again the third time after 20 minutes has passed then I should see my change in the index. Is my understanding correct? I also read about -adddays argument that could be added to the 'generate' step. How does this option work? Sorry for long email. But I wanted to make sure I provide all the information to make it easy to understand the issue. I really appreciate your help. Thanks Raj ************************************************************** depth=2 threads=50 adddays=0 #topN=15 #Comment this statement if you don't want to set topN value # Arguments for rm and mv RMARGS="-rf" MVARGS="--verbose" # Parse arguments if [ "$1" == "safe" ] then safe=yes fi if [ -z "$NUTCH_HOME" ] then NUTCH_HOME=/cygdrive/c/users/rnemani.turner/nutch cd /cygdrive/c/users/rnemani.turner/nutch echo runbot: $0 could not find environment variable NUTCH_HOME echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script else echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME fi if [ -n "$topN" ] then topN="-topN $topN" else topN="" fi steps=7 echo "----- Inject (Step 1 of $steps) -----" $NUTCH_HOME/bin/nutch inject crawl/crawldb urls echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----" for((i=0; i < $depth; i++)) do echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---" $NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments if [ $? -ne 0 ] then echo "runbot: Stopping at depth $depth. No more URLs to fetch." break fi segment=`ls -d crawl/segments/* | tail -1` $NUTCH_HOME/bin/nutch fetch $segment -threads $threads if [ $? -ne 0 ] then echo "runbot: fetch $segment at depth `expr $i + 1` failed." echo "runbot: Deleting segment $segment." rm $RMARGS $segment continue fi $NUTCH_HOME/bin/nutch updatedb crawl/crawldb $segment done echo "----- Merge Segments (Step 3 of $steps) -----" $NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/* if [ "$safe" != "yes" ] then rm $RMARGS crawl/segments else rm $RMARGS crawl/BACKUPsegments mv $MVARGS crawl/segments crawl/BACKUPsegments fi mv $MVARGS crawl/MERGEDsegments crawl/segments echo "----- Invert Links (Step 4 of $steps) -----" $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb crawl/segments/* echo "----- Index (Step 5 of $steps) -----" $NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb \ crawl/segments/* echo "----- Dedup (Step 6 of $steps) -----" $NUTCH_HOME/bin/nutch dedup crawl/NEWindexes echo "----- Merge Indexes (Step 7 of $steps) -----" $NUTCH_HOME/bin/nutch merge crawl/NEWindex crawl/NEWindexes #echo "----- Loading New Index (Step 8 of $steps) -----" if [ "$safe" != "yes" ] then rm $RMARGS crawl/NEWindexes rm $RMARGS crawl/index else rm $RMARGS crawl/BACKUPindexes rm $RMARGS crawl/BACKUPindex mv $MVARGS crawl/NEWindexes crawl/BACKUPindexes mv $MVARGS crawl/index crawl/BACKUPindex fi mv $MVARGS crawl/NEWindex crawl/index echo "runbot: FINISHED: Crawl completed!" echo "" ******************************************************************