Nutch Recrawl

Nemani, Raj Thu, 19 Aug 2010 15:49:53 -0700

Hi all,

I am using the following script to do my re-crawl.  This is basically a
slightly modified version of the script that is found here.
http://wiki.apache.org/nutch/Crawl


I have a small site that I would like to crawl using this script may be
3 times a day on a windows server by scheduling the script  using the
windows scheduled task feature.  The actual plan is to create a windows
batch file and call the CygWin bash tool and supply my script to it as
shown below

        C:\cygwin\bin\bash.exe -l nutchrecrawl

Where  " nutchrecrawl" is the file containing the script below.

Is the overall approach I am taking correct to achieve my objective?
How does Nutch determine that there are old documents that have been
updated and hence need to be crawled.  

I also read that " db.fetch.interval.default" property controls how and
when Nutch decides to re-crawl an existing document and the default is
30 days.  

Let us say I  change " db.fetch.interval.default" value from 30 days to
say 20 minutes (12000 seconds).
Then I run the script the first time and index the results into
Solr/Lucene index.
Then I go to my site and make a text change to an existing page (already
indexed during the first script run) site and immediately re-run the
script and index the results again to Solr/Lucene index.  Assuming that
these steps happened within 20 minutes then I should not see the change
I made to the page in the index yet.  If I run the script again the
third time after 20 minutes has passed then I should see my change in
the index.  Is my understanding correct?  

I also read about -adddays argument that could be added to the
'generate' step.  How does this option work?

Sorry for long email.  But I wanted to make sure I provide all the
information to make it easy to understand the issue.  I really
appreciate your help.

Thanks
Raj

**************************************************************
depth=2
threads=50
adddays=0
#topN=15 #Comment this statement if you don't want to set topN value

# Arguments for rm and mv
RMARGS="-rf"
MVARGS="--verbose"

# Parse arguments
if [ "$1" == "safe" ]
then
  safe=yes
fi

if [ -z "$NUTCH_HOME" ]
then
  NUTCH_HOME=/cygdrive/c/users/rnemani.turner/nutch
  cd /cygdrive/c/users/rnemani.turner/nutch
  echo runbot: $0 could not find environment variable NUTCH_HOME
  echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script 
else
  echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME 
fi

if [ -n "$topN" ]
then
  topN="-topN $topN"
else
  topN=""
fi

steps=7
echo "----- Inject (Step 1 of $steps) -----"
$NUTCH_HOME/bin/nutch inject crawl/crawldb urls

echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
for((i=0; i < $depth; i++))
do
  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
  $NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments  
  if [ $? -ne 0 ]
  then
    echo "runbot: Stopping at depth $depth. No more URLs to fetch."
    break
  fi
  segment=`ls -d crawl/segments/* | tail -1`

  $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
  if [ $? -ne 0 ]
  then
    echo "runbot: fetch $segment at depth `expr $i + 1` failed."
    echo "runbot: Deleting segment $segment."
    rm $RMARGS $segment
    continue
  fi

  $NUTCH_HOME/bin/nutch updatedb crawl/crawldb $segment
done

echo "----- Merge Segments (Step 3 of $steps) -----"
$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*
if [ "$safe" != "yes" ]
then
  rm $RMARGS crawl/segments
else
  rm $RMARGS crawl/BACKUPsegments
  mv $MVARGS crawl/segments crawl/BACKUPsegments
fi

mv $MVARGS crawl/MERGEDsegments crawl/segments

echo "----- Invert Links (Step 4 of $steps) -----"
$NUTCH_HOME/bin/nutch invertlinks crawl/linkdb crawl/segments/*

echo "----- Index (Step 5 of $steps) -----"
$NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb
\
    crawl/segments/*

echo "----- Dedup (Step 6 of $steps) -----"
$NUTCH_HOME/bin/nutch dedup crawl/NEWindexes

echo "----- Merge Indexes (Step 7 of $steps) -----"
$NUTCH_HOME/bin/nutch merge crawl/NEWindex crawl/NEWindexes

#echo "----- Loading New Index (Step 8 of $steps) -----"

if [ "$safe" != "yes" ]
then
  rm $RMARGS crawl/NEWindexes
  rm $RMARGS crawl/index
else
  rm $RMARGS crawl/BACKUPindexes
  rm $RMARGS crawl/BACKUPindex
  mv $MVARGS crawl/NEWindexes crawl/BACKUPindexes
  mv $MVARGS crawl/index crawl/BACKUPindex
fi

mv $MVARGS crawl/NEWindex crawl/index

echo "runbot: FINISHED: Crawl completed!"
echo ""

******************************************************************

Nutch Recrawl

Reply via email to