Re: Partitioning selected urls for politeness and scoring

Thomas Eggebrecht Mon, 11 Jul 2011 06:14:54 -0700

Hi Lewis,
No, I don't use the crawl command. I use an adapted step-by-step script from
the Wiki and Nutch is locally running on a single server. The attached
script is without merging and indexing, what is a separate step in my
workflow. My (fetch-)workflow is:
- inject
- generate
- fetch
- updatedb


Please see my complete (fetch-)script:
#!/bin/sh
RUN_HOME=/home/tsegge
crawl=$RUN_HOME/crawl
nutch=$RUN_HOME/nutch-1.2/bin/nutch
urls=$RUN_HOME/urls/seed.txt

depth=6
topN=500
threads=10
adddays=30

echo "----- Inject (Step 1) -----"
$nutch inject $crawl/crawldb $urls
echo "----- Generate, Fetch, Parse, Update (Step 2) -----"
for((i=0; i < $depth; i++))
do
  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
  $nutch generate $crawl/crawldb $crawl/segments -topN $topN -adddays
$adddays
  if [ $? -ne 0 ]
  then
    echo "runbot: Stopping at depth $depth. No more URLs to fetch."
    break
  fi
  segment=`ls -d $crawl/segments/* | tail -1`
  echo "--- fetch at depth `expr $i + 1` of $depth ---"
  $nutch fetch $segment -threads $threads
  if [ $? -ne 0 ]
  then
    echo "runbot: fetch $segment at depth $depth failed. Deleting it."
    rm -rf $segment
    continue
  fi
  echo "--- updatedb at depth `expr $i + 1` of $depth ---"
  $nutch updatedb $crawl/crawldb $segment
done


Kind regards
Thomas Eggebrecht


2011/7/8 lewis john mcgibbney <[email protected]>

> [...]
> Can you explain more about your crawling operation? Are you executing a
> crawl command? If so what arguements are you passing?
>
> If not can you give more detail of the job you are running
> [...]

--
> *Lewis*
>

Re: Partitioning selected urls for politeness and scoring

Reply via email to