Hi Lewis,
No, I don't use the crawl command. I use an adapted step-by-step script from
the Wiki and Nutch is locally running on a single server. The attached
script is without merging and indexing, what is a separate step in my
workflow. My (fetch-)workflow is:
- inject
- generate
- fetch
- updatedb
Please see my complete (fetch-)script:
#!/bin/sh
RUN_HOME=/home/tsegge
crawl=$RUN_HOME/crawl
nutch=$RUN_HOME/nutch-1.2/bin/nutch
urls=$RUN_HOME/urls/seed.txt
depth=6
topN=500
threads=10
adddays=30
echo "----- Inject (Step 1) -----"
$nutch inject $crawl/crawldb $urls
echo "----- Generate, Fetch, Parse, Update (Step 2) -----"
for((i=0; i < $depth; i++))
do
echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
$nutch generate $crawl/crawldb $crawl/segments -topN $topN -adddays
$adddays
if [ $? -ne 0 ]
then
echo "runbot: Stopping at depth $depth. No more URLs to fetch."
break
fi
segment=`ls -d $crawl/segments/* | tail -1`
echo "--- fetch at depth `expr $i + 1` of $depth ---"
$nutch fetch $segment -threads $threads
if [ $? -ne 0 ]
then
echo "runbot: fetch $segment at depth $depth failed. Deleting it."
rm -rf $segment
continue
fi
echo "--- updatedb at depth `expr $i + 1` of $depth ---"
$nutch updatedb $crawl/crawldb $segment
done
Kind regards
Thomas Eggebrecht
2011/7/8 lewis john mcgibbney <[email protected]>
> [...]
> Can you explain more about your crawling operation? Are you executing a
> crawl command? If so what arguements are you passing?
>
> If not can you give more detail of the job you are running
> [...]
--
> *Lewis*
>