What was the original fetch interval between successive crawls? Yopu're script looks fine and this would also shadown the fact that crawlking does not seem to be a problem. You mentioned that the domain which is being fecthed more than others seems to recieve a higher scoring count than other sites, how did you acertain this? I know that this is a simple suggestion, but could it possibly be the case that -topN = 500 exceeds the number of pages in the domains which are not being fetched at subsequent recrawls?
On Mon, Jul 11, 2011 at 2:14 PM, Thomas Eggebrecht < [email protected]> wrote: > Hi Lewis, > No, I don't use the crawl command. I use an adapted step-by-step script > from > the Wiki and Nutch is locally running on a single server. The attached > script is without merging and indexing, what is a separate step in my > workflow. My (fetch-)workflow is: > - inject > - generate > - fetch > - updatedb > > Please see my complete (fetch-)script: > #!/bin/sh > RUN_HOME=/home/tsegge > crawl=$RUN_HOME/crawl > nutch=$RUN_HOME/nutch-1.2/bin/nutch > urls=$RUN_HOME/urls/seed.txt > > depth=6 > topN=500 > threads=10 > adddays=30 > > echo "----- Inject (Step 1) -----" > $nutch inject $crawl/crawldb $urls > echo "----- Generate, Fetch, Parse, Update (Step 2) -----" > for((i=0; i < $depth; i++)) > do > echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---" > $nutch generate $crawl/crawldb $crawl/segments -topN $topN -adddays > $adddays > if [ $? -ne 0 ] > then > echo "runbot: Stopping at depth $depth. No more URLs to fetch." > break > fi > segment=`ls -d $crawl/segments/* | tail -1` > echo "--- fetch at depth `expr $i + 1` of $depth ---" > $nutch fetch $segment -threads $threads > if [ $? -ne 0 ] > then > echo "runbot: fetch $segment at depth $depth failed. Deleting it." > rm -rf $segment > continue > fi > echo "--- updatedb at depth `expr $i + 1` of $depth ---" > $nutch updatedb $crawl/crawldb $segment > done > > > Kind regards > Thomas Eggebrecht > > > 2011/7/8 lewis john mcgibbney <[email protected]> > > > [...] > > Can you explain more about your crawling operation? Are you executing a > > crawl command? If so what arguements are you passing? > > > > If not can you give more detail of the job you are running > > [...] > > -- > > *Lewis* > > > -- *Lewis*

