Re: Partitioning selected urls for politeness and scoring

lewis john mcgibbney Mon, 11 Jul 2011 06:51:22 -0700

What was the original fetch interval between successive crawls?

Yopu're script looks fine and this would also shadown the fact that
crawlking does not seem to be a problem. You mentioned that the domain which
is being fecthed more than others seems to recieve a higher scoring count
than other sites, how did you acertain this? I know that this is a simple
suggestion, but could it possibly be the case that -topN = 500 exceeds the
number of pages in the domains which are not being fetched at subsequent
recrawls?


On Mon, Jul 11, 2011 at 2:14 PM, Thomas Eggebrecht <
[email protected]> wrote:

> Hi Lewis,
> No, I don't use the crawl command. I use an adapted step-by-step script
> from
> the Wiki and Nutch is locally running on a single server. The attached
> script is without merging and indexing, what is a separate step in my
> workflow. My (fetch-)workflow is:
> - inject
> - generate
> - fetch
> - updatedb
>
> Please see my complete (fetch-)script:
> #!/bin/sh
> RUN_HOME=/home/tsegge
> crawl=$RUN_HOME/crawl
> nutch=$RUN_HOME/nutch-1.2/bin/nutch
> urls=$RUN_HOME/urls/seed.txt
>
> depth=6
> topN=500
> threads=10
> adddays=30
>
> echo "----- Inject (Step 1) -----"
> $nutch inject $crawl/crawldb $urls
> echo "----- Generate, Fetch, Parse, Update (Step 2) -----"
> for((i=0; i < $depth; i++))
> do
>  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
>  $nutch generate $crawl/crawldb $crawl/segments -topN $topN -adddays
> $adddays
>  if [ $? -ne 0 ]
>  then
>    echo "runbot: Stopping at depth $depth. No more URLs to fetch."
>    break
>  fi
>  segment=`ls -d $crawl/segments/* | tail -1`
>  echo "--- fetch at depth `expr $i + 1` of $depth ---"
>  $nutch fetch $segment -threads $threads
>  if [ $? -ne 0 ]
>  then
>    echo "runbot: fetch $segment at depth $depth failed. Deleting it."
>    rm -rf $segment
>    continue
>  fi
>  echo "--- updatedb at depth `expr $i + 1` of $depth ---"
>  $nutch updatedb $crawl/crawldb $segment
> done
>
>
> Kind regards
> Thomas Eggebrecht
>
>
> 2011/7/8 lewis john mcgibbney <[email protected]>
>
> > [...]
> > Can you explain more about your crawling operation? Are you executing a
> > crawl command? If so what arguements are you passing?
> >
> > If not can you give more detail of the job you are running
> > [...]
>
> --
> > *Lewis*
> >
>



-- 
*Lewis*

Re: Partitioning selected urls for politeness and scoring

Reply via email to