Re: bin/crawl sequencing algorithm

Jose-Marcio Martins da Cruz Wed, 06 Jul 2016 02:33:56 -0700


Hi Sebastian,


Thanks for the very useful hints. Will work on it.

Best regards

José-Marcio


On 07/06/2016 09:50 AM, Sebastian Nagel wrote:

Hi José-Marcio,

it's possible to do the indexing at the end or somewhere in the middle
indexing multiple segments in one turn.  The same applies to LinkDb
updates (for anchor texts) and (optionally) the link rank calculation.

If there is no need to update the index as soon as possible
(that's what most users probably want), you could change the crawl
script: keep fetched segments in a list and path them to the
"invertlinks" (if desired) and "index" tools.
If the crawl runs only once and is started from scratch the next time,
the "-dir" argument allows to index all segments in one turn.

Cheers,
Sebastian



On 07/03/2016 09:49 AM, Jose Marcio Martins da Cruz wrote:


Hello

bin/crawl algorithm looks something like :

*******************************
# prepare
inject

while ...
  # crawl
  generate
  fetch
  parse

  # post-processing
  updatedb
  invertlinks
  dedup

  # do index
  if $DoIndex
  then
    index
    clean
  endif

  # do webgraph
  if $DoWebgraph
  then
    webgraph
    linkrank
    scoreupdater
    nodedumper
  endif
done
************************

is there a reason to maintain "index" and "webgraph" parts inside the loop ?

What happens if I put them out of the loop and run them after all rounds ? What 
about the
"post-processing" part ?"

OBS : I'm crawling in small rounds (30 minutes) because "Crawl-delay" of sites 
I'm crawling are
heterogeneous and doing multiple small rounds is more efficient than a single 
long round.

Regards

José-Marcio



--

 Envoyé de ma machine à écrire.
 ---------------------------------------------------------------
  Spam : Classement statistique de messages électroniques -
         Une approche pragmatique
  Chez Amazon.fr : http://amzn.to/LEscRu ou http://bit.ly/SpamJM
 ---------------------------------------------------------------
 Jose Marcio MARTINS DA CRUZ            http://www.j-chkmail.org
 Ecole des Mines de Paris                   http://bit.ly/SpamJM
 60, bd Saint Michel                      75272 - PARIS CEDEX 06

Re: bin/crawl sequencing algorithm

Reply via email to