Re: bin/crawl sequencing algorithm

Sebastian Nagel Wed, 06 Jul 2016 00:51:06 -0700

Hi José-Marcio,

it's possible to do the indexing at the end or somewhere in the middle
indexing multiple segments in one turn.  The same applies to LinkDb
updates (for anchor texts) and (optionally) the link rank calculation.


If there is no need to update the index as soon as possible
(that's what most users probably want), you could change the crawl
script: keep fetched segments in a list and path them to the
"invertlinks" (if desired) and "index" tools.
If the crawl runs only once and is started from scratch the next time,
the "-dir" argument allows to index all segments in one turn.

Cheers,
Sebastian



On 07/03/2016 09:49 AM, Jose Marcio Martins da Cruz wrote:
> 
> Hello
> 
> bin/crawl algorithm looks something like :
> 
> *******************************
> # prepare
> inject
> 
> while ...
>   # crawl
>   generate
>   fetch
>   parse
> 
>   # post-processing
>   updatedb
>   invertlinks
>   dedup
> 
>   # do index
>   if $DoIndex
>   then
>     index
>     clean
>   endif
> 
>   # do webgraph
>   if $DoWebgraph
>   then
>     webgraph
>     linkrank
>     scoreupdater
>     nodedumper
>   endif
> done
> ************************
> 
> is there a reason to maintain "index" and "webgraph" parts inside the loop ?
> 
> What happens if I put them out of the loop and run them after all rounds ? 
> What about the
> "post-processing" part ?"
> 
> OBS : I'm crawling in small rounds (30 minutes) because "Crawl-delay" of 
> sites I'm crawling are
> heterogeneous and doing multiple small rounds is more efficient than a single 
> long round.
> 
> Regards
> 
> José-Marcio
> 
>

Re: bin/crawl sequencing algorithm

Reply via email to