Hi Sebastian,
Thanks for the very useful hints. Will work on it.
Best regards
José-Marcio
On 07/06/2016 09:50 AM, Sebastian Nagel wrote:
Hi José-Marcio,
it's possible to do the indexing at the end or somewhere in the middle
indexing multiple segments in one turn. The same applies to LinkDb
updates (for anchor texts) and (optionally) the link rank calculation.
If there is no need to update the index as soon as possible
(that's what most users probably want), you could change the crawl
script: keep fetched segments in a list and path them to the
"invertlinks" (if desired) and "index" tools.
If the crawl runs only once and is started from scratch the next time,
the "-dir" argument allows to index all segments in one turn.
Cheers,
Sebastian
On 07/03/2016 09:49 AM, Jose Marcio Martins da Cruz wrote:
Hello
bin/crawl algorithm looks something like :
*******************************
# prepare
inject
while ...
# crawl
generate
fetch
parse
# post-processing
updatedb
invertlinks
dedup
# do index
if $DoIndex
then
index
clean
endif
# do webgraph
if $DoWebgraph
then
webgraph
linkrank
scoreupdater
nodedumper
endif
done
************************
is there a reason to maintain "index" and "webgraph" parts inside the loop ?
What happens if I put them out of the loop and run them after all rounds ? What
about the
"post-processing" part ?"
OBS : I'm crawling in small rounds (30 minutes) because "Crawl-delay" of sites
I'm crawling are
heterogeneous and doing multiple small rounds is more efficient than a single
long round.
Regards
José-Marcio
--
Envoyé de ma machine à écrire.
---------------------------------------------------------------
Spam : Classement statistique de messages électroniques -
Une approche pragmatique
Chez Amazon.fr : http://amzn.to/LEscRu ou http://bit.ly/SpamJM
---------------------------------------------------------------
Jose Marcio MARTINS DA CRUZ http://www.j-chkmail.org
Ecole des Mines de Paris http://bit.ly/SpamJM
60, bd Saint Michel 75272 - PARIS CEDEX 06