Hello,

We are working on a very large scale crawl (many billions of web pages)
that needs to make use of link/page rank.  Because page rank for a page P
changes as more links to page P are discovered, one really ought to
periodically update the rank of the previously indexed page P.

This is not a problem for small crawls, but for large ones this is a
problem if one tries to just reindex previously existing pages - reindexing
is not cheap and if you've indexed hundreds of millions or billions of
pages, reindexing them will take a long time and require a lot of resources.

How do people normally handle that with Solr or Elasticsearch at large
scale?

With Solr, do people stick the rank in the External File Field, for example?

With Elasticsearch, do people store pageID => pageRank info in an external
store (e.g. Redis) and pull it from there to use when scoring search
results?  Or maybe that, too, would be too slow when the number of matches
is high?  Elasticsearch rescore to the rescue?

Or are there better, more scalable ways to handle this?

Thanks,
Otis
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/

Reply via email to