RE: Handling large scale incremental PageRank updates

Markus Jelsma Mon, 18 Jan 2016 07:11:51 -0800

Hello - please see inline.
M.
 
-----Original message-----
> From:Otis Gospodnetić <[email protected]>
> Sent: Friday 15th January 2016 22:05
> To: Nutch User List <[email protected]>
> Subject: Handling large scale incremental PageRank updates
> 
> Hello,
> 
> We are working on a very large scale crawl (many billions of web pages)
> that needs to make use of link/page rank.  Because page rank for a page P
> changes as more links to page P are discovered, one really ought to
> periodically update the rank of the previously indexed page P.\


I don't think changing rank is going to be the big problem. Only at first will 
the graph quickly change but the scores are quite stable for long term crawls 
en recrawls. Also, if you intend to calculate linkrank frequently, you are 
going to need lots of hardware, it is CPU intense and it needs several runs for 
very large crawls.

> 
> This is not a problem for small crawls, but for large ones this is a
> problem if one tries to just reindex previously existing pages - reindexing
> is not cheap and if you've indexed hundreds of millions or billions of
> pages, reindexing them will take a long time and require a lot of resources.

Yes, but if you plan for large scale, your search engine is going to be large 
scale too right? And, are you not going to recrawl periodically? 

> 
> How do people normally handle that with Solr or Elasticsearch at large
> scale?
> 
> With Solr, do people stick the rank in the External File Field, for example?

Yes, you can do that. It is very efficient but you must take care of the 
sharding yourself. Solr won't take a big file and send it hashed to various 
shards.

> 
> With Elasticsearch, do people store pageID => pageRank info in an external
> store (e.g. Redis) and pull it from there to use when scoring search
> results?  Or maybe that, too, would be too slow when the number of matches
> is high?  Elasticsearch rescore to the rescue?

That should not be a problem. Solr can also do query reranking. If you can 
request a batch of URL scores via a single call, it should be quite efficient 
and would be the approach i would begin with.

> 
> Or are there better, more scalable ways to handle this?
> 
> Thanks,
> Otis
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/

RE: Handling large scale incremental PageRank updates

Reply via email to