Hello - please see inline. M. -----Original message----- > From:Otis Gospodnetić <[email protected]> > Sent: Friday 15th January 2016 22:05 > To: Nutch User List <[email protected]> > Subject: Handling large scale incremental PageRank updates > > Hello, > > We are working on a very large scale crawl (many billions of web pages) > that needs to make use of link/page rank. Because page rank for a page P > changes as more links to page P are discovered, one really ought to > periodically update the rank of the previously indexed page P.\
I don't think changing rank is going to be the big problem. Only at first will the graph quickly change but the scores are quite stable for long term crawls en recrawls. Also, if you intend to calculate linkrank frequently, you are going to need lots of hardware, it is CPU intense and it needs several runs for very large crawls. > > This is not a problem for small crawls, but for large ones this is a > problem if one tries to just reindex previously existing pages - reindexing > is not cheap and if you've indexed hundreds of millions or billions of > pages, reindexing them will take a long time and require a lot of resources. Yes, but if you plan for large scale, your search engine is going to be large scale too right? And, are you not going to recrawl periodically? > > How do people normally handle that with Solr or Elasticsearch at large > scale? > > With Solr, do people stick the rank in the External File Field, for example? Yes, you can do that. It is very efficient but you must take care of the sharding yourself. Solr won't take a big file and send it hashed to various shards. > > With Elasticsearch, do people store pageID => pageRank info in an external > store (e.g. Redis) and pull it from there to use when scoring search > results? Or maybe that, too, would be too slow when the number of matches > is high? Elasticsearch rescore to the rescue? That should not be a problem. Solr can also do query reranking. If you can request a batch of URL scores via a single call, it should be quite efficient and would be the approach i would begin with. > > Or are there better, more scalable ways to handle this? > > Thanks, > Otis > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/

