Thank youy for the answer - we will try that. As a followup, if we were to use a SOLR back-end, are the pagerank calculation results stored as an index time boost in the SOLR index? If so, are/can these boost values be scaled a fixed range?
Thanks. On 7/2/10, Andrzej Bialecki <[email protected]> wrote: > On 2010-07-02 12:04, dc tech wrote: >> We are fairly big SOLR shop for most things except web crawling (intranet) >> where we use commercial software. Generally, the PageRank algorithm does a >> good job of finding the top pages (tend to be home page of >> sites/subsites). >> A simple solr/lucene index doesn't not yield great results due to many >> pages >> having similar content hence we are looking to see if we can use Nutch for >> crawling the intranet. >> >> Does Nutch 1.1 support PageRank/LinkRank type of model (I understand that >> would be the OPIC algorithm?) >> Can we use the NewScoring with 1.1? >> http://wiki.apache.org/nutch/NewScoring >> > > Although the OPIC scoring is still default in Nutch 1.1, you should use > the new LinkGraph scoring described on that page. OPIC implementation in > Nutch is likely broken (still), and even if it were properly implemented > in my opinion the OPIC algorithm itself is unstable in presence of a > changing webgraph and incremental crawling. The original paper tries to > solve this by smoothing over a history of past scores, but IMHO it's a > kludge. The LinkGraph tools don't suffer from this problem, because they > do the classical multiple iterations over a fixed version of link graph. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Sent from my mobile device

