Thank youy for the answer - we will try that. As a followup, if we
were to use a SOLR back-end, are the pagerank calculation results
stored as an index time boost in the SOLR index? If so, are/can these
boost values be scaled a fixed range?

Thanks.


On 7/2/10, Andrzej Bialecki <[email protected]> wrote:
> On 2010-07-02 12:04, dc tech wrote:
>> We are fairly big SOLR shop for most things except web crawling (intranet)
>> where we use commercial software. Generally, the PageRank algorithm does a
>> good job of finding the top pages (tend to be home page of
>> sites/subsites).
>> A simple solr/lucene index doesn't not yield great results due to many
>> pages
>> having similar content hence we are looking to see if we can use Nutch for
>> crawling the intranet.
>>
>> Does Nutch 1.1 support PageRank/LinkRank type of model (I understand that
>> would be the OPIC algorithm?)
>> Can we use the NewScoring with 1.1?
>> http://wiki.apache.org/nutch/NewScoring
>>
>
> Although the OPIC scoring is still default in Nutch 1.1, you should use
> the new LinkGraph scoring described on that page. OPIC implementation in
> Nutch is likely broken (still), and even if it were properly implemented
> in my opinion the OPIC algorithm itself is unstable in presence of a
> changing webgraph and incremental crawling. The original paper tries to
> solve this by smoothing over a history of past scores, but IMHO it's a
> kludge. The LinkGraph tools don't suffer from this problem, because they
> do the classical multiple iterations over a fixed version of link graph.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

-- 
Sent from my mobile device

Reply via email to