On 2010-07-02 12:04, dc tech wrote:
We are fairly big SOLR shop for most things except web crawling (intranet)
where we use commercial software. Generally, the PageRank algorithm does a
good job of finding the top pages (tend to be home page of sites/subsites).
A simple solr/lucene index doesn't not yield great results due to many pages
having similar content hence we are looking to see if we can use Nutch for
crawling the intranet.
Does Nutch 1.1 support PageRank/LinkRank type of model (I understand that
would be the OPIC algorithm?)
Can we use the NewScoring with 1.1?
http://wiki.apache.org/nutch/NewScoring
Although the OPIC scoring is still default in Nutch 1.1, you should use
the new LinkGraph scoring described on that page. OPIC implementation in
Nutch is likely broken (still), and even if it were properly implemented
in my opinion the OPIC algorithm itself is unstable in presence of a
changing webgraph and incremental crawling. The original paper tries to
solve this by smoothing over a history of past scores, but IMHO it's a
kludge. The LinkGraph tools don't suffer from this problem, because they
do the classical multiple iterations over a fixed version of link graph.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com