Note that in nutch 2 branch only OPIC is implemented. If you want move
to it in future it might be problematic.
Dnia pią, 21 lut 2014, 16:52:35 Markus Jelsma pisze:
Hi - you can safely forget about OPIC, it is useless in continuous crawls.
LinkRank, however, only works well on very large crawls, with many hosts. It
can work for single hosts (do not ignore internal links) but the graph will
become very dense; that's where the IO and CPU time comes from. We don't use
LinkRank score in Solr at all because results are already very relevant due to
other (less costly) measures.
You can do it, but you will need some serious hardware. Also, there is the
problem of frequently changing scores, but you are not frequently updating all
documents in Solr, using ExternalFileField may help.
-----Original message-----
From: Tobias Marx<[email protected]>
Sent: Friday 21st February 2014 16:29
To: [email protected]
Subject: PageRank or Opic?
Hi!
We're using nutch (1.7) and solr 3.6 for indexing about 80k pages on several
100 different hosts.
This works quiet well, but there is still room for improvement to search result ranking
and "relevancy".
When using nutch and solr there are basically two values that influence the score auf a
query result (correct me if I'm wrong). The score from nutch, which becomes the
"boost" value in solr and the boost value from solr, which is e.g. calculated
at query time.
The score in nutch is either calculated bei the "scoring-opic" plugin or with the "webgraph"
toolchain described here: http://wiki.apache.org/nutch/NewScoringIndexingExample
<http://wiki.apache.org/nutch/NewScoringIndexingExample> which gives the PageRank/LinkRank (btw. what with
the "scoring-link" plugin? Does it do anything at all? What is it role in this?).
We've been playing around with PageRank lately and it's scores look a little
better than with opic, but on the downside, calculation really takes very long
and is very cpu intensive.
Well, to cut a long story short, what is your opinion on this? Which ranking do
you use? Is PageRank worth the trouble? How do you boost solr queries (if you
use solr at all)?
BR,
--
Tobias Marx
Zentrum für Informations- und Medienverarbeitung - ZIM
Bergische Universität Wuppertal
Büro: T.11.08
+49 202 439 2237
[email protected] <mailto:[email protected]>