On 19.07.2011 15:16, Markus Jelsma wrote:
On Tuesday 19 July 2011 15:14:31 Marek Bachmann wrote:
On 19.07.2011 15:03, Markus Jelsma wrote:
It's a familiar problem of OPIC-scoring. Maybe you can migrate to using
WebGraph? It's really powerful and is recalculated each time.
Thanks for this very quick reply. I really don't have any favourites for
the page scoring in nutch. Actually I am not even aware of the pros and
cons of the different scoring types. (Should read about that :) )
Everything I know is that the score they produces helps me to find the
pages which are more popular because many other sites link to them, am I
right?
Yes. WebGraph can do this.
I guess I have to change the Scoring plugin somewhere in the
nutch-site.xml? I'll have a look. Is there something I have to look at
when I replace the scoring?
Not quite. You'll need separate programs to do this. Luckily they are bundled
with Nutch. Check the wiki:
http://wiki.apache.org/nutch/NewScoring
That looks really nice. How can I avoid the OPIC scoring while parsing?
Sounds like I won't need it anymore.
Thank you once again :)
On Tuesday 19 July 2011 15:04:01 Marek Bachmann wrote:
Hi List,
while I was crawling a set of 2000 pages a couple of times, I noticed
that the page scores are getting higher and higher every time a crawl
cycle finished. (There are no new pages discovered, only known pages are
recrawled)
Is that behaviour correct?
Thanks,
Marek