Thanks for that Sebastian. So given the hint you've given me, I'm trying
to generate the scoring using this example:
https://wiki.apache.org/nutch/NewScoringIndexingExample
But when it gets to the LinkRank part I get:
2014-03-26 02:57:14,208 INFO webgraph.LinkRank - Analysis: starting at
2014-03-26 02:57:14
2014-03-26 02:57:14,913 INFO webgraph.LinkRank - Starting link counter job
2014-03-26 02:57:17,927 INFO webgraph.LinkRank - Finished link counter job
2014-03-26 02:57:17,928 INFO webgraph.LinkRank - Reading numlinks temp file
2014-03-26 02:57:17,932 ERROR webgraph.LinkRank - LinkAnalysis:
java.io.IOException: No links to process, is the webgra$
at
org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:132)
at
org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:622)
at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:702)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:668)
I can see the webgraph directory got created and there are directories and
files in there, but I'm guessing something is not getting populated
correctly. Any clue what I may be doing wrong?
On Tue, Mar 25, 2014 at 4:15 PM, Sebastian Nagel <[email protected]
> wrote:
> Hi John,
>
> FreeGenerator unlike Injector does not use db.score.injected (default =
> 1.0)
> but sets the initial score to 0.0. If all URLs stem from FreeGenerator the
> total
> score in the link graph is also 0.0, and no linked documents can get a
> higher score
> that 0.0
> As possible solutions:
> - use FreeGenerator with a initial score > 0.0
> (but don't put thousands URLs with a score of 1.0:
> if the total score is too high some pages may get unreasonable
> high scores)
> - use linkrank (https://wiki.apache.org/nutch/NewScoring) to get the
> scores:
> the default scoring OPIC has the advantage of calculating scores online
> while following links. It gives good and plausible scores if crawl is
> started
> from few authoritative seeds. But sometimes, esp. in continuous crawls,
> OPIC scores run out of control.
>
> Sebastian
>
> On 03/25/2014 08:31 PM, John Lafitte wrote:
> > I setup a script that uses freegen to manually index new/updated URLs. I
> > thought it was working great, but now I'm just realizing that Solr
> returns
> > a score of 0 for these new documents. I thought the score was calculated
> > independent from what Nutch does, just uses the content and other
> metadata
> > to calculate it, however that doesn't seem to be the case. Anyone have a
> > clue what might be causing this? The content and other metadata look
> > normal and I reloaded the core to no avail.
> >
>
>