Thanks Sebastian, That did work when I set both of those to false, but now the url I'm inserting has an abnormally high score. You mentioned two options, the first was to use FreeGenerator with an initial score, however I cannot find it documented anywhere how to do that. The only parameters I see is normalize and filter and they don't take values. Can you point me in the right direction for that?
On Wed, Mar 26, 2014 at 6:59 AM, Sebastian Nagel <[email protected] > wrote: > There may be no relevant links if all documents are from one single host > (or domain) and > (link.ignore.internal.host == true) > resp. > (link.ignore.internal.domain == true) > cf. explanations about that in the wiki. > > > 2014-03-26 4:09 GMT+01:00 John Lafitte <[email protected]>: > > > Thanks for that Sebastian. So given the hint you've given me, I'm trying > > to generate the scoring using this example: > > https://wiki.apache.org/nutch/NewScoringIndexingExample > > > > But when it gets to the LinkRank part I get: > > > > 2014-03-26 02:57:14,208 INFO webgraph.LinkRank - Analysis: starting at > > 2014-03-26 02:57:14 > > 2014-03-26 02:57:14,913 INFO webgraph.LinkRank - Starting link counter > job > > 2014-03-26 02:57:17,927 INFO webgraph.LinkRank - Finished link counter > job > > 2014-03-26 02:57:17,928 INFO webgraph.LinkRank - Reading numlinks temp > > file > > 2014-03-26 02:57:17,932 ERROR webgraph.LinkRank - LinkAnalysis: > > java.io.IOException: No links to process, is the webgra$ > > at > > org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:132) > > at > > org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:622) > > at > > org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:702) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at > > org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:668) > > > > I can see the webgraph directory got created and there are directories > and > > files in there, but I'm guessing something is not getting populated > > correctly. Any clue what I may be doing wrong? > > > > > > On Tue, Mar 25, 2014 at 4:15 PM, Sebastian Nagel < > > [email protected] > > > wrote: > > > > > Hi John, > > > > > > FreeGenerator unlike Injector does not use db.score.injected (default = > > > 1.0) > > > but sets the initial score to 0.0. If all URLs stem from FreeGenerator > > the > > > total > > > score in the link graph is also 0.0, and no linked documents can get a > > > higher score > > > that 0.0 > > > As possible solutions: > > > - use FreeGenerator with a initial score > 0.0 > > > (but don't put thousands URLs with a score of 1.0: > > > if the total score is too high some pages may get unreasonable > > > high scores) > > > - use linkrank (https://wiki.apache.org/nutch/NewScoring) to get the > > > scores: > > > the default scoring OPIC has the advantage of calculating scores > online > > > while following links. It gives good and plausible scores if crawl is > > > started > > > from few authoritative seeds. But sometimes, esp. in continuous > crawls, > > > OPIC scores run out of control. > > > > > > Sebastian > > > > > > On 03/25/2014 08:31 PM, John Lafitte wrote: > > > > I setup a script that uses freegen to manually index new/updated > URLs. > > I > > > > thought it was working great, but now I'm just realizing that Solr > > > returns > > > > a score of 0 for these new documents. I thought the score was > > calculated > > > > independent from what Nutch does, just uses the content and other > > > metadata > > > > to calculate it, however that doesn't seem to be the case. Anyone > > have a > > > > clue what might be causing this? The content and other metadata look > > > > normal and I reloaded the core to no avail. > > > > > > > > > > > > >

