Hi,

It is true that you can only use the score as a relative measure. Because
the default scorer (OPIC) is not normalized it is very difficult to give it
a specific weight when using it with custom scorers. A crude method would
be to simply introduce a threshold (cutoff point) in your custom filter
(that is run after OPIC). The difficult part is determining the threshold.
The ideal value would be the maximum score. In your case, sometimes ~45,
sometimes ~1286. When choosing a threshold that is below the maximum value,
all scores above that must be corrected to the threshold value. This will
lose some OPIC information. See example code below.

float normalizedScore = score; //score is unnormalized OPIC score
if (normalizedScore > threshold) {
  normalizedScore = threshold;
}
normalizedScore /= threshold;

At this instance normalizedScore is (reasonably) normalized between 0~1.
I'm not sure if there is a better way to solve this problem..

Ferdy.

On Thu, Jun 28, 2012 at 2:42 PM, Safdar Kureishy
<[email protected]>wrote:

> Hi,
>
> I'm trying to understand something about page scoring in Nutch, and
> couldn't find a relevant response elsewhere. Hopefully someone can offer
> some detailed insight into this, or point me to a link that does the
> same...
>
> I've seen some crawls, where the scores range from 0.0 to ~45. But for a
> crawl of about 12 million pages, the pages had a max score of ~1286. So,
> there doesn't seem to be a fixed range for scores (i.e., they are not
> normalized). For example, here's the output from readdb -stats on my 12
> million page crawl that had been updated with the results of a LinkRank
> analysis (using all default scoring filters):
> 12/06/28 14:53:01 INFO crawl.CrawlDbReader: Statistics for CrawlDb:
> crawl/crawldb
> 12/06/28 14:53:01 INFO crawl.CrawlDbReader: TOTAL urls: 103357799
> 12/06/28 14:53:01 INFO crawl.CrawlDbReader: retry 0:    102997265
> 12/06/28 14:53:01 INFO crawl.CrawlDbReader: retry 1:    173249
> 12/06/28 14:53:01 INFO crawl.CrawlDbReader: retry 2:    102601
> 12/06/28 14:53:01 INFO crawl.CrawlDbReader: retry 3:    32048
> 12/06/28 14:53:01 INFO crawl.CrawlDbReader: retry 4:    26634
> 12/06/28 14:53:01 INFO crawl.CrawlDbReader: retry 5:    22105
> 12/06/28 14:53:01 INFO crawl.CrawlDbReader: retry 6:    3895
> 12/06/28 14:53:01 INFO crawl.CrawlDbReader: retry 7:    2
> 12/06/28 14:53:01 INFO crawl.CrawlDbReader: min score:  *0.0*
> 12/06/28 14:53:01 INFO crawl.CrawlDbReader: avg score:  *0.030191684*
> 12/06/28 14:53:01 INFO crawl.CrawlDbReader: max score:  *1285.653*
> 12/06/28 14:53:01 INFO crawl.CrawlDbReader: status 1 (db_unfetched):
> 89270292
> 12/06/28 14:53:01 INFO crawl.CrawlDbReader: status 2 (db_fetched):
> 12337926
> 12/06/28 14:53:01 INFO crawl.CrawlDbReader: status 3 (db_gone): 752754
> 12/06/28 14:53:01 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):
> 361060
> 12/06/28 14:53:01 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):
> 635687
> 12/06/28 14:53:01 INFO crawl.CrawlDbReader: status 6 (db_notmodified):  80
> 12/06/28 14:53:01 INFO crawl.CrawlDbReader: CrawlDb statistics: done
>
> The score range appears to have no upper bound and depends on the number of
> URLs processed. Additionally, it appears that the range of scores assigned
> by a given scoring filter determines the "weight" of that scoring filter
> relative to the other scoring filters in the chain that are applied before
> or after it.
>
> So, here are my questions:
> a) The only information I have with Nutch scores is the relative importance
> of one page over another page whose score I also know, but I cannot say
> where a given page *ranks* across ALL pages, even if I know the max score.
> Is that an accurate assessment?
> b) I have a custom scoring filter that needs to carry a higher weight than
> all other filters (e.g., it needs to carry twice the weight of all other
> scoring filters combined). However, I may not know the "magnitude" at which
> the other scoring filters are operating. For instance, filter A might set
> scores within a 0 to 1 range, whereas filter B might use a 0 to 0.00005
> range, thereby giving filter A about 50000 times more weight than filter B!
> Given this lack of information about other scoring filters, how do I decide
> what quantitative impact my custom scoring filter should have on each page,
> to achieve its target weighting?
>
> Regards,
> Safdar
>

Reply via email to