[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-11 Thread Andrzej Bialecki (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12379072 ] 

Andrzej Bialecki  commented on NUTCH-267:
-

Hmm, resetting the score to 0 is also dubious - it's as if we didn't want it to 
be re-crawled if we can't find any inlinks to it... I believe it should be 
reset to the following value:

newScore = initialScore - sum(distributedScoreM) + sum(incomingScoreN)

where initialScore is the score we got from previous iterations (or 
injectedScore), sum(distributedScoreM) is what we have distributed to M 
outlinks from that page, and sum(incomingScoreN) is what is contributed by N 
inlinks. Current formula omits the sum(distributedScoreM); it also doesn't 
provide any way to sponsor pages with no incoming links so that they won't 
get broke (the concept of virtual nodes I mentioned above).

Re: summing logs: yes, but then why use sqrt(opic) * docSimilarity instead of 
log(opic * docSimilarity)?

 Indexer doesn't consider linkdb when calculating boost value
 

  Key: NUTCH-267
  URL: http://issues.apache.org/jira/browse/NUTCH-267
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Chris Schneider
 Priority: Minor


 Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if 
 indexer.boost.by.link.count was true, the indexer boost value was scaled 
 based on the log of the # of inbound links:
 if (boostByLinkCount)
   res *= (float)Math.log(Math.E + linkCount);
 This is no longer true (even before Andrzej implemented scoring filters). 
 Instead, the boost value is just the square root (or some other scorePower) 
 of the page score. Shouldn't the invertlinks command, which creates the 
 linkdb, have some affect on the boost value calculated during indexing 
 (either via the OPICScoringFilter or some other built-in filter)?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-11 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12379116 ] 

Doug Cutting commented on NUTCH-267:


re: it's as if we didn't want it to be re-crawled if we can't find any inlinks 
to it

We prioritize crawling based on the number of pages we've crawled that link to 
it since we've last crawled it.  Assuming it had links to it that caused it to 
be crawled the first time, and that some of those will also be re-crawled, then 
its score will again increase.  But if no one links to it anymore, it will 
languish, and not be crawled again unless there're no higher-scoring pages.  
That sounds right to me, and I think it's what's suggested in the OPIC paper 
(if i skimmed it correctly).

Perhaps it should not be reset to zero, but one, since that's where pages start 
out.

re: why use sqrt(opic) * docSimilarity instead of log(opic * docSimilarity)

Wrapping log() around things changes the score value but not the ranking.  So 
the question is really, why use sqrt(opic)*docSimilarity and not just 
opic*docSimilarity?  The answer is simply that I tried a few queries and sqrt 
seemed to be required for OPIC to not overly dominate scoring.  It was a seat 
of the pants calculation, trying to balance the strength of anchor matches, 
opic scoring and title, url and body matching, etc.  One can disable this by 
changing the score power parameter.

 Indexer doesn't consider linkdb when calculating boost value
 

  Key: NUTCH-267
  URL: http://issues.apache.org/jira/browse/NUTCH-267
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Chris Schneider
 Priority: Minor


 Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if 
 indexer.boost.by.link.count was true, the indexer boost value was scaled 
 based on the log of the # of inbound links:
 if (boostByLinkCount)
   res *= (float)Math.log(Math.E + linkCount);
 This is no longer true (even before Andrzej implemented scoring filters). 
 Instead, the boost value is just the square root (or some other scorePower) 
 of the page score. Shouldn't the invertlinks command, which creates the 
 linkdb, have some affect on the boost value calculated during indexing 
 (either via the OPICScoringFilter or some other built-in filter)?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-09 Thread Andrzej Bialecki (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12378755 ] 

Andrzej Bialecki  commented on NUTCH-267:
-

I would argue that what Nutch implements now shouldn't be called OPIC, because 
it has little to do with the algorithm described in the OPIC paper. Either we 
fix it, or we should rename it. Let me explain:

* the paper uses a cash flow concept, where nodes not only receive score 
contributions, but also give them away thus _reducing_ their available score. 
This is not implemented in Nutch, which leads to scores growing into infinity. 
This also makes the score dependent on the number of fetch cycles, i.e. the 
scores of two pages with exactly the same inlinks will be different if one of 
them underwent more refresh cycles than the other. So, the fundamental premise 
of the algorithm - that scores would converge to certain values as a result of 
cash flow balance - is not retained.

* the paper uses a concept of virtual nodes that give away cash to 
disconnected nodes in the current graph. In reality, these nodes are probably 
connected, but the current graph is not complete enough to track it. The Nutch 
implementation doesn't use this, but only because it doesn't give away cash.

* finally, the paper argues that OPIC score and other different scores should 
be combined as a sum of logarithms, i.e. log(opic) + log(docSimilarity). 
Nutch uses a formula sqrt(opic) * docSimilarity (through document boosting).

I'm going to commit the scoring API soon, this should make it easier to 
experiment with different scoring models.

 Indexer doesn't consider linkdb when calculating boost value
 

  Key: NUTCH-267
  URL: http://issues.apache.org/jira/browse/NUTCH-267
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Chris Schneider
 Priority: Minor


 Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if 
 indexer.boost.by.link.count was true, the indexer boost value was scaled 
 based on the log of the # of inbound links:
 if (boostByLinkCount)
   res *= (float)Math.log(Math.E + linkCount);
 This is no longer true (even before Andrzej implemented scoring filters). 
 Instead, the boost value is just the square root (or some other scorePower) 
 of the page score. Shouldn't the invertlinks command, which creates the 
 linkdb, have some affect on the boost value calculated during indexing 
 (either via the OPICScoringFilter or some other built-in filter)?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-09 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12378765 ] 

Doug Cutting commented on NUTCH-267:


Andrzej: your analysis is correct, but it mostly only applies when re-crawling. 
 In an initial crawl, where each url is fetched only once, I think we implement 
 the OPIC Greedy strategy.  The question of what to do when re-crawling has 
not been adequately answered, but, glancing at the paper, it seems that 
resetting a urls score to zero each time it is fetched might be the best thing 
to do, so that it can start accumulating more cash.

When ranking, summing logs is the same as multiplying, no?

 Indexer doesn't consider linkdb when calculating boost value
 

  Key: NUTCH-267
  URL: http://issues.apache.org/jira/browse/NUTCH-267
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Chris Schneider
 Priority: Minor


 Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if 
 indexer.boost.by.link.count was true, the indexer boost value was scaled 
 based on the log of the # of inbound links:
 if (boostByLinkCount)
   res *= (float)Math.log(Math.E + linkCount);
 This is no longer true (even before Andrzej implemented scoring filters). 
 Instead, the boost value is just the square root (or some other scorePower) 
 of the page score. Shouldn't the invertlinks command, which creates the 
 linkdb, have some affect on the boost value calculated during indexing 
 (either via the OPICScoringFilter or some other built-in filter)?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-08 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12378560 ] 

Doug Cutting commented on NUTCH-267:


The OPIC score is much like a count of incoming links, but a bit more refined.  
OPIC(P) is one plus the sum of the OPIC contributions for all links to a page.  
The OPIC contribution of a link from page P is OPIC(P) / numOutLinks(P).

 Indexer doesn't consider linkdb when calculating boost value
 

  Key: NUTCH-267
  URL: http://issues.apache.org/jira/browse/NUTCH-267
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Chris Schneider
 Priority: Minor


 Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if 
 indexer.boost.by.link.count was true, the indexer boost value was scaled 
 based on the log of the # of inbound links:
 if (boostByLinkCount)
   res *= (float)Math.log(Math.E + linkCount);
 This is no longer true (even before Andrzej implemented scoring filters). 
 Instead, the boost value is just the square root (or some other scorePower) 
 of the page score. Shouldn't the invertlinks command, which creates the 
 linkdb, have some affect on the boost value calculated during indexing 
 (either via the OPICScoringFilter or some other built-in filter)?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira