subject:"Re\: OPIC"

Re: OPIC scoring differences

2007-07-11 Thread Doğacan Güney


On 7/9/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:

Carl Cerecke wrote:
 Hi,

 The docs for the OPICScoringFilter mention that the plugin implements a
 variant of OPIC from Artiboul et al's paper. What exactly is different?
 How does the difference affect the scores?

As it is now, the implementation doesn't preserve the total cash value
in the system, and also there is almost no smoothing between the
iterations (Abiteboul's history).

As a consequence, scores may (and do) vary dramatically between
iterations, and they don't converge to stable values, i.e. they always
increase. For pages that get a lot of score contributions from other
pages this leads to an explosive increase into the range of thousands or
eventually millions. This means that the scores produced by the OPIC
plugin exaggerate score differences between pages more and more, even if
the web graph that you crawl is stable.

In a sense, to follow the cash analogy, our implementation of OPIC
illustrates a runaway economy - galloping inflation, rich get richer and
poor get poorer ;)


 Also, there's a comment in the code:

 // XXX (ab) no adjustment? I think this is contrary to the algorithm descr.
 // XXX in the paper, where page loses its score if it's distributed to
 // XXX linked pages...

 Is this something that will be looked at eventually or is the scoring
 good enough at the moment without some adjustment.

Yes, I'll start working on it when I get back from vacations. I did some
simulations that show how to fix it (see
http://wiki.apache.org/nutch/FixingOpicScoring bottom of the page).


Andrzej, nice to see you working on this.

There is one thing that I don't understand about your presentation.
Assume that page A is the only url in our crawldb and it contains n
outlinks.

t = 0 - Generate runs, A is generated.

t = 1 - Page A is fetched and its cash is distributed to its outlinks.

t = 2 - Generate runs, pages P0-Pn are generated.

t = 3 - P0 - Pn are fetched and their cash are distributed to their outlinks.
- At this time, it is possible that page Pk links to page A.
So, now Page A's cash  0.

t = 4 - Generate runs, page A is considered but is not generated
(since its next fetch time is later than current time).
- Won't page A become a temporary sink? Time between
subsequent fetches may be as large as 30 days in default
configuration. So, page A will accumulate cash for a long time without
distributing it.
- I don't see how we can achieve that, but, IMO, if a page is
considered but not generated, nutch should distribute its cash to
outlinks the outlinks that are stored in its parse data. (I know that
this is incredibly hard (if not impossible) to do this.)

Or am I missing something here?



--
Best regards,
Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





--
Doğacan Güney

Re: OPIC scoring differences

2007-07-11 Thread Andrzej Bialecki


Doğacan Güney wrote:


Andrzej, nice to see you working on this.

There is one thing that I don't understand about your presentation.
Assume that page A is the only url in our crawldb and it contains n
outlinks.

t = 0 - Generate runs, A is generated.

t = 1 - Page A is fetched and its cash is distributed to its outlinks.

t = 2 - Generate runs, pages P0-Pn are generated.

t = 3 - P0 - Pn are fetched and their cash are distributed to their 
outlinks.

- At this time, it is possible that page Pk links to page A.
So, now Page A's cash  0.

t = 4 - Generate runs, page A is considered but is not generated
(since its next fetch time is later than current time).
- Won't page A become a temporary sink? Time between
subsequent fetches may be as large as 30 days in default
configuration. So, page A will accumulate cash for a long time without
distributing it.


Yes. That's why Abiteboul used history (several cycles long) to smooth 
out temporary imbalances in cache redistribution. The history component 
described in the paper could be either several cycles long, or specific 
period of time long.


In our case I think the history for rarely updated pages should span the 
db.max.interval period plus some, and for frequently updated pages it 
should span several cycles.



- I don't see how we can achieve that, but, IMO, if a page is
considered but not generated, nutch should distribute its cash to
outlinks the outlinks that are stored in its parse data. (I know that
this is incredibly hard (if not impossible) to do this.)


Actually we store outlinks in two places - one place is obviously the 
segments. The other less obvious place is the linkdb - the data is 
there, it just needs to be inverted (again).


So, theoretically, we could modify the updatedb process to consider the 
complete webgraph, i.e. all link information collected so far - but the 
main attractiveness of OPIC is that it's incremental, so that you don't 
have to consider the whole webgraph with small incremental updates.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: OPIC scoring differences

2007-07-09 Thread Doğacan Güney


Hi,

On 7/9/07, Carl Cerecke [EMAIL PROTECTED] wrote:

Hi,

The docs for the OPICScoringFilter mention that the plugin implements a
variant of OPIC from Artiboul et al's paper. What exactly is different?
How does the difference affect the scores?

Also, there's a comment in the code:

// XXX (ab) no adjustment? I think this is contrary to the algorithm descr.
// XXX in the paper, where page loses its score if it's distributed to
// XXX linked pages...

Is this something that will be looked at eventually or is the scoring
good enough at the moment without some adjustment.


I certainly hope that this is something that will be looked at
eventually. IMHO,  scoring is not good enough, but it doesn't bother
anyone enough so that they decide to fix it.

Also, see Andrzej's comments in NUTCH-267 about why plugin
scoring-opic is not really OPIC. It is basically a glorified link
counter.



Cheers,
Carl.




--
Doğacan Güney

Re: OPIC scoring differences

2007-07-09 Thread Andrzej Bialecki


Carl Cerecke wrote:

Hi,

The docs for the OPICScoringFilter mention that the plugin implements a 
variant of OPIC from Artiboul et al's paper. What exactly is different? 
How does the difference affect the scores?


As it is now, the implementation doesn't preserve the total cash value 
in the system, and also there is almost no smoothing between the 
iterations (Abiteboul's history).


As a consequence, scores may (and do) vary dramatically between 
iterations, and they don't converge to stable values, i.e. they always 
increase. For pages that get a lot of score contributions from other 
pages this leads to an explosive increase into the range of thousands or 
eventually millions. This means that the scores produced by the OPIC 
plugin exaggerate score differences between pages more and more, even if 
the web graph that you crawl is stable.


In a sense, to follow the cash analogy, our implementation of OPIC 
illustrates a runaway economy - galloping inflation, rich get richer and 
poor get poorer ;)




Also, there's a comment in the code:

// XXX (ab) no adjustment? I think this is contrary to the algorithm descr.
// XXX in the paper, where page loses its score if it's distributed to
// XXX linked pages...

Is this something that will be looked at eventually or is the scoring 
good enough at the moment without some adjustment.


Yes, I'll start working on it when I get back from vacations. I did some 
simulations that show how to fix it (see 
http://wiki.apache.org/nutch/FixingOpicScoring bottom of the page).


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: OPIC score calculation issues

2006-03-14 Thread Andrzej Bialecki


(Better late than never... I forgot I didn't yet respond to your posting).

Doug Cutting wrote:
I think all that you're saying is that we should not run two CrawlDB 
updates at once, right?  But there are lots of reasons we cannot do 
that besides the OPIC calculation.


When we used WebDB it was possible to overlap generate / fetch / update 
cycles, because we would lock pages selected by FetchListTool for a 
period of time.


Now we don't do this. The advantage is that we don't have to rewrite 
CrawlDB. But operations on CrawlDB are considerably faster than on 
WebDB, perhaps we should consider going back to this method?




Also, the cash value of those outlinks that point to URLs not in 
the current fetchlist will be dropped, because they won't be 
collected anywhere.


No, every cash value is used.  The input to a crawl db update includes a
CrawlDatum for every known url, including those just linked to.  If 
the only CrawlDatum for a url is a new outlink from a page crawled, 
then the  score for the page is 1.0 + the score of that outlink.


Of course, you are right, I missed this.


And a final note: CrawlDB.update() uses the initial score value 
recorded in the segment, and NOT the value that is actually found in 
CrawlDB at the time of the update. This means that if there was 
another update in the meantime, your new score in CrawlDB will be 
overwritten with the score based on an older initial value. This is 
counter-intuitive - I think CrawlDB.update() should always use the 
latest score value found in the current CrawlDB. I.e. in 
CrawlDBReducer instead of doing:


 result.setScore(result.getScore() + scoreIncrement);

we should do:

 result.setScore(old.getScore() + scoreIncrement);


The change is not quite that simple, since 'old' is sometimes null.  
So perhaps we need to add an 'score' variable that is set to old.score 
when old!=null and to 1.0 otherwise (for newly linked pages).


The reason I didn't do it that way was to permit the Fetcher to modify 
scores, since I was thinking of the Fetcher as the actor whose actions 
are being processed here, and of the CrawlDb as the passive thing 
acted on.  But indeed, if you have another process that's updating a 
CrawlDb while a Fetcher is running, this may not be the case.  So if 
we want to switch things so that the Fetcher is not permitted to 
adjust scores, then this seems like a reasonable change.


I would vote for implementing this change. The reason is that the active 
actor that computes new scores is CrawlDb.update(). Fetcher may provide 
additional information to affect the score, but IMHO the logic to 
calculate new scores should be concentrated in the update() method.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: OPIC score calculation issues

2006-03-14 Thread Doug Cutting


Andrzej Bialecki wrote:
When we used WebDB it was possible to overlap generate / fetch / update 
cycles, because we would lock pages selected by FetchListTool for a 
period of time.


Now we don't do this. The advantage is that we don't have to rewrite 
CrawlDB. But operations on CrawlDB are considerably faster than on 
WebDB, perhaps we should consider going back to this method?


Yes, this would be a good addition.

Ideally we should change Crawl.java to overlap these too.  When -topN is 
specified and substantially smaller than the total size of the crawldb, 
then we can generate, start a fetch job, then generate again.  As each 
fetch completes, we can start the next, then run an update and generate 
based on the just-completed fetch, so that we're constantly fetching.


This could be implemented by: (a) adding a status for generated crawl 
data; (b) adding a option to updatedb to include the generated output 
from some segments.  Then, in the above algorithm, the first time we'd 
update with only the generator output, but, after that, we can combine 
the updates with fetcher and generator output.  This way, in the course 
of a crawl, we only re-write the crawldb one additional time, rather 
than twice as many times.  Does this make sense?


And a final note: CrawlDB.update() uses the initial score value 
recorded in the segment, and NOT the value that is actually found in 
CrawlDB at the time of the update. This means that if there was 
another update in the meantime, your new score in CrawlDB will be 
overwritten with the score based on an older initial value. This is 
counter-intuitive - I think CrawlDB.update() should always use the 
latest score value found in the current CrawlDB. I.e. in 
CrawlDBReducer instead of doing:


 result.setScore(result.getScore() + scoreIncrement);

we should do:

 result.setScore(old.getScore() + scoreIncrement);



The change is not quite that simple, since 'old' is sometimes null.  
So perhaps we need to add an 'score' variable that is set to old.score 
when old!=null and to 1.0 otherwise (for newly linked pages).


The reason I didn't do it that way was to permit the Fetcher to modify 
scores, since I was thinking of the Fetcher as the actor whose actions 
are being processed here, and of the CrawlDb as the passive thing 
acted on.  But indeed, if you have another process that's updating a 
CrawlDb while a Fetcher is running, this may not be the case.  So if 
we want to switch things so that the Fetcher is not permitted to 
adjust scores, then this seems like a reasonable change.


I would vote for implementing this change. The reason is that the active 
actor that computes new scores is CrawlDb.update(). Fetcher may provide 
additional information to affect the score, but IMHO the logic to 
calculate new scores should be concentrated in the update() method.


I agree: +1.  I was just trying to explain the existing logic.  I think 
this would provide a significant improvement, with little lost.


Doug

Re: OPIC score calculation issues

2006-02-28 Thread Doug Cutting


Andrzej Bialecki wrote:
* CrawlDBReducer (used by CrawlDB.update()) collects all CrawlDatum-s 
from crawl_parse with the same URL, which means that we get:


   * the original CrawlDatum
   * (optionally a CrawlDatum that contains just a Signature)
   * all CrawlDatum.LINKED entries pointing to our URL, generated by 
outlinks from other

 pages.

 Based on this information, a new score is calculated by adding the 
original score and all

 scores from incoming links.

HOWEVER... and here's where I suspect the current code is wrong: since 
we are processing just one segment the incoming link information is very 
incomplete because it comes only from the outlinks discovered by 
fetching this segment's fetchlist, and not the complete LinkDB.


I think the code is correct.  OPIC is an incremental algorithm, designed 
to be calculated while crawling.  As each new link is seen, it 
increments the score of the page it links to.  OPIC is thus much simpler 
and faster to calculate than PageRank.  (It also provides a good 
approximation of PageRank, but prioritizes better when crawling than 
PageRank.  Crawling using an incrementally calculated PageRank is not as 
good as OPIC at crawling higher PageRank pages sooner.)


One mitigating factor could be that we already accounted for incoming 
links from other segments when processing those other segments - so our 
initial score already includes the inlink information from other 
segments. But this assumes that we never generate and process more than 
1 segment in parallel, i.e. that we finish updating from all previous 
segments before we update from the current segment (otherwise we 
wouldn't know the updated initial score).


I think all that you're saying is that we should not run two CrawlDB 
updates at once, right?  But there are lots of reasons we cannot do that 
besides the OPIC calculation.


Also, the cash value of those outlinks that point to URLs not in the 
current fetchlist will be dropped, because they won't be collected 
anywhere.


No, every cash value is used.  The input to a crawl db update includes a
CrawlDatum for every known url, including those just linked to.  If the 
only CrawlDatum for a url is a new outlink from a page crawled, then the 
 score for the page is 1.0 + the score of that outlink.


I think a better option would be to add the LinkDB as an input dir to 
CrawlDB.update(), so that we have access to all previously collected 
inlinks.


That would be a lot slower, and it would not compute OPIC.

And a final note: CrawlDB.update() uses the initial score value recorded 
in the segment, and NOT the value that is actually found in CrawlDB at 
the time of the update. This means that if there was another update in 
the meantime, your new score in CrawlDB will be overwritten with the 
score based on an older initial value. This is counter-intuitive - I 
think CrawlDB.update() should always use the latest score value found in 
the current CrawlDB. I.e. in CrawlDBReducer instead of doing:


 result.setScore(result.getScore() + scoreIncrement);

we should do:

 result.setScore(old.getScore() + scoreIncrement);


The change is not quite that simple, since 'old' is sometimes null.  So 
perhaps we need to add an 'score' variable that is set to old.score when 
old!=null and to 1.0 otherwise (for newly linked pages).


The reason I didn't do it that way was to permit the Fetcher to modify 
scores, since I was thinking of the Fetcher as the actor whose actions 
are being processed here, and of the CrawlDb as the passive thing acted 
on.  But indeed, if you have another process that's updating a CrawlDb 
while a Fetcher is running, this may not be the case.  So if we want to 
switch things so that the Fetcher is not permitted to adjust scores, 
then this seems like a reasonable change.


Doug

Re: OPIC

2005-10-21 Thread Andrzej Bialecki


Massimo Miccoli wrote:

Sorry Andrzej,

I mean on DeleteDuplicates.java, not in runtime. Is that the correct 
place to integrate some like Shingling or n-gram?


Yes. But there is an small issue of high dimensionality to solve, 
otherwise it will be very inefficient...


Both shingling and n-gram based methods (word n-gram or character 
n-gram) produce a profile of a document, which can be compared to other 
profiles, one by one. So, this seems to be appropriate to detect near 
duplicates - you create a profile for each document (in IndexDoc), and 
sort them... but here's where the problems start.


Usually such profiles take a lot of space (e.g. a list of 100 top 
n-grams), and comparing them takes a lot of resources - and several 
comparison operations are needed per item to sort the signatures. This 
is currently done by HashScore.


(BTW, HashScore is missing the fetchTime, which the original dedup 
algorithm took also into account when comparing pages...).


So, you need to reduce the number of dimensions in a signature to 
decrease the complexity of compare operations. This can be done using 
purely numeric signatures (e.g. Nilsimsa - but this particular approach 
brings numerous problems with quantization noise).


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: OPIC

2005-10-20 Thread Massimo Miccoli


Hi Doug,

Many thanks for your patch. I now try it. I'm also thinking to integrate 
some algo for near duplicated urls detection. I mean some like Shingling.

Is dedup the best place to integrate the algo?

Thanks,

Massimo

Doug Cutting ha scritto:

Here is a patch that implements this.  I'm still testing it.  If it 
appears to work well, I will commit it.


Doug Cutting wrote:


Massimo Miccoli wrote:

Any news about integration of OPIC in  mapred? I have time to 
develop OPIC on Nutch Mapred. Can you help me to start?
By the email from Carlos Alberto-Alejandro CASTILLO-Ocaranza, seams 
that the best way to integrate OPIC in on old webdb, is this way 
valid also

CrawlDb in Mapred?




Yes.  I think the way to implement this in the mapred branch is:

1. In CrawlDatum.java, replace 'int linkCount' with 'float score'.  
The default value of this should be 1.0f.  This will require changes 
to accessors, write, readFields, compareTo etc.  A constructor which 
specifies the score should be added.  The comparator should sort by 
decreasing score.


2. In crawl/Fetcher.java, add the score to the Content's metadata:

  public static String SCORE_KEY = org.apache.nutch.crawl.score;
  ...
  private void output(...) {
...
content.getMetadata().setProperty(SCORE_KEY, datum.getScore());
...
  }


3. In ParseOutputFormat.java, when writing the CrawlDatum for each 
outlink (line 77), set the score of the link CrawlDatum to be the 
score of the page:


   float score =
 Float.valueOf(parse.getData().get(Fetcher.SCORE_KEY));
   score /= links.length;
   for (int i = 0; i  links.length, ...) {
 ...
   new CrawlDatum(CrawlDatum.STATUS_LINKED,
  interval, score);
 ...
   }

4. In CrawlDbReducer.java, remove linkCount calculations.  Replace 
these with something like:


  float scoreIncrement = 0.0f;
  while (values.next()) {
...
switch (datum.getStatus()) {
...
CrawlDatum.STATUS_LINKED:
  scoreIncrement += datum.getScore();
  break;
...
  }
  ...
  result.setScore(result.getScore() + scoreIncrement);

I think that should do it, no?

Doug




Index: conf/crawl-tool.xml
===
--- conf/crawl-tool.xml (revision 326624)
+++ conf/crawl-tool.xml (working copy)
@@ -15,13 +15,6 @@
/property

property
-  nameindexer.boost.by.link.count/name
-  valuetrue/value
-  descriptionWhen true scores for a page are multipled by the log of
-  the number of incoming links to the page./description
-/property
-
-property
  namedb.ignore.internal.links/name
  valuefalse/value
  descriptionIf true, when adding new links to a page, links from
Index: conf/nutch-default.xml
===
--- conf/nutch-default.xml  (revision 326624)
+++ conf/nutch-default.xml  (working copy)
@@ -440,24 +440,6 @@
!-- indexer properties --

property
-  nameindexer.score.power/name
-  value0.5/value
-  descriptionDetermines the power of link analyis scores.  Each
-  pages's boost is set to iscoresupscorePower/sup/i where
-  iscore/i is its link analysis score and iscorePower/i is the
-  value of this parameter.  This is compiled into indexes, so, when
-  this is changed, pages must be re-indexed for it to take
-  effect./description
-/property
-
-property
-  nameindexer.boost.by.link.count/name
-  valuetrue/value
-  descriptionWhen true scores for a page are multipled by the log of
-  the number of incoming links to the page./description
-/property
-
-property
  nameindexer.max.title.length/name
  value100/value
  descriptionThe maximum number of characters of a title that are indexed.
Index: src/java/org/apache/nutch/crawl/CrawlDatum.java
===
--- src/java/org/apache/nutch/crawl/CrawlDatum.java (revision 326624)
+++ src/java/org/apache/nutch/crawl/CrawlDatum.java (working copy)
@@ -31,7 +31,7 @@
  public static final String FETCH_DIR_NAME = crawl_fetch;
  public static final String PARSE_DIR_NAME = crawl_parse;

-  private final static byte CUR_VERSION = 1;
+  private final static byte CUR_VERSION = 2;

  public static final byte STATUS_DB_UNFETCHED = 1;
  public static final byte STATUS_DB_FETCHED = 2;
@@ -47,17 +47,20 @@
  private long fetchTime = System.currentTimeMillis();
  private byte retries;
  private float fetchInterval;
-  private int linkCount;
+  private float score = 1.0f;

  public CrawlDatum() {}

  public CrawlDatum(int status, float fetchInterval) {
this.status = (byte)status;
this.fetchInterval = fetchInterval;
-if (status == STATUS_LINKED)
-  linkCount = 1;
  }

+  public CrawlDatum(int status, float fetchInterval, float score) {
+this(status, fetchInterval);
+this.score = score;
+  }
+
  //
  // accessor methods
  //
@@ -80,8 +83,8 @@
this.fetchInterval = fetchInterval;
  }

-  public

Re: OPIC

2005-10-20 Thread Andrzej Bialecki


Massimo Miccoli wrote:

Hi Doug,

Many thanks for your patch. I now try it. I'm also thinking to integrate 
some algo for near duplicated urls detection. I mean some like Shingling.

Is dedup the best place to integrate the algo?


That would be lovely. Dedup is the place to start, but certainly not the 
place to stop... ;-)


I think we should introduce a separate dedup field for each page in 
the DB. The reason is that if we re-use the md5 (or change its semantics 
to mean near duplicates covered by this value) then we run a risk of 
loosing a lot of legitimate unique urls from the DB.


Shingling, if you know how to implement it efficiently, would certainly 
be nice - but we could start by just passing a normalized text to md5. 
By normalized text I mean all lowercase, stopwords removed, 
punctuation removed, any consecutive whitespace replaced with exactly 1 
space character. We could also use an n-gram profile (either word-level 
or character level) with coarse quantization.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: OPIC

2005-10-19 Thread Doug Cutting

Here is a patch that implements this.  I'm still testing it.  If it 
appears to work well, I will commit it.


Doug Cutting wrote:

Massimo Miccoli wrote:

Any news about integration of OPIC in  mapred? I have time to develop 
OPIC on Nutch Mapred. Can you help me to start?
By the email from Carlos Alberto-Alejandro CASTILLO-Ocaranza, seams 
that the best way to integrate OPIC in on old webdb, is this way valid 
also

CrawlDb in Mapred?



Yes.  I think the way to implement this in the mapred branch is:

1. In CrawlDatum.java, replace 'int linkCount' with 'float score'.  The 
default value of this should be 1.0f.  This will require changes to 
accessors, write, readFields, compareTo etc.  A constructor which 
specifies the score should be added.  The comparator should sort by 
decreasing score.


2. In crawl/Fetcher.java, add the score to the Content's metadata:

  public static String SCORE_KEY = org.apache.nutch.crawl.score;
  ...
  private void output(...) {
...
content.getMetadata().setProperty(SCORE_KEY, datum.getScore());
...
  }


3. In ParseOutputFormat.java, when writing the CrawlDatum for each 
outlink (line 77), set the score of the link CrawlDatum to be the score 
of the page:


   float score =
 Float.valueOf(parse.getData().get(Fetcher.SCORE_KEY));
   score /= links.length;
   for (int i = 0; i  links.length, ...) {
 ...
   new CrawlDatum(CrawlDatum.STATUS_LINKED,
  interval, score);
 ...
   }

4. In CrawlDbReducer.java, remove linkCount calculations.  Replace these 
with something like:


  float scoreIncrement = 0.0f;
  while (values.next()) {
...
switch (datum.getStatus()) {
...
CrawlDatum.STATUS_LINKED:
  scoreIncrement += datum.getScore();
  break;
...
  }
  ...
  result.setScore(result.getScore() + scoreIncrement);

I think that should do it, no?

Doug
Index: conf/crawl-tool.xml
===
--- conf/crawl-tool.xml	(revision 326624)
+++ conf/crawl-tool.xml	(working copy)
@@ -15,13 +15,6 @@
 /property
 
 property
-  nameindexer.boost.by.link.count/name
-  valuetrue/value
-  descriptionWhen true scores for a page are multipled by the log of
-  the number of incoming links to the page./description
-/property
-
-property
   namedb.ignore.internal.links/name
   valuefalse/value
   descriptionIf true, when adding new links to a page, links from
Index: conf/nutch-default.xml
===
--- conf/nutch-default.xml	(revision 326624)
+++ conf/nutch-default.xml	(working copy)
@@ -440,24 +440,6 @@
 !-- indexer properties --
 
 property
-  nameindexer.score.power/name
-  value0.5/value
-  descriptionDetermines the power of link analyis scores.  Each
-  pages's boost is set to iscoresupscorePower/sup/i where
-  iscore/i is its link analysis score and iscorePower/i is the
-  value of this parameter.  This is compiled into indexes, so, when
-  this is changed, pages must be re-indexed for it to take
-  effect./description
-/property
-
-property
-  nameindexer.boost.by.link.count/name
-  valuetrue/value
-  descriptionWhen true scores for a page are multipled by the log of
-  the number of incoming links to the page./description
-/property
-
-property
   nameindexer.max.title.length/name
   value100/value
   descriptionThe maximum number of characters of a title that are indexed.
Index: src/java/org/apache/nutch/crawl/CrawlDatum.java
===
--- src/java/org/apache/nutch/crawl/CrawlDatum.java	(revision 326624)
+++ src/java/org/apache/nutch/crawl/CrawlDatum.java	(working copy)
@@ -31,7 +31,7 @@
   public static final String FETCH_DIR_NAME = crawl_fetch;
   public static final String PARSE_DIR_NAME = crawl_parse;
 
-  private final static byte CUR_VERSION = 1;
+  private final static byte CUR_VERSION = 2;
 
   public static final byte STATUS_DB_UNFETCHED = 1;
   public static final byte STATUS_DB_FETCHED = 2;
@@ -47,17 +47,20 @@
   private long fetchTime = System.currentTimeMillis();
   private byte retries;
   private float fetchInterval;
-  private int linkCount;
+  private float score = 1.0f;
 
   public CrawlDatum() {}
 
   public CrawlDatum(int status, float fetchInterval) {
 this.status = (byte)status;
 this.fetchInterval = fetchInterval;
-if (status == STATUS_LINKED)
-  linkCount = 1;
   }
 
+  public CrawlDatum(int status, float fetchInterval, float score) {
+this(status, fetchInterval);
+this.score = score;
+  }
+
   //
   // accessor methods
   //
@@ -80,8 +83,8 @@
 this.fetchInterval = fetchInterval;
   }
 
-  public int getLinkCount() { return linkCount; }
-  public void setLinkCount(int linkCount) { this.linkCount = linkCount; }
+  public float getScore() { return score; }
+  public void setScore(float score) { this.score = score; }
 
   //
   // writable methods
@@ -96,18 +99,18 @@
 
   public void readFields(DataInput

Re: OPIC scoring differences

Re: OPIC scoring differences

Re: OPIC scoring differences

Re: OPIC scoring differences

Re: OPIC score calculation issues

Re: OPIC score calculation issues

Re: OPIC score calculation issues

Re: OPIC

Re: OPIC

Re: OPIC

Re: OPIC

11 matches

Site Navigation

Mail list logo

Footer information