Hi:

I've been looking into the ScoringFilter interface and I've a question, the 
distributeScoreToOutlinks function receive one parameter called targets, which 
is a collection of URLs and CrawlDatum which correspond to the outlinks of the 
url which is been analyzed right now. On the other hand the function filter of 
the IndexingFilter interface receives also a CrawlDatum object which 
corresponds only to the URL -> NutchDocument thats is about to be indexed, my 
question is if the CrawlDatum object passed to an ScoringFilter as an outlink 
is the same that the IndexingFilter receives when that particularly outlink is 
about to be indexed. I've done some tests locally and it does, but I'm worried 
about the distributed case, this stills happens.

For instance I've this:

test.html has 2 outlinks:
test.html ----> test2.html
          ----> test3.html

So, when any Scoring plugin implementing ScoringFilter is called on test.html, 
the targets parameter has one item in the Collection for every outlink in 
test.html, so I can modify some in the CrawlDatum object inside the targets 
collection, but when the test2.html is indexed the changes will be passed to 
the indexing filters? I've done this locally and it works, but in a distribute 
enviroment running Nutch on top of hadoop the behavior will be the same?

Greetings! 

The following signature is added automatically by the mail server.

10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Reply via email to