Hi:
I've been looking into the ScoringFilter interface and I've a question, the
distributeScoreToOutlinks function receive one parameter called targets, which
is a collection of URLs and CrawlDatum which correspond to the outlinks of the
url which is been analyzed right now. On the other hand the function filter of
the IndexingFilter interface receives also a CrawlDatum object which
corresponds only to the URL -> NutchDocument thats is about to be indexed, my
question is if the CrawlDatum object passed to an ScoringFilter as an outlink
is the same that the IndexingFilter receives when that particularly outlink is
about to be indexed. I've done some tests locally and it does, but I'm worried
about the distributed case, this stills happens.
For instance I've this:
test.html has 2 outlinks:
test.html ----> test2.html
----> test3.html
So, when any Scoring plugin implementing ScoringFilter is called on test.html,
the targets parameter has one item in the Collection for every outlink in
test.html, so I can modify some in the CrawlDatum object inside the targets
collection, but when the test2.html is indexed the changes will be passed to
the indexing filters? I've done this locally and it works, but in a distribute
enviroment running Nutch on top of hadoop the behavior will be the same?
Greetings!
The following signature is added automatically by the mail server.
10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci