Hi Maciek,

> The concept behind it is to prevent given URL from refetching in the future
> based on text content analysis.

> extending ScoringFilter

Yes, it's the right plugin type to implement such a feature.

> keeping urls in a HashSet defined in my ScoringFilter and then updating
> CrawlDatum in updateDbScore, but it seems that the HashSet is not persistent
> throughout parsing and scoring process.

Indeed. Everything which should be persistent needs to be stored in Nutch
data structures. Assumed the "text content analysis" is done during the
parsing, the flag or score needs to be passed forward via
 - passScoreAfterParsing
 - distributeScoreToOutlinks
   (in addition to passing stuff to outlinks but you can "adjust" the
    CrawlDatum of the page being processed)
 - updateDbScore
   - here you would modify the next fetch time of the
     page, eventually also the retry interval
   - if necessary you can store additional information in the CrawlDatum's
     metadata


> As the documentation is very modest,

I agree. The wiki page [1] needs for sure an overhaul.

Best,
Sebastian


[1] https://cwiki.apache.org/confluence/display/nutch/NutchScoring


On 12/10/24 12:15, Maciek Puzianowski wrote:
Hi,
I am trying to make a Nutch plugin.
I was wondering if it is possible to mark URLs based on content of a
fetched page.
The concept behind it is to prevent given URL from refetching in the future
based on text content analysis.

What I have tried so far is extending ScoringFilter and keeping urls in a
HashSet defined in my ScoringFilter and then updating CrawlDatum in
updateDbScore, but it seems that the HashSet is not persistent throughout
parsing and scoring process.

As the documentation is very modest, I would like to ask community about
what can I do with this problem.

Kind regards


Reply via email to