Hi Maciek, > The concept behind it is to prevent given URL from refetching in the future > based on text content analysis.
> extending ScoringFilter Yes, it's the right plugin type to implement such a feature. > keeping urls in a HashSet defined in my ScoringFilter and then updating > CrawlDatum in updateDbScore, but it seems that the HashSet is not persistent > throughout parsing and scoring process. Indeed. Everything which should be persistent needs to be stored in Nutch data structures. Assumed the "text content analysis" is done during the parsing, the flag or score needs to be passed forward via - passScoreAfterParsing - distributeScoreToOutlinks (in addition to passing stuff to outlinks but you can "adjust" the CrawlDatum of the page being processed) - updateDbScore - here you would modify the next fetch time of the page, eventually also the retry interval - if necessary you can store additional information in the CrawlDatum's metadata > As the documentation is very modest, I agree. The wiki page [1] needs for sure an overhaul. Best, Sebastian [1] https://cwiki.apache.org/confluence/display/nutch/NutchScoring On 12/10/24 12:15, Maciek Puzianowski wrote:
Hi, I am trying to make a Nutch plugin. I was wondering if it is possible to mark URLs based on content of a fetched page. The concept behind it is to prevent given URL from refetching in the future based on text content analysis. What I have tried so far is extending ScoringFilter and keeping urls in a HashSet defined in my ScoringFilter and then updating CrawlDatum in updateDbScore, but it seems that the HashSet is not persistent throughout parsing and scoring process. As the documentation is very modest, I would like to ask community about what can I do with this problem. Kind regards