In the first 2-3 crawls there are URLs in CrawlDb with either db_fetched or db_unfetched status. Only db_fetched urls go through IndexingFilter and even though all of these URLs have my custom tag in their metadata (explored in dumped crawldb) - some of them kind of randomly get NPE when trying to reach that tag. However, it is not that bad, because I can pass my tag to IndexingFilter via ParseData and then there is no problem.
Unless I do some silly mistake, I think it is worth a Jira issue. Best, Maciek pt., 13 gru 2024 o 00:23 Sebastian Nagel <wastl.na...@googlemail.com.invalid> napisał(a): > Hi Maciek, > > > However, sometimes this gets me a NullPointerException and it is > > kind of weird to me, because I have double checked and dumped CrawlDb > and > > these URLs have this tag in metadata. > > Are there other URLs / items in the CrawlDb as well? I'd especially look > at > unfetched ones, as these may not have the Metadata initialized. > > Otherwise difficult to tell... > > Best, > Sebastian > > On 12/12/24 13:06, Maciek Puzianowski wrote: > > Sebastian, > > thank you very much for your help. I have come to the exact same solution > > following Nutch code and it works like a charm. Although, I have one more > > concern. I wanted to use CrawlDatum metadata that I have updated in > > distributeScoreToOutlinks method in IndexFilter. I would do (in > IndexFilter) > > > > Text shouldIndex = (Text) crawlDatum.getMetaData().get(new > > Text(SHOULD_REFETCH_AND_INDEX)); > > > > that is the metadata tag that I have put in distributeScoreToOutlinks > > method. However, sometimes this gets me a NullPointerException and it is > > kind of weird to me, because I have double checked and dumped CrawlDb and > > these URLs have this tag in metadata. > > Any hints on that? > > > > Again, thank you Sebastian for your response. > > > > Best regards > > Maciek > > > > czw., 12 gru 2024 o 10:42 Sebastian Nagel > > <wastl.na...@googlemail.com.invalid> napisał(a): > > > >> Hi Maciek, > >> > >> > The concept behind it is to prevent given URL from refetching in the > >> future > >> > based on text content analysis. > >> > >> > extending ScoringFilter > >> > >> Yes, it's the right plugin type to implement such a feature. > >> > >> > keeping urls in a HashSet defined in my ScoringFilter and then > updating > >> > CrawlDatum in updateDbScore, but it seems that the HashSet is not > >> persistent > >> > throughout parsing and scoring process. > >> > >> Indeed. Everything which should be persistent needs to be stored in > Nutch > >> data structures. Assumed the "text content analysis" is done during the > >> parsing, the flag or score needs to be passed forward via > >> - passScoreAfterParsing > >> - distributeScoreToOutlinks > >> (in addition to passing stuff to outlinks but you can "adjust" the > >> CrawlDatum of the page being processed) > >> - updateDbScore > >> - here you would modify the next fetch time of the > >> page, eventually also the retry interval > >> - if necessary you can store additional information in the > CrawlDatum's > >> metadata > >> > >> > >> > As the documentation is very modest, > >> > >> I agree. The wiki page [1] needs for sure an overhaul. > >> > >> Best, > >> Sebastian > >> > >> > >> [1] https://cwiki.apache.org/confluence/display/nutch/NutchScoring > >> > >> > >> On 12/10/24 12:15, Maciek Puzianowski wrote: > >>> Hi, > >>> I am trying to make a Nutch plugin. > >>> I was wondering if it is possible to mark URLs based on content of a > >>> fetched page. > >>> The concept behind it is to prevent given URL from refetching in the > >> future > >>> based on text content analysis. > >>> > >>> What I have tried so far is extending ScoringFilter and keeping urls > in a > >>> HashSet defined in my ScoringFilter and then updating CrawlDatum in > >>> updateDbScore, but it seems that the HashSet is not persistent > throughout > >>> parsing and scoring process. > >>> > >>> As the documentation is very modest, I would like to ask community > about > >>> what can I do with this problem. > >>> > >>> Kind regards > >>> > >> > >> > > > >