In the first 2-3 crawls there are URLs in CrawlDb with either db_fetched or
db_unfetched status. Only db_fetched urls go through IndexingFilter and
even though all of these URLs have my custom tag in their metadata
(explored in dumped crawldb) - some of them kind of randomly get NPE when
trying to reach that tag.
However, it is not that bad, because I can pass my tag to IndexingFilter
via ParseData and then there is no problem.

Unless I do some silly mistake, I think it is worth a Jira issue.

Best,
Maciek

pt., 13 gru 2024 o 00:23 Sebastian Nagel <wastl.na...@googlemail.com.invalid>
napisał(a):

> Hi Maciek,
>
>  > However, sometimes this gets me a NullPointerException and it is
>  > kind of weird to me, because I have double checked and dumped CrawlDb
> and
>  > these URLs have this tag in metadata.
>
> Are there other URLs / items in the CrawlDb as well? I'd especially look
> at
> unfetched ones, as these may not have the Metadata initialized.
>
> Otherwise difficult to tell...
>
> Best,
> Sebastian
>
> On 12/12/24 13:06, Maciek Puzianowski wrote:
> > Sebastian,
> > thank you very much for your help. I have come to the exact same solution
> > following Nutch code and it works like a charm. Although, I have one more
> > concern. I wanted to use CrawlDatum metadata that I have updated in
> > distributeScoreToOutlinks method in IndexFilter. I would do (in
> IndexFilter)
> >
> > Text shouldIndex = (Text) crawlDatum.getMetaData().get(new
> > Text(SHOULD_REFETCH_AND_INDEX));
> >
> > that is the metadata tag that I have put in distributeScoreToOutlinks
> > method. However, sometimes this gets me a NullPointerException and it is
> > kind of weird to me, because I have double checked and dumped CrawlDb and
> > these URLs have this tag in metadata.
> > Any hints on that?
> >
> > Again, thank you Sebastian for your response.
> >
> > Best regards
> > Maciek
> >
> > czw., 12 gru 2024 o 10:42 Sebastian Nagel
> > <wastl.na...@googlemail.com.invalid> napisał(a):
> >
> >> Hi Maciek,
> >>
> >>   > The concept behind it is to prevent given URL from refetching in the
> >> future
> >>   > based on text content analysis.
> >>
> >>   > extending ScoringFilter
> >>
> >> Yes, it's the right plugin type to implement such a feature.
> >>
> >>   > keeping urls in a HashSet defined in my ScoringFilter and then
> updating
> >>   > CrawlDatum in updateDbScore, but it seems that the HashSet is not
> >> persistent
> >>   > throughout parsing and scoring process.
> >>
> >> Indeed. Everything which should be persistent needs to be stored in
> Nutch
> >> data structures. Assumed the "text content analysis" is done during the
> >> parsing, the flag or score needs to be passed forward via
> >>    - passScoreAfterParsing
> >>    - distributeScoreToOutlinks
> >>      (in addition to passing stuff to outlinks but you can "adjust" the
> >>       CrawlDatum of the page being processed)
> >>    - updateDbScore
> >>      - here you would modify the next fetch time of the
> >>        page, eventually also the retry interval
> >>      - if necessary you can store additional information in the
> CrawlDatum's
> >>        metadata
> >>
> >>
> >>   > As the documentation is very modest,
> >>
> >> I agree. The wiki page [1] needs for sure an overhaul.
> >>
> >> Best,
> >> Sebastian
> >>
> >>
> >> [1] https://cwiki.apache.org/confluence/display/nutch/NutchScoring
> >>
> >>
> >> On 12/10/24 12:15, Maciek Puzianowski wrote:
> >>> Hi,
> >>> I am trying to make a Nutch plugin.
> >>> I was wondering if it is possible to mark URLs based on content of a
> >>> fetched page.
> >>> The concept behind it is to prevent given URL from refetching in the
> >> future
> >>> based on text content analysis.
> >>>
> >>> What I have tried so far is extending ScoringFilter and keeping urls
> in a
> >>> HashSet defined in my ScoringFilter and then updating CrawlDatum in
> >>> updateDbScore, but it seems that the HashSet is not persistent
> throughout
> >>> parsing and scoring process.
> >>>
> >>> As the documentation is very modest, I would like to ask community
> about
> >>> what can I do with this problem.
> >>>
> >>> Kind regards
> >>>
> >>
> >>
> >
>
>

Reply via email to