AW: Nutch 1.15 not respecting robots=noindex?

Felix von Zadow Thu, 23 May 2019 01:02:40 -0700

Hi Sebastian,

thank you for trying to reproduce the problem!


> The parse-metatags plugin only duplicates the "robots" metatags,
> adding it also as "metatag.robots" but keep the original "robots".

This got me confused for a minute because you are absolutely right, they're 
both there. So I checked again to see how I got to a point where I only had 
"metatag.robots" but not "robots" and the problem only seems to occur when 
parse-tika is used instead of parse-html. So with this minimal setup

protocol-httpclient|parse-(tika|metatags)|index-(metadata)

the parse metadata only contains "metatag.robots" while with this setup

protocol-httpclient|parse-(html|metatags)|index-(metadata)

the parse metadata contains both "metatag.robots" and "robots".

Felix


> Von: Sebastian Nagel
>
> Hi Felix,
> 
> I tried to reproduce the problem. The parse-metatags plugin only duplicates 
> the
> "robots" metatags,
> adding it also as "metatag.robots" but keep the original "robots".
> 
> That is the case using the current master:
> 
> - with parse-metatags and metatags.names="robots" the ParseData object
> contains:
> 
> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
> metatag.robots=noindex,nofollow metatag.robots=noindex,nofollow
> generator=WordPress 3.1
> robots=noindex,nofollow
> 
> metatag.robots is even added twice, but most important "robots" is still 
> present
> 
> - without:
> 
> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
> generator=WordPress 3.1
> robots=noindex,nofollow
> 
> 
> Deleting the robots=noindex documents works as expected for both settings.
> 
> > Or is it and I should file a report/patch?
> 
> Yes, please open an issue to fix this on
>     https://issues.apache.org/jira/projects/NUTCH
> 
> Could be that there is some additional condition which I didn't hit.
> 
> Can you also share the document for which it does not work it does not work?
> 
> 
> Thanks,
> Sebastian
> 
> 
> 
> On 5/13/19 11:34 AM, Felix von Zadow wrote:
> > Hi all!
> >
> > So I was trying to use the option indexer.delete.robots.noindex (exclude 
> > page
> when <meta robots="noindex"> is encountered).
> >
> > However, the page I'm testing with is still being indexed. I have 
> > parse-metatags
> and index-metadata activated and indexer.delete.robots.noindex=true,
> metatags.names="robots" and index.parse.md="metatag.robots".
> >
> > Looking at IndexerMapReduce.java (#257) [1], the field that is being 
> > checked is
> "robots" and not "metatag.robots". It does work as expected when I change it 
> to
> "metatag.robots":
> >
> > Before:
> > Indexing 3/3 documents
> > Deleting 0 documents
> > Indexer: number of documents indexed, deleted, or skipped:
> > Indexer:      3  indexed (add/update)
> >
> > After:
> > Indexing 2/2 documents
> > Deleting 0 documents
> > Indexer: number of documents indexed, deleted, or skipped:
> > Indexer:      1  deleted (robots=noindex)
> > Indexer:      2  indexed (add/update)
> >
> >
> > Am I missing something and this is not actually a bug but rather some
> misconfiguration on my part?
> > Or is it and I should file a report/patch?
> >
> > Thanks!
> > Felix
> >
> >
> > [1]
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/inde
> xer/IndexerMapReduce.java#L257
> >
> >

AW: Nutch 1.15 not respecting robots=noindex?

Reply via email to