Hi Sebastian, thank you for trying to reproduce the problem!
> The parse-metatags plugin only duplicates the "robots" metatags, > adding it also as "metatag.robots" but keep the original "robots". This got me confused for a minute because you are absolutely right, they're both there. So I checked again to see how I got to a point where I only had "metatag.robots" but not "robots" and the problem only seems to occur when parse-tika is used instead of parse-html. So with this minimal setup protocol-httpclient|parse-(tika|metatags)|index-(metadata) the parse metadata only contains "metatag.robots" while with this setup protocol-httpclient|parse-(html|metatags)|index-(metadata) the parse metadata contains both "metatag.robots" and "robots". Felix > Von: Sebastian Nagel > > Hi Felix, > > I tried to reproduce the problem. The parse-metatags plugin only duplicates > the > "robots" metatags, > adding it also as "metatag.robots" but keep the original "robots". > > That is the case using the current master: > > - with parse-metatags and metatags.names="robots" the ParseData object > contains: > > Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 > metatag.robots=noindex,nofollow metatag.robots=noindex,nofollow > generator=WordPress 3.1 > robots=noindex,nofollow > > metatag.robots is even added twice, but most important "robots" is still > present > > - without: > > Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 > generator=WordPress 3.1 > robots=noindex,nofollow > > > Deleting the robots=noindex documents works as expected for both settings. > > > Or is it and I should file a report/patch? > > Yes, please open an issue to fix this on > https://issues.apache.org/jira/projects/NUTCH > > Could be that there is some additional condition which I didn't hit. > > Can you also share the document for which it does not work it does not work? > > > Thanks, > Sebastian > > > > On 5/13/19 11:34 AM, Felix von Zadow wrote: > > Hi all! > > > > So I was trying to use the option indexer.delete.robots.noindex (exclude > > page > when <meta robots="noindex"> is encountered). > > > > However, the page I'm testing with is still being indexed. I have > > parse-metatags > and index-metadata activated and indexer.delete.robots.noindex=true, > metatags.names="robots" and index.parse.md="metatag.robots". > > > > Looking at IndexerMapReduce.java (#257) [1], the field that is being > > checked is > "robots" and not "metatag.robots". It does work as expected when I change it > to > "metatag.robots": > > > > Before: > > Indexing 3/3 documents > > Deleting 0 documents > > Indexer: number of documents indexed, deleted, or skipped: > > Indexer: 3 indexed (add/update) > > > > After: > > Indexing 2/2 documents > > Deleting 0 documents > > Indexer: number of documents indexed, deleted, or skipped: > > Indexer: 1 deleted (robots=noindex) > > Indexer: 2 indexed (add/update) > > > > > > Am I missing something and this is not actually a bug but rather some > misconfiguration on my part? > > Or is it and I should file a report/patch? > > > > Thanks! > > Felix > > > > > > [1] > https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/inde > xer/IndexerMapReduce.java#L257 > > > >

