Hi all!

So I was trying to use the option indexer.delete.robots.noindex (exclude page 
when <meta robots="noindex"> is encountered).

However, the page I'm testing with is still being indexed. I have 
parse-metatags and index-metadata activated and 
indexer.delete.robots.noindex=true, metatags.names="robots" and 
index.parse.md="metatag.robots".

Looking at IndexerMapReduce.java (#257) [1], the field that is being checked is 
"robots" and not "metatag.robots". It does work as expected when I change it to 
"metatag.robots":

Before:
Indexing 3/3 documents
Deleting 0 documents
Indexer: number of documents indexed, deleted, or skipped:
Indexer:      3  indexed (add/update)

After:
Indexing 2/2 documents
Deleting 0 documents
Indexer: number of documents indexed, deleted, or skipped:
Indexer:      1  deleted (robots=noindex)
Indexer:      2  indexed (add/update)


Am I missing something and this is not actually a bug but rather some 
misconfiguration on my part?
Or is it and I should file a report/patch?

Thanks!
Felix


[1] 
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L257

Reply via email to