Hi Felix,
I tried to reproduce the problem. The parse-metatags plugin only duplicates the
"robots" metatags,
adding it also as "metatag.robots" but keep the original "robots".
That is the case using the current master:
- with parse-metatags and metatags.names="robots" the ParseData object contains:
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
metatag.robots=noindex,nofollow metatag.robots=noindex,nofollow
generator=WordPress 3.1
robots=noindex,nofollow
metatag.robots is even added twice, but most important "robots" is still present
- without:
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
generator=WordPress 3.1
robots=noindex,nofollow
Deleting the robots=noindex documents works as expected for both settings.
> Or is it and I should file a report/patch?
Yes, please open an issue to fix this on
https://issues.apache.org/jira/projects/NUTCH
Could be that there is some additional condition which I didn't hit.
Can you also share the document for which it does not work it does not work?
Thanks,
Sebastian
On 5/13/19 11:34 AM, Felix von Zadow wrote:
> Hi all!
>
> So I was trying to use the option indexer.delete.robots.noindex (exclude page
> when <meta robots="noindex"> is encountered).
>
> However, the page I'm testing with is still being indexed. I have
> parse-metatags and index-metadata activated and
> indexer.delete.robots.noindex=true, metatags.names="robots" and
> index.parse.md="metatag.robots".
>
> Looking at IndexerMapReduce.java (#257) [1], the field that is being checked
> is "robots" and not "metatag.robots". It does work as expected when I change
> it to "metatag.robots":
>
> Before:
> Indexing 3/3 documents
> Deleting 0 documents
> Indexer: number of documents indexed, deleted, or skipped:
> Indexer: 3 indexed (add/update)
>
> After:
> Indexing 2/2 documents
> Deleting 0 documents
> Indexer: number of documents indexed, deleted, or skipped:
> Indexer: 1 deleted (robots=noindex)
> Indexer: 2 indexed (add/update)
>
>
> Am I missing something and this is not actually a bug but rather some
> misconfiguration on my part?
> Or is it and I should file a report/patch?
>
> Thanks!
> Felix
>
>
> [1]
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L257
>
>