Hi Felix, I've also checked parse-tika but the "robots=noindex" is in the parse metadata also when, at least, for the following test document:
% cat /var/www/html/nutch/noindex.html <html> <head> <title>test</title> <meta name='robots' content='noindex,nofollow'> </head> <body> test for robots=noindex </body> </html> The test page is hosted via Apache httpd on http://localhost/nutch/noindex.html: - using parse-html % bin/nutch parsechecker \ -Dplugin.includes="protocol-httpclient|parse-(html|metatags)|index-metadata" \ -Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" \ http://localhost/nutch/noindex.html ... Parse Metadata: CharEncodingForConversion=windows-1252 OriginalCharEncoding=windows-1252 metatag.robots=noindex,nofollow metatag.robots=noindex,nofollow robots=noindex,nofollow - using parse-tika % bin/nutch parsechecker \ -Dplugin.includes="protocol-httpclient|parse-(tika|metatags)|index-metadata" \ -Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" \ http://localhost/nutch/noindex.html ... Parse Metadata: metatag.robots=noindex,nofollow metatag.robots=noindex,nofollow dc:title=test Content-Encoding=ISO-8859-1 robots=noindex,nofollow Content-Type=text/html; charset=ISO-8859-1 But I believe you. There can multiple other reasons. Could you share the HTML or a snippet of it which makes the issue reproducible? Thanks, Sebastian On 5/23/19 10:01 AM, Felix von Zadow wrote: > > Hi Sebastian, > > thank you for trying to reproduce the problem! > >> The parse-metatags plugin only duplicates the "robots" metatags, >> adding it also as "metatag.robots" but keep the original "robots". > > This got me confused for a minute because you are absolutely right, they're > both there. So I checked again to see how I got to a point where I only had > "metatag.robots" but not "robots" and the problem only seems to occur when > parse-tika is used instead of parse-html. So with this minimal setup > > protocol-httpclient|parse-(tika|metatags)|index-(metadata) > > the parse metadata only contains "metatag.robots" while with this setup > > protocol-httpclient|parse-(html|metatags)|index-(metadata) > > the parse metadata contains both "metatag.robots" and "robots". > > Felix > > >> Von: Sebastian Nagel >> >> Hi Felix, >> >> I tried to reproduce the problem. The parse-metatags plugin only duplicates >> the >> "robots" metatags, >> adding it also as "metatag.robots" but keep the original "robots". >> >> That is the case using the current master: >> >> - with parse-metatags and metatags.names="robots" the ParseData object >> contains: >> >> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 >> metatag.robots=noindex,nofollow metatag.robots=noindex,nofollow >> generator=WordPress 3.1 >> robots=noindex,nofollow >> >> metatag.robots is even added twice, but most important "robots" is still >> present >> >> - without: >> >> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 >> generator=WordPress 3.1 >> robots=noindex,nofollow >> >> >> Deleting the robots=noindex documents works as expected for both settings. >> >>> Or is it and I should file a report/patch? >> >> Yes, please open an issue to fix this on >> https://issues.apache.org/jira/projects/NUTCH >> >> Could be that there is some additional condition which I didn't hit. >> >> Can you also share the document for which it does not work it does not work? >> >> >> Thanks, >> Sebastian >> >> >> >> On 5/13/19 11:34 AM, Felix von Zadow wrote: >>> Hi all! >>> >>> So I was trying to use the option indexer.delete.robots.noindex (exclude >>> page >> when <meta robots="noindex"> is encountered). >>> >>> However, the page I'm testing with is still being indexed. I have >>> parse-metatags >> and index-metadata activated and indexer.delete.robots.noindex=true, >> metatags.names="robots" and index.parse.md="metatag.robots". >>> >>> Looking at IndexerMapReduce.java (#257) [1], the field that is being >>> checked is >> "robots" and not "metatag.robots". It does work as expected when I change it >> to >> "metatag.robots": >>> >>> Before: >>> Indexing 3/3 documents >>> Deleting 0 documents >>> Indexer: number of documents indexed, deleted, or skipped: >>> Indexer: 3 indexed (add/update) >>> >>> After: >>> Indexing 2/2 documents >>> Deleting 0 documents >>> Indexer: number of documents indexed, deleted, or skipped: >>> Indexer: 1 deleted (robots=noindex) >>> Indexer: 2 indexed (add/update) >>> >>> >>> Am I missing something and this is not actually a bug but rather some >> misconfiguration on my part? >>> Or is it and I should file a report/patch? >>> >>> Thanks! >>> Felix >>> >>> >>> [1] >> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/inde >> xer/IndexerMapReduce.java#L257 >>> >>> >

