Hi Sebastian, thank you so much for checking again. With your test document I get the same result as you. Guess what the difference to my document was... Mine has the robots tag capitalized: <META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW"> Apparently someone thought this tag was particularly important. Or they came from the 90s where this was common practice [1].
Anyway, this leads to the following results: parse-html: bin/nutch parsechecker -Dplugin.includes="protocol-httpclient|parse-(html|metatags)|index-metadata" -Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" http://localhost:8080/noindex_caps.html Parse Metadata: CharEncodingForConversion=windows-1252 OriginalCharEncoding=windows-1252 metatag.robots=noindex,nofollow robots=noindex,nofollow parse-tika: bin/nutch parsechecker -Dplugin.includes="protocol-httpclient|parse-(tika|metatags)|index-metadata" -Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" http://localhost:8080/noindex_caps.html Parse Metadata: metatag.robots=NOINDEX,NOFOLLOW dc:title=test Content-Encoding=ISO-8859-1 ROBOTS=NOINDEX,NOFOLLOW Content-Type=text/html; charset=ISO-8859-1 Note that parse-tika keeps the capitalization of "ROBOTS" while parse-html does not. So my guess is that in [2] parseData.getMeta("robots") becomes null and then the document is indexed. There are plenty of resources online [3] suggesting a capitalized ROBOTS meta tag, and I can't seem to find any that say that it MUST be in lower case. So I guess this can still be considered a bug. Felix [1] https://www.w3.org/TR/html401/appendix/notes.html#h-B.4.1.2 [2] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L257 [3] http://www.robotstxt.org/meta.html > Von: Sebastian Nagel > > Hi Felix, > > I've also checked parse-tika but the "robots=noindex" is in the parse metadata > also when, > at least, for the following test document: > > % cat /var/www/html/nutch/noindex.html > <html> > <head> > <title>test</title> > <meta name='robots' content='noindex,nofollow'> > </head> > <body> > test for robots=noindex > </body> > </html> > > The test page is hosted via Apache httpd on > http://localhost/nutch/noindex.html: > > - using parse-html > > % bin/nutch parsechecker \ > -Dplugin.includes="protocol-httpclient|parse-(html|metatags)|index- > metadata" \ > -Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" \ > http://localhost/nutch/noindex.html > ... > Parse Metadata: CharEncodingForConversion=windows-1252 > OriginalCharEncoding=windows-1252 > metatag.robots=noindex,nofollow metatag.robots=noindex,nofollow > robots=noindex,nofollow > > - using parse-tika > > % bin/nutch parsechecker \ > -Dplugin.includes="protocol-httpclient|parse-(tika|metatags)|index-metadata" > \ > -Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" \ > http://localhost/nutch/noindex.html > ... > Parse Metadata: metatag.robots=noindex,nofollow > metatag.robots=noindex,nofollow dc:title=test > Content-Encoding=ISO-8859-1 robots=noindex,nofollow Content- > Type=text/html; charset=ISO-8859-1 > > > But I believe you. There can multiple other reasons. Could you share the HTML > or > a snippet of it > which makes the issue reproducible? > > > Thanks, > Sebastian > > On 5/23/19 10:01 AM, Felix von Zadow wrote: > > > > Hi Sebastian, > > > > thank you for trying to reproduce the problem! > > > >> The parse-metatags plugin only duplicates the "robots" metatags, > >> adding it also as "metatag.robots" but keep the original "robots". > > > > This got me confused for a minute because you are absolutely right, they're > both there. So I checked again to see how I got to a point where I only had > "metatag.robots" but not "robots" and the problem only seems to occur when > parse-tika is used instead of parse-html. So with this minimal setup > > > > protocol-httpclient|parse-(tika|metatags)|index-(metadata) > > > > the parse metadata only contains "metatag.robots" while with this setup > > > > protocol-httpclient|parse-(html|metatags)|index-(metadata) > > > > the parse metadata contains both "metatag.robots" and "robots". > > > > Felix > > > > > >> Von: Sebastian Nagel > >> > >> Hi Felix, > >> > >> I tried to reproduce the problem. The parse-metatags plugin only duplicates > the > >> "robots" metatags, > >> adding it also as "metatag.robots" but keep the original "robots". > >> > >> That is the case using the current master: > >> > >> - with parse-metatags and metatags.names="robots" the ParseData object > >> contains: > >> > >> Parse Metadata: CharEncodingForConversion=utf-8 > OriginalCharEncoding=utf-8 > >> metatag.robots=noindex,nofollow metatag.robots=noindex,nofollow > >> generator=WordPress 3.1 > >> robots=noindex,nofollow > >> > >> metatag.robots is even added twice, but most important "robots" is still > present > >> > >> - without: > >> > >> Parse Metadata: CharEncodingForConversion=utf-8 > OriginalCharEncoding=utf-8 > >> generator=WordPress 3.1 > >> robots=noindex,nofollow > >> > >> > >> Deleting the robots=noindex documents works as expected for both settings. > >> > >>> Or is it and I should file a report/patch? > >> > >> Yes, please open an issue to fix this on > >> https://issues.apache.org/jira/projects/NUTCH > >> > >> Could be that there is some additional condition which I didn't hit. > >> > >> Can you also share the document for which it does not work it does not > >> work? > >> > >> > >> Thanks, > >> Sebastian > >> > >> > >> > >> On 5/13/19 11:34 AM, Felix von Zadow wrote: > >>> Hi all! > >>> > >>> So I was trying to use the option indexer.delete.robots.noindex (exclude > page > >> when <meta robots="noindex"> is encountered). > >>> > >>> However, the page I'm testing with is still being indexed. I have parse- > metatags > >> and index-metadata activated and indexer.delete.robots.noindex=true, > >> metatags.names="robots" and index.parse.md="metatag.robots". > >>> > >>> Looking at IndexerMapReduce.java (#257) [1], the field that is being > >>> checked > is > >> "robots" and not "metatag.robots". It does work as expected when I change > >> it > to > >> "metatag.robots": > >>> > >>> Before: > >>> Indexing 3/3 documents > >>> Deleting 0 documents > >>> Indexer: number of documents indexed, deleted, or skipped: > >>> Indexer: 3 indexed (add/update) > >>> > >>> After: > >>> Indexing 2/2 documents > >>> Deleting 0 documents > >>> Indexer: number of documents indexed, deleted, or skipped: > >>> Indexer: 1 deleted (robots=noindex) > >>> Indexer: 2 indexed (add/update) > >>> > >>> > >>> Am I missing something and this is not actually a bug but rather some > >> misconfiguration on my part? > >>> Or is it and I should file a report/patch? > >>> > >>> Thanks! > >>> Felix > >>> > >>> > >>> [1] > >> > https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/inde > >> xer/IndexerMapReduce.java#L257 > >>> > >>> > >

