Hi Felix, > There are plenty of resources online [3] suggesting a capitalized ROBOTS meta > tag, > and I can't seem to find any that say that it MUST be in lower case. > So I guess this can still be considered a bug.
Yes, definitely. Please open a Jira issue to fix it. Thanks, Sebastian On 5/23/19 3:16 PM, Felix von Zadow wrote: > Hi Sebastian, > > thank you so much for checking again. With your test document I get the same > result as you. Guess what the difference to my document was... Mine has the > robots tag capitalized: > <META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW"> > Apparently someone thought this tag was particularly important. Or they came > from the 90s where this was common practice [1]. > > Anyway, this leads to the following results: > > parse-html: > bin/nutch parsechecker > -Dplugin.includes="protocol-httpclient|parse-(html|metatags)|index-metadata" > -Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" > http://localhost:8080/noindex_caps.html > Parse Metadata: CharEncodingForConversion=windows-1252 > OriginalCharEncoding=windows-1252 metatag.robots=noindex,nofollow > robots=noindex,nofollow > > parse-tika: > bin/nutch parsechecker > -Dplugin.includes="protocol-httpclient|parse-(tika|metatags)|index-metadata" > -Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" > http://localhost:8080/noindex_caps.html > Parse Metadata: metatag.robots=NOINDEX,NOFOLLOW dc:title=test > Content-Encoding=ISO-8859-1 ROBOTS=NOINDEX,NOFOLLOW Content-Type=text/html; > charset=ISO-8859-1 > > Note that parse-tika keeps the capitalization of "ROBOTS" while parse-html > does not. So my guess is that in [2] parseData.getMeta("robots") becomes null > and then the document is indexed. > > There are plenty of resources online [3] suggesting a capitalized ROBOTS meta > tag, and I can't seem to find any that say that it MUST be in lower case. So > I guess this can still be considered a bug. > > Felix > > > [1] https://www.w3.org/TR/html401/appendix/notes.html#h-B.4.1.2 > [2] > https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L257 > [3] http://www.robotstxt.org/meta.html > >> Von: Sebastian Nagel >> >> Hi Felix, >> >> I've also checked parse-tika but the "robots=noindex" is in the parse >> metadata >> also when, >> at least, for the following test document: >> >> % cat /var/www/html/nutch/noindex.html >> <html> >> <head> >> <title>test</title> >> <meta name='robots' content='noindex,nofollow'> >> </head> >> <body> >> test for robots=noindex >> </body> >> </html> >> >> The test page is hosted via Apache httpd on >> http://localhost/nutch/noindex.html: >> >> - using parse-html >> >> % bin/nutch parsechecker \ >> -Dplugin.includes="protocol-httpclient|parse-(html|metatags)|index- >> metadata" \ >> -Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" \ >> http://localhost/nutch/noindex.html >> ... >> Parse Metadata: CharEncodingForConversion=windows-1252 >> OriginalCharEncoding=windows-1252 >> metatag.robots=noindex,nofollow metatag.robots=noindex,nofollow >> robots=noindex,nofollow >> >> - using parse-tika >> >> % bin/nutch parsechecker \ >> >> -Dplugin.includes="protocol-httpclient|parse-(tika|metatags)|index-metadata" >> \ >> -Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" \ >> http://localhost/nutch/noindex.html >> ... >> Parse Metadata: metatag.robots=noindex,nofollow >> metatag.robots=noindex,nofollow dc:title=test >> Content-Encoding=ISO-8859-1 robots=noindex,nofollow Content- >> Type=text/html; charset=ISO-8859-1 >> >> >> But I believe you. There can multiple other reasons. Could you share the >> HTML or >> a snippet of it >> which makes the issue reproducible? >> >> >> Thanks, >> Sebastian >> >> On 5/23/19 10:01 AM, Felix von Zadow wrote: >>> >>> Hi Sebastian, >>> >>> thank you for trying to reproduce the problem! >>> >>>> The parse-metatags plugin only duplicates the "robots" metatags, >>>> adding it also as "metatag.robots" but keep the original "robots". >>> >>> This got me confused for a minute because you are absolutely right, they're >> both there. So I checked again to see how I got to a point where I only had >> "metatag.robots" but not "robots" and the problem only seems to occur when >> parse-tika is used instead of parse-html. So with this minimal setup >>> >>> protocol-httpclient|parse-(tika|metatags)|index-(metadata) >>> >>> the parse metadata only contains "metatag.robots" while with this setup >>> >>> protocol-httpclient|parse-(html|metatags)|index-(metadata) >>> >>> the parse metadata contains both "metatag.robots" and "robots". >>> >>> Felix >>> >>> >>>> Von: Sebastian Nagel >>>> >>>> Hi Felix, >>>> >>>> I tried to reproduce the problem. The parse-metatags plugin only duplicates >> the >>>> "robots" metatags, >>>> adding it also as "metatag.robots" but keep the original "robots". >>>> >>>> That is the case using the current master: >>>> >>>> - with parse-metatags and metatags.names="robots" the ParseData object >>>> contains: >>>> >>>> Parse Metadata: CharEncodingForConversion=utf-8 >> OriginalCharEncoding=utf-8 >>>> metatag.robots=noindex,nofollow metatag.robots=noindex,nofollow >>>> generator=WordPress 3.1 >>>> robots=noindex,nofollow >>>> >>>> metatag.robots is even added twice, but most important "robots" is still >> present >>>> >>>> - without: >>>> >>>> Parse Metadata: CharEncodingForConversion=utf-8 >> OriginalCharEncoding=utf-8 >>>> generator=WordPress 3.1 >>>> robots=noindex,nofollow >>>> >>>> >>>> Deleting the robots=noindex documents works as expected for both settings. >>>> >>>>> Or is it and I should file a report/patch? >>>> >>>> Yes, please open an issue to fix this on >>>> https://issues.apache.org/jira/projects/NUTCH >>>> >>>> Could be that there is some additional condition which I didn't hit. >>>> >>>> Can you also share the document for which it does not work it does not >>>> work? >>>> >>>> >>>> Thanks, >>>> Sebastian >>>> >>>> >>>> >>>> On 5/13/19 11:34 AM, Felix von Zadow wrote: >>>>> Hi all! >>>>> >>>>> So I was trying to use the option indexer.delete.robots.noindex (exclude >> page >>>> when <meta robots="noindex"> is encountered). >>>>> >>>>> However, the page I'm testing with is still being indexed. I have parse- >> metatags >>>> and index-metadata activated and indexer.delete.robots.noindex=true, >>>> metatags.names="robots" and index.parse.md="metatag.robots". >>>>> >>>>> Looking at IndexerMapReduce.java (#257) [1], the field that is being >>>>> checked >> is >>>> "robots" and not "metatag.robots". It does work as expected when I change >>>> it >> to >>>> "metatag.robots": >>>>> >>>>> Before: >>>>> Indexing 3/3 documents >>>>> Deleting 0 documents >>>>> Indexer: number of documents indexed, deleted, or skipped: >>>>> Indexer: 3 indexed (add/update) >>>>> >>>>> After: >>>>> Indexing 2/2 documents >>>>> Deleting 0 documents >>>>> Indexer: number of documents indexed, deleted, or skipped: >>>>> Indexer: 1 deleted (robots=noindex) >>>>> Indexer: 2 indexed (add/update) >>>>> >>>>> >>>>> Am I missing something and this is not actually a bug but rather some >>>> misconfiguration on my part? >>>>> Or is it and I should file a report/patch? >>>>> >>>>> Thanks! >>>>> Felix >>>>> >>>>> >>>>> [1] >>>> >> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/inde >>>> xer/IndexerMapReduce.java#L257 >>>>> >>>>> >>> >

