Re: AW: AW: Nutch 1.15 not respecting robots=noindex?

2019-05-23 Thread Sebastian Nagel
Hi Felix, > There are plenty of resources online [3] suggesting a capitalized ROBOTS meta > tag, > and I can't seem to find any that say that it MUST be in lower case. > So I guess this can still be considered a bug. Yes, definitely. Please open a Jira issue to fix it. Thanks, Sebastian On 5

AW: AW: Nutch 1.15 not respecting robots=noindex?

2019-05-23 Thread Felix von Zadow
Hi Sebastian, thank you so much for checking again. With your test document I get the same result as you. Guess what the difference to my document was... Mine has the robots tag capitalized: Apparently someone thought this tag was particularly important. Or they came from the 90s where this wa

Re: AW: Nutch 1.15 not respecting robots=noindex?

2019-05-23 Thread Sebastian Nagel
Hi Felix, I've also checked parse-tika but the "robots=noindex" is in the parse metadata also when, at least, for the following test document: % cat /var/www/html/nutch/noindex.html test test for robots=noindex The test page is hosted via Apache httpd on http://localhost/nutch/noindex.ht

AW: Nutch 1.15 not respecting robots=noindex?

2019-05-23 Thread Felix von Zadow
Hi Sebastian, thank you for trying to reproduce the problem! > The parse-metatags plugin only duplicates the "robots" metatags, > adding it also as "metatag.robots" but keep the original "robots". This got me confused for a minute because you are absolutely right, they're both there. So I chec