Re: AW: AW: Nutch 1.15 not respecting robots=noindex?

Sebastian Nagel Thu, 23 May 2019 07:03:16 -0700

Hi Felix,

> There are plenty of resources online [3] suggesting a capitalized ROBOTS meta 
> tag,
> and I can't seem to find any that say that it MUST be in lower case.
> So I guess this can still be considered a bug.


Yes, definitely. Please open a Jira issue to fix it.

Thanks,
Sebastian



On 5/23/19 3:16 PM, Felix von Zadow wrote:
> Hi Sebastian,
> 
> thank you so much for checking again. With your test document I get the same 
> result as you. Guess what the difference to my document was... Mine has the 
> robots tag capitalized:
> <META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">
> Apparently someone thought this tag was particularly important. Or they came 
> from the 90s where this was common practice [1].
> 
> Anyway, this leads to the following results:
> 
> parse-html:
> bin/nutch parsechecker 
> -Dplugin.includes="protocol-httpclient|parse-(html|metatags)|index-metadata" 
> -Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" 
> http://localhost:8080/noindex_caps.html
> Parse Metadata: CharEncodingForConversion=windows-1252 
> OriginalCharEncoding=windows-1252 metatag.robots=noindex,nofollow 
> robots=noindex,nofollow
> 
> parse-tika:
> bin/nutch parsechecker 
> -Dplugin.includes="protocol-httpclient|parse-(tika|metatags)|index-metadata" 
> -Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" 
> http://localhost:8080/noindex_caps.html
> Parse Metadata: metatag.robots=NOINDEX,NOFOLLOW dc:title=test 
> Content-Encoding=ISO-8859-1 ROBOTS=NOINDEX,NOFOLLOW Content-Type=text/html; 
> charset=ISO-8859-1
> 
> Note that parse-tika keeps the capitalization of "ROBOTS" while parse-html 
> does not. So my guess is that in [2] parseData.getMeta("robots") becomes null 
> and then the document is indexed.
> 
> There are plenty of resources online [3] suggesting a capitalized ROBOTS meta 
> tag, and I can't seem to find any that say that it MUST be in lower case. So 
> I guess this can still be considered a bug.
> 
> Felix
> 
> 
> [1] https://www.w3.org/TR/html401/appendix/notes.html#h-B.4.1.2
> [2] 
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L257
> [3] http://www.robotstxt.org/meta.html
> 
>> Von: Sebastian Nagel
>>
>> Hi Felix,
>>
>> I've also checked parse-tika but the "robots=noindex" is in the parse 
>> metadata
>> also when,
>> at least, for the following test document:
>>
>> % cat /var/www/html/nutch/noindex.html
>> <html>
>> <head>
>> <title>test</title>
>> <meta name='robots' content='noindex,nofollow'>
>> </head>
>> <body>
>> test for robots=noindex
>> </body>
>> </html>
>>
>> The test page is hosted via Apache httpd on
>> http://localhost/nutch/noindex.html:
>>
>> - using parse-html
>>
>> % bin/nutch parsechecker \
>>   -Dplugin.includes="protocol-httpclient|parse-(html|metatags)|index-
>> metadata"  \
>>   -Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" \
>>   http://localhost/nutch/noindex.html
>> ...
>> Parse Metadata: CharEncodingForConversion=windows-1252
>> OriginalCharEncoding=windows-1252
>> metatag.robots=noindex,nofollow metatag.robots=noindex,nofollow
>> robots=noindex,nofollow
>>
>> - using parse-tika
>>
>> % bin/nutch parsechecker \
>>   
>> -Dplugin.includes="protocol-httpclient|parse-(tika|metatags)|index-metadata"
>> \
>>   -Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" \
>>   http://localhost/nutch/noindex.html
>> ...
>> Parse Metadata: metatag.robots=noindex,nofollow
>> metatag.robots=noindex,nofollow dc:title=test
>> Content-Encoding=ISO-8859-1 robots=noindex,nofollow Content-
>> Type=text/html; charset=ISO-8859-1
>>
>>
>> But I believe you. There can multiple other reasons. Could you share the 
>> HTML or
>> a snippet of it
>> which makes the issue reproducible?
>>
>>
>> Thanks,
>> Sebastian
>>
>> On 5/23/19 10:01 AM, Felix von Zadow wrote:
>>>
>>> Hi Sebastian,
>>>
>>> thank you for trying to reproduce the problem!
>>>
>>>> The parse-metatags plugin only duplicates the "robots" metatags,
>>>> adding it also as "metatag.robots" but keep the original "robots".
>>>
>>> This got me confused for a minute because you are absolutely right, they're
>> both there. So I checked again to see how I got to a point where I only had
>> "metatag.robots" but not "robots" and the problem only seems to occur when
>> parse-tika is used instead of parse-html. So with this minimal setup
>>>
>>> protocol-httpclient|parse-(tika|metatags)|index-(metadata)
>>>
>>> the parse metadata only contains "metatag.robots" while with this setup
>>>
>>> protocol-httpclient|parse-(html|metatags)|index-(metadata)
>>>
>>> the parse metadata contains both "metatag.robots" and "robots".
>>>
>>> Felix
>>>
>>>
>>>> Von: Sebastian Nagel
>>>>
>>>> Hi Felix,
>>>>
>>>> I tried to reproduce the problem. The parse-metatags plugin only duplicates
>> the
>>>> "robots" metatags,
>>>> adding it also as "metatag.robots" but keep the original "robots".
>>>>
>>>> That is the case using the current master:
>>>>
>>>> - with parse-metatags and metatags.names="robots" the ParseData object
>>>> contains:
>>>>
>>>> Parse Metadata: CharEncodingForConversion=utf-8
>> OriginalCharEncoding=utf-8
>>>> metatag.robots=noindex,nofollow metatag.robots=noindex,nofollow
>>>> generator=WordPress 3.1
>>>> robots=noindex,nofollow
>>>>
>>>> metatag.robots is even added twice, but most important "robots" is still
>> present
>>>>
>>>> - without:
>>>>
>>>> Parse Metadata: CharEncodingForConversion=utf-8
>> OriginalCharEncoding=utf-8
>>>> generator=WordPress 3.1
>>>> robots=noindex,nofollow
>>>>
>>>>
>>>> Deleting the robots=noindex documents works as expected for both settings.
>>>>
>>>>> Or is it and I should file a report/patch?
>>>>
>>>> Yes, please open an issue to fix this on
>>>>     https://issues.apache.org/jira/projects/NUTCH
>>>>
>>>> Could be that there is some additional condition which I didn't hit.
>>>>
>>>> Can you also share the document for which it does not work it does not 
>>>> work?
>>>>
>>>>
>>>> Thanks,
>>>> Sebastian
>>>>
>>>>
>>>>
>>>> On 5/13/19 11:34 AM, Felix von Zadow wrote:
>>>>> Hi all!
>>>>>
>>>>> So I was trying to use the option indexer.delete.robots.noindex (exclude
>> page
>>>> when <meta robots="noindex"> is encountered).
>>>>>
>>>>> However, the page I'm testing with is still being indexed. I have parse-
>> metatags
>>>> and index-metadata activated and indexer.delete.robots.noindex=true,
>>>> metatags.names="robots" and index.parse.md="metatag.robots".
>>>>>
>>>>> Looking at IndexerMapReduce.java (#257) [1], the field that is being 
>>>>> checked
>> is
>>>> "robots" and not "metatag.robots". It does work as expected when I change 
>>>> it
>> to
>>>> "metatag.robots":
>>>>>
>>>>> Before:
>>>>> Indexing 3/3 documents
>>>>> Deleting 0 documents
>>>>> Indexer: number of documents indexed, deleted, or skipped:
>>>>> Indexer:      3  indexed (add/update)
>>>>>
>>>>> After:
>>>>> Indexing 2/2 documents
>>>>> Deleting 0 documents
>>>>> Indexer: number of documents indexed, deleted, or skipped:
>>>>> Indexer:      1  deleted (robots=noindex)
>>>>> Indexer:      2  indexed (add/update)
>>>>>
>>>>>
>>>>> Am I missing something and this is not actually a bug but rather some
>>>> misconfiguration on my part?
>>>>> Or is it and I should file a report/patch?
>>>>>
>>>>> Thanks!
>>>>> Felix
>>>>>
>>>>>
>>>>> [1]
>>>>
>> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/inde
>>>> xer/IndexerMapReduce.java#L257
>>>>>
>>>>>
>>>
>

Re: AW: AW: Nutch 1.15 not respecting robots=noindex?

Reply via email to