Re: AW: Nutch 1.15 not respecting robots=noindex?

Sebastian Nagel Thu, 23 May 2019 01:28:33 -0700

Hi Felix,

I've also checked parse-tika but the "robots=noindex" is in the parse metadata 
also when,
at least, for the following test document:


% cat /var/www/html/nutch/noindex.html
<html>
<head>
<title>test</title>
<meta name='robots' content='noindex,nofollow'>
</head>
<body>
test for robots=noindex
</body>
</html>

The test page is hosted via Apache httpd on http://localhost/nutch/noindex.html:

- using parse-html

% bin/nutch parsechecker \
  -Dplugin.includes="protocol-httpclient|parse-(html|metatags)|index-metadata"  
\
  -Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" \
  http://localhost/nutch/noindex.html
...
Parse Metadata: CharEncodingForConversion=windows-1252 
OriginalCharEncoding=windows-1252
metatag.robots=noindex,nofollow metatag.robots=noindex,nofollow 
robots=noindex,nofollow

- using parse-tika

% bin/nutch parsechecker \
  -Dplugin.includes="protocol-httpclient|parse-(tika|metatags)|index-metadata"  
\
  -Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" \
  http://localhost/nutch/noindex.html
...
Parse Metadata: metatag.robots=noindex,nofollow metatag.robots=noindex,nofollow 
dc:title=test
Content-Encoding=ISO-8859-1 robots=noindex,nofollow Content-Type=text/html; 
charset=ISO-8859-1


But I believe you. There can multiple other reasons. Could you share the HTML 
or a snippet of it
which makes the issue reproducible?


Thanks,
Sebastian

On 5/23/19 10:01 AM, Felix von Zadow wrote:
> 
> Hi Sebastian,
> 
> thank you for trying to reproduce the problem!
> 
>> The parse-metatags plugin only duplicates the "robots" metatags,
>> adding it also as "metatag.robots" but keep the original "robots".
> 
> This got me confused for a minute because you are absolutely right, they're 
> both there. So I checked again to see how I got to a point where I only had 
> "metatag.robots" but not "robots" and the problem only seems to occur when 
> parse-tika is used instead of parse-html. So with this minimal setup
> 
> protocol-httpclient|parse-(tika|metatags)|index-(metadata)
> 
> the parse metadata only contains "metatag.robots" while with this setup
> 
> protocol-httpclient|parse-(html|metatags)|index-(metadata)
> 
> the parse metadata contains both "metatag.robots" and "robots".
> 
> Felix
> 
> 
>> Von: Sebastian Nagel
>>
>> Hi Felix,
>>
>> I tried to reproduce the problem. The parse-metatags plugin only duplicates 
>> the
>> "robots" metatags,
>> adding it also as "metatag.robots" but keep the original "robots".
>>
>> That is the case using the current master:
>>
>> - with parse-metatags and metatags.names="robots" the ParseData object
>> contains:
>>
>> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
>> metatag.robots=noindex,nofollow metatag.robots=noindex,nofollow
>> generator=WordPress 3.1
>> robots=noindex,nofollow
>>
>> metatag.robots is even added twice, but most important "robots" is still 
>> present
>>
>> - without:
>>
>> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
>> generator=WordPress 3.1
>> robots=noindex,nofollow
>>
>>
>> Deleting the robots=noindex documents works as expected for both settings.
>>
>>> Or is it and I should file a report/patch?
>>
>> Yes, please open an issue to fix this on
>>     https://issues.apache.org/jira/projects/NUTCH
>>
>> Could be that there is some additional condition which I didn't hit.
>>
>> Can you also share the document for which it does not work it does not work?
>>
>>
>> Thanks,
>> Sebastian
>>
>>
>>
>> On 5/13/19 11:34 AM, Felix von Zadow wrote:
>>> Hi all!
>>>
>>> So I was trying to use the option indexer.delete.robots.noindex (exclude 
>>> page
>> when <meta robots="noindex"> is encountered).
>>>
>>> However, the page I'm testing with is still being indexed. I have 
>>> parse-metatags
>> and index-metadata activated and indexer.delete.robots.noindex=true,
>> metatags.names="robots" and index.parse.md="metatag.robots".
>>>
>>> Looking at IndexerMapReduce.java (#257) [1], the field that is being 
>>> checked is
>> "robots" and not "metatag.robots". It does work as expected when I change it 
>> to
>> "metatag.robots":
>>>
>>> Before:
>>> Indexing 3/3 documents
>>> Deleting 0 documents
>>> Indexer: number of documents indexed, deleted, or skipped:
>>> Indexer:      3  indexed (add/update)
>>>
>>> After:
>>> Indexing 2/2 documents
>>> Deleting 0 documents
>>> Indexer: number of documents indexed, deleted, or skipped:
>>> Indexer:      1  deleted (robots=noindex)
>>> Indexer:      2  indexed (add/update)
>>>
>>>
>>> Am I missing something and this is not actually a bug but rather some
>> misconfiguration on my part?
>>> Or is it and I should file a report/patch?
>>>
>>> Thanks!
>>> Felix
>>>
>>>
>>> [1]
>> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/inde
>> xer/IndexerMapReduce.java#L257
>>>
>>>
>

Re: AW: Nutch 1.15 not respecting robots=noindex?

Reply via email to