AW: AW: Nutch 1.15 not respecting robots=noindex?

Felix von Zadow Thu, 23 May 2019 06:16:41 -0700

Hi Sebastian,

thank you so much for checking again. With your test document I get the same 
result as you. Guess what the difference to my document was... Mine has the 
robots tag capitalized:
<META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">
Apparently someone thought this tag was particularly important. Or they came 
from the 90s where this was common practice [1].


Anyway, this leads to the following results:

parse-html:
bin/nutch parsechecker 
-Dplugin.includes="protocol-httpclient|parse-(html|metatags)|index-metadata" 
-Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" 
http://localhost:8080/noindex_caps.html
Parse Metadata: CharEncodingForConversion=windows-1252 
OriginalCharEncoding=windows-1252 metatag.robots=noindex,nofollow 
robots=noindex,nofollow

parse-tika:
bin/nutch parsechecker 
-Dplugin.includes="protocol-httpclient|parse-(tika|metatags)|index-metadata" 
-Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" 
http://localhost:8080/noindex_caps.html
Parse Metadata: metatag.robots=NOINDEX,NOFOLLOW dc:title=test 
Content-Encoding=ISO-8859-1 ROBOTS=NOINDEX,NOFOLLOW Content-Type=text/html; 
charset=ISO-8859-1

Note that parse-tika keeps the capitalization of "ROBOTS" while parse-html does 
not. So my guess is that in [2] parseData.getMeta("robots") becomes null and 
then the document is indexed.

There are plenty of resources online [3] suggesting a capitalized ROBOTS meta 
tag, and I can't seem to find any that say that it MUST be in lower case. So I 
guess this can still be considered a bug.

Felix


[1] https://www.w3.org/TR/html401/appendix/notes.html#h-B.4.1.2
[2] 
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L257
[3] http://www.robotstxt.org/meta.html

> Von: Sebastian Nagel
> 
> Hi Felix,
> 
> I've also checked parse-tika but the "robots=noindex" is in the parse metadata
> also when,
> at least, for the following test document:
> 
> % cat /var/www/html/nutch/noindex.html
> <html>
> <head>
> <title>test</title>
> <meta name='robots' content='noindex,nofollow'>
> </head>
> <body>
> test for robots=noindex
> </body>
> </html>
> 
> The test page is hosted via Apache httpd on
> http://localhost/nutch/noindex.html:
> 
> - using parse-html
> 
> % bin/nutch parsechecker \
>   -Dplugin.includes="protocol-httpclient|parse-(html|metatags)|index-
> metadata"  \
>   -Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" \
>   http://localhost/nutch/noindex.html
> ...
> Parse Metadata: CharEncodingForConversion=windows-1252
> OriginalCharEncoding=windows-1252
> metatag.robots=noindex,nofollow metatag.robots=noindex,nofollow
> robots=noindex,nofollow
> 
> - using parse-tika
> 
> % bin/nutch parsechecker \
>   -Dplugin.includes="protocol-httpclient|parse-(tika|metatags)|index-metadata"
> \
>   -Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" \
>   http://localhost/nutch/noindex.html
> ...
> Parse Metadata: metatag.robots=noindex,nofollow
> metatag.robots=noindex,nofollow dc:title=test
> Content-Encoding=ISO-8859-1 robots=noindex,nofollow Content-
> Type=text/html; charset=ISO-8859-1
> 
> 
> But I believe you. There can multiple other reasons. Could you share the HTML 
> or
> a snippet of it
> which makes the issue reproducible?
> 
> 
> Thanks,
> Sebastian
> 
> On 5/23/19 10:01 AM, Felix von Zadow wrote:
> >
> > Hi Sebastian,
> >
> > thank you for trying to reproduce the problem!
> >
> >> The parse-metatags plugin only duplicates the "robots" metatags,
> >> adding it also as "metatag.robots" but keep the original "robots".
> >
> > This got me confused for a minute because you are absolutely right, they're
> both there. So I checked again to see how I got to a point where I only had
> "metatag.robots" but not "robots" and the problem only seems to occur when
> parse-tika is used instead of parse-html. So with this minimal setup
> >
> > protocol-httpclient|parse-(tika|metatags)|index-(metadata)
> >
> > the parse metadata only contains "metatag.robots" while with this setup
> >
> > protocol-httpclient|parse-(html|metatags)|index-(metadata)
> >
> > the parse metadata contains both "metatag.robots" and "robots".
> >
> > Felix
> >
> >
> >> Von: Sebastian Nagel
> >>
> >> Hi Felix,
> >>
> >> I tried to reproduce the problem. The parse-metatags plugin only duplicates
> the
> >> "robots" metatags,
> >> adding it also as "metatag.robots" but keep the original "robots".
> >>
> >> That is the case using the current master:
> >>
> >> - with parse-metatags and metatags.names="robots" the ParseData object
> >> contains:
> >>
> >> Parse Metadata: CharEncodingForConversion=utf-8
> OriginalCharEncoding=utf-8
> >> metatag.robots=noindex,nofollow metatag.robots=noindex,nofollow
> >> generator=WordPress 3.1
> >> robots=noindex,nofollow
> >>
> >> metatag.robots is even added twice, but most important "robots" is still
> present
> >>
> >> - without:
> >>
> >> Parse Metadata: CharEncodingForConversion=utf-8
> OriginalCharEncoding=utf-8
> >> generator=WordPress 3.1
> >> robots=noindex,nofollow
> >>
> >>
> >> Deleting the robots=noindex documents works as expected for both settings.
> >>
> >>> Or is it and I should file a report/patch?
> >>
> >> Yes, please open an issue to fix this on
> >>     https://issues.apache.org/jira/projects/NUTCH
> >>
> >> Could be that there is some additional condition which I didn't hit.
> >>
> >> Can you also share the document for which it does not work it does not 
> >> work?
> >>
> >>
> >> Thanks,
> >> Sebastian
> >>
> >>
> >>
> >> On 5/13/19 11:34 AM, Felix von Zadow wrote:
> >>> Hi all!
> >>>
> >>> So I was trying to use the option indexer.delete.robots.noindex (exclude
> page
> >> when <meta robots="noindex"> is encountered).
> >>>
> >>> However, the page I'm testing with is still being indexed. I have parse-
> metatags
> >> and index-metadata activated and indexer.delete.robots.noindex=true,
> >> metatags.names="robots" and index.parse.md="metatag.robots".
> >>>
> >>> Looking at IndexerMapReduce.java (#257) [1], the field that is being 
> >>> checked
> is
> >> "robots" and not "metatag.robots". It does work as expected when I change 
> >> it
> to
> >> "metatag.robots":
> >>>
> >>> Before:
> >>> Indexing 3/3 documents
> >>> Deleting 0 documents
> >>> Indexer: number of documents indexed, deleted, or skipped:
> >>> Indexer:      3  indexed (add/update)
> >>>
> >>> After:
> >>> Indexing 2/2 documents
> >>> Deleting 0 documents
> >>> Indexer: number of documents indexed, deleted, or skipped:
> >>> Indexer:      1  deleted (robots=noindex)
> >>> Indexer:      2  indexed (add/update)
> >>>
> >>>
> >>> Am I missing something and this is not actually a bug but rather some
> >> misconfiguration on my part?
> >>> Or is it and I should file a report/patch?
> >>>
> >>> Thanks!
> >>> Felix
> >>>
> >>>
> >>> [1]
> >>
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/inde
> >> xer/IndexerMapReduce.java#L257
> >>>
> >>>
> >

AW: AW: Nutch 1.15 not respecting robots=noindex?

Reply via email to