|TheDJ added a comment.|
FYI: you disallowed crawling, that doesn't mean you disallowed indexing for modern search engines. If another indexed page links to the url, that google will still index it.
When you block URLs from being indexed in Google via robots.txt, they may still show those pages as URL only listings in their search results. A better solution for completely blocking the index of a particular page is to use a robots noindex meta tag on a per page bases. You can tell them to not index a page, or to not index a page and to not follow outbound links by inserting either of the following code bits in the HTML head of your document that you do not want indexed.
This why we have NOINDEX and setRobotPolicy on OutputPage etc.. it's just that these are redirects and/or not necessarily HTML. That's why i pointed at X-Robots-Tag.
Another thing to pay attention to, is indicating canonical urls whenever possible
Cc: Sjoerddebruin, TheDJ, Mbch331, Jonas, hoo, aude, Lydia_Pintscher, Tobi_WMDE_SW, Aklapper, thiemowmde, TerraCodes, D3r1ck01, MuhammadShuaib, Izno, Wikidata-bugs
_______________________________________________ Wikidata-bugs mailing list Wikidatafirstname.lastname@example.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs