[Wikidata-bugs] [Maniphest] [Commented On] T144308: [Task] Disallow Special:GoToLinkedPage in wikidata.org/robots.txt

2016-11-10 Thread thiemowmde
thiemowmde added a comment.
We changed our robots.txt two and a half weeks ago. Re-visiting possibly millions of URLs in such a short time is something neither we nor Google want. At the moment there are 28,000 left, it seems.

The links we want to exclude are tools and never meant to be indexed. On the one hand, Google can't know this. On the other hand, I wonder why an existing canonical tag is basically ignored and Google acts like it found Wikipedia articles on Wikidata.

Let's check again in another two weeks.TASK DETAILhttps://phabricator.wikimedia.org/T144308EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Mbch331, thiemowmdeCc: Sjoerddebruin, TheDJ, Mbch331, Jonas, hoo, aude, Lydia_Pintscher, Tobi_WMDE_SW, Aklapper, thiemowmde, TerraCodes, D3r1ck01, MuhammadShuaib, Izno, Wikidata-bugs___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T144308: [Task] Disallow Special:GoToLinkedPage in wikidata.org/robots.txt

2016-11-09 Thread Sjoerddebruin
Sjoerddebruin added a comment.
https://www.google.com/search?q=Ethiopian+wolf+site%3Awikidata.org was visited yesterday and is still indexed by Google.TASK DETAILhttps://phabricator.wikimedia.org/T144308EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Mbch331, SjoerddebruinCc: Sjoerddebruin, TheDJ, Mbch331, Jonas, hoo, aude, Lydia_Pintscher, Tobi_WMDE_SW, Aklapper, thiemowmde, TerraCodes, D3r1ck01, MuhammadShuaib, Izno, Wikidata-bugs___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T144308: [Task] Disallow Special:GoToLinkedPage in wikidata.org/robots.txt

2016-10-24 Thread Mbch331
Mbch331 added a comment.
Request on https://www.wikidata.org/wiki/MediaWiki_talk:Robots.txt#Also_exclude_URLs_with_question_marks is done.TASK DETAILhttps://phabricator.wikimedia.org/T144308EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Mbch331Cc: Sjoerddebruin, TheDJ, Mbch331, Jonas, hoo, aude, Lydia_Pintscher, Tobi_WMDE_SW, Aklapper, thiemowmde, TerraCodes, D3r1ck01, MuhammadShuaib, Izno, Wikidata-bugs___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T144308: [Task] Disallow Special:GoToLinkedPage in wikidata.org/robots.txt

2016-10-12 Thread thiemowmde
thiemowmde added a comment.
As you said, Special:GoToLinkedPage redirects and does not output HTML (except for the form, which is a single page, and the reason why Disallow: /wiki/Special:GoToLinkedPage should not be used). The target pages of these redirects are Wikipedia articles. They should be indexed, and they already have canonical tags.

I'm not sure what an X-Robots-Tag will do when used with a redirect.


I believe there is no point in crawling Special:GoToLinkedPage URLs, because they are guaranteed to do nothing but redirect to Wikipedia articles. We know each redirect represents a sitelink, and sitelinks are already accessible on the ordinary item page. We know all this. Google does not.
Similar for Special:ItemByTitle, which redirects to a Wikidata item. The exact same links already exist in the sidebars of the connected Wikipedia articles.


@Mbch331, please add the following lines to https://www.wikidata.org/wiki/MediaWiki:Robots.txt:

Disallow: /wiki/Special:GoToLinkedPage?
Disallow: /wiki/Special:ItemByTitle?
Disallow: /wiki/Special:SetSiteLink?

Do not remove the slashes, because this would exclude the special page forms itself. We want these to appear in a Google search.TASK DETAILhttps://phabricator.wikimedia.org/T144308EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Mbch331, thiemowmdeCc: Sjoerddebruin, TheDJ, Mbch331, Jonas, hoo, aude, Lydia_Pintscher, Tobi_WMDE_SW, Aklapper, thiemowmde, TerraCodes, D3r1ck01, MuhammadShuaib, Izno, Wikidata-bugs___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T144308: [Task] Disallow Special:GoToLinkedPage in wikidata.org/robots.txt

2016-10-12 Thread TheDJ
TheDJ added a comment.
FYI: you disallowed crawling, that doesn't mean you disallowed indexing for modern search engines. If another indexed page links to the url, that google will still index it.

To quote

When you block URLs from being indexed in Google via robots.txt, they may still show those pages as URL only listings in their search results. A better solution for completely blocking the index of a particular page is to use a robots noindex meta tag on a per page bases. You can tell them to not index a page, or to not index a page and to not follow outbound links by inserting either of the following code bits in the HTML head of your document that you do not want indexed.

This why we have NOINDEX and setRobotPolicy on OutputPage etc..  it's just that these are redirects and/or not necessarily HTML. That's why i pointed at X-Robots-Tag.
Another thing to pay attention to, is indicating canonical urls whenever possibleTASK DETAILhttps://phabricator.wikimedia.org/T144308EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Mbch331, TheDJCc: Sjoerddebruin, TheDJ, Mbch331, Jonas, hoo, aude, Lydia_Pintscher, Tobi_WMDE_SW, Aklapper, thiemowmde, TerraCodes, D3r1ck01, MuhammadShuaib, Izno, Wikidata-bugs___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T144308: [Task] Disallow Special:GoToLinkedPage in wikidata.org/robots.txt

2016-09-23 Thread Mbch331
Mbch331 added a comment.

In T144308#2662636, @Sjoerddebruin wrote:
Google still index these pages: https://www.google.nl/search?client=safari=en=African+wild+dog+site:wikidata.org=UTF-8=UTF-8_rd=cr=G1HlV9TLHemRwAKeurXICg (notice that the first result has a cached version of yesterday)


Probably due to the trailing slashes in the request posted in T144308#2597079. The URL for the cached entry is http://www.wikidata.org/wiki/Special:GoToLinkedPage?site=enwiki=Q173651. (Which has no slash after GoToLinkedPage)

@thiemowmde: Maybe we should remove the trailing slashes in the robots.txt?TASK DETAILhttps://phabricator.wikimedia.org/T144308EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Mbch331Cc: Sjoerddebruin, TheDJ, Mbch331, Jonas, hoo, aude, Lydia_Pintscher, Tobi_WMDE_SW, Aklapper, thiemowmde, TerraCodes, D3r1ck01, MuhammadShuaib, Izno, Wikidata-bugs___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T144308: [Task] Disallow Special:GoToLinkedPage in wikidata.org/robots.txt

2016-09-23 Thread Sjoerddebruin
Sjoerddebruin added a comment.
Google still index these pages: https://www.google.nl/search?client=safari=en=African+wild+dog+site:wikidata.org=UTF-8=UTF-8_rd=cr=G1HlV9TLHemRwAKeurXICg (notice that the first result has a cached version of yesterday)TASK DETAILhttps://phabricator.wikimedia.org/T144308EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Mbch331, SjoerddebruinCc: Sjoerddebruin, TheDJ, Mbch331, Jonas, hoo, aude, Lydia_Pintscher, Tobi_WMDE_SW, Aklapper, thiemowmde, TerraCodes, D3r1ck01, MuhammadShuaib, Izno, Wikidata-bugs___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T144308: [Task] Disallow Special:GoToLinkedPage in wikidata.org/robots.txt

2016-09-01 Thread TheDJ
TheDJ added a comment.
You'll probably want to consider adapting the extension to make to enforce this in a better way for all users..

Perhaps the X-Robots-Tag http header can be used to remove indexing of the redirect... Not sure, redirects can be a bit problematic in that way.TASK DETAILhttps://phabricator.wikimedia.org/T144308EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Mbch331, TheDJCc: TheDJ, Mbch331, Jonas, hoo, aude, Lydia_Pintscher, Tobi_WMDE_SW, Aklapper, thiemowmde, TerraCodes, D3r1ck01, Izno, Wikidata-bugs___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T144308: [Task] Disallow Special:GoToLinkedPage in wikidata.org/robots.txt

2016-08-30 Thread hoo
hoo added a comment.
What about URLs like title=Special:GoToLinkedPage=dewiki=Q123456?TASK DETAILhttps://phabricator.wikimedia.org/T144308EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Mbch331, hooCc: Mbch331, Jonas, hoo, aude, Lydia_Pintscher, Tobi_WMDE_SW, Aklapper, thiemowmde, TerraCodes, D3r1ck01, Izno, Wikidata-bugs, TheDJ___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs