On 2013-09-16 8:01 PM, Gabriel Wicke wrote:
> On 09/16/2013 07:24 PM, Daniel Friesen wrote:
>> On 2013-09-16 7:09 PM, Gabriel Wicke wrote:
>>> Any of the entry points? Any new entry point? Anything we ever want to
>>> put into the root?
>>> We should be able to avoid most conflicts by picking prefixed entry
>>> points. However, as we can't drop the clashing /w/api.php any time soon
>>> I have removed the /wiki/ part from the RFC:
>>>
>>> https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs
>>>
>>> So now only the conversion from
>>>
>>> /w/index.php?title=foo?action=history
>>> to
>>> /foo?action=history
>>>
>>> is under discussion.
>>>
>>> Gabriel
>> Has the practice of disallowing /w/ or /index.php inside robots.txt to
>> force search engines to completely ignore search, edit pages,
>> exponential pagination, etc.. been considered?
> See
> https://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs#Migration
Ok. Though even assuming the * and Allow: non-standard features are
supported by all bots we want to target I actually don't like the idea
of blacklisting /wiki/*? in this way.

I don't think that every url with a query in it qualifies as something
we want to blacklist from search engines. There are plenty but sometimes
there is content that's served with a query which could otherwise be a
good idea to index.

For example the non-first pages on long categories and Special:Allpages'
pagination. The latter has robots=noindex – though I think we may want
to reconsider that – but the former is not noindexed and with the
introduction of rel="next", etc... would be pretty reasonable to index
but is currently blacklisted by robots.txt.
Additionally while we normally want to noindex edit pages. This isn't
true of redlinks in every case. Take redlinked category links for
example. These link to an action=edit&redlink=1 which for a search
engine would then redirect back to the pretty url for the category. But
because of robots.txt this link is masked because the intermediate
redirect cannot be read by the search engine.

The idea I had to fix that naturally was to make MediaWiki aware of this
and whether by a new routing system or simply filters for specific
simple queries make it output /wiki/title?query urls for those cases
where it's a query we would want indexed and leave robots blacklisted
stuff under /w/ (though I did also consider a separate short url path
like /w/page/$1 to make internal/robots blacklisted urls pretty).
However adding Disallow: /wiki/*? to robots.txt will preclude the
ability to do that.

~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]


_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to