https://bugzilla.wikimedia.org/show_bug.cgi?id=25718

--- Comment #13 from Daniel Friesen <mediawiki-b...@nadir-seen-fire.com> 
2010-10-31 14:38:31 UTC ---
I just took an actual look at the patch and larger look at the code, I see two
things staring at me...

Firstly, it feels for some reason to me that something is wrong with the
$variant stuff... I'm not quite up to speed on the variant stuff, so it's
either a MW bug where variant handling that should be in the second block isn't
there. Or variant handling is supposed to be done only for /short/Urls and the
patch adds a new location where /short/Urls end up showing up but is missing
variant handling code. Or it's only meant to happen when $query is empty and
I'm just to far behind.

Though that's probably less of an issue compared to this.

Looking at Line 821 I see what seams to be part of a feature that appears to
allow you to pass "-" to getLocalURL as a method of saying to getLocalURL "I
don't have any query, but I want you to give me a long index.php style url
because I have a case where getting short urls will cause chaos, so NO short
urls", although it does look like a tweak would be nice so there is no trailing
&.

The way the patch is written seams to break that... if this patch is applied it
looks like $query = "-" will start outputting short urls when getLocalURL was
explicitly asked to output long urls, and additionally this will be done
without variant handling that would have otherwise been done.

Also now that I notice the comment, the &title= showing up in a short url does
not look like proper implementation of a feature, even without the other issues
I'd reject the patch till that was fixed.


Re-reading comments again, it looks to me you have a lingering misunderstanding
of $wgArticlePath, $wgActionPaths and what those blocks of code actually do.

Firstly of course, that note on how in default config robots.txt can't be used
is moot, the robots.txt stuff is an advanced configuration for advanced robots
optimized shorturl style configuration and for the reasons you describe
requires use of short urls to function. Just because a "default" configuration
doesn't support it doesn't mean that it should be broken universally (this
patch) when it is only supported in an advanced configuration. I should also
point out that if MediaWiki detects that the server is able to support it,
MediaWiki actually sets up $wgArticlePath to work in a /index.php/Article form,
so robots.txt IS actually usable by default on the right server by disallowing
"/index.php?".

Now for the actual misunderstanding you commented "As far as I can tell, there
is no use case within MediaWiki's default
configuration that touches line 824 of Title.php -- there is never a URL that
includes a query but not an action parameter (apart from the default 'view'
action, which also bypasses line 824)."
As I understand from this -- even taking your later comment on history into
account -- you believe that calls to getLocalURL containing a query with an
action parameter are always caught by the chunk of code between lines 807-819.
That is incorrect. The block of code there is ONLY ever applied if the
non-default $wgActionPaths (NOT $wgArticlePath, but $wgActionPaths) which in
fact there are very very few MediaWiki installations that actually enable. So
in actuality the code in lines 810-817 which does things with getLocalURL calls
that have a action= in the query is actually almost NEVER run, at least not
unless the wiki is one of the rare few wiki that have specially configured
$wgActionPaths. The line 824 you believe is being bypassed by default in fact
is actually NEVER EVER bypassed by default. By default config line 824 handles
every single call to getLocalURL which contains anything at all inside the
$query.

And to re-iterate, the expected behavior for getLocalURL and the like is that
urls with no query, and not marked with a faux "-" query to disallow shorturls
will have a shorturl returned if possible. While any url with a query is ALWAYS
returned in long index.php form. Changing things so that things that currently
output long urls output short urls will break things. I forgot about it, but
there is another reason why that /w/ robots.txt trick is used. The nature of
most pages that use queries is that they are dynamic. In other words they
likely contain links within them which could cause search engines to endlessly
spider dynamically generated content that the wiki does not want them to spider
(ie: search queries). So changing these to short urls may cause dynamic things
to suddenly start being spidered by search engines when they were originally
intended not to. Because of this, I believe that ANY feature we add that allows
getLocalURL to return short urls with queries instead of long urls MUST be an
explicit opt-in where getLocalURL is passed with an extra optional argument
telling it that shorturls with queries are ok and that any code changed to use
this be done by someone understanding the potential ramifications.

To be frank, I actually think that the original use case for such a feature
actually shouldn't even be using it. The original request is to do with
calendar pages, where /wiki/Calendar is the normal calendar page, and
/index.php?title=Calendar&year=&month= is a calendar page for ANY other month
in ANY year, and both these pages contain forward/back links pointing to other
months. In other words, this looks like a prime example of exactly what we do
NOT want search engines spidering. If we were to drop
/index.php?title=Calendar&year=##&month=## to /wiki/Calendar?year=####&month=##
on a wiki using a properly setup robots.txt the search engine would begin to
spider starting with /wiki/Calendar. Currently because the long url format is
used, the search engine's spider is not allowed to continue spidering other
months. But dropping those to short urls, the search engine WILL continue to
spider. After spidering /wiki/Calendar it may find a next month link to
/wiki/Calendar?year=2010&month=11, where it will find a link to
/wiki/Calendar?year=2010&month=12, where it will find a link to
/wiki/Calendar?year=2011&month=1, and continue... In other words because of the
dynamic nature of the calendar, if we allow it to link to other short urls with
year/month queries search engines may dive into an endless game of trying to
index years and months that will never have any actual content on them in the
wiki (and pound the wiki with requests it shouldn't be making in doing so).


So to finish up my tl;dr post. The patch DOES introduce bugs. The SRF issue is
not an issue that already exists and the patch doesn't change, as the problem
is that the patch breaks existing "advanced" configurations of robots.txt that
expect all dynamic urls with queries to use the full script url, not short urls
by making configurations using explicit $wgArticlePath or automatic
/index.php/$1 as setup by default MediaWiki installation on some servers to use
short urls for dynamic pages instead of the long urls they are supposed to.
And additionally because line 842 IS handed by all default
non-$wgActionPaths["raw"] wiki the patch DOES break MediaWiki causing
->getLocalURL('action=raw'); to output /wiki/$1?action=raw style urls instead
of the required (ie: if you apply this patch, you'll probably break Wikipedia).
^_^ And finally... I'm typing this out while semi-half-asleep so *disclaimer,
disclaimer, disclaimer*...

...so I think we should mark this WONTFIX... or is it INVALID?

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to