Matthew Flaschen <mflasc...@wikimedia.org> changed:
What |Removed |Added
See Also| |https://bugzilla.wikimedia.
--- Comment #8 from Matthew Flaschen <mflasc...@wikimedia.org> ---
(In reply to comment #0)
> This may be actually impossible, but I'm filing a bug to discuss strategies
> for preventing mirrors of Wikipedia from including pages we NOINDEX. A good
> example of this is user pages or user talk pages, and the new Draft namespace
> on English Wikipedia.
Preventing them is a WONTFIX.
For reference, the user namespace is not NOINDEX by default on English
Wikipedia, though __NOINDEX__ works.
> Technically speaking these pages are free content just like anything else on
> Wikipedia (with the exception of fair use images, etc.).
Yes, this (along with the Right to Fork) is why we must not do this. If we
exclude the pages from the dumps, it will make the freedom of the content much
less meaningful. It would also encourage people to mirror by crawling the HTML
(or even worse, mirroring it live), which is a poor practice and loses a lot of
information from the dumps.
> Numerous times, I've had Wikipedians bring up the valid point that mirrors
> erode our ability to control search indexing, because they mirror content we
> NOINDEX, but do not replicate the contents of our robots.txt.
Free content means giving up some control over what people do with it. The
edit screen used to say, "If you do not want your writing to be edited
mercilessly and redistributed at will, do not submit it." It no longer says
that, but it's just as true under our current licenses.
Wikipedia has a high overall search engine ranking, and sites simply mirroring
drafts (which by definition are generally not ready for the primetime) probably
won't rank that high. But I accept this could change, does not apply to many
other sites, and that there are probably exceptions even on Wikipedia.
People have to comply with our license (attribution, stating license, etc.),
but they are allowed to distribute everything with or without marking it
NOINDEX. It is reasonable to encourage mirrors to preserve the robot policies
on their own HTML output, though.
Since a3aac44 in 2010 (pages last saved before then don't seem to have it
judging by a check of the akwiki dump), __NOINDEX__ and __INDEX__ have been
stored in the page_props table (along with all other __DOUBLEUNDERSCORE__ magic
words). This is dumped, so it is relatively easy to check this on a per-page
I don't think the namespace robot policies are currently anywhere in the dump.
I've filed this as bug 58805.
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
Wikibugs-l mailing list