https://bugzilla.wikimedia.org/show_bug.cgi?id=58758
Matthew Flaschen <[email protected]> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED See Also| |https://bugzilla.wikimedia. | |org/show_bug.cgi?id=58805 Resolution|--- |WONTFIX --- Comment #8 from Matthew Flaschen <[email protected]> --- (In reply to comment #0) > This may be actually impossible, but I'm filing a bug to discuss strategies > for preventing mirrors of Wikipedia from including pages we NOINDEX. A good > example of this is user pages or user talk pages, and the new Draft namespace > on English Wikipedia. Preventing them is a WONTFIX. For reference, the user namespace is not NOINDEX by default on English Wikipedia, though __NOINDEX__ works. > Technically speaking these pages are free content just like anything else on > Wikipedia (with the exception of fair use images, etc.). Yes, this (along with the Right to Fork) is why we must not do this. If we exclude the pages from the dumps, it will make the freedom of the content much less meaningful. It would also encourage people to mirror by crawling the HTML (or even worse, mirroring it live), which is a poor practice and loses a lot of information from the dumps. > Numerous times, I've had Wikipedians bring up the valid point that mirrors > erode our ability to control search indexing, because they mirror content we > NOINDEX, but do not replicate the contents of our robots.txt. Free content means giving up some control over what people do with it. The edit screen used to say, "If you do not want your writing to be edited mercilessly and redistributed at will, do not submit it." It no longer says that, but it's just as true under our current licenses. Wikipedia has a high overall search engine ranking, and sites simply mirroring drafts (which by definition are generally not ready for the primetime) probably won't rank that high. But I accept this could change, does not apply to many other sites, and that there are probably exceptions even on Wikipedia. People have to comply with our license (attribution, stating license, etc.), but they are allowed to distribute everything with or without marking it NOINDEX. It is reasonable to encourage mirrors to preserve the robot policies on their own HTML output, though. Since a3aac44 in 2010 (pages last saved before then don't seem to have it judging by a check of the akwiki dump), __NOINDEX__ and __INDEX__ have been stored in the page_props table (along with all other __DOUBLEUNDERSCORE__ magic words). This is dumped, so it is relatively easy to check this on a per-page basis. I don't think the namespace robot policies are currently anywhere in the dump. I've filed this as bug 58805. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
