As I read it MZMcBride's code [1] used the pattern, pattern = r'style[ ]*=[ ]*"(.+?)"'
and then normalized the internal spaces. Which, after substituting "display:none;" I think translates directly to the regex search: insource:/style[ ]*=[ ]*\"display:[ ]*none;[ ]*\"/i That gives me 487 articles [2]. I would also note that MZMcBride's code will only report a "display: none" if that is the only style element. If multiple style elements are present then it becomes a count on a different line including those multiple elements. I am happy to agree that searching the XML should be better than the local search tool, but I still find these numbers hard to reconcile. -Robert Rohde [1] https://github.com/mzmcbride/dump-reports/blob/a8dbbcb3/xmldumpreader.py [2] https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=advanced&search=insource%3A%2Fstyle%5B+%5D*%3D%5B+%5D*%5C%22display%3A%5B+%5D*none%3B%5B+%5D*%5C%22%2Fi&fulltext=Search&ns0=1&profile=advanced On Tue, Oct 27, 2015 at 4:03 PM, Trey Jones <[email protected]> wrote: > It appears that the summary may have normalized the formatting of the CSS. > > 143095 display: none; > > > Your query[1] assumes a space after "display:" and gives 218 results. Using > no space[2] gives 2,473 results, but still assumes that no other elements > occur in the style attribute. A regex query[3] with "display:" + optional > spaces + "none" gives 4,296 results, or a more reasonable average of 33 per > result. That query may be overly aggressive and match outside of style > contexts, but it also matches *list_style = > text-align:center;display:none,* > and *style="font-size: normal; text-align: left; display: none;"* which I > think is a good thing (definitely in the latter case). > > Parsing a dump of enwiki is more accurate than running insource: queries. > > —Trey > > [1] > > https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=advanced&search=insource%3A%22style%3D%5C%22display%3A+none%3B%5C%22%22&fulltext=Search&ns0=1&profile=advanced > > [2] > > https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=advanced&search=insource%3A%22style%3D%5C%22display%3Anone%3B%5C%22%22&fulltext=Search&ns0=1&profile=advanced > > [3] > > https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=insource%3A%2Fdisplay%3A+*none%2F&fulltext=Search > > > Trey Jones > Software Engineer, Discovery > Wikimedia Foundation > > On Tue, Oct 27, 2015 at 10:41 AM, Robert Rohde <[email protected]> wrote: > > > Okay, I misunderstood those as page counts, which would be way too high. > > Even if they are explicit usage counts, I am still surprised they are > that > > high. > > > > BTW, is it surprising to anyone else that style elements aren't > searchable > > by default? Searching for "efcfff" [1], gives only a single article > result > > despite "background: #efcfff;" being reported 200k times. > > > > We can however search using "insource:efcfff" [2], which reports 5516 > > articles, implying this color is applied _on average_ roughly 39 times > per > > article. > > > > "display: none;" would appear even more impressive, with a reported 140k > > uses in just 218 articles [3] or an average of 656 usages per page > > containing it. That doesn't feel very likely to me. One possibility > would > > be if you mistakenly counted some or all pages outside of the main > > namespace. Though only 218 articles use "display: none", there are > nearly > > 31000 other pages that include it [4], which seems like a much more > > reasonable way to get to 140k total uses. > > > > -Robert > > > > [1] > > > > > https://en.wikipedia.org/w/index.php?search=efcfff&title=Special%3ASearch&go=Go > > [2] > > > > > https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=advanced&search=insource%3Aefcfff&fulltext=Search&ns0=1&profile=advanced > > [3] > > > > > https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=advanced&search=insource%3A%22style%3D%5C%22display%3A+none%3B%5C%22%22&fulltext=Search&ns0=1&profile=advanced > > [4] > > > > > https://en.wikipedia.org/w/index.php?title=Special:Search&search=insource%3A%22style%3D%5C%22display%3A%20none%3B%5C%22%22&fulltext=Search&profile=all > > > > > > On Tue, Oct 27, 2015 at 2:32 PM, MZMcBride <[email protected]> wrote: > > > > > Robert Rohde wrote: > > > >On Mon, Oct 26, 2015 at 2:13 AM, MZMcBride <[email protected]> wrote: > > > >>The following are the top ten instances of inline styling from main > > > >>namespace pages on the English Wikipedia, as of about 2015-10-02: > > > >> > > > >>1552197 text-align: center; > > > >>499756 text-align: left; > > > >>355952 background: #dfffdf; > > > >>235222 background: #cfcfff; > > > >>215038 background: #efcfff; > > > >>210702 text-align: right; > > > >>143095 display: none; > > > >>93646 background: #efefef; > > > >>86391 font-size: 90%; > > > >>80420 background: #fff; > > > > > > > >I'm not sure what your bug is, but those counts are way too high to be > > > >accurate reflections of the wikitext in the main namespace on enwiki. > > > > > > Err, based on what? :-) > > > > > > These numbers are instances of style="[...]", not page counts. Looking > at > > > a specific example from <https://phabricator.wikimedia.org/P2230>: > > > > > > 1164 font-family: 'microsoft yi baiti', 'noto sans yi', > nsimsun-18030, > > > simsun-18030, 'sil yi', code2000; > > > > > > These 1,164 inline styling instances all come from a single article: > > > <https://en.wikipedia.org/w/index.php?oldid=672244691&action=edit>. > > > > > > Maybe that's the confusion? I tried to make my descriptions as clear as > > > possible and I'm not saying a major bug is impossible, of course, but I > > > don't have any reason so far to doubt the data I collected. > > > > > > Another strange case is "background-color: {{/meta/color}};", which had > > > 16,432 instances. This almost looks like it would try to transclude a > > > subpage of the article, but due to subpages being disabled in the main > > > namespace on the English Wikipedia, it's actually transcluding a > template > > > named "/meta/color": < > https://en.wikipedia.org/wiki/Template:/meta/color > > >. > > > > > > I did concurrently look at the approximate number of non-redirect pages > > > that contain inline styling. My findings were that about 408,777 > > > non-redirect pages contain some kind of inline styling on the English > > > Wikipedia (cf. <https://phabricator.wikimedia.org/T115228#1752223>). > > > > > > MZMcBride > > > > > > > > > > > > _______________________________________________ > > > Wikitech-l mailing list > > > [email protected] > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > > > _______________________________________________ > > Wikitech-l mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
