As I read it MZMcBride's code [1] used the pattern,

pattern = r'style[ ]*=[ ]*"(.+?)"'

and then normalized the internal spaces.

Which, after substituting "display:none;" I think translates directly to
the regex search:

insource:/style[ ]*=[ ]*\"display:[ ]*none;[ ]*\"/i

That gives me 487 articles [2].  I would also note that MZMcBride's code
will only report a "display: none" if that is the only style element.  If
multiple style elements are present then it becomes a count on a different
line including those multiple elements.

I am happy to agree that searching the XML should be better than the local
search tool, but I still find these numbers hard to reconcile.

-Robert Rohde

[1] https://github.com/mzmcbride/dump-reports/blob/a8dbbcb3/xmldumpreader.py
[2]
https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=advanced&search=insource%3A%2Fstyle%5B+%5D*%3D%5B+%5D*%5C%22display%3A%5B+%5D*none%3B%5B+%5D*%5C%22%2Fi&fulltext=Search&ns0=1&profile=advanced



On Tue, Oct 27, 2015 at 4:03 PM, Trey Jones <[email protected]> wrote:

> It appears that the summary may have normalized the formatting of the CSS.
>
> 143095  display: none;
>
>
> Your query[1] assumes a space after "display:" and gives 218 results. Using
> no space[2] gives 2,473 results, but still assumes that no other elements
> occur in the style attribute. A regex query[3] with "display:" + optional
> spaces + "none" gives 4,296 results, or a more reasonable average of 33 per
> result. That query may be overly aggressive and match outside of style
> contexts, but it also matches *list_style =
> text-align:center;display:none,*
> and *style="font-size: normal; text-align: left; display: none;"* which I
> think is a good thing (definitely in the latter case).
>
> Parsing a dump of enwiki is more accurate than running insource: queries.
>
> —Trey
>
> [1]
>
> https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=advanced&search=insource%3A%22style%3D%5C%22display%3A+none%3B%5C%22%22&fulltext=Search&ns0=1&profile=advanced
>
> [2]
>
> https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=advanced&search=insource%3A%22style%3D%5C%22display%3Anone%3B%5C%22%22&fulltext=Search&ns0=1&profile=advanced
>
> [3]
>
> https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=insource%3A%2Fdisplay%3A+*none%2F&fulltext=Search
>
>
> Trey Jones
> Software Engineer, Discovery
> Wikimedia Foundation
>
> On Tue, Oct 27, 2015 at 10:41 AM, Robert Rohde <[email protected]> wrote:
>
> > Okay, I misunderstood those as page counts, which would be way too high.
> > Even if they are explicit usage counts, I am still surprised they are
> that
> > high.
> >
> > BTW, is it surprising to anyone else that style elements aren't
> searchable
> > by default?  Searching for "efcfff" [1], gives only a single article
> result
> > despite "background: #efcfff;" being reported 200k times.
> >
> > We can however search using "insource:efcfff" [2], which reports 5516
> > articles, implying this color is applied _on average_ roughly 39 times
> per
> > article.
> >
> > "display: none;" would appear even more impressive, with a reported 140k
> > uses in just 218 articles [3] or an average of 656 usages per page
> > containing it.  That doesn't feel very likely to me.  One possibility
> would
> > be if you mistakenly counted some or all pages outside of the main
> > namespace.  Though only 218 articles use "display: none", there are
> nearly
> > 31000 other pages that include it [4], which seems like a much more
> > reasonable way to get to 140k total uses.
> >
> > -Robert
> >
> > [1]
> >
> >
> https://en.wikipedia.org/w/index.php?search=efcfff&title=Special%3ASearch&go=Go
> > [2]
> >
> >
> https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=advanced&search=insource%3Aefcfff&fulltext=Search&ns0=1&profile=advanced
> > [3]
> >
> >
> https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=advanced&search=insource%3A%22style%3D%5C%22display%3A+none%3B%5C%22%22&fulltext=Search&ns0=1&profile=advanced
> > [4]
> >
> >
> https://en.wikipedia.org/w/index.php?title=Special:Search&search=insource%3A%22style%3D%5C%22display%3A%20none%3B%5C%22%22&fulltext=Search&profile=all
> >
> >
> > On Tue, Oct 27, 2015 at 2:32 PM, MZMcBride <[email protected]> wrote:
> >
> > > Robert Rohde wrote:
> > > >On Mon, Oct 26, 2015 at 2:13 AM, MZMcBride <[email protected]> wrote:
> > > >>The following are the top ten instances of inline styling from main
> > > >>namespace pages on the English Wikipedia, as of about 2015-10-02:
> > > >>
> > > >>1552197 text-align: center;
> > > >>499756  text-align: left;
> > > >>355952  background: #dfffdf;
> > > >>235222  background: #cfcfff;
> > > >>215038  background: #efcfff;
> > > >>210702  text-align: right;
> > > >>143095  display: none;
> > > >>93646   background: #efefef;
> > > >>86391   font-size: 90%;
> > > >>80420   background: #fff;
> > > >
> > > >I'm not sure what your bug is, but those counts are way too high to be
> > > >accurate reflections of the wikitext in the main namespace on enwiki.
> > >
> > > Err, based on what? :-)
> > >
> > > These numbers are instances of style="[...]", not page counts. Looking
> at
> > > a specific example from <https://phabricator.wikimedia.org/P2230>:
> > >
> > > 1164   font-family: 'microsoft yi baiti', 'noto sans yi',
> nsimsun-18030,
> > >        simsun-18030, 'sil yi', code2000;
> > >
> > > These 1,164 inline styling instances all come from a single article:
> > > <https://en.wikipedia.org/w/index.php?oldid=672244691&action=edit>.
> > >
> > > Maybe that's the confusion? I tried to make my descriptions as clear as
> > > possible and I'm not saying a major bug is impossible, of course, but I
> > > don't have any reason so far to doubt the data I collected.
> > >
> > > Another strange case is "background-color: {{/meta/color}};", which had
> > > 16,432 instances. This almost looks like it would try to transclude a
> > > subpage of the article, but due to subpages being disabled in the main
> > > namespace on the English Wikipedia, it's actually transcluding a
> template
> > > named "/meta/color": <
> https://en.wikipedia.org/wiki/Template:/meta/color
> > >.
> > >
> > > I did concurrently look at the approximate number of non-redirect pages
> > > that contain inline styling. My findings were that about 408,777
> > > non-redirect pages contain some kind of inline styling on the English
> > > Wikipedia (cf. <https://phabricator.wikimedia.org/T115228#1752223>).
> > >
> > > MZMcBride
> > >
> > >
> > >
> > > _______________________________________________
> > > Wikitech-l mailing list
> > > [email protected]
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > >
> > _______________________________________________
> > Wikitech-l mailing list
> > [email protected]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to