Okay.  Thanks for making the extra effort.

-Robert

On Thu, Oct 29, 2015 at 6:05 AM, MZMcBride <[email protected]> wrote:

> Robert Rohde wrote:
> >Which, after substituting "display:none;" I think translates directly to
> >the regex search:
> >
> >insource:/style[ ]*=[ ]*\"display:[ ]*none;[ ]*\"/i
> >
> >That gives me 487 articles.
>
> Almost, but not quite. You actually want this:
>
> insource:/style[ ]*=[ ]*\"display:[ ]*none;?[ ]*\"/i
>
> With the semicolon being made optional, the search results increase from
> 487 to 2,487 currently on the English Wikipedia. The normalization script
> (<https://phabricator.wikimedia.org/P2229>) made the trailing semicolon
> consistent, in addition to lowercasing and trying to account for strange
> spacing. For whatever reason, "display: none;" is often written without
> the trailing semicolon in main namespace pages on the English Wikipedia.
>
> I was worried that I may have made a major coding mistake, so I re-ran my
> script using this pattern:
>
> pattern = r'style[ ]*=[ ]*"[ ]*display[ ]*:[ ]*none[ ]*;?[ ]*"'
>
> The results are available here: <https://phabricator.wikimedia.org/P2255>.
> Sixteen articles have over 1,000 instances of "display: none;" each! The
> total is 142,176 instances of "display: none;" (normalized) in 2,507 main
> namespace pages on the English Wikipedia, as of about 2015-10-02.
>
> >I am happy to agree that searching the XML should be better than the local
> >search tool, but I still find these numbers hard to reconcile.
>
> After re-reviewing the code and re-running the script to focus on
> "display: none;" specifically, there's strong evidence to suggest that the
> numbers are accurate, if not a bit surprising in some cases. :-)
>
> MZMcBride
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to