Robert Rohde wrote:
>Which, after substituting "display:none;" I think translates directly to
>the regex search:
>
>insource:/style[ ]*=[ ]*\"display:[ ]*none;[ ]*\"/i
>
>That gives me 487 articles.

Almost, but not quite. You actually want this:

insource:/style[ ]*=[ ]*\"display:[ ]*none;?[ ]*\"/i

With the semicolon being made optional, the search results increase from
487 to 2,487 currently on the English Wikipedia. The normalization script
(<https://phabricator.wikimedia.org/P2229>) made the trailing semicolon
consistent, in addition to lowercasing and trying to account for strange
spacing. For whatever reason, "display: none;" is often written without
the trailing semicolon in main namespace pages on the English Wikipedia.

I was worried that I may have made a major coding mistake, so I re-ran my
script using this pattern:

pattern = r'style[ ]*=[ ]*"[ ]*display[ ]*:[ ]*none[ ]*;?[ ]*"'

The results are available here: <https://phabricator.wikimedia.org/P2255>.
Sixteen articles have over 1,000 instances of "display: none;" each! The
total is 142,176 instances of "display: none;" (normalized) in 2,507 main
namespace pages on the English Wikipedia, as of about 2015-10-02.

>I am happy to agree that searching the XML should be better than the local
>search tool, but I still find these numbers hard to reconcile.

After re-reviewing the code and re-running the script to focus on
"display: none;" specifically, there's strong evidence to suggest that the
numbers are accurate, if not a bit surprising in some cases. :-)

MZMcBride



_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to