Okay. Thanks for making the extra effort. -Robert
On Thu, Oct 29, 2015 at 6:05 AM, MZMcBride <[email protected]> wrote: > Robert Rohde wrote: > >Which, after substituting "display:none;" I think translates directly to > >the regex search: > > > >insource:/style[ ]*=[ ]*\"display:[ ]*none;[ ]*\"/i > > > >That gives me 487 articles. > > Almost, but not quite. You actually want this: > > insource:/style[ ]*=[ ]*\"display:[ ]*none;?[ ]*\"/i > > With the semicolon being made optional, the search results increase from > 487 to 2,487 currently on the English Wikipedia. The normalization script > (<https://phabricator.wikimedia.org/P2229>) made the trailing semicolon > consistent, in addition to lowercasing and trying to account for strange > spacing. For whatever reason, "display: none;" is often written without > the trailing semicolon in main namespace pages on the English Wikipedia. > > I was worried that I may have made a major coding mistake, so I re-ran my > script using this pattern: > > pattern = r'style[ ]*=[ ]*"[ ]*display[ ]*:[ ]*none[ ]*;?[ ]*"' > > The results are available here: <https://phabricator.wikimedia.org/P2255>. > Sixteen articles have over 1,000 instances of "display: none;" each! The > total is 142,176 instances of "display: none;" (normalized) in 2,507 main > namespace pages on the English Wikipedia, as of about 2015-10-02. > > >I am happy to agree that searching the XML should be better than the local > >search tool, but I still find these numbers hard to reconcile. > > After re-reviewing the code and re-running the script to focus on > "display: none;" specifically, there's strong evidence to suggest that the > numbers are accurate, if not a bit surprising in some cases. :-) > > MZMcBride > > > > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
