[Bug 66112] data: URLs accounting for 6 of the top 10 most viewed articles reported by stats.grok.se

2014-06-12 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=66112

Toby Negrin tneg...@wikimedia.org changed:

   What|Removed |Added

   Priority|Unprioritized   |Normal

--- Comment #17 from Toby Negrin tneg...@wikimedia.org ---
Need collaboration with Platform to work on this further.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 66112] data: URLs accounting for 6 of the top 10 most viewed articles reported by stats.grok.se

2014-06-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=66112

--- Comment #16 from christ...@quelltextlich.at ---
Probably not relevant as the CSS should be interpreted as UTF-8
... but since I've been burnt by UTF-8 support on Windows a few times,
I checked the CSS of some prominent Wikipedias [1], and it seems of
them only

  eswiki [2]
  ptwiki [3]
  plwiki [4]

had css classes using characters beyond 7-bit ASCII.

However, while eswiki, and ptwiki are the affected ones, plwiki does not
seem to be affected.



[1] arwiki cswiki dawiki dewiki elwiki enwiki eswiki fawiki fiwiki
frwiki hewiki idwiki itwiki jawiki kowiki nlwiki nowiki plwiki ptwiki
ruwiki svwiki trwiki ukwiki zhwiki

[2] eswiki:
  .arquería
  .astronomía
  .béisbol
  .canadá
  .cómics
  .comunicación
  [...]

[3] ptwiki:
  .page-Wikipédia_Esplanada_geral
  .page-Wikipédia_Esplanada_propostas

[4] plwiki:
  .page-Wikipedia_Strona_główna

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 66112] data: URLs accounting for 6 of the top 10 most viewed articles reported by stats.grok.se

2014-06-05 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=66112

--- Comment #7 from Oliver Keyes oke...@wikimedia.org ---
Which is fairly common. Even IE has started deliberately making ambiguous user
agents because the devs have realised that people write special rules around IE
UAs.

Is there anything interesting in the x_analytics field? I recall a problem with
a similar range of browsers from Zero - attempts to DDoS the ISP-level packet
inspection in Bangladesh.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 66112] data: URLs accounting for 6 of the top 10 most viewed articles reported by stats.grok.se

2014-06-05 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=66112

--- Comment #8 from christ...@quelltextlich.at ---
(In reply to Matthew Flaschen from comment #6)
 (In reply to christian from comment #5)
  It seems to be a Windows with (Firefox or Chrome) issue.
 
 Or a bot spoofing their user-agent to pretend to be such.

I checked that. And while of course, we cannot rule it out, it's
not too plausible to me.

The number of requests is following a strong weekly pattern.

For each day, the client IPs fall in between 200 to 500 different /24 IP
groups.
(Basically all matching the country for the relevant wikis. So Brazil IPs
fetching ptwiki, Venezuelan IPs fetching eswiki.)

Sure. A /smart/ botnet still could implement a weekly pattern and grab many
relevant different IPs that are correctly geolocated.
But then ... a smart botnet would not misinterpret data uris. And even if they
did by accident, such a smart botnet would notice and fix it.

So I'd rule bots out.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 66112] data: URLs accounting for 6 of the top 10 most viewed articles reported by stats.grok.se

2014-06-05 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=66112

--- Comment #9 from christ...@quelltextlich.at ---
(In reply to Oliver Keyes from comment #7)
 Is there anything interesting in the x_analytics field?

No. X-Analytics is empty for all those requests.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 66112] data: URLs accounting for 6 of the top 10 most viewed articles reported by stats.grok.se

2014-06-05 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=66112

--- Comment #10 from christ...@quelltextlich.at ---
For those who want to take a look themselves, there are prefiltered (from
sampled-1000 stream) tsvs for May and June 2014 in

  /home/qchris/data-uris

on stat1002 (the date in the file name corresponds to the date in the file name
of the sampled-1000 tsv files).

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 66112] data: URLs accounting for 6 of the top 10 most viewed articles reported by stats.grok.se

2014-06-05 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=66112

Bartosz Dziewoński matma@gmail.com changed:

   What|Removed |Added

 CC||matma@gmail.com

--- Comment #11 from Bartosz Dziewoński matma@gmail.com ---
(In reply to christian from comment #1)
 The images in the data uri scheme decode to images from VectorBeta like
   VectorBeta/resources/typography/images/search-fade.png
   VectorBeta/resources/typography/images/tab-break.png
   VectorBeta/resources/typography/images/tab-current-fade.png
   VectorBeta/resources/typography/images/portal-break.png

These images are also part of the core Vector skin, where
they sit at [mediawiki/core]/skins/vector/images.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 66112] data: URLs accounting for 6 of the top 10 most viewed articles reported by stats.grok.se

2014-06-05 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=66112

--- Comment #12 from Oliver Keyes oke...@wikimedia.org ---
Humn. Worth CCing the typography peeps and seeing if there's something weird in
the implementation?

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 66112] data: URLs accounting for 6 of the top 10 most viewed articles reported by stats.grok.se

2014-06-05 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=66112

--- Comment #13 from Bartosz Dziewoński matma@gmail.com ---
The images listed also do not have SVG versions, so I wouldn't blame the
SVG-PNG fallback mechanism.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 66112] data: URLs accounting for 6 of the top 10 most viewed articles reported by stats.grok.se

2014-06-05 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=66112

--- Comment #14 from Bartosz Dziewoński matma@gmail.com ---
We were missing test cases that would prove that CSSMin is not borking data:
URIs generated by LESS mixins like .background-image(), so I added some in
https://gerrit.wikimedia.org/r/#/c/137698/ just in case.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 66112] data: URLs accounting for 6 of the top 10 most viewed articles reported by stats.grok.se

2014-06-05 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=66112

--- Comment #15 from christ...@quelltextlich.at ---
(In reply to Bartosz Dziewoński from comment #11)
 These images are also part of the core Vector skin, [...]

*Facepalm*
I had core at an old commit :-(

Yup ... they can come from core as well :-) Thanks.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 66112] data: URLs accounting for 6 of the top 10 most viewed articles reported by stats.grok.se

2014-06-04 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=66112

--- Comment #1 from christ...@quelltextlich.at ---
Looking through the log files, we indeed see requests for [1]

  http://es.wikipedia.org/wiki/Data:image/png;base64,iVBORw0K[...]

so webstatscollector is doing the right thing :-/

Currently, this traffir amounts to ~500K requests per day.

We see such requests back until the first sampled log files we still
have. (But they were fewer in numbers back then)

Requested URLs are mostly to eswiki (~58%), and ptwiki (~38%).

Referrers are either empty (~97%) or coming mostly from ptwiki (to a
lesser extend eswiki, enwiki).

User Agents match '^Mozilla/5\.0 (Windows NT [56]\.' for 98% of requests.

Unwrapping the inline data from the URLs, and looking at them it seems
they are just images for UI chrome.

The images in the data uri scheme decode to images from VectorBeta like
  VectorBeta/resources/typography/images/search-fade.png
  VectorBeta/resources/typography/images/tab-break.png
  VectorBeta/resources/typography/images/tab-current-fade.png
  VectorBeta/resources/typography/images/portal-break.png





[1] Since they are just UI images, here are some concrete examples:

http://es.wikipedia.org/wiki/data:image/png;base64,iVBORw0KGgoNSUhEUgEuCAIAAABmjeQ9RElEQVR42mVO2wrAUAhy/f8fz+niVMTYQ3hLKkgGgN/IPvgIhUYYV/qogdP75J01V+JwrKZr/5YPcnzN3e6t7l+2K+EFX91B1daOi7sASUVORK5CYII=

http://pt.wikipedia.org/wiki/Data:image/png;base64,iVBORw0KGgoNSUhEUgEuCAIAAABmjeQ9RElEQVR42mVO2wrAUAhy/f8fz%2BniVMTYQ3hLKkgGgN/IPvgIhUYYV/qogdP75J01V%2BJwrKZr/5YPcnzN3e6t7l%2B2K%2BEFX91B1daOi7sASUVORK5CYII%3D

http://es.wikipedia.org/wiki/data:image/png;base64,iVBORw0KGgoNSUhEUgEQCAIAAABY/YLgJUlEQVQIHQXBsQEAAAjDoND/73UWdnerhmHVsDQZJrNWVg3Dqge6bgMe6bejNABJRU5ErkJggg==

http://es.wikipedia.org/wiki/Data:image/png;base64,iVBORw0KGgoNSUhEUgEQCAIAAABY/YLgJUlEQVQIHQXBsQEAAAjDoND/73UWdnerhmHVsDQZJrNWVg3Dqge6bgMe6bejNABJRU5ErkJggg%3D%3D

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 66112] data: URLs accounting for 6 of the top 10 most viewed articles reported by stats.grok.se

2014-06-04 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=66112

Matthew Flaschen mflasc...@wikimedia.org changed:

   What|Removed |Added

 CC||mflasc...@wikimedia.org

--- Comment #2 from Matthew Flaschen mflasc...@wikimedia.org ---
The bug looks like a browser/crawler bug where it's interpreting data URIs as
relative URLs due to not understanding the protocol (and having a weird default
for unknown protocols)

(In reply to christian from comment #1)
 User Agents match '^Mozilla/5\.0 (Windows NT [56]\.' for 98% of requests.

Do you know which browsers these actually are?  Does it have the MSIE or
Trident token?

It is a known issue that IE = 7 (http://caniuse.com/#feat=datauri) does not
support data URIs.  However, my understanding is that it's supposed to just
drop it; I've never heard it would send a bogus request (I could be wrong,
though).

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 66112] data: URLs accounting for 6 of the top 10 most viewed articles reported by stats.grok.se

2014-06-04 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=66112

--- Comment #3 from Matthew Flaschen mflasc...@wikimedia.org ---
(In reply to Matthew Flaschen from comment #2)
 Do you know which browsers these actually are?  Does it have the MSIE or
 Trident token?

If you could share the full user agent, either publicly or privately, that
might be helpful.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 66112] data: URLs accounting for 6 of the top 10 most viewed articles reported by stats.grok.se

2014-06-04 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=66112

--- Comment #4 from Matthew Flaschen mflasc...@wikimedia.org ---
(In reply to Matthew Flaschen from comment #2)
 It is a known issue that IE = 7 (http://caniuse.com/#feat=datauri) does not
 support data URIs.  However, my understanding is that it's supposed to just
 drop it; I've never heard it would send a bogus request (I could be wrong,
 though).

This (old IE support) is also why we have a PNG fallback, which it's supposed
to use.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 66112] data: URLs accounting for 6 of the top 10 most viewed articles reported by stats.grok.se

2014-06-04 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=66112

--- Comment #5 from christ...@quelltextlich.at ---
Sadly enough. No IE=7 issue. That was the first impression yesterday as well
:-(

(In reply to Matthew Flaschen from comment #2)
 Do you know which browsers these actually are?

Yes. User Agents are for example (figured they are generic enough to post):

  Mozilla/5.0 (Windows NT 6.1; rv:29.0) Gecko/20100101 Firefox/29.0
  Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/35.0.1916.114 Safari/537.36

 Does it have the MSIE or
 Trident token?

Nope.
Affected browsers are mostly Firefox (~65%) and Chrome (~33%).
In old versions and (as exhibited above) also new versions.

It seems to be a Windows with (Firefox or Chrome) issue.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 66112] data: URLs accounting for 6 of the top 10 most viewed articles reported by stats.grok.se

2014-06-04 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=66112

--- Comment #6 from Matthew Flaschen mflasc...@wikimedia.org ---
(In reply to christian from comment #5)
 It seems to be a Windows with (Firefox or Chrome) issue.

Or a bot spoofing their user-agent to pretend to be such.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l