https://bugzilla.wikimedia.org/show_bug.cgi?id=60184

--- Comment #1 from christ...@quelltextlich.at ---
(In reply to comment #0)
> I'm sat here looking at a 6MB user agent field.

Interesting.
I was under the impression that requests >8K get truncated.
That's obviously wrong then :-)

Where can I find this user agent field?

> I'm not sure how Erik Z reads his files in, but if it's tab-sensitive we're
> potentially looking at a data loss issue with wikistats.

Although files may come with wrong number of columns, it's actually
only a minor problem. For example in December 2013 only about ~0.0028%
rows of the sampled-1000 stream had a wrong column count. In January
2014 it is up to now 0.0029%.

Adding escaping to the files would make many changes necessary
throughout all of our infrastructure (e.g.: Wikipedia Zero), which I'd
prefer we need not do.

To put those 0.0029% into perspective: Udp2log dropped 0.4% of the
packets in December. And when comparing with historical values, we see
that this is exceptionally low packet drop rate:
http://stats.wikimedia.org/wikimedia/squids/SquidDataMonthlyPerSquidSet.htm

> Obviously VK will solve for this once it's dealing with the whole firehose.

VK being varnishkafka?
If so ... Ja, I'd say waiting for Hadoop with the new JSON data
structures would be a good solution :-)

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to