https://bugzilla.wikimedia.org/show_bug.cgi?id=44236
Web browser: ---
Bug ID: 44236
Summary: Inconsistent field separation makes Squid logs in
Hadoop largely unusable
Product: Analytics
Version: unspecified
Hardware: All
OS: All
Status: NEW
Severity: critical
Priority: Unprioritized
Component: General/Unknown
Assignee: [email protected]
Reporter: [email protected]
CC: [email protected], [email protected],
[email protected], [email protected],
[email protected]
Classification: Unclassified
Mobile Platform: ---
Created attachment 11664
--> https://bugzilla.wikimedia.org/attachment.cgi?id=11664&action=edit
Screenshot of Beeswax showing parse failure
Sort out the field separator issue in your handling of squid logs first.
To summarize:
1) Kafka byte offset is delimited from hostname by a tab (\t).
2) Other fields are delimited by a space (\0020).
3) The content-type field contains unescaped spaces.
4) Beeswax only supports splitting on a single character.
As a result:
1) Byte offset is not separable from the hostname
("316554683463cp1043.wikimedia.org")
2) Unescaped spaces in the content type field cause it to span a variable
number of columns.
3) It is impossible to select the user agent field.
I'd like a solution to this that does not require that I provide a jar file for
customized string processing.
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
You are watching all bug changes.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l