https://bugzilla.wikimedia.org/show_bug.cgi?id=44236

       Web browser: ---
            Bug ID: 44236
           Summary: Inconsistent field separation makes Squid logs in
                    Hadoop largely unusable
           Product: Analytics
           Version: unspecified
          Hardware: All
                OS: All
            Status: NEW
          Severity: critical
          Priority: Unprioritized
         Component: General/Unknown
          Assignee: [email protected]
          Reporter: [email protected]
                CC: [email protected], [email protected],
                    [email protected], [email protected],
                    [email protected]
    Classification: Unclassified
   Mobile Platform: ---

Created attachment 11664
  --> https://bugzilla.wikimedia.org/attachment.cgi?id=11664&action=edit
Screenshot of Beeswax showing parse failure

Sort out the field separator issue in your handling of squid logs first.

To summarize:

1) Kafka byte offset is delimited from hostname by a tab (\t).
2) Other fields are delimited by a space (\0020).
3) The content-type field contains unescaped spaces.
4) Beeswax only supports splitting on a single character.

As a result:

1) Byte offset is not separable from the hostname
("316554683463cp1043.wikimedia.org")
2) Unescaped spaces in the content type field cause it to span a variable
number of columns.
3) It is impossible to select the user agent field.

I'd like a solution to this that does not require that I provide a jar file for
customized string processing.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
You are watching all bug changes.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to