Re: [Wireshark-dev] utf8 support on http dissectors

Guy Harris Mon, 19 Mar 2018 01:55:42 -0700

(Don't CC individual developers on messages to wireshark-dev; we're all on that 
list, and we shouldn't be singled out, as none of us individually "own" this 
issue.)


On Mar 18, 2018, at 11:28 PM, Roberto Ayuso <[email protected]> wrote:

> I have seen that http dissector only manages content on ASCII, I modified the 
> source for my project changing it with ENC_UTF_8 on http.request_uri and 
> http.data
> 
> Can you consider put it as an option on the tshark command line? I have no 
> enough skills to do by myself.

For request/response fields and headers:

To quote RFC 7230:

        Historically, HTTP has allowed field content with text in the 
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of 
[RFC2047] encoding.  In practice, most HTTP header field values use only a 
subset of the US-ASCII charset [USASCII].  Newly defined header fields SHOULD 
limit their field values to US-ASCII octets.  A recipient SHOULD treat other 
octets in field content (obs-text) as opaque data.

RFC 2047 is "MIME (Multipurpose Internet Mail Extensions) Part Three: Message 
Header Extensions for Non-ASCII Text", which describes the 
"=?iso-8859-1?q?this=20is=20some=20text?=" mechanism used to encode non-ASCII - 
and not necessarily UTF-8 - text in mail message headers.

So:

        1) There appear to be "extended ASCII" encodings other than UTF-8 that 
have been used in HTTP requests and replies, so an option of that sort should 
perhaps allow more than just UTF-8 to be specified as the "default" encoding.   
(It would be implemented as a preference for the HTTP dissector, so it would 
allow a setting on the command line such as "-o http.charset=utf-8", but would 
also be settable through the GUI in Wireshark.)

        2) Are there HTTP headers that are not in ASCII and that don't use 
percent-escaping for the non-ASCII characters?

        3) RFC 3986 seems to be at least suggesting that percent-escape 
sequences in URLs represent UTF-8 encodings of characters (rather than, say, 
ISO 8859-n encodings, for some value of n); if that's the case, it would 
probably be appropriate to display the URL exactly as it appears in the 
message, *but* to also provide, as a separate field, the result of unescaping, 
*if* the result is valid UTF-8.

For the body:

There is no such field as "http.data".  Did you mean "http.file_data", or 
something else?

The Content-Type header should, if the body is text, what character encoding is 
used, e.g.

        Content-Type: text/plain;charset=utf-8

To quote RFC 2046:

        4.1.2.  Charset Parameter


           A critical parameter that may be specified in the Content-Type field
           for "text/plain" data is the character set.  This is specified with a
           "charset" parameter, as in:

             Content-type: text/plain; charset=iso-8859-1

           Unlike some other parameter values, the values of the charset
           parameter are NOT case sensitive.  The default character set, which
           must be assumed in the absence of a charset parameter, is US-ASCII.

so if there's no "charset=", the character set must be assumed to be ASCII, not 
UTF-8.
___________________________________________________________________________
Sent via:    Wireshark-dev mailing list <[email protected]>
Archives:    https://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://www.wireshark.org/mailman/options/wireshark-dev
             mailto:[email protected]?subject=unsubscribe

Re: [Wireshark-dev] utf8 support on http dissectors

Reply via email to