Re: [Wireshark-dev] tvb_get_string_enc() doesn't always return valid UTF-8

Evan Huus Wed, 29 Jan 2014 13:27:35 -0800

On Mon, Jan 27, 2014 at 7:17 PM, Evan Huus <[email protected]> wrote:
> Thank you Guy, this is excellent. It's more or less what I was looking
> for in [1] back in October (which, coincidentally, links to the bug I
> referred to earlier in this thread).
>
> All of these conversations should be condensed (probably using your
> email as a base) and put into a wiki page for future reference. I will
> try and do this when I find the time, if nobody else beats me to it.


First draft wiki page is available:
http://wiki.wireshark.org/Development/StringHandling

> Further comments and suggestions inline.
>
> [1] https://www.wireshark.org/lists/wireshark-dev/201310/msg00260.html
>
> On Sun, Jan 26, 2014 at 6:13 PM, Guy Harris <[email protected]> wrote:
>>
>> On Jan 26, 2014, at 1:27 PM, Jakub Zawadzki <[email protected]> 
>> wrote:
>>
>>> On Tue, Jan 21, 2014 at 08:01:15AM -0500, Evan Huus wrote:
>>>> On Tue, Jan 21, 2014 at 2:40 AM, Guy Harris <[email protected]> wrote:
>>>>>
>>>>> On Jan 20, 2014, at 5:59 PM, Evan Huus <[email protected]> wrote:
>>>>>
>>>>>> In which case is dumb search-and-replace of tvb_get_string with
>>>>>> tvb_get_string_enc and ENC_ASCII an easy way to make (part of) the API
>>>>>> transition?
>>>>>
>>>>> Did somebody say that had been done or suggest that it be done?
>>>>
>>>> I thought it was kind of implied when you wrote "We should also
>>>> probably audit all calls to tvb_get_string() and tvb_get_stringz() in
>>>> dissectors and change them to tvb_get_string_enc() with the
>>>> appropriate encoding."
>>>>
>>>> If tvb_get_string() behaves identically to tvb_get_string_enc(...
>>>> ENC_ASCII) then there doesn't seem much point in having both.
>>>
>>> We can also think about dropping STR_ASCII, STR_UNICODE stuff.
>>>
>>> To be honest I'm not happy about it, I'd rather still display
>>> non-printable ascii in escaped-hex-form.
>>
>> OK, first-principles time:
>>
>> A character string is a sequence of code points from a character set.  It's 
>> represented as a sequence of octets using a particular encoding for that 
>> character set, wherein each character is represented as a 1-or-more-octet 
>> subsequence in that sequence.
>>
>> In many of those encodings, not all subsequences of octets correspond to 
>> code points in the character set.  For example:
>>
>>         the 8-bit encoding of ASCII encodes each code point as an octet, and 
>> octets with the uppermost bit set don't correspond to ASCII code points;
>>
>>         the 8-bit encodings of "8-bit" character sets encode each code point 
>> as an octet and, in some of those character sets, there are fewer than 256 
>> code points, and some octet values don't correspond to code points in the 
>> character set;
>>
>>         UTF-8 encodes each Unicode code point as 1 or more octets, and:
>>
>>                 an octet sequence that begins with an octet with the 
>> uppermost bit set and the bit below it clear is invalid and doesn't 
>> correspond to a code point in Unicode;
>>
>>                 an octet sequence that begins with an octet with the 
>> uppermost two bits set, and where the 1 bits below it indicate that the 
>> sequence is N bytes long, but that has fewer than N-1 
>> octets-with-10-at-the-top following it (either because it's terminated by an 
>> octet that doesn't have 10 at the top or it's terminated by the end of the 
>> string), is invalid and doesn't correspond to a code point in Unicode;
>>
>>                 an octet sequence doesn't have the two problems above but 
>> that produces a value that's not a valid Unicode code point is invalid and 
>> (by definition) doesn't correspond to a code point in Unicode;
>>
>>         UCS-2 encodes each code point in the Unicode Basic Multilingual 
>> Plane as 2 octets (big-endian or little-endian), and not all values from 0 
>> to 65535 correspond to Unicode code points (see next item...);
>>
>>         UTF-16 encodes each Unicode code point as 2 or 4 octets (big-endian 
>> or little-endian), with code points in the Basic Multilingual Plane encoded 
>> as 2 octets and other code points encoded as a 2-octet "leading surrogate" 
>> followed by a 2-octet "trailing surrogate" (those are values between 0 and 
>> 65535 that are *not* Unicode code points; see previous item), and:
>>
>>                 a leading surrogate not followed by a trailing surrogate 
>> (either because it's followed by a 2-octet Unicode code point value or 
>> because it's at the end of the string) is not a valid UTF-16 sequence and 
>> doesn't correspond to a code point in Unicode;
>>
>>                 a trailing surrogate not preceded by a leading surrogate 
>> (either because it's at the beginning of the string or because it's preceded 
>> by a 2-octet Unicode code point value) is not a valid UTF-16 sequence and 
>> doesn't correspond to a code point in Unicode;
>>
>>                 a leading surrogate followed by a trailing surrogate that 
>> gives a value that's not a valid Unicode code point is invalid and (by 
>> definition) doesn't correspond to a code point in Unicode;
>>
>>         UCS-4 encodes each Unicode code point directly as 4 octets 
>> (big-endian or little-endian), and any value that corresponds to a surrogate 
>> or a value larger than the largest possible Unicode code point value is 
>> invalid and doesn't correspond to a code point in Unicode;
>>
>> etc..
>>
>> Strings in Wireshark are:
>>
>>         displayed to users, either directly in the packet containing them as 
>> part of the packet summary or detail, or indirectly, for example, by being 
>> stored as a pathname or pathname component for a file and shown in packets 
>> referring to the file by some ID rather than by pathname;
>>
>>         matched by packet-matching expressions (display filters, color 
>> filters, etc.);
>>
>>         processed internally by dissectors and taps (whether in C, Lua, or 
>> some other language);
>>
>>         handed to other programs by, for example, "tshark -T fields -e ...".
>>
>> In all of those cases, we need to do *something* with the invalid octet 
>> sequences.
>>
>> In the packet-matching expression case, I'd say that:
>>
>>         *all* comparison operations in which a string value from the packet 
>> is compared with a constant text string should fail if the string has 
>> invalid octet sequences (so 0x20 0xC0 0xC0 0xC0, as a purportedly-UTF-8 
>> string, is neither equal to nor unequal to " " or "hello" or....);
>
> In addition, we should have a monadic function "valid" (I believe the
> matching engine already supports the syntax for monadic functions, so
> this should be easy to do) which takes a string field and returns a
> boolean whether or not the string contains invalid octet sequences.
>
> The other question here is matching non-printing characters. If I want
> to match 0x13 0x10 I should probably be able to compare against
> "\r\n", but:
> - most non-printing characters don't have obvious escape sequences
> (see the bit on \uXX escape sequences below, I guess)
> - if I want to match a literal "\" and I'm typing in the shell, that
> means I need to type "\\\\" for all the escapes to process correctly.
> Yuck.
>
>>         comparison operations in which a string value from the packet is 
>> compared with an octet string (comparing against something such as 
>> 20:c0:c0:c0) should do an octet-by-octet comparison of the raw octets of the 
>> string (so 0x20 0xC0 0xC0 0xC0, no matter what the encoding, compares equal 
>> to 20:c0:c0:c0);
>>
>>         equality comparison operations between two string values from the 
>> packet succeed if either
>>
>>                 1) the two string values are valid in their encoding and 
>> translate to the same UTF-8 string
>>
>>         or
>>
>>                 2) the two string values have the same encoding and have the 
>> same octets.
>
> It took me a while to convince myself 2 was a good idea (since it
> seems on the surface that it might confuse the
> invalid-strings-are-not-comparable choice) but on reflection I think
> it's probably the right thing to do.
>
>> That would require more work.
>>
>> In the display case, an argument could be made that invalid octet sequences 
>> should be displayed as a sequence of \xNN escape sequences, one octet at a 
>> time.
>>
>> In the "processed internally" case, if the part of the string that's being 
>> looked at contains an invalid octet sequence, the processing should fail, 
>> otherwise the processing should still work.  For example, an HTTP request 
>> beginning with 0x47 0x45 0x54 0x20 0xC0 should be treated as a GET request 
>> with the operand being invalid, but an HTTP request beginning with 0x47 0x45 
>> 0x54 0xC0 should be treated as an invalid request.  That would *also* 
>> require more work.
>>
>> I'm not sure *what* the right thing to do when handing fields to other 
>> programs is.
>
> I see two probable use cases:
> - the program is interested in the raw bytes, in which case that's
> what we should send; if the string isn't valid, the reading program
> will deal with it
> - the program is interested in the string, in which case we should
> send it in the locale-appropriate encoding (most frequently UTF-8)
> with invalid sequences mapped to the encoding-appropriate replacement
> character
> These two should cover 99% of cases I can think of with relatively
> minimal effort on our part. The second should be default, since the
> most frequent case of "other program" is probably "stdout of a shell"
> or "text file".
>
>> So the functions that get strings from packets should not map invalid octet 
>> sequences to a sequence of \xNN escape sequences, as that would interfere 
>> with proper handling of the string when doing packet matching and internal 
>> processing.  For those cases, perhaps a combination of
>>
>>         1) replacing invalid sequences with REPLACEMENT CHARACTER
>>
>> and
>>
>>         2) providing a separate indication that this was done
>>
>> would be the right case.
>
> I agree. This makes it easy for dissectors that don't care to
> gracefully handle invalid strings. Dissectors that care about the type
> of invalidity can access the individual bytes, though we may want a
> helper function list_string_encoding_errors(tvb, offset, len, enc) or
> some such thing.
>
>> However, that then throws away information, so that you can't *display* that 
>> string with the invalid sequences shown as \xNN sequences.
>
> I have a vague idea that a combination of
> list_string_encoding_errors() and escape_encoding_errors() would be
> sufficient, but this definitely bears more thought. Maybe just a
> boolean parameter to tvb_get_string()? Or make tvb_get_string() always
> do the escaping, and provide a helper unescape_string() which removes
> the escaping and puts replacement characters in for invalid sequences?
> Hmm...
>
>> For now, my inclination is to continue with the "replace invalid sequences 
>> with REPLACEMENT CHARACTER in tvb_get_string* routines" strategy, but not 
>> treat that as the ultimate solution.  (I'll talk about some thoughts about 
>> what to do after that below.)
>>
>> Non-printable characters are an orthogonal issue; they *can* be represented 
>> in our UTF-8 encoding of Unicode, but they shouldn't be displayed in the UI 
>> as themselves.
>>
>> My inclination there is to replace them, when displaying, with escape 
>> sequences:
>>
>>         for code points >= 0x80, I'd display them as a \uXXXXXX escape 
>> sequence (whether to trim leading zeroes, and how many of them to trim, is 
>> left as an exercise for the reader; probably trim no more than two leading 
>> zeroes, but I'm not sure what to do if there's only one leading zero) - note 
>> that this won't show the value(s) of the octet(s) that make up the code 
>> point, it'll show the Unicode code point;
>>
>>         for 0x7F and most code points < 0x20, I'd display them either as 
>> \uXX, \xXX, or \ooo (whether to stick with Traditional Octal(TM), hex, or 
>> Unicode is left as an exercise for the reader);
>>
>>         for the characters that have their own C string escape sequences 
>> (e.g., tab, CR, LF), I'd display them as that escape sequence.
>>
>> (For the future, we might want to have the "value", in a protocol tree, of a 
>> string be a combination of the encoding and the raw octets; if some code 
>> wants the value for processing purposes, it'd call a routine that converts 
>> the value to UTF-8 with REPLACEMENT CHARACTER replacing invalid sequences, 
>> and if it wants the value for display purposes, it'd call a routine that 
>> converts it to UTF-8 with escape sequences replacing invalid sequences *and* 
>> with non-printable characters shown as the appropriate escape sequence.
>>
>> That raises the question of whether, when building a protocol tree, we need 
>> to put the value into the protocol tree item at all if the item is created 
>> with proto_tree_create_item(), or whether we just postpone extracting the 
>> value until we actually need it.  Lazy processing FTW....)
___________________________________________________________________________
Sent via:    Wireshark-dev mailing list <[email protected]>
Archives:    http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
             mailto:[email protected]?subject=unsubscribe

Re: [Wireshark-dev] tvb_get_string_enc() doesn't always return valid UTF-8

Reply via email to