Re: [Wireshark-dev] tvb_get_string_enc() doesn't always return valid UTF-8

Guy Harris Sun, 26 Jan 2014 15:16:59 -0800

On Jan 26, 2014, at 1:27 PM, Jakub Zawadzki <[email protected]> wrote:

> On Tue, Jan 21, 2014 at 08:01:15AM -0500, Evan Huus wrote:
>> On Tue, Jan 21, 2014 at 2:40 AM, Guy Harris <[email protected]> wrote:
>>> 
>>> On Jan 20, 2014, at 5:59 PM, Evan Huus <[email protected]> wrote:
>>> 
>>>> In which case is dumb search-and-replace of tvb_get_string with
>>>> tvb_get_string_enc and ENC_ASCII an easy way to make (part of) the API
>>>> transition?
>>> 
>>> Did somebody say that had been done or suggest that it be done?
>> 
>> I thought it was kind of implied when you wrote "We should also
>> probably audit all calls to tvb_get_string() and tvb_get_stringz() in
>> dissectors and change them to tvb_get_string_enc() with the
>> appropriate encoding."
>> 
>> If tvb_get_string() behaves identically to tvb_get_string_enc(...
>> ENC_ASCII) then there doesn't seem much point in having both.
> 
> We can also think about dropping STR_ASCII, STR_UNICODE stuff.
> 
> To be honest I'm not happy about it, I'd rather still display
> non-printable ascii in escaped-hex-form.

OK, first-principles time:

A character string is a sequence of code points from a character set.  It's 
represented as a sequence of octets using a particular encoding for that 
character set, wherein each character is represented as a 1-or-more-octet 
subsequence in that sequence.

In many of those encodings, not all subsequences of octets correspond to code 
points in the character set.  For example:

        the 8-bit encoding of ASCII encodes each code point as an octet, and 
octets with the uppermost bit set don't correspond to ASCII code points;

        the 8-bit encodings of "8-bit" character sets encode each code point as 
an octet and, in some of those character sets, there are fewer than 256 code 
points, and some octet values don't correspond to code points in the character 
set;

        UTF-8 encodes each Unicode code point as 1 or more octets, and:

                an octet sequence that begins with an octet with the uppermost 
bit set and the bit below it clear is invalid and doesn't correspond to a code 
point in Unicode;

                an octet sequence that begins with an octet with the uppermost 
two bits set, and where the 1 bits below it indicate that the sequence is N 
bytes long, but that has fewer than N-1 octets-with-10-at-the-top following it 
(either because it's terminated by an octet that doesn't have 10 at the top or 
it's terminated by the end of the string), is invalid and doesn't correspond to 
a code point in Unicode;

                an octet sequence doesn't have the two problems above but that 
produces a value that's not a valid Unicode code point is invalid and (by 
definition) doesn't correspond to a code point in Unicode;

        UCS-2 encodes each code point in the Unicode Basic Multilingual Plane 
as 2 octets (big-endian or little-endian), and not all values from 0 to 65535 
correspond to Unicode code points (see next item...);

        UTF-16 encodes each Unicode code point as 2 or 4 octets (big-endian or 
little-endian), with code points in the Basic Multilingual Plane encoded as 2 
octets and other code points encoded as a 2-octet "leading surrogate" followed 
by a 2-octet "trailing surrogate" (those are values between 0 and 65535 that 
are *not* Unicode code points; see previous item), and:

                a leading surrogate not followed by a trailing surrogate 
(either because it's followed by a 2-octet Unicode code point value or because 
it's at the end of the string) is not a valid UTF-16 sequence and doesn't 
correspond to a code point in Unicode;

                a trailing surrogate not preceded by a leading surrogate 
(either because it's at the beginning of the string or because it's preceded by 
a 2-octet Unicode code point value) is not a valid UTF-16 sequence and doesn't 
correspond to a code point in Unicode;

                a leading surrogate followed by a trailing surrogate that gives 
a value that's not a valid Unicode code point is invalid and (by definition) 
doesn't correspond to a code point in Unicode;

        UCS-4 encodes each Unicode code point directly as 4 octets (big-endian 
or little-endian), and any value that corresponds to a surrogate or a value 
larger than the largest possible Unicode code point value is invalid and 
doesn't correspond to a code point in Unicode;

etc..

Strings in Wireshark are:

        displayed to users, either directly in the packet containing them as 
part of the packet summary or detail, or indirectly, for example, by being 
stored as a pathname or pathname component for a file and shown in packets 
referring to the file by some ID rather than by pathname;

        matched by packet-matching expressions (display filters, color filters, 
etc.);

        processed internally by dissectors and taps (whether in C, Lua, or some 
other language);

        handed to other programs by, for example, "tshark -T fields -e ...".

In all of those cases, we need to do *something* with the invalid octet 
sequences.

In the packet-matching expression case, I'd say that:

        *all* comparison operations in which a string value from the packet is 
compared with a constant text string should fail if the string has invalid 
octet sequences (so 0x20 0xC0 0xC0 0xC0, as a purportedly-UTF-8 string, is 
neither equal to nor unequal to " " or "hello" or....);

        comparison operations in which a string value from the packet is 
compared with an octet string (comparing against something such as 20:c0:c0:c0) 
should do an octet-by-octet comparison of the raw octets of the string (so 0x20 
0xC0 0xC0 0xC0, no matter what the encoding, compares equal to 20:c0:c0:c0);

        equality comparison operations between two string values from the 
packet succeed if either

                1) the two string values are valid in their encoding and 
translate to the same UTF-8 string

        or

                2) the two string values have the same encoding and have the 
same octets.

That would require more work.

In the display case, an argument could be made that invalid octet sequences 
should be displayed as a sequence of \xNN escape sequences, one octet at a time.

In the "processed internally" case, if the part of the string that's being 
looked at contains an invalid octet sequence, the processing should fail, 
otherwise the processing should still work.  For example, an HTTP request 
beginning with 0x47 0x45 0x54 0x20 0xC0 should be treated as a GET request with 
the operand being invalid, but an HTTP request beginning with 0x47 0x45 0x54 
0xC0 should be treated as an invalid request.  That would *also* require more 
work.

I'm not sure *what* the right thing to do when handing fields to other programs 
is.

So the functions that get strings from packets should not map invalid octet 
sequences to a sequence of \xNN escape sequences, as that would interfere with 
proper handling of the string when doing packet matching and internal 
processing.  For those cases, perhaps a combination of

        1) replacing invalid sequences with REPLACEMENT CHARACTER

and

        2) providing a separate indication that this was done

would be the right case.

However, that then throws away information, so that you can't *display* that 
string with the invalid sequences shown as \xNN sequences.

For now, my inclination is to continue with the "replace invalid sequences with 
REPLACEMENT CHARACTER in tvb_get_string* routines" strategy, but not treat that 
as the ultimate solution.  (I'll talk about some thoughts about what to do 
after that below.)

Non-printable characters are an orthogonal issue; they *can* be represented in 
our UTF-8 encoding of Unicode, but they shouldn't be displayed in the UI as 
themselves.

My inclination there is to replace them, when displaying, with escape sequences:

        for code points >= 0x80, I'd display them as a \uXXXXXX escape sequence 
(whether to trim leading zeroes, and how many of them to trim, is left as an 
exercise for the reader; probably trim no more than two leading zeroes, but I'm 
not sure what to do if there's only one leading zero) - note that this won't 
show the value(s) of the octet(s) that make up the code point, it'll show the 
Unicode code point;

        for 0x7F and most code points < 0x20, I'd display them either as \uXX, 
\xXX, or \ooo (whether to stick with Traditional Octal(TM), hex, or Unicode is 
left as an exercise for the reader);

        for the characters that have their own C string escape sequences (e.g., 
tab, CR, LF), I'd display them as that escape sequence.

(For the future, we might want to have the "value", in a protocol tree, of a 
string be a combination of the encoding and the raw octets; if some code wants 
the value for processing purposes, it'd call a routine that converts the value 
to UTF-8 with REPLACEMENT CHARACTER replacing invalid sequences, and if it 
wants the value for display purposes, it'd call a routine that converts it to 
UTF-8 with escape sequences replacing invalid sequences *and* with 
non-printable characters shown as the appropriate escape sequence.

That raises the question of whether, when building a protocol tree, we need to 
put the value into the protocol tree item at all if the item is created with 
proto_tree_create_item(), or whether we just postpone extracting the value 
until we actually need it.  Lazy processing FTW....)

___________________________________________________________________________
Sent via:    Wireshark-dev mailing list <[email protected]>
Archives:    http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
             mailto:[email protected]?subject=unsubscribe

Re: [Wireshark-dev] tvb_get_string_enc() doesn't always return valid UTF-8

Reply via email to