On Mon, Jan 27, 2014 at 7:17 PM, Evan Huus <[email protected]> wrote: > Thank you Guy, this is excellent. It's more or less what I was looking > for in [1] back in October (which, coincidentally, links to the bug I > referred to earlier in this thread). > > All of these conversations should be condensed (probably using your > email as a base) and put into a wiki page for future reference. I will > try and do this when I find the time, if nobody else beats me to it.
First draft wiki page is available: http://wiki.wireshark.org/Development/StringHandling > Further comments and suggestions inline. > > [1] https://www.wireshark.org/lists/wireshark-dev/201310/msg00260.html > > On Sun, Jan 26, 2014 at 6:13 PM, Guy Harris <[email protected]> wrote: >> >> On Jan 26, 2014, at 1:27 PM, Jakub Zawadzki <[email protected]> >> wrote: >> >>> On Tue, Jan 21, 2014 at 08:01:15AM -0500, Evan Huus wrote: >>>> On Tue, Jan 21, 2014 at 2:40 AM, Guy Harris <[email protected]> wrote: >>>>> >>>>> On Jan 20, 2014, at 5:59 PM, Evan Huus <[email protected]> wrote: >>>>> >>>>>> In which case is dumb search-and-replace of tvb_get_string with >>>>>> tvb_get_string_enc and ENC_ASCII an easy way to make (part of) the API >>>>>> transition? >>>>> >>>>> Did somebody say that had been done or suggest that it be done? >>>> >>>> I thought it was kind of implied when you wrote "We should also >>>> probably audit all calls to tvb_get_string() and tvb_get_stringz() in >>>> dissectors and change them to tvb_get_string_enc() with the >>>> appropriate encoding." >>>> >>>> If tvb_get_string() behaves identically to tvb_get_string_enc(... >>>> ENC_ASCII) then there doesn't seem much point in having both. >>> >>> We can also think about dropping STR_ASCII, STR_UNICODE stuff. >>> >>> To be honest I'm not happy about it, I'd rather still display >>> non-printable ascii in escaped-hex-form. >> >> OK, first-principles time: >> >> A character string is a sequence of code points from a character set. It's >> represented as a sequence of octets using a particular encoding for that >> character set, wherein each character is represented as a 1-or-more-octet >> subsequence in that sequence. >> >> In many of those encodings, not all subsequences of octets correspond to >> code points in the character set. For example: >> >> the 8-bit encoding of ASCII encodes each code point as an octet, and >> octets with the uppermost bit set don't correspond to ASCII code points; >> >> the 8-bit encodings of "8-bit" character sets encode each code point >> as an octet and, in some of those character sets, there are fewer than 256 >> code points, and some octet values don't correspond to code points in the >> character set; >> >> UTF-8 encodes each Unicode code point as 1 or more octets, and: >> >> an octet sequence that begins with an octet with the >> uppermost bit set and the bit below it clear is invalid and doesn't >> correspond to a code point in Unicode; >> >> an octet sequence that begins with an octet with the >> uppermost two bits set, and where the 1 bits below it indicate that the >> sequence is N bytes long, but that has fewer than N-1 >> octets-with-10-at-the-top following it (either because it's terminated by an >> octet that doesn't have 10 at the top or it's terminated by the end of the >> string), is invalid and doesn't correspond to a code point in Unicode; >> >> an octet sequence doesn't have the two problems above but >> that produces a value that's not a valid Unicode code point is invalid and >> (by definition) doesn't correspond to a code point in Unicode; >> >> UCS-2 encodes each code point in the Unicode Basic Multilingual >> Plane as 2 octets (big-endian or little-endian), and not all values from 0 >> to 65535 correspond to Unicode code points (see next item...); >> >> UTF-16 encodes each Unicode code point as 2 or 4 octets (big-endian >> or little-endian), with code points in the Basic Multilingual Plane encoded >> as 2 octets and other code points encoded as a 2-octet "leading surrogate" >> followed by a 2-octet "trailing surrogate" (those are values between 0 and >> 65535 that are *not* Unicode code points; see previous item), and: >> >> a leading surrogate not followed by a trailing surrogate >> (either because it's followed by a 2-octet Unicode code point value or >> because it's at the end of the string) is not a valid UTF-16 sequence and >> doesn't correspond to a code point in Unicode; >> >> a trailing surrogate not preceded by a leading surrogate >> (either because it's at the beginning of the string or because it's preceded >> by a 2-octet Unicode code point value) is not a valid UTF-16 sequence and >> doesn't correspond to a code point in Unicode; >> >> a leading surrogate followed by a trailing surrogate that >> gives a value that's not a valid Unicode code point is invalid and (by >> definition) doesn't correspond to a code point in Unicode; >> >> UCS-4 encodes each Unicode code point directly as 4 octets >> (big-endian or little-endian), and any value that corresponds to a surrogate >> or a value larger than the largest possible Unicode code point value is >> invalid and doesn't correspond to a code point in Unicode; >> >> etc.. >> >> Strings in Wireshark are: >> >> displayed to users, either directly in the packet containing them as >> part of the packet summary or detail, or indirectly, for example, by being >> stored as a pathname or pathname component for a file and shown in packets >> referring to the file by some ID rather than by pathname; >> >> matched by packet-matching expressions (display filters, color >> filters, etc.); >> >> processed internally by dissectors and taps (whether in C, Lua, or >> some other language); >> >> handed to other programs by, for example, "tshark -T fields -e ...". >> >> In all of those cases, we need to do *something* with the invalid octet >> sequences. >> >> In the packet-matching expression case, I'd say that: >> >> *all* comparison operations in which a string value from the packet >> is compared with a constant text string should fail if the string has >> invalid octet sequences (so 0x20 0xC0 0xC0 0xC0, as a purportedly-UTF-8 >> string, is neither equal to nor unequal to " " or "hello" or....); > > In addition, we should have a monadic function "valid" (I believe the > matching engine already supports the syntax for monadic functions, so > this should be easy to do) which takes a string field and returns a > boolean whether or not the string contains invalid octet sequences. > > The other question here is matching non-printing characters. If I want > to match 0x13 0x10 I should probably be able to compare against > "\r\n", but: > - most non-printing characters don't have obvious escape sequences > (see the bit on \uXX escape sequences below, I guess) > - if I want to match a literal "\" and I'm typing in the shell, that > means I need to type "\\\\" for all the escapes to process correctly. > Yuck. > >> comparison operations in which a string value from the packet is >> compared with an octet string (comparing against something such as >> 20:c0:c0:c0) should do an octet-by-octet comparison of the raw octets of the >> string (so 0x20 0xC0 0xC0 0xC0, no matter what the encoding, compares equal >> to 20:c0:c0:c0); >> >> equality comparison operations between two string values from the >> packet succeed if either >> >> 1) the two string values are valid in their encoding and >> translate to the same UTF-8 string >> >> or >> >> 2) the two string values have the same encoding and have the >> same octets. > > It took me a while to convince myself 2 was a good idea (since it > seems on the surface that it might confuse the > invalid-strings-are-not-comparable choice) but on reflection I think > it's probably the right thing to do. > >> That would require more work. >> >> In the display case, an argument could be made that invalid octet sequences >> should be displayed as a sequence of \xNN escape sequences, one octet at a >> time. >> >> In the "processed internally" case, if the part of the string that's being >> looked at contains an invalid octet sequence, the processing should fail, >> otherwise the processing should still work. For example, an HTTP request >> beginning with 0x47 0x45 0x54 0x20 0xC0 should be treated as a GET request >> with the operand being invalid, but an HTTP request beginning with 0x47 0x45 >> 0x54 0xC0 should be treated as an invalid request. That would *also* >> require more work. >> >> I'm not sure *what* the right thing to do when handing fields to other >> programs is. > > I see two probable use cases: > - the program is interested in the raw bytes, in which case that's > what we should send; if the string isn't valid, the reading program > will deal with it > - the program is interested in the string, in which case we should > send it in the locale-appropriate encoding (most frequently UTF-8) > with invalid sequences mapped to the encoding-appropriate replacement > character > These two should cover 99% of cases I can think of with relatively > minimal effort on our part. The second should be default, since the > most frequent case of "other program" is probably "stdout of a shell" > or "text file". > >> So the functions that get strings from packets should not map invalid octet >> sequences to a sequence of \xNN escape sequences, as that would interfere >> with proper handling of the string when doing packet matching and internal >> processing. For those cases, perhaps a combination of >> >> 1) replacing invalid sequences with REPLACEMENT CHARACTER >> >> and >> >> 2) providing a separate indication that this was done >> >> would be the right case. > > I agree. This makes it easy for dissectors that don't care to > gracefully handle invalid strings. Dissectors that care about the type > of invalidity can access the individual bytes, though we may want a > helper function list_string_encoding_errors(tvb, offset, len, enc) or > some such thing. > >> However, that then throws away information, so that you can't *display* that >> string with the invalid sequences shown as \xNN sequences. > > I have a vague idea that a combination of > list_string_encoding_errors() and escape_encoding_errors() would be > sufficient, but this definitely bears more thought. Maybe just a > boolean parameter to tvb_get_string()? Or make tvb_get_string() always > do the escaping, and provide a helper unescape_string() which removes > the escaping and puts replacement characters in for invalid sequences? > Hmm... > >> For now, my inclination is to continue with the "replace invalid sequences >> with REPLACEMENT CHARACTER in tvb_get_string* routines" strategy, but not >> treat that as the ultimate solution. (I'll talk about some thoughts about >> what to do after that below.) >> >> Non-printable characters are an orthogonal issue; they *can* be represented >> in our UTF-8 encoding of Unicode, but they shouldn't be displayed in the UI >> as themselves. >> >> My inclination there is to replace them, when displaying, with escape >> sequences: >> >> for code points >= 0x80, I'd display them as a \uXXXXXX escape >> sequence (whether to trim leading zeroes, and how many of them to trim, is >> left as an exercise for the reader; probably trim no more than two leading >> zeroes, but I'm not sure what to do if there's only one leading zero) - note >> that this won't show the value(s) of the octet(s) that make up the code >> point, it'll show the Unicode code point; >> >> for 0x7F and most code points < 0x20, I'd display them either as >> \uXX, \xXX, or \ooo (whether to stick with Traditional Octal(TM), hex, or >> Unicode is left as an exercise for the reader); >> >> for the characters that have their own C string escape sequences >> (e.g., tab, CR, LF), I'd display them as that escape sequence. >> >> (For the future, we might want to have the "value", in a protocol tree, of a >> string be a combination of the encoding and the raw octets; if some code >> wants the value for processing purposes, it'd call a routine that converts >> the value to UTF-8 with REPLACEMENT CHARACTER replacing invalid sequences, >> and if it wants the value for display purposes, it'd call a routine that >> converts it to UTF-8 with escape sequences replacing invalid sequences *and* >> with non-printable characters shown as the appropriate escape sequence. >> >> That raises the question of whether, when building a protocol tree, we need >> to put the value into the protocol tree item at all if the item is created >> with proto_tree_create_item(), or whether we just postpone extracting the >> value until we actually need it. Lazy processing FTW....) ___________________________________________________________________________ Sent via: Wireshark-dev mailing list <[email protected]> Archives: http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:[email protected]?subject=unsubscribe
