Re: [Wireshark-dev] Note about proto_tree_add_unicode_string (r43379)

Guy Harris Tue, 19 Jun 2012 12:48:27 -0700

On Jun 19, 2012, at 12:01 PM, Jakub Zawadzki wrote:

> Hi,
> 
> String from tvb_get_ephemeral_string() still needs escaping with 
> format_text(),
> cause it doesn't check encoding.
> 
> When you use:
>  tvb_get_ephemeral_string_enc(tvb, offset, length, ENC_UTF_8 | ENC_NA);
> 
> It guarantees result encoded in UTF-8:
> * string as converted from the appropriate encoding to UTF-8 ...
> 
> (Code to do it is still in XXX's but this is bug in libwireshark and no one 
> can blame you that you used wrong function :))


Or, rather, there is no code to do it and there's an XXX comment noting that:

        case ENC_UTF_8:
                /*
                 * XXX - should map all invalid UTF-8 sequences
                 * to a "substitute" UTF-8 character.
                 */
                strbuf = tvb_get_ephemeral_string(tvb, offset, length);
                break;

*That's* why escaping is still needed - we don't handle malformed UTF-8 strings.

If we were to:

        store strings in the protocol tree as a combination of an encoding 
value and a raw bucket of bytes;

        in the display filter code:

                implement comparison of string fields with text strings by 
attempting to convert the string field value to UTF-8 at that point and having 
all comparisons other than != fail and have != succeed if the string can't be 
converted;

                also support comparison of string fields with *byte* strings, 
e.g. a sequence of hex-encoded octets, and have that compare the raw bucket of 
bytes;

        have the "construct a filter from this field" UI generate, for string 
fields, a comparison with a UTF-8 string if the value can be converted to UTF-8 
and a comparison with a byte string if it can't;

        in the code that generates displayable text for the protocol tree, 
replace anything in the string that can't be mapped to UTF-8 to a "substitute" 
UTF-8 character, including mapping invalid UTF-8 sequences if the string's 
encoding is UTF-8;

        do similar stuff if a string field's value is used to generate anything 
in the packet summary.

that might handle this, and might also defer conversion to UTF-8 until it's 
absolutely necessary, which might cut CPU usage.

(A further optimization might be to, in the display filter code implement 
comparison of string fields with text strings by attempting to convert the 
*text string* to the field's encoding at that point, with the result cached so 
you only have to do it once, and having all comparisons other than != fail and 
have != succeed if the string can't be converted.)

As for PDML:

        
http://www.nbee.org/doku.php?id=netpdl:visualization_expression#data_associated_to_each_netpdl_field

says

        value (always present)
                It contains the field value as an hex string. In other words, 
the content of the value attribute in case of a 16 bit field whose value is 
0800 (hex) is the string 0800. This value is exactly the same as written in the 
hex file, i.e. it does not change in case of little-endian or big-endian 
fields. Please note that for bitmasked fields, this value corresponds to the 
'unmasked' value (i.e. the real field value must be derived from a bitwise AND 
operation between the mask and the value).

        showvalue(always present)

                It contains the field value in a “printable” form. For 
instance, field value ffffffffffff (related to a MAC address) can be associated 
to value ffffff-ffffff and assigned to this attribute.

which seems to indicate that, for a string field, "value" should be the raw hex 
bytes of the string (except perhaps for byte-swapping in the case of UCS-2, 
UCS-4, or UTF-16), *NOT* anything that looks like the text, and suggests to me 
that "showvalue" would be the same as what's generated by the code that 
generates displayable text for the protocol tree, i.e. it'd have invalid UTF-8 
sequences replaced by a "substitute" character.

However, I don't know whether that's what programs out there that read PDML 
would expect - and it doesn't include the encoding for the string, so that code 
wouldn't know what to do with the string's value.
___________________________________________________________________________
Sent via:    Wireshark-dev mailing list <[email protected]>
Archives:    http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
             mailto:[email protected]?subject=unsubscribe

Re: [Wireshark-dev] Note about proto_tree_add_unicode_string (r43379)

Reply via email to