On Nov 8, 2009, at 12:55 PM, Mark Bednarczyk wrote: >>> My library gets its filename from a java string and it currently >>> converts it to plain UTF-8 charset and that works fine. >> >> On UN*X, it should perhaps be converted to whatever the >> locale's filename character set is. > > But I don't actually call on any fopen calls directly. I rely on > libpcap to > work with the filesystem. Therefore I would like to go by the specs > the > libpcap provides for the pcap_open_offline call. It would be nice to > somehow > handle and provide a definitive specification when passing in a > string.
The definitive specification is "it calls fopen(), so it does the same thing as fopen()". *If* a file name happens to be encoded, in the file system, using UTF-8, you would hand that UTF-8 string to fopen() to open it, so you would do the same with pcap_open_offline(). If, instead, it happens to be encoded using ISO 8859/1, or 8859/2, or 8859/15, or..., or KOI-8, or Shift-JIS, or EUJIS, or..., you'd hand a string in *that* encoding. (Sorry, but UN*X internationalization antedated Unicode, so they had to do *something*, and ended up doing a variety of different things in different locales. Oh, and don't get me started about Unicode normalization forms....) > >> >> I'm not sure how that would be determined, however. I might >> be tempted to assume that, if the environment variable >> LC_CTYPE is set that specifies the encoding, otherwise if >> LANG is set that specifies the encoding, otherwise it might >> be the C locale (which, I think, unfortunately says the >> encoding is ASCII). However, GLib (not glibc, >> GLib) has its own additional environment variables: >> >> http://library.gnome.org/devel/glib/stable/glib-running.html >> >> and I'm not sure why that's the case. >> >>> But in reality I'd like to support all unicode widths 8, 16 and even >>> 32 bit. I'm not sure how those wider unicode chars would be handled. >> >> How are they handled elsewhere in Java? The File class seems >> to work with Strings, and the String class, at least as I >> understand the documentation, uses UTF-16 (presumably that's >> what you mean by "unicode [width] ... 16 ... bit").= > > Java has extensive unicode support for even the extended unicode > widths where > they combine 2 UTF-16 chars to describe a single character. If that's "surrogate pairs", that's more like "combining two 16-bit codes" - a surrogate pair is a single character, represented as two "code units": http://unicode.org/standard/principles.html "Encoding Forms Character encoding standards define not only the identity of each character and its numeric value, or code point, but also how this value is represented in bits. The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit). All three encoding forms encode the same common character repertoire and can be efficiently transformed into one another without loss of data. The Unicode Consortium fully endorses the use of any of these encoding forms as a conformant way of implementing the Unicode Standard. UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites. UTF-16 is popular in many environments that need to balance efficient access to characters with economical use of storage. It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16- bit code units. UTF-32 is popular where memory space is no concern, but fixed width, single code unit access to characters is desired. Each Unicode character is encoded in a single 32-bit code unit when using UTF-32. All three encoding forms need at most 4 bytes (or 32-bits) of data for each character." At least as I read the description of the String class: http://java.sun.com/javase/6/docs/api/java/lang/String.html it's based on UTF-16: "A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String. The String class provides methods for dealing with Unicode code points (i.e., characters), in addition to those for dealing with Unicode code units (i.e., char values)." > Here is how java represents unicode characters: > > The char data type (and therefore the value that a Character object > encapsulates) are based on the original Unicode specification, which > defined > characters as fixed-width 16-bit entities. Meaning it can't handle characters outside the BMP. However, from your example in "Decoding packets manually": String file = "capturefile.pcap"; ... Pcap pcap = Pcap.openOffline(file, errbuf); it appears that you use Strings for pathnames. As per my earlier mail, pathnames seem to be Strings, hence UTF-16-encoded, so the pathnames you'll be handed are UTF-16, not UCS-2 (UCS-2 encodes only the BMP, with one 16-bit code unit per code point). > So in summary, I think the answer is that UTF-8 is supported on all/ > most > platforms and filesystem types right now. It's supported on UN*Xes where file names happen to be encoded in UTF-8. Mac OS X does that (in fact, that's all that's supported in HFS +, although, *on disk*, HFS+ uses, I think, UTF-16, but what you see in the UN*X APIs is UTF-8; the OS X SMB client assumes all file names are UTF-8, mapping them to UTF-16 over the wire and mapping stuff received from over the wire from UTF-16 back to UTF-8). Other UN*Xes probably allow other encodings, hence my comment about mapping from UTF-8 to the native file name encoding. On Windows, however, it's not going to work - on Windows, I don't think fopen() takes UTF-8-encoded pathnames, I think it takes pathnames encoded in whatever the current "code page" is. That means that there could be unopenable files (e.g., if your current code page is an Asian DBCS code page, you probably won't be able to open a file named "Müller's network problem.pcap"). You'd need pcap_wopen_offline(), or something such as that, to fully support Unicode pathnames. > The UTF-16 which is what my user is > using for some chineese characters in filename will not work with > libpcap's > pcap_open_offline(). The platform he is on is ubuntu ...which, being a Linux distribution, and hence a UN*X, expects pathnames to be sequences of octets, with '/' as separators and '\0' as a terminator. Handing it a UTF-16 string isn't going to work very well. *If* the file's name is encoded with UTF-8, handing it a UTF-8 string should work. If it's encoded in some other encoding, such as Big5: http://en.wikipedia.org/wiki/Big5 or GB 2312: http://en.wikipedia.org/wiki/GB2312 it probably won't work. > I'm not sure what application created the file in the first place. > May be we can discern if fopen was used to created the file using > UTF-16 > encoding or some other system call. Ultimately, the system call used to create the file was either open() or creat() (and the former is a superset of the latter); they take octet strings in some superset-of-ASCII encoding (UTF-8, ISO 8859/x, Big5, GB 2312, Shift JIS, etc.), so that all octets in the range 0x00 through 0x7F represent the corresponding ASCII character, and only octets with the 0x80 bit set are used to encode other characters. The issue probably doesn't involve UTF-16, as that's not a octet- string superset-of-ASCII encoding; it probably involves UTF-8 vs. some other encoding of Chinese. As for the other user who filed http://jnetpcap.com/node/456 he's using Windows (as per the reference to MinGW and jnetpcap.dll), so his problem may ultimately be caused by the lack of pcap_wopen_offline(). > > Cheers, > mark.. > http://jnetpcap.com > > > _______________________________________________ > Winpcap-users mailing list > [email protected] > https://www.winpcap.org/mailman/listinfo/winpcap-users _______________________________________________ Winpcap-users mailing list [email protected] https://www.winpcap.org/mailman/listinfo/winpcap-users
