Thanks for your deep insight into the issue. > he's using Windows (as per the reference to MinGW and > jnetpcap.dll), so his problem may ultimately be caused by the > lack of pcap_wopen_offline().
I agree. Java's native code interface (JNI) provides support for converting the java strings to unicode or UTF-8: http://java.sun.com/javase/6/docs/technotes/guides/jni/spec/functions.html#s tring_operations So when the support for this comes from libpcap/winpcap I will be ready. I can experiment some more with windows wide char to java string conversion in my getHardwareAddress function for an interface: /* * Name is in wide character format. So convert to plain UTF8. */ int size=WideCharToMultiByte(0, 0, map->Name, -1, NULL, 0, NULL, NULL); char utf8[size + 1]; WideCharToMultiByte(0, 0, map->Name, -1, utf8, size, NULL, NULL); (Source starts on line 567: http://jnetpcap.svn.sourceforge.net/viewvc/jnetpcap/jnetpcap/trunk/src/c/jne tpcap_utils.cpp?view=markup) Now that I think about I should be able to convert directly to java string from window's wide-char string without the extra step of going through UTF-8 as median between the 2. Cheers, mark... > -----Original Message----- > From: Guy Harris [mailto:[email protected]] > Sent: Sunday, November 08, 2009 5:28 PM > To: [email protected]; [email protected] > Subject: Re: [Winpcap-users] pcap_open_offline and unicode charsets > > > On Nov 8, 2009, at 12:55 PM, Mark Bednarczyk wrote: > > >>> My library gets its filename from a java string and it currently > >>> converts it to plain UTF-8 charset and that works fine. > >> > >> On UN*X, it should perhaps be converted to whatever the locale's > >> filename character set is. > > > > But I don't actually call on any fopen calls directly. I rely on > > libpcap to work with the filesystem. Therefore I would like > to go by > > the specs the libpcap provides for the pcap_open_offline call. It > > would be nice to somehow handle and provide a definitive > specification > > when passing in a string. > > The definitive specification is "it calls fopen(), so it does > the same thing as fopen()". > > *If* a file name happens to be encoded, in the file system, > using UTF-8, you would hand that UTF-8 string to fopen() to > open it, so you would do the same with pcap_open_offline(). > If, instead, it happens to be encoded using ISO 8859/1, or > 8859/2, or 8859/15, or..., or KOI-8, or Shift-JIS, or EUJIS, > or..., you'd hand a string in *that* encoding. (Sorry, but > UN*X internationalization antedated Unicode, so they had to > do *something*, and ended up doing a variety of different > things in different locales. Oh, and don't get me started > about Unicode normalization forms....) > > > > >> > >> I'm not sure how that would be determined, however. I might be > >> tempted to assume that, if the environment variable > LC_CTYPE is set > >> that specifies the encoding, otherwise if LANG is set that > specifies > >> the encoding, otherwise it might be the C locale (which, I think, > >> unfortunately says the encoding is ASCII). However, GLib > (not glibc, > >> GLib) has its own additional environment variables: > >> > >> http://library.gnome.org/devel/glib/stable/glib-running.html > >> > >> and I'm not sure why that's the case. > >> > >>> But in reality I'd like to support all unicode widths 8, > 16 and even > >>> 32 bit. I'm not sure how those wider unicode chars would > be handled. > >> > >> How are they handled elsewhere in Java? The File class > seems to work > >> with Strings, and the String class, at least as I understand the > >> documentation, uses UTF-16 (presumably that's what you mean by > >> "unicode [width] ... 16 ... bit").= > > > > Java has extensive unicode support for even the extended unicode > > widths where they combine 2 UTF-16 chars to describe a single > > character. > > If that's "surrogate pairs", that's more like "combining two > 16-bit codes" - a surrogate pair is a single character, > represented as two "code units": > > http://unicode.org/standard/principles.html > > "Encoding Forms > > Character encoding standards define not only the identity of > each character and its numeric value, or code point, but also > how this value is represented in bits. > > The Unicode Standard defines three encoding forms that allow > the same data to be transmitted in a byte, word or double > word oriented format (i.e. in 8, 16 or 32-bits per code > unit). All three encoding forms encode the same common > character repertoire and can be efficiently transformed into > one another without loss of data. The Unicode Consortium > fully endorses the use of any of these encoding forms as a > conformant way of implementing the Unicode Standard. > > UTF-8 is popular for HTML and similar protocols. UTF-8 is a > way of transforming all Unicode characters into a variable > length encoding of bytes. It has the advantages that the > Unicode characters corresponding to the familiar ASCII set > have the same byte values as ASCII, and that Unicode > characters transformed into UTF-8 can be used with much > existing software without extensive software rewrites. > > UTF-16 is popular in many environments that need to balance > efficient access to characters with economical use of > storage. It is reasonably compact and all the heavily used > characters fit into a single 16-bit code unit, while all > other characters are accessible via pairs of 16- bit code units. > > UTF-32 is popular where memory space is no concern, but fixed > width, single code unit access to characters is desired. Each > Unicode character is encoded in a single 32-bit code unit > when using UTF-32. > > All three encoding forms need at most 4 bytes (or 32-bits) of > data for each character." > > At least as I read the description of the String class: > > http://java.sun.com/javase/6/docs/api/java/lang/String.html > > it's based on UTF-16: > > "A String represents a string in the UTF-16 format in which > supplementary characters are represented by surrogate pairs > (see the section Unicode Character Representations in the > Character class for more information). Index values refer to > char code units, so a supplementary character uses two > positions in a String. > > The String class provides methods for dealing with Unicode > code points (i.e., characters), in addition to those for > dealing with Unicode code units (i.e., char values)." > > > Here is how java represents unicode characters: > > > > The char data type (and therefore the value that a Character object > > encapsulates) are based on the original Unicode > specification, which > > defined characters as fixed-width 16-bit entities. > > Meaning it can't handle characters outside the BMP. > > However, from your example in "Decoding packets manually": > > String file = "capturefile.pcap"; > > ... > > Pcap pcap = Pcap.openOffline(file, errbuf); > > it appears that you use Strings for pathnames. As per my > earlier mail, pathnames seem to be Strings, hence > UTF-16-encoded, so the pathnames you'll be handed are UTF-16, > not UCS-2 (UCS-2 encodes only the BMP, with one 16-bit code > unit per code point). > > > So in summary, I think the answer is that UTF-8 is > supported on all/ > > most platforms and filesystem types right now. > > It's supported on UN*Xes where file names happen to be > encoded in UTF-8. Mac OS X does that (in fact, that's all > that's supported in HFS > +, although, *on disk*, HFS+ uses, I think, UTF-16, but what you see > in the UN*X APIs is UTF-8; the OS X SMB client assumes all > file names are UTF-8, mapping them to UTF-16 over the wire > and mapping stuff received from over the wire from UTF-16 > back to UTF-8). Other UN*Xes probably allow other encodings, > hence my comment about mapping from > UTF-8 to the native file name encoding. > > On Windows, however, it's not going to work - on Windows, I > don't think fopen() takes UTF-8-encoded pathnames, I think it > takes pathnames encoded in whatever the current "code page" > is. That means that there could be unopenable files (e.g., > if your current code page is an Asian DBCS code page, you > probably won't be able to open a file named "Müller's network > problem.pcap"). > > You'd need pcap_wopen_offline(), or something such as that, > to fully support Unicode pathnames. > > > The UTF-16 which is what my user is > > using for some chineese characters in filename will not work with > > libpcap's pcap_open_offline(). The platform he is on is ubuntu > > ...which, being a Linux distribution, and hence a UN*X, > expects pathnames to be sequences of octets, with '/' as > separators and '\0' > as a terminator. Handing it a UTF-16 string isn't going to > work very well. > > *If* the file's name is encoded with UTF-8, handing it a > UTF-8 string should work. If it's encoded in some other > encoding, such as Big5: > > http://en.wikipedia.org/wiki/Big5 > > or GB 2312: > > http://en.wikipedia.org/wiki/GB2312 > > it probably won't work. > > > I'm not sure what application created the file in the first place. > > May be we can discern if fopen was used to created the file using > > UTF-16 > > encoding or some other system call. > > Ultimately, the system call used to create the file was > either open() or creat() (and the former is a superset of the > latter); they take octet strings in some superset-of-ASCII > encoding (UTF-8, ISO 8859/x, Big5, GB 2312, Shift JIS, etc.), > so that all octets in the range 0x00 through 0x7F represent > the corresponding ASCII character, and only octets with the > 0x80 bit set are used to encode other characters. > > The issue probably doesn't involve UTF-16, as that's not a > octet- string superset-of-ASCII encoding; it probably > involves UTF-8 vs. some other encoding of Chinese. > > As for the other user who filed > > http://jnetpcap.com/node/456 > > he's using Windows (as per the reference to MinGW and > jnetpcap.dll), so his problem may ultimately be caused by the > lack of pcap_wopen_offline(). > > > > Cheers, > > mark.. > > http://jnetpcap.com > > > > > > _______________________________________________ > > Winpcap-users mailing list > > [email protected] > > https://www.winpcap.org/mailman/listinfo/winpcap-users >
_______________________________________________ Winpcap-users mailing list [email protected] https://www.winpcap.org/mailman/listinfo/winpcap-users
