Re: Unprintable 8-bit characters
On Tue, 8 Nov 2011 23:04:25 -0600 (CST) Robert Bonomi bon...@mail.r-bonomi.com wrote: Conrad J. Sabatier conr...@cox.net wrote: grin Yes, and this is one area where the labels are more than a little misleading as well. My natural inclination is think of UTF-8 as being a single-byte representation for each character in the set, whereas UTF-16, as the name implies, would be the wide, 2-byte version. Not exactly. Nonetheless, as I posted earlier in this thread, according to the info in gucharmap, the representations of the umlauted u are just the opposite of this: not exactly. Again. UTF-8: 0xC3 0xBC UTF-16: 0x00FC Go figure, huh? :-) In UTF-16, everything _is_ a 16-bit entity. Notice that 0x00FC has -four- nybbles after the '0x.' Every character boundary is on a multiple of 16 bits. Ah yes! I hadn't noticed that. What's really weird, as I mentioned in a later private email to Polytropon, last night, the copy-and-paste in gucharmap suddenly decided to start copying the UTF-8 code instead of the UTF-16. I have no idea why that changed. In UTF-8, the 'base' charset -- the 'C0' and 'C1' groups are represented by a single byte. 'extended' characters are represented by two bytes. Thus, 'characters' have a *variable*length* representation -- one or two bytes. A character, whether it is represented by one or two bytes, can begin on -any- byte boundary within a data stream, depending on 'what came before it'. UTF-8 2-byte representations are designed such that one can jump to any _byte_ offset within the file, and determine -- by looking *only* at the value of that byte whether is is (a) a single-byte character, (b) the first byte of a two-byte sequence, or (c) the second byte of a two-byte sequence. With UTF-16 you can position directly to any -character-, by jumping to a _byte_ offset that is twice the index of the character you want. Given a byte offset, you always know the 'equivalent' _character_ offset. With UTF-8, you have to read the character stream, counting 'characters' as you go, to get to the desired point. You can seek to an arbitrary _byte_ offset, but you do not know how mny 'characters' into the file that offset is. I see. Yes, that could certainly complicate things. UTF-8 vs. UTF-16 is a trade-off between 'compactness' (UTF-8), and simplicity of addessing/representation (UTF-16). This seems rather unfortunate to me. You would think that, by now, some standard character set might have emerged that would allow one to use, at the very least, the Western characters (as opposed to the Eastern or Oriental or Asian, if you will) with a reasonable expectation that others will see what was intended. Heh. How many 'character' codes are you willing to devote to national 'currency symbols', just for starters? Probable minimum of two per currency -- one for the minimum coinage unit (cent, pence, pfennig, etc.) and one for the denomination unit (dollar, pound, mark, kroner, etc.) Now, one (obviously) has to have the basic 'Roman' alphabet. Then there are all the diacritical markings (accent, accent grave, dot umlaut, ring, bar, 'hat', inverted hat, etc.) for vowels. And cedilla, tilde, etc., for select consonants. Plus language specific symbols like ess-zett , 'thorn', etc. How about phonetic symbols, like 'schwa' ? And Greek for all sorts of scientific use? What about Cyrilic characters, for many Eastern Eurpean languages? Now, consider punctuation marks: the 'typewriter' basics, How many of 'minus-sign, hyphen, em-dash, en-dash, soft-hyphen' are needed? How many of 'accent, accent grave, apostrophe, opening/closing single-quote' are needed? opening/closing double-quotes, and/or a 'position neutral' double-quote? Other symbols, like -- digits, common fractions, 'Trademark','Registered trademark','copyright' 'paragraph','section', superscripts -- exponents, footnotes, etc. subscripts -- chemical formulae, etc. Simple line-drawing graphics Diphthongs?? Ligatures?? Start counting things up. An 8-bit 'address space' gets used used up _really_ quick. wry grin I certainly get the point. :-) Thanks for that very thorough elucidation. :-) Now I just have to figure out what the heck's going on here, why suddenly I'm seeing the exact opposite of what I was seeing yesterday. Thought I had everything straightened out for a while there. :-( Oh, this is madness! :-) -- Conrad J. Sabatier conr...@cox.net ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: Unprintable 8-bit characters
On Tue, 8 Nov 2011 20:59:48 -0600, Conrad J. Sabatier wrote: Same here. I've been guilty as well of neglecting to properly adjust my console configuration. Sometimes just works in combination with lazyness beats all proper concepts of doing things. :-) Doesn't using LC_ALL obviate the need to set any of the other LC_* variables? At least, that's always been my understanding of it. I have to admit that I haven't fully understood everything in that relation, but it seems that the $LC_* (!ALL) can modify subsets of what $LC_ALL defines. Languages and character sets can be assigned independently (e. g. english program messages, but german file names properly displayed). But, getting back to something you said earlier, what did you mean exactly about the precedence of LANG vs. LC_*? There is, if I remember correctly, the idea that _if_ $LANG is set, $LC_* won't be considered at all, even if they are set. http://www.freebsd.org/doc/handbook/using-localization.html See 24.3.4.1.1.1 and 24.3.4.1.2. Yes, and this is one area where the labels are more than a little misleading as well. My natural inclination is think of UTF-8 as being a single-byte representation for each character in the set, whereas UTF-16, as the name implies, would be the wide, 2-byte version. Nonetheless, as I posted earlier in this thread, according to the info in gucharmap, the representations of the umlauted u are just the opposite of this: UTF-8: 0xC3 0xBC UTF-16: 0x00FC Go figure, huh? :-) I think Robert did explain it very good: While UTF-16 is a fixed width (2 byte) representation, UTF-8 is variable width (1 byte _or_ two byte). But returning to the original question, I think Robert did explain it very well: There is no real consensus about what the different codings should mean. They were meant to unify the representation of a very large set of characters, but basically there are many inter- pretations now, and how they show up to the user depends on the font in use, _if_ it has this mapping or that, or none. This seems rather unfortunate to me. You would think that, by now, some standard character set might have emerged that would allow one to use, at the very least, the Western characters (as opposed to the Eastern or Oriental or Asian, if you will) with a reasonable expectation that others will see what was intended. Assumptions, wishes, conclusions and hopes do differ from reality. :-) For example, in October I had to assist working on a document containing german text and chinese symbols. Decision: We use UTF-8 so the chinese symbols can appear in the input. A name: Weng Tonghe [][][]. The brackets should symbolize the three characters for that name. They did show up properly in the editor, but on the printed page... Weng Tonghe [][]. What? Two? But there were three on input! As we found out, the he used in input was the wrong one (there are several hes), and the font used to render the text did not have that particular he. When we found the correct one, finally three characters appeared, as intended and correct. This should show: You _never_ know where things are wrong when something is missing - settings, fonts, who knows. In relation to file names, this is not a problem of the file system as it will store any name you want, but if you can actually SEE or USE that file name - that's a completely different thing. Again a fine demonstration why file names should be limited to printable ASCII and no spaces if you want them to work everywhere. :-) Well, for myself, personally, I'm a bit of a stickler for language authenticity, you might call it. Having studied both German and French rather extensively in my younger days, I'm quite fond of both languages, and rather keen on seeing them represented accurately (I especially wince at the use of the plain, unaccented vowel followed by an e in place of the umlaut, and to a lesser degree, the use of ss in place of Esszett), which has caused me no small amount of confusion, aggravation and frustration over the years, to be sure! :-) Make sure to call it Eszett (Es = S and Zett = Z). The teletyping conventions suggests to dissolve ß to sz, because it's easier to recombine sz to ß because it's likely to be correct, whereas recombining ss to ß is often wrong, as there are too many correct ss in texts. Example: Mißwirtschaft - Miszwirtschaft - Mißwirtschaft === good. Messer - Meßer === wrong. In names (e. g. of towns): Staßfurt (right) != Stassfurt (wrong). Note that !(sz - ß) in all cases, and !(ss - ß) as well, as the rule states that only a non-truncatable ss is to be set as Eszett. There are only few sz that are real 'sz', typically in word gaps, e. g. Reiszange. :-) The funny things start when diacritic marks and other non-US-ASCII representable elements change the meaning of a word. In such cases, it's often justified to use the proper localized representation. However, this is also the point
Re: Unprintable 8-bit characters
It's worth noting, too, that most of the non-Unicode encoding systems predate the Internet. When computers weren't really talking to each other, there was no real emphasis on interoperability, and every OS tended to come up with their own way of encoding foreign languages. Languages like French, German, and English generally have it easy -- almost everything ended up being Latin1 (aka ISO 8859-1). For other languages it can be much more complicated. There are at least three commonly used encoding systems for Chinese. Unicode is gradually winning, but you'll still find, for example, a lot of Chinese documents in GB2312 and Big5. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: Unprintable 8-bit characters
On Tue, 8 Nov 2011 18:42:36 -0600, Conrad J. Sabatier wrote: I've been trying to understand what the deal is with regards to the displaying of the extended 8-bit character set, i.e., 8-bit characters with the MSB set. Quite simply Unix dates from the days where the 8th bit was used as a 'parity' bit. Allowing detection of *all* single-bit errors -- especially over the notoriously un-reliable connections known as 'serial ports'. More specifically, I'm trying to figure out how to get the ls command to properly display filenames containing characters in this extended set. I have some MP3 files, for instance, whose names contain certain European characters, such as the lowercase u with umlaut (code 0xfc in the Latin set, according to gucharmap), that I just can't get ls to display properly. These characters seem to be considered by ls as unprintable, and the best I've been able to produce in the ls output is backslash interpretations of the characters using either the -B or -b options, otherwise the default ? is displayed in their place. The strange thing is that these characters will display just fine in xterm, gnome-terminal, etc. I can copy and paste them from the gucharmap utility into a shell command line or other application, and they appear as they should, but ls simply refuses to display them. I can print them using the printf command, even bash's builtin echo seems to have no problem with them. Only ls appears to have this problem. I've experimented with using various locales, using the LC_* variables, as well as the LANG variable (as documented in the environment section of the ls man page), all to no avail. Obviously you never read as far as the '-w' switch. grin Is this an inherent limitation of ls, It is -not- a limitation; rather it is a _desired_ behavior -- so that one can _tell_ where there is an 'unprintable' character (like \r, or\b) in a filename. There are *good*reasons*(TM) why -q is the default behavior for 'terminal' output. or is there some workaround or other solution? Do we need a new en_*.UTF-16 locale? Should we consider extending the ls command to handle these characters? There _are_ improved versions of ls that do understand the 'locale' environment variables -- but those programs introduce a whole bunch of *other* 'not necessarily desired' behaviors -- like sorting upper-case and lower-case letters as 'equals', rather than regarding any upper-case as sorting before any lowercase. Or is there just something about all of this that I'm just not getting? As an additional note, I notice that in the text console, this same character code (0xfc) produces an entirely different character (a lowercase n in a raised position, as for the exponent in a mathematical expression). Is there, in fact, no standardization re: the representation of these high bit characters? The nice thing about standards is that there are so many to choose from applies. WITH A VENGANCE!! There are at least FIFTEEN different sets of glyphs for the 'high bit set' byte codes *JUST* for the 'iso-8859' base charset. Plus 'utf-8' And not counting the various bastardiztions (e.g. 'CP-1252', etc.) that Microsoft has introduced. Thanks to anyone who can help clear up this long-standing mystery for me. Reading the fine manpage -- with particular attention to the '-q' and '-w' options should provie some enlightenment. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: Unprintable 8-bit characters
Am 09.11.2011, 01:42 Uhr, schrieb Conrad J. Sabatier conr...@cox.net: Pardon me if this may seem like a stupid question, but this is something that's been bugging me for a long time, and none of my research has turned up anything useful yet. I've been trying to understand what the deal is with regards to the displaying of the extended 8-bit character set, i.e., 8-bit characters with the MSB set. More specifically, I'm trying to figure out how to get the ls command to properly display filenames containing characters in this extended set. I have some MP3 files, for instance, whose names contain certain European characters, such as the lowercase u with umlaut (code 0xfc in the Latin set, according to gucharmap), that I just can't get ls to display properly. These characters seem to be considered by ls as unprintable, and the best I've been able to produce in the ls output is backslash interpretations of the characters using either the -B or -b options, otherwise the default ? is displayed in their place. Unsure if I understand you correctly. (extended 8-bit character set with MSB? utf-16?) I'm confused by this charset stuff in general. Assuming you want \0xfc displayed as ü, cat test.py python test.py ls -l #!/usr/local/bin/python # -*- coding: utf-8 -*- f=open('\xfc','w') f.close() total 2 -rw-r--r-- 1 michael wheel 29 9 Nov 02:43 test.py -rw-r--r-- 1 michael wheel 0 9 Nov 02:44 ü here is what works for me: in my login class in /etc/login.conf: :charset=ISO-8859-1:\ :lang=de_DE.ISO8859-1:\ ``cap_mkdb /etc/login.conf'' after changes in /etc/rc.conf: scrnmap=iso-8859-1_to_cp437 font8x8=cp850-8x8 font8x14=cp850-8x14 font8x16=cp850-8x16 and in /etc/ttys, console type is set to ``cons25l1'' Regards, Michael ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: Unprintable 8-bit characters
On Tue, 8 Nov 2011 19:17:27 -0600 (CST) Robert Bonomi bon...@mail.r-bonomi.com wrote: On Tue, 8 Nov 2011 18:42:36 -0600, Conrad J. Sabatier wrote: I've been trying to understand what the deal is with regards to the displaying of the extended 8-bit character set, i.e., 8-bit characters with the MSB set. Quite simply Unix dates from the days where the 8th bit was used as a 'parity' bit. Allowing detection of *all* single-bit errors -- especially over the notoriously un-reliable connections known as 'serial ports'. Ah, yes! The good old days. :-) More specifically, I'm trying to figure out how to get the ls command to properly display filenames containing characters in this extended set. I have some MP3 files, for instance, whose names contain certain European characters, such as the lowercase u with umlaut (code 0xfc in the Latin set, according to gucharmap), that I just can't get ls to display properly. These characters seem to be considered by ls as unprintable, and the best I've been able to produce in the ls output is backslash interpretations of the characters using either the -B or -b options, otherwise the default ? is displayed in their place. The strange thing is that these characters will display just fine in xterm, gnome-terminal, etc. I can copy and paste them from the gucharmap utility into a shell command line or other application, and they appear as they should, but ls simply refuses to display them. I can print them using the printf command, even bash's builtin echo seems to have no problem with them. Only ls appears to have this problem. I've experimented with using various locales, using the LC_* variables, as well as the LANG variable (as documented in the environment section of the ls man page), all to no avail. Obviously you never read as far as the '-w' switch. grin Yes, somehow that one went right past me. Haste makes waste! :-) Is this an inherent limitation of ls, It is -not- a limitation; rather it is a _desired_ behavior -- so that one can _tell_ where there is an 'unprintable' character (like \r, or\b) in a filename. There are *good*reasons*(TM) why -q is the default behavior for 'terminal' output. OK, I can see that. :-) or is there some workaround or other solution? Do we need a new en_*.UTF-16 locale? Should we consider extending the ls command to handle these characters? There _are_ improved versions of ls that do understand the 'locale' environment variables -- but those programs introduce a whole bunch of *other* 'not necessarily desired' behaviors -- like sorting upper-case and lower-case letters as 'equals', rather than regarding any upper-case as sorting before any lowercase. Well, *that* certainly won't do! That should be the exception, not the rule. Or is there just something about all of this that I'm just not getting? As an additional note, I notice that in the text console, this same character code (0xfc) produces an entirely different character (a lowercase n in a raised position, as for the exponent in a mathematical expression). Is there, in fact, no standardization re: the representation of these high bit characters? The nice thing about standards is that there are so many to choose from applies. WITH A VENGANCE!! There are at least FIFTEEN different sets of glyphs for the 'high bit set' byte codes *JUST* for the 'iso-8859' base charset. Plus 'utf-8' And not counting the various bastardiztions (e.g. 'CP-1252', etc.) that Microsoft has introduced. Thanks to anyone who can help clear up this long-standing mystery for me. Reading the fine manpage -- with particular attention to the '-q' and '-w' options should provie some enlightenment. Thank you very much. Some of this matched the suspicions I already had re: this matter. Don't know how I completely missed the -w switch. Mea culpa. :-) So, what would be the safest bet as far as the most universal representation for these characters? Something I've long wondered about when I've e-mailed people and copied/pasted these characters (are they really seeing what I'm seeing?). :-) -- Conrad J. Sabatier conr...@cox.net ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: Unprintable 8-bit characters
On Tue, 8 Nov 2011 19:17:27 -0600 (CST) Robert Bonomi bon...@mail.r-bonomi.com wrote: On Tue, 8 Nov 2011 18:42:36 -0600, Conrad J. Sabatier wrote: I've been trying to understand what the deal is with regards to the displaying of the extended 8-bit character set, i.e., 8-bit characters with the MSB set. Quite simply Unix dates from the days where the 8th bit was used as a 'parity' bit. Allowing detection of *all* single-bit errors -- especially over the notoriously un-reliable connections known as 'serial ports'. More specifically, I'm trying to figure out how to get the ls command to properly display filenames containing characters in this extended set. I have some MP3 files, for instance, whose names contain certain European characters, such as the lowercase u with umlaut (code 0xfc in the Latin set, according to gucharmap), that I just can't get ls to display properly. These characters seem to be considered by ls as unprintable, and the best I've been able to produce in the ls output is backslash interpretations of the characters using either the -B or -b options, otherwise the default ? is displayed in their place. The strange thing is that these characters will display just fine in xterm, gnome-terminal, etc. I can copy and paste them from the gucharmap utility into a shell command line or other application, and they appear as they should, but ls simply refuses to display them. I can print them using the printf command, even bash's builtin echo seems to have no problem with them. Only ls appears to have this problem. I've experimented with using various locales, using the LC_* variables, as well as the LANG variable (as documented in the environment section of the ls man page), all to no avail. Obviously you never read as far as the '-w' switch. grin Just a quickie followup: Setting LC_ALL=en_US.UTF-8 and using ls -w was, in fact, the magic key (at least, in any of the X terminal apps; still getting the little exponential n in the console)! Thank you so much. I'll sleep much better tonight. :-) -- Conrad J. Sabatier conr...@cox.net ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: Unprintable 8-bit characters
On Wed, 09 Nov 2011 02:51:31 +0100, Michael Ross wrote: Am 09.11.2011, 01:42 Uhr, schrieb Conrad J. Sabatier conr...@cox.net: Pardon me if this may seem like a stupid question, but this is something that's been bugging me for a long time, and none of my research has turned up anything useful yet. I've been trying to understand what the deal is with regards to the displaying of the extended 8-bit character set, i.e., 8-bit characters with the MSB set. More specifically, I'm trying to figure out how to get the ls command to properly display filenames containing characters in this extended set. I have some MP3 files, for instance, whose names contain certain European characters, such as the lowercase u with umlaut (code 0xfc in the Latin set, according to gucharmap), that I just can't get ls to display properly. These characters seem to be considered by ls as unprintable, and the best I've been able to produce in the ls output is backslash interpretations of the characters using either the -B or -b options, otherwise the default ? is displayed in their place. Unsure if I understand you correctly. (extended 8-bit character set with MSB? utf-16?) I'm confused by this charset stuff in general. Assuming you want \0xfc displayed as ü, cat test.py python test.py ls -l #!/usr/local/bin/python # -*- coding: utf-8 -*- f=open('\xfc','w') f.close() total 2 -rw-r--r-- 1 michael wheel 29 9 Nov 02:43 test.py -rw-r--r-- 1 michael wheel 0 9 Nov 02:44 ü here is what works for me: in my login class in /etc/login.conf: :charset=ISO-8859-1:\ :lang=de_DE.ISO8859-1:\ ``cap_mkdb /etc/login.conf'' after changes Ah, thanks - that seems to be the proper way to have the environmental variables set - instead of my (ab)use of setenv's in the csh config file. :-) Note the precedence of $LANG vs. $LC_* (as they can be used to configure things more precisely, e. g. regarding system messages or date formats; see example following). in /etc/rc.conf: scrnmap=iso-8859-1_to_cp437 Hm? CP437? Codepage? Isn't that some MS-DOS thing? I've never needed a screenmap to make extended characters (everything beyong US-ASCII) work. font8x8=cp850-8x8 font8x14=cp850-8x14 font8x16=cp850-8x16 and in /etc/ttys, console type is set to ``cons25l1'' I have a similar setting here, but that does _not_ work wuth UTF-8 codec characters. If I want to use them, I have to change some environmental variables, from #---GERMAN/ENGLISH === DEFAULT setenv LC_ALL en_US.ISO8859-1 setenv LC_MESSAGES en_US.ISO8859-1 setenv LC_COLLATE de_DE.ISO8859-1 setenv LC_CTYPEde_DE.ISO8859-1 setenv LC_MONETARY de_DE.ISO8859-1 setenv LC_NUMERIC de_DE.ISO8859-1 setenv LC_TIME de_DE.ISO8859-1 unsetenv LANG to #---INTERNATIONAL- setenv LC_ALL en_US.UTF-8 setenv LC_MESSAGES en_US.UTF-8 setenv LC_COLLATE de_DE.UTF-8 setenv LC_CTYPEde_DE.UTF-8 setenv LC_MONETARY de_DE.UTF-8 setenv LC_NUMERIC de_DE.UTF-8 setenv LC_TIME de_DE.UTF-8 setenv LANGde_DE.UTF-8 Then I can use UTF-8 characters inside rxvt-unicode. Of course, text mode console is limited to the first set of configuration, using the ISO 8859-1 character set. This worked long before UTF-8 arrived with the glorious idea that I should have 2 bytes where one is sufficient, to describe our (german) 6 umlauts and the Eszett ligature. :-) Improper settings will result in [][] or A-tilde three quarters upside-down question mark, depending on editor or terminal used. But returning to the original question, I think Robert did explain it very well: There is no real consensus about what the different codings should mean. They were meant to unify the representation of a very large set of characters, but basically there are many inter- pretations now, and how they show up to the user depends on the font in use, _if_ it has this mapping or that, or none. For running ls, -w is the right option to use - but IN COMBINATION with correct settings for the terminal emulation AND the presence of a font that will do. Again a fine demonstration why file names should be limited to printable ASCII and no spaces if you want them to work everywhere. :-) -- Polytropon Magdeburg, Germany Happy FreeBSD user since 4.0 Andra moi ennepe, Mousa, ... ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: Unprintable 8-bit characters
On Tue, 8 Nov 2011 19:58:04 -0600, Conrad J. Sabatier wrote: So, what would be the safest bet as far as the most universal representation for these characters? Something I've long wondered about when I've e-mailed people and copied/pasted these characters (are they really seeing what I'm seeing?). :-) With lots of experience in how not to do it, I would like to suggest the following: Use US-ASCII letters only. This makes _sure_ they will display correctly everywhere and even on ultra-worst conditions (e. g. you are at a real serial console, a real DEC vt100). Filenames like kloesze_mit_muesli_foerdern_baerenhunger.mp3 can be processed by _any_ ls or mailer program. There is no need to worry about... hmmm... do they have the same character settings that I use? Do they have a font installed that can show the file names properly? Rules: Substitute umlauts properly (*e). Substitute ß to sz (teletype convention). Remove accents or other marks completely, as well as strokes through characters or similar typographical specialities. If you can, use lowercase only. No spaces, use _ instead. Avoid any other special characters. Make everything plain ASCII, and you can _still_ easily get the meaning. The file system ITSELF doesn't care for the meaning of the characters. SAVING them and DISPLAYING them are two fully different things. Nobody stops you from making filenames like öÜÖß߀Łµ³¼`łøæſđ̣ĸ»¢.mp3, but they can cause trouble you can't predict. You _never_ know... -- Polytropon Magdeburg, Germany Happy FreeBSD user since 4.0 Andra moi ennepe, Mousa, ... ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: Unprintable 8-bit characters
On Wed, 09 Nov 2011 02:51:31 +0100 Michael Ross g...@ross.cx wrote: Am 09.11.2011, 01:42 Uhr, schrieb Conrad J. Sabatier conr...@cox.net: Pardon me if this may seem like a stupid question, but this is something that's been bugging me for a long time, and none of my research has turned up anything useful yet. I've been trying to understand what the deal is with regards to the displaying of the extended 8-bit character set, i.e., 8-bit characters with the MSB set. More specifically, I'm trying to figure out how to get the ls command to properly display filenames containing characters in this extended set. I have some MP3 files, for instance, whose names contain certain European characters, such as the lowercase u with umlaut (code 0xfc in the Latin set, according to gucharmap), that I just can't get ls to display properly. These characters seem to be considered by ls as unprintable, and the best I've been able to produce in the ls output is backslash interpretations of the characters using either the -B or -b options, otherwise the default ? is displayed in their place. Unsure if I understand you correctly. (extended 8-bit character set with MSB? utf-16?) I'm confused by this charset stuff in general. That is to say, 8-bit characters with the most significant bit set, or characters greater than 0x7f. I can certainly appreciate your confusion; this is definitely a confusing area. In gucharmap, selecting the unlauted u in the Latin set, the Character Details tab reveals the following: U+00FC LATIN SMALL LETTER U WITH DIAERESIS General Character Properties In Unicode since: 1.1 Unicode category: Letter, Lowercase Canonical decomposition: U+0075 LATIN SMALL LETTER U + U+0308 COMBINING DIAERESIS Various Useful Representations UTF-8: 0xC3 0xBC UTF-16: 0x00FC C octal escaped UTF-8: \303\274 XML decimal entity: #252; So apparently, it's a wide character in UTF-8, which really throws a monkey wrench into the works in certain situations (for example, one of the little scripts I've written to process MP3 files uses the cut command, which complains about an illegal byte sequence). Even more confusing, selecting the character and copying it to the clipboard, the UTF-16 representation (0xfc) is what actually gets used. Pasting this single-byte version into an X terminal (any of them: xterm, gnome-terminal, etc.) does display the correct character, an umlauted u, even if using an 8-bit locale, such as UTF-8. Majorly confusing! Assuming you want \0xfc displayed as ü, Yes, exactly. cat test.py python test.py ls -l #!/usr/local/bin/python # -*- coding: utf-8 -*- f=open('\xfc','w') f.close() total 2 -rw-r--r-- 1 michael wheel 29 9 Nov 02:43 test.py -rw-r--r-- 1 michael wheel 0 9 Nov 02:44 ü here is what works for me: in my login class in /etc/login.conf: :charset=ISO-8859-1:\ :lang=de_DE.ISO8859-1:\ ``cap_mkdb /etc/login.conf'' after changes in /etc/rc.conf: scrnmap=iso-8859-1_to_cp437 font8x8=cp850-8x8 font8x14=cp850-8x14 font8x16=cp850-8x16 and in /etc/ttys, console type is set to ``cons25l1'' Thanks, I hadn't considered making those sorts of changes for the console. I work so seldom nowadays in the console, I'd forgotten all about that stuff (use it or lose it, as they say!). I'll certainly give that a try. Much appreciation for both yours and Robert's replies. -- Conrad J. Sabatier conr...@cox.net ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: Unprintable 8-bit characters
--As of November 8, 2011 7:58:04 PM -0600, Conrad J. Sabatier is alleged to have said: So, what would be the safest bet as far as the most universal representation for these characters? Something I've long wondered about when I've e-mailed people and copied/pasted these characters (are they really seeing what I'm seeing?). :-) --As for the rest, it is mine. These days, the safest bet is UTF-8, or some other Unicode character set, in something that can convey what character set it is in. (Email can, depending on the mail client.) Not that Unicode is universal yet, but it designed to be (and is, generally) a solution to the 'multiple character encodings' problem. (By, of course, defining a new encoding.) It has a decent amount of traction, and in a decade or so - once other options have been firmly depreciated - I'd expect we could start discussing whether to switch ls to using it by default. ;) All this is of course if you *must* go beyond 7-bit ASCII. (Which all forms of Unicode is designed to be a strict superset of.) Daniel T. Staal --- This email copyright the author. Unless otherwise noted, you are expressly allowed to retransmit, quote, or otherwise use the contents for non-commercial purposes. This copyright will expire 5 years after the author's death, or in 30 years, whichever is longer, unless such a period is in excess of local copyright law. --- ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: Unprintable 8-bit characters
On Wed, 9 Nov 2011 03:10:24 +0100 Polytropon free...@edvax.de wrote: On Wed, 09 Nov 2011 02:51:31 +0100, Michael Ross wrote: Am 09.11.2011, 01:42 Uhr, schrieb Conrad J. Sabatier conr...@cox.net: [snip] I've been trying to understand what the deal is with regards to the displaying of the extended 8-bit character set, i.e., 8-bit characters with the MSB set. [snip] Unsure if I understand you correctly. (extended 8-bit character set with MSB? utf-16?) I'm confused by this charset stuff in general. Assuming you want \0xfc displayed as ü, [snip] here is what works for me: in my login class in /etc/login.conf: :charset=ISO-8859-1:\ :lang=de_DE.ISO8859-1:\ ``cap_mkdb /etc/login.conf'' after changes Ah, thanks - that seems to be the proper way to have the environmental variables set - instead of my (ab)use of setenv's in the csh config file. :-) Same here. I've been guilty as well of neglecting to properly adjust my console configuration. Note the precedence of $LANG vs. $LC_* (as they can be used to configure things more precisely, e. g. regarding system messages or date formats; see example following). in /etc/rc.conf: scrnmap=iso-8859-1_to_cp437 Hm? CP437? Codepage? Isn't that some MS-DOS thing? I've never needed a screenmap to make extended characters (everything beyong US-ASCII) work. font8x8=cp850-8x8 font8x14=cp850-8x14 font8x16=cp850-8x16 and in /etc/ttys, console type is set to ``cons25l1'' I have a similar setting here, but that does _not_ work wuth UTF-8 codec characters. If I want to use them, I have to change some environmental variables, from #---GERMAN/ENGLISH === DEFAULT setenv LC_ALL en_US.ISO8859-1 setenv LC_MESSAGES en_US.ISO8859-1 setenv LC_COLLATE de_DE.ISO8859-1 setenv LC_CTYPEde_DE.ISO8859-1 setenv LC_MONETARY de_DE.ISO8859-1 setenv LC_NUMERIC de_DE.ISO8859-1 setenv LC_TIME de_DE.ISO8859-1 unsetenv LANG to #---INTERNATIONAL- setenv LC_ALL en_US.UTF-8 setenv LC_MESSAGES en_US.UTF-8 setenv LC_COLLATE de_DE.UTF-8 setenv LC_CTYPEde_DE.UTF-8 setenv LC_MONETARY de_DE.UTF-8 setenv LC_NUMERIC de_DE.UTF-8 setenv LC_TIME de_DE.UTF-8 setenv LANGde_DE.UTF-8 Doesn't using LC_ALL obviate the need to set any of the other LC_* variables? At least, that's always been my understanding of it. But, getting back to something you said earlier, what did you mean exactly about the precedence of LANG vs. LC_*? Then I can use UTF-8 characters inside rxvt-unicode. Of course, text mode console is limited to the first set of configuration, using the ISO 8859-1 character set. This worked long before UTF-8 arrived with the glorious idea that I should have 2 bytes where one is sufficient, to describe our (german) 6 umlauts and the Eszett ligature. :-) grin Yes, and this is one area where the labels are more than a little misleading as well. My natural inclination is think of UTF-8 as being a single-byte representation for each character in the set, whereas UTF-16, as the name implies, would be the wide, 2-byte version. Nonetheless, as I posted earlier in this thread, according to the info in gucharmap, the representations of the umlauted u are just the opposite of this: UTF-8: 0xC3 0xBC UTF-16: 0x00FC Go figure, huh? :-) Improper settings will result in [][] or A-tilde three quarters upside-down question mark, depending on editor or terminal used. Yes, I will definitely have to try using the recommendations that have come up in this thread re: the console. But returning to the original question, I think Robert did explain it very well: There is no real consensus about what the different codings should mean. They were meant to unify the representation of a very large set of characters, but basically there are many inter- pretations now, and how they show up to the user depends on the font in use, _if_ it has this mapping or that, or none. This seems rather unfortunate to me. You would think that, by now, some standard character set might have emerged that would allow one to use, at the very least, the Western characters (as opposed to the Eastern or Oriental or Asian, if you will) with a reasonable expectation that others will see what was intended. For running ls, -w is the right option to use - but IN COMBINATION with correct settings for the terminal emulation AND the presence of a font that will do. Yes. I'm still a little embarrassed for having completely overlooked that option earlier. Hasty (impatient) reading of man pages. :-) Again a fine demonstration why file names should be limited to printable ASCII and no spaces if you want them
Re: Unprintable 8-bit characters
On Tue, 08 Nov 2011 21:27:16 -0400 Daniel Staal dst...@usa.net wrote: --As of November 8, 2011 7:58:04 PM -0600, Conrad J. Sabatier is alleged to have said: So, what would be the safest bet as far as the most universal representation for these characters? Something I've long wondered about when I've e-mailed people and copied/pasted these characters (are they really seeing what I'm seeing?). :-) --As for the rest, it is mine. These days, the safest bet is UTF-8, or some other Unicode character set, in something that can convey what character set it is in. (Email can, depending on the mail client.) Not that Unicode is universal yet, but it designed to be (and is, generally) a solution to the 'multiple character encodings' problem. (By, of course, defining a new encoding.) It has a decent amount of traction, and in a decade or so - once other options have been firmly depreciated - I'd expect we could start discussing whether to switch ls to using it by default. ;) All this is of course if you *must* go beyond 7-bit ASCII. (Which all forms of Unicode is designed to be a strict superset of.) That sounds sane and sensible. :-) I've adjusted my environment to include: export LANG=en_US.UTF-8 export LC_ALL=en_US.UTF-8 And also adjusted my console configuration to display these characters: font8x14=iso-8x14 font8x16=iso-8x16 font8x8=iso-8x8 And, last but not least, aliased ls to ensure these characters will actually be displayed: alias ls='ls -Fw' Looking good here now: conrads:~$ cd Music/Progressive Rock/Yes/The Yes Album conrads:~/Music/Progressive Rock/Yes/The Yes Album$ ls *03* Yes - The Yes Album - 03 - Starship Trooper: a. Life Seeker - b. Disillusion - c. Würm.mp3 Many thanks to everyone for all the very helpful, useful information. -- Conrad J. Sabatier conr...@cox.net ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: Unprintable 8-bit characters
On Tue, 8 Nov 2011 20:24:18 -0600 Conrad J. Sabatier conr...@cox.net wrote: Even more confusing, selecting the character and copying it to the clipboard, the UTF-16 representation (0xfc) is what actually gets used. Pasting this single-byte version into an X terminal (any of them: xterm, gnome-terminal, etc.) does display the correct character, an umlauted u, even if using an 8-bit locale, such as UTF-8. Majorly confusing! Just realized on reading this how weird it sounds. What I was getting at here was that the (single-byte) UTF-16 code displays the correct character in a UTF-8 locale, even though the UTF-8 code for the character is supposedly a 2-byte sequence. Anyway, enough about that. I've managed to get the results I was hoping for now, so I'm satisfied. :-) Thanks again for all the responses. -- Conrad J. Sabatier conr...@cox.net ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: Unprintable 8-bit characters
Conrad J. Sabatier conr...@cox.net wrote: grin Yes, and this is one area where the labels are more than a little misleading as well. My natural inclination is think of UTF-8 as being a single-byte representation for each character in the set, whereas UTF-16, as the name implies, would be the wide, 2-byte version. Not exactly. Nonetheless, as I posted earlier in this thread, according to the info in gucharmap, the representations of the umlauted u are just the opposite of this: not exactly. Again. UTF-8: 0xC3 0xBC UTF-16: 0x00FC Go figure, huh? :-) In UTF-16, everything _is_ a 16-bit entity. Notice that 0x00FC has -four- nybbles after the '0x.' Every character boundary is on a multiple of 16 bits. In UTF-8, the 'base' charset -- the 'C0' and 'C1' groups are represented by a single byte. 'extended' characters are represented by two bytes. Thus, 'characters' have a *variable*length* representation -- one or two bytes. A character, whether it is represented by one or two bytes, can begin on -any- byte boundary within a data stream, depending on 'what came before it'. UTF-8 2-byte representations are designed such that one can jump to any _byte_ offset within the file, and determine -- by looking *only* at the value of that byte whether is is (a) a single-byte character, (b) the first byte of a two-byte sequence, or (c) the second byte of a two-byte sequence. With UTF-16 you can position directly to any -character-, by jumping to a _byte_ offset that is twice the index of the character you want. Given a byte offset, you always know the 'equivalent' _character_ offset. With UTF-8, you have to read the character stream, counting 'characters' as you go, to get to the desired point. You can seek to an arbitrary _byte_ offset, but you do not know how mny 'characters' into the file that offset is. UTF-8 vs. UTF-16 is a trade-off between 'compactness' (UTF-8), and simplicity of addessing/representation (UTF-16). This seems rather unfortunate to me. You would think that, by now, some standard character set might have emerged that would allow one to use, at the very least, the Western characters (as opposed to the Eastern or Oriental or Asian, if you will) with a reasonable expectation that others will see what was intended. Heh. How many 'character' codes are you willing to devote to national 'currency symbols', just for starters? Probable minimum of two per currency -- one for the minimum coinage unit (cent, pence, pfennig, etc.) and one for the denomination unit (dollar, pound, mark, kroner, etc.) Now, one (obviously) has to have the basic 'Roman' alphabet. Then there are all the diacritical markings (accent, accent grave, dot umlaut, ring, bar, 'hat', inverted hat, etc.) for vowels. And cedilla, tilde, etc., for select consonants. Plus language specific symbols like ess-zett , 'thorn', etc. How about phonetic symbols, like 'schwa' ? And Greek for all sorts of scientific use? What about Cyrilic characters, for many Eastern Eurpean languages? Now, consider punctuation marks: the 'typewriter' basics, How many of 'minus-sign, hyphen, em-dash, en-dash, soft-hyphen' are needed? How many of 'accent, accent grave, apostrophe, opening/closing single-quote' are needed? opening/closing double-quotes, and/or a 'position neutral' double-quote? Other symbols, like -- digits, common fractions, 'Trademark','Registered trademark','copyright' 'paragraph','section', superscripts -- exponents, footnotes, etc. subscripts -- chemical formulae, etc. Simple line-drawing graphics Diphthongs?? Ligatures?? Start counting things up. An 8-bit 'address space' gets used used up _really_ quick. wry grin ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org