Re: Unprintable 8-bit characters

2011-11-10 Thread Conrad J. Sabatier
On Tue, 8 Nov 2011 23:04:25 -0600 (CST)
Robert Bonomi bon...@mail.r-bonomi.com wrote:

 
 Conrad J. Sabatier conr...@cox.net wrote:
 
  grin
 
  Yes, and this is one area where the labels are more than a little
  misleading as well.  My natural inclination is think of UTF-8 as
  being a single-byte representation for each character in the set,
  whereas UTF-16, as the name implies, would be the wide, 2-byte
  version.
 
 Not exactly.
 
  Nonetheless, as I posted earlier in this thread, according to the
  info in gucharmap, the representations of the umlauted u are just
  the opposite of this:
 
 not exactly. Again.
 
  UTF-8: 0xC3 0xBC
  UTF-16: 0x00FC
   
  Go figure, huh?  :-)
 
 In UTF-16, everything _is_ a 16-bit entity.  Notice that 0x00FC has
 -four- nybbles after the '0x.'  Every character boundary is on a
 multiple of 16 bits.

Ah yes!  I hadn't noticed that.

What's really weird, as I mentioned in a later private email to
Polytropon, last night, the copy-and-paste in gucharmap suddenly
decided to start copying the UTF-8 code instead of the UTF-16.  I have
no idea why that changed.

 In UTF-8, the 'base' charset -- the 'C0' and 'C1' groups are
 represented by a single byte.  'extended' characters are represented
 by two bytes. Thus, 'characters' have  a *variable*length*
 representation -- one or two bytes.  A character, whether it is
 represented by one or two bytes,  can begin on -any- byte boundary
 within a data stream, depending on 'what came before it'.  UTF-8
 2-byte representations are designed such that one can jump to any
 _byte_ offset within the file, and determine -- by looking *only* at
 the value of that byte whether is is (a) a single-byte character, (b)
 the first byte of a two-byte sequence, or (c) the second byte of a
 two-byte sequence.
 
 With UTF-16 you can position directly to any -character-, by jumping
 to a _byte_ offset that is twice the index of the character you want.
 Given a byte offset, you always know the 'equivalent' _character_
 offset.
 
 With UTF-8, you have to read the character stream, counting
 'characters' as you go, to get to the desired point.  You can seek to
 an arbitrary _byte_ offset, but you do not know how mny 'characters'
 into the file that offset is.

I see.  Yes, that could certainly complicate things.

 UTF-8 vs. UTF-16 is a trade-off between 'compactness' (UTF-8), and 
 simplicity of addessing/representation (UTF-16).
 
  This seems rather unfortunate to me.  You would think that, by now,
  some standard character set might have emerged that would allow
  one to use, at the very least, the Western characters (as opposed
  to the Eastern or Oriental or Asian, if you will) with a
  reasonable expectation that others will see what was intended.
 
 Heh. 
 
 How many 'character' codes are you willing to devote to national
 'currency symbols', just for starters?  Probable minimum of two per
 currency -- one for the minimum coinage unit (cent, pence, pfennig,
 etc.) and one for the denomination unit (dollar, pound, mark, kroner,
 etc.)
 
 Now, one (obviously) has to have the basic 'Roman' alphabet. 
 
 Then there are all the diacritical markings (accent, accent grave, dot
 umlaut, ring, bar, 'hat', inverted hat,  etc.) for vowels.  And
 cedilla, tilde, etc., for select consonants.  Plus language specific
 symbols like ess-zett , 'thorn', etc.
 
 How about phonetic symbols, like 'schwa' ?
 
 And Greek for all sorts of scientific use?
 
 What about Cyrilic characters, for many Eastern Eurpean languages?
 
 Now, consider punctuation marks:
the 'typewriter' basics, 
How many of 'minus-sign, hyphen, em-dash, en-dash, soft-hyphen'
 are needed? How many of 'accent, accent grave, apostrophe,
 opening/closing single-quote' are needed?
opening/closing double-quotes,  and/or a 'position neutral'
 double-quote?
 
 Other symbols, like --
digits,
common fractions,
'Trademark','Registered trademark','copyright' 
'paragraph','section', 
superscripts  -- exponents, footnotes, etc.
subscripts -- chemical formulae, etc.
Simple line-drawing graphics
 
 Diphthongs??  Ligatures??
 
 Start counting things up. 
 
 An 8-bit 'address space' gets used used up _really_ quick.
 
 wry grin

I certainly get the point.  :-)  Thanks for that very thorough
elucidation.  :-)

Now I just have to figure out what the heck's going on here, why
suddenly I'm seeing the exact opposite of what I was seeing yesterday.
Thought I had everything straightened out for a while there.  :-(

Oh, this is madness!  :-)

-- 
Conrad J. Sabatier
conr...@cox.net
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Unprintable 8-bit characters

2011-11-09 Thread Polytropon
On Tue, 8 Nov 2011 20:59:48 -0600, Conrad J. Sabatier wrote:
 Same here.  I've been guilty as well of neglecting to properly adjust
 my console configuration.

Sometimes just works in combination with lazyness beats
all proper concepts of doing things. :-)



 Doesn't using LC_ALL obviate the need to set any of the other LC_*
 variables?  At least, that's always been my understanding of it.

I have to admit that I haven't fully understood everything
in that relation, but it seems that the $LC_* (!ALL) can
modify subsets of what $LC_ALL defines. Languages and
character sets can be assigned independently (e. g. english
program messages, but german file names properly displayed).



 But, getting back to something you said earlier, what did you mean
 exactly about the precedence of LANG vs. LC_*?

There is, if I remember correctly, the idea that _if_
$LANG is set, $LC_* won't be considered at all, even
if they are set.

http://www.freebsd.org/doc/handbook/using-localization.html
See 24.3.4.1.1.1 and 24.3.4.1.2.


 Yes, and this is one area where the labels are more than a little
 misleading as well.  My natural inclination is think of UTF-8 as being a
 single-byte representation for each character in the set, whereas
 UTF-16, as the name implies, would be the wide, 2-byte version.
 Nonetheless, as I posted earlier in this thread, according to the info
 in gucharmap, the representations of the umlauted u are just the
 opposite of this:
 
 UTF-8: 0xC3 0xBC
 UTF-16: 0x00FC
 
 Go figure, huh?  :-)

I think Robert did explain it very good: While UTF-16 is
a fixed width (2 byte) representation, UTF-8 is variable
width (1 byte _or_ two byte).



  But returning to the original question, I think Robert
  did explain it very well: There is no real consensus
  about what the different codings should mean. They
  were meant to unify the representation of a very large
  set of characters, but basically there are many inter-
  pretations now, and how they show up to the user depends
  on the font in use, _if_ it has this mapping or that,
  or none.
 
 This seems rather unfortunate to me.  You would think that, by now,
 some standard character set might have emerged that would allow one
 to use, at the very least, the Western characters (as opposed to
 the Eastern or Oriental or Asian, if you will) with a reasonable
 expectation that others will see what was intended.

Assumptions, wishes, conclusions and hopes do differ from
reality. :-)

For example, in October I had to assist working on a
document containing german text and chinese symbols.
Decision: We use UTF-8 so the chinese symbols can appear
in the input. A name: Weng Tonghe [][][]. The brackets
should symbolize the three characters for that name.
They did show up properly in the editor, but on the
printed page... Weng Tonghe [][]. What? Two? But there
were three on input! As we found out, the he used
in input was the wrong one (there are several hes),
and the font used to render the text did not have that
particular he. When we found the correct one, finally
three characters appeared, as intended and correct.

This should show: You _never_ know where things are
wrong when something is missing - settings, fonts,
who knows. In relation to file names, this is not a
problem of the file system as it will store any name
you want, but if you can actually SEE or USE that
file name - that's a completely different thing.



  Again a fine demonstration why file names should be
  limited to printable ASCII and no spaces if you want
  them to work everywhere. :-)
 
 Well, for myself, personally, I'm a bit of a stickler for language
 authenticity, you might call it.  Having studied both German and
 French rather extensively in my younger days, I'm quite fond of both
 languages, and rather keen on seeing them represented accurately (I
 especially wince at the use of the plain, unaccented vowel followed by
 an e in place of the umlaut, and to a lesser degree, the use of ss
 in place of Esszett), which has caused me no small amount of confusion,
 aggravation and frustration over the years, to be sure!  :-)

Make sure to call it Eszett (Es = S and Zett = Z).
The teletyping conventions suggests to dissolve ß to sz,
because it's easier to recombine sz to ß because it's
likely to be correct, whereas recombining ss to ß is
often wrong, as there are too many correct ss in texts.

Example:
Mißwirtschaft - Miszwirtschaft - Mißwirtschaft  === good.
Messer - Meßer  === wrong.

In names (e. g. of towns): Staßfurt (right) != Stassfurt (wrong).

Note that !(sz - ß) in all cases, and !(ss - ß)
as well, as the rule states that only a non-truncatable ss
is to be set as Eszett. There are only few sz that are
real 'sz', typically in word gaps, e. g. Reiszange. :-)



The funny things start when diacritic marks and other
non-US-ASCII representable elements change the meaning
of a word. In such cases, it's often justified to use
the proper localized representation. However, this is
also the point 

Re: Unprintable 8-bit characters

2011-11-09 Thread David Brodbeck
It's worth noting, too, that most of the non-Unicode encoding systems
predate the Internet.  When computers weren't really talking to each
other, there was no real emphasis on interoperability, and every OS
tended to come up with their own way of encoding foreign languages.
Languages like French, German, and English generally have it easy --
almost everything ended up being Latin1 (aka ISO 8859-1).  For other
languages it can be much more complicated.  There are at least three
commonly used encoding systems for Chinese.  Unicode is gradually
winning, but you'll still find, for example, a lot of Chinese
documents in GB2312 and Big5.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Unprintable 8-bit characters

2011-11-08 Thread Robert Bonomi

On Tue, 8 Nov 2011 18:42:36 -0600, Conrad J. Sabatier wrote:

 I've been trying to understand what the deal is with regards to the
 displaying of the extended 8-bit character set, i.e., 8-bit characters
 with the MSB set.

Quite simply Unix dates from the days where the 8th bit was used as a 'parity'
bit.  Allowing detection of *all* single-bit errors -- especially over the
notoriously un-reliable connections known as 'serial ports'.

 More specifically, I'm trying to figure out how to get the ls command
 to properly display filenames containing characters in this extended
 set.  I have some MP3 files, for instance, whose names contain certain
 European characters, such as the lowercase u with umlaut (code 0xfc
 in the Latin set, according to gucharmap), that I just can't get ls to
 display properly.  These characters seem to be considered by ls as
 unprintable, and the best I've been able to produce in the ls
 output is backslash interpretations of the characters using either the
 -B or -b options, otherwise the default ? is displayed in their place.

 The strange thing is that these characters will display just fine in
 xterm, gnome-terminal, etc.  I can copy and paste them from the
 gucharmap utility into a shell command line or other application, and
 they appear as they should, but ls simply refuses to display them.  I
 can print them using the printf command, even bash's builtin echo seems
 to have no problem with them.  Only ls appears to have this problem.

 I've experimented with using various locales, using the LC_*
 variables, as well as the LANG variable (as documented in the
 environment section of the ls man page), all to no avail.

Obviously you never read as far as the '-w' switch.  grin

 Is this an inherent limitation of ls, 

It is -not- a limitation; rather it is a _desired_ behavior -- so that 
one can _tell_ where there is an 'unprintable' character (like \r, or\b)
in a filename.  There are *good*reasons*(TM) why -q is the default behavior
for 'terminal' output.

   or is there some workaround or
 other solution?  Do we need a new en_*.UTF-16 locale?  Should we
 consider extending the ls command to handle these characters?

There _are_ improved versions of ls that do understand the 'locale'
environment variables -- but those programs introduce a whole bunch of
*other* 'not necessarily desired' behaviors -- like sorting upper-case and
lower-case letters as 'equals', rather than regarding any upper-case as 
sorting before any lowercase.

Or is
 there just something about all of this that I'm just not getting?

 As an additional note, I notice that in the text console, this same
 character code (0xfc) produces an entirely different character (a
 lowercase n in a raised position, as for the exponent in a mathematical
 expression).  Is there, in fact, no standardization re: the
 representation of these high bit characters?

The nice thing about standards is that there are so many to choose from
applies.  WITH A VENGANCE!!

There are at least FIFTEEN different sets of glyphs for the 'high bit set'
byte codes *JUST* for the 'iso-8859' base charset.  Plus 'utf-8'  And not 
counting the various bastardiztions (e.g. 'CP-1252', etc.) that Microsoft 
has introduced.

 Thanks to anyone who can help clear up this long-standing mystery for
 me.

Reading the fine manpage -- with particular attention to the '-q'
and '-w' options should provie some enlightenment.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Unprintable 8-bit characters

2011-11-08 Thread Michael Ross

Am 09.11.2011, 01:42 Uhr, schrieb Conrad J. Sabatier conr...@cox.net:


Pardon me if this may seem like a stupid question, but this is
something that's been bugging me for a long time, and none of my
research has turned up anything useful yet.

I've been trying to understand what the deal is with regards to the
displaying of the extended 8-bit character set, i.e., 8-bit characters
with the MSB set.

More specifically, I'm trying to figure out how to get the ls command
to properly display filenames containing characters in this extended
set.  I have some MP3 files, for instance, whose names contain certain
European characters, such as the lowercase u with umlaut (code 0xfc
in the Latin set, according to gucharmap), that I just can't get ls to
display properly.  These characters seem to be considered by ls as
unprintable, and the best I've been able to produce in the ls
output is backslash interpretations of the characters using either the
-B or -b options, otherwise the default ? is displayed in their place.


Unsure if I understand you correctly.
(extended 8-bit character set with MSB? utf-16?)
I'm confused by this charset stuff in general.

Assuming you want \0xfc displayed as ü,


cat test.py  python test.py  ls -l


#!/usr/local/bin/python
# -*- coding: utf-8 -*-

f=open('\xfc','w')
f.close()
total 2

-rw-r--r--  1 michael  wheel  29  9 Nov 02:43 test.py
-rw-r--r--  1 michael  wheel   0  9 Nov 02:44 ü


here is what works for me:

in my login class in /etc/login.conf:

:charset=ISO-8859-1:\
:lang=de_DE.ISO8859-1:\

``cap_mkdb /etc/login.conf'' after changes


in /etc/rc.conf:

scrnmap=iso-8859-1_to_cp437
font8x8=cp850-8x8
font8x14=cp850-8x14
font8x16=cp850-8x16


and in /etc/ttys, console type is set to ``cons25l1''


Regards,

Michael
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Unprintable 8-bit characters

2011-11-08 Thread Conrad J. Sabatier
On Tue, 8 Nov 2011 19:17:27 -0600 (CST)
Robert Bonomi bon...@mail.r-bonomi.com wrote:

 
 On Tue, 8 Nov 2011 18:42:36 -0600, Conrad J. Sabatier wrote:
 
  I've been trying to understand what the deal is with regards to the
  displaying of the extended 8-bit character set, i.e., 8-bit
  characters with the MSB set.
 
 Quite simply Unix dates from the days where the 8th bit was used as a
 'parity' bit.  Allowing detection of *all* single-bit errors --
 especially over the notoriously un-reliable connections known as
 'serial ports'.

Ah, yes!  The good old days.  :-)

  More specifically, I'm trying to figure out how to get the ls
  command to properly display filenames containing characters in this
  extended set.  I have some MP3 files, for instance, whose names
  contain certain European characters, such as the lowercase u with
  umlaut (code 0xfc in the Latin set, according to gucharmap), that I
  just can't get ls to display properly.  These characters seem to be
  considered by ls as unprintable, and the best I've been able to
  produce in the ls output is backslash interpretations of the
  characters using either the -B or -b options, otherwise the default
  ? is displayed in their place.
 
  The strange thing is that these characters will display just fine in
  xterm, gnome-terminal, etc.  I can copy and paste them from the
  gucharmap utility into a shell command line or other application,
  and they appear as they should, but ls simply refuses to display
  them.  I can print them using the printf command, even bash's
  builtin echo seems to have no problem with them.  Only ls appears
  to have this problem.
 
  I've experimented with using various locales, using the LC_*
  variables, as well as the LANG variable (as documented in the
  environment section of the ls man page), all to no avail.
 
 Obviously you never read as far as the '-w' switch.  grin

Yes, somehow that one went right past me.  Haste makes waste!  :-)

  Is this an inherent limitation of ls, 
 
 It is -not- a limitation; rather it is a _desired_ behavior -- so
 that one can _tell_ where there is an 'unprintable' character (like
 \r, or\b) in a filename.  There are *good*reasons*(TM) why -q is the
 default behavior for 'terminal' output.

OK, I can see that.  :-)

  or is there some workaround or
  other solution?  Do we need a new en_*.UTF-16 locale?  Should we
  consider extending the ls command to handle these characters?
 
 There _are_ improved versions of ls that do understand the 'locale'
 environment variables -- but those programs introduce a whole bunch of
 *other* 'not necessarily desired' behaviors -- like sorting
 upper-case and lower-case letters as 'equals', rather than regarding
 any upper-case as sorting before any lowercase.

Well, *that* certainly won't do!  That should be the exception, not the
rule.

  Or is
  there just something about all of this that I'm just not getting?
 
  As an additional note, I notice that in the text console, this same
  character code (0xfc) produces an entirely different character (a
  lowercase n in a raised position, as for the exponent in a
  mathematical expression).  Is there, in fact, no standardization
  re: the representation of these high bit characters?
 
 The nice thing about standards is that there are so many to choose
 from applies.  WITH A VENGANCE!!
 
 There are at least FIFTEEN different sets of glyphs for the 'high bit
 set' byte codes *JUST* for the 'iso-8859' base charset.  Plus
 'utf-8'  And not counting the various bastardiztions (e.g. 'CP-1252',
 etc.) that Microsoft has introduced.
 
  Thanks to anyone who can help clear up this long-standing mystery
  for me.
 
 Reading the fine manpage -- with particular attention to the
 '-q' and '-w' options should provie some enlightenment.

Thank you very much.  Some of this matched the suspicions I already had
re: this matter.

Don't know how I completely missed the -w switch. Mea culpa.  :-)

So, what would be the safest bet as far as the most universal
representation for these characters?  Something I've long wondered
about when I've e-mailed people and copied/pasted these characters (are
they really seeing what I'm seeing?).  :-)

-- 
Conrad J. Sabatier
conr...@cox.net
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Unprintable 8-bit characters

2011-11-08 Thread Conrad J. Sabatier
On Tue, 8 Nov 2011 19:17:27 -0600 (CST)
Robert Bonomi bon...@mail.r-bonomi.com wrote:

 
 On Tue, 8 Nov 2011 18:42:36 -0600, Conrad J. Sabatier wrote:
 
  I've been trying to understand what the deal is with regards to the
  displaying of the extended 8-bit character set, i.e., 8-bit
  characters with the MSB set.
 
 Quite simply Unix dates from the days where the 8th bit was used as a
 'parity' bit.  Allowing detection of *all* single-bit errors --
 especially over the notoriously un-reliable connections known as
 'serial ports'.
 
  More specifically, I'm trying to figure out how to get the ls
  command to properly display filenames containing characters in this
  extended set.  I have some MP3 files, for instance, whose names
  contain certain European characters, such as the lowercase u with
  umlaut (code 0xfc in the Latin set, according to gucharmap), that I
  just can't get ls to display properly.  These characters seem to be
  considered by ls as unprintable, and the best I've been able to
  produce in the ls output is backslash interpretations of the
  characters using either the -B or -b options, otherwise the default
  ? is displayed in their place.
 
  The strange thing is that these characters will display just fine in
  xterm, gnome-terminal, etc.  I can copy and paste them from the
  gucharmap utility into a shell command line or other application,
  and they appear as they should, but ls simply refuses to display
  them.  I can print them using the printf command, even bash's
  builtin echo seems to have no problem with them.  Only ls appears
  to have this problem.
 
  I've experimented with using various locales, using the LC_*
  variables, as well as the LANG variable (as documented in the
  environment section of the ls man page), all to no avail.
 
 Obviously you never read as far as the '-w' switch.  grin

Just a quickie followup:

Setting LC_ALL=en_US.UTF-8 and using ls -w was, in fact, the magic
key (at least, in any of the X terminal apps; still getting the little
exponential n in the console)!

Thank you so much.  I'll sleep much better tonight.  :-)

-- 
Conrad J. Sabatier
conr...@cox.net
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Unprintable 8-bit characters

2011-11-08 Thread Polytropon
On Wed, 09 Nov 2011 02:51:31 +0100, Michael Ross wrote:
 Am 09.11.2011, 01:42 Uhr, schrieb Conrad J. Sabatier conr...@cox.net:
 
  Pardon me if this may seem like a stupid question, but this is
  something that's been bugging me for a long time, and none of my
  research has turned up anything useful yet.
 
  I've been trying to understand what the deal is with regards to the
  displaying of the extended 8-bit character set, i.e., 8-bit characters
  with the MSB set.
 
  More specifically, I'm trying to figure out how to get the ls command
  to properly display filenames containing characters in this extended
  set.  I have some MP3 files, for instance, whose names contain certain
  European characters, such as the lowercase u with umlaut (code 0xfc
  in the Latin set, according to gucharmap), that I just can't get ls to
  display properly.  These characters seem to be considered by ls as
  unprintable, and the best I've been able to produce in the ls
  output is backslash interpretations of the characters using either the
  -B or -b options, otherwise the default ? is displayed in their place.
 
 Unsure if I understand you correctly.
 (extended 8-bit character set with MSB? utf-16?)
 I'm confused by this charset stuff in general.
 
 Assuming you want \0xfc displayed as ü,
 
  cat test.py  python test.py  ls -l
 
 #!/usr/local/bin/python
 # -*- coding: utf-8 -*-
 
 f=open('\xfc','w')
 f.close()
 total 2
 
 -rw-r--r--  1 michael  wheel  29  9 Nov 02:43 test.py
 -rw-r--r--  1 michael  wheel   0  9 Nov 02:44 ü
 
 
 here is what works for me:
 
 in my login class in /etc/login.conf:
 
  :charset=ISO-8859-1:\
  :lang=de_DE.ISO8859-1:\
 
 ``cap_mkdb /etc/login.conf'' after changes

Ah, thanks - that seems to be the proper way to have
the environmental variables set - instead of my (ab)use
of setenv's in the csh config file. :-)

Note the precedence of $LANG vs. $LC_* (as they can
be used to configure things more precisely, e. g.
regarding system messages or date formats; see example
following).



 in /etc/rc.conf:
 
   scrnmap=iso-8859-1_to_cp437

Hm? CP437? Codepage? Isn't that some MS-DOS thing?
I've never needed a screenmap to make extended
characters (everything beyong US-ASCII) work.



   font8x8=cp850-8x8
   font8x14=cp850-8x14
   font8x16=cp850-8x16
 
 
 and in /etc/ttys, console type is set to ``cons25l1''

I have a similar setting here, but that does _not_ work
wuth UTF-8 codec characters. If I want to use them, I
have to change some environmental variables, from

#---GERMAN/ENGLISH === DEFAULT
setenv  LC_ALL  en_US.ISO8859-1
setenv  LC_MESSAGES en_US.ISO8859-1
setenv  LC_COLLATE  de_DE.ISO8859-1
setenv  LC_CTYPEde_DE.ISO8859-1
setenv  LC_MONETARY de_DE.ISO8859-1
setenv  LC_NUMERIC  de_DE.ISO8859-1
setenv  LC_TIME de_DE.ISO8859-1
unsetenv LANG

to

#---INTERNATIONAL-
setenv  LC_ALL  en_US.UTF-8
setenv  LC_MESSAGES en_US.UTF-8
setenv  LC_COLLATE  de_DE.UTF-8
setenv  LC_CTYPEde_DE.UTF-8
setenv  LC_MONETARY de_DE.UTF-8
setenv  LC_NUMERIC  de_DE.UTF-8
setenv  LC_TIME de_DE.UTF-8
setenv  LANGde_DE.UTF-8

Then I can use UTF-8 characters inside rxvt-unicode. Of
course, text mode console is limited to the first set
of configuration, using the ISO 8859-1 character set.

This worked long before UTF-8 arrived with the glorious
idea that I should have 2 bytes where one is sufficient,
to describe our (german) 6 umlauts and the Eszett ligature. :-)

Improper settings will result in [][] or A-tilde three
quarters upside-down question mark, depending on editor
or terminal used.


But returning to the original question, I think Robert
did explain it very well: There is no real consensus
about what the different codings should mean. They
were meant to unify the representation of a very large
set of characters, but basically there are many inter-
pretations now, and how they show up to the user depends
on the font in use, _if_ it has this mapping or that,
or none.

For running ls, -w is the right option to use - but IN
COMBINATION with correct settings for the terminal
emulation AND the presence of a font that will do.

Again a fine demonstration why file names should be
limited to printable ASCII and no spaces if you want
them to work everywhere. :-)



-- 
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Unprintable 8-bit characters

2011-11-08 Thread Polytropon
On Tue, 8 Nov 2011 19:58:04 -0600, Conrad J. Sabatier wrote:
 So, what would be the safest bet as far as the most universal
 representation for these characters?  Something I've long wondered
 about when I've e-mailed people and copied/pasted these characters (are
 they really seeing what I'm seeing?).  :-)

With lots of experience in how not to do it, I would
like to suggest the following: Use US-ASCII letters only.
This makes _sure_ they will display correctly everywhere
and even on ultra-worst conditions (e. g. you are at a
real serial console, a real DEC vt100).

Filenames like kloesze_mit_muesli_foerdern_baerenhunger.mp3
can be processed by _any_ ls or mailer program. There is
no need to worry about... hmmm... do they have the same
character settings that I use? Do they have a font installed
that can show the file names properly?

Rules: Substitute umlauts properly (*e). Substitute ß
to sz (teletype convention). Remove accents or other
marks completely, as well as strokes through characters
or similar typographical specialities. If you can, use
lowercase only. No spaces, use _ instead. Avoid any other
special characters. Make everything plain ASCII, and you
can _still_ easily get the meaning.

The file system ITSELF doesn't care for the meaning of
the characters. SAVING them and DISPLAYING them are two
fully different things. Nobody stops you from making
filenames like öÜÖß߀Łµ³¼`łøæſđ̣ĸ»¢.mp3, but they can
cause trouble you can't predict. You _never_ know...



-- 
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Unprintable 8-bit characters

2011-11-08 Thread Conrad J. Sabatier
On Wed, 09 Nov 2011 02:51:31 +0100
Michael Ross g...@ross.cx wrote:

 Am 09.11.2011, 01:42 Uhr, schrieb Conrad J. Sabatier
 conr...@cox.net:
 
  Pardon me if this may seem like a stupid question, but this is
  something that's been bugging me for a long time, and none of my
  research has turned up anything useful yet.
 
  I've been trying to understand what the deal is with regards to the
  displaying of the extended 8-bit character set, i.e., 8-bit
  characters with the MSB set.
 
  More specifically, I'm trying to figure out how to get the ls
  command to properly display filenames containing characters in this
  extended set.  I have some MP3 files, for instance, whose names
  contain certain European characters, such as the lowercase u with
  umlaut (code 0xfc in the Latin set, according to gucharmap), that I
  just can't get ls to display properly.  These characters seem to be
  considered by ls as unprintable, and the best I've been able to
  produce in the ls output is backslash interpretations of the
  characters using either the -B or -b options, otherwise the default
  ? is displayed in their place.
 
 Unsure if I understand you correctly.
 (extended 8-bit character set with MSB? utf-16?)
 I'm confused by this charset stuff in general.

That is to say, 8-bit characters with the most significant bit set,
or characters greater than 0x7f.

I can certainly appreciate your confusion; this is definitely a
confusing area.  In gucharmap, selecting the unlauted u in the Latin
set, the Character Details tab reveals the following:

U+00FC LATIN SMALL LETTER U WITH DIAERESIS

General Character Properties

In Unicode since: 1.1
Unicode category: Letter, Lowercase
Canonical decomposition: U+0075 LATIN SMALL LETTER U + U+0308 COMBINING
DIAERESIS

Various Useful Representations

UTF-8: 0xC3 0xBC
UTF-16: 0x00FC

C octal escaped UTF-8: \303\274
XML decimal entity: #252;

So apparently, it's a wide character in UTF-8, which really throws a
monkey wrench into the works in certain situations (for example, one of
the little scripts I've written to process MP3 files uses the cut
command, which complains about an illegal byte sequence).

Even more confusing, selecting the character and copying it to the
clipboard, the UTF-16 representation (0xfc) is what actually gets
used.  Pasting this single-byte version into an X terminal (any of
them: xterm, gnome-terminal, etc.) does display the correct character,
an umlauted u, even if using an 8-bit locale, such as UTF-8.  Majorly
confusing!

 Assuming you want \0xfc displayed as ü,

Yes, exactly.

  cat test.py  python test.py  ls -l
 
 #!/usr/local/bin/python
 # -*- coding: utf-8 -*-
 
 f=open('\xfc','w')
 f.close()
 total 2
 
 -rw-r--r--  1 michael  wheel  29  9 Nov 02:43 test.py
 -rw-r--r--  1 michael  wheel   0  9 Nov 02:44 ü
 
 
 here is what works for me:
 
 in my login class in /etc/login.conf:
 
  :charset=ISO-8859-1:\
  :lang=de_DE.ISO8859-1:\
 
 ``cap_mkdb /etc/login.conf'' after changes
 
 
 in /etc/rc.conf:
 
   scrnmap=iso-8859-1_to_cp437
   font8x8=cp850-8x8
   font8x14=cp850-8x14
   font8x16=cp850-8x16
 
 
 and in /etc/ttys, console type is set to ``cons25l1''

Thanks, I hadn't considered making those sorts of changes for the
console.  I work so seldom nowadays in the console, I'd forgotten all
about that stuff (use it or lose it, as they say!).  I'll certainly give
that a try.

Much appreciation for both yours and Robert's replies.

-- 
Conrad J. Sabatier
conr...@cox.net
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Unprintable 8-bit characters

2011-11-08 Thread Daniel Staal
--As of November 8, 2011 7:58:04 PM -0600, Conrad J. Sabatier is alleged to 
have said:



So, what would be the safest bet as far as the most universal
representation for these characters?  Something I've long wondered
about when I've e-mailed people and copied/pasted these characters (are
they really seeing what I'm seeing?).  :-)


--As for the rest, it is mine.

These days, the safest bet is UTF-8, or some other Unicode character set, 
in something that can convey what character set it is in.  (Email can, 
depending on the mail client.)


Not that Unicode is universal yet, but it designed to be (and is, 
generally) a solution to the 'multiple character encodings' problem.  (By, 
of course, defining a new encoding.)  It has a decent amount of traction, 
and in a decade or so - once other options have been firmly depreciated - 
I'd expect we could start discussing whether to switch ls to using it by 
default.  ;)


All this is of course if you *must* go beyond 7-bit ASCII.  (Which all 
forms of Unicode is designed to be a strict superset of.)


Daniel T. Staal

---
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Unprintable 8-bit characters

2011-11-08 Thread Conrad J. Sabatier
On Wed, 9 Nov 2011 03:10:24 +0100
Polytropon free...@edvax.de wrote:

 On Wed, 09 Nov 2011 02:51:31 +0100, Michael Ross wrote:
  Am 09.11.2011, 01:42 Uhr, schrieb Conrad J. Sabatier
  conr...@cox.net:

[snip]

   I've been trying to understand what the deal is with regards to
   the displaying of the extended 8-bit character set, i.e., 8-bit
   characters with the MSB set.

[snip] 

  Unsure if I understand you correctly.
  (extended 8-bit character set with MSB? utf-16?)
  I'm confused by this charset stuff in general.
  
  Assuming you want \0xfc displayed as ü,

[snip]

  here is what works for me:
  
  in my login class in /etc/login.conf:
  
   :charset=ISO-8859-1:\
   :lang=de_DE.ISO8859-1:\
  
  ``cap_mkdb /etc/login.conf'' after changes
 
 Ah, thanks - that seems to be the proper way to have
 the environmental variables set - instead of my (ab)use
 of setenv's in the csh config file. :-)

Same here.  I've been guilty as well of neglecting to properly adjust
my console configuration.

 Note the precedence of $LANG vs. $LC_* (as they can
 be used to configure things more precisely, e. g.
 regarding system messages or date formats; see example
 following).
 
 
 
  in /etc/rc.conf:
  
  scrnmap=iso-8859-1_to_cp437
 
 Hm? CP437? Codepage? Isn't that some MS-DOS thing?
 I've never needed a screenmap to make extended
 characters (everything beyong US-ASCII) work.
 
 
 
  font8x8=cp850-8x8
  font8x14=cp850-8x14
  font8x16=cp850-8x16
  
  
  and in /etc/ttys, console type is set to ``cons25l1''
 
 I have a similar setting here, but that does _not_ work
 wuth UTF-8 codec characters. If I want to use them, I
 have to change some environmental variables, from
 
   #---GERMAN/ENGLISH === DEFAULT
   setenv  LC_ALL  en_US.ISO8859-1
   setenv  LC_MESSAGES en_US.ISO8859-1
   setenv  LC_COLLATE  de_DE.ISO8859-1
   setenv  LC_CTYPEde_DE.ISO8859-1
   setenv  LC_MONETARY de_DE.ISO8859-1
   setenv  LC_NUMERIC  de_DE.ISO8859-1
   setenv  LC_TIME de_DE.ISO8859-1
   unsetenv LANG
 
 to
 
   #---INTERNATIONAL-
   setenv  LC_ALL  en_US.UTF-8
   setenv  LC_MESSAGES en_US.UTF-8
   setenv  LC_COLLATE  de_DE.UTF-8
   setenv  LC_CTYPEde_DE.UTF-8
   setenv  LC_MONETARY de_DE.UTF-8
   setenv  LC_NUMERIC  de_DE.UTF-8
   setenv  LC_TIME de_DE.UTF-8
   setenv  LANGde_DE.UTF-8

Doesn't using LC_ALL obviate the need to set any of the other LC_*
variables?  At least, that's always been my understanding of it.

But, getting back to something you said earlier, what did you mean
exactly about the precedence of LANG vs. LC_*?

 Then I can use UTF-8 characters inside rxvt-unicode. Of
 course, text mode console is limited to the first set
 of configuration, using the ISO 8859-1 character set.
 
 This worked long before UTF-8 arrived with the glorious
 idea that I should have 2 bytes where one is sufficient,
 to describe our (german) 6 umlauts and the Eszett ligature. :-)

grin

Yes, and this is one area where the labels are more than a little
misleading as well.  My natural inclination is think of UTF-8 as being a
single-byte representation for each character in the set, whereas
UTF-16, as the name implies, would be the wide, 2-byte version.
Nonetheless, as I posted earlier in this thread, according to the info
in gucharmap, the representations of the umlauted u are just the
opposite of this:

UTF-8: 0xC3 0xBC
UTF-16: 0x00FC

Go figure, huh?  :-)
 
 Improper settings will result in [][] or A-tilde three
 quarters upside-down question mark, depending on editor
 or terminal used.

Yes, I will definitely have to try using the recommendations that have
come up in this thread re: the console.

 But returning to the original question, I think Robert
 did explain it very well: There is no real consensus
 about what the different codings should mean. They
 were meant to unify the representation of a very large
 set of characters, but basically there are many inter-
 pretations now, and how they show up to the user depends
 on the font in use, _if_ it has this mapping or that,
 or none.

This seems rather unfortunate to me.  You would think that, by now,
some standard character set might have emerged that would allow one
to use, at the very least, the Western characters (as opposed to
the Eastern or Oriental or Asian, if you will) with a reasonable
expectation that others will see what was intended.

 For running ls, -w is the right option to use - but IN
 COMBINATION with correct settings for the terminal
 emulation AND the presence of a font that will do.

Yes.  I'm still a little embarrassed for having completely overlooked
that option earlier.  Hasty (impatient) reading of man pages.  :-)

 Again a fine demonstration why file names should be
 limited to printable ASCII and no spaces if you want
 them 

Re: Unprintable 8-bit characters

2011-11-08 Thread Conrad J. Sabatier
On Tue, 08 Nov 2011 21:27:16 -0400
Daniel Staal dst...@usa.net wrote:

 --As of November 8, 2011 7:58:04 PM -0600, Conrad J. Sabatier is
 alleged to have said:
 
  So, what would be the safest bet as far as the most universal
  representation for these characters?  Something I've long wondered
  about when I've e-mailed people and copied/pasted these characters
  (are they really seeing what I'm seeing?).  :-)
 
 --As for the rest, it is mine.
 
 These days, the safest bet is UTF-8, or some other Unicode character
 set, in something that can convey what character set it is in.
 (Email can, depending on the mail client.)
 
 Not that Unicode is universal yet, but it designed to be (and is, 
 generally) a solution to the 'multiple character encodings' problem.
 (By, of course, defining a new encoding.)  It has a decent amount of
 traction, and in a decade or so - once other options have been firmly
 depreciated - I'd expect we could start discussing whether to switch
 ls to using it by default.  ;)
 
 All this is of course if you *must* go beyond 7-bit ASCII.  (Which
 all forms of Unicode is designed to be a strict superset of.)

That sounds sane and sensible.  :-)

I've adjusted my environment to include:

export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8

And also adjusted my console configuration to display these characters:

font8x14=iso-8x14
font8x16=iso-8x16
font8x8=iso-8x8

And, last but not least, aliased ls to ensure these characters will
actually be displayed:

alias ls='ls -Fw'

Looking good here now:

conrads:~$ cd Music/Progressive Rock/Yes/The Yes Album
conrads:~/Music/Progressive Rock/Yes/The Yes Album$ ls *03*
Yes - The Yes Album - 03 - Starship Trooper: a. Life Seeker - b.
Disillusion - c. Würm.mp3

Many thanks to everyone for all the very helpful, useful information.

-- 
Conrad J. Sabatier
conr...@cox.net
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Unprintable 8-bit characters

2011-11-08 Thread Conrad J. Sabatier
On Tue, 8 Nov 2011 20:24:18 -0600
Conrad J. Sabatier conr...@cox.net wrote:

 Even more confusing, selecting the character and copying it to the
 clipboard, the UTF-16 representation (0xfc) is what actually gets
 used.  Pasting this single-byte version into an X terminal (any of
 them: xterm, gnome-terminal, etc.) does display the correct character,
 an umlauted u, even if using an 8-bit locale, such as UTF-8.
 Majorly confusing!

Just realized on reading this how weird it sounds.  What I was getting
at here was that the (single-byte) UTF-16 code displays the correct
character in a UTF-8 locale, even though the UTF-8 code for the
character is supposedly a 2-byte sequence.

Anyway, enough about that.  I've managed to get the results I was
hoping for now, so I'm satisfied.  :-)

Thanks again for all the responses.

-- 
Conrad J. Sabatier
conr...@cox.net
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Unprintable 8-bit characters

2011-11-08 Thread Robert Bonomi

Conrad J. Sabatier conr...@cox.net wrote:

 grin

 Yes, and this is one area where the labels are more than a little
 misleading as well.  My natural inclination is think of UTF-8 as being a
 single-byte representation for each character in the set, whereas
 UTF-16, as the name implies, would be the wide, 2-byte version.

Not exactly.

 Nonetheless, as I posted earlier in this thread, according to the info
 in gucharmap, the representations of the umlauted u are just the
 opposite of this:

not exactly. Again.

 UTF-8: 0xC3 0xBC
 UTF-16: 0x00FC
  
 Go figure, huh?  :-)

In UTF-16, everything _is_ a 16-bit entity.  Notice that 0x00FC has -four-
nybbles after the '0x.'  Every character boundary is on a multiple of 16
bits.

In UTF-8, the 'base' charset -- the 'C0' and 'C1' groups are represented
by a single byte.  'extended' characters are represented by two bytes.
Thus, 'characters' have  a *variable*length* representation -- one or two 
bytes.  A character, whether it is represented by one or two bytes,  can 
begin on -any- byte boundary within a data stream, depending on 'what came 
before it'.  UTF-8 2-byte representations are designed such that one can 
jump to any _byte_ offset within the file, and determine -- by looking *only* 
at the value of that byte whether is is (a) a single-byte character, (b) the 
first byte of a two-byte sequence, or (c) the second byte of a two-byte 
sequence.

With UTF-16 you can position directly to any -character-, by jumping to 
a _byte_ offset that is twice the index of the character you want. Given
a byte offset, you always know the 'equivalent' _character_ offset.

With UTF-8, you have to read the character stream, counting 'characters' 
as you go, to get to the desired point.  You can seek to an arbitrary
_byte_ offset, but you do not know how mny 'characters' into the file 
that offset is.

UTF-8 vs. UTF-16 is a trade-off between 'compactness' (UTF-8), and 
simplicity of addessing/representation (UTF-16).

 This seems rather unfortunate to me.  You would think that, by now,
 some standard character set might have emerged that would allow one
 to use, at the very least, the Western characters (as opposed to
 the Eastern or Oriental or Asian, if you will) with a reasonable
 expectation that others will see what was intended.

Heh. 

How many 'character' codes are you willing to devote to national 'currency 
symbols', just for starters?  Probable minimum of two per currency -- one
for the minimum coinage unit (cent, pence, pfennig, etc.) and one for
the denomination unit (dollar, pound, mark, kroner, etc.)

Now, one (obviously) has to have the basic 'Roman' alphabet. 

Then there are all the diacritical markings (accent, accent grave, dot
umlaut, ring, bar, 'hat', inverted hat,  etc.) for vowels.  And cedilla,
tilde, etc., for select consonants.  Plus language specific symbols like
ess-zett , 'thorn', etc.

How about phonetic symbols, like 'schwa' ?

And Greek for all sorts of scientific use?

What about Cyrilic characters, for many Eastern Eurpean languages?

Now, consider punctuation marks:
   the 'typewriter' basics, 
   How many of 'minus-sign, hyphen, em-dash, en-dash, soft-hyphen' are needed?
   How many of 'accent, accent grave, apostrophe, opening/closing single-quote'
   are needed?
   opening/closing double-quotes,  and/or a 'position neutral' double-quote?

Other symbols, like --
   digits,
   common fractions,
   'Trademark','Registered trademark','copyright' 
   'paragraph','section', 
   superscripts  -- exponents, footnotes, etc.
   subscripts -- chemical formulae, etc.
   Simple line-drawing graphics

Diphthongs??  Ligatures??

Start counting things up. 

An 8-bit 'address space' gets used used up _really_ quick.

wry grin




___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org