Re: Unicode Filenames in Archives

2008-06-13 Thread Simos Xenitellis

O/H SrinTuar έγραψε:


Using some fairly recent O/S's, such as Fedora core 8 and WIndows XP,
I seem to have no way to move a bunch of files from one to the other 
while preserving

the nice unicode filenames I have.

In specific, the files were created on the fc8 system. (a few thousand 
of them)


Putting them together in a zip file works fine fc8-fc8, but fails 
miserably

when trying to unzip in windows.

A bit of searching shows this:
http://www.pkware.com/documents/casestudies/APPNOTE.TXT

pkware has apparently declared a flag bit to mean all filenames are utf-8

But at the same time, the developers of info-zip say this:
  http://www.info-zip.org/FAQ.html

Basically, that utf-8 support is nowhere on their radar.

Things work poorly in the opposite direction for zipfiles created on 
windows as well:
sometimes i can guess the original encoding and reverse the damage, 
other times
I cannot : perhaps the software that made the archive has already 
trashed the filenames.


Ive also given tarballs a shot for this task, but sadly cygwin is 
ascii-only.


Because it works linux to linux, or at least fedora to fedora, and 
that is really good enough for me,
Its not a major issue. But I'm curious to know if other have run into 
this cross-platform problem, and how they

resolved it for themselves. That is, if anyone still reads this list.

How do you go about making a basic archive containing non-ascii 
filenames that you can have confidence

will unpack well on most operating systems.
If you check the list archives, you will notice a discussion a few years 
back.
One of the outcomes was that it's a bit messy to use ZIP and filenames 
in encoding other than ASCII.


I would suggest that you to tar and GZip (or BZip) your archives. Will 
these work on Windows?
Try with 7zip to extract the said files. I would appreciate it if you 
could report back on this.


Talking about 7Zip, 7z is another option as well.

Simos


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Using UTF-8 console on Linux

2007-08-03 Thread Simos Xenitellis
Στις 03-08-2007, ημέρα Παρ, και ώρα 22:31 +0100, ο/η Ken Moffat έγραψε:
 On Tue, Jul 31, 2007 at 05:56:24PM +0200, Egmont Koblinger wrote:
  On Tue, Jul 31, 2007 at 04:36:44PM +0100, Rui Santos wrote:
  
   In my quest, I'd like to use UTF-8 in all consoles. I almost did
   it, except for a little detail: I cannot use any kind of accents with
   any of my letters. Here is what I do
   
   loadkeys /usr/share/kbd/compose.winkeys
   loadkeys /usr/share/kbd/compose.latin1.add
  
  Composing characters don't work with utf-8, but a patch exsts (not tested by
  me). It was mentioned last week on kernel list:
  http://marc.info/?l=linux-kernelm=118531371404736w=2
  
  I'm sorry to be pedantic on my first post to this list, but it's
 only the non-latin-1 composing and dead keys which don't work.  Not
 trying to minimise the scope of the problem, I'd love to be able to
 type in more languages at the console.
 
  I've put an example British keymap at
 http://homepage.ntlworld.com/zarniwhoop/uk-utf.map - if a dead key
 in the standard xorg layout can be made to work on the console, it
 uses it (so, for example, dead acute works on a,e,i,o,u only), and
 there are some other variations.

That should be 
http://homepage.ntlworld.com/zarniwhoop/console/uk-utf.map

The keymap references a couple of files, unicode.map and compose.latin1.
Are they part of a distribution of console-data (using Ubuntu)?

During the last attempt to get compose support in the kernel at LKML,
the response was that the kernel console support was supposed to provide
facilities for emergency usage only (serial debugger, access to fsck,
etc). Therefore, the patch described at
http://www.advogato.org/person/simosx/diary.html?start=2
although it solved the problem, got rejected. (Also the patch was
somewhat of a hack in the way it solved the problem).

Simos



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Using UTF-8 console on Linux

2007-07-31 Thread Simos Xenitellis
Στις 31-07-2007, ημέρα Τρι, και ώρα 17:56 +0200, ο/η Egmont Koblinger
έγραψε:
 On Tue, Jul 31, 2007 at 04:36:44PM +0100, Rui Santos wrote:
 
  In my quest, I'd like to use UTF-8 in all consoles. I almost did
  it, except for a little detail: I cannot use any kind of accents with
  any of my letters. Here is what I do
  
  loadkeys /usr/share/kbd/compose.winkeys
  loadkeys /usr/share/kbd/compose.latin1.add
 
 Composing characters don't work with utf-8, but a patch exsts (not tested by
 me). It was mentioned last week on kernel list:
 http://marc.info/?l=linux-kernelm=118531371404736w=2

There has been a discussion on this at this list, a summary of which is
at
http://www.mail-archive.com/linux-utf8@nl.linux.org/msg04900.html

Hope this helps,
Simos

 Standalone accents should work, but you may need to pass a -u/--unicode
 option to loadkeys. (Not mentioned in loadkeys manual, but printed by
 loadkeys --help.)
 
 I'm using something like this; this should work work you too:
   echo -en '\033%G'
   kbd_mode -u
   setfont lat2-16 -m 8859-2
   loadkeys -u hu
 If you press the keys, some accented vowels should appear.
 
 
 Furthermore, brand new in 2.6.22: there's a file called default_utf8 (or
 something similar) somewhere under /proc, echo 1 to it and your newly
 allocated or reseted terminals will automatically be UTF-8 so you won't need
 that \033%G.
 
 
 
 bye,
 
 Egmont
 
 --
 Linux-UTF8:   i18n of Linux on all levels
 Archive:  http://mail.nl.linux.org/linux-utf8/
 


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Cuneiform, and how to make the fonts work

2007-03-29 Thread Simos Xenitellis
On Thu, 2007-03-29 at 04:37 -0400, William J Poser wrote:
 For re-encoding a font to Unicode using FontForge (formerly called
 pfaedit), I have a little tutorial here: 
 http://billposer.org/Linguistics/Computation/Reencoding/HowTo.html.
 
 If you do do this, please make the re-encoding available as I'm sure
 other people would like to use it.

Thanks Bill.
The guide is really useful!

Simos


signature.asc
Description: This is a digitally signed message part


Re: How to enter accented UTF-8 character on GNOME terminal

2007-03-25 Thread Simos Xenitellis

Since you use GNOME, you can either enable a keyboard layout that has
those characters (such as US International),
http://ubuntuguide.org/wiki/Ubuntu_Edgy#How_to_type_extended_characters
or use compose sequences (no need to enable a special keyboard layout),
http://ubuntuguide.org/wiki/Ubuntu_Edgy#How_to_set_the_Compose_key_to_type_special_characters

Simos

On Sat, 2007-03-24 at 13:03 -0400, William J Poser wrote:
 For entering non-ascii characters, I use three techniques:
 
 (a) when the characters are part of a set used routinely, e.g.
 the alphabet of French, install a keyboard map specifically
 for that language (or, e.g., for ISO-8859-1, which includes it);
 
 (b) at the other extreme, when the character is some random character
 for which I have a one time need, use gucharmap, or, what is
 often quicker, look it up in my copy of the Unicode Consortium
 file Nameslist.txt (http://unicode.org/Public/UNIDATA/NamesList.txt)
 and enter the character via its hex code using any of several
 methods depending on where I want to put it.
 
 (c) for the intermediate case, of characters that I use with some
 frequency but that aren't part of some language's writing
 system or where it isn't convenient to switch to a separate
 keyboard, I use a character entry utility of my own, available
 at: http://billposer.org/Software/CharEntry.html
 This works something like gucharmap, but instead of presenting
 all of Unicode it provides clickable charts of selected sets of
 characters: (a) the consonants of the International Phonetic
 Alphabet; (b) the IPA vowels; (c) a large set of roman letters with
 diacritics; and (d) a set of combining diacritics. There is also
 a widget that accepts hex codes. You can also define custom
 clickable character charts by reading a definition from a simple
 text file (basically each line consists of the hex code and
 the gloss to appear in the tool tip).
 
 Bill
  
 
 --
 Linux-UTF8:   i18n of Linux on all levels
 Archive:  http://mail.nl.linux.org/linux-utf8/
 


signature.asc
Description: This is a digitally signed message part


IRC Log (Was: Re: IRC Meeting ([EMAIL PROTECTED]) tomorrow Friday, 14Jul06, 20:00 GMT: Fonts; choosing fonts; fonts.conf; fontconfig)

2006-07-15 Thread Simos Xenitellis
Dear All,

The IRC session took place yesterday and the IRC log is available at
http://wiki.freedesktop.org/wiki/Software_2fFonts_2fConfiguration

At the end of the page there is a summary that is being built; feel free
to help out summarising the session and extracting actions to do.

There was interest to have more regular font meetings, such as an IRC
channel dedicated to issues related to fonts. We are looking into
getting #fonts on Freenode. If you can help out with this (it's already
registered), it will be appreciated.

Thanks to everyone for taking part,
Simos

Στις 13-07-2006, ημέρα Πεμ, και ώρα 20:49 +0100, ο/η Simos Xenitellis
έγραψε:
 Dear All,
 
 I would like to announce an IRC meeting that will take place tomorrow
 Friday, 14th July 2006, at 20:00 GMT, at #freedesktop on Freenode (IRC).
 
 To find the exact local time for your country, see
 http://www.timeanddate.com/worldclock/fixedtime.html?year=2006month=7day=14hour=20min=0sec=0
 For example, if you are in Paris, the meeting is on Friday at 10:00pm.
 If you are in Asia/Australia, it will probably be inconvenient. Please
 mail me about it. If there is enough interest (5 e-mails), we can
 arrange a repeat meeting. Tell me what time is suitable for you. This is
 particularly important for Indic/CJK/etc font issues.
 
 The expected duration is 1 hour. Based on interest and support, we may
 arrange extra sessions. 
 
 The agenda includes:
 1. discussion on basic issues on fonts; font licenses; 
 2. building a list of 'desirable' FLOSS fonts for each language/script;
 tell us your preference; promote your preference
 3. build an 'optimal' fonts.conf file (+ suggest
 snippets for fonts.d/); LSB common font repository
 4. discuss the proposed fontconfig patch for granular font selection,
 http://lists.freedesktop.org/archives/fontconfig/2006-June/002332.html
 5. discuss on writing a patch for fontconfig to disregard glyphs from
 fonts in a very low level (as if the font did not have those glyphs in
 the first place)
 
 The webpage of this discussion is at
 http://wiki.freedesktop.org/wiki/Software_2fFonts_2fConfiguration
 
 The discussion (irclog) will be saved at the above URL as well.
 
 Extra reading
 1. Call to test the DejaVu fonts
 http://fedoraproject.org/wiki/Fonts/DejavuFeedbackCall
 2. The Open Font License (OFL) by SIL International
 http://scripts.sil.org/OFL
 3. Fonts page on the Freedesktop Wiki
 http://wiki.freedesktop.org/wiki/Software_2fFonts
 4. OpenFonts page on the Ubuntu Wiki
 https://wiki.ubuntu.com/OpenFonts
 5. Fonts management
 https://wiki.ubuntu.com/FontManagement
 6. Font issues across distros
 http://fedoraproject.org/wiki/Fonts/FontMusings
 7. Enhancing pango/fontconfig to help solve font issues
 http://sourceforge.net/mailarchive/message.php?msg_id=18518811
 8. re: fontconfig support to exclude glyphs from fonts
 http://lists.freedesktop.org/archives/fontconfig/2006-June/002332.html
 9. Open source casts new mold for type design (recent font article)
 http://news.com.com/2102-7344_3-6092398.html?tag=st.util.print
 
 A similar IRC meeting took place last week, on input methods and
 multilingual writing support in Xorg; the discussion is available at
 http://wiki.freedesktop.org/wiki/KeyboardInputDiscussion
 
 Please forward this announcement where you feel appropriate.
 
 Hope to see you tomorrow,
 Simos
 


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



IRC Meeting ([EMAIL PROTECTED]) tomorrow Friday, 14Jul06, 20:00 GMT: Fonts; choosing fonts; fonts.conf; fontconfig

2006-07-13 Thread Simos Xenitellis
Dear All,

I would like to announce an IRC meeting that will take place tomorrow
Friday, 14th July 2006, at 20:00 GMT, at #freedesktop on Freenode (IRC).

To find the exact local time for your country, see
http://www.timeanddate.com/worldclock/fixedtime.html?year=2006month=7day=14hour=20min=0sec=0
For example, if you are in Paris, the meeting is on Friday at 10:00pm.
If you are in Asia/Australia, it will probably be inconvenient. Please
mail me about it. If there is enough interest (5 e-mails), we can
arrange a repeat meeting. Tell me what time is suitable for you. This is
particularly important for Indic/CJK/etc font issues.

The expected duration is 1 hour. Based on interest and support, we may
arrange extra sessions. 

The agenda includes:
1. discussion on basic issues on fonts; font licenses; 
2. building a list of 'desirable' FLOSS fonts for each language/script;
tell us your preference; promote your preference
3. build an 'optimal' fonts.conf file (+ suggest
snippets for fonts.d/); LSB common font repository
4. discuss the proposed fontconfig patch for granular font selection,
http://lists.freedesktop.org/archives/fontconfig/2006-June/002332.html
5. discuss on writing a patch for fontconfig to disregard glyphs from
fonts in a very low level (as if the font did not have those glyphs in
the first place)

The webpage of this discussion is at
http://wiki.freedesktop.org/wiki/Software_2fFonts_2fConfiguration

The discussion (irclog) will be saved at the above URL as well.

Extra reading
1. Call to test the DejaVu fonts
http://fedoraproject.org/wiki/Fonts/DejavuFeedbackCall
2. The Open Font License (OFL) by SIL International
http://scripts.sil.org/OFL
3. Fonts page on the Freedesktop Wiki
http://wiki.freedesktop.org/wiki/Software_2fFonts
4. OpenFonts page on the Ubuntu Wiki
https://wiki.ubuntu.com/OpenFonts
5. Fonts management
https://wiki.ubuntu.com/FontManagement
6. Font issues across distros
http://fedoraproject.org/wiki/Fonts/FontMusings
7. Enhancing pango/fontconfig to help solve font issues
http://sourceforge.net/mailarchive/message.php?msg_id=18518811
8. re: fontconfig support to exclude glyphs from fonts
http://lists.freedesktop.org/archives/fontconfig/2006-June/002332.html
9. Open source casts new mold for type design (recent font article)
http://news.com.com/2102-7344_3-6092398.html?tag=st.util.print

A similar IRC meeting took place last week, on input methods and
multilingual writing support in Xorg; the discussion is available at
http://wiki.freedesktop.org/wiki/KeyboardInputDiscussion

Please forward this announcement where you feel appropriate.

Hope to see you tomorrow,
Simos



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Experiments with classical Greek keyboard input

2006-05-10 Thread Simos Xenitellis

O/H Jan Willem Stumpel έγραψε:

Joe Schaffner wrote:
  

After lengthy consideration, I have come to the conclusion xkb
[..] only maps keyboard events to keysyms, which are not
characters



Many of them really are just characters.

  

I have these two keymaps i.e. groups on my system:

/etc/X11/xkb/symbols/el -- The one I'm using

/etc/X11/xkb/symbols/gr -- The dirty bastard



Isn't this dirty bastard /etc/X11/xkb/symbols/pc/gr? Which version
of X do you have?

  

include el(extended)



This shows that you are really using both, because gr includes el.
BTW in newer versions of X there is no el, only the dirty bastard.
  
Now the official is gr. el is an alias to gr, to let old 
configurations continue to work.
This is in Xorg 7.0+ and xkeyboard-config, in earlier Xorg your mileage 
may vary.

key.type = THREE_LEVEL;

key AD11 {[], [ dead_tilde, dead_diaeresis, dead_macron ]}; 
key AD12 {[], [ dead_iota,  VoidSymbol, dead_breve  ]};


key AC10 {[], [ dead_acute,   dead_horn   ]}; 
key AC11 {[], [ dead_grave,   dead_ogonek ]};


};

I assume the list of keysyms captures the shifted state of the
key i.e. dead_acute is on the semi-colon key and dead_horn
is on the same key, shifted, the colon key.



Yes, and in the case of three-level keys, the third level is
accessed by the AltGr key (right-alt, most probably). So that's
how you get the dead macron etc.

Some keys might be four-level, in which case the fourth level is
accessed by means of Shift-AltGr.

  

dead_grave is on the single-quote key and dead_ogonek is on
the double-quote key.

That's a pretty good layout. I like it.

Why not name these keysyms dead_psili and dead_dasia?



Because these names are not known to the system. However, all
UTF-8 characters are known to the system by default, having
names beginning with U. So the designer of this layout could, and
in my opinion should, have called them U0313 (for the dead psili)
and U0314 (for the dead dasia).
  
The U notation for Unicode characters in the Compose file should be 
edited so that any numbers have 0x1000 added to them.

For more on this and the chance to try out such an updated Compose file, see
https://bugs.freedesktop.org/show_bug.cgi?id=5129

I did not manage to try the file myself as I run Breezy (Oldish Xorg 6.8.2).
In Xorg 6.8.2 on Breezy I have an issue of typing psili, daseia and
several other combinations based on these. I think this relates to the 
merging of the greek compose file

to the common international one.
See
https://launchpad.net/distros/ubuntu/+source/gtk+2.0/+bug/21637
for more.
If someone has Xorg 7.0 and want to try out, please do and report back.

This would have avoided the need for a special Greek Compose file,
the existence of which is just a bother, ergo censeo delendam
esse. There already exists an international Compose file (it is
called the US file but it is really international), which serves
all languages, including ancient and modern Greek, and which knows
how to combine U0313 and U0314 with Greek letters and with other
accents.
  

I second that.

Anyway, I activate the gr keymap like this:

setxkbmap us,gr(polytonic) -option grp:alt_shift_toggle

The command syntax is troublesome. There seem to be other ways
of doing it. Maybe I'm wrong, but it seems to work.



You can put the keyboard options in the X configuration file
(/etc/X11/xorg.conf, or /etc/X11/XF86Config-4).

  

[..] Yes, I can enter greek characters. The dead_acute seems to
work, but I am not sure if it is outputting a tonos or a acute.
It's probably a tonos.



It should be, because having a separate acute is not considered
correct anymore. The fonts you use should display the tonos as an
acute. But if you really want to have the separate acute (oxia),
there are ways.

  

None of the other dead keys seem to work.

Any ideas?



All the dead keys can be made to work. It is not magic; it is not
even difficult. I apologise for blowing my own horn, but perhaps
you really should read the bits relating to keyboard and Greek
on http://www.jw-stumpel.nl/stestu.html.

  

It would be nice to see the entire character map in the same
place.



To get a picture of your character map (or maps, if you have
defined multiple maps) you could try

  xkbcomp -xkm $DISPLAY
  xkbprint server-0_0.xkm server-0_0.eps

The resulting file, server-0_0.eps, can be viewed with gv. This
xkbprint system seems a little bit flaky, though. You may have
difficulty actually printing the map.
  
You can also use xev. Run it from command line and give focus to the 
xev window.
Switch keyboard to Greek Polytonic and type ancient greek. You will be 
able to see
the individual characters being sent. You will also be able to see if 
GTK+ filters and cuts off any dead keys.


There are some patches for GTK+ to add support for Greek polytonic
(it actually synchs Compose-Xorg with GTK+).
If you are the compile type of person (Gentoo?), try out

Re: Experiments with classical Greek keyboard input

2006-02-06 Thread Simos Xenitellis
On Mon, 2006-02-06 at 21:58 +0100, Jan Willem Stumpel wrote:
 Imitating the difficult-to-learn Windows system for 'multiple
 diacriticals' should IMHO be offered as an option, but not as the only

I am not sure what complexities the Windows keyboard layout has that
make it difficult to re-implement as an extra layout in Xorg. My
understanding is that sets too many dead keys, as there is a limitation
of stacking dead keys together.

 option. The ease with which diacriticals can be combined by means of
 xkb/Compose could be a 'Linux selling point' in the academic world.
 
 BTW I am now terribly confused about he tonos/oxia issue.
 
 -- Tonos and oxia are considered equivalent in Unicode - but why,
then, are there different code points for them (U+1FFD, and all
the letters with oxia, vs. U+0384 and all the letters with
tonos)? Where does it actually say that they are equivalent?

It at 
http://www.unicode.org/charts/PDF/U1F00.pdf

For example, see 1F71, Greek Small Letter Alpha with Oxia.
The three horizontal bars show equivalence between glyphs.
It shows that 1F71 == 03AC.

It is common to have these equivalences; compatible software should take
care of these equivalences for the end-users and fold glyphs to their
initial equivalences.

 -- Many (maybe most) font creators made different glyphs for oxia
and tonos (although others did not, see the Gentium font), because
they were looking at unicode. But, surely, that was the correct
place to look?

Unicode does not dictate how fonts should look. See the Fonts section at
http://www.unicode.org/charts/PDF/U1F00.pdf
The selected font was merely a font donated for this purpose.

 -- Kostas calls it a bug of the fonts. If there is a bug, isn't it
in the Unicode standard ?

I am not sure about the background of this; I think it has to do with
different schools of thought on how original documents looked like.

 I hope there is a way to put the genie back into the bottle. Just making
 the keyboard entry for oxia hard, forcing people not to use it does
 not seem to be the right way.

The choice is between
1. do not provide an option for people to type 1F71 and other vowels
with oxia. (current situation)
2. provide such a choice to type vowels with oxia.

The preference is to move to Choice 2, so that if a user wants this
option, he has the freedom of choice to do so. 
Giving equivalent exposure to both oxia and tonos can create a mess with
documents. That's why oxia should be somewhere far away, not on a nearby
dead key.

Google does not normalise yet texts so that these equivalent glyphs are
treated the same.

Simos


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



[Fwd: Re: Experiments with classical Greek keyboard input]

2006-02-03 Thread Simos Xenitellis

Dear All,
This e-mail appears not to have made it to the list (Kostas is probably
not subscribed to the list), therefore I forward it as there are some
interesting information here.

Simos

 Forwarded Message 
From: Πιστιόλης Κωνσταντίνος pistiolis στο ts τελεία sch τελεία gr
To: Simos Xenitellis simos74 στο gmx τελεία net,
linux-utf8@nl.linux.org
Subject: Re: Experiments with classical Greek keyboard input
Date: Tue, 31 Jan 2006 22:11:05 +0200

Την Mon, 30 Jan 2006 19:05:26 +,ο(η) Simos Xenitellis  
[EMAIL PROTECTED] έγραψε/wrote:

 O/H Jan Willem Stumpel έγραψε:
 Simos Xenitellis wrote:


 You can have a look at this document,  
 http://planet.hellug.gr/misc/polytonic/ Although it is in Greek, it
 should be feasible to discern the combinations proposed. For example,
 Νεκρό πλήκτρο is Dead key in the list. If there are queries, feel
 free to refer to me.


 Very interesting. Is this a proposal, or has it been implemented?
 According to Babelfish, you say Your distribution of Linux that
 has been published after October 2005 should include the renewed system
 that we describe here. Mine does not, but I don't trust the Babelfish
 translation..

 The referenced document is indeed a proposal.
 You are correct about October 2005. Several distributions were released  
 in October (Ubuntu, OpenSUSE) so the plan was to have the changes  
 upstream by the end of the summer so that they move to the new  
 distributions as they appear.
 However, this plan did not work out and we still did not submit these  
 changes.
 Konstantinos Pistiolis is working on this subject.
 As far as I can see, it would not be difficult to implement it. Nothing
 would have to be changed in the binaries, only in the xkb and Compose
 files.

 I noticed you only want to use 'two level' keys (normal and shift), not
 using AltGr. Is this some kind of standard? (e.g. Greek national
 standard, or some other kind of standard)? The present pc/gr file in xkb
 uses 'three level' keys.

 As far as I know there is no national standard for Greek polytonic.  
 Windows XP support Greek polytonic,
 however, there is an inherent disadvantage that you cannot stuck more  
 than one dead key; due to this
 quite a lot of keys have to be used as dead keys. In addition, if a  
 character accepts more than one diacritic,
 then you need three dead keys to cover all the cases (diacritic A,  
 diacritic B, diacritic A+B).
If it could be any, it is the old typewriter's standard (computers were not
used for text proccessing at the time polytonic was removed from modern  
greek),
but it didn't cover the full polytonic because it didn't have vareia  
(grave),
makron, and vrahy. It was rather used for modern greek than ancient greek.
This keymap defines a dead key for every combination, and is more or less
followed by the windows XP, using up to 16 or more dead keys!

However, the proposed keymap uses the same principles and only needs 9
dead keys

 Regarding the usage of AltGr. There have been quite a few discussions on  
 whether to use or not. I do not have the full details at my disposal.
 Kostas, would you like to chip in for this?
the accents, dead iota and the breathing marks shouldn't use it:
1. most of the dead keys are too often used to be put in third
level (except for makron, vrahy). Each symbol is aproximately
used in 1 every 3-5 words!
2. the altGr chooser was not used in the old typewriter's standard.
In fact, all symbols (except vareia=grave) have a position in
the old typewriter's standard which is preserved in the proposed keymap.

About makron and vrahy, I have proposed putting them in ] and } and not as  
an
altGr combination, as the openning [ and { are already occupied
as dead keys (~ and iota subscript in accordance to the typewriter  
standard).
The concept is that it wouldn't be bad to lose the closing brace, if
the openning brace is lost too, and it would save the altGr+dead_key
combinations for future use (see below).


The other symbols (ancient greek numbers) are also needed in modern
(monotonic) greek, and could be added either as altGr combinations,
or composed with dead acute, or even in both ways. eg:
altGr + sigma: numeric stigma
or
dead tonos + sigma   : numeric stigma
I don't know if the latter odd combination would produce conflicts in
an international Compose file, but this idea was used in the past in
greek keyboard, in the following combinations:
dead_tonos + .  : above (middle) dot
dead_tonos +   : «
dead_tonos +   : »
I believe that the Compose should actually be a part of the keymap;
not the locale. Dead keys are very good sticky third level choosers, for
languages that use them.
The present pc/gr file uses altgr for the euro symbol, the middle dot
and the «» symbols, along with the Compose combinations and I suggest
the same (duality) for all new symbols

Another idea is to use the same kind of rules to increase the usability
of the polytonic keyboard for writing

Re: Experiments with classical Greek keyboard input

2006-01-30 Thread Simos Xenitellis

O/H Thomas Wolff έγραψε:
I've only followed this discussion partially because I'm not familiar 
with ancient Greek, but I noticed a few things.


Jan Willem Stumpel wrote:

  

Proposal (I tested this, with the small alpha only, and it seems to
work):



  

-- Greek (modern and ancient) should use the common (international)
   Compose file.
-- The international Compose file should have different definitions for
   letters with simple tonos and letters with simple oxia. At present,
   the Compose file has



  

dead_acute Greek_alpha  : ά U03AC # GREEK SMALL LETTER  ALPHA
WITH TONOS



  

   (and grep GREEK SMALL LETTER ALPHA Compose|grep -v AND|grep OXIA
   gives nothing!)



It should actually list the following two entries from Unicode data:
1F71;GREEK SMALL LETTER ALPHA WITH OXIA;Ll;0;L;03ACN;;;1FBB;;1FBB
1FBB;GREEK CAPITAL LETTER ALPHA WITH OXIA;Lu;0;L;0386N1F71;

I guess that's due to the following comments quoted from 
en_US.UTF-8/Compose (SUSE Linux 10.0):

# Part 2
# Compose map for Korean Hangul(Choseongul) Conjoining Jamos  automatically
# generated  from UnicodeData-2.0.14.txt at
#ftp://ftp.unicode.org/Public/2.0-Update/UnicodeData-2.0.14.txt
#   by Jungshik Shin [EMAIL PROTECTED]  2002-10-17

This means the Compose data are quite outdated (Unicode 2.0!) and should 
be updated.


Jungshik Shin, would you provide us with the script or program that you 
used to generate these entries automatically? That would be much 
appreciated.
Actually, I would also like to equip my editor mined http://towo.net/mined 
with compose data automatically generated from Unicode data. I could 
do that myself but Jungshik Shin's contribution would help.


Also, the following information would help:
* What are the preferred keys that users would like to use to enter 
  oxia, tonos, etc as accent prefix or combination keys?
* Are any common keys (like quote mark, grave, acute) typically 
  associated with Greek accents or is that rather random and subject 
  to individual preference?
* Are any common keyboard mappings in use that set some de facto standard 
  here? What are their mappings?


If someone would answer these questions in a generic way (i.e. not 
referring to X key names or mappings or even the more mysterious X 
keyboard configuration properties), I would be grateful.
(I admit the questions are a little bit redundant, trying to achieve 
the same result under different aspects.)
  

You can have a look at this document,
http://planet.hellug.gr/misc/polytonic/
Although it is in Greek, it should be feasible to discern the 
combinations proposed. For example, Νεκρό πλήκτρο is Dead key in the 
list.

If there are queries, feel free to refer to me.

The Compose file should be broken in smaller files per script rather 
than having a big monolithic file.
There is increasing interest in updating this area of Xorg 
(http://community.livejournal.com/xkbconfig/) and I home it gets done soon.


Simos

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Experiments with classical Greek keyboard input

2006-01-30 Thread Simos Xenitellis

O/H Jan Willem Stumpel έγραψε:

Simos Xenitellis wrote:

  
You can have a look at this document, 
http://planet.hellug.gr/misc/polytonic/ Although it is in Greek, it

should be feasible to discern the combinations proposed. For example,
Νεκρό πλήκτρο is Dead key in the list. If there are queries, feel
free to refer to me.



Very interesting. Is this a proposal, or has it been implemented?
According to Babelfish, you say Your distribution of Linux that
has been published after October 2005 should include the renewed system
that we describe here. Mine does not, but I don't trust the Babelfish
translation..
  

The referenced document is indeed a proposal.
You are correct about October 2005. Several distributions were released 
in October (Ubuntu, OpenSUSE) so the plan was to have the changes 
upstream by the end of the summer so that they move to the new 
distributions as they appear.
However, this plan did not work out and we still did not submit these 
changes.

Konstantinos Pistiolis is working on this subject.

As far as I can see, it would not be difficult to implement it. Nothing
would have to be changed in the binaries, only in the xkb and Compose
files.

I noticed you only want to use 'two level' keys (normal and shift), not
using AltGr. Is this some kind of standard? (e.g. Greek national
standard, or some other kind of standard)? The present pc/gr file in xkb
uses 'three level' keys.
  
As far as I know there is no national standard for Greek polytonic. 
Windows XP support Greek polytonic,
however, there is an inherent disadvantage that you cannot stuck more 
than one dead key; due to this
quite a lot of keys have to be used as dead keys. In addition, if a 
character accepts more than one diacritic,
then you need three dead keys to cover all the cases (diacritic A, 
diacritic B, diacritic A+B).


Regarding the usage of AltGr. There have been quite a few discussions on 
whether to use or not. I do not have the full details at my disposal.

Kostas, would you like to chip in for this?

BTW I suppose when you say that tonos/oxia is on the ; key, you mean the
key which is ; on US keyboards, not the key which is ; on Greek keyboards?
  

Indeed, ; it is the physical key according to the US keyboard.
The proposal document does not include a specific dead key to produce 
oxia. In the Windows XP layout there is such a dead key,
in an uncomfortable location however, for those end-users who would like 
to use it.
  

The Compose file should be broken in smaller files per script
rather than having a big monolithic file.



What advantage would this bring? If we have many small pieces of the
Compose file, how is the user (or the system) supposed to decide when to
use which piece? Wouldn't this create another configuration problem?
  
The configuration mechanism of Xorg would shield the end-user from this 
complexity. I am referring to the needs of the developers.
For example, suppose a lesser known language wants to make an 
installable package that adds writing support. The way this could be 
done is by dropping (adding) the appropriate files in the appropriate 
directory. Otherwise, there would be need to patch the monolithic file.
In addition, the Polytonic section in the Compose file is suitable to be 
auto-generated from a script as the multiple diacritics on vowels bring up

combinations.

UTF-8 allows using one system for all languages and scripts, without
changing locales. There is only one, IMHO unavoidable, but small,
disadvantage: some files (like fonts, and the Compose file) tend to
become rather big. But memory and disk space are not as expensive as
they used to be. And the user does not notice anything of this. She just
thinks: wow! I can input any language anywhere, at any time!
  
As I mention above, the splitting of the files would be an advantage for 
the developers.
The end-user would only see a GUI configuration tool. No setxkbmap or 
editing of xorg.conf.
There is increasing interest in updating this area of Xorg 
(http://community.livejournal.com/xkbconfig/) and I hope it gets done

soon.



Hmm.. xkb and Compose are two completely different mechanisms. One
is input to the other. People often complain about xkb being
'mysterious' or 'arcane'. Since xfree86 4.3 and x.org came around, it
isn't anymore. It just lacks user-level documentation. Recently, thanks
to this list, I have come close enough to enlightenment to attempt a
user-level description on my utf-8 page, sections 6.1 and 6.2
(http://www.jw-stumpel.nl/stestu).
  

Thanks for this.
We need to put effort so that gswitchit (Keyboard Indicator applet in 
GNOME) gets more and more advanced and ubiquitous.

The plan is for gswitchit to be used for KDE as well.
This is the proper direction so end-users are happy that their settings 
just work.


Simos


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Experiments with classical Greek keyboard input

2006-01-27 Thread Simos Xenitellis
On Thu, 2006-01-26 at 20:29 +0100, Jan Willem Stumpel wrote:
 Simos Xenitellis wrote:
  O/H Jan Willem Stumpel έγραψε:
 
  This also means that when you run scim, the ogonek and horn
  do not work as breathing signs even if the locale is
  el_GR.UTF-8, because scim's internal copy of the Compose file
  is only the common one.
  
  In addition, when you try to type Greek Polytonic in
  OpenOffice.org, it will not work. The reason is that the
  default Input Method, GTK+ IM does not know yet about these
  dead keys and does not pass them on. Therefore, when selecting
  Greek Polytonic in the X Input Method (XIM), for example, using
  System/Preferences/Keyboard or adding the Keyboard Indicator
  applet, all in GNOME (such as Ubuntu), you have to do first
  
  export GTK_IM_MODULE=xim then run OpenOffice.org
 
 The situation is very complicated, because there are many factors
 which can influence the result.
 
 Yes, it is best to set GTK_IM_MODULE=xim (in Debian, you put this 
 in /etc/environment). Then you can enter polytonic Greek 
 everywhere (using the xkb facilities) *if* the 
 /etc/X11/xkb/symbols/pc/gr has been hacked, *and* the locale is 
 any type of UTF-8 (with the possible exception of el_GR.UTF-8), 
 *and* the application has access to a proper font.

GTK+ 2.x based applications that are linked to pango are in the happy
situation where glyphs from different fonts are grouped together to fill
in the Unicode table. Therefore, if you have at least one font in your
system that has Greek Polytonic support, this will be used for your GTK+
application. For issues like font preference for this, the file
/etc/fonts/fonts.conf (fontconfig) is used which can dictate where to
choose from first.
OpenOffice.org appears to do its internal choosing of fonts (does not
obey fontconfig), which causes some pain for Greek. Specifically, if the
selected font in OOo does not have Greek glyphs AND your distribution
has Asian support, Greek glyphs will be chosen from Asian fonts.

 As far as I could find out, with GTK_IM_MODULE=xim,
 xkb-type polytonic Greek works (i.e. you can enter ᾆ) in just 
 about all situations; I tested all 12 combinations of 1-3 and A-D 
 below:
 
 1. No input method framework present
 2. uim present
 3. scim present
 
 A. text mode programs in xterm
AFAIK, xterm uses XIM by default.

 B. mozilla, bluefish
 C. openoffice
Both B and C are based on GTK+, so GTK_IM_MODULE to xim simply directs
them use the standard X Input Method. Any scim/uim/iiimf present cannot
affect these applications when GTK_IM_MODULE is set to xim.

 D. QT programs
QT uses XIM directly, so it is not affected by setting GTK_IM_MODULE.
The QT folks are actually trying to make a QT Input Method, similar to
GTK+ IM.

 
 With scim, at first I thought that there were program types in
 which xkb polytonic Greek did not work. But this is (fortunately) 
 not the case. With scim, you must just take care that the keyboard 
 is set separately to English/European (i.e. direct input, 
 through xkb) for each application.
Indeed, that should be the case.
I did not find Greek Polytonic in either scim or uim, or even iiimf.
There was only modern Greek.

 Some uim and scim docs recommend using GTK_IM_MODULE=uim or
 GTK_IM_MODULE=scim. It seems this is not necessary; 
 GTK_IM_MODULE=xim works in all circumstances.
By setting GTK_IM_MODULE to either xim, uim, scim or iiim, you enable
them for GTK+ applications. When the variable was set to xim, any of
the other frameworks where not active for these GTK+ applications.

 But with the original /etc/x11/xkb/symbols/pc/gr, with or without 
 el_GR.UTF-8 locale, polytonic Greek does not work with scim. I now 

Which distribution are you using?
What are the changes that you have for the gr file that makes it work
for you?
The latest is
http://cvs.freedesktop.org/xlibs/xkbdesc/symbols/gr?view=markup

There is some work to update the settings for Greek Polytonic.
Two thoughts here are: 
1. Place ¨ (dyalytika) on the same dead key as with modern Greek.
2. There is no way to type oxia; tonos and oxia are considered
equivalent in Unicode 3.0+ and tonos is preferred. However, if users
would rather have an oxia option, I feel we should provide it.

 think the keyboard action is as follows:
 
 without uim or scim:
 
 keyboard -- xkb -- xlib Compose -- application
 
 with uim:
 
 keyboard -- xkb -- xlib Compose -- uim -- application
 
 and with scim:
 
 keyboard -- xkb -- scim΄s own Compose -- scim -- application

I think that when one sets GTK_IM_MODULE for GTK+ applications, one
injects a framework between keyboard and xkb. 

Do you consider the key combination that switches between layouts as
part of xkb or xlib Compose?  An important issue with all these
frameworks is that they make it difficult to have a single interface for
the end-user to use irrespective of the language she speaks.

Simos

Simos


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Experiments with classical Greek keyboard input

2006-01-25 Thread Simos Xenitellis

O/H Jan Willem Stumpel έγραψε:

Alexandros Diamantidis wrote:

[sorry for taking a few days to reply...]

* Jan Willem Stumpel [2006-01-18 14:41]:

This does not work in my case. Also interchanging the entries (US 
first,

then GR) did not work. I mean you can get the accents, but not the
breathing signs. Strangely enough, even calling

LANG=el_GR.UTF-8 xterm

and then doing things in the new xterm, did not work! I don't 
understand

why. I have the el_GR.UTF-8 locale installed.



I really wonder why... I thought if you had a ~/.XCompose file, your
locale didn't matter (except if you specifically used it in that file,
by doing 'include %L'). Maybe it's not used at all?


I think it did not work because I am trying out scim and uim. I must
have been running scim at the time. It seems that when scim is running,
only its own internal version of the compose file is used. 
Customisations in ῀/.XCompose do not work at all. With uim, they work. 
The Greek entry must of course come second in the ῀/.XCompose file. I 
do not know how uim does it.


This also means that when you run scim, the ogonek and horn do not 
work as breathing signs even if the locale is el_GR.UTF-8, because 
scim's internal copy of the Compose file is only the common one.
In addition, when you try to type Greek Polytonic in OpenOffice.org, it 
will not work.
The reason is that the default Input Method, GTK+ IM does not know yet 
about these dead keys and does not pass them on.
Therefore, when selecting Greek Polytonic in the X Input Method (XIM), 
for example, using System/Preferences/Keyboard or adding the Keyboard 
Indicator applet, all in GNOME (such as Ubuntu), you have to do first


export GTK_IM_MODULE=xim
then run OpenOffice.org

In standard GNOME applications you can change the the X Input Method 
(XIM) if you right-click in any text box and select XIM from the context 
sensitive menu.


Simos
p.s.
Did I write about this in a previous e-mail?
The GTK+ IM bug with not supporting Greek Polytonic is at
http://bugzilla.gnome.org/show_bug.cgi?id=321896
and we are stuck in how to interpret some additions in the Compose file.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Experiments with classical Greek keyboard input

2006-01-18 Thread Simos Xenitellis

Jan Willem Stumpel wrote:


Alexandros Diamantidis wrote:


When I made an initial try at a polytonic Greek keyboard, I couldn't
find a dead_comma_above and a dead_reversed_comma_above, so I just
(ab)used the first two keysyms that weren't otherwise meaningful on a
Greek keyboard. Subsequent updates to the Greek keyboard layout and
Compose files kept this (perhaps not strictly correct) arrangement.



This xkb stuff is not so easy to understand, but Alexandros' and Jim's
comments helped a lot.

I have so far always used a us_intl keyboard layout in order to enter
accents. This needs the AltGr key to change groups when a key must
produce more than 2 symbols.

But there is also a variant called alt-int of the us keyboard, which
uses extra levels (instead of a new group) to get the same effect. The
AltGr key is used to make the 3rd level. BTW I still don't know what to
press for the 4th level.

From the user's point of view, the behaviour of us_intl and
us(alt-intl) is exactly the same. You get all the accents (dead keys),
the Euro sign, etc. in the same way with both methods. But us(alt-intl)
does not use an extra group. So the groups can be used for other
languages (so you do not need to switch groups, only toggle them).

I found the following combination works nicely:

setxkbmap us(alt-intl),gr(polytonic) \
 -option compose:rwin
 -option grp:lwin_toggle

With this, left-Windows toggles between us(alt-intl) and polytonic Greek
mode. All characters, including things like ᾦ, can be made in Greek
mode, even in en_GB.UTF-8 locale, if the dead ogonek and horn in the
symbols/pc/gr file are replaced by the utf-8 characters COMBINING COMMA
ABOVE (0x1000313) and COMBINING REVERSED COMMA ABOVE (0x1000314); the
(default?) US Compose file then has lots of entries for combined Greek
characters.

This change would probably break things for Greek users unless the Greek
Compose file is also changed.

Other scripts can be added, e.g us(alt-intl),gr(polytonic),ru.


AFAIK, nowdays Greek uses the en_US.UTF-8 file for dead keys.
Specifically, Greek users of Ubuntu 5.10 have trouble with accents as 
the Greek file (el_GR.UTF-8) with the dead key sequences is not 
installed any more. By changing the configuration file to point to 
en_US.UTF-8, modern Greek works once again.
In addition, the name of the keyboard has reverted back to gr (country 
code, as with all other keyboard layouts) compared to el that used to 
be the case for the last few years.


GTK+ has its own input method and requires dead keys to be registered, 
if you use this GTK+ IM input method. If you notice some GTK+ apps not 
working, this is where you investigate. For more on this, see

http://bugzilla.gnome.org/show_bug.cgi?id=321896

X.org has been in transition from the monolithic setup to the modular 
one you find now in X.org 7.0. Due to this,
files are being moved around, so you need to know where you submit 
patches to.
My understanding is that Greek (modern/ancient-polytonic) keysyms should 
come from the generic en_US.UTF-8 and not use a custom one.

The existing en_US.UTF-8 at
http://cvs.freedesktop.org/xorg/xc/nls/Compose/en_US.UTF-8?view=markup
shows that it covers many languages. This file appears to be monolithic 
one.
I will have to look closer to find the modular copy somewhere in the 
source tree. Any hints?


There are clashes with the reusing of dead_acute, dead_ogonek and so on 
in many different languages, causing trouble and conflicts when having a 
single compose file for all languages. I did not see a compelling reason 
against creating more symbol definitions. Are there any?
At this point that the transition took place, I think patches would get 
accepted for a few more symbol definitions (that's their name, right?).


Indeed, keyboard support for X.org is a bit of a mystery as there 
appears to be no person that claims some expertise and answers questions.
The keyboard support was created by Sun engineers in the early 90s and 
there was this feeling it was over-engineered. Those engineers moved 
on to work areas now, some of them still at Sun (irc discussions at #xorg).




Still this setup generates warnings which probably explain why I cannot
reach the 4th level symbols (you see the warnings after closing X), like:

Warning: Type ONE_LEVEL has 1 levels but RALT has 2 symbols
   Ignoring extra symbols
Warning: Type THREE_LEVEL has 3 levels but AC11 has 4 symbols
   Ignoring extra symbols

Now how to fix this?


See
http://www.xfree86.org/current/XKB-Enhancing4.html

When you specify how many levels your keyboard layout will use, the 
table that looks like


 key AE02  { [ 2,   quotedbl,  twosuperior,oneeighth ] };
 key AE03  { [ 3,   sterling, threesuperior,sterling ] };

should have up to that number of columns.
In your case, somehow, more collumns where found so some had to be ignored.

Simos


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  

Re: Thai Numbering in gnome-doc-utils

2005-12-05 Thread Simos Xenitellis

Theppitak Karoonboonyanan wrote:


Hello,

I'm going to translate gnome-doc-utils into Thai and find
two required Thai numberings are missing. One is Thai alphabetical,
and the other is Thai decimal digits.

Thai alphabetical numbering is run with Thai consonants in the range:

 U+0E01 (THAI CHARACTER KO KAI)
   :
 U+0E2E (THAI CHARACTER HO NOKHUK)

with three characters skipped, namely:

 - U+0E03 (THAI CHARACTER KHO KHUAT)
 - U+0E05 (THAI CHARACTER KHO KHON)
 - U+0E06 (THAI CHARACTER KHO RAKHANG)

(i.e. the sequence is: U+0E01, U+0E02, U+0E04, U+0E07 .. U+0E2E)

This is mainly used for numbering appendixes in Thai
documents, and occasionally used in ordered lists.

Numbering with Thai decimal digits is less used in general,
but exists  in most official or military documents. It just uses
Thai digits in the range (U+0E50..U+0E50) for 0..9 respectively.

I'm not sure about digits bahavior described by W3C's XSLT,
nor what have been done in gnome-doc-utils, but let me mention
a common mistake in some implementations: the assumed translation
of digits. We would need an explicit way to specify whether to use
Thai digits in numbering, rather than automatically translated.

Thank you for your attention. Any comment would be appreciated.
 


That's quite an interesting issue.
I did not notice this information in the locale settings, nor in the
documentation of gettext.
Perhaps the linux-utf8 list is more appropriate for this?
I am cc:ing there as well.

Simos


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Command line output mis-alignment.

2005-10-27 Thread Simos Xenitellis

Amarendra Godbole wrote:


Hi,

The output of a command that prints a tabular output (with a
tab separator) is susceptible for a mis-alignment across
different languages. Mostly the headers' get mis-aligned with the
column data in a multi-byte language like Japanese. I have been
thinking of this issue for a while, and here are the possible
solutions to it -

1. Space the columns based on the length of the header. For eg.,
  if the column data is ``helloworld, then o/p would be -
head1header2headline3
-
hellohellowohelloworl
worldrldd

  Each column wraps. But this approach might break existing
  line-by-line parsing scripts.

2. Space the columns based on the longest length of the column
  data. This shall need two passes - one to find out the longest
  column data, and other to align-and-print the table.

3. Space the columns based on some pre-computation of the change
  in lengths of the English and Japanese equivalent string. For
  eg., if the Japanese string occupies 40% more columns approx.,
  then space the columns accordingly.

4. Leave the issue as-is. :) I have found this approach taken on
  HP-UX, where output of df command gets mis-aligned in Japanese
  locale.

Can senior folks on this list help me with this? Can there be a
better approach more suitable to i18nized software?? Thanks a lot
in advance.
 

Would it be an option for you to default to, let's say, the POSIX or 
en_US.UTF-8 locales?
Before running the mentioned commands, you can reset on demand the 
LANG/LANGAUGE variables to values of your choice.

It looks as a hell of a problem to parse output that is affected by l10n.

Simos

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Capitalisation of text, which library is it?

2005-09-03 Thread Simos Xenitellis

Hi All,
Several applications allow you to convert text to all caps, such as 
Firefox and OpenOffice.org.
In a Web page, the CSS can specify that a specific text should be shown 
with all caps.

In OOo, a similar option exists under Tools to change the case of text.

Do you know where this information is stored or which library deals this 
task?

Is it CLDR? How does Firefox (non-CLDR) do it?
The Greek support for this is not good, and we are looking to correct it.
A similar situation may exist with other languages which specify case, 
and capitalisation of text follows certain complex rules depending on 
whether accents are involved.


Simos
p.s.
For example, in Greek, accents are generally dropped when the text 
becomes all caps, which some interesting exceptions.


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



[Fwd: [Bug 143014] Cannot input composed characters on Linux console (Unicode mode)]

2005-07-15 Thread Simos Xenitellis


Hi,
Could someone check with Fedora Core 3 (kernel-2.6.12-1.1372_FC3) 
whether this

bug still exists?
I am not running FC3 at the moment, so I cannot test.

This bug report refers to an earlier discussion from this list, at
http://mail.nl.linux.org/linux-utf8/2005-01/msg00072.html

Cheers,
Simos

==
Summary: Cannot input composed characters on Linux console (Unicode mode)

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=143014

davej at redhat dοt com changed:

  What|Removed |Added

Status|NEW |NEEDINFO

--- Additional Comments From [EMAIL PROTECTED]  2005-07-15 14:40 EST ---
An update has been released for Fedora Core 3 (kernel-2.6.12-1.1372_FC3) which
may contain a fix for your problem.   Please update to this new kernel, and
report whether or not it fixes your problem.

If you have updated to Fedora Core 4 since this bug was opened, and the problem
still occurs with the latest updates for that release, please change the version
field of this bug to 'fc4'.

Thank you.



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: weird representation for FFFD

2005-06-27 Thread Simos Xenitellis
Στις 27/Ιούν/2005, ημέρα Δευτέρα και ώρα 14:33, ο/η [EMAIL PROTECTED]
έγραψε:
 On my fedora core 3 gnome desktop,
 I get a weird representation for U+FFFD.
 Here's what it looks like for you [�].
 
 It's the REPLACEMENT CHARACTER, and according
 to the following should be question mark enclosed
 in a solid diamond: http://www.unicode.org/charts/PDF/UFFF0.pdf
 I've been told that this is also the representation
 on windows and OSX.
 
 However I'm getting a weird comma like thing, which
 Markus Kuhn _has_ made reference to here I think:
 http://www.w3.org/2001/06/utf-8-wrong/UTF-8-test.html
 In the gnome charmap applet it seems to be the nimbus
 and schoolbook (sans and serif) fallback fonts that have
 this weird representation. The (Misc) Fixed fonts
 do have the question mark as expected.
 
 So why this weird representation?
 I'm writing an app where I would like to display
 characters that are invalid in the current encoding,
 and the comma like thing it totally confusing for users.

Hi,
On my system (FC2), gucharmap says it's FreeSans.
Doesn't FC3 have FreeSans/FreeSerif/FreeMono?
Ubuntu and other distributions come with freefont by default, covering
a good range of the Unicode space.
If FC4 does not install by default freefont, you should file a bug
report.

Simos



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: C source and execution encodings

2005-06-22 Thread Simos Xenitellis

Roger Leigh wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

A while back, I made the useful discovery that GCC accepts UTF-8
encoded C source by default, and in the generated object code uses
UTF-8 for narrow (char) strings, and UTF-32/UCS-4 for wide (wchar_t)
strings.

As an example:

#include locale.h
#include stdio.h

int
main (void)
{
 setlocale (LC_ALL, );
 printf(‘Name’\n);
 return 0;
}

This then correctly outputs the quotes:

$ ./test
‘Name’

A better example is here:

http://groups-beta.google.com/group/comp.lang.c.moderated/msg/bb55bb9f835eba6a?hl=en

In this case, you can output wide strings to narrow streams, and
narrow strings to wide streams.  In order to be able to do this, I
assume that the C runtime must know something of the execution
charsets in order to do the conversion, otherwise you wouldn't get
readable output.  Additionally, when you output a wide string with
wprintf(), it must be recoded to the narrow representation for
output??.

The above link is wrong.  I thought that given the C runtime's
knowledge of the execution charsets, it would recode the output into
the locale charset.  This does not appear to be the case, however.
The above program works the same in the C locale as a normal UTF-8
locale.

Can anyone confirm if the above is correct, or point to anywhere this
is documented?
 

Googling for gcc utf-8 brings up a discussion from this list (Dec 
2004) which references the GCC documentation.

The archive of that discussion starts at
http://mail.nl.linux.org/linux-utf8/2004-11/index.html#8

GCC documentation is available online at
http://gcc.gnu.org/onlinedocs/

The behaviour of the compiler regarding Unicode strings can be 
controlled with preprocessor options.

The page for this is
http://gcc.gnu.org/onlinedocs/gcc-4.0.0/gcc/Preprocessor-Options.html#Preprocessor-Options

Hope this helps,
Simos

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: How to detect the encoding of a string?

2005-06-21 Thread Simos Xenitellis
Στις 20/Ιούν/2005, ημέρα Δευτέρα και ώρα 12:59, ο/η 
Mike FABIAN έγραψε:
 Simos Xenitellis [EMAIL PROTECTED] さんは書きました:
 
  Hi All,
  The ZIP format (http://www.info-zip.org/pub/infozip/doc/) appears not
  to specify the text encoding
  of the filenames of the compressed files, which causes a problem with
  unzip utilities when they try
  to uncompress .ZIP files that include filenames in non-UTF-8 encodings.
 
  Such ZIP programs are unzip, file-roller (GNOME, at
  http://fileroller.sourceforge.net/), ark (KDE)
  cannot guess the encoding of the filenames and automatically convert
  to UTF-8.
 
  To solve this problem, a workaround is to be able to detect the
  encoding and automagically convert to UTF-8.
 
  Is there a library or sample program that can do such a encoding
  detection based on short strings of unknown encoding
  (or to choose from encodings based on a smaller list than iconv --list)?
 
 I think it is better to use the filename-encoding-conversion tool
 convmv to fix the encoding *after* unpacking the archive.
 
 See: http://j3e.de/linux/convmv/
 
 (convmv is already included in SuSE Linux).

Thanks.

Though you must agree that this does not follow the principle of Just
works; the GUI tool will not be able to do the work for them. Most
end-users will be in a Am stuckgive up situation. :(

If we cannot solve it in a gracefull way, we might be able to put this
whole issue under the carpet if we identify that a very limited number
of end-users are really affected. Can we say that? People from distros,
do you have feedback on this?

Simos



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



SUMMARY: Zip/Unzip and encoding problem (Was: Re: How to detect the encoding of a string?)

2005-06-03 Thread Simos Xenitellis

Hi,
I'ld like to thank everyone for their useful inputs to this issue. Based
on your suggestions I am opening a bug report for file-roller
(fileroller.sourceforge.net/) and notifying the authors of Info-Zip
(http://www.info-zip.org/pub/infozip/) on the issue. If you are from KDE
and use the ark archiver, please report as the problem exists there as
well. Same goes for 7-Zip Windows users (http://www.7-zip.org/).

If you have something more to add, please do so at the individual
bugzilla pages.

File-roller bug report:
http://bugzilla.gnome.org/show_bug.cgi?id=306403

7-Zip bug report:
https://sourceforge.net/tracker/index.php?func=detailaid=1214471group_id=14481atid=114481

My main concern was with GUI ZIP archivers, that should work no matter
what the compressed file is (It just works philosophy, GNOME :)) in
contrast to command line archivers. I just realised that at least
file-roller is a front-end to Info-Zip (zip/unzip/etc), therefore
there needs to be work as well as there. Probably the same with Ark.

As noted in http://mail.nl.linux.org/linux-utf8/2005-06/msg6.html
unzip has a bug and tries to force a character conversion from CP437
to latin-1, loosing the encoding information for any program that calls
it. Therefore, unzip should be fixed as well so that any program that
calls it can retrieve at least the original filename, and proceed with
an intelligent conversion to UTF-8. Actually, the fix might need to be
done in Info-Zip altogether, since unzip needs a way to extract and
place the file on the filesystem. There is not way that file-roller can
do something unzip [EMAIL PROTECTED]@#$.doc --saveas=test.doc, since unzip
cannot extract a file to a different filename.

The thread started at http://mail.nl.linux.org/linux-utf8/2005-06/#0
and from there one can view the whole discussion.

Indeed in the general case it is not possible to detect which 8-bit
encoding a string has. The byte values for the alphabet might give a
hint, for example iso-8859-x, x1, it's roughly between 128 and 180. For
CPxxx (such as CP737) encodings, it's roughly over 180. 
The Zip program can figure out the language variable (suppose it's
Greek). If the filename is not valid UTF-8, it's probably in a Greek
8-bit encoding. There are two main options here, ISO-8859-7 and CP737.
If you try iconv (1), it will only work for the correct encoding, while
it will fail for the other (due to the positioning).

Specifically:
 zipnote apoxairetisthrio-logos-mathith.zip | iconv -f CP737 -t utf-8
@   .doc
@ (comment above this line)
@ (zip file comment below this line)

 zipnote apoxairetisthrio-logos-mathith.zip | iconv -f ISO-8859-7 -t
utf-8
@ iconv: illegal input sequence at position 5

 zipnote apoxairetisthrio-logos-mathith.zip | iconv -f CP1253 -t utf-8
@ iconv: illegal input sequence at position 6

Therefore, as Bruno described in
http://mail.nl.linux.org/linux-utf8/2005-06/msg9.html
the ZIP application should check if the filename is valid UTF-8, and if
not, it should do something about it. Try to convert with heuristics to
UTF-8 (see Bruno's e-mail), else as last resort replace the unknown
characters with the Unicode Replacement character  (thanks Egmont,
http://www.fileformat.info/info/unicode/char/fffd/index.htm).

Simos

 03//2005, 14:08, 
/ Bruno Haible
:
 Simos Xenitellis wrote:
  Is there a library or sample program that can do such a encoding
  detection based on short strings of unknown encoding
  (or to choose from encodings based on a smaller list than iconv --list)?
 
 It's very unfortunate the encoding of the filenames is not specified in the
 central_directory_file_header in unzip.h. So the best you can do is to
 fall back on heuristics, based on these three bits of information:
 
  1) the version_made_by[1] field, which contains the OS on which the zip
 file was made.
  2) the locale (especially language) of the user who attempts to extract the
 zip,
  3) the set of filenames in the zip file.
 
 Here's how you can use this information to do something meaningful:
 
 1) You know that AMIGA used the ISO-8859-1 encoding, ATARI used the ATARIST
encoding, FS_NTFS and FS_VFAT use preferrably Windows encodings, BEOS
uses UTF-8, MAC uses the MAC-* specific encodings, MAC_OSX uses UTF-8 in
decomposed normal form.
 
 2) Assuming that the language of the person who extracts the zip often matches
the language of the one who created it, you can set up a list of encodings
to try:
 
Afrikaans  UTF-8 ISO-8859-15 ISO-8859-1
Albanian   UTF-8 ISO-8859-15 ISO-8859-1
Arabic UTF-8 ISO-8859-6 CP1256
Armenian   UTF-8 ARMSCII-8
Basque UTF-8 ISO-8859-15 ISO-8859-1
Breton UTF-8 ISO-8859-15 ISO-8859-1
Bulgarian  UTF-8 ISO-8859-5
Byelorussian   UTF-8 ISO-8859-5
CatalanUTF-8 ISO-8859-15 ISO-8859-1
ChineseUTF-8 GB18030 CP936 CP950 BIG5 BIG5-HKSCS EUC-TW
Cornish

How to detect the encoding of a string?

2005-06-02 Thread Simos Xenitellis


Hi All,
The ZIP format (http://www.info-zip.org/pub/infozip/doc/) appears not to 
specify the text encoding
of the filenames of the compressed files, which causes a problem with 
unzip utilities when they try

to uncompress .ZIP files that include filenames in non-UTF-8 encodings.

Such ZIP programs are unzip, file-roller (GNOME, at 
http://fileroller.sourceforge.net/), ark (KDE)
cannot guess the encoding of the filenames and automatically convert to 
UTF-8.


To solve this problem, a workaround is to be able to detect the 
encoding and automagically convert to UTF-8.


Is there a library or sample program that can do such a encoding 
detection based on short strings of unknown encoding

(or to choose from encodings based on a smaller list than iconv --list)?

It would be good to have something common to solve the problem for at 
least file-roller and ark,

which are based on graphical interfaces.

Any suggestions?

Simos

P.S.
If you would like to experiment with your own ZIP application,
try 
http://www.thranio.gr/sxolikes-giortes/telikes/omilies/apoxairetisthrio-logos-mathith.zip
The filename is encoded in CP737 (a la iconv). All open-source ZIP tools 
(=unzip, file-roller, ark) fail to detect the encoding.

WinZip is able to detect the encoding.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: incorrect non-ascii letters in web archive

2005-03-22 Thread Simos Xenitellis
Hi,
If you get hold of the mailing list admin, could you please add the 
following:

If someone visits http://mail.nl.linux.org/linux-utf8/ and tries to 
subscribe
from the link at the top, s/he will notice that nothing happens.
The problematic link is https://mail.nl.linux.org/cgi-bin/lsg2.cgi
and it appears that the HTTPS virtual server is not working
Could you please fix?

Thanks,
Simos
Egmont Koblinger wrote:
Hi,
Non-ascii letters appear quite often in this mailing list, but (due to a
mhonarc bug) they are usually unreadable in the official web archive of this
mailing list.
The bug is known and a fix is already available. About two weeks ago I wrote
a mail to the nl.linux.org admin asking for a fix but I got no reply.
Is there someone here (Markus maybe?) who could get in contact with the site
admins and fix the archive?
Original mail follows.
Thanks,
Egmont

- Forwarded message from Egmont Koblinger [EMAIL PROTECTED] -
Date: Thu, 10 Mar 2005 14:47:03 +0100
From: Egmont Koblinger [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: wrong non-ascii letters in mail archives
Hi,
I've found that the web archive of the linux-utf8 mailing list often
displays non-ascii characters incorrectly. This is quite bad in general, but
especially bad since this particular mailing list is about how to handle
non-English letters correctly, and people often write non-ascii letters in
their messages to demonstrate things. I see these messages perfectly in my
mail client (mutt 1.5.6 running inside an UTF-8 terminal) but they are
incorrect in the web archive.
An example is a message sent only a couple of minutes ago:
http://mail.nl.linux.org/linux-utf8/2005-03/msg00011.html
whereas the original message is encoded in UTF-8 character set (and Quoted
printable transfer encoding, but this shouldn't matter), but the archive
shows A and I characters with some accents instead of micro sign, greek mu,
german sharp s, greek beta, etc...
This is a bug in mhonarc which is still buggy in their latest release
(2.6.10) but already fixed in CVS, see these:
bug report:
http://savannah.nongnu.org/bugs/?func=detailitemitem_id=11187
patch and commit log:
http://www.mhonarc.org/archive/html/mhonarc-commits/2004-12/msg1.html
So theoretically all you'd need to do is apply this trivial patch to mhonarc
and re-generate the archives, accented letters would become repaired then.
I hope so :-))

Thanks,
Egmont
- End forwarded message -
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

 


--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.308 / Virus Database: 266.8.0 - Release Date: 21/03/2005
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: New version of UTF-8 on Linux

2005-03-21 Thread Simos Xenitellis
 21//2005, 18:36, / 
Jan Willem Stumpel
:
 Would like to get comments  criticism especially about the
 input part:
 
 http://www.jw-stumpel.nl/stestu.html#T6.3

1. There are scripts the have a different class of complexity than
CJK, typically the Indic languages, Khmer, Lao, Burmese and so on. In
some of them, the rendering depends on the characters that come before
or follow, and may also require a dictionary to represent them
correctly.
2. You mention that [IIIMF] has zero documentation. The choice of
words is not elegant.
For IIIMF it's rather simple to setup for Fedora 2, Fedora 3, etc,
Ubuntu Linux, Debian and so on.
Have a look at 
http://fedora.redhat.com/projects/i18n/iiimf-faq.html
http://www.openi18n.org/modules.php?op=modloadname=Sectionsfile=indexreq=viewarticleartid=30page=1
http://apac.redhat.com/iiimftest/
http://anakin.ncst.ernet.in/~aparna/consolidated/x2004.html
You can write in most Indic languages using IIIMF doing transliteration.
3. GTK+ IM is not limited to some Gnome (GNOME) programs but works
throughout the GNOME Desktop Environment and Development Platform.
Specific applications that use the GTK+ library, namely
Firefox/Thunderbird and OpenOffice.org have problems with keyboard
shortcuts when you are typing in the other language, but that's
another issue. It looks bad to critisize GTK+ IM without explaining what
your complaint is about.
4. Thanks for mentioning im-classicalgreek.
An alternative is to use XIM and selecting polytonic. Have a look at
http://www.livejournal.com/users/simos74/32918.html
(sorry, it's in Greek). GTK+ IM would work for Ancient Greek when the
following issue gets resolved:
http://bugzilla.gnome.org/show_bug.cgi?id=167940
Else, one can use XIM.

It's an interesting document what you are writing and I am would
interested to see how it progresses.

Best regards,
Simos Xenitellis



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: GNU glibc locales and CLDR Questions ...

2005-03-15 Thread Simos Xenitellis

More stuff on CLDR...
http://www.advogato.org/person/roozbeh/diary.html?start=3

I should start a new thread, I suppose...

Simos

 09//2005, 16:01, / 
Simos Xenitellis
:
 Tim wrote more on his blog (http://blogs.sun.com/roller/page/timf/).
 
 Direct link is
 http://blogs.sun.com/roller/page/timf/20050304#fixing_the_locale_data_mess
 
 Please provide comments.
 
 Simos
 
  01//2005, 19:38, / 
 Simos Xenitellis
 :
  I'ld like to add to this a post by Tim Foster (Sun, on Localisation), at
  http://blogs.sun.com/roller/page/timf/20050226#everything_is_interesting_when_you
  Skip the intro, read the rest until the end.
  There are different repositories for locale information and the
  situation gets more complicated...
  
  Simos
  
   14//2005, 23:19, 
  / Simos Xenitellis
  :
   On Mon, 2005-02-14 at 18:14, Edward H. Trager wrote:
Hi,

I posted the following on the [EMAIL PROTECTED] mailing list
which I suppose is the best place for it.  But perhaps followers on 
this list
have some insight on these questions, so I thought I would ask here too:


I see that the recently-released glibc-2.3.4 has about 189 locales.
In comparision, CLDR 1.2 has I believe 231 locales in it.

Can someone please clarify for me the following simple questions:

1) What is the current origin of the 189 locales in glibc-2.3.4? Are 
these
  still the set of accrued locale data from glibc, or have these data 
already
  been influenced/augmented by the CLDR/ICU locale data?
   
   I believe the glibc locales are individual submissions from the
   respective countries. I think there is a bit of reinvention of the
   wheel, as people have to show in some cases what is the official
   convention to represent data (like am_pm format, date format, etc).
   
   Per http://sources.redhat.com/ml/libc-locales/2005-q1/msg2.html
   it looks that there is a bottleneck in the processing of bug reports and
   updates to the glibc registry.
   
   Petter Reinholdtsen has done a very good job to act as some sort of an
   intermediary between the glibc maintainers and the individuals that want
   to update their locale information.
   
   I feel there is a conflict of interest between the glibc maintainers
   and the GUI developers. The GUI developers want more freedom from glibc
   locale, so they maintain their own locale data for some fields (scary).
   For example, for am_pm (or 12-hour clock), the glibc maintainers
   prefer to set it if that is the official representation in the country.
   Else, this field should be empty. They also suggest to developers to
   check this field if it's empty, if it is, do not show the time in
   12-hour format (as it would be technically incorrect). 
   See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=140891msg=47
   In GNOME and the clock applet, you  have the option to choose between
   12-hour or 24-hour clock. So, for Greek, if you choose 12-hour clock,
   then 7pm shows as 7:00since am_pm is blank (it's a bit ironic, as
   officially in Greece we have the 12-hour clock).
   There was a recent discussion on gnome-i18n on this. See:
   1. http://mail.gnome.org/archives/gnome-i18n/2005-February/msg00015.html
   (it does not matter if at 12-hour clock the am_pm does not show up),
   2. http://mail.gnome.org/archives/gnome-i18n/2005-February/msg00139.html
   (bypassing locale data for usability purposes?).
   
2) If the current glibc locale data have not yet been 
influenced/augmented
  by the CLDR project, is there a plan to do so by the glibc 
maintainers?
   
   I would be really interested to learn about this answer.
   
3) If there is a plan by the glibc maintainers to derive all future 
glibc locale data
  from the CLDR XML data repository, does this mean that we can look 
forward to having
  all of the localedata in UTF-8 format when it is translated into the 
POSIX format
  required by glibc? (This would be much nicer than the mish-mash of 
legacy encodings).

4) Is there any future plan to extend the glibc library to, say, read 
directly from the
  CLDR LDML XML format?
   
   I would be really interested to see such a project taking place. A
   situation where glibc locale data are not considered that useful is a
   bad one, why not get rid off?
   
   Simos
   
   
   --
   Linux-UTF8:   i18n of Linux on all levels
   Archive:  http://mail.nl.linux.org/linux-utf8/
   
  
  
  --
  Linux-UTF8:   i18n of Linux on all levels
  Archive:  http://mail.nl.linux.org/linux-utf8/
  
 
 
 --
 Linux-UTF8:   i18n of Linux on all levels
 Archive:  http://mail.nl.linux.org/linux-utf8/
 


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: GNU glibc locales and CLDR Questions ...

2005-03-09 Thread Simos Xenitellis

Tim wrote more on his blog (http://blogs.sun.com/roller/page/timf/).

Direct link is
http://blogs.sun.com/roller/page/timf/20050304#fixing_the_locale_data_mess

Please provide comments.

Simos

 01//2005, 19:38, / Simos 
Xenitellis
:
 I'ld like to add to this a post by Tim Foster (Sun, on Localisation), at
 http://blogs.sun.com/roller/page/timf/20050226#everything_is_interesting_when_you
 Skip the intro, read the rest until the end.
 There are different repositories for locale information and the
 situation gets more complicated...
 
 Simos
 
  14//2005, 23:19, / 
 Simos Xenitellis
 :
  On Mon, 2005-02-14 at 18:14, Edward H. Trager wrote:
   Hi,
   
   I posted the following on the [EMAIL PROTECTED] mailing list
   which I suppose is the best place for it.  But perhaps followers on this 
   list
   have some insight on these questions, so I thought I would ask here too:
   
   
   I see that the recently-released glibc-2.3.4 has about 189 locales.
   In comparision, CLDR 1.2 has I believe 231 locales in it.
   
   Can someone please clarify for me the following simple questions:
   
   1) What is the current origin of the 189 locales in glibc-2.3.4? Are these
 still the set of accrued locale data from glibc, or have these data 
   already
 been influenced/augmented by the CLDR/ICU locale data?
  
  I believe the glibc locales are individual submissions from the
  respective countries. I think there is a bit of reinvention of the
  wheel, as people have to show in some cases what is the official
  convention to represent data (like am_pm format, date format, etc).
  
  Per http://sources.redhat.com/ml/libc-locales/2005-q1/msg2.html
  it looks that there is a bottleneck in the processing of bug reports and
  updates to the glibc registry.
  
  Petter Reinholdtsen has done a very good job to act as some sort of an
  intermediary between the glibc maintainers and the individuals that want
  to update their locale information.
  
  I feel there is a conflict of interest between the glibc maintainers
  and the GUI developers. The GUI developers want more freedom from glibc
  locale, so they maintain their own locale data for some fields (scary).
  For example, for am_pm (or 12-hour clock), the glibc maintainers
  prefer to set it if that is the official representation in the country.
  Else, this field should be empty. They also suggest to developers to
  check this field if it's empty, if it is, do not show the time in
  12-hour format (as it would be technically incorrect). 
  See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=140891msg=47
  In GNOME and the clock applet, you  have the option to choose between
  12-hour or 24-hour clock. So, for Greek, if you choose 12-hour clock,
  then 7pm shows as 7:00since am_pm is blank (it's a bit ironic, as
  officially in Greece we have the 12-hour clock).
  There was a recent discussion on gnome-i18n on this. See:
  1. http://mail.gnome.org/archives/gnome-i18n/2005-February/msg00015.html
  (it does not matter if at 12-hour clock the am_pm does not show up),
  2. http://mail.gnome.org/archives/gnome-i18n/2005-February/msg00139.html
  (bypassing locale data for usability purposes?).
  
   2) If the current glibc locale data have not yet been influenced/augmented
 by the CLDR project, is there a plan to do so by the glibc maintainers?
  
  I would be really interested to learn about this answer.
  
   3) If there is a plan by the glibc maintainers to derive all future glibc 
   locale data
 from the CLDR XML data repository, does this mean that we can look 
   forward to having
 all of the localedata in UTF-8 format when it is translated into the 
   POSIX format
 required by glibc? (This would be much nicer than the mish-mash of 
   legacy encodings).
   
   4) Is there any future plan to extend the glibc library to, say, read 
   directly from the
 CLDR LDML XML format?
  
  I would be really interested to see such a project taking place. A
  situation where glibc locale data are not considered that useful is a
  bad one, why not get rid off?
  
  Simos
  
  
  --
  Linux-UTF8:   i18n of Linux on all levels
  Archive:  http://mail.nl.linux.org/linux-utf8/
  
 
 
 --
 Linux-UTF8:   i18n of Linux on all levels
 Archive:  http://mail.nl.linux.org/linux-utf8/
 


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: GNU glibc locales and CLDR Questions ...

2005-03-04 Thread Simos Xenitellis
 04//2005, 10:23, 
/ Markus Kuhn :
 Simos Xenitellis wrote on 2005-02-14 23:19 UTC:
  officially in Greece we have the 12-hour clock
 
 Even in formal written communication, e.g. on bus/train time tables and
 airport tickets? I have a hard time believing that.
 
 What does officially mean? Does Greece have any other standard for
 time notation than ELOT EN 28601?

I wonder if you actually have a copy of ELOT EN 28601 as we don't have.
Perhaps you have a copy of the 28601 European Standard?

My source is the EU Publication service, at
http://publications.eu.int/code/el/el-4100800el.htm

 Is a significant fraction of the Greek population unfamiliar with what
 23:59 means?

I would answer empirically here as I do not have statistical
information. In everyday life people use the 12-hour notation.

Do a search for  (see below) on Google:
http://www.google.com/search?q=%CE%BC%CE%BCsourceid=firefoxstart=0start=0ie=utf-8oe=utf-8client=firefox-arls=org.mozilla:el-GR:official

You get over 500.000 hits.  is not used as shortcut for something
else, so almost all hits count.

 How do you write AM and PM traditionally using the Greek alphabet?

am/AM is / ( )
pm/PM is / ( )

 Why would you want to go back to something as broken and troublesome as
 the 12-h time-of-day notation on a computer?

I am trying my best for my language. 

If you can retrieve either standard:
a. ELOT EN 28601
b. European Standard 28601
I would be happy to read them and figure out what is better.

 Exercise: When is 12:00 AM today?
That was about 4 hours and a half ago.

I am cc:ing a Greek mailing list. Anyone has any contact at all with the
guys at elot.gr to help out...?

Simos


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Thank God for Yudit.

2005-02-21 Thread Simos Xenitellis
 21//2005, 12:31, / 
Elvis Presley :
 Thanks for the help.
 
 I never would have found the GNOME indicator control panel alone, well, maybe
 in 100 years. :)
 
 Elvis
 
  --- Simos Xenitellis [EMAIL PROTECTED] wrote:
 
  As a GNOME user, my preferred method is to use GTK+ IM (yes, there is
  such a thing). Assuming you are running GNOME, you
 
  1. Make sure that no changes in xorg.conf are in place.That is, you have
  a vanilla system with regards to input methods, so we do not have
  interference..
 
 My xorg.conf is full of settings... I haven't changed anything,
 but I don't know exactly what you mean by vanilla. I proceed anyway...

By vanilla I meant not setting and XKB-related options. OK.

  2. Right-click on the panel
  Add to Panel/Utility/Keyboard Indicator
 
 , ... it works! I've got the indicator on the
 panel.

Here you type Greek but it's not shown properly. You are using Yahoo!
Mail which is doing a bad job with languages other than English.
Yahoo! Mail (and Hotmail) started out by using 8-bit encodings for the
mails and are slow to improve.
A better option is to try GMail as it defaults to Unicode and UTF-8. I
just sent you a GMail invite, I hope you make use of it.

  Strangely enough, apart from Indicator, it's also a Keyboard switcher.
  Once done, you will see the string USA on your panel. Right-click on
  it and choose Open Keyboard Preferences. Go to Layouts and choose
  the languages you want to use. Safe choices are US English and
  Greek, or you might go for UK English and Greek.
  You can add up to four languages in the list.
 
 I've got two languages, USA and Greek, but I can't add any others, the
 configuration error message appears.

What does the error message say?
What distribution are you using?
Can you show a screenshot of the error message?
The typical action in such a case is to search in Google using the error
message. Try first to enclose the error message in double quotes (such
as in Error 3982: Cannot load XYZ) which most probably will lead you
to a similar report.

  Then, click on Layout Options. Here we choose the key combination to
  switch between language. I prefer under Group Shirt/Ctrl behavior to
  use both Shift keys. Choose which you prefer.
  Click ok and you are done!
 
 I see what you mean... I like Both Alt keys together change group, but I
 can't add the shortcut, because the configuration error message appears. 

Show what's the error message.

  How to test?
  Open up gedit (Start/Accessories/Text Editor on Fedora Core). 
  Switch language and enjoy typing.
 
 Yes, it works. I can now type in Greek in GNOME! I can change the keymaps
 by clicking on the indicator.
 
  Right-click inside the Text Editor and you get an option to choose Input
  Methods. Here, you will see Default (GTK+ IM) and X Input Method (XIM).
  Default is good.
 
  Oh, tell me if you see something called Internet/Intranet in the list.
  Do not use setxkbmap with GTK+ IM, it will mess it up.
 
  (Default) is set, but it doesn't show GTK+ IM. I guess
 I'm
 still working with XKB. [No problem, I'm happy to be able to select the
 keymap with the mouse and type in Greek.] There are 10 selections, XIM is
 the last one, but no Internet/Intranet in the list.

The Internet/Intranet does not show because you do not have the IIIMF 
packages installed.
If you have FC3, check out
http://fedora.redhat.com/projects/i18n/iiimf-faq.html
Talking about Fedora and Red Hat, the latter commited last year to
support IIIMF as the input method to use.
IIIMF is platform-independent. Suppose there is Unicode support on the
Linux console in the future (through framebuffer), IIIMF could be used
there as well. As well in Windows, OS/X, and more...

Simos


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Locale for typing ancient Greek

2005-02-20 Thread Simos Xenitellis
 20//2005, 11:18, / 
Pablo Saratxaga
:
 Kaixo!
 
 On Sun, Feb 20, 2005 at 11:04:27AM +0100, Koblinger Egmont wrote:
 
  Locales have nothing to do with keymaps, they're completely independent
 
 for keymaps yes, but for dead keys trough xkb composing mechanism
 it does matter a lot.
 if you use LC_ALL=el_GR.UTF-8 you can type with dead keys all
 greek polytonic, but probably only a limited set of latin
 accents.
 with LC_ALL=en_US.UTF-8 you can type lots of different latin
 accents, but only monotonic Greek.
 and with LC_ALL=ja_JP.UTF-8 you can't type any dead key at all.

Input methods are like black magic to me. Please tell me if the following are 
correct! :)

You need GNOME for this.
At the moment I use (and recommend) GKT+ IM. To so do, right-click on
the panel, Add to the panel/Tools/Language Indicator. You will notice
a scary USA string on your panel. Right click on it, then Open
Keyboard Preferences. Go to the Layouts menu and choose what Layouts
you want. For Polytonic, find Greek, expand it, select Polytonic and add
it there. Up to four selections can be made here. Also check out the key
combination to use to switch between languages. I use R-Shift+L-Shift to
cycle between them. Some other key combinations may strangely not work,
so in that case, go for the two Shifts.

Now open up gedit (Accessories/Text Editor), select Polytonic (If you
have both Greek and Greek Polytonic, they both show as Grc on the
panel :(. We are working on that...) and start typing.

The dead keys are under ;, [, ]. 

Try them out and you will notice that they do not work. Nothing appears
on screen. It does not matter what locale you have (en_US.UTF-8 or
el_GR.UTF-8). It just does not work (tm).

Why? Because
http://mail.gnome.org/archives/gtk-i18n-list/2004-December/msg00044.html
(read the last part about question 5.).
GTK+ Input Method does not know about Greek polytonic dead keys, only
monotonic (modern).

So, what to do?
In gedit (or any other GTK+ text box), right click, choose Input
Methods, choose X Input Method.

Now you can type
  
and so on.

Simos Xenitellis



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



IIIMF (Was: Re: SCIM IM)

2005-02-18 Thread Simos Xenitellis
On Fri, 2005-02-18 at 15:59, Edward H. Trager wrote:
 On Friday 2005.02.18 04:36:21 -0800, Elvis Presley wrote:
  Thanks Everyone...
  
  I'm slowly getting the idea.
  
  I just found a little file on the Fedora called '/etc/sysconfig/i18n' which
  contains what appear to be environment variables setting the locale. That 
  would
  explain why nobody is using .profile anymore.
 
 I believe that Mandrake uses an i18n file too.  Mandrake sets every single 
 LC_ variable
 explicitely.  I think it is better to only set the LANG variable, and then 
 only
 set LC_ variables that one wants to override the default for whatever 
 locale LANG is
 set to.
 
  Now, what about input methods?
  
 
 Try SCIM (http://www.scim-im.org).  The latest version of SCIM comes with a
 modern Vietnamese input method.  Recall that modern Vietnamese basically uses 
 the Latin alphabet
 with numerous diacritical marks on the letters.  One could easily create a 
 similar input method
 map for polytonic Greek, for example.

Talking about Input Methods, another option is IIIMF, at 
http://www.openi18n.org/modules.php?op=modloadname=Sectionsfile=indexreq=viewarticleartid=30page=1
At least Fedora Core 3 supports it, and if predictions go ok, it should
replace all other input methods, but that's another thread...
You can get packages for Debian/Ubuntu as well, if not other distros.

Simos


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Thank God for Yudit.

2005-02-14 Thread Simos Xenitellis

I second the reply by Edward on this e-mail.

It's not uncommon in open-source software to make an attempt to figure
out why something does not work and if indeed it does not work, report
it to the Bugzilla service (search first if it has been reported).

However, reading the rest of your e-mail it appears that the problem is
not entirely with SuSE Linux 9.2 but on learning to use it effectively.

Writing in Greek works on GNOME (and SuSE), you may want to get some
background information on the basics at
http://members.hellug.gr/djart/articles/grlinux/grlinux.html 
The rest of your questions are Linux distribution related, so not very
relevant here.

Simos

On Mon, 2005-02-14 at 11:23, Elvis Presley wrote:
 1) Yudit works great!
.


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Linux Unicode user-space console driver (again)

2005-01-22 Thread Simos Xenitellis
Hi All,
I am restarting afresh the thread
http://mail.nl.linux.org/linux-utf8/2005-01/#00020 titled Unicode and
the Linux Console (again). Looking back in the archives of this mailing
list, I noticed a similar thread, thus I am reusing the title
(http://mail.nl.linux.org/linux-utf8/2000-03/#00036)
That thread was asking in March 2000 the same thing we are asking now...
Please tell me if I make any mistakes below.

The clear outcome of the thread
(http://mail.nl.linux.org/linux-utf8/2005-01/#00020) was that in order
to support well Unicode on the Linux console, one should not touch
anymore the kernel. Rather, one should write a user-space program to do
the work.
Markus Khun describes very well this direction at
http://mail.nl.linux.org/linux-utf8/2005-01/msg00061.html
and effectively closes that thread.

The user-space program should use then the framebuffer device of the
Linux kernel. A rather outdated HOWTO on the framebuffer can be found at
http://www.tldp.org/HOWTO/Framebuffer-HOWTO.html
There are several drivers for the Linux Framebuffer, depending on the
graphics card one has. One may use vesafb which is the lowest common
denominator, or hardward-accelerated versions (such as intelfb) if your
card is supported.
There is an issue that some Linux distributions do not enable/use the
Linux framebuffer device but rather prefer the emergency terminal (panic
terminal). One reason why this happens is because most users do not use
the console; using an accelerated framebuffer and X at the same time
might lead to resource conflicts.
It has to be figured out how to get the Linux distributions to provide a
basic framebuffer by default.

Now, are there examples of such user-space software that allows you to
use Unicode on the framebuffer?
They have been mentioned in different sources, to sum it up there are
two that look (to me?) very promising.

A. jfbterm, http://jfbterm.sourceforge.jp/ by Fumitoshi UKAI.
FBTERM/ME takes advantages of framebuffer device that is supported
since linux kernel 2.2.x (at least on ix86 architecture) and make it
enable to display multilingual text on console. Is is developed on ix86
architecture, and it will works on other architectures such as
linux/pcpc.
While searching for jfbterm, I noticed that it was used for some time in
the Debian distribution to display Unicode in the console, but now it is
not in use as the Unicode support in the emergency console is in favour.
I could not find mailing list archives and I believe that if they exist,
they will be in Japanese.
Last release, May 2004. 

B. uterm
http://members.ispwest.com/hanpaul/uterm.html
by John Paul (could be John Palmisano, not sure).
When you browse the Website and Firefox say No data, click again the
link until you get it.
uterm looks really promising.
There are two screenshots demonstrating a login screen and a test page
covering a few ranges of Unicode (Boxdrawing, Korean, Cyrillic, Greek,
Graphics characters).
The FAQ (http://members.ispwest.com/hanpaul/uterm.faq) is quite
descriptive and mentions that the latest development was in Jan 2005.
The link to the source code
(http://members.ispwest.com/hanpaul/uterm.src.tgz) is not available (one
can download a binary though).

All in all, 
1. I feel that uterm is quite promising but it needs more people to work
on it (currently it looks like a one-man show).
2. I do not know how to get in contact with the author. Ideas?
3. Do you feel that uterm is the way to go?

Cheers,
Simos Xenitellis



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/