RE: CJK question

2003-03-21 Thread Erik.Ostermueller



Have 
you tried the MS-Office Proofing Tools?
It 
contains a font for GB 18030 characters as well as 
an 
updated version of the MS IME.

The 
install program forces you to have MS-Office XP installed.
However, I've seen comments on the web suggesting that you 
can
use it 
without XP.

--Erik 
Ostermueller

  -Original Message-From: Allen Haaheim 
  [mailto:[EMAIL PROTECTED]Sent: Friday, March 21, 2003 
  1:48 PMTo: [EMAIL PROTECTED]Subject: CJK 
  question
  Hello, 
  
  First my apologies if I have missed something already 
  available on the Unicode website that I should have already known, as well as 
  for my total lack of expertise in the fields commonly under discussion in this 
  forum. If anyone knows of a more appropriate place towards which I should 
  direct my woes, I would be glad to hear of it.
  
  I work with Classical Chinese texts which contain 
  numerous characters not in any character sets I use (Arial Unicode MS; CJK 
  Unified Ideographs), or know of. I was delighted to see that most--if not 
  all--of the graphs I need are in Unicode charts. However they all seem to be 
  images, so it seems that I cannot use them.I need to be able to both 
  input and display the rare graphs in my MS system. As far as I cantell, 
  what I may need to do is add characters to my existing 
  character sets, or get new character sets altogether.
  
  For example, in a website I commonly use to view 
  reliable, annotatededitions of source texts www.chant.org(Chinese Ancient Texts 
  Center at the Chinese U of Hong Kong), I frequently encounter blank spaces 
  where rare graphs are located. (I have downloaded all their 
  fontpacks.)
  
  Third-party layovers with "font-maker" utilities such as 
  Twinbridgeor Chinese Star are too 
  unstable,in my experience causing frequent crashes. Furthermore they are 
  not always convertible to or compatible with unicode. The most stable setup I 
  have found for a PC is a localized version of Win2000, however the MS input 
  methods for Chinese are much too cumbersome and slow,cover only a 
  portion of the language, of course have no font-maker utility, and do not even 
  seem to be able to retrieve all the graphs stored in my existing character 
  sets. 
  
  Besides Chinese and Japanese graphs, I also need to 
  occasionally enter Latin or other letters with diacritical marks, such as 
  romanized Sanskrit (for Buddhist terms), IPA symbols, and others, but I cannot 
  always find them in the character sets I have. Some of you should get a 
  chuckle (or a groan) to learn that I'm manually searching through character 
  setsin "Insert (drop-down); Symbol..." in MS Word 2000 to find them. Of 
  course their location gets memorized after a while--but this gets more 
  difficult with CJK graphs! (They're not trueideographs, by the way--they 
  are mostly logographic.) 
  
  I realize there may be no easy solution for my problem, 
  but any advice would be greatly appreciated. 
  
  From a frustrated sinologist,
  
  Allen Haaheim


Re: DBCS and Unicode 3.1

2003-02-18 Thread Erik.Ostermueller
Thanks, all, for your responses.
They helped me to better phrase my question:

Does anyone know of a way to process GB 18030 data in COBOL on MVS?


Thanks,

--Erik Ostermueller




DBCS and Unicode 3.1

2003-02-17 Thread Erik.Ostermueller
Hello all,

In the past, DBCS could support characters no larger than 2 bytes.  Correct?

Now that Unicode 3.1 has broken the two-byte barrier, is there a corresponding update 
for DBCS?

I've been getting most of my DBCS info from these url's:
http://oss.software.ibm.com/icu/userguide/conversion-data.html
http://www-919.ibm.com/developer/dbcs/guide3.html#DBCS

Thanks,

Erik Ostermueller




RE: {SPAM?} RE: CJK test data

2003-02-10 Thread Erik.Ostermueller
All,

For all those interested in following my search for GB 18030 test data.
I'm having a another one of those 'senior moments'.  I could have sworn
that I sent this to the list already, but can't find it anywhere.
Forgive me if I've already sent this info.


I found some GB 18030 test data on the website of a private consultancy.
http://www.chinesization.com/gb18030_standard.htm

Download the testdata.zip from this page.

My Chinese-speaking colleage, Tianmiao Hu, informs me that
this _is_ the official test data.  Can anyone confirm or deny this?

It would be nice to find this same data from a more official source.
Instructions in english would be helpful, also. :-)

--Erik

   -Original Message-
   From: Anthony Fok [mailto:[EMAIL PROTECTED]]
   Sent: Saturday, February 08, 2003 3:35 AM
   To: Ostermueller, Erik
   Cc: [EMAIL PROTECTED]
   Subject: Re: {SPAM?} RE: CJK test data
   
   
   On Fri, Feb 07, 2003 at 04:19:04PM -0600, 
   [EMAIL PROTECTED] wrote:
Markus wrote:
 For general test data for determining support of GB 
   18030 I suggest to
 contact the Chinese government and its standards 
   agency. They have
 defined a certification procedure, and I assume 
   that the data and
 procedure are available. I have no direct contacts 
   for this myself.

Here is contact info from an 18030 article by Tom Emerson.

   http://lisa.org/archive_domain/newsletters/2002/2.3/emerson.html
Hmmm.  No url.  No email address.  This will be interesting.

Standard Conformity Testing Center for Information Products 
#1 Andingmen Dong Da Jie 
Beijing, China 
Tel: 84029573 or 84029792 
Fax: 64007681
   
   Tom Emerson's article is news to me, and I find it very 
   helpful.  :-)
   
   There _is_ an e-mail address that interesting parties 
   could try, that of
   
   CHEN Zhuang 
 Chinese IT Standardization Technical Committee
 Chinese Electronics Standardization Institute
   
   His e-mail address is included in the Application of 
   IANA Charset
   Registration for GB18030:
   
   http://www.iana.org/assignments/charset-reg/GB18030
   
   I suppose Mr. Chen does not in the Testing Center, but 
   he may be able to
   provide some other pointers.  :-)
   
   Cheers,
   
   Anthony
   
   -- 
   Anthony Fok Tung-Ling
   ThizLinux Laboratory   [EMAIL PROTECTED] 
http://www.thizlinux.com/
Debian Chinese Project [EMAIL PROTECTED]   http://www.debian.org/intl/zh/
Come visit Our Lady of Victory Camp!   http://www.olvc.ab.ca/




RE: CJK test data

2003-02-07 Thread Erik.Ostermueller
Markus wrote:
   For general test data for determining support of GB 
   18030 I suggest to contact the Chinese 
   government and its standards agency. They have defined 
   a certification procedure, and I assume that 
   the data and procedure are available. I have no direct 
   contacts for this myself.

Here is contact info from an 18030 article by Tom Emerson.
http://lisa.org/archive_domain/newsletters/2002/2.3/emerson.html
Hmmm.  No url.  No email address.  This will be interesting.

Standard Conformity Testing Center for Information Products 
#1 Andingmen Dong Da Jie 
Beijing, China 
Tel: 84029573 or 84029792 
Fax: 64007681




CJK test data

2003-02-06 Thread Erik.Ostermueller
I'm starting to put together some CJK test data
as described below.

Before I dive in, I was curious if any of this
work is already available on the web.
If not, would others be interested seeing this,
once complete?

###
CJK Test data.
This is just a start!

Need to produce a set of CJK data that is geared towards
testing string manipulation support in any software system.
The intent of the data would be to test software systems,
regardless of platform, software language or even API.

All data need english translations and instructions for
entering the data using an IME on a QWERTY keyboard.

Need tests to prove that a system SUPPORTS GB 18030
Need tests to prove that a system SUPPORTS GB 13000
Need tests to prove that a system DOES NOT support GB 18030
Need tests to prove that a system DOES NOT support GB 13000

Tests: need two sets of data, on for 13000, one for 18030
  1) Sorting Test 
a) include a list of un-ordered strings.
b) follow that with the same list, ordered properly.
  
  2) Text searching
-Need single character search and multiple character search.
 Must include the 'key' that we're looking for and 
  strings that do and do not contain that key.

  3) Character classification
  We need data to test some subset of the predicate functions: isSpace(), 
isAlpha(), is*():




RE: discovering code points with embedded nulls

2003-02-06 Thread Erik.Ostermueller
I didn't get any attachments.  Hmmm.

--Erik O.

   -Original Message-
   From: Stefan Persson [mailto:[EMAIL PROTECTED]]
   Sent: Thursday, February 06, 2003 11:12 AM
   To: Kent Karlsson
   Cc: Ostermueller, Erik; [EMAIL PROTECTED]
   Subject: Re: discovering code points with embedded nulls
   
   
   What is that strange file (winmail.dat) attached to 
   your mail?  I really 
   hope that it isn't a virus.
   
   Stefan
   
   Kent Karlsson wrote:
   
   From what I'm hearing from you all is that a null in UTF-8 is 
   for termination and termination only.
   Is this correct?
   
   
   
   No, NULL is a character (actually a control character) 
   among many
   others. However, many C/C++ APIs (mis)use NULL as a 
   string terminator
   since NULL isn't very useful for other things.
   
  /kent k
 
   
   
   
   
   _
   
   Gratis e-mail resten av livet på www.yahoo.se/mail
   
   Busenkelt!
   
   




discovering code points with embedded nulls

2003-02-05 Thread Erik.Ostermueller
Hello, all.

I'm dealing with an API that claims it doesn't support unicode characters with 
embedded nulls.
I'm trying to figure out how much of a liability this is.

What is my best plan of attack for discovering precisely which code points have 
embedded nulls
given a particular encoding?  Didn't find it in the maillist archive.
I've googled for quite a while with no luck.  

I'll want to do this for a few different versions of unicode and a few different 
encodings.
What if I write a program using some of the data files available at unicode.org?
Am I crazy (I'm new at this stuff) or am I getting warm?
Perhaps this data file: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt ?

Algorithm:
INPUT: Name of unicode code point file
INPUT: Name of encoding (perhaps UTF-8)

Read code point from file.
Expand code point to encoded format for the given encoding.
Test all constituent bytes for 0x00.
Goto next code point from file.

Thanks in advance for any help,

--Erik O.






compatibility between unicode 2.0 and 3.0

2003-01-31 Thread Erik.Ostermueller
We have a large amount of C++ that currently has Unicode 2.0 support.

Could you all help me figure out what types of operations will fail
if we attempt to pass Unicode 3.0 thru this code?

I can start the list off with 

-sorting 
-searching for text 
-text comparison
-other character classification (isSpace, isDigit, etc...).

I'm understand that these operations probably won't work in ALL cases.
But how about basic plumbing code -- creating and copying string?

As I mentioned in my last post, I've enjoyed
listening in on this forum -- I've learned a whole lot.

Thanks,

--Erik Ostermueller