RE: About Kana folding

2001-05-18 Thread Yves Arrouye

Kenneth,

Thanks for the explanations.

 So I'd suggest you be very careful when trying to do this kind of
 a folding. If it is just for surface text matching, the number of
 false positive matches would likely swamp the number of false
 negatives you'd be correcting.
 
 On the other hand, if you are doing a phonetic matching, then 
 of course
 you have to fold the Hiragana and Katakana forms together.

I am trying to work around a situation where people cannot register a
database key in Katakana and the same one in Hiragana (because the DB's
collation does some Kana folding), yet they need to be able to find it using
either of these (after this key has been migrated to some other system that
doesn't do Kana folding). I don't know if that's what you call surface text
matching. The matching will be done on the whole key, not using N-grams.

 The more serious problem of equivalencing for matching in Japanese
 would be kanji versus Hiragana, in particular. [...] Getting this kind of
thing
 right is far more important for matching in Japanese than just
 brute matching of Hiragana to Katakana.

And if one wanted to do that automatically (which is not my intent, Kanji
work fine), one would need a dictionary to go from words in Kanji to one
Kana, is that true?

YA




Re: UTF-8 signature in web and email

2001-05-18 Thread Martin Duerst
At 22:58 01/05/17 -0400, [EMAIL PROTECTED] wrote:
Martin D$B—S(Bst wrote:

  There is about 5% of a justification
  for having a 'signature' on a plain-text, standalone file (the reason
  being that it's somewhat easier to detect that the file is UTF-8 from the
  signature than to read through the file and check the byte patterns
  (which is an extremely good method to distinguish UTF-8 from everything
  else)).

A plain-text file is more in need of such a signature than any other type of
file.  It is true that "fancy" text such as HTML or XML, which already has a
mechanism to indicate the character encoding, doesn't need a signature, but
this is not necessarily true of plain-text files, which will continue to
exist for a long time to come.

The strategy of checking byte patterns to detect UTF-8 is usually accurate,
but may require that the entire file be checked instead of just the first
three bytes.  In his September 1997 presentation in San Jose, Martin conceded
that "Because probability to detect UTF-8 [without a signature] is high, but
not 100%, this is a heuristic method" and then spent several pages evaluating
and refining the heuristics.  Using a signature is not somewhat easier, it is
*much* easier.

Sorry, but I think your summary here is a bit slanted.
I indeed used several pages, but the main aim was to show that
in practice, it's virtually 100%, for many different cases.
People using this heuristic, who didn't really think it would
work that well after the talk, have confirmed later that it
actually works extremely well (and they were writing production
code, not just testing stuff). On the other hand, I never met
anybody who showed me an example where it actually didn't work.
I would be interested to know about one if it exists.

I just said 'high, but not exactly 100%', because it was a technical
talk and not a marketing talk. Could be that this wasn't
easy to understand for some of the audience? There is no actual
need in practice to refine the heuristics.

The use of the signature may be easier than the heuristic in particular
if you want to know before reading a file what the encoding of the
file is. But in most cases, you will want to convert it somehow,
and in that case, it's easy to just read in bytes, and decide
lazily (i.e. when seing the first few high-octet bytes) whether
to transcode the rest of the file e.g. as Latin-1 or as UTF-8.

Also, the signature really only helps if you are only dealing with
two different encodings, a single legacy encoding and UTF-8.
The signature won't help e.g. to keep apart Shift_JIS, EUC, and
JIS (and UTF-8), but the heuristics used for these cases can
easily be extended to UTF-8.


  - When producing UTF-8 files/documents, *never* produce a 'signature'.
 There are quite some receivers that cannot deal with it, or that deal
 with it by displaying something. And there are many other problems.

If U+FEFF is not interpreted as a BOM or signature, then by process of
elimination it should be interpreted as a zero-width no-break space (ZWNBSP;
more on this later).  Any receiver that deals with a ZWNBSP by displaying a
visible glyph is not very smart about they way it handles Unicode text, and
should not be the deciding factor in how to encode it.

Don't think that display is everything that can be done to a text
file. An XML processor that doesn't expect a signature in UTF-8
will correctly reject the file if the signature comes before
an XML declaration. Same for many other formats and languages.


What are the "many other problems"?  Does this comment refer to programs and
protocols that require their own signatures as the first few bytes of an
input file (like shell scripts)?  The Unicode Standard 3.0 explicitly states
on page 325, "Systems that use the byte order mark must recognize that an
initial U+FEFF signals the byte order; it is not part of the textual
content."  Programs that go bonkers when handed a BOM need to be corrected to
conform to the intent of the UTC.

This would mean changing all compilers, all other software dealing with
formated data, and so on, and all unix utilities from 'cat' upwards.
In many cases, these applications and utilities are designed to work
without knowing what the encoding is, they work on a byte stream.
This makes it just impossible to conform to the above statement.
If you have an idea how that can be solved, please tell us.

The problem goes even further. How should the 'signature' be handled
in all the pieces of text data that may be passed around inside
an application, or between applications, but not as files?
Having to specify for each case who is responsible to add or
remove the 'signature', and doing the actual work, is just crazy.


Regards,   Martin.


Re: Ancient writing found in Turkmenistan

2001-05-18 Thread Peter Ilieve

Mike Ayers wrote:

 ... However, we don't want the article, we want the picture!

After lurking on this list for years, finally I can do something
vaguely useful. :-)

A piece about this appeared in The Times on Tuesday 15 May.
There was a picture of the seal spread over three columns but this
didn't make it to the online version of the story (which is at
http://www.thetimes.co.uk/article/0,,3-201723,00.html, with no
need to register, unlike its New York cousin).

I have made some scans of the picture and put them at
http://www.aldie.org.uk/unicode/.

The Times credits the picture to Fred Hiebert, who is one of the
archaeologists named in the story, so I hope I am fairly safe
against rampaging copyright lawyers. :-)


Peter Ilieve[EMAIL PROTECTED]





Re: [OT] bits and bytes

2001-05-18 Thread Otto Stolz

On Thu, 17 May 2001 15:39:02 -0500, Peter Constable wrote:
 Can anyone clarify for me how big a byte has ever been? (If you could
 identify the particular hardware, that would be helpful.)

The TR440, a German brand of computer (designed and built here
at Konstanz), in use circa 1975..1990 (I don't remember the exact
living-span), had a 48 + 2 bit word: 2 bits of data-type flag,
48 bits of data proper. The adressing was mostly per 24-bit
half-words, as a machine instruction, or a pointer, were stored
in a halfword each (with the data-type flag set to 2).

For character-type processing (data-type flag = 3), there were two
particular machine instructions, termed BNZ and CNZ (bringe/speichere
nächstes Zeichen = load/store next character) which could address
6-bit, 8-bit, or 12-bit bytes, at the programmers discretion (I do
not quite remember whether these instructions could also handle
smaller or larger fractions of a machine-word, such as 2-bit, 4-bit,
or 24-bit chunks; in any case, there were all 48 dat bits used per
word).

However, most available software only exploited the 8-bit variant,
and the only character encoding defined by the input/output routines
of the operating system was an 8-bit single-byte coded character set
(not counting a 6-bit byte mode available for backwards compatibility
with the predecessor TR4). There was even another machine instruction,
TOK (transportiere Oktaden = move octets) which was particularly
designed for the COBOL compiler and (as its name implies) could only
handle 8-bit bytes.

Note that the term Oktade (octet) was used by the vendor of this
machine as early as 1975 (or was it 1972?).

Aside: as a result of an integrierte Hardware-Software-Entwicklung
(integrated hardware-software development), this TOK instruction used
a particular addressing scheme, viz. a sequential numbering of the
octets. Hence, the COBOL compiler multiplied the usual address (used
elsewhere, e. g. in storage management) by 3 to arrive at an octet
address, and then the hardware divided that address by 3 to arrive
at a storage-address used elsewhere (e. g. in the micro-program for
storage access). Weird...

Best wishes,
  Otto Stolz



Re: [OT] bits and bytes

2001-05-18 Thread Bob_Hallissy


I was hoping someone with more detailed memory would mention this, but
since not, and since it is a contender for having one of the largest
minimal addressable unit (other than microcode storage):

I wrote a couple of programs for a Control Data Corporation (CDC) 6600 back
in the early '70s. I recall that the smallest addressable unit was a 60 bit
word (though there were special instructions to pack and unpack some size
of character -- was it 6 bit?)

Bob





Re: [OT] bits and bytes

2001-05-18 Thread Peter_Constable


Thanks for all the interesting feedback.

Now let me ask a slightly different question: Prior to Unicode and ISO
10646, what were the smallest and largest size code units ever used for
representing character data? In the various responses, there was reference
to 6- and 9-bit character representations (on the Unisys 1100: 6 for
Fielddata, and 9 for ASCII). There may have been other references to big
and small sizes for character data but if so it wasn't clear to me if
specifically character data was involved.

Any characters bigger than 9 bits smaller than 6?



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]






Re: [OT] bits and bytes

2001-05-18 Thread Peter_Constable


On 05/18/2001 09:39:18 AM Michael \(michka\) Kaplan wrote:

Well, most of the various CJK encodings clearly would have a lot more than
9
bits to them. Kind of required for any system dealing with thousands of
characters.

But do any of them encode using code units larger than 8 bits? Certainly if
something like GB2312 were encoded in a flat (linear?) encoding that never
used code-unit sequences, the code units would have to be larger than 9
bits. But I've only ever heard of them being handled using sequences of
8-bit code units.


After sending out that second message, I noticed Nelson Beebe had said, 
However, kcc offered extended datatypes to access 6-bit, 7-bit, 8-bit,
9-bit, and 36-bit characters. Does that qualify for both the largest and
the smallest code units to represent characters?



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]






Re: [OT] bits and bytes

2001-05-18 Thread Michael \(michka\) Kaplan

Well, most of the various CJK encodings clearly would have a lot more than 9
bits to them. Kind of required for any system dealing with thousands of
characters.

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/

- Original Message -
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Friday, May 18, 2001 6:35 AM
Subject: Re: [OT] bits and bytes



 Thanks for all the interesting feedback.

 Now let me ask a slightly different question: Prior to Unicode and ISO
 10646, what were the smallest and largest size code units ever used for
 representing character data? In the various responses, there was reference
 to 6- and 9-bit character representations (on the Unisys 1100: 6 for
 Fielddata, and 9 for ASCII). There may have been other references to big
 and small sizes for character data but if so it wasn't clear to me if
 specifically character data was involved.

 Any characters bigger than 9 bits smaller than 6?



 - Peter


 --
-
 Peter Constable

 Non-Roman Script Initiative, SIL International
 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
 Tel: +1 972 708 7485
 E-mail: [EMAIL PROTECTED]









Re: [OT] bits and bytes

2001-05-18 Thread Frank da Cruz

 Now let me ask a slightly different question: Prior to Unicode and ISO
 10646, what were the smallest and largest size code units ever used for
 representing character data?

 Any characters bigger than 9 bits smaller than 6?

Of course, Baudot was 5-bit code used widely in Teletype networks, Telexes,
TDDs, etc, and as the i/o code by computers that used such devices as their
control consoles.  I doubt that Baudot was ever used as a storage code for
files, but who knows.  However, Baudot-based TDDs still exist, even though
they are often emulated by PCs (whose UARTs can be programmed for 5-bit
bytes).

For that matter, hex bytes are 4-bit codes and some people read and write
them fluently :-)

As to larger-than-9-bit-bytes, when I visited Japan in the heydey of the
DECSYSTEM-20 (mid 1980s), but before I knew much about character sets, I was
told that the DEC-20 was prized in Japan because of its variable-length bytes,
which were perfect for handling Japanese text.  I wish I had found out more
about this.

- Frank





Re: [OT] bits and bytes

2001-05-18 Thread Michael \(michka\) Kaplan

From: [EMAIL PROTECTED]

 But do any of them encode using code units larger than 8 bits? Certainly
if
 something like GB2312 were encoded in a flat (linear?) encoding that never
 used code-unit sequences, the code units would have to be larger than 9
 bits. But I've only ever heard of them being handled using sequences of
 8-bit code units.

Ok, I think I understand what you are saying here, but I am not sure how
meaningful the question is in such a case. If an encoding is constructed
such that I need two octets to represent all characters then we are looking
at a 16 bit requirement. The fact that it is done with particular ranges is
not really something that has a meaning in the context of your question,
does it?

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/






Re: [OT] bits and bytes

2001-05-18 Thread Markus Scherer

[EMAIL PROTECTED] wrote:
 the smallest and largest size code units ever used for representing character data?

Teletype machines commonly use a 5-bit code (Baudot, International Alphabet Nr. 2). It 
has Shift-In/Shift-Out codes to switch between an alphabetic default level and a level 
with digits and symbols.

Morse code uses a one-bit scheme, if you will, or a small number of codes (short/long 
sound and some 3 or 4 standard lengths of pauses) depending on how you look at it.

markus
DL6FCT




Re: UTF-8 signature in web and email

2001-05-18 Thread Edward Cherlin

At 10:58 PM -0400 5/17/01, [EMAIL PROTECTED] wrote:
The UTF-8 signature discussion appears every few months on this list,
usually as a religious debate between those who believe in it and those who
do not.  Be forewarned, my religion may not match yours.  :-)

My religion suggests that we find common ground and not engage in rwars.

Keld Jørn Simonsen wrote:

  For UTF-8 there is no need to have a BOM, as there is only one
  way of serializing octets in UTF-8. There is no little-endian
  or big-endian. A BOM is superfluous and will be ignored.

You could say should be ignored, but you can't speak for everybody 
else's software.

The debate is not about whether byte order needs to be specified in a UTF-8
file (of course it doesn't) but whether U+FEFF should be used as a signature
to identify the file as UTF-8, rather than some other byte-oriented encoding.

Which will only work if the software is ready to handle it.

Martin Dürst wrote:

  There is about 5% of a justification
  for having a 'signature' on a plain-text, standalone file (the reason
  being that it's somewhat easier to detect that the file is UTF-8 from the
  signature than to read through the file and check the byte patterns
  (which is an extremely good method to distinguish UTF-8 from everything
   else)).
[snip]
OK, that's enough context.

Last year, as previously the year before, we discussed the 
possibility of defining some standard Unicode plain text formats. The 
discussions foundered on the differences between text files meant for 
people to read, such as e-mail, FAQs, and so on, and text files meant 
for computers to process, such as delimited data files. We could not 
agree, for example, whether a limit on line length was to be 
required, permitted, or forbidden. We could not even agree that the 
rules would be different for different cases, and that we would 
attempt to enumerate the cases our standard would cover.

This BOM-as-signature debate is of the same type. Is it to be 
required, permitted, forbidden, or something else? The short answer 
is No. Users do not agree, and software cannot be made to agree, not 
even if a formal standard were created and widely used.

Martin knows of no actual cases where a non-UTF-8 file could be 
mistaken for UTF-8, so he says the signature is unnecessary, and goes 
on to say that it is actually harmful. Specifically, he asks how all 
Unix text-handling software could be made to work with a signature. 
It can't all be changed, but here is a possible method for coping.

Create a filter that strips an initial signature from a text stream, 
and passes the remainder through unchanged. You can be picky and make 
it verify that the stream is in UTF-8, if you like.

Create a filter that adds a signature to the beginning of a text 
stream, if it does not already have one. You can be picky, again.

Create a filter that can identify character sets heuristically and 
convert them to UTF-8.

Write your scripts carefully, so that you know when you are handling 
text in unknown character sets, and apply these filters as needed.

Then ordinary Unix utilities will be fed data that they will not 
choke on, in known encodings without extraneous non-text data.

In all other contexts, such as XML, if the standard allows for a 
signature, fine, and if not, don't use one. If there is no standard, 
you have to negotiate a private agreement if you want to send people 
something out of the ordinary.


Another way to look at the matter is to say that plain text is plain, 
and a signature is markup. Then a text file with a signature is, if 
not rich text, at least above the poverty line.
-- 

Edward Cherlin
Generalist
A knot! exclaimed Alice. Oh, do let me help to undo it.
Alice in Wonderland




Re: [OT] bits and bytes

2001-05-18 Thread Peter_Constable


Morse code uses a one-bit scheme, if you will, or a small number of codes
(short/long sound and some 3 or 4 standard lengths of pauses) depending on
how
you look at it.

Well, either you say that Morse code has a character set of three
characters: SPACE, DOT, DASH, meaning a two-bit encoding is required, or
you say that it has two characters: SPACE and BEEP, with the understanding
that two consective BEEPS are temporally continguous -- and that would
require a one-bit encoding.

But I'm not really thinking of Morse code as characters in the sense we
generally use here. Rather, I think it is (in the terms of UTR#17) either a
Transfer Encoding Syntax, or a Character Encoding Form where the code units
are tones of a more or less constant frequency and of a fixed (one-bit
perspecitive) or tripartite (two-bit perspective) duration.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]






Re: UTF-8 signature in web and email

2001-05-18 Thread Michael \(michka\) Kaplan


michka

the only book on internationalization in VB at
http://www.i18nWithVB.com/

- Original Message -
From: Edward Cherlin [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Friday, May 18, 2001 1:08 PM
Subject: Re: UTF-8 signature in web and email


 At 10:58 PM -0400 5/17/01, [EMAIL PROTECTED] wrote:
 The UTF-8 signature discussion appears every few months on this list,
 usually as a religious debate between those who believe in it and those
who
 do not.  Be forewarned, my religion may not match yours.  :-)

 My religion suggests that we find common ground and not engage in rwars.

 Keld Jørn Simonsen wrote:
 
   For UTF-8 there is no need to have a BOM, as there is only one
   way of serializing octets in UTF-8. There is no little-endian
   or big-endian. A BOM is superfluous and will be ignored.

 You could say should be ignored, but you can't speak for everybody
 else's software.

 The debate is not about whether byte order needs to be specified in a
UTF-8
 file (of course it doesn't) but whether U+FEFF should be used as a
signature
 to identify the file as UTF-8, rather than some other byte-oriented
encoding.

 Which will only work if the software is ready to handle it.

 Martin Dürst wrote:
 
   There is about 5% of a justification
   for having a 'signature' on a plain-text, standalone file (the reason
   being that it's somewhat easier to detect that the file is UTF-8 from
the
   signature than to read through the file and check the byte patterns
   (which is an extremely good method to distinguish UTF-8 from
everything
else)).
 [snip]
 OK, that's enough context.

 Last year, as previously the year before, we discussed the
 possibility of defining some standard Unicode plain text formats. The
 discussions foundered on the differences between text files meant for
 people to read, such as e-mail, FAQs, and so on, and text files meant
 for computers to process, such as delimited data files. We could not
 agree, for example, whether a limit on line length was to be
 required, permitted, or forbidden. We could not even agree that the
 rules would be different for different cases, and that we would
 attempt to enumerate the cases our standard would cover.

 This BOM-as-signature debate is of the same type. Is it to be
 required, permitted, forbidden, or something else? The short answer
 is No. Users do not agree, and software cannot be made to agree, not
 even if a formal standard were created and widely used.

 Martin knows of no actual cases where a non-UTF-8 file could be
 mistaken for UTF-8, so he says the signature is unnecessary, and goes
 on to say that it is actually harmful. Specifically, he asks how all
 Unix text-handling software could be made to work with a signature.
 It can't all be changed, but here is a possible method for coping.

 Create a filter that strips an initial signature from a text stream,
 and passes the remainder through unchanged. You can be picky and make
 it verify that the stream is in UTF-8, if you like.

 Create a filter that adds a signature to the beginning of a text
 stream, if it does not already have one. You can be picky, again.

 Create a filter that can identify character sets heuristically and
 convert them to UTF-8.

 Write your scripts carefully, so that you know when you are handling
 text in unknown character sets, and apply these filters as needed.

 Then ordinary Unix utilities will be fed data that they will not
 choke on, in known encodings without extraneous non-text data.

 In all other contexts, such as XML, if the standard allows for a
 signature, fine, and if not, don't use one. If there is no standard,
 you have to negotiate a private agreement if you want to send people
 something out of the ordinary.


 Another way to look at the matter is to say that plain text is plain,
 and a signature is markup. Then a text file with a signature is, if
 not rich text, at least above the poverty line.
 --

 Edward Cherlin
 Generalist
 A knot! exclaimed Alice. Oh, do let me help to undo it.
 Alice in Wonderland







Re: UTF-8 signature in web and email

2001-05-18 Thread Michael \(michka\) Kaplan

From: Edward Cherlin [EMAIL PROTECTED]

A text file with a BOM is, if not rich text, at least above the poverty
line.

(modified from Ed's prior msg -- this one is a keeper!)

michka