RE: About Kana folding
Kenneth, Thanks for the explanations. So I'd suggest you be very careful when trying to do this kind of a folding. If it is just for surface text matching, the number of false positive matches would likely swamp the number of false negatives you'd be correcting. On the other hand, if you are doing a phonetic matching, then of course you have to fold the Hiragana and Katakana forms together. I am trying to work around a situation where people cannot register a database key in Katakana and the same one in Hiragana (because the DB's collation does some Kana folding), yet they need to be able to find it using either of these (after this key has been migrated to some other system that doesn't do Kana folding). I don't know if that's what you call surface text matching. The matching will be done on the whole key, not using N-grams. The more serious problem of equivalencing for matching in Japanese would be kanji versus Hiragana, in particular. [...] Getting this kind of thing right is far more important for matching in Japanese than just brute matching of Hiragana to Katakana. And if one wanted to do that automatically (which is not my intent, Kanji work fine), one would need a dictionary to go from words in Kanji to one Kana, is that true? YA
Re: UTF-8 signature in web and email
At 22:58 01/05/17 -0400, [EMAIL PROTECTED] wrote: Martin D$BS(Bst wrote: There is about 5% of a justification for having a 'signature' on a plain-text, standalone file (the reason being that it's somewhat easier to detect that the file is UTF-8 from the signature than to read through the file and check the byte patterns (which is an extremely good method to distinguish UTF-8 from everything else)). A plain-text file is more in need of such a signature than any other type of file. It is true that "fancy" text such as HTML or XML, which already has a mechanism to indicate the character encoding, doesn't need a signature, but this is not necessarily true of plain-text files, which will continue to exist for a long time to come. The strategy of checking byte patterns to detect UTF-8 is usually accurate, but may require that the entire file be checked instead of just the first three bytes. In his September 1997 presentation in San Jose, Martin conceded that "Because probability to detect UTF-8 [without a signature] is high, but not 100%, this is a heuristic method" and then spent several pages evaluating and refining the heuristics. Using a signature is not somewhat easier, it is *much* easier. Sorry, but I think your summary here is a bit slanted. I indeed used several pages, but the main aim was to show that in practice, it's virtually 100%, for many different cases. People using this heuristic, who didn't really think it would work that well after the talk, have confirmed later that it actually works extremely well (and they were writing production code, not just testing stuff). On the other hand, I never met anybody who showed me an example where it actually didn't work. I would be interested to know about one if it exists. I just said 'high, but not exactly 100%', because it was a technical talk and not a marketing talk. Could be that this wasn't easy to understand for some of the audience? There is no actual need in practice to refine the heuristics. The use of the signature may be easier than the heuristic in particular if you want to know before reading a file what the encoding of the file is. But in most cases, you will want to convert it somehow, and in that case, it's easy to just read in bytes, and decide lazily (i.e. when seing the first few high-octet bytes) whether to transcode the rest of the file e.g. as Latin-1 or as UTF-8. Also, the signature really only helps if you are only dealing with two different encodings, a single legacy encoding and UTF-8. The signature won't help e.g. to keep apart Shift_JIS, EUC, and JIS (and UTF-8), but the heuristics used for these cases can easily be extended to UTF-8. - When producing UTF-8 files/documents, *never* produce a 'signature'. There are quite some receivers that cannot deal with it, or that deal with it by displaying something. And there are many other problems. If U+FEFF is not interpreted as a BOM or signature, then by process of elimination it should be interpreted as a zero-width no-break space (ZWNBSP; more on this later). Any receiver that deals with a ZWNBSP by displaying a visible glyph is not very smart about they way it handles Unicode text, and should not be the deciding factor in how to encode it. Don't think that display is everything that can be done to a text file. An XML processor that doesn't expect a signature in UTF-8 will correctly reject the file if the signature comes before an XML declaration. Same for many other formats and languages. What are the "many other problems"? Does this comment refer to programs and protocols that require their own signatures as the first few bytes of an input file (like shell scripts)? The Unicode Standard 3.0 explicitly states on page 325, "Systems that use the byte order mark must recognize that an initial U+FEFF signals the byte order; it is not part of the textual content." Programs that go bonkers when handed a BOM need to be corrected to conform to the intent of the UTC. This would mean changing all compilers, all other software dealing with formated data, and so on, and all unix utilities from 'cat' upwards. In many cases, these applications and utilities are designed to work without knowing what the encoding is, they work on a byte stream. This makes it just impossible to conform to the above statement. If you have an idea how that can be solved, please tell us. The problem goes even further. How should the 'signature' be handled in all the pieces of text data that may be passed around inside an application, or between applications, but not as files? Having to specify for each case who is responsible to add or remove the 'signature', and doing the actual work, is just crazy. Regards, Martin.
Re: Ancient writing found in Turkmenistan
Mike Ayers wrote: ... However, we don't want the article, we want the picture! After lurking on this list for years, finally I can do something vaguely useful. :-) A piece about this appeared in The Times on Tuesday 15 May. There was a picture of the seal spread over three columns but this didn't make it to the online version of the story (which is at http://www.thetimes.co.uk/article/0,,3-201723,00.html, with no need to register, unlike its New York cousin). I have made some scans of the picture and put them at http://www.aldie.org.uk/unicode/. The Times credits the picture to Fred Hiebert, who is one of the archaeologists named in the story, so I hope I am fairly safe against rampaging copyright lawyers. :-) Peter Ilieve[EMAIL PROTECTED]
Re: [OT] bits and bytes
On Thu, 17 May 2001 15:39:02 -0500, Peter Constable wrote: Can anyone clarify for me how big a byte has ever been? (If you could identify the particular hardware, that would be helpful.) The TR440, a German brand of computer (designed and built here at Konstanz), in use circa 1975..1990 (I don't remember the exact living-span), had a 48 + 2 bit word: 2 bits of data-type flag, 48 bits of data proper. The adressing was mostly per 24-bit half-words, as a machine instruction, or a pointer, were stored in a halfword each (with the data-type flag set to 2). For character-type processing (data-type flag = 3), there were two particular machine instructions, termed BNZ and CNZ (bringe/speichere nächstes Zeichen = load/store next character) which could address 6-bit, 8-bit, or 12-bit bytes, at the programmers discretion (I do not quite remember whether these instructions could also handle smaller or larger fractions of a machine-word, such as 2-bit, 4-bit, or 24-bit chunks; in any case, there were all 48 dat bits used per word). However, most available software only exploited the 8-bit variant, and the only character encoding defined by the input/output routines of the operating system was an 8-bit single-byte coded character set (not counting a 6-bit byte mode available for backwards compatibility with the predecessor TR4). There was even another machine instruction, TOK (transportiere Oktaden = move octets) which was particularly designed for the COBOL compiler and (as its name implies) could only handle 8-bit bytes. Note that the term Oktade (octet) was used by the vendor of this machine as early as 1975 (or was it 1972?). Aside: as a result of an integrierte Hardware-Software-Entwicklung (integrated hardware-software development), this TOK instruction used a particular addressing scheme, viz. a sequential numbering of the octets. Hence, the COBOL compiler multiplied the usual address (used elsewhere, e. g. in storage management) by 3 to arrive at an octet address, and then the hardware divided that address by 3 to arrive at a storage-address used elsewhere (e. g. in the micro-program for storage access). Weird... Best wishes, Otto Stolz
Re: [OT] bits and bytes
I was hoping someone with more detailed memory would mention this, but since not, and since it is a contender for having one of the largest minimal addressable unit (other than microcode storage): I wrote a couple of programs for a Control Data Corporation (CDC) 6600 back in the early '70s. I recall that the smallest addressable unit was a 60 bit word (though there were special instructions to pack and unpack some size of character -- was it 6 bit?) Bob
Re: [OT] bits and bytes
Thanks for all the interesting feedback. Now let me ask a slightly different question: Prior to Unicode and ISO 10646, what were the smallest and largest size code units ever used for representing character data? In the various responses, there was reference to 6- and 9-bit character representations (on the Unisys 1100: 6 for Fielddata, and 9 for ASCII). There may have been other references to big and small sizes for character data but if so it wasn't clear to me if specifically character data was involved. Any characters bigger than 9 bits smaller than 6? - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: [OT] bits and bytes
On 05/18/2001 09:39:18 AM Michael \(michka\) Kaplan wrote: Well, most of the various CJK encodings clearly would have a lot more than 9 bits to them. Kind of required for any system dealing with thousands of characters. But do any of them encode using code units larger than 8 bits? Certainly if something like GB2312 were encoded in a flat (linear?) encoding that never used code-unit sequences, the code units would have to be larger than 9 bits. But I've only ever heard of them being handled using sequences of 8-bit code units. After sending out that second message, I noticed Nelson Beebe had said, However, kcc offered extended datatypes to access 6-bit, 7-bit, 8-bit, 9-bit, and 36-bit characters. Does that qualify for both the largest and the smallest code units to represent characters? - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: [OT] bits and bytes
Well, most of the various CJK encodings clearly would have a lot more than 9 bits to them. Kind of required for any system dealing with thousands of characters. MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/ - Original Message - From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, May 18, 2001 6:35 AM Subject: Re: [OT] bits and bytes Thanks for all the interesting feedback. Now let me ask a slightly different question: Prior to Unicode and ISO 10646, what were the smallest and largest size code units ever used for representing character data? In the various responses, there was reference to 6- and 9-bit character representations (on the Unisys 1100: 6 for Fielddata, and 9 for ASCII). There may have been other references to big and small sizes for character data but if so it wasn't clear to me if specifically character data was involved. Any characters bigger than 9 bits smaller than 6? - Peter -- - Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: [OT] bits and bytes
Now let me ask a slightly different question: Prior to Unicode and ISO 10646, what were the smallest and largest size code units ever used for representing character data? Any characters bigger than 9 bits smaller than 6? Of course, Baudot was 5-bit code used widely in Teletype networks, Telexes, TDDs, etc, and as the i/o code by computers that used such devices as their control consoles. I doubt that Baudot was ever used as a storage code for files, but who knows. However, Baudot-based TDDs still exist, even though they are often emulated by PCs (whose UARTs can be programmed for 5-bit bytes). For that matter, hex bytes are 4-bit codes and some people read and write them fluently :-) As to larger-than-9-bit-bytes, when I visited Japan in the heydey of the DECSYSTEM-20 (mid 1980s), but before I knew much about character sets, I was told that the DEC-20 was prized in Japan because of its variable-length bytes, which were perfect for handling Japanese text. I wish I had found out more about this. - Frank
Re: [OT] bits and bytes
From: [EMAIL PROTECTED] But do any of them encode using code units larger than 8 bits? Certainly if something like GB2312 were encoded in a flat (linear?) encoding that never used code-unit sequences, the code units would have to be larger than 9 bits. But I've only ever heard of them being handled using sequences of 8-bit code units. Ok, I think I understand what you are saying here, but I am not sure how meaningful the question is in such a case. If an encoding is constructed such that I need two octets to represent all characters then we are looking at a 16 bit requirement. The fact that it is done with particular ranges is not really something that has a meaning in the context of your question, does it? MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
Re: [OT] bits and bytes
[EMAIL PROTECTED] wrote: the smallest and largest size code units ever used for representing character data? Teletype machines commonly use a 5-bit code (Baudot, International Alphabet Nr. 2). It has Shift-In/Shift-Out codes to switch between an alphabetic default level and a level with digits and symbols. Morse code uses a one-bit scheme, if you will, or a small number of codes (short/long sound and some 3 or 4 standard lengths of pauses) depending on how you look at it. markus DL6FCT
Re: UTF-8 signature in web and email
At 10:58 PM -0400 5/17/01, [EMAIL PROTECTED] wrote: The UTF-8 signature discussion appears every few months on this list, usually as a religious debate between those who believe in it and those who do not. Be forewarned, my religion may not match yours. :-) My religion suggests that we find common ground and not engage in rwars. Keld Jørn Simonsen wrote: For UTF-8 there is no need to have a BOM, as there is only one way of serializing octets in UTF-8. There is no little-endian or big-endian. A BOM is superfluous and will be ignored. You could say should be ignored, but you can't speak for everybody else's software. The debate is not about whether byte order needs to be specified in a UTF-8 file (of course it doesn't) but whether U+FEFF should be used as a signature to identify the file as UTF-8, rather than some other byte-oriented encoding. Which will only work if the software is ready to handle it. Martin Dürst wrote: There is about 5% of a justification for having a 'signature' on a plain-text, standalone file (the reason being that it's somewhat easier to detect that the file is UTF-8 from the signature than to read through the file and check the byte patterns (which is an extremely good method to distinguish UTF-8 from everything else)). [snip] OK, that's enough context. Last year, as previously the year before, we discussed the possibility of defining some standard Unicode plain text formats. The discussions foundered on the differences between text files meant for people to read, such as e-mail, FAQs, and so on, and text files meant for computers to process, such as delimited data files. We could not agree, for example, whether a limit on line length was to be required, permitted, or forbidden. We could not even agree that the rules would be different for different cases, and that we would attempt to enumerate the cases our standard would cover. This BOM-as-signature debate is of the same type. Is it to be required, permitted, forbidden, or something else? The short answer is No. Users do not agree, and software cannot be made to agree, not even if a formal standard were created and widely used. Martin knows of no actual cases where a non-UTF-8 file could be mistaken for UTF-8, so he says the signature is unnecessary, and goes on to say that it is actually harmful. Specifically, he asks how all Unix text-handling software could be made to work with a signature. It can't all be changed, but here is a possible method for coping. Create a filter that strips an initial signature from a text stream, and passes the remainder through unchanged. You can be picky and make it verify that the stream is in UTF-8, if you like. Create a filter that adds a signature to the beginning of a text stream, if it does not already have one. You can be picky, again. Create a filter that can identify character sets heuristically and convert them to UTF-8. Write your scripts carefully, so that you know when you are handling text in unknown character sets, and apply these filters as needed. Then ordinary Unix utilities will be fed data that they will not choke on, in known encodings without extraneous non-text data. In all other contexts, such as XML, if the standard allows for a signature, fine, and if not, don't use one. If there is no standard, you have to negotiate a private agreement if you want to send people something out of the ordinary. Another way to look at the matter is to say that plain text is plain, and a signature is markup. Then a text file with a signature is, if not rich text, at least above the poverty line. -- Edward Cherlin Generalist A knot! exclaimed Alice. Oh, do let me help to undo it. Alice in Wonderland
Re: [OT] bits and bytes
Morse code uses a one-bit scheme, if you will, or a small number of codes (short/long sound and some 3 or 4 standard lengths of pauses) depending on how you look at it. Well, either you say that Morse code has a character set of three characters: SPACE, DOT, DASH, meaning a two-bit encoding is required, or you say that it has two characters: SPACE and BEEP, with the understanding that two consective BEEPS are temporally continguous -- and that would require a one-bit encoding. But I'm not really thinking of Morse code as characters in the sense we generally use here. Rather, I think it is (in the terms of UTR#17) either a Transfer Encoding Syntax, or a Character Encoding Form where the code units are tones of a more or less constant frequency and of a fixed (one-bit perspecitive) or tripartite (two-bit perspective) duration. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: UTF-8 signature in web and email
michka the only book on internationalization in VB at http://www.i18nWithVB.com/ - Original Message - From: Edward Cherlin [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, May 18, 2001 1:08 PM Subject: Re: UTF-8 signature in web and email At 10:58 PM -0400 5/17/01, [EMAIL PROTECTED] wrote: The UTF-8 signature discussion appears every few months on this list, usually as a religious debate between those who believe in it and those who do not. Be forewarned, my religion may not match yours. :-) My religion suggests that we find common ground and not engage in rwars. Keld Jørn Simonsen wrote: For UTF-8 there is no need to have a BOM, as there is only one way of serializing octets in UTF-8. There is no little-endian or big-endian. A BOM is superfluous and will be ignored. You could say should be ignored, but you can't speak for everybody else's software. The debate is not about whether byte order needs to be specified in a UTF-8 file (of course it doesn't) but whether U+FEFF should be used as a signature to identify the file as UTF-8, rather than some other byte-oriented encoding. Which will only work if the software is ready to handle it. Martin Dürst wrote: There is about 5% of a justification for having a 'signature' on a plain-text, standalone file (the reason being that it's somewhat easier to detect that the file is UTF-8 from the signature than to read through the file and check the byte patterns (which is an extremely good method to distinguish UTF-8 from everything else)). [snip] OK, that's enough context. Last year, as previously the year before, we discussed the possibility of defining some standard Unicode plain text formats. The discussions foundered on the differences between text files meant for people to read, such as e-mail, FAQs, and so on, and text files meant for computers to process, such as delimited data files. We could not agree, for example, whether a limit on line length was to be required, permitted, or forbidden. We could not even agree that the rules would be different for different cases, and that we would attempt to enumerate the cases our standard would cover. This BOM-as-signature debate is of the same type. Is it to be required, permitted, forbidden, or something else? The short answer is No. Users do not agree, and software cannot be made to agree, not even if a formal standard were created and widely used. Martin knows of no actual cases where a non-UTF-8 file could be mistaken for UTF-8, so he says the signature is unnecessary, and goes on to say that it is actually harmful. Specifically, he asks how all Unix text-handling software could be made to work with a signature. It can't all be changed, but here is a possible method for coping. Create a filter that strips an initial signature from a text stream, and passes the remainder through unchanged. You can be picky and make it verify that the stream is in UTF-8, if you like. Create a filter that adds a signature to the beginning of a text stream, if it does not already have one. You can be picky, again. Create a filter that can identify character sets heuristically and convert them to UTF-8. Write your scripts carefully, so that you know when you are handling text in unknown character sets, and apply these filters as needed. Then ordinary Unix utilities will be fed data that they will not choke on, in known encodings without extraneous non-text data. In all other contexts, such as XML, if the standard allows for a signature, fine, and if not, don't use one. If there is no standard, you have to negotiate a private agreement if you want to send people something out of the ordinary. Another way to look at the matter is to say that plain text is plain, and a signature is markup. Then a text file with a signature is, if not rich text, at least above the poverty line. -- Edward Cherlin Generalist A knot! exclaimed Alice. Oh, do let me help to undo it. Alice in Wonderland
Re: UTF-8 signature in web and email
From: Edward Cherlin [EMAIL PROTECTED] A text file with a BOM is, if not rich text, at least above the poverty line. (modified from Ed's prior msg -- this one is a keeper!) michka