Re: font-encoded hacks

2016-10-07 Thread Martin J. Dürst
Hello Andrew, On 2016/10/07 11:11, Andrew Cunningham wrote: Considering the mess that adhoc fonts create. What is the best way forward? That's very clear: Use Unicode. Zwekabin, Mon, Zawgyi, and Zawgyi-Tai and their ilk? Most governemt translations I am seeing in Australia for Burmese are

Re: Why incomplete subscript/superscript alphabet ?

2016-10-04 Thread Martin J. Dürst
On 2016/10/04 19:35, Marcel Schneider wrote: On Mon, 3 Oct 2016 13:47:09 -0700, Asmus Freytag (c) wrote: Later, the beta and gamma were encoded for phonetic notation, but not the alpha. As a result, you can write basic formulas for select compounds, but not all. Given that these basic

Re: Dates in Japanese Era Names in Unicode Standard

2016-09-29 Thread Martin J. Dürst
68 October 23 Gregorian. Meiji 1 January 1 Lunar (and Keio 4 January 1 Lunar) is 1868 January 25 Gregorian. My best guess is that the author of Table 22-8 picked up the year value from spreadsheet showing "1867-12-31" in local time, originally intended to show merely "1868-01".

Re: Dates in Japanese Era Names in Unicode Standard

2016-09-29 Thread Martin J. Dürst
pon the day after Emperor Akihito's succession to the throne on 7 January 1989. -- Martin J. Dürst Department of Intelligent Information Technology Collegue of Science and Engineering Aoyama Gakuin University Fuchinobe 5-1-10, Chuo-ku, Sagamihara 252-5258 Japan

Re: [Unicode] how to evaluate the "emoji support level" in given font?

2016-09-13 Thread Martin J. Dürst
Or, if such attempt (evaluate the support level of emoji by checking some codepoints) is wrong, is there any good method to evaluate the support level of emoji in given font? Regards, mpsuzuki . -- Martin J. Dürst Department of Intelligent Information Technology Collegue of Science and Engine

Re: Whitespace characters in Unicode

2016-08-08 Thread Martin J. Dürst
On 2016/08/08 08:08, Sean Leonard wrote: On 8/6/2016 11:30 AM, Doug Ewell wrote: Additionally, in UTF-8, either LS or PS actually takes more bytes than CR plus LF, so the "increased text size" argument also discouraged use of the new controls. That is true, it takes 3 bytes. However, the

Re: Non-standard 8-bit fonts still in use

2016-05-02 Thread Martin J. Dürst
Hello Don, I agree with Doug that creating a good keyboard layout is a good thing to do. Among the people on this list, you probably have the best contacts, and can help create some test layouts and see how people react. Also, creating fonts that have the necessary coverage but are encoded

Re: Support for Latin ligature IJ (was another thread)

2016-03-30 Thread Martin J. Dürst
On 2016/03/31 06:42, Philippe Verdy wrote: The use of "ÿ" in Dutch should also be considered as an orthographic fault, and it should be corrected into "ij" (to solve the capitalization problem), but there are occurences in Dutch of "ÿ" which is correct (notably in borrowed French toponyms such

Re: Swapcase for Titlecase characters

2016-03-19 Thread Martin J. Dürst
Thanks everybody for the feedback. On 2016/03/19 04:33, Marcel Schneider wrote: On Fri, Mar 18, 2016, 08:43:56, Martin J. Dürst wrote: b) Convert to upper (or lower), which may simplify implementation. For example, 'Džinsi' (jeans) would become 'DžINSI' with a), 'DŽINSI' (or 'džinsi') with b

Re: Purpose of and rationale behind Go Markers U+2686 to U+2689

2016-03-19 Thread Martin J. Dürst
On 2016/03/19 04:55, Garth Wallace wrote: On Fri, Mar 18, 2016 at 11:48 AM, Philippe Verdy wrote: 2016-03-18 19:11 GMT+01:00 Garth Wallace : Rotation is definitely not salient in standard go kifu like it is in fairy chess notation. Go variants for more

Swapcase for Titlecase characters

2016-03-18 Thread Martin J. Dürst
I'm working on extending the case conversion methods for the programming language Ruby from the current ASCII only to cover all of Unicode. Ruby comes with four methods for case conversion. Three of them, upcase, downcase, and capitalize, are quite clear. But we have hit a question for the

Re: Purpose of and rationale behind Go Markers U+2686 to U+2689

2016-03-09 Thread Martin J. Dürst
On 2016/03/10 07:52, Ken Whistler wrote: I don't know the answer to this. But I suspect that that the source was from one of the collection of fonts associated with the STIX project research that led to the collection of mathematical symbols additions noted in L2/01-067 (superseded by

Re: precomposed polytonic Greek characters with macrons and other diacritics

2016-02-08 Thread Martin J. Dürst
On 2016/02/09 02:10, James Tauber wrote: http://jktauber.com/2016/01/28/polytonic-greek-unicode-is-still-not-perfect/ Hello James, I read your article. I just wanted to point out that in your problem 3, the two sequences aren't normalized because if the acute accent is first, that would be

Re: Unicode in the Curriculum?

2016-01-05 Thread Martin J. Dürst
I agree to a certain extent with Julian. There are extremely many subjects industry surely would like computer science students to learn in college, and internationalization/Unicode is only one of them. On the other hand, I think that universities teach about integer and floating point

Re: Proposal for German capital letter "ß"

2015-12-10 Thread Martin J. Dürst
Hello Marc, On 2015/12/10 14:35, Marc Blanchet wrote: This is an interesting example of a phenomenon that turns up in many other contexts, too. A similar example is the use of accents on upper-case letters in French in France where 'officially', upper-case letters are written without accents.

Re: Proposal for German capital letter "ß"

2015-12-09 Thread Martin J. Dürst
On 2015/12/10 09:30, Mark E. Shoulson wrote: I remember when we went through all this the first time around, encoding ẞ in the first place. People were saying "But the Duden says no!!!" And someone then pointed out, "Please close your Duden and cast your gaze upon ITS FRONT COVER, where you

Re: Devanagari and Subscript and Superscript

2015-12-08 Thread Martin J. Dürst
Hello Plug, I suggest using HTML: बक ्ष Regards, Martin. On 2015/12/09 12:24, Plug Gulp wrote: Hi, I am trying to understand if there is a way to use Devanagari characters (and grapheme clusters) as subscript and/or superscript in unicode text. It will help if someone could please direct

Re: ZWJ, ZWNJ and Markup languages.

2015-11-27 Thread Martin J. Dürst
On 2015/11/28 04:55, Plug Gulp wrote: The Unicode standard 8.0 states in chapter 23, section titled "Cursive Connection and Ligatures"(printed page #814, PDF page #850) that: "The zero width joiner and non-joiner characters are designed for use in plain text; they should not be used where

Re: A Bulldog moves on

2015-10-24 Thread Martin J. Dürst
Hello Doug, Thanks for making us aware of this very sad event. Michael did a lot for Unicode, and fought bravely with his illness. I hope we can all remember him this week at the Unicode Conference, where he gave so many amazing talks. I also hope that somebody somehow will be able to

Re: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?)

2015-10-23 Thread Martin J. Dürst
On 2015/10/24 02:11, Rick McGowan wrote: William, All right... This is likely to be my last posting on the subject... ... there has been much objection to my invention in this mailing list over the years, with no good reason ever stated, ... If this invention had been made in the research

Re: Rights to the Emoji

2015-10-11 Thread Martin J. Dürst
You can also design your own version of the emoji you want to use. [I'm not a lawyer, but as far as I understand,] what's protected is the individual design, not the idea of a "donut" or "frowning face" emoji as such. Regards, Martin. On 2015/10/12 09:51, Shervin Afshar wrote: Those

Re: Unicode in passwords

2015-10-05 Thread Martin J. Dürst
On 2015/10/01 13:11, Jonathan Rosenne wrote: For languages such as Java, passwords should be handled as byte arrays rather than strings. This may make it difficult to apply normalization. Well, they should be received from the user interface as strings, then normalized, then converted to

Re: Unicode in passwords

2015-10-05 Thread Martin J. Dürst
Some additional concerns: - Input methods for Chinese, Japanese,... need visual feedback to check that the correct Han character was selected. That may show (some parts of) the password to bystanders. - Length limitations of 8 bytes are few and far between these days, but they still exist.

Re: Deleting Lone Surrogates

2015-10-05 Thread Martin J. Dürst
On 2015/10/05 04:30, Asmus Freytag (t) wrote: On 10/4/2015 6:02 AM, Richard Wordingham wrote: In the absence of a specific tailoring, is the combination of a lone surrogate and a combining mark a user-perceived character? Does a lone surrogate constitute a user-perceived character? In an

Re: Concise term for non-ASCII Unicode characters

2015-09-21 Thread Martin J. Dürst
Hello Doug, On 2015/09/22 00:42, Doug Ewell wrote: I was thinking that something like "non–Basic-Latin Unicode" might be Is that non-Basic Latin or not Basic-Latin? useful. It avoids the confusion of referring to ASCII as a range of code points instead of a separate encoding standard.

Re: Concise term for non-ASCII Unicode characters

2015-09-20 Thread Martin J. Dürst
Hello Sean, On 2015/09/20 23:48, Sean Leonard wrote: What is the most concise term for characters or code points So we already have two different things we might need a term for. outside of the US-ASCII range (U+ - U+007F)? Sometimes I have referred to these as "extended characters"

Re: Upcoming proposal for Bitcoin sign

2015-09-06 Thread Martin J. Dürst
Hello Ken, You write "The bitcoin sign and baht symbol are two unrelated symbols that have some visual similarity.", but don't really give any supporting information for that claim. For example, searching for images of bitcoin and bath symbols shows that the Bitcoin usually has two vertical

Re: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?

2015-08-27 Thread Martin J. Dürst
Sorry to be late. Just some background information. On 2015/04/28 14:57, Makoto Kato wrote: Although I read JIS X 4051, it doesn't define that half-width katakana and full-width katakana are differently. I was on the committee that updated JIS X 4015 (mostly liaison/observer role). The

Re: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?

2015-08-27 Thread Martin J. Dürst
Sorry, one correction: On 2015/08/27 16:39, Martin J. Dürst wrote: In practice, technical restrictions in early limitations (one byte == one (half-width) character cell) led to a typographic distinction. The fact that half-width Kana used less space was exploited in fixed-pitch screen design

Re: Emoji characters for food allergens

2015-07-29 Thread Martin J. Dürst
On 2015/07/29 23:27, Andrew West wrote: On 29 July 2015 at 14:42, William_J_G Overington My diet can include soya There already is, you can write My diet can include soya. If you are likely to swell up and die if you eat a peanut (for example), you will not want to trust your life to an

Re: Mark-up to Indicate Words

2015-07-15 Thread Martin J. Dürst
Hello Richard, On 2015/07/15 16:49, Richard Wordingham wrote: What mark-up schemes exist to show that a sequence of letters and combining marks constitutes a single word? Such mark-up would be useful when using spell checkers. At present, I use U+2060 WORD JOINER (WJ) to indicate the absence

Re: International Register of Coded Character Sets

2015-06-21 Thread Martin J. Dürst
On 2015/06/22 05:37, Frédéric Grosshans wrote: I don't know if it's what you're looking for but Google brought me to the following URL. https://www.itscj.ipsj.or.jp/itscj_english/iso-ir/ISO-IR.pdf I managed to download the pdf without problems. I also successfully downloaded a standard (

Re: Tag characters and in-line graphics (from Tag characters)

2015-06-05 Thread Martin J. Dürst
On 2015/06/04 17:03, Chris wrote: I wish Steve Jobs was here to give this lecture. Well, if Steve Jobs were still around, he could think about whether (and how many) users really want their private characters, and whether it was worth the time to have his engineers working on the solution.

Re: Tag characters and in-line graphics (from Tag characters)

2015-06-02 Thread Martin J. Dürst
On 2015/06/03 07:55, Chris wrote: As you point out, The UCS will not encode characters without a demonstrated usage.”. But there are use cases for characters that don’t meet UCS’s criteria for a world wide standard, but are necessary for more specific use cases, like specialised regional,

Re: Tag characters and in-line graphics (from Tag characters)

2015-06-02 Thread Martin J. Dürst
On 2015/05/29 11:37, John wrote: If I had a large document that reused a particular character thousands of times, Then it would be either a very boring document (containing almost only that same character) or it would be a very large document. would this HTML markup require embedding that

Re: Compatibility decomposition for Hebrew and Greek final letters

2015-02-19 Thread Martin J. Dürst
On 2015/02/20 05:17, Eli Zaretskii wrote: From: Philippe Verdy verd...@wanadoo.fr Date: Thu, 19 Feb 2015 20:31:07 +0100 Cc: Julian Bradfield jcb+unic...@inf.ed.ac.uk, unicode Unicode Discussion unicode@unicode.org The decompositions are not needed for plain text searches, that can use

Re: Compatibility decomposition for Hebrew and Greek final letters

2015-02-19 Thread Martin J. Dürst
On 2015/02/19 20:47, Julian Bradfield wrote: On 2015-02-19, Eli Zaretskii e...@gnu.org wrote: Does anyone know why does the UCD define compatibility decompositions for Arabic initial, medial, and final forms, but doesn't do the same for Hebrew final letters, like U+05DD HEBREW LETTER FINAL MEM?

Re: The NEW Keyboard Layout—IEAOU

2015-01-25 Thread Martin J. Dürst
What's better on this keyboard when compared to the Dvorak layout? At first sight, it looks heavily right-handed, all the letters that the Dvorak keyboard has on the homerow are on the right hand. Regards, Martin. P.S.: I'm a happy Dvorak user. On 2015/01/26 06:54, Robert Wheelock wrote:

Re: Unicode encoding policy

2014-12-23 Thread Martin J. Dürst
On 2014/12/24 09:50, Tex Texin wrote: True, however as William points out, apparently the rules have changed, I hope the rules get clarified to clearly state that these are exceptions. so it isn’t unreasonable to ask again whether the rules now allow it, or if people that dismissed the idea

Re: emoji are clearly the current meme fad

2014-12-17 Thread Martin J. Dürst
On 2014/12/18 06:49, Michael Everson wrote: Clearly the plural of emoji is emojis. Not in Japanese, where there are no plural forms. The question of what it is/will be in English will be decided by usage, not by grammar. I'd use 'emoji', but then I'm too biased towards Japanese to be

Code charts and code points (was: Re: fonts for U7.0 scripts)

2014-10-24 Thread Martin J. Dürst
On 2014/10/24 10:21, Asmus Freytag wrote: Peter is correct. The only fonts that should be released to the public are those that are Unicode encoded and have the correct shaping tables. Unlike the public, the code chart editors for Unicode have tools that can correctly handle not only

Re: Request for Information

2014-07-24 Thread Martin J. Dürst
On 2014/07/24 15:37, Richard Wordingham wrote: No. The text samples I could find quickly show scripta continua, but I suspect the line breaks are occurring at word or syllable boundaries. If I am right about the constraint on line break position, then this can be recovered by marking the

Re: Corrigendum #9

2014-06-03 Thread Martin J. Dürst
On 2014/06/03 07:08, Asmus Freytag wrote: On 6/2/2014 2:53 PM, Markus Scherer wrote: On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com mailto:prosfil...@gmail.com wrote: I would especially discourage any web browser from handling these; they're noncharacters used for

Re: FYI: More emoji from Chrome

2014-04-02 Thread Martin J. Dürst
On 2014/04/02 20:08, Christopher Fynn wrote: On 02/04/2014, Asmus Freytag asm...@ix.netcom.com wrote: On 4/2/2014 1:42 AM, Christopher Fynn wrote: Rather than Emoji it might be better if people learnt Han ideographs which are also compact (and a far more developed system of communication than

Re: Emoji

2014-04-02 Thread Martin J. Dürst
On 2014/04/03 02:00, James Lin wrote: Emoji or 顔文字, literally means Face word or Face Characters, essentially, Emoji is 絵文字 (picture character), 顔文字 is kaomoji (face character). Regards, Martin. provides an emotional state in the context of words. Emoji is very popular in APJ, and

Re: FYI: More emoji from Chrome

2014-04-01 Thread Martin J. Dürst
Now that it's no longer April 1st (at least not here in Japan), I can add a (moderately) serious comment. On 2014/04/02 01:43, Ilya Zakharevich wrote: On Tue, Apr 01, 2014 at 09:01:39AM +0200, Mark Davis ☕️ wrote: More emoji from Chrome:

Fwd: Updated Japanese Legacy Standard? (was: Re: Romanized Singhala got great reception in Sri Lanka)

2014-03-28 Thread Martin J. Dürst
J. Dürst due...@it.aoyama.ac.jp On 2014/03/16 14:36, Philippe Verdy wrote: You may still want to promote it at some government or education institution, in order to promote it as a national standard, except that there's little change it will ever happen when all countries in ISO have stopoed

Fwd: Re: Romanized Singhala got great reception in Sri Lanka

2014-03-28 Thread Martin J. Dürst
I got informed today by your IT Dept. that the mail below never went out. Resent herewith.Martin. Original Message Subject: Re: Romanized Singhala got great reception in Sri Lanka Date: Mon, 17 Mar 2014 14:37:00 +0900 From: Martin J. Dürst due...@it.aoyama.ac.jp On 2014

Re: Request for review: 3023bis (XML media types) makes significant changes

2013-12-18 Thread Martin J. Dürst
Hello Henry, Some comments on your specific questions, which may trigger some additional discussion. On 2013/12/12 1:43, Henry S. Thompson wrote: I'm one of the editors of a proposed replacement for RFC3023 [1], the media type registration for application/xml, text/xml and 3 others. The

Re: ¥ instead of \

2013-10-27 Thread Martin J. Dürst
On 2013/10/23 4:22, Asmus Freytag wrote: On 10/22/2013 11:38 AM, Jean-François Colson wrote: Hello. I know that in some Japanese encodings (JIS, EUC), \ was replaced by a ¥. On my computer, there are some Japanese fonts where the characters seems coded following Unicode, except for the \

Re: COMBINING OVER MARK?

2013-10-02 Thread Martin J. Dürst
On 2013/10/02 9:52, Leo Broukhis wrote: Thanks! That comes out exactly right, although using math markup for linguistic purposes is, IMO, a stretch. Why? Surely like in other fields (Math to start with), there somewhere is a boundary between plain text and rich text. Of course it's not

Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)

2013-07-05 Thread Martin J. Dürst
On 2013/07/05 16:04, Denis Jacquerye wrote: On Thu, Jul 4, 2013 at 12:07 PM, Michael Eversonever...@evertype.com wrote: The problem is in pretending that a cedilla and a comma below are equivalent because in some script fonts in France or Turkey routinely write some sort of

Re: writing in an alphabet with fewer letters: letter replacements

2013-07-05 Thread Martin J. Dürst
On 2013/07/05 17:25, Stephan Stiller wrote: What I had in mind was more specific: Germans are supposed to convert [ä,ö,ü,ß] to [ae,oe,ue,ss], though I don't know what's considered best/legal wrt documents required for entering the US, for example. I have always used Duerst on plane tickets

Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)

2013-07-03 Thread Martin J. Dürst
On 2013/06/22 0:32, Michael Everson wrote: On 21 Jun 2013, at 16:20, Khaled Hosnykhaledho...@eglug.org wrote: Yeah, I don't believe that you can language-tag individual file names for such display as that is markup. Why do you need to? You only need one language, it is not like file names

Re: Encoding localizable sentences (was: RE: UTC Document Register Now Public)

2013-04-23 Thread Martin J. Dürst
On 2013/04/23 18:01, William_J_G Overington wrote: On Monday 22 April 2013, Asmus Freytagasm...@ix.netcom.com wrote: I'm always suspicious if someone wants to discuss scope of the standard before demonstrating a compelling case on the merits of wide-spread actual use. The reason that I

Re: Why wasn't it possible to encode a coeng-like joiner for Tibetan?

2013-04-12 Thread Martin J. Dürst
On 2013/04/11 16:30, Michael Everson wrote: On 11 Apr 2013, at 00:09, Shriramana Sharmasamj...@gmail.com wrote: Or was the Khmer model of an invisible joiner a *later* bright idea? Yes. Later, yes. Bright? Most Kambodian experts disagree. Regards, Martin.

Re: Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?

2013-02-04 Thread Martin J. Dürst
Hello Roger, The conclusion to your question below is a very clear NO. The reason is that most text is already in NFC. In fact, as I wrote a few days or weeks ago, NFC was defined to capture what's usually around on the Web (and in other places, too). Trying to recommend that everything be in

Re: Normalization rate on the Web

2013-01-21 Thread Martin J. Dürst
On 2013/01/22 1:12, Denis Jacquerye wrote: Does anybody have any idea of how much of the Web is normalized in NFC or NFD? Or how much not normalized? I have never measured this. But at one time, there was only NFD (and NFKD). The Unicode Consortium, with input from W3C, then defined NFC (and

Re: What does it mean to not be a valid string in Unicode?

2013-01-08 Thread Martin J. Dürst
On 2013/01/08 14:43, Stephan Stiller wrote: Wouldn't the clean way be to ensure valid strings (only) when they're built Of course, the earlier erroneous data gets caught, the better. The problem is that error checking is expensive, both in lines of code and in execution time (I think there

Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Martin J. Dürst
On 2013/01/08 3:27, Markus Scherer wrote: Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). That would not be well-formed UTF-16, but it's generally harmless in text processing. Things

Re: Why is endianness relevant when storing data on disks but not when in memory?

2013-01-05 Thread Martin J. Dürst
On 2013/01/06 7:21, Costello, Roger L. wrote: Does this mean that when exchanging Unicode data across the Internet the endianness is not relevant? Are these stated correctly: When Unicode data is in a file we would say, for example, The file contains UTF-32BE data. When Unicode

Re: Character name translations

2012-12-20 Thread Martin J. Dürst
On 2012/12/21 0:59, Asmus Freytag wrote: There have been efforts at a Japanese translation of the text of the standard, I have no idea whether that contains translated names for characters. JIS X 0221-1995, which is a translation of ISO 10646, contains some Japanese character names, but this

Tool to convert characters to character names

2012-12-19 Thread Martin J. Dürst
I'm looking for a (preferably online) tool that converts Unicode characters to Unicode character names. Richard Ishida's tools (http://rishida.net/tools/conversion/) do a lot of conversions, but not names. Regards, Martin.

Why 17 planes? (was: Re: Why 11 planes?)

2012-11-27 Thread Martin J. Dürst
Well, first, it is 17 planes (or have we switched to using hexadecimal numbers on the Unicode list already? Second, of course this is in connection with UTF-16. I wasn't involved when UTF-16 was created, but it must have become clear that 2^16 (^ denotes exponentiation (to the power of))

Re: cp1252 decoder implementation

2012-11-27 Thread Martin J. Dürst
On 2012/11/17 12:54, Buck Golemon wrote: On Fri, Nov 16, 2012 at 4:11 PM, Doug Ewelld...@ewellic.org wrote: Buck Golemon wrote: Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and to map it to the equally-non-semantic U+81 ? U+0081 (there are always at least four

Re: Why 17 planes?

2012-11-27 Thread Martin J. Dürst
To this, my mother would say: Why keep it simple when we can make it complicated?. Regards,Martin. On 2012/11/27 21:01, Philippe Verdy wrote: That's a valid computation if the extension was limited to use only 2-surrogate encodings for supplementary planes. If we could use 3-surrogate

Re: cp1252 decoder implementation

2012-11-21 Thread Martin J. Dürst
On 2012/11/21 16:23, Peter Krefting wrote: Doug Ewell d...@ewellic.org: Somewhat off-topic, I find it amusing that tolerance of poorly encoded input is considered justification for changing the underlying standards, The encoding work at W3C, at least as far as I see it, is not an attempt to

Re: latin1 decoder implementation

2012-11-19 Thread Martin J. Dürst
On 2012/11/17 9:56, Philippe Verdy wrote: True. HTML5 makes its own reinterpretation of the IETF's MIME standard, definining it own protocol (which means that it is no longer fully compatible with MIME and its IANA datatabase, because the mapping of the value of a charset= pseudo-attribute is

Re: latin1 decoder implementation

2012-11-17 Thread Martin J. Dürst
Just in case it helps, Ruby (since version 1.9) also uses 3). Regards, Martin. On 2012/11/17 6:48, Buck Golemon wrote: When decoding bytes to unicode using the latin1 scheme, there are three options for bytes not defined in the ISO-8859-1 standard. 1) Throw an error. 2) Insert the

Re: latin1 decoder implementation

2012-11-17 Thread Martin J. Dürst
On 2012/11/17 9:45, Doug Ewell wrote: If he is targeting HTML5, then none of this matters, because HTML5 says that ISO 8859-1 is really Windows-1252. Yes. But unless Python wants to limit its use to HTML5, this should be handled on a separate level (mapping a iso-8859-1 label to the

Re: Caret

2012-11-14 Thread Martin J. Dürst
On 2012/11/13 21:49, Eli Zaretskii wrote: I'd welcome that. Although the reality flies in the face of user requirements in this case: most bidi-aware editors, including my own work in Emacs, don't have 2 carets, for some reason. Maybe the developers didn't consider that important enough, or

Re: Missing geometric shapes

2012-11-08 Thread Martin J. Dürst
On 2012/11/08 19:15, Michael Everson wrote: On 8 Nov 2012, at 09:59, Simon Montagusmont...@smontagu.org wrote: Please take into account that the half-stars should be symmetric-swapped in RTL text. I attach an example from an advertisment for a movie published in Haaretz 2 November 2012 I

Re: Character set cluelessness

2012-10-02 Thread Martin J. Dürst
Richard - Complex script usually refers to scripts where rendering isn't just simply putting glyphs side by side. That includes stuff with combining marks, ligatures, reordering, stacking, and the like. Regards, Martin. On 2012/10/03 7:09, Richard Wordingham wrote: On Tue, 02 Oct 2012

Re: Character set cluelessness

2012-10-02 Thread Martin J. Dürst
So in order to get something going here, why doesn't Doug draft a letter to these guys (possibly based on the one from a few years ago) and then Mark sends it off in his position at Unicode, which hopefully will impress them more than just a personal contribution. Being upset in this list

Re: Is the Subject field of an e-mail an obvious example of plain text where no higher level protocol application is possible?

2012-07-21 Thread Martin J. Dürst
Hello Karl, On 2012/07/21 0:41, Karl Pentzlin wrote: Looking for an example of plain text which is obvious to anybody, it seems to me that the Subject field of e-mails is a good example. Common e-mail software lets you enter any text but gives you never access to any higher-level protocol.

Re: Unicode String Models

2012-07-20 Thread Martin J. Dürst
On 2012/07/21 7:01, David Starner wrote: I'm concerned about the statement/implication that one can optimize for ASCII and Latin-1. It's too easy for a lot of developers to test speed with the English/European documents they have around and test correctness only with Chinese. I see the argument

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-18 Thread Martin J. Dürst
Hello Doug, On 2012/07/18 0:35, Doug Ewell wrote: For those who haven't yet had enough of this debate yet, here's a link to an informative blog (with some informative comments) from Michael Kaplan: Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!)

Re: pre-HTML5 and the BOM

2012-07-18 Thread Martin J. Dürst
On 2012/07/18 16:35, Leif Halvard Silli wrote: Martin J. Dürst, Wed, 18 Jul 2012 11:00:42 +0900: The best reason is simply that nobody should be using crutches as long as they can walk with their own legs. Crutches, in that sense, is only about authoring convenience. And, of course

Re: pre-HTML5 and the BOM

2012-07-18 Thread Martin J. Dürst
Hello Leif, I think that more and more, we are on the wrong mailing list. Regards, Martin. On 2012/07/18 18:47, Leif Halvard Silli wrote: Martin J. Dürst, Wed, 18 Jul 2012 17:20:31 +0900: On 2012/07/18 16:35, Leif Halvard Silli wrote: Martin J. Dürst, Wed, 18 Jul 2012 11:00:42 +0900

Re: pre-HTML5 and the BOM

2012-07-17 Thread Martin J. Dürst
On 2012/07/13 22:31, Jukka K. Korpela wrote: 2012-07-13 16:12, Leif Halvard Silli wrote: The kind of BOM intolerance I know about in user agents is that some text browsers and IE5 for Mac (abandoned) convert the BOM into a (typically empty) line a the start of the body element. I wonder if

Re: pre-HTML5 and the BOM

2012-07-17 Thread Martin J. Dürst
On 2012/07/14 1:33, Philippe Verdy wrote: Fra: Jukka K. Korpelajkorp...@cs.tut.fi When the BOM is used in web pages or editors for UTF-8 encoded content it can sometimes introduce blank spaces or short sequences of strange-looking characters (such as ). For this reason, it is usually best

Re: pre-HTML5 and the BOM

2012-07-17 Thread Martin J. Dürst
On 2012/07/17 17:22, Leif Halvard Silli wrote: And an argument was put forward in the WHATWG mailinglist earlier tis year/end of previous year, that a page with strict ASCII characters inside could still contain character entities/references for characters outside ASCII. Of course they can.

Re: pre-HTML5 and the BOM

2012-07-17 Thread Martin J. Dürst
Hello Leif, Sorry to be late with my answer. On 2012/07/13 20:44, Leif Halvard Silli wrote: Martin J. Dürst, Fri, 13 Jul 2012 18:17:05 +0900: On 2012/07/13 0:12, Leif Halvard Silli wrote: Doug Ewell, Wed, 11 Jul 2012 09:12:46 -0600: and people who want to create or modify UTF-8 files which

Re: pre-HTML5 and the BOM

2012-07-17 Thread Martin J. Dürst
Hello Leif, On 2012/07/18 4:35, Leif Halvard Silli wrote: But is the Windows Notepad really to blame? Pretty much so. There may have been other products from Microsoft that also did it, but with respect to forcing browsers and XML parsers to accept an UTF-8 BOM as a signature, Notepad was

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-17 Thread Martin J. Dürst
Hello Philippe, On 2012/07/18 3:37, Philippe Verdy wrote: 2012/7/17 Julian Bradfieldjcb+unic...@inf.ed.ac.uk: On 2012-07-16, Philippe Verdyverd...@wanadoo.fr wrote: I am also convinced that even Shell interpreters on Linux/Unix should recognize and accept the leading BOM before the hash/bang

Re: pre-HTML5 and the BOM

2012-07-17 Thread Martin J. Dürst
Hello Jukka, On 2012/07/17 23:31, Jukka K. Korpela wrote: 2012-07-17 17:11, Leif Halvard Silli wrote: For instance, early on in 'the Web', some appeared to think that all non-ASCII had to be represented as entities. Yes indeed. There's still some such stuff around. It's mostly unnecessary,

Re: pre-HTML5 and the BOM

2012-07-13 Thread Martin J. Dürst
On 2012/07/13 0:12, Leif Halvard Silli wrote: Doug Ewell, Wed, 11 Jul 2012 09:12:46 -0600: and people who want to create or modify UTF-8 files which will be consumed by a process that is intolerant of the signature should not use Notepad. That goes for HTML (pre-5) pages [snip] HTML5-parsers

Re: Too narrowly defined: DIVISION SIGN COLON

2012-07-10 Thread Martin J. Dürst
On 2012/07/11 4:37, Asmus Freytag wrote: I recall, with certainty, having seen the : in the context of elementary instruction in arithmetic, as in 4 : 2 = ?, but am no longer positive about seeing ÷ in the same context. I remember this very well. In grade school, we had to learn two ways to

Re: Too narrowly defined: DIVISION SIGN COLON

2012-07-10 Thread Martin J. Dürst
On 2012/07/11 10:35, Stephan Stiller wrote: About Martin Dürst's content re geteilt-gemessen: When I attended the German school system in approx the 1990s this distinction wasn't mentioned or taught. (I prefer to not give details about specific time and place for privacy reasons.) Sorry, but

Re: Sinhala naming conventions

2012-07-10 Thread Martin J. Dürst
On 2012/07/11 11:04, Mark E. Shoulson wrote: Ever start to feel that we would have been better off not to give official descriptive names at all? Or else really vague ones like LETTERLIKE THINGY NUMBER 5412? So much blood-pressure raised over the names... I'm feeling that way since about the

Re: Unicode 6.2 to Support the Turkish Lira Sign

2012-05-30 Thread Martin J. Dürst
On 2012/05/30 4:42, Roozbeh Pournader wrote: Just look what happened when the Japanese did their own font/character set hack. The backslash/yen problem is still with us, to this day... To be fair, the Japanese Yen at 0x5C was there long before Unicode, in the Japanese version of ISO 646.

Flag tags (was: Re: Unicode 6.2 to Support the Turkish Lira Sign)

2012-05-29 Thread Martin J. Dürst
On 2012/05/29 17:43, Asmus Freytag wrote: On 5/27/2012 5:52 PM, Michael Everson wrote: Get over it. Please just get over it. It doesn't matter. It's a blort. Time to agree with Michael. Get over it, is good advice here. Sovereign countries are free to decree currency symbols, whatever their

Re: Unicode, SMS and year 2012

2012-04-29 Thread Martin J. Dürst
On 2012/04/29 18:58, Szelp, A. Sz. wrote: While there are good reasons the authors of HTML5 brought to ignore SCSU or BOCU-1, having excluded UTF-32 which is the most direct, one-to-one mapping of Unicode codepoints to byte values seems shortsighted. Well, except that it's hopelessly

Re: Unicode, SMS and year 2012

2012-04-27 Thread Martin J. Dürst
On 2012/04/28 4:26, Mark Davis ☕ wrote: Actually, if the goal is to get as many characters in as possible, Punycode might be the best solution. That is the encoding used for internationalized domains. In that form, it uses a smaller number of bytes per character, but a parameterization allows

Re: Unicode, SMS and year 2012

2012-04-27 Thread Martin J. Dürst
On 2012/04/28 7:29, Cristian Secară wrote: În data de Fri, 27 Apr 2012 12:26:25 -0700, Mark Davis ☕ a scris: Actually, if the goal is to get as many characters in as possible, Punycode might be the best solution. That is the encoding used for internationalized domains. In that form, it uses a

Re: Unicode, SMS and year 2012

2012-04-27 Thread Martin J. Dürst
On 2012/04/27 17:06, Cristian Secară wrote: It turned out that they (ETSI its groups) created a way to solve the 70 characters limitation, namely “National Language Single Shift” and “National Language Locking Shift” mechanism. This is described in 3GPP TS 23.038 standard and it was introduced

Re: missing characters: combining marks above runs of more than 2 base letters

2011-11-20 Thread Martin J. Dürst
On 2011/11/21 5:54, Asmus Freytag wrote: On 11/20/2011 8:00 AM, Joó Ádám wrote: Leaving aside that CSS is presentation and not content, and is definitely not markup. HTML is a better candidate. Á The details of the appearance of the mark would be presentation. The scoping, like for applying

Default bidi ranges

2011-11-09 Thread Martin J. Dürst
I tried to find something like a normative description of the default bidi class of unassigned code points. In UTR #9, it says (http://www.unicode.org/reports/tr9/tr9-23.html#Bidirectional_Character_Types): Unassigned characters are given strong types in the algorithm. This is an explicit

Forum Problems

2011-10-24 Thread Martin J. Dürst
How can one use the Forum to comment on URI/IRI issues when one gets a message: Your message contains too many URLs. The maximum number of URLs allowed is 8. I never liked this forum stuff too much, and this hasn't made things better :-(. Regards, Martin.

Wrong UTF-8 encoders still around?

2011-10-20 Thread Martin J. Dürst
I'm hoping to get some advice from people with experience with various Unicode/transcoding libraries. RFC 3987 (the current IRI spec) has the following text: Note: Some older software transcoding to UTF-8 may produce illegal output for some input, in particular for characters outside

<    1   2   3   >