from:"Markus Scherer"

Re: Is the binaryness/textness of a data format a property?

2020-03-22 Thread Markus Scherer via Unicode

On Sat, Mar 21, 2020 at 12:35 PM Doug Ewell via Unicode wrote: > I thought the whole premise of GB18030 was that it was Unicode mapped into > a GB2312 framework. What characters exist in GB18030 that don't exist in > Unicode, and have they been proposed for Unicode yet, and why was none of > the

Re: Egyptian Hieroglyph Man with a Laptop

2020-02-12 Thread Markus Scherer via Unicode

On Wed, Feb 12, 2020 at 11:37 AM Marius Spix via Unicode < unicode@unicode.org> wrote: > In my opinion, this is an invalid character, which should not be > included in Unicode. > Please remember that feedback that you want the committee to look at needs to go through

Fwd: ICU 66preview available

2019-12-05 Thread Markus Scherer via Unicode

documents are published on unicode-org.github.io/icu-docs/ – follow the “Dev” links there. Best regards, Markus Scherer for the ICU Project

Re: Proposal to add Roman transliteration schemes to ISO 15924.

2019-12-02 Thread Markus Scherer via Unicode

On Mon, Dec 2, 2019 at 5:47 PM विश्वासो वासुकिजः (Vishvas Vasuki) via Unicode wrote: > But that says that the definitions are at >> > >> https://github.com/unicode-org/cldr/releases/tag/latest/common/bcp47/transform.xml >> , >> but all one currently gets from that is an error message 'XML

Re: Proposal to add Roman transliteration schemes to ISO 15924.

2019-12-02 Thread Markus Scherer via Unicode

On Mon, Dec 2, 2019 at 8:42 AM Roozbeh Pournader via Unicode < unicode@unicode.org> wrote: > You don't need an ISO 15924 script code. You need to think in terms of BCP > 47. Sanskrit in Latin would be sa-Latn. > Right! Now, if you want to distinguish the different transcription systems for >

Re: Encoding the Nsibidi script (African) for writing the Igbo language

2019-11-11 Thread Markus Scherer via Unicode

On Mon, Nov 11, 2019 at 4:03 AM Philippe Verdy via Unicode < unicode@unicode.org> wrote: > But first there's still no code in ISO 15924 (first step easy to complete > before encoding in the UCS). > That's not first; it's nearly last. The script code standard says "In general, script codes shall

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-11 Thread Markus Scherer via Unicode

On Fri, Oct 11, 2019 at 12:05 PM Richard Wordingham via Unicode < unicode@unicode.org> wrote: > On Thu, 10 Oct 2019 15:23:00 -0700 > Markus Scherer via Unicode wrote: > > > [c \q{ch}]h should work like (ch|c)h. Note that the order matters in > > the alternation --

Re: Will TAGALOG LETTER RA, currently in the pipeline, be in the next version of Unicode?

2019-10-11 Thread Markus Scherer via Unicode

On Fri, Oct 11, 2019 at 4:37 AM Fred Brennan via Unicode < unicode@unicode.org> wrote: > Many users are asking me and I'm not sure of the answer (nor how to find > it > out). > You can find out by looking at the data files that are being developed for Unicode 13. Look at the latest

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-10 Thread Markus Scherer via Unicode

On Tue, Oct 8, 2019 at 7:28 AM Richard Wordingham via Unicode < unicode@unicode.org> wrote: > An example UTS#18 gives for matching a literal cluster can be simplified > to, in its notation: > > [c \q{ch}] > > This is interpreted as 'match against "ch" if possible, otherwise > against "c". Thus

Re: Manipuri/Meitei customary writing system

2019-10-04 Thread Markus Scherer via Unicode

On Fri, Oct 4, 2019 at 2:05 PM Richard Wordingham via Unicode < unicode@unicode.org> wrote: > > >> Is the use of the Meitei script aspirational or customary? > > >> Which script is being used for major newspapers, popular books, > > >> and video captions? > > > > > > This may give you some more

Manipuri/Meitei customary writing system

2019-10-03 Thread Markus Scherer via Unicode

Dear Unicoders, Is Manipuri/Meitei customarily written in Bangla/Bengali script or in Meitei script? I am looking at https://en.wikipedia.org/wiki/Meitei_language#Writing_systems which seems to describe writing practice in transition, and I can't quite tell where it stands. Is the use of the

Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Markus Scherer via Unicode

There are lots of ways to implement the UCA. When you want fast string comparison, the zero weights are useful for processing -- and you don't actually assemble a sort key. People who want sort keys usually want them to be short, so you spend time on compression. You probably also build sort

Re: Dealing with Georgian capitalization in programming languages

2018-10-02 Thread Markus Scherer via Unicode

On Tue, Oct 2, 2018 at 12:50 AM Martin J. Dürst via Unicode < unicode@unicode.org> wrote: > ... The only > operation that can cause problems is 'capitalize'. > > When I say "cause problems", I mean producing mixed-case output. I > originally thought that 'capitalize' would be fine. It is fine for

Re: Diacritic marks in parentheses

2018-07-26 Thread Markus Scherer via Unicode

I would not expect for Ä+combining () above = Ä᪻ to look right except with specialized fonts. http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%84%5Cu1ABB==0 Even if it worked widely, I think it would be confusing. I think you are best off writing Arzt/Ärztin. Viele Grüße, markus

Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols

2018-05-15 Thread Markus Scherer via Unicode

On Tue, May 15, 2018 at 10:47 AM, Johnny Farraj via Unicode < unicode@unicode.org> wrote: > Dear Unicode list members, > > I wish to get feedback about a new symbol submission proposal. > Just to clarify, this is a discussion list where you may get some useful feedback. This is not where you

Re: [Unicode] Re: Fonts and font sizes used in the Unicode

2018-03-05 Thread Markus Scherer via Unicode

On Mon, Mar 5, 2018 at 9:03 AM, suzuki toshiya via Unicode < unicode@unicode.org> wrote: > I have a question; if some people try to make a > translated version of Unicode, they should contact > all font contributors and ask for the license? > Unicode Consortium cannot give any sublicense? > If

Re: Fonts and font sizes used in the Unicode

2018-03-04 Thread Markus Scherer via Unicode

On Sun, Mar 4, 2018 at 6:10 AM, Helena Miton via Unicode < unicode@unicode.org> wrote: > Greetings. Is there a way to know which font and font size have been used > in the Unicode charts (for various writing systems)? Many thanks! > What are you trying to do? Many of the fonts are unique to the

Re: Emoji blooper

2018-02-13 Thread Markus Scherer via Unicode

On my machine (Chromebox+Gmail), the trumpets point down to the lower left. If you want to convey precise images, then send images... markus

Re: Internationalization & Unicode Conference 2018

2018-01-24 Thread Markus Scherer via Unicode

If your presentation is accepted for the conference, you should get a hotel discount. markus

Re: Minimal Implementation of Unicode Collation Algorithm

2017-12-04 Thread Markus Scherer via Unicode

On Mon, Dec 4, 2017 at 5:30 AM, Richard Wordingham via Unicode < unicode@unicode.org> wrote: > May a collation algorithm that always compares all strings as equal be a > compliant implementation of the Unicode Collation Algorithm (UTS #10)? > If not, by which clause is it not compliant?

Re: implicit weight base for U+2CEA2

2017-09-27 Thread Markus Scherer via Unicode

On Wed, Sep 27, 2017 at 4:07 PM, James Tauber wrote: > Ah yes, I was just going by membership in the CJK Unified Ideographs > Extension E block, not actual assignment. > > So the lack of assignment means it should fail the Unified_Ideograph > membership in

Re: implicit weight base for U+2CEA2

2017-09-27 Thread Markus Scherer via Unicode

On Wed, Sep 27, 2017 at 1:49 PM, James Tauber via Unicode < unicode@unicode.org> wrote: > I recently updated pyuca[1], my pure Python implementation of the Unicode > Collation Algorithm to work with 8.0.0, 9.0.0, and 10.0.0 but to get all > the tests to work, I had to special case the implicit

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-09-23 Thread Markus Scherer via Unicode

FYI, I changed the ICU behavior for the upcoming ICU 60 release (pending code review). Proposal & description: https://sourceforge.net/p/icu/mailman/message/35990833/ Code changes: http://bugs.icu-project.org/trac/review/13311 Best regards, markus On Thu, Aug 3, 2017 at 5:34 PM, Mark Davis ☕️

Re: Emoji Space

2017-07-17 Thread Markus Scherer via Unicode

On Mon, Jul 17, 2017 at 5:25 AM, Christoph Päper via Unicode < unicode@unicode.org> wrote: > As you may know, the combined original Japanese emoji set included three > whitespace characters: one was the full width of a (square) emoji, one was > half that and the last one was a quarter blank.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-03 Thread Markus Scherer via Unicode

On Wed, May 31, 2017 at 5:12 AM, Henri Sivonen wrote: > On Sun, May 21, 2017 at 7:37 PM, Mark Davis ☕️ via Unicode > wrote: > > There is plenty of time for public comment, since it was targeted at > Unicode > > 11, the release for about a year from

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Markus Scherer via Unicode

On Fri, May 26, 2017 at 3:28 AM, Martin J. Dürst wrote: > But there's plenty in the text that makes it absolutely clear that some > things cannot be included. In particular, it says > > > The term “maximal subpart of an ill-formed subsequence” refers to the code >

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-24 Thread Markus Scherer via Unicode

On Wed, May 24, 2017 at 3:56 PM, Karl Williamson wrote: > On 05/24/2017 12:46 AM, Martin J. Dürst wrote: > >> That's wrong. There was a public review issue with various options and >> with feedback, and the recommendation has been implemented and in use >> widely (among

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Markus Scherer via Unicode

On Tue, May 23, 2017 at 7:05 AM, Asmus Freytag via Unicode < unicode@unicode.org> wrote: > So, if the proposal for Unicode really was more of a "feels right" and not > a "deviate at your peril" situation (or necessary escape hatch), then we > are better off not making a RECOMMEDATION that goes

Re: Comparing Raw Values of the Age Property

2017-05-22 Thread Markus Scherer via Unicode

On Mon, May 22, 2017 at 2:44 PM, Richard Wordingham via Unicode < unicode@unicode.org> wrote: > Given two raw values of the Age property, defined in UCD file > DerivedAge.txt, how is a computer program supposed to compare them? > Apart from special handling for the value "Unassigned" and its

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Markus Scherer via Unicode

Let me try to address some of the issues raised here. The proposal changes a recommendation, not a requirement. Conformance applies to finding and interpreting valid sequences properly. This includes not consuming parts of valid sequences when dealing with illegal ones, as explained in the

Re: Emoji Compatibility Symbols

2017-04-04 Thread Markus Scherer

There were some symbols, mostly proprietary logos, that we did not propose for encoding in Unicode. See pages 83-89 of http://www.unicode.org/L2/L2010/10132-emojidata.pdf You could also mine the defunct symbols subcommittee page for more information:

Re: Proposal to add standardized variation sequences for chess notation

2017-04-03 Thread Markus Scherer

On Mon, Apr 3, 2017 at 2:33 PM, Michael Everson <ever...@evertype.com> wrote: > On 3 Apr 2017, at 18:51, Markus Scherer <markus@gmail.com> wrote: > > > It seems to me that higher-level layout (e.g, HTML+CSS) is appropriate for > the board layout (e.g., via

Re: Proposal to add standardized variation sequences for chess notation

2017-04-03 Thread Markus Scherer

It seems to me that higher-level layout (e.g, HTML+CSS) is appropriate for the board layout (e.g., via a table), board frame style, and cell/field shading. In each field, the existing characters should suffice. markus

Re: Unicode Emoji 5.0 characters now final

2017-03-29 Thread Markus Scherer

I think "recommended" could be renamed to "(expected to be) widely implemented". markus

Re: Unicode Emoji 5.0 characters now final

2017-03-28 Thread Markus Scherer

On Tue, Mar 28, 2017 at 11:41 AM, Doug Ewell wrote: > Mark Davis wrote: > > > 3. Valid, but not recommended: "usca". Corresponds to the valid > > Unicode subdivision code for California according to > > http://unicode.org/reports/tr51/proposed.html#valid-emoji-tag-sequences > >

Re: Unicode Emoji 5.0 characters now final

2017-03-27 Thread Markus Scherer

On Mon, Mar 27, 2017 at 5:09 PM, Philippe Verdy wrote: > I followed the links. Check your links, you are referencing the proposal, > and this contradicts the published version 4.0 of TR51. Where is stability ? > Of course I am pointing to the proposal. The version of TR 51

Re: Unicode Emoji 5.0 characters now final

2017-03-27 Thread Markus Scherer

On Mon, Mar 27, 2017 at 4:58 PM, Philippe Verdy wrote: > This only describes the sequences encoded with 2 characters, not the newer > longer sequences for flags of subnational regions. the > unicode_region_subtag data does not contain anything about the flags for > the first

Re: Unicode Emoji 5.0 characters now final

2017-03-27 Thread Markus Scherer

On Mon, Mar 27, 2017 at 1:34 PM, Ken Whistler wrote: > Anybody could *attempt* to convey a flag of Pomerania (a rather handsome > black gryphon on a yellow background, btw) with an emoji tag sequence right > now, I suppose. I suppose not. Since it's bound to ISO 3166

Re: Unicode Emoji 5.0 characters now final

2017-03-27 Thread Markus Scherer

On Mon, Mar 27, 2017 at 1:39 PM, Philippe Verdy wrote: > Note also that ISO3166-2 is far from being stable, and this could > contradict Unicode encoding stability: it would then be required to ensure > this stability by only allowing sequences that are effectively registered

Re: Encoding of old compatibility characters

2017-03-27 Thread Markus Scherer

I think the interest has been low because very few documents survive in these encodings, and even fewer documents using not-already-encoded symbols. In my opinion, this is a good use of the Private Use Area among a very small group of people. See also

Re: Implications of Logical Order Exception Property

2017-01-25 Thread Markus Scherer

On Wed, Jan 25, 2017 at 12:00 PM, Richard Wordingham < richard.wording...@ntlworld.com> wrote: > > > 2) Claims that logical_order_exception is relevant for searching > > > (TUS, as above) > > > It informs the construction of the DUCET and could be used to > > suppress_contractions in a search

Re: Implications of Logical Order Exception Property

2017-01-25 Thread Markus Scherer

On Wed, Jan 25, 2017 at 11:10 AM, Richard Wordingham < richard.wording...@ntlworld.com> wrote: > I now have a clutch of errors to report on Unicode's use of the term > 'logical order' and references to logical_order_exception: > > 1) Claims that Thai is not encoded in logical order in >

Re: IdnaTest.txt and RFC 5893

2017-01-04 Thread Markus Scherer

On Wed, Jan 4, 2017 at 2:28 AM, Alastair Houghton < alast...@alastairs-place.net> wrote: > RFC 5893 seems pretty clear to me, and the problem really is that the test > vectors (which come from unicode.org) seem (to me) to be incorrect. https://tools.ietf.org/html/rfc5893#section-2 says "*The

Re: About standardized variants of characters in Dingbat block

2016-12-25 Thread Markus Scherer

On Sun, Dec 25, 2016 at 8:33 AM, Yifán Wáng <747.neut...@gmail.com> wrote: > I'm curious about the reason why U+270C VICTORY HAND ✌ has > standardized text and emoji styles defined but not with U+270A RAISED > FIST ✊ and U+270B RAISED HAND ✋. >

Re: Best practices for replacing UTF-8 overlongs

2016-12-20 Thread Markus Scherer

On Tue, Dec 20, 2016 at 8:59 AM, Ken Whistler wrote: > You found the resulting text in TUS 9.0, p. 126 - 129. The origin of the > text there about best practices for using U+FFFD was the discussion and > resolution of PRI #121 in August, 2008: > >

Re: Best practices for replacing UTF-8 overlongs

2016-12-19 Thread Markus Scherer

On Mon, Dec 19, 2016 at 3:04 PM, Karl Williamson wrote: > It seems counterintuitive to me that the two byte sequence C0 80 should be > replaced by 2 replacement characters under best practices, or that E0 80 80 > should also be replaced by 2. Each sequence was legal in

Re: Mixed-Script confusables in prog.languages

2016-12-04 Thread Markus Scherer

On Sun, Dec 4, 2016 at 3:09 AM, Reini Urban wrote: > Is anybody aware of any other language implementation, which does > confusable or mixed-script protection? > I think R has something, because it has this header: > https://cran.r-project.org/bin/windows/extsoft/3.4/ >

Re: Emoji mappings in Shift JIS / CP932/943

2016-12-03 Thread Markus Scherer

On Sat, Dec 3, 2016 at 2:37 PM, Christoph Päper wrote: > If an existing character encoding forms the (sole) base of an addition to > Unicode, shouldn’t it be part of the UTC’s job to document these sources? > This was obviously done in the case of Japanese emoji,

Re: Emoji mappings in Shift JIS / CP932/943

2016-12-02 Thread Markus Scherer

On Fri, Dec 2, 2016 at 4:35 AM, Christoph Päper wrote: > Could and should custom vendor extensions like the ones documented in > > - http://unicode.org/Public/UCD/latest/ucd/EmojiSources.txt > > be included in these mappings? > They could, but it would be best for

Re: IJ with accent

2016-09-28 Thread Markus Scherer

On Wed, Sep 28, 2016 at 9:16 AM, Philippe Verdy wrote: > My opinion is to put an accent on each letter and join them with a joiner > I don't see a reason for the joiner. markus

Re: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech

2016-08-26 Thread Markus Scherer

On Fri, Aug 26, 2016 at 10:26 AM, Ken Whistler wrote: > On 8/26/2016 10:01 AM, John O'Conner wrote: > >> What I find more interesting is how emoji (a small digital image or icon) >> was ever interpreted as encodable text for the Unicode Standard. If our >> German newspaper

Re: Whitespace characters in Unicode

2016-08-05 Thread Markus Scherer

On Fri, Aug 5, 2016 at 8:52 AM, Sean Leonard wrote: > What makes a character a "whitespace" in Unicode, e.g., why are ZWSP and > ZWNBSP not "whitespace" even though they clearly say "SPACE" in them? > I think "white space" basically wants to have an advance width

Re: UAX44: loose matching of symbolic values and the `is` prefix

2016-06-06 Thread Markus Scherer

Interesting discussion! ICU does not support "is" nor "in" prefixes. I wasn't even aware that UAX #44 loose matching prescribes "is". ICU just implements what Property[Value]Aliases.txt say: # Loose matching should be applied to all property names and property values, with # the exception of

Re: Canonical block names: spaces vs. underscores

2016-05-26 Thread Markus Scherer

Note that the Block property is an artifact of how the committee organizes the encoding of characters. It is not very useful for processing. For that, the Script property, Script_Extensions, and others are normally much better. markus

Re: The Hebrew Extended (Proposed) Block

2016-05-10 Thread Markus Scherer

FYI It seems like 08xx is reserved for RTL scripts. http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedBidiClass.txt # The unassigned code points that default to R are in the ranges: # [\u0590-\u05FF *\u07C0-\u089F* \uFB1D-\uFB4F \U00010800-\U00010FFF \U0001E800-\U0001EDFF

Re: Case for letters j and J with acute

2016-02-09 Thread Markus Scherer

On Tue, Feb 9, 2016 at 7:58 AM, Michael Everson wrote: > On 9 Feb 2016, at 11:18, ACJ Unicode wrote: > > > This is taught in writing in primary school in the Netherlands (or at > least it was 30 years ago), but this practice is often abandoned soon >

Re: precomposed polytonic Greek characters with macrons and other diacritics

2016-02-08 Thread Markus Scherer

On Mon, Feb 8, 2016 at 10:47 AM, James Tauber wrote: > Even with all this, though, my own work includes accentuation and > syllabification algorithms, all of which are made more cumbersome by the > lack of precomposed characters indicating vowel length. I'm currently >

Re: Unicode password mapping for crypto standard

2016-01-05 Thread Markus Scherer

I would specify that UTF-8 must be used, without mapping. US-ASCII is a proper subset, so need not be mentioned explicitly, nor distinguished in the protocol. Mappings would require that all implementations carry relevant data, and are up to date to recent versions of Unicode, or else

Re: Hentaigana proposal

2015-12-10 Thread Markus Scherer

Dear Mr. Tranter, I can't tell whether you intend to start a discussion on this discussion mailing list, or intend to submit feedback on a proposal. Maybe you are looking for discussion before you formalize your feedback. If you do intend to submit feedback, then, once you have formulated a

Re: Question about Perl5 extended UTF-8 design

2015-11-05 Thread Markus Scherer

On Thu, Nov 5, 2015 at 9:25 AM, Philippe Verdy wrote: > (0xFF was reserved only in the old RFC version of UTF-8 when it allowed > code points up to 31 bits, but even this RFC is obsolete and should no > longer be used and it has never been approved by Unicode). > No, even in

Re: Emoji data in UCD xml ?

2015-11-03 Thread Markus Scherer

About http://www.unicode.org/L2/L2015/15299-ucd-emoji-props.pdf which has Emoji_Presentation (EP) ● Non_Emoji (NE) ● Default_Text (DT) ● Default_Emoji (DE) ● NA Why do we need both Non_Emoji and NA? Can't Non_Emoji be the default for all code points that are not mentioned in the data? markus

Re: Unpaired surrogates (was: Re: Why Work at Encoding Level?)

2015-10-19 Thread Markus Scherer

On Mon, Oct 19, 2015 at 1:32 PM, Doug Ewell wrote: > > ICU (but perhaps it's actually Java) seems to have a culture of > > tolerating lone surrogates, and rules for handling lone surrogates are > > strewn across the Unicode standards and annexes. > > I suspect you have an

Re: Deleting Lone Surrogates

2015-10-04 Thread Markus Scherer

I would not spend any time specifying intricate rules for unpaired surrogates in 16-bit strings, or out-of range values in 32-bit strings. Most processing will treat them like unassigned characters, like U+50005, with only default behaviors. markus

Re: Hentaigana and the Kana Supplement block

2015-07-27 Thread Markus Scherer

On Mon, Jul 27, 2015 at 4:46 PM, Garth Wallace gwa...@gmail.com wrote: where does that leave the Kana Supplement block? That block contains only two encoded characters, but was allocated 256 code points, presumably for the future encoding of hentaigana. With hentaigana handled by SVSes, it

Re: Precomposed Cyrillic letters

2015-07-09 Thread Markus Scherer

On Thu, Jul 9, 2015 at 8:53 AM, Doug Ewell d...@ewellic.org wrote: From http://www.unicode.org/L2/L2015/15169-montenegro-cyrillic.pdf, Addition of two letters from Montenegrin language, CYRILLIC script: 9. Can any of the proposed characters be encoded using a composed character sequence

Re: ISO 15924

2015-07-09 Thread Markus Scherer

Thanks! markus

Re: Possible issue with Character Fallback Substitutions between version 24 and 25 ?

2015-06-18 Thread Markus Scherer

If the chart does not reflect the data, then please submit a bug ticket. http://unicode.org/cldr/trac/newticket The data is what counts. markus

Re: Another take on the English apostrophe in Unicode

2015-06-04 Thread Markus Scherer

Looks all wrong to me. don’t is a contraction of two words, it is not one word. English is taught as that squiggle being punctuation, not a letter. (Unlike, say, the Hawaiʻian ʻOkina http://en.wikipedia.org/wiki/%CA%BBOkina.) You can't use simple regular expressions to find word boundaries.

Re: Flag tags with U+1F3F3 and subtypes

2015-05-18 Thread Markus Scherer

On Mon, May 18, 2015 at 11:19 AM, Doug Ewell d...@ewellic.org wrote: Is the new mechanism intended to allow flag tags that include either subtype values or contains values? As far as I can tell from your quotes, CLDR will say what's valid (plus containment info), and Unicode permits you to

Re: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...)

2015-05-08 Thread Markus Scherer

On Fri, May 8, 2015 at 9:13 PM, Philippe Verdy verd...@wanadoo.fr wrote: 2015-05-09 5:13 GMT+02:00 Richard Wordingham richard.wording...@ntlworld.com: I can't think of a practical use for the specific concepts of Unicode 8-bit, 16-bit and 32-bit strings. Unicode 16-bit strings are

Re: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

2015-05-07 Thread Markus Scherer

I assume that the JSON spec deliberately allows anything that Java and JavaScript allow. In particular, there is no requirement for a Java String or JavaScript string to contain text, or well-formed UTF-16, or only assigned characters. Some code stores binary data (sequence of arbitrary 16-bit

Re: Usage stats?

2015-03-27 Thread Markus Scherer

On Fri, Mar 27, 2015 at 1:27 PM, Michael Norton michaelanortons...@gmail.com wrote: Easy example: what's the code for [blank space] U+020 across all language sets of Unicode? Is it the same ie: 100%? I don't understand what you are asking, and I have a hunch you haven't said it in a way

Re: Fixing the sort order of the SignWriting symbols in Unicode 8

2015-02-24 Thread Markus Scherer

On Tue, Feb 24, 2015 at 9:38 AM, Stephen E Slevinski Jr sle...@signpuddle.net wrote: Hi Unicode list, This is a useful place for discussion, but once the discussion peters out please submit formal feedback: http://www.unicode.org/review/pri285/ I am concerned that the SignWriting symbols as

Re: Compatibility decomposition for Hebrew and Greek final letters

2015-02-20 Thread Markus Scherer

On Thu, Feb 19, 2015 at 11:51 PM, Eli Zaretskii e...@gnu.org wrote: I think decomposition to NFKD solves these issues, doesn't it? Not completely. Judging from your question, you expected more mappings than NFKD has. You might want to try the mappings that are used as input for deriving the

Re: Compatibility decomposition for Hebrew and Greek final letters

2015-02-19 Thread Markus Scherer

On Thu, Feb 19, 2015 at 12:17 PM, Eli Zaretskii e...@gnu.org wrote: Sorry, I disagree. First, collation data is overkill for search, since the order information is not required, so the weights are simply wasting storage. Second, people do want to find, e.g., ² when they search for 2 etc.

Re: About cultural/languages communities flags

2015-02-09 Thread Markus Scherer

On Mon, Feb 9, 2015 at 9:54 AM, Andrea Giammarchi andrea.giammar...@gmail.com wrote: if a cultural/language TLD is typed with Unicode RIS, then show the flag for these culture/language: This does not work. The Unicode RIS are defined to be used in pairs, with semantics according to

Re: About cultural/languages communities flags

2015-02-09 Thread Markus Scherer

On Mon, Feb 9, 2015 at 1:11 PM, Joan Montané j...@montane.cat wrote: AFAIK, this is done in font side. Emoji flags are just ligatures, so a font can provide a ligature for 4 RIS characters. Technically true, but a font that violates the encoding standard would cause large problems. Imagine a

Re: Wrong plane numbers

2015-02-06 Thread Markus Scherer

These are not block boundaries. These lines are for book chart production, where we don't need to print every unsigned code point. markus ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode

N'Ko - which character? 02BC vs. 2019

2015-01-31 Thread Markus Scherer

Dear Unicoders, which is the proper second character in N'Ko? See below for details. Thanks, markus -- Forwarded message -- From: Doug Ewell d...@ewellic.org Date: Sat, Jan 31, 2015 at 9:16 AM Subject: Apostrophes (was: Re: ISO 639-3 changes) To: Philip Newton

Re: Looking for a standard on historical countries

2014-10-31 Thread Markus Scherer

On Fri, Oct 31, 2014 at 6:20 AM, Jörg Knappen jknap...@web.de wrote: Does someone here is aware of a standard or a de facto standard for names or codes of historical countries? For the requirement I have in mind, all countries where there was a printing press would be optimal coverage,

Re: Bliss?

2014-10-13 Thread Markus Scherer

On Mon, Oct 13, 2014 at 2:23 PM, Jean-François Colson j...@colson.eu wrote: I’ve found a 16-year-old proposal for Blissymbolics ( http://www.evertype.com/standards/iso10646/pdf/bliss.pdf ) but nothing more recent. Was that script rejected? Was it forgotten? Are there any technical difficulties

Re: Bliss?

2014-10-13 Thread Markus Scherer

As Michael said, I don't have information. But I found this which might help: http://en.wikipedia.org/wiki/Blissymbols#Towards_the_international_standardization_of_the_script markus ___ Unicode mailing list Unicode@unicode.org

Re: Request for Information

2014-07-23 Thread Markus Scherer

Some of the data is available in the Unicode CLDR script metadata: http://unicode.org/cldr/trac/browser/trunk/common/properties/scriptMetadata.txt http://cldr.unicode.org/development/updating-codes/updating-script-metadata markus -- Google Internationalization Engineering

Re: Default case algorithms

2014-06-24 Thread Markus Scherer

The context-sensitive and/or language-sensitive mappings are here: http://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt Best regards, markus ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode

Re: Default case algorithms

2014-06-24 Thread Markus Scherer

On Tue, Jun 24, 2014 at 4:56 PM, Daniel Bünzli daniel.buen...@erratique.ch wrote: Does an algorithm that simply applies R1 *regardless of context* constitute a default case algorithm or not ? I.e. does simply mapping each character C in a string using Uppercase_Mapping (C) (e.g. as exposed by

Re: Default case algorithms

2014-06-24 Thread Markus Scherer

On Tue, Jun 24, 2014 at 6:46 PM, Daniel Bünzli daniel.buen...@erratique.ch wrote: Having a look at the data it seems that the Uppercase_Mapping property of UCD includes (using the terminology of SpecialCasing.txt): * All the unconditional mappings of SpecialCasing.txt (context independent) *

Re: Corrigendum #9

2014-06-12 Thread Markus Scherer

On Wed, Jun 11, 2014 at 9:29 PM, Karl Williamson pub...@khwilliamson.com wrote: I have a something like a library that was written a long time ago (not by me) assuming that noncharacters were illegal in open interchange. Programs that use the library were guaranteed that they would not receive

Re: Corrigendum #9

2014-06-02 Thread Markus Scherer

On Mon, Jun 2, 2014 at 8:27 AM, Doug Ewell d...@ewellic.org wrote: I suspect everyone can agree on the edge cases, that noncharacters are harmless in internal processing, but probably should not appear in random text shipped around on the web. Right, in principle. However, it should be ok to

Re: Corrigendum #9

2014-06-02 Thread Markus Scherer

On Mon, Jun 2, 2014 at 10:00 AM, Shawn Steele shawn.ste...@microsoft.com wrote: To further my understanding, can someone provide examples of how these are used in actual practice? CLDR collation data defines special contraction mappings that start with a noncharacter, for

Re: Corrigendum #9

2014-06-02 Thread Markus Scherer

On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com wrote: I would especially discourage any web browser from handling these; they're noncharacters used for unknown purposes that are undisplayable and if used carelessly for their stated purpose, can probably trigger serious

Re: Unicode Regular Expressions, Surrogate Points and UTF-8

2014-06-01 Thread Markus Scherer

On Sun, Jun 1, 2014 at 1:49 AM, Richard Wordingham richard.wording...@ntlworld.com wrote: D80: Unicode string: A code unit sequence containing code units of a particular Unicode encoding form... Right -- in a Unicode 16-bit string, you have a sequence of any 16-bit value in any order.

Re: Corrigendum #9

2014-06-01 Thread Markus Scherer

On Sun, Jun 1, 2014 at 7:49 AM, Karl Williamson pub...@khwilliamson.com wrote: Thanks, I had not thought about that. I'm thinking wording something like this is more appropriate Noncharacters may be openly interchanged, but it is inadvisable to do so without prior agreement, since at each

Re: Unicode Regular Expressions, Surrogate Points and UTF-8

2014-05-31 Thread Markus Scherer

On Sat, May 31, 2014 at 6:41 AM, Mark Davis ☕️ m...@macchiato.com wrote: I think you have a point here. We should probably change to: To meet this requirement, an implementation shall supply a mechanism for specifying any Unicode scalar value (from U+ to U+D7FF and U+E000 to U+10),

Re: Unicode Regular Expressions, Surrogate Points and UTF-8

2014-05-31 Thread Markus Scherer

On Sat, May 31, 2014 at 1:59 AM, Richard Wordingham richard.wording...@ntlworld.com wrote: Bear in mind that a pattern \uD808 shall not match anything in a well-formed Unicode string. Depends. See the definitions of Unicode strings vs. UTF strings. \uD808\uDF45 specifies a sequence of two

Re: Block Boundaries (was: RE: Corrigendum #9)

2014-05-30 Thread Markus Scherer

In addition, the Block property is not particularly useful even in regular expressions or other processing. It is almost always more useful to use Script, Alphabetic, Unified_Ideograph, etc. Blocks help with planning and allocation but little else. markus

Re: Unicode Regular Expressions, Surrogate Points and UTF-8

2014-05-30 Thread Markus Scherer

If you use Unicode 16-bit strings, it's easy to pass through unpaired surrogates and treat them like code points; it's often not productive or necessary to check for them all the time, that is, to be strict about UTF-16. On the other hand, I don't think anyone expects you to support invalid

Re: Guillements in Email

2014-05-02 Thread Markus Scherer

If there is a Gmail bug, then please report it. Either way, I suggest you go into Gmail Settings and set it to Use Unicode (UTF-8) encoding for outgoing messages markus ___ Unicode mailing list Unicode@unicode.org

Re: ID_Start, ID_Continue, and stability extensions

2014-04-28 Thread Markus Scherer

On Fri, Apr 25, 2014 at 11:06 PM, Mathias Bynens math...@qiwi.be wrote: My initial question can be rephrased as the following remark/change request: http://unicode.org/reports/tr31/#Default_Identifier_Syntax could make it more clear that “stability extensions” means `Other_ID_Start` and

Re: ID_Start, ID_Continue, and stability extensions

2014-04-25 Thread Markus Scherer

On Fri, Apr 25, 2014 at 6:05 AM, Steffen Nurpmeso sdao...@yandex.comwrote: |What I tried to say is, if you need ID_Start, then parse ID_Start from |DerivedCoreProperties.txt. That's more stable (and easier than parsing the |pieces and deriving | |# Lu + Ll + Lt + Lm + Lo + Nl |#

Re: Bidi Brackets for Dummies

2014-04-25 Thread Markus Scherer

On Fri, Apr 25, 2014 at 1:54 AM, Eli Zaretskii e...@gnu.org wrote: I also have a couple of questions about matching the canonical equivalents of the opening bracket: Please take a look at the date of the tech note. I suggest you start a new thread with a new subject for serious discussion.

1 2 3 4 5 >

1 - 100 of 446 matches

Mail list logo