from:"Andrew West"

Re: PUA (BMP) planned characters HTML tables

2019-08-12 Thread Andrew West via Unicode

On Mon, 12 Aug 2019 at 02:27, James Kass via Unicode
 wrote:
>
> On 2019-08-11 5:26 PM, [ Doug Ewell ] via Unicode wrote:
> > If you are thinking of these as potential future additions to the standard, 
> > keep in mind that accented letters that can already be represented by a 
> > combination of letter + accent will not ever be encoded. This is one of the 
> > longest-standing principles Unicode has.

People seem to be ignoring the fact that Marshallese and Latvian both
use L and N with cedilla, but with completely different glyph shapes:

> In January 2013, the Unicode Technical Committee discussed issues for the 
> representation of
> Marshallese orthography. In particular, Marshallese uses the Latin script and 
> requires the letters l,
> m, n, and o with cedilla. Latvian orthography uses the Latin script and 
> requires the letters g, k, l, n,
> and r with comma below. For Marshallese, it is unacceptable to display 
> cedillas as commas below.
> Conversely, for Latvian, it is unacceptable to display commas below as 
> cedillas.

However, as fonts have been following Latvian practice for these
letters (cedilla is displayed as a comma below) since before Unicode,
Marshallese users cannot get their desired outcome using standard
Unicode combining diacritical marks unless they apply a font specially
designed for Marshallese -- which you can never guarantee if you are
writing an email or posting on twitter, etc.

This issue was discussed at WG2 in 2013
(https://www.unicode.org/L2/L2013/13128-latvian-marshal-adhoc.pdf),
when there was a recommendation to encode precomposed letters L and N
with cedilla *with no decomposition*, but that solution does not seem
to have been taken up by the UTC.

Andrew

Re: Fonts and Canonical Equivalence

2019-08-10 Thread Andrew West via Unicode

On Sat, 10 Aug 2019 at 15:46, Richard Wordingham via Unicode
 wrote:
>
> > Just retested on Windows 10 with
> > a Tibetan font that supports both sequences of vowels, and both
> > sequences display correctly under Harfbuzz (as expected), but only
> > vowel-below followed by vowel-above displays correctly when using
> > built-in Windows rendering.
>
> Does vowel above before vowel below yield a dotted circle?

Yes. Attached are screenshots for two real world examples, one which
is logically spelled as i + u, and one as u + i:

1. ཉིུ <0F49 0F72 0F74> [nyiu] as a contraction for ཉི་ཤུ [nyi shu] "twenty"

2. བཅིུག <0F56 0F45 0F74 0F72 0F42> [bcuig] as a contraction for
བཅུ་གཅིག [bcu gcig] "eleven"

Andrew

Re: Fonts and Canonical Equivalence

2019-08-10 Thread Andrew West via Unicode

On Sat, 10 Aug 2019 at 08:29, Richard Wordingham via Unicode
 wrote:
>
> There are similar issues with Tibetan; some fonts do not work properly
> if a vowel below (ccc=132) is separated from the base of the
> consonant stack by a vowel above (ccc=130).

It's not that the fonts don't work, it's that some the rendering
engines do not apply the OpenType features in the font that support
both sequences of vowels (vowel-above followed by vowel-below, and
vowel-below followed by vowel-above). Just retested on Windows 10 with
a Tibetan font that supports both sequences of vowels, and both
sequences display correctly under Harfbuzz (as expected), but only
vowel-below followed by vowel-above displays correctly when using
built-in Windows rendering.

It is very frustrating that Windows cannot correctly support the
display of Tibetan in normalized form, yet Harfbuzz does not have any
problems. Personally, I think USE is a failed experiment, and I wish
Microsoft would simply adopt Harfbuzz as the default rendering engine.

Andrew

Re: Proposal to extend the U+1F4A9 Symbol

2019-06-01 Thread Andrew West via Unicode

On Sat, 1 Jun 2019 at 23:32, Doug Ewell via Unicode  wrote:
>
> Tex wrote:
>
> > What I would find useful is an emoji for when my phone falls into the
> > toilet.
>
> I would have thought ⤵ would be sufficient.

Don't worry, a brand new foolproof method of defining emoji for
anything in the universe using Wikidata QIDs is coming to a phone near
you soon (http://www.unicode.org/L2/L2019/19082r-qid-emoji.pdf) ...
oh, there is no Wikidata QID for phone dropped in the toilet.

Andrew

Re: Encoding italic

2019-02-05 Thread Andrew West via Unicode

On Tue, 5 Feb 2019 at 15:34, wjgo_10...@btinternet.com via Unicode
 wrote:
>
> italic version of a glyph in plain text, including a suggestion of to
> which characters it could apply, would test whether such a proposal
> would be accepted to go into the Document Register for the Unicode
> Technical Committee to consider or just be deemed out of scope and
> rejected and not considered by the Unicode Technical Committee.

Just reminding you that "The initial character in a variation sequence
is never a nonspacing combining mark (gc=Mn) or a canonical
decomposable character" (The Unicode Standard 11.0 §23.4). This means
that a variation sequence cannot be defined for any precomposed
letters and diacritics, so for example you could not italicize the
word "fête" by simply adding VS14 after each letter because "ê" (in
NFC form) cannot act as the base for a variation sequence. You would
have to first convert any text to be italicized to NFD, then apply
VS14 to each non-combining character. This alone would make a VS
solution unacceptable in my opinion.

Andrew

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Andrew West via Unicode

On Fri, 1 Feb 2019 at 22:20, Doug Ewell via Unicode  wrote:
>
> Richard Wordingham wrote:
>
> > Language tagging is already available in Unicode, via the tag
> > characters in the deprecated plane.
>
> Plane 14 isn't deprecated -- that isn't a property of planes -- and the
> tag characters U+E0020 through U+E007E have been un-deprecated for use
> with emoji flags. Only U+E0001 LANGUAGE TAG and U+E007F CANCEL TAG are
> deprecated.

Cancel Tag is not deprecated any longer either
(http://www.unicode.org/Public/UNIDATA/PropList.txt).

Andrew

Re: Encoding italic

2019-01-29 Thread Andrew West via Unicode

On Mon, 28 Jan 2019 at 01:55, James Kass via Unicode
 wrote:
>
> This bold new concept was not mine.  When I tested it
> here, I was using the tag encoding recommended by the developer.

Congratulations James, you've successfully interchanged tag-styled
plain text over the internet with no adverse side effects. I copied
your email into BabelPad and your "bold" is shown bold (see attached
screenshot).

Andrew

Re: Encoding italic

2019-01-29 Thread Andrew West via Unicode

On Tue, 29 Jan 2019 at 10:25, Martin J. Dürst via Unicode
 wrote:
>
> The overall tag proposal had the desired effect: The original proposal
> to hijack some unused bytes in UTF-8 was defeated, and the tags itself
> were not actually used and therefore could be depreciated.

And the tag characters (all except E0001) are now no longer
deprecated. As flag tag sequences are now a thing
(http://www.unicode.org/reports/tr51/#valid-emoji-tag-sequences), and
are widely supported (including on Twitter), your and PV's objections
to using tag characters for a plain text font styling protocol simply
because they are tag characters carry zero weight.

Andrew

Re: Encoding italic (was: A last missing link)

2019-01-24 Thread Andrew West via Unicode

On Thu, 24 Jan 2019 at 15:42, James Kass  wrote:
>
> Here's a very polite reply from John Hudson from 2000,
> http://unicode.org/mail-arch/unicode-ml/Archives-Old/UML024/1042.html
> ...and, over time, many of the replies to William Overington's colorful
> suggestions were less than polite.  But it was clear that colors were
> out-of-scope for a computer plain-text encoding standard.

Going off topic a little, I saw this tweet from Marijn van Putten
today which shows examples of Arabic script from early Quranic
manuscripts with phonetic information indicated by the use of red and
green dots:

https://twitter.com/PhDniX/status/1088171783461703682

I would be interested to know how those should be represented in Unicode.

Andrew

Re: Encoding italic (was: A last missing link)

2019-01-24 Thread Andrew West via Unicode

On Thu, 24 Jan 2019 at 13:59, James Kass via Unicode
 wrote:
>
> FAICT, the emoji repertoire is vendor-driven, just as the pre-Unicode
> emoji sets were vendor driven.  Pre-Unicode, if a vendor came up with
> cool ideas for new emoji they added new characters to the PUA.  Now that
> emoji are standardized, when vendors come up with new ideas they put
> them in the emoji ranges in order to preserve the standardization factor
> and ensure interoperability.  (That's probably over-simplified and there
> are bound to be other factors involved.)

I do not believe that recent (post-6.0) emoji additions are
vendor-driven. There is no formal vendor representation on the ESC,
and most ESC members do not work for vendors. Current emoji additions
are driven by ordinary users, who are actively encouraged by the UTC
to propose novel characters for encoding:

http://blog.unicode.org/2018/04/submissions-open-for-2020-emoji.html
http://blog.unicode.org/2016/09/emoji-deadline.html

The vendors happily lap up whatever emojis the UTC throws at them, but
they seem to have little interest in taking control of the emoji
process.

> We should no more expect the conventional Unicode character encoding
> model to apply to emoji than we should expect the old-fashioned text
> ranges to become vendor-driven.

Why should we not expect the conventional Unicode character encoding
mode to apply to emoji?

We were told time and time again when emoji were first proposed that
they were required for encoding for interoperability with Japanese
telecoms whose usage had spilled over to the internet. At that time
there was no suggestion that encoding emoji was anything other than a
one-off solution to a specific problem with PUA usage by different
vendors, and I at least had no idea that emoji encoding would become a
constant stream with an annual quota of 60+ fast-tracked
user-suggested novelties. Maybe that was the hidden agenda, and I was
just naïve.

The ESC and UTC do an appallingly bad job at regulating emoji, and I
would like to see the Emoji Subcommittee disbanded, and decisions on
new emoji taken away from the UTC, and handed over to a consortium or
committee of vendors who would be given a dedicated vendor-use emoji
plane to play with (kinda like a PUA plane with pre-assigned
characters with algorithmic names [VENDOR-ASSIGNED EMOJI X] which
the vendors can then associate with glyphs as they see fit; and as
emoji seem to evolve over time they would be free to modify and
reassign glyphs as they like because the Unicode Standard would not
define the meaning or glyph for any characters in this plane).

Andrew

Re: Encoding italic (was: A last missing link)

2019-01-24 Thread Andrew West via Unicode

On Thu, 24 Jan 2019 at 02:10, Mark E. Shoulson via Unicode
 wrote:
>
> Unicode isn't here to encode cool new ideas that would be cool and
> new.  It's here for writing what people already do.

http://www.unicode.org/L2/L2018/18141r2-emoji-colors.pdf

"Add 14 colored emoji characters for decorative and/or descriptive
uses. These may be used to indicate that an emoji has a different
color."

No evidence has been provided that anybody is currently using colored
blobs for this purpose (in fact emoji users have explicitly rejected
this method for indicating emoji color:
http://www.unicode.org/L2/L2018/18208-white-wine-rgi.pdf), just an
assertion that it would be a good idea if emoji users could add a
colored swatch to an existing emoji to indicate what color they want
it to represent (note that the colored characters do not change the
color of the emoji they are attached to [before or after, depending
upon whether you are speaking French or English dialect of emoji],
they are just intended as a visual indication of what colour you wish
the emoji was).

This proposal to add 14 additional colored circles, squares and hearts
is a perfect example of a cool new idea for something that the authors
think would be really useful, but for which there is no evidence of
existing use. The UTC should have rejected it as out of scope, but we
all know that rules and procedures do not apply to the Emoji
Subcommittee, so in fact this cool new idea will be included in
Unicode 12 in March.

Andrew

Re: Encoding italic (was: A last missing link)

2019-01-20 Thread Andrew West via Unicode

On Sun, 20 Jan 2019 at 03:16, James Kass via Unicode
 wrote:
>
> Possible approaches include:
>
> 3 - Open/Close punctuation treatment
> Stateful.  Works on ranges.  Not currently supported in plain-text.
> Could be supported in applications which can take a text string URL and
> make it a clickable link.  Default appearance in nonsupporting apps may
> resemble existing plain-text italic kludges such as slashes.  The ASCII
> is already in the character string.

A possibility that I don't think has been mentioned so far would be to
use the existing tag characters (E0020..E007F). These are no longer
deprecated, and as they are used in emoji flag tag sequences, software
already needs to support them, and they should just be ignored by
software that does not support them. The advantages are that no new
characters need to be encoded, and they are flexible so that tag
sequences for start/end of italic, bold, fraktur, double-struck,
script, sans-serif styles could be defined. For example start and end
of italic styling could be defined as the tag sequences  and 
(E003C E0069 E003E and E003C E002F E0069 E003E).

Andrew

Re: Private Use areas - Vertical Text

2018-08-29 Thread Andrew West via Unicode

On Wed, 29 Aug 2018 at 11:18,  wrote:
>
> I was using a change horizontal to vertical text feature in office, the
> PUA characters being from plane 15.

I tested with Word 2007, and normal PUA characters from my font were
displayed with vertical orientation in a vertical text box, but Plane
15 PUA characters were rotated.

I also tested with Word 2016, and both normal PUA characters and Plane
15 PUA characters were displayed with vertical orientation in a
vertical text box, as you want, although there were vertical spacing
issues with the Plane 15 PUA characters which suggest that the
vertical metrics tables (vhea and vmtx) in the font are not being
applied for Plane 15 characters (or it could be a problem with my
font).

Andrew

Re: Private Use areas - Vertical Text

2018-08-29 Thread Andrew West via Unicode

On Wed, 29 Aug 2018 at 05:07, via Unicode  wrote:
>
> Yes, as Richard says when CJK Zhuang text is displayed vertically whilst
> the Zhuang characters in Unicode remain upright, but those with PUA
> codepoints are rotated 90°.

John, you did not explain by what mechanism you were trying to display
vertical PUA Zhuang text.

I can display vertically-oriented PUA-encoded CJKVZ ideographs in
vertical layout in web pages using CSS, as demonstrated in this test
page:

http://www.babelstone.co.uk/Fonts/PUA_Vertical_Test.html

The PUA characters display with correct orientation under Windows 10
on the Edge, Chrome and Firefox browsers. The test page only fails
under IE, but we are not meant to use IE anymore anyway.

Andrew

Re: Private Use areas - Vertical Text

2018-08-29 Thread Andrew West via Unicode

On Tue, 28 Aug 2018 at 18:15, WORDINGHAM RICHARD via Unicode
 wrote:
>
> Unicode is doing what it can in this matter:
>
> (a) Zhuang PUA characters are being made individually obsolete.

Not by a nebulous entity called "Unicode", or even by the Unicode
Consortium per se, but by the hard work over many years by individual
experts such as John Knightley.

Andrew

Re: The Unicode Standard and ISO

2018-06-08 Thread Andrew West via Unicode

On 8 June 2018 at 13:01, Michael Everson via Unicode
 wrote:
>
> I wonder if Mark Davis will be quick to agree with me  when I say that 
> ISO/IEC 15897 has no use and should be withdrawn.

It was reviewed and confirmed in 2017, so the next systematic review
won't be until 2022. And as the standard is now under SC35, national
committees mirroring SC2 may well overlook (or be unable to provide
feedback to) the systematic review when it next comes around. I agree
that ISO/IEC 15897 has no use, and should be withdrawn.

Andrew

Re: Translating the standard

2018-03-12 Thread Andrew West via Unicode

On 12 March 2018 at 07:59, Marcel Schneider via Unicode
 wrote:
>
> Likewise ISO/IEC 10646 is available in a French version

No it is not, and never has been.

Why don't you check your facts before making misleading statements to this list?

> or at least, it should have an official French version like all ISO standards.

That is also blatantly untrue.

Only six of the publicly available ISO standards listed at
http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html
have French versions, and one has a Russian version. You will notice
that there is no French version of ISO/IEC 10646.

Andrew

Re: Unicode Emoji 11.0 characters now ready for adoption!

2018-03-07 Thread Andrew West via Unicode

On 7 March 2018 at 22:18, Philippe Verdy via Unicode
 wrote:
>
> Additional note: the UCS will never large enough to support the personal
> signatures of billions Chinese people living today or born since milleniums,
> or jsut those to be born in the next century. There's a need to represent
> these names using composed strings. A reasonable compositing/ligaturing
> process can then present almost all of them !

CJK characters invented for writing personal names are extremely rare,
and do not constitute a significant fraction of CJK ideographs
proposed for encoding. The majority of unencoded modern-use characters
in China (that are not systematic simplified forms of existing encoded
characters) are used in place names or in Chinese dialects or for
writing non-Chinese languages such as Zhuang.

Andrew

Re: Unicode Emoji 11.0 characters now ready for adoption!

2018-02-28 Thread Andrew West via Unicode

On 28 February 2018 at 13:22, Christoph Päper via Unicode
 wrote:
>>
>> The 157 new Emoji are now available for adoption
>
> But Unicode 11.0 (which all new emojis but Pirate Flag and Infinity rely 
> upon) is not even in beta yet.

Don't even get me started on that!

>> There are approximately 7,000 living human languages,
>> but fewer than 100 of these languages are well-supported on computers,
>> mobile phones, and other devices. Adopt-a-character donations are used
>> to improve Unicode support for digitally disadvantaged languages, and to
>> help preserve the world’s linguistic heritage.
>
> Why is the announcement mentioning those numbers of languages at all?

I agree, the figures are meaningless and misleading (and intended to
mislead). I could list a hundred languages that are written with the
Latin script without pausing for breath. There are very very few
scripts in modern daily use that are not yet encoded in the UCS, but
letting out that secret will not help the Unicode Consortium to raise
money from character adoption.

The latest grant to Anshu from Character Adoption money is for three
historic scripts
(http://blog.unicode.org/2018/02/adopt-character-grant-to-support-three.html).
If there were still so many digitally disadvantaged languages urgently
in need of script encoding then surely the Unicode Consortium would be
sponsoring those as a priority rather than historic scripts.

Andrew

Re: Unicode Emoji 11.0 characters now ready for adoption!

2018-02-28 Thread Andrew West via Unicode

On 28 February 2018 at 10:48, Martin J. Dürst via Unicode
 wrote:
>>
>>> The 157 new Emoji are now available for adoption, to help the Unicode
>>> Consortium’s work on digitally disadvantaged languages.
>>
>> I'm quite curious what it the relation between the new emojis and the
>> digitally disadvantages languages. I see none.
>
> I think this was mentioned before on this list, in particular by Mark:
> The money collected from character adoptions (where emoji are a prominent
> target) is (mostly?) used to support work on not-yet-encoded (thus digitally
> disadvantaged) scripts.

Over $250,000 has been raised from Unicode character adoptions to
date. I am curious as to how much of this money has been spent, and
would very much like to see annual accounts showing how much money has
been received, and how much has been disbursed to whom and for what.

Andrew



. See e.g. the recent announcement at
> http://blog.unicode.org/2018/02/adopt-character-grant-to-support-three.html.
>
> Regards,   Martin.

Re: UNICODE vehicle vanity registration?

2018-02-14 Thread Andrew West via Unicode

You can use ♥⭐➕ in California. Someone has U+1F913 邏 (
https://www.instagram.com/p/BVYtIHensDu/)

Andrew


On 14 February 2018 at 16:24, Stephane Bortzmeyer via Unicode <
unicode@unicode.org> wrote:

> On Wed, Feb 14, 2018 at 09:44:06PM +0530,
>  Shriramana Sharma via Unicode  wrote
>  a message of 6 lines which said:
>
> > Given that in the US vanity vehicle registrations with arbitrary
> > alphanumeric sequences upto 7 characters are permitted (I am correct
> > I hope?), I wonder who (here?) owns the UNICODE registration?
>
> Won't work in New York, unfortunately
>
> https://dmv.ny.gov/learn-about-personalized-plates
>
> "A character is a letter (A-Z), number (0-9) or space. Each space
> counts as one character."
>
>

Re: 0027, 02BC, 2019, or a new character?

2018-01-25 Thread Andrew West via Unicode

On 23 January 2018 at 00:55, James Kass via Unicode  wrote:
>
> Regular American users simply don't type umlauts, period.

Not even the president of the Unicode Consortium when referring to
Christoph Päper:

http://www.unicode.org/L2/L2018/18051-emoji-ad-hoc-resp.pdf

Andrew

Re: 0027, 02BC, 2019, or a new character?

2018-01-19 Thread Andrew West via Unicode

On 19 January 2018 at 13:19, Michael Everson via Unicode
 wrote:
>
> I’d go talk with him :-) I published Alice in Kazakh. He might like that.

Damn, you'll have to reprint it with apostrophes now.

Andrew

Re: 0027, 02BC, 2019, or a new character?

2018-01-19 Thread Andrew West via Unicode

On 19 January 2018 at 09:16, Shriramana Sharma via Unicode
 wrote:
> Wow. Somebody really needs to convey this to the Kazhaks. Else a
> short-sighted decision would ruin their chances at native IDNs. Any Kazhaks
> on this list?

There's only one Kazakh who counts, and I'm pretty sure he's not on this list.

Andrew

Re: Xiangqi Game Symbols (was Re: Proposal to add standardized variation sequences for chess notation)

2017-04-12 Thread Andrew West via Unicode

On 12 April 2017 at 15:58, Garth Wallace  wrote:
>
> So has that proposal been retracted now?

Once a proposal has been approved it cannot simply be retracted by the
submitter. On the SC2 side, the proposed characters have been subject
to ballot comments from national bodies, and no doubt they will be
discussed at the WG2 meeting in Hohhot later this year.

Andrew

Xiangqi Game Symbols (was Re: Proposal to add standardized variation sequences for chess notation)

2017-04-12 Thread Andrew West via Unicode

On 12 April 2017 at 05:12, Garth Wallace via Unicode
<unicode@unicode.org> wrote:
>
> Later Xiangqi proposals by Andrew West focused on
> the circled ideographs and did not pursue new diagram drawing characters,
> and were eventually successful.

My Xiangqi proposal
(http://www.unicode.org/L2/L2016/16255-n4748-xiangqi.pdf) proposed a
minimal set of logical game pieces for Xiangqi/Janggi, regardless of
shape (circular or octagonal) or design (traditional characters,
simplified characters, cursive characters, or pictures) which I
consider a font design issue, and explicitly did not seek to encode
circled ideographs. My proposal was rejected, and a different proposal
by Michael Everson
(http://www.unicode.org/L2/L2016/16270-n4766-xiangqi.pdf) to encode
all circled ideographs and negative circled ideographs attested in
Xiangqi game diagrams was accepted instead.

The accepted proposal for circled ideographs is a glyph encoding model
not a character encoding model as for other game symbols (Chess,
Dominos, Mahjong, Playing Cards, etc.), and in my opinion it is a very
bad model for several reasons. It makes the interchange of Xiangqi
game data and game diagrams problematic; it hinders normal text
processing operations on Xiangqi game pieces (for example, to search
for a red horse piece you have to search for three different
characters); and in modern computer usage Xiangqi game pieces may not
be represented as simple circled ideographs, but may be coloured
designs showing characters or images. It is also very likely that
vendors will want to produce emoji versions of Xiangqi pieces, and
these could not reasonably be considered to be glyph variants of
circled ideographs. There has been some negative feedback on the
circled ideographs model on the internet, and I believe that Michael
has now been convinced that this model is wrong, and should be
replaced by a model using logical game pieces.

Andrew

Re: Unicode Emoji 5.0 characters now final

2017-03-29 Thread Andrew West

On 29 March 2017 at 21:09, Doug Ewell  wrote:
>
>> I think "recommended" could be renamed to "(expected to be) widely
>> implemented".
>
> That's a modest improvement; it shifts from an advisory health warning
> not to use certain sequences to what it is, speculation that some
> sequences will be far better supported in the field than others.

I don't think that would work.
http://www.unicode.org/Public/emoji/5.0/emoji-sequences.txt explicitly
lists just the three subdivision flags for England, Scotland and Wales
under Emoji Tag Sequences, which indicates that they are special in an
undefined way that none of the thousands of other potential
subdivision flag tag sequences are. There must be a higher bar for
inclusion in the Emoji data files than simply that they are expected
to be widely implemented. Their inclusion in the Emoji data files and
the Emoji charts
(http://www.unicode.org/emoji/charts/emoji-ordering.html) must
indicate that only these three tag sequences are recommended or
sanctioned by the UTC.

(In case anyone thinks I support the current situation, let me state
that I disagree vehemently with the UTC decision to only "recommend"
these three particular subdivision flag tag sequences.)

Andrew

Re: Manatee emoji?

2016-11-23 Thread Andrew West

On 23 November 2016 at 16:39, Ken Whistler  wrote:
> On 11/23/2016 7:15 AM, James Kass wrote:
>>
>> How many signatures on a petition would be needed before
>> Unicode would consider adding a non-existent character to the
>> repertoire?
>
> I would say somewhat more than zero (which could hardly be considered a
> petition) and less than 7,466,363,069 (current estimate of the world
> population).

Well, based on http://www.unicode.org/L2/L2016/16295r-animal-emoji.pdf
I would say between 4,737 and 6,941.

Andrew

Re: Dataset for all ISO639 code sorted by country/territory?

2016-11-10 Thread Andrew West

On 10 November 2016 at 17:56, Doug Ewell  wrote:
>
> Keep in mind that the CLDR table documents 675 of the world's best-known
> languages, counting variants such as three different orthographies of
> Uzbek.

Oddly, it seems that there are over 1.2 billion speakers of Cantonese
in China, but no speakers of Mandarin (the biggest language by number
of speakers in the world).

Andrew

Re: Bit arithmetic on Unicode characters?

2016-10-07 Thread Andrew West

On 7 October 2016 at 23:31, Doug Ewell  wrote:
>
> Well, "treacherous" is right. I'd hesitate to trust an algorithm to
> recognize PLANCK CONSTANT as the character name that logically fits
> between MATHEMATICAL ITALIC SMALL G and MATHEMATICAL ITALIC SMALL I.

Well, it could be picked up from that most treacherous of Unicode data
files http://www.unicode.org/Public/UNIDATA/NamesList.txt

Andrew

Re: less-than or equal to with dot in the less-than part?

2016-08-10 Thread Andrew West

On 10 August 2016 at 12:21, Costello, Roger L.  wrote:
>
> Do you know if there is another version of the symbol, but with a straight 
> equals sign rather than a slanted equals sign? (The book that I referred to 
> uses a straight equals sign not a slanted equals sign)

No, but there are lots of standardized variants for mathematical glyph
variants of this sort (see first section of
http://www.unicode.org/Public/UNIDATA/StandardizedVariants.txt), so
you could ask the UTC to define two more mathematical standardized
variants:

2A7F FE00; with straight equal; # LESS-THAN OR SLANTED EQUAL TO WITH DOT INSIDE
2A80 FE00; with straight equal; # GREATER-THAN OR SLANTED EQUAL TO
WITH DOT INSIDE

Then all you would need is to get someone to support the new
standardized variants in a math font.

Andrew

Re: USAT value in the kIRG_USource property

2016-06-28 Thread Andrew West

David,

As Mr Suzuki says, despite the U prefix, USAT is not a Unicode source
character.  The reason why a solitary USAT source reference has
suddenly popped up in Ext. B is that several thousand ideographs were
proposed for encoding by SAT in what will be CJK Ext. F in Unicode
10.0 next year (there are currently 2,884 USAT characters in Ext. F).
At the WG2 meeting in Matsue Japan last year, in response to UK ballot
comments, USAT-00061 was unified with U+20991 in Ext. B (see
http://www.unicode.org/wg2/docs/n4701-M64-Recommendations.pdf
Recommendation M64.05c).  I suppose that the Unicode Standard will be
updated with a description of SAT when Ext. F is included in v. 10
next year.

Andrew

On 28 June 2016 at 06:09, drmccreedy .  wrote:
> I see one codepoint now has the kIRG_USource property value of "USAT" in the
> Unihan_IRGSources.txt file from Unihan.zip:
>U+20991 kIRG_USource USAT-00061
>
> UAX #45 (U-source Ideographs,
> http://www.unicode.org/reports/tr45/index.html) mentions UTC and UCI but not
> USAT.
>
> UAX #38 (Unicode Han Database, http://www.unicode.org/reports/tr38/) updated
> the syntax for the kIRG_USource property (but not the description) to
> U(TC|CI|SAT)-[0-9]{5} so I'm pretty sure it's not a typo.
>
> Where can I find a description of the USAT value?
>
> Thanks,
>
> David McCreedy

Re: Are there Unicode symbols for parenthesis generator symbols?

2016-06-26 Thread Andrew West

On 26 June 2016 at 09:37, Costello, Roger L.  wrote:
>
> In the book Parsing Techniques the authors use a less than symbol with a dot 
> tucked inside for the open parenthesis and a greater than symbol with a dot 
> tucked insider for the close parenthesis. Also, they use an equal sign with a 
> dot over it. You can see the 3 symbols here:
>
> https://books.google.com/books?id=05xA_d5dSwAC=PA267=PA267=parenthesis+generator+symbols=bl=3OwyeBndO8=ZhwoeYRJjm3GTzNNP1vgsAVRisc=en=X=2=0ahUKEwi577X-o8XNAhWBaz4KHc0QA_EQ6AEIIzAB#v=onepage=parenthesis%20generator%20symbols=false
>
> Are there Unicode symbols for the 3 symbols?

Yes, and they have all been around since Unicode 1.0:

U+22D6 ⋖
U+22D7 ⋗
U+2250 ≐ (named APPROACHES THE LIMIT)

Andrew

Re: NamesList.txt as data source

2016-03-29 Thread Andrew West

On 29 March 2016 at 16:19, Janusz S. Bień  wrote:
>
> > All documents submitted to WG2 and to L2 by individuals are copyright
> > of the author(s) of the document.  Documents do not need to carry a
> > copyright notice to have copyright, and submitting the documents to
> > Unicode Consortium and/or ISO does not affect the copyright status of
> > documents.
> >
> >> http://www.unicode.org/policies/ipr_policy.html
>
> Do you happen to know an analogical link for the ISO submissions? I was
> unable to find one quickly.

ISO/IEC Directives Part 1 (6th ed., 2015)

Section 2.13:

In ISO and IEC, there is an understanding that original material
contributed to become a part of an ISO,
IEC or ISO/IEC publication can be copied and distributed within the
ISO and/or IEC systems (as relevant)
as part of the consensus building process, this being without
prejudice to the rights of the original
copyright owner to exploit the original text elsewhere. Where material
is already subject to copyright,
the right should be granted to ISO and/or IEC to reproduce and
circulate the material. This is frequently
done without recourse to a written agreement, or at most to a simple
written statement of acceptance.
Where contributors wish a formal signed agreement concerning copyright
of any submissions they
make to ISO and/or IEC, such requests must be addressed to ISO Central
Secretariat or the IEC Central
Office, respectively.

Andrew

Re: NamesList.txt as data source

2016-03-29 Thread Andrew West

On 29 March 2016 at 06:15, Asmus Freytag (c)  wrote:
>
> What is the copyright status of the
> document?
>
> The terms of use (ostensibly for the entire site) are defined here:
>
> http://www.unicode.org/copyright.html

That refers to the Unicode Standard and data files and other pages
produced and published by the Unicode Consortium.  It does not and
cannot refer to documents submitted to the Unicode Consortium by
external entities or individuals.

> The document archive has not been designated with anything more restrictive, 
> more specific or even explicit, but the documents themselves do not carry 
> copyrights. As far as the Consortium is concerned, it requires the submitters 
> to follow this policy

All documents submitted to WG2 and to L2 by individuals are copyright
of the author(s) of the document.  Documents do not need to carry a
copyright notice to have copyright, and submitting the documents to
Unicode Consortium and/or ISO does not affect the copyright status of
documents.

> http://www.unicode.org/policies/ipr_policy.html
>
> which gives the Consortium the rights to distribute submissions for any 
> purpose.

A non-exclusive right.

> Can it be redistributed and replicated on other sites?

Ask the individual authors of the particular documents you want to redistribute.

> Can it be quoted literally in a Wikipedia entry?

Within the normal Wikipedia rules for quoting copyrighted material.

Andrew

Re: Purpose of and rationale behind Go Markers U+2686 to U+2689

2016-03-19 Thread Andrew West

On 18 March 2016 at 23:49, Garth Wallace  wrote:
>
> Correction: the 2-digit pairs would require 19 characters. There would
> be no need for a left half circle enclosed digit one, since the
> enclosed numbers 10–19 are already encoded. This would only leave
> enclosed 20 as a potential confusable. There would also be no need for
> a left third digit zero, saving one code point if the thirds are not
> unified with the halves, so there would be 29 thirds.
>
> And just to clarify, there would have to be separate half cirlced and
> negative half circled digits. So that would be 96 characters
> altogether, or 58 if left and right third-circles are unified with
> their half-circle equivalents.  Not counting ideographic numbers.

Thanks for your suggestion, I have added two new options to my draft
proposal, one based on your suggestion (60 characters: 10 left, 10
middle and 10 right for normal and negative circles) and one more
verdyesque (four enclosing circle format characters).  To be honest, I
don't think the UTC will go for either of these options, but I doubt
they will be keen to accept any of the suggested options.

> This may not work very well for ideographic numbers though. In the
> examples, they appear to be written vertically within their circles
> (AFAICT none of the moves in those diagrams are numbered 100 or above,
> although some are hard to read).

I have now added an example with circled ideographic numbers greater
than 100.  See Fig. 13 in

http://www.babelstone.co.uk/Unicode/GoNotation.pdf

In this example, numbers greater than 100 are written in two columns
within the circle, with hundreds on the right.

Andrew

Re: Purpose of and rationale behind Go Markers U+2686 to U+2689

2016-03-19 Thread Andrew West

Hi Frédéric,

The historic use of ideographic numbers for marking Go moves are
discussed in the latest draft of my document:

http://www.babelstone.co.uk/Unicode/GoNotation.pdf

Andrew


On 16 March 2016 at 13:35, Frédéric Grosshans
<frederic.grossh...@gmail.com> wrote:
> Le 15/03/2016 22:21, Andrew West a écrit :
>>
>>
>> Possibly.  I certainly have very little expectation that a proposal to
>> complete both sets to 999 (or even 399) would have any chance of
>> success.
>
> And then, there are also the historical example of ideographic numbers used
> for the same purpose in historic texts (like here
> http://sns.91ddcc.com/t/54057, here http://pmgs.kongfz.com/item_pic_464349/
> or here
> http://www.weibo.com/p/1001593905063666976890?from=page_100106_profile=6=wenzhangmod
> ).
>
> The above has been found with a quick google search, and I have no idea
> whether these symbols were used in the running text or not.
>
>   Frédéric
>

Re: Purpose of and rationale behind Go Markers U+2686 to U+2689

2016-03-15 Thread Andrew West

On 15 March 2016 at 19:48, K.C.Saff  wrote:
>
> I often see numbers roll over at 100, displayed on a new board, so even just
> the full set of two digit forms adds a lot of utility for go games.  This
> seems to be a standard practice at Wikipedia (
> https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol#Game_4 ), Sensei's
> Library and a lot of books that I've worked through.

That's certainly true, although it is not hard to find examples which
go over 100 (http://www.babelstone.co.uk/Ludus/Weiqi/FamousGames_279.jpg),
and even the AlphaGo vs Lee Sedol Wikipedia page shows one game
diagram that goes into the 200s.

> Completing both sets
> up to 99, adding "00", and including the most common markers (triangle,
> square, etc.) seems like a good, useful compromise.

Possibly.  I certainly have very little expectation that a proposal to
complete both sets to 999 (or even 399) would have any chance of
success.

I am currently working on a proposal for the triangle and square go
markers, and am still considering the best approach to the circled
numbers.  Any feedback would be most welcome.

http://www.babelstone.co.uk/Unicode/GoNotation.pdf

Andrew

Re: Gaps in Mathematical Alphanumeric Symbols

2016-03-10 Thread Andrew West

On 10 March 2016 at 20:49, Doug Ewell  wrote:
>
>>
>> http://www.unicode.org/charts/PDF/U1D400.pdf
>>
>> The annotation for each reserved code point refers to the character
>> that logically belongs there.
>
> NamesList.txt also has this information, and unlike the others, it's
> both official and machine-readable:

It (http://www.unicode.org/Public/UNIDATA/NamesList.txt) is
machine-readable, although the file specifically warns that "this file
should not be parsed for machine-readable information".

Andrew

Re: Gaps in Mathematical Alphanumeric Symbols

2016-03-10 Thread Andrew West

On 10 March 2016 at 19:09, Oren Watson  wrote:
>
> Is there a standard denoting which characters are part of each "mathematical
> variable alphabet"? There is a table on Wikipedia
> 
> but the placement of characters into the gaps is unsourced. Perhaps I'm
> overthinking this, but I don't think it's necessarily obvious that the
> character BLACK-LETTER CAPITAL C should be used as the nonexistent character
> *MATHEMATICAL FRAKTUR CAPITAL C. Is there a document clarifying this?

Yes, the code charts in the Unicode Standard:

http://www.unicode.org/charts/PDF/U1D400.pdf

The annotation for each reserved code point refers to the character
that logically belongs there.

Andrew

Re: Purpose of and rationale behind Go Markers U+2686 to U+2689

2016-03-10 Thread Andrew West

On 10 March 2016 at 11:34, Leonardo Boiko  wrote:
> Isn't it better to use some sort of COMBINING ENCLOSING CIRCLE?

Of course that approach is possible, but it is quite problematic, both
from the perspective of the font developer and the end user, because
the circle would have to be able to combine with an indefinite number
of preceding characters, and it is not easy to either determine where
the boundary is (in the font) or specify the boundary (by the end
user).  For example, given a text string of "1234"
what does the combining circle combine with?  Unitary characters would
be just way simpler and more reliable.

Andrew

Re: Purpose of and rationale behind Go Markers U+2686 to U+2689

2016-03-10 Thread Andrew West

On 10 March 2016 at 07:00, Martin J. Dürst  wrote:
>
> because these numbers can go up to the 200s, it doesn't make sense to
> register them all as characters (one would need over 500!).

I don't get why that would make no sense.  We already have CIRCLED
NUMBER 1 through 50, and NEGATIVE CIRCLED NUMBER 1 through 20, and
these characters are widely used (in East Asian contexts, at least)
for representing note numbers in text.  In my opinion it would be
eminently sensible to extend both series up to 999, which would cover
the needs of Go notation and as well as note numbering for the vast
majority of users.

Andrew

Re: Purpose of and rationale behind Go Markers U+2686 to U+2689

2016-03-10 Thread Andrew West

On 10 March 2016 at 07:00, Martin J. Dürst  wrote:
>
> So yes, these symbols are used for for mathematical research of the game of
> Go, and not as far as I know for actual notation.

Which indicates how absurd the proposal to emojify these four characters is.

http://www.unicode.org/L2/L2016/16021-game-pieces-emoji.pdf

Andrew

Re: Dark beer emoji

2015-09-02 Thread Andrew West

On 1 September 2015 at 17:37, Doug Ewell  wrote:
>
> As an alternative to this proposal that may provide more flexibility, I
> propose adapting the Fitzpatrick skin-tone modifiers from U+1F3FB to
> U+1F3FF to be valid for use following U+1F37A BEER MUG or U+1F37B
> CLINKING BEER MUGS.
>
> This could be done by establishing a normative correlation between the
> Fitzpatrick scale and the Standard Reference Method (SRM), Lovibond,
> and/or European Brewery Convention (EBC) beer color scales
> .
>
> This mechanism would allow the entire spectrum of beer styles to be
> depicted, instead of dividing beers arbitrarily into "light" and "dark,"
> in the same way (and for the same reason) that Unicode already supports
> a variety of skin tones.
>
> For example, a Budweiser or similar lager could be represented as
>  <1F37A, 1F3FB>, while a Newcastle Brown Ale might be 
> <1F37A, 1F3FD>. U+1F3FF could denote imperial stout or Baltic porter.
> There might be a need to encode an additional "Type 0" color modifier to
> extend the "light" end of the scale, such as for non-alcoholic brews, or
> for Coors Light.

Yet more blatant anti-ginger discrimination. Yet another reason to
encode a ginger emoji modifier at the earliest opportunity (see
https://www.change.org/p/apple-redheads-should-have-emoji-too), which
could then be applied to U+1F37A BEER MUG in order to depict ginger
beer.

Andrew

Re: a suggestion new emoji .

2015-08-19 Thread Andrew West

On 19 August 2015 at 12:36, Otto Stolz otto.st...@uni-konstanz.de wrote:

 You cannot suggest a new character just because it would
 be “nice to have”. Rather, you have to supply evidence that
 an additional character really needs to be encoded, e. g.
 because it is already widely used in print and cannot be
 represented in Unicode.

Well that may once have been the case, but certainly isn't any longer
with respect to emoji, especially emoji representing food and drink.

I suggest Emma reads Unicode Technical Report 51
http://unicode.org/reports/tr51/ especially section 1.2 Encoding
Considerations and Annex C Selection Factors, then start a petition to
the Unicode Consortium on www.change.org, and when she has 10,000
signatures make a formal request to the UTC.  Petitions don't
guarantee acceptance, but widely-petitioned emoji such as taco, cheese
wedge, paella and whisky tumbler have been successful.

Andrew

Re: a suggestion new emoji .

2015-08-19 Thread Andrew West

On 19 August 2015 at 19:45, Andrew West andrewcw...@gmail.com wrote:
 On 19 August 2015 at 19:22, Marcel Schneider charupd...@orange.fr wrote:

 On 19 Aug 2015 at 01:45, Emma Haneys miszhan...@gmail.com wrote:

  i suggest one and only for fruit category . it is a durian .

 Emma, at small sizes, and especially in monochrome rendering, the glyph of a 
 durian emoji would resemble closely to the glyph of an eventual lychee emoji.

 I don't know, I think durian emoji would be quite distinctive, as
 shown in the examples on this page (I am rather taken with the sad
 durian which gets no hugs).

Sorry, this page:

http://www.cafepress.co.uk/+durian+stickers

Andrew

Re: a suggestion new emoji .

2015-08-19 Thread Andrew West

On 19 August 2015 at 19:22, Marcel Schneider charupd...@orange.fr wrote:

 On 19 Aug 2015 at 01:45, Emma Haneys miszhan...@gmail.com wrote:

  i suggest one and only for fruit category . it is a durian .

 Emma, at small sizes, and especially in monochrome rendering, the glyph of a 
 durian emoji would resemble closely to the glyph of an eventual lychee emoji.

I don't know, I think durian emoji would be quite distinctive, as
shown in the examples on this page (I am rather taken with the sad
durian which gets no hugs).

Andrew

Re: Standardised Variation Sequences with Toggles

2015-08-17 Thread Andrew West

On 16 August 2015 at 23:50, Richard Wordingham
richard.wording...@ntlworld.com wrote:

 @+ For details about the implementation of variation sequences in
 Phags-pa, please refer to the Phags-pa section of the core
 specification.

 a) This is likely to be ignored by someone who is just looking for the
 *specification*.  I think replacing 'implementation' by 'rendering'
 would be better.  I would be inclined to add, 'These sequences are more
 complicated than they appear at first reading'.  Otherwise, someone
 will just add them to the character to glyph conversion section of a
 font and think, Job done.

That's not a plausible scenario. Phags-pa has complex shaping and
joining requirements, and it is impossible for someone to create a
properly functioning Phags-pa font based on the code charts alone. If
anyone did implement Phags-pa in a font based solely on the Unicode or
10646 code charts, with no joining or shaping behaviour, for use as a
fallback font or as a code chart font then naively implementing U+A856
+ U+FE00 (VS1) as a mirrored glyph is not unreasonable.  If they want
to produce a Phags-pa font for displaying running Phags-pa text then
at a minimum they will need to read the appropriate section of the
core specification.

Andrew

Re: Emoji characters for food allergens

2015-07-30 Thread Andrew West

On 30 July 2015 at 18:07, Marcel Schneider charupd...@orange.fr wrote:

 I'll try to respond to all,

Please don't.

Andrew

Re: Emoji characters for food allergens

2015-07-29 Thread Andrew West

On 29 July 2015 at 14:42, William_J_G Overington
wjgo_10...@btinternet.com wrote:

 For example, one such character could be used to be placed before a list of
 emoji characters for food allergens to indicate that that a list of dietary
 need follows.

 For example,

 My dietary need is no gluten no dairy no egg

 There could be a way to indicate the following.

 My diet can include soya

There already is, you can write My diet can include soya.

If you are likely to swell up and die if you eat a peanut (for
example), you will not want to trust your life to an emoji picture of
a peanut which could be mistaken for something else or rendered as a
square box for the recipient.  There may be a case to be made for
encoding symbols for food allergens for labelling purposes, but there
is no case for encoding such symbols as a form of symbolic language
for communication of dietary requirements.

Andrew

Re: Some questions about Unicode's CJK Unified Ideograph

2015-06-29 Thread Andrew West

On 28 June 2015 at 21:16, gfb hjjhjh c933...@gmail.com wrote:

 oh and by the way, could you (or someone else) please help look for the
 character ⿰亻革 also?

Not in the pipeline as far as I can see.

 Just seen a Chinese Wikipedia article introducing an
 ethnic group with the character as partvof its name
 https://zh.m.wikipedia.org/wiki/(亻革)家人 but without a proper character for
 so. The article sourced a CCTV program for ots origin.

... which calls them 革家, and so is not evidence for the existence of
the character 亻革  (I don't doubt that the character exists, but
neither the Wikipedia article nor the CCTV web page are sufficient
evidence for it).

 And there seem to be a dozen more wikipedia article that contain unencoded
 han characters, as listed in
 https://zh.wikipedia.org/wiki/Category:含有未收錄漢字的条目

There are some 60 unencoded CJK characters in use on Wikimedia
projects (see 
https://commons.wikimedia.org/wiki/Category:Chinese_characters_not_in_Unicode),
which I include in my BabelStone Han PUA font (see U+F2D6..U+F2EF,
U+F2FD..U+F2FF, U+F3E0, U+F4C0..U+F4E1 listed at
http://www.babelstone.co.uk/Fonts/PUA.html).  The problem with most of
these characters is that Wikipedia is not a suitable source for
encoding, and evidence for use of these characters in printed sources
needs to be presented to the UTC and IRG for them to have any chance
of being encoded.

For an example of what you should do to get these characters encoded
see the latest revision of Ming Fan's Proposal to add 94 Chinese
characters to UAX #45
(http://www.unicode.org/L2/L2015/15098r3-chinese.pdf).

Andrew

Re: free download of ISO/IEC 10646 (was: Accessing the WG2 document register)

2015-06-11 Thread Andrew West

On 11 June 2015 at 10:38, Andrew West andrewcw...@gmail.com wrote:

 The Unicode terms of use http://unicode.org/copyright.html are far
 more restrictive, and state that Any person is hereby authorized,
 without fee, to view, use, reproduce, and distribute all documents and
 files solely for informational purposes in the creation of products
 supporting the Unicode Standard, subject to the Terms and Conditions
 herein.  So if you are not planning to create a product supporting
 the Unicode Standard, you are not legally allowed to view or download
 any of the files comprising the Unicode Standard !

My apologies, according to the Unicode Consortium and Trademark Usage
Policy http://www.unicode.org/policies/logo_policy.html I should
always refer to The Unicode® Standard.  I hope that everyone on this
list will take note of this important policy in future messages.

Andrew

Re: Accessing the WG2 document register

2015-06-11 Thread Andrew West

On 11 June 2015 at 07:05, Philippe Verdy verd...@wanadoo.fr wrote:

 Personally I think that Unicode does a much better job to open its standard
 to many more people by offering differnet levels of participations and
 opening a large area open to every individual without paying considerable
 fees. I consider that the only standard that defines the UCS is TUS, not
 ISO/IEC 10646 (that is just a piece of junk, badly administered, and
 inaccessible to most people).

You do realise that by insulting ISO/IEC 10646 you are also insulting
a number of prominent members of the UTC and officers of the Unicode
Consortium who actively participate in the production and editing of
ISO/IEC 10646?

The latest version of ISO/IEC 10646 is not inaccessible to most
people, as it is (and has been since 2006) available for free download
from ISO at 
http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html.

Whilst I agree that the standard itself is irrelevent to the vast
majority of users, who can get by quite happily just knowing about the
Unicode Standard, I believe that the great importance of ISO/IEC 10646
lies in the process that goes to produce it, not in the resultant
standard.  The Unicode Consortium is largely controlled by a few large
American corporations, but ISO is open to participation by standards
organizations representing countries across the globe, and there are
currently thirty participating members of SC2, the committee which is
responsible for ISO/IEC 10646
http://lucia.itscj.ipsj.or.jp/itscj/servlets/ScmMem10?Com_Id=02PER_F=1.
The ISO ballot process allows stakeholders in scripts from these
countries to participate in the encoding process, and make the views
of their experts heard.  The ballot process also applies important
checks on the encoding process, and prevents scripts and characters
being encoded with undue haste if an encoding proposal is not yet
mature enough or if there is insufficient consensus among
stakeholders.  Not least, the ballot process allows for multiple
stages of review and correction of errors.

If Unicode were to go it alone, professional encoders such as Anshu
and Michael, who do not have an inherent stake in most of the scripts
they work on, would present their proposals to the UTC, who do not
have any expertise in such minority or historic scripts, but on the
basis that the proposal seems plausible they would approve it, and six
months later it would be in the next version of Unicode.  Yes, this
speeds up the encoding process enormously (which is usually at least
two years), but at what cost?  What happens when a couple of years
later, users of the script in question in Africa or Asia discover that
it has been encoded in Unicode but has a serious flaw or shortcoming
that no-one from the user community had an opportunity to correct (and
due to stability policies it is now too late to correct)?

So whilst ISO/IEC 10646 is certainly irrelevent to most people, I
strongly believe that the process whereby the standard is produced is
extremely beneficial to the Unicode Standard, and I would urge Anshu
and others to support the work of SC2 and WG2 rather than dismiss it
as a hindrance or irrelevance.

Andrew

Re: free download of ISO/IEC 10646 (was: Accessing the WG2 document register)

2015-06-11 Thread Andrew West

On 11 June 2015 at 10:12, Janusz S. Bień jsb...@mimuw.edu.pl wrote:

 The latest version of ISO/IEC 10646 is not inaccessible to most
 people, as it is (and has been since 2006) available for free download
 from ISO at 
 http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html.

 The page states clearly

  The following standards are made freely available for standardization
  purposes.

 In consequence I don't feel entitled to download it. Not only my
 curiosity is not a standarization purpose, but even teaching students
 about standards also doesn't qualify. I just show them the link and tell
 them to decide themselves to download or not :-)

I think you are reading far too much into the phrase for
standardization purposes.  The license states that you are allowed to
store a copy on your personal computer and print off a single copy,
but says nothing about what purposes you may use the standards for.
In my opinion it is ridiculous to claim that you are not entitled to
download the documents.

The Unicode terms of use http://unicode.org/copyright.html are far
more restrictive, and state that Any person is hereby authorized,
without fee, to view, use, reproduce, and distribute all documents and
files solely for informational purposes in the creation of products
supporting the Unicode Standard, subject to the Terms and Conditions
herein.  So if you are not planning to create a product supporting
the Unicode Standard, you are not legally allowed to view or download
any of the files comprising the Unicode Standard !

Andrew

Re: Re: Some questions about Unicode's CJK Unified Ideograph

2015-05-31 Thread Andrew West

On 31 May 2015 at 09:43, gfb hjjhjh c933...@gmail.com wrote:

 As of ⿰言亞 versus ⿰言亜, as I don't have much knowledge about Vietnamese and
 the character is from chu han instead of chu nom, I don't really know if
 there are any semantic difference between the two, but at least the one
 usage of ⿰言亜 shown in the word on that dictionary page would be something
 like dumb, mute which were not listed as part of the meaning of the
 character 䛩 in wiktionary.

The way CJK unification works, you don't need to show that there is a
semantic difference between the two forms, just that the form is used
in a reputable source.  Can you send me off-list a scan of the
character from the Vietnamese dictionary you mention?

 And for the proper name mark and book name mark, while i see the point that
 it wiuld be best achieve via word processor styling or markup language, so
 is it a good idea to integrating things similar to markup language into
 unicode, like create a character ps that indicate start of proper name mark
 and pe for end of proper name mark, then typing psPROPERNAMEpe would result
 in something similar to uPROPERNAME/u?

I think you can achieve the appropriate styling for web pages using CSS:

http://www.w3.org/TR/2013/WD-css-text-decor-3-20130103/#text-decoration-style-property

 And if using the work around suggested by Andrew, yes the hair space work
 but it a distance between characters a gap with width equal to an 'i'. Have
 also tried characters like u+200c or u+034f which does not work.

Even with OpenType it is not easy to contextually create a gap between
two combining underlines as the characters are not adjacent (I don't
think it is impossible, but the only way I can think of doing it is
rather unpleasant; perhaps other font experts on this list know an
easy way of doing it).

 and it seem
 like babelstone han is not supporting U+1AB6?

U+1AB6 is supported in the next release of BabelStone Han (due for
release very soon, probably within the next week or two).

 and is there any vertical
 edition of the two characters...

The combining underline and wavy line characters will work OK with a
vertically oriented CJK font (they will display on the left).
Unfortunately BabelStone does not currently work very well in vertical
orientation.

Andrew

Re: Re: Some questions about Unicode's CJK Unified Ideograph

2015-05-31 Thread Andrew West

On 31 May 2015 at 12:42, Andrew West andrewcw...@gmail.com wrote:

 Even with OpenType it is not easy to contextually create a gap between
 two combining underlines as the characters are not adjacent (I don't
 think it is impossible, but the only way I can think of doing it is
 rather unpleasant; perhaps other font experts on this list know an
 easy way of doing it).

Ignore that, I wasn't thinking straight.  It can be done easily using OpenType.

Andrew

Re: the usage of LATIN SMALL LETTER A WITH STROKE

2015-05-31 Thread Andrew West

On 31 May 2015 at 15:32, Janusz S. Bień jsb...@mimuw.edu.pl wrote:

 I'm curious what was the motivation for adding the character to
 Unicode. I understand the proposal is somewhere in the archives, perhaps
 it is available on the Internet?

Please see http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2942.doc.

Andrew

Re: Some questions about Unicode's CJK Unified Ideograph

2015-05-30 Thread Andrew West

On 30 May 2015 at 02:50, Ken Whistler kenwhist...@att.net wrote:

1. I have seen a chinese character ⿰言亜 from a Vietnamese dictionary NHAT
DUNG THUONG DAM DICTIONARY

Extension F is harder to track down, because it has not yet been
approved by the UTC, and comes in two pieces, with different
progression so far in the ISO committee. Perhaps somebody on this list
who has better access to the relevant documents can let you
know whether ⿰言亜 can be found in those sets.

It's not in my lists of F1 and F2 characters.

2. Is combined characters like U+20DD intended to work with all different
type of characters, or is it some problem related to implementation ? as I
when i write ゆ⃝ (Japanese Hiragana Letter Yu + Combining Enclosing Circle)
appear to be separate on most font I use, but if I change the Hiragana Yu
into a conventional = sign or some latin character, most fonts are at
least
somehow able to put them together. Or, is there any better/alternative
representation in unicode that can show japanese hiragana yu in a circle?

Combining enclosing marks in principle could work with most characters,
but in practice most arbitrary combinations do not work very well,
because they would require very complicated font support.

It's not that complicated, but I think most fonts don't support arbitrary
combinations with combining enclosing circle because there is little or no
demand for them. BabelStone Han displays Japanese Hiragana Letter Yu +
Combining Enclosing Circle quite well, but on the other hand it does not
work so well with CJK ideographs, and fails with Latin letters and
punctuation.

4.In CJK Symbols and Punctuation, Proper name mark and Book name mark are
not included. While there are charactera like U+2584, U+FE33, U+FE4F, and
U+FE34 in unicode that is more or less a representation for the two
symbol,
they do not appear below or on the left of typed characters when text flow
is horizontal/vertical, and instead, they occupy their own space which
make
them having little use in daily life, and while the proper name mark and
book name mark can represented by text editing softwares and css but those
representation are not ideal and they do match Criteria for Encoding
Symbols. Is it possible to make a new unicode symbol, or change some
current symbol into one that could appear in suitable place of other
characters when typed? And a property of the symbol is that when used in
case like 美國紐約 which 美國 and 紐約 are two different proper name (place name),
so an underline should go below them without any separation between the
character 美and國 or 紐and約 (when text are written horizontally), but at the
same time the underline should not be linked between 國 and 紐 as 國 is the
end
of first place name while 紐 is the start of the other.

What you are talking about is, indeed, best handled by text styling
attributes,rather than by individual character encoding.

I agree. However, if you really do want to represent underlining of proper
names at the character encoding level, then you would have to do something
like put U+0332 Combining Low Line after each character to be underlined,
and select a font that supports Combining Low Line with CJK ideographs.
BabelStone Han supports this low-level method of underlining CJK
ideographs, but if you want a space in the underlining between 美國 and 紐約
you would have to insert a very thin space (U+200A Hair Space in this
example) between the characters.

Andrew

KPS 9566 mappings (was Re: Arrow dingbats)

2015-05-29 Thread Andrew West

As someone who supports opening of KPS 9566 encoded files in my
software (BabelPad), I am interested in those characters proposed by
DPRK (http://std.dkuug.dk/jtc1/sc2/wg2/Docs/n2374.pdf) that were not
accepted for encoding but which are still in the latest version of the
DPRK standard, KPS 9566-2012(?). Red Star OS 3.0 Unicode-maps most of
them to the PUA, which is not satisfactory in most cases.

LEFTWARDS SCISSORS = KPS 9566-2012 ACD5

There are five scissors characters at 2700..2704, but they are all
right-facing. I think it would not be unreasonable to encode a
left-facing scissors character for compatibility with KPS 9566.
Alternatively, standardized variants for left-facing and right-facing
scissors could be defined for all 2700..2704, but that might open a
nasty precedent that we come to regret, so I would prefer simply
encoding a single left-facing scissor character.

CIRCLED UPWARD INDICATION = KPS 9566-2012 ACD4

This could be represented as U+1F446 WHITE UP POINTING BACKHAND INDEX
+ U+20DD COMBINING ENCLOSING CIRCLE.

WHITE UP-POINTING TRIANGLE WITH BLACK TRIANGLE = KPS 9566-2012 A2F1
WHITE UP-POINTING TRIANGLE WITH HORIZONTAL FILL = KPS 9566-2012 A2F2
WHITE UP-POINTING TRIANGLE WITH UPPER LEFT TO LOWER RIGHT FILL = KPS
9566-2012 A2F3
WHITE UP-POINTING TRIANGLE WITH UPPER RIGHT TO LOWER LEFT FILL = KPS
9566-2012 A2F4

I don't know why these were not accepted for encoding.  As far as I
can tell, they cannot be represented by any current Unicode character,
and I think it would be reasonable to encode them for compatibility
with KPS 9566.

RIGHT PARENTHESIS WITH FULL STOP = KPS 9566-2012 A1DC
RIGHT DOUBLE ANGLE BRACKET WITH FULL STOP = KPS 9566-2012 A1DD

I understand why these were not accepted for encoding, but the
precedent of U+2047 DOUBLE QUESTION MARK, U+2048 QUESTION EXCLAMATION
MARK, and U+2049 EXCLAMATION QUESTION MARK, which I believe were
encoded because they are used in vertically oriented Mongolian text
and it is problematic to embed ?? etc. horizontally in vertical text
suggests that it may be appropriate to encode these two characters for
compatibility with KPS 9566.

VULGAR FRACTION ONE HALF WITH HORIZONTAL BAR = KPS 9566-2012 A7FA
VULGAR FRACTION ONE THIRD WITH HORIZONTAL BAR = KPS 9566-2012 A7FB
VULGAR FRACTION TWO THIRDS WITH HORIZONTAL BAR = KPS 9566-2012 A7FC
VULGAR FRACTION ONE QUARTER WITH HORIZONTAL BAR = KPS 9566-2012 A7FD
VULGAR FRACTION THREE QUARTERS WITH HORIZONTAL BAR = KPS 9566-2012 A7FE

These contrast with KPS 9566 A7CA..A7CE which are vulgar fractions
with diagonal bar.  The issue of distinguishing between a horizontal
and a diagonal fraction slash is not restricted to North Korea, and I
think that there is an argument to be made for defining standardized
variants for all vulgar fraction characters to specify a glyph with
either a horizontal bar or a diagonal bar.

HAMMER AND SICKLE AND BRUSH
CIRCLED HAMMER AND SICKLE AND BRUSH

I assume that there is no appetite to encode these symbols for the
Workers' Party of Korea, and so mapping them to the PUA is
appropriate.

There is also the proposed VERTICAL TILDE character which was not
accepted for encoding, but which Red Star OS 3.0 Unicode-maps to
U+2E2F VERTICAL TILDE which was added in Unicode 5.1 for Cyrillic
transliteration.  This mapping does not seem wholy satisfactory to me,
and I wonder whether it would not be better to simply encode a
PRESENTATION FORM FOR VERTICAL TILDE at FE1A.

Andrew

Re: simplified Chinese words （土+从）

2015-05-21 Thread Andrew West

Hi Shi Zhao,

The character 土+从 is not yet in Unicode, but it is scheduled for
inclusion in CJK Extension F.  You can see the character here
(http://www.unicode.org/L2/L2014/14271-n4637.pdf on p. 148), but you
should not rely on the code point which will surely change.

Andrew



On 21 May 2015 at 16:06, shi zhao shiz...@gmail.com wrote:
 simplified Chinese words （土+从, Hanyupinyin: zong1）don't in unihan.

 simplified Chinese: (土+从)
 traditional Chinese: 㙡 (U+3661)

 see
 http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=3661useutf8=false
 http://glyphwiki.org/wiki/u2ff0-u571f-u4ece

 http://www.cnki.net/kcms/detail/Detail.aspx?dbname=CJFD2014filename=KJSY201404019v=MjA1NzdMdktMaWZZZDdHNEg5WE1xNDlFYllRSGZYZ3h2UjhRbUV3SlRReVFybVJFRnJDVVJMK2ZZdVJ1RkN2bFU=filetitle=%E4%BB%8E%E8%AF%AF%E5%90%8D%E2%80%9C%E9%B8%A1%E6%9E%9E%E8%8F%8C%E2%80%9D%E7%9C%8B%E7%A7%91%E6%8A%80%E5%90%8D%E8%AF%8D%E8%A7%84%E8%8C%83%E5%8C%96

Re: Flag tags with U+1F3F3 and subtypes

2015-05-18 Thread Andrew West

On 18 May 2015 at 19:19, Doug Ewell d...@ewellic.org wrote:

 Is the new mechanism intended to allow flag tags that include either
 subtype values or contains values? For example:

That is my understanding.

 1F3F3 E0047 E0042 E002D E0053 E0043 E0054 (GB-SCT)
 for the Scottish flag

 and

 1F3F3 E0047 E0042 E002D E004E E004C E004B (GB-NLK)
 for the North Lanarkshire council area flag

I don't believe that North Lanarkshire has an associated flag, which I
think is the case for most UK counties and councils (Cornwall, Devon
and Dorset all have flags, but they may be the exceptions).  In fact
not all of the four nations comprising the UK have a flag -- for
political reasons there is no official flag for Northern Ireland, so I
do not know what an implementation would display for 1F3F3 E0047
E0042 E002D E004E E0049 E0052 (GB-NIR), perhaps just a plain flag
emblazoned with GB-NIR.

Andrew

Re: Meroitic cursive fractions numerical values

2015-03-29 Thread Andrew West

On 28 March 2015 at 20:05, Karl Williamson pub...@khwilliamson.com wrote:

 Existing software that looks at the numeric values of characters is written
 expecting that rational numbers will have been reduced to their lowest form.

That seems to be a rather rash statement. I have software (BabelPad)
which parses the numeric values of characters for numeric sorting
purposes, and it parses 6/12 for MEROITIC CURSIVE FRACTION SIX
TWELFTHS as 0.5. Personally I find it hard to imagine how you could
write software that accepts 6/12 as input and is unable to come up
with the answer of a half.

I would say that fractions should not be reduced to their lowest form
in the Unicode data as some people may need to order fractions by
numerator or denominator, and reducing to lowest form could break the
expectations of some software.  Having said that, I note that the
numeric value of one character has been reduced in the Unicode data:
U+2189 VULGAR FRACTION ZERO THIRDS is given the numeric value of 0
rather that 0/3.

Andrew
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Terms for rotations

2014-11-13 Thread Andrew West

On 11 November 2014 01:17, David Starner prosfil...@gmail.com wrote:
 On Mon, Nov 10, 2014 at 4:12 PM, Whistler, Ken ken.whist...@sap.com wrote:
 Seriously, I think that Ilya's point is well-taken. Although in English
 there is a strong association of the phrase turn to the right with
 clockwise motion for control devices which rotate, if you take the
 phrase out of that mechanical context and just talk about the
 orientation of pictures on paper, there can be some ambiguity
 based on the conceptual confusion with the concept of
 turning to[wards] facing the right, which can mean something
 very different for symbols which seem to have built-in
 directions, like arrows.

 So is there anything wrong with CLOCKWISE and COUNTERCLOCKWISE? TURNED
 COUNTERCLOCKWISE seems a little verbose. WIDDERSHINS is shorter then
 COUNTERCLOCKWISE, but is not exactly a common term, especially in
 technical English.

ANTICLOCKWISE is the term used in the UCS (see names for 20D4, 20DA,
21B6, 21BA, 2233, 27F2, 2939, 293A, 293B, 293D, 293F, 2940, 29BC,
2A11, 2B6F, 2B8C, 2B8D, 2B8E, 2B8F, 2B94, 1F504).

Andrew
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: MONGOLIAN LETTER YA medial second form, incorrect image?

2014-11-13 Thread Andrew West

On 13 November 2014 10:00, Richard Ishida ish...@w3.org wrote:
 Before reporting this I want to check I have understood it correctly. If you
 know something about Mongolian variant selectors, please let me know if my
 conclusion is correct.

 I think the image for medial MONGOLIAN LETTER YA second form, 1836 180B, at
 http://www.unicode.org/Public/UNIDATA/StandardizedVariants.html is
 incorrect.

 I think it should have no upturn on the left.

Yes, you are correct.  It makes no sense to have an upturn as that
would be the same glyph as the first medial form.  You can see that
the second initial form and the second medial form both have the same
glyph with no upturn (ignore the dot, that is a printing artefact) in
Prof. Choijinzhab's Mongolian Encoding:

http://www.babelstone.co.uk/Mongolian/MGWBM/MGWBM_C034-C035.jpg

Andrew
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Code charts and code points (was: Re: fonts for U7.0 scripts)

2014-10-24 Thread Andrew West

On 24 October 2014 13:05, Shriramana Sharma samj...@gmail.com wrote:
 Hi Martin. If you haven't noticed it before, opening Unicode charts in
 PDF readers has something like SECURED on the top i.o.w. the charts
 are sorta DRM-protected. So you can't copy-paste the characters. Heck
 you can't even copy-paste the character *names*!

You can copy just fine with the Foxit PDF reader.  Like Jukka, I tried
randomly copying a number of PDF code charts (from
http://www.unicode.org/charts/) and I couldn't find any which were not
using the correct Unicode code points (maybe there are some, but I
gave up before I found them).

Andrew
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: fonts for U7.0 scripts

2014-10-23 Thread Andrew West

On 22 October 2014 23:58, Asmus Freytag asm...@ix.netcom.com wrote:

 Nothing prevents people to put their fonts in the public domain, if they so
 desire, but that can't be a requirement of the character encoding process.

I never said or implied that making the font freely available should
be a requirement of the character encoding process (although I
personally think it ought to be).  I said that if the production of
the font was funded by the SEI then it should be made freely
available, and I think that is what donors to the SEI would expect,
certainly based on the text I quoted earlier which had been on the SEI
web site for many years before Debbie removed it yesterday.

Andrew
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: fonts for U7.0 scripts

2014-10-23 Thread Andrew West

On 22 October 2014 21:47, Andrew Glass (WINDOWS)
andrew.gl...@microsoft.com wrote:

 I think that distributing fonts that are known to be deficient in shaping 
 does not address needs
 other than reproducing code charts and supressing tofu. Moreover, such fonts 
 create can
 mislead lead users into thinking that a script is supported when we know that 
 more work remains
 to be done. When work appears to be complete to someone that can't read a 
 script, then the
 motivation to address the remaining issues to support that script are 
 undermined. There can also
 be other negative consequences. I think that making a set of character only 
 fonts available would
 be against the interests of the SEI and Unicode.

Well, not all scripts have complex rendering behaviour, so for some
scripts the code chart font mapped to the correct Unicode code points
is all that is needed.

Even for fonts with deficient rendering behaviour or which are mapped
to ASCII or PUA code points, if the font was released under the SIL
Open Font license or an equivalent free license then people could use
it for the basis for a fully functional Unicode font.

 In this respect, I think the effort of the Noto project to including shaping 
 support for complex
 scripts is commendable. I hope that the current gaps in Noto will soon be 
 filled by suitable fonts
 so that the need to release 'chart-only' fonts is removed.

I'm a great fan of the Noto project, but as Mark's original question
indicates Noto does not supply a solution for newly encoded scripts,
and I very much dislike the idea of Google having a monopoly on
supplying free fonts for minor and historic scripts.  A code chart
font, released under a free license such as the SIL OFL (with any
necessary limitations clearly stated) is far and away better than
leaving people puzzling over little square boxes for years.

Andrew
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: fonts for U7.0 scripts

2014-10-22 Thread Andrew West

On 22 October 2014 08:27, Mark Davis ☕️ m...@macchiato.com wrote:
 I'm looking for freely downloadable TTF fonts for any of the following.  I'd
 appreciate links to sites for any of these:

 Bassa_Vah
 Duployan
 Grantha
 Khojki
 Khudawadi
 Mahajani
 Mende_Kikakui
 Modi
 Mro
 Nabataean
 Old_Permic
 Palmyrene
 Pau_Cin_Hau
 Tirhuta
 Warang_Citi

Was the encoding of any of these scripts funded by the Script Encoding
Initiative?  According to the SEI
(http://www.linguistics.berkeley.edu/sei/help.html) Funding is used
primarily for the creation of proposals on a per-project basis and for
fonts. Fonts will be made available over the Unicode website and will
be available for free distribution but cannot be bundled with
commercial products.

Although I have to say that I cannot see anywhere on the Unicode
website that provides fonts for SEI-funded scripts.

Andrew

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: fonts for U7.0 scripts

2014-10-22 Thread Andrew West

Debbie,

Thanks for the explanation.  I just wonder, in order to get a script
accepted for encoding the proposer has to provide a font for the
Unicode/10646 code charts, so creating a font (that is at least good
enough for the code charts even if it does not have full shaping
behaviour) is an essential part of the proposal process, so if the SEI
is funding someone to research/write a proposal is not the funding
provided by SEI at least indirectly funding the creation of a font,
and if so should not the font be made freely available at the end of
the project?

Andrew


On 22 October 2014 14:48, Deborah W. Anderson dwand...@sonic.net wrote:
 Dear Andrew,
 Most of the scripts listed below did come via Script Encoding Initiative 
 (SEI), you are correct.

 The intent of SEI was to work on proposals and provide fonts but, to date, 
 the focus of the work has been almost exclusively on getting scripts into 
 Unicode and not on the creation of distributable fonts. I will modify the 
 wording on the webpage accordingly.

 Ideally, I would like to have free fonts made available via SEI, but it 
 hasn't been possible due to funding constraints. In the future, I plan to 
 work closely with ScriptSource (and other projects that make free fonts 
 available), and will encourage the creation and submission of free fonts to 
 such projects, though at this point SEI doesn't have the funding itself to 
 pay for such work, unfortunately.

 Debbie Anderson

 -Original Message-
 From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Andrew West
 Sent: Wednesday, October 22, 2014 2:51 AM
 To: Mark Davis ☕️
 Cc: Unicode Public
 Subject: Re: fonts for U7.0 scripts

 On 22 October 2014 08:27, Mark Davis ☕️ m...@macchiato.com wrote:
 I'm looking for freely downloadable TTF fonts for any of the
 following.  I'd appreciate links to sites for any of these:

 Bassa_Vah
 Duployan
 Grantha
 Khojki
 Khudawadi
 Mahajani
 Mende_Kikakui
 Modi
 Mro
 Nabataean
 Old_Permic
 Palmyrene
 Pau_Cin_Hau
 Tirhuta
 Warang_Citi

 Was the encoding of any of these scripts funded by the Script Encoding 
 Initiative?  According to the SEI
 (http://www.linguistics.berkeley.edu/sei/help.html) Funding is used 
 primarily for the creation of proposals on a per-project basis and for fonts. 
 Fonts will be made available over the Unicode website and will be available 
 for free distribution but cannot be bundled with commercial products.

 Although I have to say that I cannot see anywhere on the Unicode website that 
 provides fonts for SEI-funded scripts.

 Andrew

 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode


 ---
 This email is free from viruses and malware because avast! Antivirus 
 protection is active.
 http://www.avast.com


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Bliss?

2014-10-14 Thread Andrew West

On 14 October 2014 17:06, Doug Ewell d...@ewellic.org wrote:

 Statements in the linked article such as the following (not written by
 Markus) always trouble me:

Gosh, I wonder who it could have been?

https://en.wikipedia.org/w/index.php?title=Blissymbolsdiff=331226727oldid=331223779

Andrew
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: And what happened to...

2014-10-08 Thread Andrew West

On 8 October 2014 01:23, Mark E. Shoulson m...@kli.org wrote:

 The other thing I wanted to ask about has, sure enough, disappeared.  It's 
 the only Han character I'm following.  The infamous Biang-Biang Noodle 
 character, discussed at http://en.wikipedia.org/wiki/Biangbiang_noodles The 
 WP page said it was scheduled for Extension E (I know it says Extension F 
 now: I changed it), which has already been passed, so I looked through the 
 IRG web site and read up on a bunch of discussion tracing its fate.

The character was part of Unicode's Urgently Needed Characters (UNC)
submission of 19 characters to IRG in 2013
(http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg40/IRGN1936_UTC_UNC.zip),
but other IRG members had concerns about biang and some other
characters in the submission, and so Unicode's UNC resubmission in
2014 (http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg42/IRGN2005_UTC_UNC.zip)
was reduced to five characters, with all controversial or questioned
characters removed.  Those five characters are currently scheduled for
inclusion in Unicode 9.0, but biang remains in limbo.

On the other hand, I am pleased to see that yet another two variants
of the character 邊 biān (for which there are already 21 Ideographic
Variation Sequences defined) are scheduled for encoding at 2DD84 and
2DD85 (http://std.dkuug.dk/JTC1/SC2/WG2/docs/n4637.pdf).

The situation is highly unsatisfactory, but in my opinion the whole
CJK encoding process is highly unsatisfactory.

Andrew

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Discrepancies between kTotalStrokes and kRSUnicode in the Unihan database

2014-09-09 Thread Andrew West

Hi John,

You raise some interesting points, and I hope that one of the people
who maintain the Unihan database can address your issues better than I
can.

I think that the reason why the main CJK block shows the greatest
number of mismatches between kTotalStrokes and kRSUnicode is related
to the way that CJK characters were ordered in the initial Unicode 1.0
repertoire, which seems to have been based on the glyph shape used in
the particular source standard.  To take U+5040 偀 as an example, this
is mainly a Cantonese character, and I guess that the source from
which the Unicode character was derived used a traditional form of the
character with a broken grass radical, giving a residual stroke count
of 9, and it was thus ordered in the code charts as the first
character with nine strokes under radical 9, hence kRSUnicode = 9.9
(you can see in the Unicode 2.0 code charts
http://www.unicode.org/versions/Unicode2.0.0/CodeCharts2.pdf that
U+5040 indeed has a 4-stroke grass radical).  In a later version of
Unicode a new font was used in which U+5040 was represented in the
(single column) code charts with a  3-stroke grass radical glyph (see
for example the Unicode 4.0 code charts
http://www.unicode.org/versions/Unicode4.0.0/CodeCharts.pdf).
kTotalStrokes was presumably based on the glyph forms given in one of
these later versions of the code charts, and so for U+5040
kTotalStrokes = 10.

The problem of stroke counts is now compounded by the use of
multi-column code charts for CJK, with each character illustrated with
multiple regional glyph forms.  In many cases different glyph forms
with differing stroke counts are shown in different columns for the
same character, so the kTotalStrokes and kRSUnicode fields may not
reflect the stroke count for all regional variants of the same
character.  Furthermore, when regional variants of the same character
do have varying stroke counts it is not obvious which character form
should be used to calculate the values of kTotalStrokes and
kRSUnicode, which makes these two fields very problematic in my
opinion.

That kRSUnicode allows for multiple values, but only provides more
than one value in a tiny handful of cases (mostly where the character
can be classified under more than one radical), makes the situation
even worse in my opinion.  For processes that want to sort CJK
characters, it is very useful to have a single nominal radical-stroke
key for every encoded CJK character, but once you have multiple values
for kRSUnicode (and no indication which value is preferable under
which circumstances) then you are given a choice as to which value to
use but no way of knowing which the best choice is.

My solution would be to have a single kRSUnicode value giving a
nominal radical-stroke value for each character, harmonized with
kTotalStrokes, with stroke count for the two fields calculated
consistently according to some defined criteria; and if there are more
than one possible radicals for a particular character then just use
the radical under which it appears in the Unicode code charts.  In
addition I would create individual  kTotalStrokes and kRSUnicode
fields for each source (G, H, J, K, T, U, V, etc.), which would give
the preferred radical and stroke count for each regional glyph form
given in the code charts.

Andrew

On 8 September 2014 20:03, John Armstrong john.armstrong@gmail.com wrote:
 [Apologies if this issue has already been resolved.  I searched the
 Unicode.org site for discussions but I only found document dating from 2003
 which touches on the issue:  andrewcw...@alumni.princeton.edu RE: Unicode
 4.0.1 Beta Review 1. kRSUnicode Field
 (http://www.unicode.org/L2/L2003/03311-errata4.txt)]



 A CJK Han character is conventionally viewed as consisting of a radical plus
 a residual part or “phonetic”.  (For a character which is a radical the
 residual part is nothing.  The term “phonetic”, indicating that the residual
 part of the character points the pronunciation of the character, properly
 only applies to 90-95% of characters, but it applies in the examples below.
 )



 The two parts of a character each consist of a specific arrange of strokes,
 and together account for all the strokes in the character.  In particular,
 the number of strokes in the radical portion plus the number of strokes in
 the residual portion equals the total number of strokes in the character.
 The stroke count of a radical combined with a residual part is not always
 the same as the stroke count of the radical appearing on its own, but may be
 slightly or significantly less due to a minor or major abbreviation.  (A
 radical may have several forms which are used in different positions of the
 whole character, say left or right side vs. top or bottom.  These variants
 may have the same or different stroke counts.)



 Because of abbreviated variants the total stroke count for a character
 cannot be always be gotten by adding the stroke count of the radical in its
 standalone form to

Re: Noto adds CJK, plus new user-facing website

2014-07-16 Thread Andrew West

On 16 July 2014 00:33, Roozbeh Pournader rooz...@unicode.org wrote:
 Please excuse the spam, but I think it would be interesting for people here
 to know that the Noto open source project now supports CJK, which brings it
 very close to the goal of supporting every major script (and several minor
 and historical ones).

 Here is the CJK announcement:
 http://googledevelopers.blogspot.com/2014/07/noto-cjk-font-that-is-complete.html

Fantastic news, but I personally think that the decision to include
the four characters at U+9FCD through U+9FD0 in the Adobe Source Han /
Noto Sans Simplified Chinese fonts (and U+9FD0 in the Traditional
Chinese fonts) is extremely premature given that these characters have
only just been added to the draft repertoire for ISO/IEC 10646:2014
Amd. 2, and have not yet completed even their first round of the ISO
balloting process.  As such the code point allocations are not stable,
and should not be used in fonts for public consumption.

Andrew
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: [Unicode] tablature characters for the Chinese guqin

2014-05-30 Thread Andrew West

On 30 May 2014 05:50, suzuki toshiya mpsuz...@hiroshima-u.ac.jp wrote:

 BTW, a few (only one?) characters for the latter style are
 sampled in a normal dictionary CiYuan, and will be included
 in CJK Unified Ideograph Extension F.

I hope not.  Just because it occurs in a Chinese dictionary does not
mean that it is a Han ideograph, and guqin tablature signs most
definitely are not Han ideographs.  The component elements of guqin
tablature signs should be encoded in a separate block, with an
encoding model that allows for the composition of arbitrary tablature
signs by fonts.

 However, I don't think
 encoding only one glyph for the tablature is so useful - there is any 
 avantgarde number using only one note?

It would be extremely unuseful to do so.

 Attached is IRG42 t-shirt of a tablature(?) taken from Dunhuang
 manuscript (Pelliot P3808).

Yes, it is tablature used for the pipa lute during the Tang dynasty.
I have a table of pipa tablature signs at:

http://babelstone.blogspot.co.uk/2012/12/one-to-twenty-in-jurchen-khitan-and-lute.html#Lute

And I discuss Song and Yuan dynasty flute tablature signs at:

http://babelstone.blogspot.co.uk/2012/12/one-to-ten-in-tangut-and-flute.html

Glyphs for both flute and pipa tablature signs are available in my
BabelStone Han font in the PUA at E000..E01D and E020..E04B
respectively.

Andrew
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: two Hanzi

2014-03-21 Thread Andrew West

On 20 March 2014 16:43, Richard COOK rsc...@wenlin.com wrote:

 Interesting, yedict.com lists a few characters as 非unicode汉字, some 
 repeatedly.

 ⿹气云 =氲!굣
 ⿹气免 =冕
 ⿹气木 !㏙ # [U+2C1CF] Ext E
 ⿹气毛 !㯘

 One of these is in Ext E (from V-Source), but the other three seem not to be 
 encoded.

The Zhonghua Zihai 《中华字海》 dictionary includes thousands of
characters not yet encoded in Unicode, and just under the 气 radical
yedict.com lists 24 not-in-Unicode characters from Zhonghua Zihai,
of which 18 are not encoded or included in CJK-E:

⿹气一
⿰气刂 = U+520F 刏
⿰几气
⿹气Ó
⿹气木 = CJK-E U+2C1CF
⿹气云
⿹气毛
⿹气电 = CJK-E U+2C1D0
⿹气禾
⿹气未
⿹气目
⿹气囚
⿺气朴 = CJK-E U+2C1D1
⿹气ᆬ
⿹气免
⿹气ౄ
⿰气奏
⿹气ᇒ
⿹气泉
⿺气原 = CJK-E U+2C1D2
⿰⿱杳丂气 != U+20103 ă (⿰⿱杳丂乞)
⿹气達 = CJK-E U+2C1D3
⿰⿳⿻工昍日丂气 != U+2010B ċ (⿰⿳⿻工昍日丂乞)
⿹气羅

 Note that the structure may differ ...

 ⿹气X
 ⿺气X
 ⿱气X

 might all refer to the same abstract character ...

... but nevertheless would not be unified according to Annex S.

Andrew

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Re: [Unicode] two Hanzi

2014-03-20 Thread Andrew West

On 20 March 2014 14:12, Jörg Knappen jknap...@web.de wrote:

 Who writes a proposal?

I wish that there was a mechanism for encoding CJK characters that
allowed individuals to simply submit characters with appropriate
evidence to Unicode, and after review they could be added to the next
version Unicode, but the reality is that you need to go through a long
and bureaucratic process involving the Ideographic Rapporteur Group
(IRG), with the result that it may take ten years to get a CJK
character encoded.  Even the Unicode Consortium seems powerless to
overcome IRG bureaucracy, as the sorry tale below illustrates.

In 2012 I wrote a proposal to encode 226 Han characters, including two
fish characters previously requested by Shi Zhao on this list
http://www.unicode.org/mail-arch/unicode-ml/y2012-m05/0259.html,
which I submitted to the Unicode Technical Committee (UTC):

http://www.unicode.org/L2/L2012/12333-cjk-f.pdf

The UTC accepted this document, and included the suggested characters
in the Unicode submission to the IRG for inclusion in the CJK-F
extension:

http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg39/IRGN1888_UTCExtensionF.zip

This was discussed at the IRG meeting in Hanoi in November 2012
(http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg39/IRG39.htm), but the
Unicode submission for CJK-F was entirely rejected by IRG just because
the submission was a couple of days late.

The UTC later submitted a proposal to encode 19 of the original
characters (including Shi Zhao's two fish characters) as urgently
needed characters:

http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg40/IRGN1936_UTC_UNC.zip

But this was rejected by IRG in November last year as they considered
that these characters were not urgent enough, so now we will have to
wait another four or five years before they can be considered for
CJK-G.

Good luck getting the characters for newtonium and nebulium encoded any sooner!

Andrew

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Heron element in Unicode?

2014-03-18 Thread Andrew West

On 18 March 2014 17:46, Mike Morrison m...@mikemorr.com wrote:

 Does the element which is the right half of 権, and the left half of 勧,
 歓, 観, exist anywhere in the current or proposed Unicode standards?

No.

 It's a simplification of 雚, and similar to but not the same as 隺. If
 not currently in Unicode, is it the sort of thing that might be
 considered for addition in the future?

People have been talking for many years about encoding the relatively
few CJK components that do not exist as characters in there own right,
and I think that there would be some support from the relevant
committees if a well-presented proposal was submitted.

Andrew

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Aw: Astrological symbol for Pluto?

2014-02-03 Thread Andrew West

On 3 February 2014 13:14, Shriramana Sharma samj...@gmail.com wrote:

 In any case, it seems its astronomical symbol was encoded quite early
 (DerivedAge = 1.1) which was before the 2006 IAU decision to demote it to
 dwarf planet status. Of course, even if it were encoded today I'm sure it
 would be the only dwarf planet to have a symbol encoded since no other dwarf
 planet has captured the common man's imagination (and basic knowledge) like
 Pluto, and I have not heard any of the other dwarf planets (Ceres, Haumea,
 Makemake and Eris) having any symbols...

Well, there are no fewer than four unencoded astrological symbols for
Eris according to this Wikipedia article:

https://en.wikipedia.org/wiki/Astrological_symbols

Andrew
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode

Re: Tool to convert characters to character names

2012-12-20 Thread Andrew West

On 20 December 2012 05:03, Martin J. Dürst due...@it.aoyama.ac.jp wrote:
 I'm looking for a (preferably online) tool that converts Unicode characters
 to Unicode character names. Richard Ishida's tools
 (http://rishida.net/tools/conversion/) do a lot of conversions, but not
 names.

My What Unicode character is this? javascript tool converts Unicode
characters to Unicode character names:

http://www.babelstone.co.uk/Unicode/whatisit.html

Andrew

Re: Claims of Conformance (was: Re: CLDR and ICU)

2012-07-26 Thread Andrew West

On 27 July 2012 00:42, Ken Whistler k...@sybase.com wrote:

 It is a whole nother kettle of fish when somebody says of their product
 This product conforms to the Unicode Standard, Version 6.2.0. There
 would be nothing misleading about their use of the Unicode Mark in
 such a case -- they are actually referring to the actual standard which
 claims the trademark. The reference is not misleading.

Yet such a claim would be wrong according to the Trademark Policy
page, because they omitted the ® symbol and used the word conforms
(they should have stated This product is compliant with the Unicode®
Standard, Version 6.2.0.).  The page clearly states that any claim of
conformance is not allowed to be made if the Unicode Word Mark
guidelines are not followed (e.g. omitting the ® after Unicode, or
using a verb other than use, implement, support, or are
compliant with), which implies that any wrongly formulated or
formatted claims of conformance are null and void, and should not be
accepted by potential users of the product.

I am sure we have discussed how stupid this page is on this list
before, and I for one refuse in principal to add the ® symbol to
Unicode when, for example, I claim conformance to the Unicode 6.1
normalization algorithm for BabelPad.  Perhaps people should be wary
of using my software because the Unicode Word Mark is misused, but
more likely they will think that Unicode's (oops!) trademark policy is
a little bit silly.

Andrew

Re: Too narrowly defined: DIVISION SIGN COLON

2012-07-10 Thread Andrew West

On 10 July 2012 11:50, Leif Halvard Silli
xn--mlform-...@xn--mlform-iua.no wrote:

 My candidate characters, this round, are:

  DIVISION SIGN (÷) as minus sign.
  COLON (:) as division sign.
 MIDDLE DOT (·) as multiplication symbol.

The last one is already encoded as U+22C5 DOT OPERATOR.

 What's next? Would some formal action be needed?

Yes. If you really want to propose them then you must submit a
proposal form to Unicode and/or WG2:

http://www.unicode.org/pending/proposals.html

But I very much doubt that the committees would accept such characters
for encoding.

Andrew

Re: Too narrowly defined: DIVISION SIGN COLON

2012-07-10 Thread Andrew West

On 10 July 2012 13:52, Jukka K. Korpela jkorp...@cs.tut.fi wrote:

 Yes. If you really want to propose them then you must submit a
 proposal form to Unicode and/or WG2:

 http://www.unicode.org/pending/proposals.html

 I don’t think Leif meant proposing new characters. Instead, I suppose he
 meant adding annotations to existing characters, changing the text of the
 standards (probably in the code charts, though notes about uses of
 individual characters also appear scattered around the chapters of the
 standard, too).

OK, in that case he needs to file a report at:

http://www.unicode.org/reporting.html

Andrew

Re: [OT] Flerovium and livermorium get names on the periodic table of elements

2012-06-01 Thread Andrew West

On 1 June 2012 23:02, Peter Constable peter...@microsoft.com wrote:

 http://www.theverge.com/2012/6/1/3057261/flerovium-livermorium-periodic-table-of-elements

There don't appear to have been any Chinese characters assigned to
these two elements yet, but it is interesting to note that there are
no simplified forms for eight of the elements with highest atomic
numbers:

104 Rf 鑪 钅卢
105 Db 觀 钅杜
106 Sg 譎 钅喜
107 Bh 訏 钅波
108 Hs 譆 钅黑
109 Mt 䥑 钅麦
111 Rg 錀 钅仑
112 Cn 鎶 钅哥

which are represented with PUA characters at:

http://zh.wikipedia.org/wiki/%E5%85%83%E7%B4%A0%E5%91%A8%E6%9C%9F%E8%A1%A8

and as components at:

http://zh.wikipedia.org/wiki/%E6%89%A9%E5%B1%95%E5%85%83%E7%B4%A0%E5%91%A8%E6%9C%9F%E8%A1%A8

(110 Ds is already encoded in CJK-D as U+2B7FC 럼)

Seem like candidates for urgent encoding to me.

Andrew

Re: Flag tags (was: Re: Unicode 6.2 to Support the Turkish Lira Sign)

2012-05-31 Thread Andrew West

On 31 May 2012 00:24, Mark Davis ☕ m...@macchiato.com wrote:

 There is definitely a problem.

Is it really such a problem?  Why can't implementations simply use
ZWSP to demarcate the 2-character units in a sequence of more than two
regional indicator symbols (and maybe always emit 2-character codes
wrapped between ZWSP on either side to be safe), so for example
USZWSPESZWSPGE would be parsed as the regional indicator symbols
for USA, SPAIN and Georgia, whereas UZWSPSEZWSPSGZWSPE would be
parsed as the regional indicator symbols for U (invalid), Sweden,
Singapore and E (invalid).  Algorithms such as line-breaking would not
break between two regional indicator symbols, but only at a ZWSP.

And if implementations wanted to support two- and three-letter
regional codes, they might parse
ZWSPGBZWSPCYMZWSPENGZWSPNIRZWSPSCOZWSP as the codes for
United Kingdom, Wales, England, Northern Ireland, and Scotland, and
represent them visually with the appropriate flag icons.

Andrew

Re: Flag tags

2012-05-31 Thread Andrew West

On 31 May 2012 10:20, Michael Everson ever...@evertype.com wrote:

 No at least the black pirate flag, and the checkered flag (for car racing).

 U+2690 WHITE FLAG
 U+2691 BLACK FLAG
 U+26FF WHITE FLAG WITH HORIZONTAL MIDDLE BLACK STRIPE
 U+1F38C CROSSED FLAGS
 1F3C1 CHEQUERED FLAG

 We are missing the JOLLY ROGER.

I propose U+20F1 COMBINING ENCLOSING FLAG, and a named sequence
U+2620 U+20F1 = JOLLY ROGER.

Andrew

Re: Plese add a Chinese Hanzi

2012-05-30 Thread Andrew West

I have found examples of the use of this character (鱼丹) in print in
the following academic article available on line:

Composition and Status of Fishes of Nanla River in Xishuangbanna,
Yunnan, China
ZHENG Lan-ping, CHEN Xiao-yong*, YANG Jun-xing
Zoological Research 2009, Jun. 30(3): 334−340

http://www.bioline.org.br/pdf?zr09050

See page 6: Wang XZ ref.; and Appendix I #1 Danio chrysotaeniata.

I personally think that rather than add characters such as this
piecemeal, it would be more useful if someone or some organization
could research what newly devised, unencoded characters are in use in
biology, chemistry, etc., and make a proposal to encode them all,
either via the Chinese national body or directly to IRG.  Characters
used in modern scientific literature should be considered urgent use,
in my opinion, and encoded sooner rather than later.

Andrew


On 30 May 2012 14:10, shi zhao shiz...@gmail.com wrote:

 (鱼丹) pinyin is: dan1.

 (鱼丹)  is Chinese name of some fish.

 In chinese:
 Danioninae =  (鱼丹)亚科
 Gymnodanid = 裸 (鱼丹)属
 Gymnodanid strigatus = 条纹裸(鱼丹)
 Danio =  (鱼丹)属
 Danio aequipinnatus = 波条(鱼丹)
 Danio kakhienansis = 红蚌(鱼丹)
 Danio myersi = 麦氏(鱼丹)
 Danio interrupta = 半线(鱼丹)
 Danio apogon = 缺须(鱼丹)
 Danio chrysotaeniatus = 金线(鱼丹)

 References:

 [1] Xin-Luo Chu, A Preliminary Revision of Fishes of The Genus Danio From
 China, Zoological Research, 1981, 2(2), p 145-154
 [2] CHEN YI-FENG,  HE SHUN-PING, A NEW GENUS AND A NEW SPECIES OF CYPRINID
 FISHES FROM YUNNAN, CHINA (CYPRINIFORMES: CYPRINIDAE: DANIONINAE),  Acta
 Zootaxonomica Sinica, 1992, 17(2), 238-240
 [3] CHEN Min,  A study of the Freshwater Fishes in Guangxi Part
 I.Cyprinidae:Danioninae,Leuciscinae and Cultrinae, Journal of Liuzhou
 Vocational  Technical College, 2001, 1(2), 64-69

 PS: In China, scientists sometimes will be made new Hanzi for the some
 new concept/terminology, especially in the field of biology, chemistry, etc.

Re: Plese add a Chinese Hanzi

2012-05-30 Thread Andrew West

On 30 May 2012 15:30, Michael Everson ever...@evertype.com wrote:

 http://www.bioline.org.br/pdf?zr09050

 (鱼皮) is also found there, on page 3.

That one is already encoded as U+9C8F 鲏 so it is odd that they needed
to create their own custom glyph for it.

 And (鱼芒) on page 4.

But that is an unencoded simplified form of U+29DF6 鷶

Which just goes to prove my point that someone needs to research these
characters methodically.

Andrew

Re: [unicode] Re: vertical writing mode of modern Yi?

2012-05-01 Thread Andrew West

On 1 May 2012 03:48, suzuki toshiya mpsuz...@hiroshima-u.ac.jp wrote:

 I wouldn't expect to see vertical modern standard Yi text in modern
 publications, other than perhaps newspapers.

 I got a scanned image of Liangshan Ribao (涼山日報), dated 2002/Mar/9,
 the vertical text is laid out without glyph rotation.

Thanks! Those are very good examples, especially the whole paragraph
of vertically written text next to the photograph, and really should
settle the argument once and for all.  Now all we have to do is
convince fantasai and the other folk working on UTR #50 that the
people typesetting the Liangshan Daily really do know how to typeset
Yi, and that they are not making a terrible faux pas because of
technical limitations or unfamiliarity with Yi typographic traditions
(of course, for the standardized Liangshan Yi syllabary encoded in
Unicode there is no typographic tradition prior to the standardization
of the script in 1980).

Andrew

Re: Canadian aboriginal syllabics in vertical writing mode

2012-05-01 Thread Andrew West

On 1 May 2012 12:27, Michael Everson ever...@evertype.com wrote:
On 1 May 2012, at 11:16, suzuki toshiya wrote:

In current draft of UTR#50, the properties for Canadian aboriginal syllabics
are defined as U; S; S;. But seeing the PDFs like

http://www.unicode.org/reports/tr50/tr50-3.VerticalOrientation.txt gives:

1400..167F ; S ; S ; S
1401..167F ; U ; S ; S

which seems to be a mistake.

http://www.gov.nu.ca/save10/English/Documents/Newsletters/Newsletter%203/Newsletter%203%20-%20Inuktitut.pdf
http://www.cley.gov.nu.ca/pdf/Documentary%20Art%20Project_Inuk.pdf
it is questionable if the default value U is preferred.

I don't know what U means, but that rotation is weird and confusing and not
legible. In a cross-word, vertical text goes from to to bottom with no
rotation.

In UTR#50 S means that the glyphs should be rotated 90 degrees
clockwise wrt the Unicode charts, and U means that the glyphs should
have the same orientation as in the code charts. The draft UTR#50
specifies that UCAS glyphs should be unrotated in in those parts of
the world where characters are mostly upright (whatever that means)
but rotated clockwise when used for vertical lines in East Asia.
The big problem with UTR#50 as I see it is that it only deals with
glyph orientation, but with complex and joining scripts such as
Mongolian and Ogham it is runs of text that are rotated in different
orientations, not individual glyphs. In the two examples linked to
above the UCAS text appears to be rotated counterclockwise in vertical
layout, so that it reads bottom-to-top sideways, which is not a mode
of vertical layout that UTR#50 deals with.

Andrew

Re: [unicode] Re: vertical writing mode of modern Yi?

2012-04-04 Thread Andrew West

On 2 April 2012 07:20, fantasai fantasai.li...@inkedblade.net wrote:

 The question in my mind is,
  a) does the Yi community consider the Chinese style of typesetting
 vertical captions and suchlike to be the only correct way, or

I don't think you can separate an Yi typographic tradition from the
Chinese typographic tradition. All members of the Liangshan Yi
community in China who are literate in the standardized Liangshan Yi
syllabary encoded in Unicode will be bilingual in Chinese, and will
have been exposed to Chinese books and typesetting traditions for
their whole life.

  b) is it a consequence of the Chinese typesetting software that such text
 is typeset this way, and the correct orientation would match Old Yi, or

I doubt that there are very many young or middle-aged members of the
Liangshan Yi community who can read Old Yi or have had any substantial
exposure to Old Yi texts.

I find the whole argument that Unicode Yi should match the orientation
of Old Yi bewildering -- if users would expect that the glyph
orientation of Unicode Yi text laid out vertically should match Old
Yi, then why wouldn't you also suppose that they would expect the
glyph orientation of Unicode Yi text laid out horizontally to match
Old Yi, i.e. possibly rotated 90 degrees?

I really think that Old Yi is a red herring, and should not be used as
an argument for how modern standardized Yi should be oriented when
written vertically.

  c) would the Yi community consider either option acceptable and a matter
 of stylistic preference, similar to Latin characters, whose native text
 orientation is not vertical and thus can be found typeset both sideways
 and upright

I personally think that sideways oriented vertical Unicode Yi looks
very odd, and would be very suprised if it were the prefered
orientation of the user community.  Has anyone ever seen any examples
of vertically laid out Unicode Yi text with rotated glyphs in books or
on signs?

Andrew

Re: [unicode] Re: vertical writing mode of modern Yi?

2012-04-04 Thread Andrew West

On 4 April 2012 00:53, fantasai fantasai.li...@inkedblade.net wrote:

 If the software is capable of both options, and the people managing
 the typesetting process are comfortably literate in Yi and familiar
 with its vertical habits in handwritten texts, then we can consider
 the results of their work to be correct.

 But if the software is only capable of typsetting characters upright
 and not sideways, this will obviously be the result regardless of
 typographic preference.

 Also if the people typesetting the book are familiar with Chinese
 but not with Yi, they might assume that these characters, which like
 ideographs also fit in fixed-size boxes, should behave the same as
 Chinese. This is a reasonable assumption. But it may not be correct.

Have you got any evidence to support any of your speculations?  Unless
there is hard evidence that modern standardized Liangshan Yi should be
written vertically with rotated glyphs then, in my opinion, the
default position of UTR #50 should be that Yi characters do not rotate
when laid out vertically.

Andrew

Re: [unicode] Re: vertical writing mode of modern Yi?

2012-03-30 Thread Andrew West

On 29 March 2012 02:28,  mpsuz...@hiroshima-u.ac.jp wrote:

 My observation is only in imported bookstores in Japan,
 and some photos taken by the foreign visitors. I expected
 more living vertical texts of Yi may exist in China, but
 it might be too optimistic...

I wouldn't expect to see vertical modern standard Yi text in modern
publications, other than perhaps newspapers.

 Please let me ask a stupid question for confirmation; the top/
 bottom for Old Yi glyph cannot be defined without a specification
 the time  place that the script was used? In previous post, I
 wrote as if Yunnan Old Yi glyphs were not rotated from their
 original shapes, and Sichuan Old Yi glyphs were rotated (as modern
 Liangshan Yi) - it would be hasty observation, and I should not
 assume a history as if Sichuan Old Yi glyphs were rotated at some
 stage. It's right understanding?

I'm not sure.  I have little experience with Old Yi texts, but my
understanding from secondary sources is that the orientation varied
considerably from place to place, and sometimes even from village to
village in the same area.  But I suspect your observation may be a
correct generalization -- I think that you know more than me on the
subject.

 Indeed, Old Yi is not coded yet, I know (rather, no official
 proposal is submitted to WG2, I think).

There is N3288 Preliminary Proposal to encode Classical Yi
characters submitted by China in February 2007:

http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3288.pdf (134MB)

This proposes 88,613 Old Yi characters, but it is just a long list of
every glyph and glyph variant (many variants with almost imperceptible
differences) that the group responsible for the proposal had managed
to find in all Old Yi texts from different times and different places
that they examined.  This glyph-based encoding proposal in not
actionable, and if Old Yi is to be encoded a new character-based
proposal needs to be produced (or perhaps a number of proposals for
separate Old Yi scripts based on different regional traditions would
be better).  However, I think there are too few people with a good
knowledge of both Yi scripts and Unicode for this to become a reality
any time soon.

 And, I think your pointing
 is very important - Modern (Liangshan syllabicalized) Yi and
 Old Yi should be regarded as different scripts and they could
 have different preferences about their text layouts (it's right
 understanding?)

Yes.

 I was assuming that modern Liangshan syllabicalized Yi and Old
 Yi materials may share same preferences about vertical writing
 mode, but it might be hasty assumption - I have to agree, the
 materials in my hands are too few to push my assumption.

I only have a couple of Old Yi books on my bookshelves, so I am no
expert on the subject, but I think that in a way UCS Yi and modern
Old Yi (Old Yi texts in modern publication) do share the same
vertical writing preferences -- that is to say, in both scripts glyphs
are written in the same orientation for both horizontal and vertical
layout (i.e. there is no glyph rotation when horizontal text is laid
out vertically).  This can be seen in the scans from the two modern
editions of Old Yi texts from Guizhou below (ISBN 7-5412-0787-X and
7-5412-0659-8 respectively) where the book title is written
horizontally on the front cover and vertically on the title page, with
no change in glyph orientation:

http://www.babelstone.co.uk/Yi/Images/Sujulimi2.jpg
http://www.babelstone.co.uk/Yi/Images/Sujulimi3.jpg

http://www.babelstone.co.uk/Yi/Images/YizuYuanliu2.jpg
http://www.babelstone.co.uk/Yi/Images/YizuYuanliu3.jpg

The key difference between these Old Yi texts and UCS Yi is that the
glyphs are rotated 90 degrees clockwise in UCS Yi compared with Old Yi
*in both horizontal and vertical layout* (cf. the 3rd character of
Sujulimi with the corresponding Liangshan Yi character ꇖ), but they
both follow the Chinese model for vertical layout.

 Checking the latest draft of UTR#50, it seems that the vertical
 writing mode for UCS Yi is different from CJK Ideographs;

3400..4DBF ; U ; U ; U
4DC0..4DFF ; U ; U ; U
4E00..9FFF ; U ; U ; U
A000..A48F ; S ; S ; S
A490..A4CF ; S ; S ; S

 I guess that somebody found a vertical writing mode of UCS Yi
 with rotated (umm, I should say as recovery-rotated?)
 glyphs, or, somebody guessed the vertical writing mode of UCS
 Yi by Old Yi materials. If the background is former, I want
 to see it.

From the discussion at
http://www.unicode.org/forum/viewtopic.php?f=35t=222 it seems to
have been the latter case.

 In Sichuan dialects, I could not find the orientation of the glyphs
 in Sichuan volumes of ISBN 7-5367-2637-6

 https://www.codeblog.org/blog/mpsuzuki/images/20120327_1.gif

The characters on this page are written in the same orientation as UCS
Yi (cf. the characters ꐑꅝꐒ on the 3rd column from the left).

 More investigation is needed.

I think that a lot more investigation is required to understand how
best to

Re: vertical writing mode of modern Yi?

2012-03-27 Thread Andrew West

On 27 March 2012 06:18, suzuki toshiya mpsuz...@hiroshima-u.ac.jp wrote:

 Is there any typesetted material of modern Yi syllabic script in vertical
 writing mode?

Probably nothing more than titles on book spines and names of
government offices written on gate pillars.   However, I believe that
these examples are sufficient to establish the vertical writing mode
of the modern Yi script.  My observation is that the standardized
Liangshan Yi script that is encoded in Unicode is written vertically
with no rotation of glyphs, in the same way that Chinese characters
are written vertically.

 On the spines of the manually written books for old Yi, the situation
 is same; non-rotated glyphs are laid out vertically. Ah, vertical is
 the native writing mode, so I should say as on the front cover, non-
 rotated glyphs are laid out horizontally.

That seems reasonable, but as Old Yi was written in a variety of
orientations in different times and different places it is hard to
agree on what the correct vertical and horizontal layout of Old Yi
(or perhaps more correctly, the various Old Yi scripts) should be.
Moreover, as Old Yi has not been encoded, and therefore cannot be
represented in Unicode (other than using the PUA), the orientation
behaviour of Old Yi does not seem particularly relevant to Unicode in
general or to UTR#50 (http://www.unicode.org/reports/tr50/tr50-3.html)
in particular.

 In the volume for Sichuan dialect (p.751), you can find the glyphs
 looking like modern-Yi-after-rotation. You may wonder if the volume
 for Sichuan dialect includes only modern Yi, and it should not be
 recognized as Old Yi?. In the last page for Sichuan volume (p.889),
 you can find some glyphs that are not included in modern Yi.

Even if it shares many of the same glyphs, the standardized Liangshan
Yi syllabary encoded in Unicode is not the same script as the Old Yi
script used to represent the same language (Liangshan Yi) prior to
standardization of the script, and you cannot really represent any Old
Yi texts using Unicode Yi without normalizing the text in a way that
would be unacceptable for scholarly purposes.  Really, anything
written in an Old Yi script is irrelevant to discussions of the
behaviour of the standardized Liangshan Yi script, and just causes
unnecessary confusion and eventually leads to the definition of
incorrect vertical text layout properties for Unicode Yi.

Andrew

Re: Code2000 on SourceForge

2012-02-03 Thread Andrew West

On 3 February 2012 20:41, James Kass jamesk...@att.net wrote:

 Don't worry. Taking somebody to court for using of my fonts for any purpose 
 is something what I *strongly oppose*.

James,

It's great to see you back, just a pity that your English seems to
have deteriorated so much over the last couple of years.  Could you
explain what you meant exactly by As New Year gift I included both
037F and 03F3 in this font to appease all classicist needs
(http://sourceforge.net/projects/phpwebsite/forums/forum/49348/topic/4573554)
when you added a capital yot glyph to the reserved code point U+037F
in the latest version of the Code2000 font (and made no other changes
to the font other than the name table).

Best regards,

Andrew West

Re: Armenian Eternity Sign (proposal)

2012-01-19 Thread Andrew West

On 19 January 2012 15:26, Doug Ewell d...@ewellic.org wrote:

 I always assumed that ARMENIAN in the name of these characters referred
 to the script with which they are usually used, not the country of
 Armenia. Names of countries aren't normally used in character names;
 that's why we have FARSI SYMBOL.

I am sure that in this case Armenian does refer to the script.  If
they are widely used in other script contexts then I would prefer to
see them named 058D RIGHT-FACING ETERNITY SIGN and 058E LEFT-FACING
ETERNITY SIGN, and give them a Unicode script property of common.  I
have no problem with encoding them in the Armenian block, even as
symbols common to multiple scripts, as this is analagous to the
encoding of the four svasti signs in the Tibetan block even though
they are used in multiple script contexts.

NB As they are still at the PDAM ballot stage there is still time to
make changes to their names and/or code positions ... which is unusual
as most problems are only brought to the attention of this list when
it is too late to make any such changes.

Andrew

Code2000 on SourceForge (was Re: [indic] Re: Lack of Complex script rendering support on Android)

2011-11-07 Thread Andrew West

On 7 November 2011 08:34,  a...@peoplestring.com wrote:

 Code2000 supports most BMP code points of Unicode 5.2. It is open sourced
 from September:

 http://code2000.sourceforge.net/

I have doubts as to whether this project was actually created by James
Kass.  The project comprises the last public version of code2000.ttf
and a 210MB code2000.asm file which turns out to be a dump of the
ttf file in human-readable form, both of which could easily have been
put onto SourceForge in contravention of copyright and license by
someone pretending to be James who wants the font to be open source
now that the official Code2000 site has disappeared.  James once told
me that Code2000 was maintained as a 66 megabyte dBASE III database
file which is not what is on SourceForge.

Andrew

Re: Unihan data for U+2B5B8 error

2011-10-20 Thread Andrew West

On 19 October 2011 18:41, John H. Jenkins jenk...@apple.com wrote:

 U+613F kDefinition (variant/simplification of U+9858 願) desire, want, wish; 
 (archaic) prudent, cautious
 U+613F kSemanticVariant U+9858kFenn:T
 U+613F kSpecializedSemanticVariant U+9858kHanYu:T
 U+613F kTraditionalVariant U+613F U+9858
 U+613F kSimplifiedVariant U+613F
 U+9858 kSimplifiedVariant U+613F U+2B5B8
 U+9858 kSemanticVariant U+9613FkFenn:T

 Andrew, does that look like it covers everything correctly?

Looks OK to me (except for the typo on the last line), although I
wonder about the necessity for:

U+613F kSimplifiedVariant U+613F

Where a character can either traditionalify (what is the opposite of
simplify?) to another character or stay the same then it is useful to
have (e.g.):

U+613F kTraditionalVariant U+613F U+9858

But where a character does not change on simplification, is it not
redundant to give it a kSimplifiedVariant mapping to itself ?  I note
that the following characters have kSimplifiedVariant mappings to
themself, all of which can either stay the same or change when
converted to traditional:

U+4F59 余
U+53F0 台
U+540E 后
U+5FD7 志
U+6781 极
U+8721 蜡

But there are other characters that fit this paradigm that do not have
kSimplifiedVariant mappings to themself, such as:

U+5E72 干

But maybe that is a reflection of this line:

U+5E72  kTraditionalVariant U+4E7E U+5E79

which I think should be:

U+5E72  kTraditionalVariant U+4E7E U+5E72 U+5E79

Andrew

Re: Unihan data for U+2B5B8 error

2011-10-19 Thread Andrew West

On 19 October 2011 02:38, shi zhao shiz...@gmail.com wrote:
 Unihan data for U+2B5B8 error?
 see http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=2b5b8useutf8=false

Anything in particular we are meant to be looking at?

Andrew

Re: Unihan data for U+2B5B8 error

2011-10-19 Thread Andrew West

On 19 October 2011 10:43, shi zhao shiz...@gmail.com wrote:
 The page said kTraditionalVariant of U+2B5B8 is U+9858 願.

which is correct.

 ) said U+2B5B8 떸 is kSimplifiedVariant of U+9858 願, U+613F 愿 is
 kSemanticVariant, but 愿 is simplified of 願, not U+2B5B8 떸.

which I agree is not correct.  It's not always clear how asymmetrical
cases like this should be handled.  For U+9918 餘, which is analagous,
with a common simplified form U+4F59 余 and an alternate simplified
form U+9980 馀, the Unihan database lists them both as simplified
variants of U+9918:

U+9918  kSimplifiedVariant  U+4F59 U+9980

On this precedent, I would expect:

U+9858  kSimplifiedVariant  U+613F U+2B5B8

I suggest you report this issue on the Unicode Error Reporting form:

http://www.unicode.org/reporting.html

Andrew

Re: Letter from the Hungarian Standards Institution

2011-09-14 Thread Andrew West

On 14 September 2011 09:41, Karl Pentzlin karl-pentz...@acssoft.de wrote:
 Dear Gábor,
 thank you for your letter regarding ISO/IEC 10646/PDAM 1.2 - Amendment 1.

 You asked me to arrange for a German negative vote on that document,

I received the same letter, via BSI, this morning, and although I
haven't been able to find anything prohibiting canvassing in the
JTC1 directives, it seems to me that to write such a letter to all
NBs on the committee, using language such as I officially request you to
vote No, is highly irregular.

Andrew

1 2 >

1 - 100 of 126 matches

Mail list logo