from:"Janusz S. Bień via Unicode"

Re: Variation Sequences (and L2-11/059)

2019-03-13 Thread Janusz S. Bień via Unicode

On Wed, Mar 13 2019 at  9:48 -07, Ken Whistler wrote:
> On 3/13/2019 2:42 AM, Janusz S. Bień via Unicode wrote:
>> Hi!
>>
>> On Mon, Jul 16 2018 at  7:07 +02, Janusz S. Bień via Unicode wrote:
>>> FAQ (http://unicode.org/faq/vs.html) states:
>>>
>>>  For historic scripts, the variation sequence provides a useful tool,
>>>  because it can show mistaken or nonce glyphs and relate them to the
>>>  base character. It can also be used to reflect the views of
>>>  scholars, who may see the relation between the glyphs and base
>>>  characters differently. Also, new variation sequences can be added
>>>  for new variant appearances (and their relation to the base
>>>  characters) as more evidence is discovered.
>> I'm proof-reading a paper where I quote the above fragment and to my
>> surprise I noticed it's no longer present in the FAQ.
>
> That text is, in fact, still present on the FAQ page in question:
>
> https://www.unicode.org/faq/vs.html#18

I apologize for jumping to the wrong conclusion, I should check it more
carefully.

>
>>
>> So my question are:
>>
>> 1. Does the change mean the change of the official policy of the
>> Consortium?
>
> Your premise here, however, is mistaken. The FAQ pages do *not*, and
> never have represented official policy of the Unicode Consortium.

That I expected but asked just to be on the safe side.

> The
> individual FAQ entries are contributed by many people -- some
> attributed, and some not. They are updated or added to periodically by
> various editors, in response to feedback, or as old entries grow
> out-dated, or new issues arise. Those updates are editorial, and do
> not reflect any official decision process by Unicode technical
> committees or officers. The FAQ main page itself points out that "The
> FAQs are contributed by many people," and invites the public to submit
> possible new entries for editing and addition to the list of FAQs.

BTW, what about copyright of FAQ entries? Do I guess correctly it
belongs to the consortium? To be specific, what about using and entry in
full in English or in translation as or in a Wikipedia entry?

>
> For official technical content, refer to the published technical
> specifications themselves, which are carefully controlled, versioned,
> and archived.
>
> For official policies of the Unicode Consortium, refer to the Unicode
> Consortium policies page, which is also carefully controlled:
>
> https://www.unicode.org/policies/policies.html

Thanks for reminding.


>> 2. Are the archival versions of the FAQ available somewhere?
>
> https://web.archive.org/web/*/https://www.unicode.org/faq/

Great!

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: Variation Sequences (and L2-11/059)

2019-03-13 Thread Janusz S. Bień via Unicode

Hi!

On Mon, Jul 16 2018 at  7:07 +02, Janusz S. Bień via Unicode wrote:
> FAQ (http://unicode.org/faq/vs.html) states:
>
> For historic scripts, the variation sequence provides a useful tool,
> because it can show mistaken or nonce glyphs and relate them to the
> base character. It can also be used to reflect the views of
> scholars, who may see the relation between the glyphs and base
> characters differently. Also, new variation sequences can be added
> for new variant appearances (and their relation to the base
> characters) as more evidence is discovered.

I'm proof-reading a paper where I quote the above fragment and to my
surprise I noticed it's no longer present in the FAQ.

So my question are:

1. Does the change mean the change of the official policy of the
Consortium?

2. Are the archival versions of the FAQ available somewhere?

3. Are the changes to the FAQ documented somehow (a version control
system?)?

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Update to the second question summary (was: A sign/abbreviation for "magister")

2018-12-02 Thread Janusz S. Bień via Unicode

On Sun, Dec 02 2018 at 10:33 +0100, Hans Åberg via Unicode wrote:
>> On 30 Oct 2018, at 22:50, Ken Whistler via Unicode  
>> wrote:
>> 
>> On 10/30/2018 2:32 PM, James Kass via Unicode wrote:
>>> but we can't seem to agree on how to encode its abbreviation. 
>> 
>> For what it's worth, "mgr" seems to be the usual abbreviation in Polish for 
>> it.
>
> It was common in the 1800s to singly and doubly underline superscript
> abbreviations in handwriting according to [1-2], and [2] also mentions
> the abbreviation discussed in this thread.

Thank you very much for this reference to the very abbreviation! I
looked up Wikipedia but I haven't read it carefully enough :-(

>
> 1. https://en.wikipedia.org/wiki/Ordinal_indicator
> 2. https://en.wikipedia.org/wiki/Ordinal_indicator#cite_note-1

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

A sign/abbreviation for "magister" - third question summary

2018-11-06 Thread Janusz S. Bień via Unicode

On Sat, Oct 27 2018 at 14:10 +0200, Janusz S. Bień via Unicode wrote:
> Hi!
>
> On the over 100 years old postcard
>
> https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6
>
> you can see 2 occurences of a symbol which is explicitely explained (in
> Polish) as meaning "Magister".
>

[...]

> The third and the last question is: how to encode this symbol in
> Unicode?

A constructive answer to my question was provided quickly by James Kass:

On Sat, Oct 27 2018 at 19:52 GMT, James Kass via Unicode wrote:
> Mr͇ / M=ͬ

I answered:

On Sun, Oct 28 2018 at 18:28 +0100, Janusz S. Bień via Unicode wrote:

[...]

> For me only the latter seems acceptable. Using COMBINING LATIN SMALL
> LETTER R is a natural idea, but I feel uneasy using just EQUALS SIGN as
> the base character. However in the lack of a better solution I can live
> with it :-)
>
> An alternative would be to use SMALL EQUALS SIGN, but looks like fonts
> supporting it are rather rare. 

and Philippe Verdy commented:

On Sun, Oct 28 2018 at 18:54 +0100, Philippe Verdy via Unicode wrote:

[...]

>
> There's a third alternative, that uses the superscript letter r,
> followed by the combining double underline, instead of the normal
> letter r followed by the same combining double underline.  

Some comments were made also by Michael Everson:

On Sun, Oct 28 2018 at 20:42 GMT, Michael Everson via Unicode wrote:

[...]

> I would encode this as Mʳ if you wanted to make sure your data
> contained the abbreviation mark. It would not make sense to encode it
> as M=ͬ or anything else like that, because the “r” is not modifying a
> dot or a squiggle or an equals sign.  The dot or squiggle or equals
> sign has no meaning at all. And I would not encode it as Mr͇, firstly
> because it would never render properly and you might as well encode it
> as Mr. or M:r, and second because in the IPA at least that character
> indicates an alveolar realization in disordered speech. (Of course it
> could be used for anything.)

FYI, I decided to use the encoding proposed by Philippe Verdy (if I
understand him correctly):

Mʳ̳

i.e.

'LATIN CAPITAL LETTER M' (U+004D)
'MODIFIER LETTER SMALL R' (U+02B3)
'COMBINING DOUBLE LOW LINE' (U+0333)

for purely pragmatic reasons: it is rendered quite well in my
Emacs. According to the 'fc-search-codepoint" script, the sequence is
supported on my computer by almost 150 fonts, so I hope to find in due
time a way to render it correctly also in XeTeX. I'm also going to add
it to my private named sequences list
(https://bitbucket.org/jsbien/unicode4polish).

The same post contained a statement which I don't accept:

On Sun, Oct 28 2018 at 20:42 GMT, Michael Everson via Unicode wrote:

[...]

> The squiggle in your sample, Janusz, does not indicate anything; it is
> only a decoration, and the abbreviation is the same without it.

One of the reasons I disagree was described by me in the separate thread
"use vs mention":

https://unicode.org/mail-arch/unicode-ml/y2018-m10/0133.html

There were also some other statements which I find unacceptable:

On Mon, Oct 29 2018 at 12:20 -0700, Doug Ewell via Unicode wrote:

[...]

> The abbreviation in the postcard, rendered in plain text, is "Mr".

He was supported by Julian Bradfield in his mail on Wed, Oct 31 2018 at
9:38 GMT (and earlier in a private mail).

I understand that both of them by "plane text" mean Unicode.

On 10/31/2018 2:38 AM, Julian Bradfield via Unicode wrote:

>  You could use the various hacks you've discussed, with modifier
> letters; but that is not "encoding", that is "abusing Unicode to do
> markup". At least, that's the view I take!

and was supported by Asmus Freytag on Wed, Oct 31 2018 at  3:12
-0700.

The latter elaborated his view later and I answered:

On Fri, Nov 02 2018 at 17:20 +0100, Janusz S. Bień via Unicode wrote:
> On Fri, Nov 02 2018 at  5:09 -0700, Asmus Freytag via Unicode wrote:

[...]

>> All else is just applying visual hacks
>
> I don't mind hacks if they are useful and serve the intended purpose,
> even if they are visual :-)

[...]

>> at the possible cost of obscuring the contents.
>
> It's for the users of the transcription to decide what is obscuring the
> text and what, to the contrary, makes the transcription more readable
> and useful.

Please note that it's me who makes the transcription, it's me who has a
vision of the future use and users, and in consequence it's me who makes
the decision which aspects of text to encode. Accusing me of "abusing
Unicode" will not stop me from doing it my way.

I hope that at least James Kass understands my attitude:

On Mon, Oct 29 2018 at  7:57 GMT, James Kass via Unicode wrote:

[...]

> If I were entering plain text data from an old post card, I'd try to
> keep the data as close to the so

A sign/abbreviation for "magister" - second question summary

2018-11-06 Thread Janusz S. Bień via Unicode



On Sat, Oct 27 2018 at 14:10 +0200, Janusz S. Bień via Unicode wrote:
> Hi!
>
> On the over 100 years old postcard
>
> https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6
>
> you can see 2 occurences of a symbol which is explicitely explained (in
> Polish) as meaning "Magister".

[...]

> The second question is: are you familiar with such or a similar symbol?
> Have you ever seen it in print?

Later I provided some additional information:

On Sat, Oct 27 2018 at 16:09 +0200, Janusz S. Bień via Unicode wrote:
>
> The postcard is from the front of the first WW written by an
> Austro-Hungarian soldier. He explaines the meaning of the abbreviation
> to his wife, so looks like the abbreviation was used but not very
> popular.

On Sat, Oct 27 2018 at 20:25 +0200, Janusz S. Bień via Unicode wrote:

[...]

> In the meantime I looked up some other postcards written by the same
> person i found several other abbreviation including № 'NUMERO SIGN'
> (U+2116) written in the same way, i.e. with a double instead of a single
> line.

The similarity to № 'NUMERO SIGN' was mentioned quite often in the
thread, there seem to be no need to quote all this mentions here.

A more general observation was formulated by Richard Wordingham:

On Sun, Oct 28 2018 at  8:13 GMT, Richard Wordingham via Unicode wrote:

[...]

> The notation is a quite widespread format for abbreviations.  the
> first letter is normal sized, and the subsequent letter is written in
> some variety of superscript with a squiggle underneath so that it
> doesn't get overlooked.  

Various examples of such abbreviations were also mentioned several times
in the thread, but again there seem to be no need to quote all this
mentions here.

Nobody however reported any other occurence of the symbol in question.

Best regards

Janusz


-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

A sign/abbreviation for "magister" - first question summary

2018-11-06 Thread Janusz S. Bień via Unicode

On Sat, Oct 27 2018 at 14:10 +0200, Janusz S. Bień via Unicode wrote:
> Hi!
>
> On the over 100 years old postcard
>
> https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6
>
> you can see 2 occurences of a symbol which is explicitely explained (in
> Polish) as meaning "Magister".
>
> First question is: how do you interpret the symbol? For me it is
> definitely the capital M followed by the superscript "r" (written in an
> old style no longer used in Poland), but there is something below the
> superscript. It looks like a small "z", but such an interpretation
> doesn't make sense for me.

I've got almost immediately two complementary answers:

On Sat, Oct 27 2018 at 9:11 -0400, Robert Wheelock wrote:

> It is constructed much like the symbol for numero—only with a capital
>  accompanied by a superscript small  > having an underbar (or
> double underbar).


On Sat, Oct 27 2018 at  6:58 -0700, Asmus Freytag via Unicode wrote:

[...]

> My suspicion would be that the small "z" is rather a "=" that
> acquired a connecting stroke as part of quick handwriting.  A./

and on the same day this interpretation was supported by Philippe Verdy:

On Sat, Oct 27 2018 at 20:35 +0200, Philippe Verdy via Unicode wrote:

[...]

> I have the same kind of reading, the zigzagging stroek is an
> hnadwritten emphasis of the uperscript r above it (explicitly noting
> it is terminating the abbreviation), jut like the small underline that
> happens sometimes below the superscript o in the abbreviation of
> "numero" (as well sometimes there was not just one but two small
> underlines, including in some prints).
>
> This sample is a perfect example of fast cursive handwritting (due to
> high variability of all other letter shapes, sizes and joinings, where
> even the capital M is written as two unconnected strokes), and it's
> not abnormal to see in such condition this cursive joining between the
> two underlining strokes so that it looks like a single zigzag.

Later it was summarized by James Kass:

On Fri, Nov 02 2018 at  2:59 GMT, James Kass via Unicode wrote:
> Alphabetic script users write things the way they are spelled and
> spell things the way they are written.  The abbreviation in question
> as written consists of three recognizable symbols.  An "M", a
> superscript "r", and an equal sign (= two lines).  It can be printed,
> handwritten, or in fraktur; it will still consist of those same three
> recognizable symbols.
>
> We're supposed to be preserving the past, not editing it or revising
> it.

It was commented by Julian Bradfield:

On Fri, Nov 02 2018 at  8:54 GMT, Julian Bradfield via Unicode wrote:

[...]

> That's not true. The squiggle under the r is a squiggle - it is a
> matter of interpretation (on which there was some discussion a hundred
> messages up-thread or so :) whether it was intended to be = .
> Just as it is a matter of interpretation whether the superscript and
> squiggle were deeply meaningful to the writer, or whether they were
> just a stylistic flourish for Mr.

The abbreviation in question definitely consists of three symbols: an
"M", a superscript "r" and the third one, which I think was best
described by Robert Wheelock as double (under)bar, with the connecting
stroke mentioned first by Asmus Freytag.

This third element was referred to, also by myself, as a squiggle, but
after looking up the definition of the word in a dictionary

  a short line that has been written or drawn and that curves and
  twists in a way that is not regular

I think this is a misnomer. Unfortunately I have no better proposal.

Best regards

Janusz


-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: A sign/abbreviation for "magister"

2018-11-02 Thread Janusz S. Bień via Unicode

On Fri, Nov 02 2018 at  5:09 -0700, Asmus Freytag via Unicode wrote:

[...]

> To transcribe the postcard would mean selecting the characters
> appropriate for the printed equivalent of the text.

You seem to make implicit assumptions which are not necessarily
true. For me to transcribe the postcard would mean to answer the needs
of the intended transcription users.

> If the printed form had a standard way of superscripting letters with
> a decoration below when used for abbreviations, then, and only then
> would we start discussing whether this decoration needs to be encoded,
> or whether it is something a font can supply as part of rendering the
> (sequence of) superscripted letters. (Perhaps with the aid of markup
> identifying the sequence as abbreviation).

As I wrote already some time ago on the list, the alternative "encoding
or using a specialized font" is wrong. These days texts are encoding for
processing (in particular searching), rendering is just a kind of
side-effect.

On the other hand, whom do you mean by "we" and what do you mean by
"encoding"? If I guess correctly what do you mean by these words then
you are discussing an issue which was never raised by anybody (if I'm
wrong, please quote the relevant post). Again is not clear for me whom
you want to convince or inform.

> All else is just applying visual hacks

I don't mind hacks if they are useful and serve the intended purpose,
even if they are visual :-)

> to simulate a specific appearance,

As I said above, the appearance is not necessarily of primary
importance.

> at the possible cost of obscuring the contents.

It's for the users of the transcription to decide what is obscuring the
text and what, to the contrary, makes the transcription more readable
and useful.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

mail attribution (was: A sign/abbreviation for "magister")

2018-11-02 Thread Janusz S. Bień via Unicode

On Thu, Nov 01 2018 at  6:43 -0700, Asmus Freytag via Unicode wrote:
> On 11/1/2018 12:52 AM, Richard Wordingham via Unicode wrote:
>
>  On Wed, 31 Oct 2018 11:35:19 -0700
> Asmus Freytag via Unicode  wrote:

[...]

> Unfortunately, your emails are extremely hard to read in plain text.
> It is even difficult to tell who wrote what.

My previous mail is unfortunately an example.

>
> Not sure why that is. After they make the round trip, they look fine
> to me.

When displaying your HTML mail, Emacs Gnus doesn't show correctly the
attributions. If I forget to edit it by hand when replying, we get the
confusion like in my previous mail.

I guess I should submit this as a bug or feature request to Emacs
developers. Perhaps Richard Wordingham should do the same for the mail
agent he uses.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: A sign/abbreviation for "magister"

2018-11-01 Thread Janusz S. Bień via Unicode

On Thu, Nov 01 2018 at 13:34 -0700, Asmus Freytag via Unicode wrote:
> On 11/1/2018 10:23 AM, Janusz S. Bień via Unicode wrote:

[...]

> Looks like you completely missed my point. Nobody ever claimed that
> reproducing all variations in manuscripts is in scope of Unicode, so
> whom do you want to convince that it is not?
>
> Looks like you are missing my point about there being a continuum with
> not clear lines that can be perfectly drawn a-priori.

Why do you think so? There is nothing in my posts which can be used to
support your claim. Perhaps you confused me with some other poster?

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: A sign/abbreviation for "magister"

2018-11-01 Thread Janusz S. Bień via Unicode

On Thu, Nov 01 2018 at  8:43 -0700, Asmus Freytag via Unicode wrote:
> On 11/1/2018 12:33 AM, Janusz S. Bień via Unicode wrote:
>
>  On Wed, Oct 31 2018 at 12:14 -0700, Ken Whistler via Unicode wrote:
>
>  On 10/31/2018 11:27 AM, Asmus Freytag via Unicode wrote:
>
>  
>  but we don't have an agreement that reproducing all variations in
>  manuscripts is in scope.
>
>
> In fact, I would say that in the UTC, at least, we have an agreement
> that that clearly is out of scope!
>
> Trying to represent all aspects of text in manuscripts, including
> handwriting conventions, as plain text is hopeless.  There is no
> principled line to draw there before you get into arbitrary
> calligraphic conventions.
>
>
> Your statements are perfect examples of "attacking a straw man":
>
>
> Perhaps you are joking?
>
> Not sure which of us you were suggesting as the jokester here.
>
> I don't think it's a joke to recognize that there is a continuum here
> and that there is no line that can be drawn which is based on
> straightforward principles. This is a pattern that keeps surfacing the
> deeper you look at character coding questions.

Looks like you completely missed my point. Nobody ever claimed that
reproducing all variations in manuscripts is in scope of Unicode, so
whom do you want to convince that it is not?

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: A sign/abbreviation for "magister"

2018-11-01 Thread Janusz S. Bień via Unicode

On Wed, Oct 31 2018 at 12:14 -0700, Ken Whistler via Unicode wrote:
> On 10/31/2018 11:27 AM, Asmus Freytag via Unicode wrote:
>>
>>  but we don't have an agreement that reproducing all variations in
>>  manuscripts is in scope.
>
> In fact, I would say that in the UTC, at least, we have an agreement
> that that clearly is out of scope!
>
> Trying to represent all aspects of text in manuscripts, including
> handwriting conventions, as plain text is hopeless.  There is no
> principled line to draw there before you get into arbitrary
> calligraphic conventions.

Your statements are perfect examples of "attacking a straw man":

 Straw Man (Fallacy Of Extension): attacking an exaggerated or
 caricatured version of your opponent's position.

 http://www.don-lindsay-archive.org/skeptic/arguments.html

 https://en.wikipedia.org/wiki/Straw_man

 https://en.wikipedia.org/wiki/The_Art_of_Being_Right

Perhaps you are joking?

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

use vs mention (was: second attempt)

2018-10-31 Thread Janusz S. Bień via Unicode

On Wed, Oct 31 2018 at  9:38 GMT, Julian Bradfield via Unicode wrote:
> On 2018-10-31, Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode 
>  wrote:

[...]

>> The relevant fragment of the postcard in a loose translation is
>>
>> Use the following address:   ...
>>  is the abbreviation of magister.
>>
>> I don't think your rendering
>>
>>Mr is the abbreviation of magister.
>>
>> has the same meaning.
>
> I do

The author of the postcard definitely *referred* to the abbreviation in
the form *used* in the postcard.

We don't know whether the abbreviation "Mr", spelled exactly this way,
already existed in that time and in that geographical area.

You still don't see the difference in the meaning?

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: second attempt

2018-10-31 Thread Janusz S. Bień via Unicode

On Wed, Oct 31 2018 at  9:38 GMT, Julian Bradfield via Unicode wrote:
> On 2018-10-31, Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode 
>  wrote:
>> On Mon, Oct 29 2018 at 12:20 -0700, Doug Ewell via Unicode wrote:
>
> [ as did I in private mail ]
>
>>> The abbreviation in the postcard, rendered in
>>> plain text, is "Mr".
>>
>> The relevant fragment of the postcard in a loose translation is
>>
>> Use the following address:   ...
>>  is the abbreviation of magister.
>>
>> I don't think your rendering
>>
>>Mr is the abbreviation of magister.
>>
>> has the same meaning.
>
> I do, for the reasons stated by many.

How many?

I'm aware only of you and Doug Ewell.

>
> If the topic were a study of the ways in which people indicate
> abbreviations by typographic or manuscript styling, then it would be
> important to know the exact form of the marks; but that is not plain
> text.

Let me remind what plain text is according to the Unicode glossary:

Computer-encoded text that consists only of a sequence of code
points from a given standard, with no other formatting or structural
information.

If you try to use this definition to decide what is and what is not a
character, you get vicious circle.

As mentioned already by others, there is no other generally accepted
definition of plain text.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

second attempt (was: A sign/abbreviation for "magister")

2018-10-30 Thread Janusz S. Bień via Unicode

My previous attempt to send this mail was rejected by the list as
spam. If this one will not appear on the list, would you be so kind to
forward it to the list and the listmaster?

On Mon, Oct 29 2018 at 12:20 -0700, Doug Ewell via Unicode wrote:

[...]

> The abbreviation in the postcard, rendered in
> plain text, is "Mr".

The relevant fragment of the postcard in a loose translation is

Use the following address:   ...
 is the abbreviation of magister.

I don't think your rendering

   Mr is the abbreviation of magister.

has the same meaning.

Please note that I didn't asked *whether* to encode the abbreviation. I
asked *how* to do it.

If you think it is impossible to encode it in Unicode (without using
PUA), just say this explicitely.

BTW, I find it strange that nobody refers to an old thread

https://www.unicode.org/mail-arch/unicode-ml/y2016-m12/0117.html

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: A sign/abbreviation for "magister"

2018-10-29 Thread Janusz S. Bień via Unicode

On Mon, Oct 29 2018 at  7:57 GMT, James Kass wrote:
> Janusz S. Bień asked,
>
>> Do you claim that in the ground-truth for HWR the
>> squiggle and raising doesn't matter?
>
> Not me!

I know, sorry if my previous mail was confusing.

> "McCoy", "M=ͨCoy", and "M-ͨCoy" are three different ways of
> writing the same surname.  If I were entering plain text data from an
> old post card, I'd try to keep the data as close to the source as
> possible.  Because that would be my purpose.  Others might have
> different purposes.  As you state, it depends on the intention. But,
> if there were an existing plain text convention I'd be inclined to use
> it.  Conventions allow for the possibility of interchange, direct
> encoding would ensure it.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: A sign/abbreviation for "magister"

2018-10-28 Thread Janusz S. Bień via Unicode

On Sun, Oct 28 2018 at 20:42 GMT, Michael Everson wrote:
> This is no different the Irish name McCoy which can be written MᶜCoy
> where the raising of the c is actually just decorative, though perhaps
> it was once an abbreviation for Mac. In some styles you can see a line
> or a dot under the raised c. This is purely decorative.
>
> I would encode this as Mʳ if you wanted to make sure your data
> contained the abbreviation mark. 
[...]

> The squiggle in your sample, Janusz, does not indicate anything; it is
> only a decoration, and the abbreviation is the same without it.

I have received off the list even more radical suggestion:

>>>  The third and the last question is: how to encode this symbol in
>>>  Unicode?
>
> Why would you need to? Its plain text content is adequately
> represented by "Mr"

On Sun, Oct 28 2018 at 23:57 GMT, James Kass wrote:
> The umlauts in the band name "Mötley Crüe" are decorative, yet the
> difference between "Mötley Crüe" and "Motley Crue" is one of
> spelling.  Although the tilde in the place name "Rancho Peñasquitos"
> is *not* decorative, "Rancho Peñasquitos" vs. "Rancho Penasquitos" is
> still a spelling difference.

[...]

> If "Mccoy" vs. "McCoy" vs. "MCCOY" vs. "MC COY" represent spelling
> differences, then so do "McCoy" vs "MᶜCoy".  It's a matter of opinion,
> and opinions often differ.

Well said, but I make the claim stronger; it depends on the purpose of
the encoding and intended applications.

Handwriting recognition (HWR) is no longer just an abstract possibility,
it's a facility present to everybody e.g. in Transkribus
(https://transkribus.eu/) which I actually use for transcribing the
texts of interest. Do you claim that in the ground-truth for HWR the
squiggle and raising doesn't matter?

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: A sign/abbreviation for "magister"

2018-10-28 Thread Janusz S. Bień via Unicode

On Sun, Oct 28 2018 at 15:19 +0100, Philippe Verdy via Unicode wrote:
> Given the "squiggle" below letters are actually gien distinctive
> semantics, I think it should be encoded a combining character (to be
> written not after a "superscript" but after any normal base letter,
> possibly with other combining characters, or CGJ if needed because of
> the compatibility equivalence.  That "squiggle" (which may look like
> an underscore) would haver the effect of implicity making the base
> letter superscript (smaller and elevated). It would have probably a
> "combining below" class.

Seems to me an elegant solution.

[...]

On Sat, Oct 27 2018 at 19:52 GMT, James Kass via Unicode wrote:
> Mr͇ / M=ͬ

For me only the latter seems acceptable. Using COMBINING LATIN SMALL
LETTER R is a natural idea, but I feel uneasy using just EQUALS SIGN as
the base character. However in the lack of a better solution I can live
with it :-)

An alternative would be to use SMALL EQUALS SIGN, but looks like fonts
supporting it are rather rare. 

>
> Le dim. 28 oct. 2018 à 10:41, arno.schmitt via Unicode  
> a écrit :

[...]

>  Looks to me like U+2116 № NUMERO SIGN
>  which perhaps should not have encoded,
>  since we have both U+004E LATIN CAPITAL LETTER N and
>  U+00BA º MASCULINE ORDINAL INDICATOR

I'm rather sure it is inherited from a character set used for the
round-trip test.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: A sign/abbreviation for "magister"

2018-10-27 Thread Janusz S. Bień via Unicode

On Sat, Oct 27 2018 at  5:58 -0700, Asmus Freytag via Unicode wrote:

[...]

> My suspicion would be that the small "z" is rather a "=" that acquired
> a connecting stroke as part of quick handwriting.

You must be right.

In the meantime I looked up some other postcards written by the same
person i found several other abbreviation including № 'NUMERO SIGN'
(U+2116) written in the same way, i.e. with a double instead of a single
line.

So we have a consensus about how to interpret the sign, but there are
still open questions about the scope of its usage, and its encoding.

Thanks one again to all who contributed to the discussion.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: A sign/abbreviation for "magister"

2018-10-27 Thread Janusz S. Bień via Unicode

On Sat, Oct 27 2018 at 16:32 +0200, rein wrote:
> Janusz,
>
> "wszystkimi m(oj)ami rączki"  some sort of  plural instrumentalis :)

Rather "moimi", although still the phrase sounds strange.

> "embracing you with all my  hands/arms"

Now "kiss" (całować) and "embrace" (obejmować) are strictly separated,
but perhaps 100 years ago it was differently.

Bess regards

Janusz

P.S. This discussion is completely of the topic of the list, but I'm
very greatful for the help received on and off the list.

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: A sign/abbreviation for "magister"

2018-10-27 Thread Janusz S. Bień via Unicode

On Sat, Oct 27 2018 at 14:36 +0200, rein wrote:
> Janusz,
>
> reminds me of the "numero sign " 

Yes, that's definitely similar.

>
> I tried to read the letter but couldn't manage to all the way ;)

Congratulation, you have done it better than me!

>
> Droga i Kochana Wiriańko

Rather "Wisieńko": "Ludwika" -> "Ludwisieńka" ->"Wisieńka"

>
> załaczam Ci z tą fotografiją list Staszki - odpisałem już jej też.  co
> u Was więcej słychać żadnych jeszcze ni mam odpowiedzi

I didn't recognized "odpowiedzi".

>ze znanych Ci miejscowoci ?adresować?

"Adresować" makes sense, although some letters seem missing.

> do Staszki jak tyś chciała pisać
>(W.Pan Mr Michał Gałkiewicz Feldspital 411 Feldpost 380.) Mr znaczy
>Magister. On przy tem szpitalu aptekarzem.  całuję Cię ze wargatkiem

I read this "wszystkiemi".

>Mami

I can't guess a word which would make sense of this phrase...

> rączki Twój Kochający Włodek 12/9 917
>
> pozdrawiam, Rein

Nawzajem :-)

>
> Sat, 27 Oct 2018 13:10:20 +0200 schreef Janusz S. Bień via Unicode 
> :

[...]

>> The second question is: are you familiar with such or a similar symbol?
>> Have you ever seen it in print?

The postcard is from the front of the first WW written by an
Austro-Hungarian soldier. He explaines the meaning of the abbreviation
to his wife, so looks like the abbreviation was used but not very
popular.

>>
>> The third and the last question is: how to encode this symbol in
>> Unicode?

I've got a comment to this question off the list, but I'm waiting to see
more opinions.

Best regards

Janusz

P.S. I subscribe the list in the digest form but I look up the archive -
I think Asmus Freytag interpretation is the correct one (similar
interpretation was suggested also of the list).

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

A sign/abbreviation for "magister"

2018-10-27 Thread Janusz S. Bień via Unicode



Hi!

On the over 100 years old postcard

https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6

you can see 2 occurences of a symbol which is explicitely explained (in
Polish) as meaning "Magister".

First question is: how do you interpret the symbol? For me it is
definitely the capital M followed by the superscript "r" (written in an
old style no longer used in Poland), but there is something below the
superscript. It looks like a small "z", but such an interpretation
doesn't make sense for me.

The second question is: are you familiar with such or a similar symbol?
Have you ever seen it in print?

The third and the last question is: how to encode this symbol in
Unicode?

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: Unicode String Models

2018-09-09 Thread Janusz S. Bień via Unicode

On Sat, Sep 08 2018 at 18:36 +0200, Mark Davis ☕️ via Unicode wrote:
> I recently did some extensive revisions of a paper on Unicode string models 
> (APIs). Comments are welcome.
>
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#

It's a good opportunity to propose a better term for "extended grapheme
cluster", which usually are neither extended nor clusters, it's also not
obvious that they are always graphemes.

Cf.the earlier threads

https://www.unicode.org/mail-arch/unicode-ml/y2017-m03/0031.html
https://www.unicode.org/mail-arch/unicode-ml/y2016-m09/0040.html

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: CLDR

2018-09-03 Thread Janusz S. Bień via Unicode

On Fri, Aug 31 2018 at 10:27 +0200, Manuel Strehl via Unicode wrote:
> The XML files in these folders:
>
> https://unicode.org/repos/cldr/tags/latest/common/

Thanks for the link.

In the meantime I rediscovered Locale Explorer

http://demo.icu-project.org/icu-bin/locexp

which I used some time ago.

On Fri, Aug 31 2018 at 12:17 +0200, Marcel Schneider via Unicode wrote:
> On 31/08/18 07:27 Janusz S. Bień via Unicode wrote:
> […]
>> > Given NamesList.txt / Code Charts comments are kept minimal by design, 
>> > one couldn’t simply pop them into XML or whatever, as the result would be 
>> > disappointing and call for completion in the aftermath. Yet another task 
>> > competing with CLDR survey.
>> 
>> Please elaborate. It's not clear for me what do you mean.
>
> These comments are designed for the Code Charts and as such must not be
> disproportionate in exhaustivity. Eg we have lists of related languages 
> ending 
> in an ellipsis.

Looks like we have different comments in mind.

[...]

>> > Reviewing CLDR data is IMO top priority.
>> > There are many flaws to be fixed in many languages including in English.
>> > A lot of useful digest charts are extracted from XML there,
>> 
>> Which XML? where?
>
> More precisely it is LDML, the CLDR-specific XML.
> What I called “digest charts” are the charts found here:
>
> http://www.unicode.org/cldr/charts/34/
>
> The access is via this page:
>
> http://cldr.unicode.org/index/downloads
>
> where the charts are in the Charts column, while the raw data is under
> SVN Tag.

Thanks for the link. I found especially interesting the Polish section
in

https://www.unicode.org/cldr/charts/34/subdivisionNames/other_indo_european.html

Looks like a complete rubbish, e.g.

plmp = Federal Capital Territory(???) = Pomerania (Latin/English name of
Pomorze) transliterated into the Greek alphabet (and something in
Arabic).

The header of the page says "The coverage depends on the availability of
data in wikidata for these names" but I was unable to find this rubbish
in Wikidata (but I was not looking very hard).

>
>> 
>> > and we really 
>> > need to go through the data and correct the many many errors, please.

But who is the right person or institution to do it?

>> 
>> Some time ago I tried to have a close look at the Polish locale and
>> found the CLDR site prohibitively confusing.
>
> I experienced some trouble too, mainly because "SVN Tag" is counter-intuitive 
> for the access to the XML data (except when knowing about SubVersioN).
> Polish data is found here:
>
> https://www.unicode.org/cldr/charts/34/summary/pl.html
>
> The access is via the top of the "Summary" index page (showing root data):
>
> https://www.unicode.org/cldr/charts/34/summary/root.html
>
> You may wish to particularly check the By-Type charts:
>
> https://www.unicode.org/cldr/charts/34/by_type/index.html
>
> Here I’d suggest to first focus on alphabetic information and on punctuation.
>
> https://www.unicode.org/cldr/charts/34/by_type/core_data.alphabetic_information.punctuation.html
>
> Under Latin (table caption, without anchor) we find out what punctuation 
> Polish has compared to other locales using the same script.
> The exact character appears when hovering the header row.
> Eg U+2011 NON-BREAKING HYPHEN is systematically missing, which is 
> an error in almost every locale using hyphen. TC is about to correct that.
>
> Further you will see that while Polish is using apostrophe
> https://slowodnia.tumblr.com/post/136492530255/the-use-of-apostrophe-in-polish
> CLDR does not have the correct apostrophe for Polish, as opposed eg to French.

I understand that by "the correct apostrophe" you mean U+2019 RIGHT
SINGLE QUOTATION MARK.

> You may wish to note that from now on, both U+0027 APOSTROPHE and 
> U+0022 QUOTATION MARK are ruled out in almost all locales, given the 
> preferred characters in publishing are U+2019 and, for Polish, the U+201E and 
> U+201D that are already found in CLDR pl.

The situation seems more complicated because the chart

https://www.unicode.org/cldr/charts/34/by_type/core_data.alphabetic_information.punctuation.html

contains different list of punctuation characters than

https://www.unicode.org/cldr/charts/34/summary/pl.html.

I guess the latter is the primary one, and it contains U+2019 RIGHT
SINGLE QUOTATION MARK (and U+0x2018 LEFT SINGLE QUOTATION MARK, too).
   
>
> Note however that according to the information provided by English Wikipedia:
> https://en.wikipedia.org/wiki/Quotation_mark#Polish
> Polish also uses single quotes, that by contrast are still missing in CLDR.

You are right, but who cares? Looks like this has no practical
importance. Nobody complains about the wrong use of quotation marks in
Polish by Word or OpenOffice, so looks like the software doesn't use
this information. So this is rather a matter of aesthetics...

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: UCD in XML or in CSV? (is: UCD data consumption)

2018-09-03 Thread Janusz S. Bień via Unicode

On Sun, Sep 02 2018 at  4:16 +0200, 

[...]

> So you can understand that I’m not unaware of the complexity of UCD. Though
> I don’t think that this could be an argument for not publishing a medium-size 
> CSV file with scalar values listed as in UnicodeData.txt.

For a non-programmer like me CVS is much more convenient form than XML -
I can use it not only with a spreadsheet, but also import directly into
a database and analyse with various queries. XML is politically correct,
but practically almost unusable without a specialised parser.

On Sat, Sep 01 2018 at 15:15 +0200, unicode@unicode.org writes:
> On 31/08/18 10:47 Manuel Strehl via Unicode wrote:
>> 
>> To handle the UCD XML file a streaming parser like Expat is necessary.
>
> Thanks for the tip. However for my needs, Expat looks like overkill, and I’m 
> looking out for a much simpler standalone tool, just converting XML to CSV.

I think CSV and XML can coexist peacefully, we just need an open source
round-trip converter.

Last but not least, let me remind that the thread was started by a
question what is the most convenient way to describe the properties of
PUA characters.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

CLDR (was: Private Use areas)

2018-08-30 Thread Janusz S. Bień via Unicode

On Thu, Aug 30 2018 at  2:27 +0200, unicode@unicode.org writes:

[...]

> Given NamesList.txt / Code Charts comments are kept minimal by design, 
> one couldn’t simply pop them into XML or whatever, as the result would be 
> disappointing and call for completion in the aftermath. Yet another task 
> competing with CLDR survey.

Please elaborate. It's not clear for me what do you mean.

> Reviewing CLDR data is IMO top priority.
> There are many flaws to be fixed in many languages including in English.
> A lot of useful digest charts are extracted from XML there,

Which XML? where?

> and we really 
> need to go through the data and correct the many many errors, please.

Some time ago I tried to have a close look at the Polish locale and
found the CLDR site prohibitively confusing.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: Private Use areas

2018-08-28 Thread Janusz S. Bień via Unicode

On Tue, Aug 28 2018 at  9:43 -0700, unicode@unicode.org writes:
> On August 23, 2011, Asmus Freytag wrote:
>
>> On 8/23/2011 7:22 AM, Doug Ewell wrote:
>>> Of all applications, a word processor or DTP application would want
>>> to know more about the properties of characters than just whether
>>> they are RTL. Line breaking, word breaking, and case mapping come to
>>> mind.
>>>
>>> I would think the format used by standard UCD files, or the XML
>>> equivalent, would be preferable to making one up:

Right. I was not so quick to state this so early, but 2 years ago I
wrote to the MUFI list:


--8<---cut here---start->8---
On Sat, Jan 02 2016 at 12:35 CET, odd.hau...@uib.no writes:

[...]

> Note the permanent URI at the University Library in Bergen. This will
> in all likelihood be the last recommendation of its kind (and
> certainly the last edited by the undersigned), so please look out for
> new solutions (databases or the like) on the MUFI web site!

I think that one of the forms, perhaps even the primary one, should
follow the original Unicode Character Database and the
output of Unibook (http://www.unicode.org/unibook/).

The idea can be tested by converting the present recommendation to this
form. Unfortunately I'm unable to contribute myself to this task.

One of the advantages would be that the various character browsers can
be adapted relatively easily to provide info about the MUFI characters.

A simpler variant of this idea is to use Unibook-like format to
document fonts. A quick-and-dirty tools for this purpose has been
prepared by a student of mine:

https://bitbucket.org/jsbien/fntsample-fork-with-ucd-comments/
https://bitbucket.org/jsbien/unicode-ucd-parser

A sample output of the tools is available at

https://bitbucket.org/jsbien/parkosz-font/downloads/Parkosz1907draft.pdf

(the font is also quick-and-dirty and unfinished work).

--8<---cut here---end--->8---

Unfortunately there was no reaction.

>>
>> The right answer would follow the XML format of the UCD.
>>
>> That's the only format that allows all necessary information contained
>> in one file,

For me necessary are also comments and crossreferences contained in
NamesList.txt. Do I understand correctly that only "ISO Comment
properties" are included in the file?

>> and it would leverage of any effort that users of the
>> main UCD have made in parsing the XML format.
>>
>> An XML format shold also be flexible in that you can add/remove not
>> just characters, but properties as needed.
>>
>> The worst thing do do, other than designing something from scratch,
>> would be to replicate the UnicodeData.txt layout with its random, but
>> fixed collection of properties and insanely many semi-colons. None of
>> the existing UCD txt files carries all the needed data in a single
>> file.
>
> I don't know if or how I responded 7 years ago, but at least today, I
> think this is an excellent suggestion.
>
> If the goal is to encourage vendors to support PUA assignments, using an
> exceedingly well-defined format (UAX #42) sitting atop one of the most
> widely used base formats ever (XML), with all property information in a
> single repository (per PUA scheme), would be great encouragement.

I think we need also the data in the format acceptable by UniBook.

> I've devised lots of novel file formats and I think this is one use
> case where that would be a real hindrance.

> Storing this information in a font, by hook or crook, would lock users
> of those PUA characters into that font. At that rate, you might as well
> use ASCII-hacked fonts, as we did 25 years ago.

Storing the information in a font is inappropriate not only for the
technical reasons, as I wrote recently (on Thu, Aug 23 2018)

> Fonts are for *rendering*, new characters and variants are more and
> more often needed for *input* of real life old texts with sufficient
> precision.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: Emacs Verbose Character Entry (was Private Use Areas)

2018-08-24 Thread Janusz S. Bień via Unicode

On Thu, Aug 23 2018 at 22:15 +0100, unicode@unicode.org writes:
> On Thu, 23 Aug 2018 21:47:03 +0200
> "Janusz S. Bień via Unicode"  wrote:
>
>> My needs are very simple, for example C-x 8 Return LATIN CAPITAL
>> LETTER A WITH MACRON AND BREVE [MUFI] should yield the character with
>> the code E010. I can provide the list of names and codes.
>
> While it should obviously yield, if anything,  or
>  for 'LATIN CAPITAL LETTER A WITH MACRON AND
> BREVE',

In my opinion there is no question what

'LATIN CAPITAL LETTER A WITH MACRON AND BREVE'

should yield, because the name should be absent on the name list.

My example concerns names like

'LATIN CAPITAL LETTER A WITH MACRON AND BREVE [MUFI]'
'COMBINING ABBREVIATION MARK SUPERSCRIPT UR ROUND R FORM [MUFI]'

etc.

[...]

> The Emacs command "C-x 8 RET" expects the name of a single codepoint.

It's OK and in my opinion it should stay this way.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: Private Use areas

2018-08-24 Thread Janusz S. Bień via Unicode

On Fri, Aug 24 2018 at 16:12 +0300, e...@gnu.org writes:
>> From: jsb...@mimuw.edu.pl (Janusz S. Bień)
>> Cc: unicode@unicode.org,  richard.wording...@ntlworld.com
>> Date: Thu, 23 Aug 2018 21:47:03 +0200
>> 
>> I'm very glad you join the discussion.
>
> I'm sorry for not joining sooner.  In my defense, I missed the
> reference to Emacs, and the rest of the discussion is not really
> interesting for me, as using PUA for new characters is not something I
> have interest in or experience with.

I don't think you missed anything important.

>
>> My needs are very simple, for example C-x 8 Return LATIN CAPITAL LETTER
>> A WITH MACRON AND BREVE [MUFI] should yield the character with the code
>> E010. I can provide the list of names and codes.
>
> So you'd like to extend "C-x 8 RET" to recognize names of additional
> characters and associate them with codepoints in the PUA area?  That
> shouldn't be hard to add.

I would prefer extensibility over efficiency, I don't mind loading PUA
information from a source declared somehow in .emacs.d., so I can
change/expand the list of characters from time to time.

> But is that all? won't you also want to tell Emacs about the
> properties of those characters?

Personally I would like additionally to be able to change the case of a
letter or string, and I am willing to prepare the necessary information
for MUFI characters.

Displaying other properties would be nice, but for me this is not
crucial. Moreover, somebody has to prepare the data...

> or be able to set up fonts for displaying them?

It would be nice. I haven't asked for it because I typeset my texst with
XeTeX or LuaTeX and the input is more important for me than rendering.

> IOW, would it be okay to have these
> characters be "second-class citizens" in Emacs?

For me it would be acceptable.

BTW, I just got perhaps a crazy idea: what about treating a PUA
declaration (as you probably noticed, there may be conficting ones) as a
separate coding system? Of course some mechanism for escaping the
standard PUA interpretation would be needed.

>
>> > It is true that the Unicode related data is produced at build time,
>> > but only some of that is actually recorded in the Emacs binary, the
>> > rest is loaded upon demand.  But all the data is stored in data
>> > structures that are mutable, given some Lisp programming.
>> 
>> I never was fluent in Lisp programming and by now I forgot almost
>> everything I knew, so it's not a task for me. I was thinking about
>> submitting a feature request, but I forgot also the proper procedures to
>> do it.
>
> The proper procedure is to type "M-x report-emacs-bug RET" and then
> describe the feature(s) you'd like to see added/improved.

I will definitely remember now :-)

>
>> Moreover I had the impression that I'm the only person who needs
>> it...
>
> That shouldn't stop you.  Many a feature in Emacs started as a request
> from a single individual.
>
>> > (It is not clear to me which part of the Unicode data you would like
>> > to change; are you talking about adding characters to the list of
>> > those defined by Unicode?  If you are using the PUA codepoints, it's
>> > possible that you will need to update Emacs's notion of PUA as well.)
>> 
>> Yes, I would like the PUA codepoints to be handled analogically as the
>> proper ones. What do you mean by Emacs's notion of PUA?
>
> Emacs knows about the PUA regions of the Unicode code-space, and
> treats those codepoints specially.  The features you request will
> probably need to affect the PUA region as well, because the codepoints
> you use should no longer be treated as PUA.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: Private Use areas

2018-08-24 Thread Janusz S. Bień via Unicode

On Thu, Aug 23 2018 at 11:49 -0700, beckie...@gmail.com writes:
> On Thu, Aug 23, 2018 at 5:10 AM, Janusz S. Bień  wrote:
>
>  > I already provide this myself for my uses of the PUA as well as the
>  > CSUR and any vendor-specific agreements I can find:
>  >
>  > http://www.kreativekorp.com/charset/PUADATA/
>
>  I would prefer to see the data in a repository, so others can can
>  comment and contribute.
>
> That is actually my intent for the future. Though it's not quite ready yet:
>
> https://github.com/kreativekorp/charset/tree/master/puadata

Great!

>
> That's the data in a "pre-compiled" form; it's turned into a "proper"
> PUADATA directory using this script:
>
> https://github.com/kreativekorp/charset/blob/master/bin/build-public.py
>
>  As for "any vendor-specific agreements", do MUFI and LINCUA qualify?
>
> I certainly do want to see MUFI and LINCUA provided in this form, but
> I put them in a different category along with CSUR. I basically have
> three categories of PUA agreements:
>
> Fonts - PUA assignments specific to a font family, e.g. Constructium, 
> Fairfax, Nishiki-teki, Quivira, Junicode, etc.

You are probably aware that Junicode 1.000, released in September 2017,
supports in full MUFI 4.0  (released in December 2015). I don't know
whether Junicode contains now any PUA characters which are not in MUFI.

>
> Public - PUA agreements meant to be widely used, e.g. CSUR, UCSUR,
> MUFI, LINCUA, etc.
>
> Vendors - PUA assignments meant to be used by a single vendor or
> platform, e.g. Adobe, Apple, etc. but also Linux, MirOS, etc.
>
> Thank you for those links by the way. I had tried to find charts for
> MUFI in the past but had somehow been unsuccessful.

Similar files for different purpose has been created by Mikkel Eide
Eriksen:

https://github.com/mikkelee/mufi-latex

An earlier version of MUFI was incorporated in the ENRICH Gaiji bank:

http://v2.manuscriptorium.com/apps/gbank/

You can download the source but it doesn't seem useful.

A version of MUFI is available also as a searchable character database
created by the present single-person MUFI board, i.e. Tarrin Wills, as a
part of the beta version of a new MUFI site:

http://skaldic.abdn.ac.uk/m.php?p=mufi

Some time ago I wrote on the mufi-fonts list:

--8<---cut here---start->8---
On Sun, Dec 03 2017 at  6:55 +0100, jsb...@mimuw.edu.pl writes:

[...]

> I wanted the file quickly to get an overview of the recently released
> corpus of 16th century Polish, and it's seemed to me that the simplest
> and fastest way is to convert the PDF recommendation in a semi-automatic
> way. It was more cumbersome than I expected, but thanks to this approach
> I've discovered a typo in the recommendation: letter I instead of digit
> 1 in EAFI, the code for LATIN ENLARGED LETTER SMALL LIGATURE AE (p. 93
> in the code chart order version).
>
> For the planned extension of the program I need more info on MUFI
> characters, preferably in the format of the UnicodeData.txt. This time
> however I intend to make haste slowly, so I have a question:
>
> Is it possible to make publicly available for download the database
> underlying http://skaldic.abdn.ac.uk/db.php?if=mufi=mufi_char?

--8<---cut here---end--->8---

Unfortunately I got no answer to the question.


>  > Of course there is no way to get software to use this information.
>
>  What kind of software do you have in mind?
>
> Unicode-related utilities, text editors to start with. You pretty much
> hit the nail on the head with uniname and emacs as examples. :)

Thanks! As for uniname by Bill Poser, I exchanged mails with him in
2011:

--8<---cut here---start->8---
On Sun, Aug 28 2011 at 12:01 +0200, jsb...@mimuw.edu.pl writes:

[...]

> A student of mine wrote an alternative program according to my
> specification. The program is GPLed and available with
>
> git clone http://students.mimuw.edu.pl/~findepi/unihistext unihistext

Now https://bitbucket.org/jsbien/unihistext

>
> The source is ready for Debian packaging.
>
> I think the program is worth better distribution, but its author is no
> longer interested in it. Would you be so kind to consider including
> either the program itself in your uniutils or extend your unidesc with
> its features?
>
> Best regards
>
> Janusz

On Sun, Aug 28 2011 at 16:03 -0700, billpos...@gmail.com writes:
> In principle, sure. I'll have a look at it.

--8<---cut here---end--->8---

Unfortunatelly nothing happened, and I thought I should not press the
point.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: Private Use areas

2018-08-23 Thread Janusz S. Bień via Unicode

On Thu, Aug 23 2018 at 22:17 +0300, e...@gnu.org writes:
>> Date: Thu, 23 Aug 2018 20:30:52 +0200
>> Cc: Richard Wordingham 
>> From: "Janusz S. Bień via Unicode" 
>> 
>> >> and in Emacs - to my disappointed it looks like the Unicode data are
>> >> set at the compile time, but perhaps this can be negotiated with the
>> >> developers.
>> >
>> > Can you be more specific?
>> 
>> I often search characters by name with C-x 8 Return. I would like to use
>> it also for MUFI characters, I have already the name list (the example
>> directory at https://bitbucket.org/jsbien/unihistext/). I haven't looked
>> very closely into the problem and don't remember now the details, but my
>> impression was that it's not simple.
>
> What is "it" in the last sentence?  IOW, what is not simple about that
> with Emacs?

I'm very glad you join the discussion.

My needs are very simple, for example C-x 8 Return LATIN CAPITAL LETTER
A WITH MACRON AND BREVE [MUFI] should yield the character with the code
E010. I can provide the list of names and codes.

>
> It is true that the Unicode related data is produced at build time,
> but only some of that is actually recorded in the Emacs binary, the
> rest is loaded upon demand.  But all the data is stored in data
> structures that are mutable, given some Lisp programming.

I never was fluent in Lisp programming and by now I forgot almost
everything I knew, so it's not a task for me. I was thinking about
submitting a feature request, but I forgot also the proper procedures to
do it. Moreover I had the impression that I'm the only person who needs
it...

>
> (It is not clear to me which part of the Unicode data you would like
> to change; are you talking about adding characters to the list of
> those defined by Unicode?  If you are using the PUA codepoints, it's
> possible that you will need to update Emacs's notion of PUA as well.)

Yes, I would like the PUA codepoints to be handled analogically as the
proper ones. What do you mean by Emacs's notion of PUA?

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: Private Use areas

2018-08-23 Thread Janusz S. Bień via Unicode

On Thu, Aug 23 2018 at 17:26 +0100, unicode@unicode.org writes:
> On Thu, 23 Aug 2018 17:39:15 +0200
> Philippe Verdy via Unicode  wrote:
>
>> You make a confusion: I do not propose "hacking" existing codes, but
>> instead adding new codes for private variations. It's then up to PUV
>> sequence authors to choose an appropropriate base character that can
>> have the properties they want to be inherited by the private-use
>> variation sequence, or to choose a base character that will provide
>> some reasonnable reading if rendererd as is (by renderers or fonts
>> not implementing the pricate viaration sequence, give nthat they will
>> also append a symbol for the PUV itself after the standard character).
>
> Variation sequences cannot be used to add new characters.  Most PUA
> characters are used to represent new characters.  A
> standard-conformant private variation sequence would generally achieve
> the same effect as could be achieved by a font feature (typically one
> of the cvxx, though possibly one of the ssxx),

This is a typical but IMHO obsolete perspective. Fonts are for
*rendering*, new characters and variants are more and more often needed
for *input* of real life old texts with sufficient precision.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: Private Use areas

2018-08-23 Thread Janusz S. Bień via Unicode

On Thu, Aug 23 2018 at 17:11 +0100, unicode@unicode.org writes:
> On Thu, 23 Aug 2018 14:10:35 +0200
> "Janusz S. Bień via Unicode"  wrote:
>
>> What kind of software do you have in mind?
>> 
>> I'm primarily interested in the locally developed programs
>> 
>> https://bitbucket.org/jsbien/unihistext/
>> 
>> https://bitbucket.org/jsbien/fntsample-fork-with-ucd-comments/
>
> It looks as though the security certificates are awry - has someone
> forgotten to pay the protection money to the right people?  (Firefox
> objects with "The page you are trying to view cannot be shown because
> the authenticity of the received data could not be verified.")

I see no such problems with Firefox ESR 52.9.0 on Debian
testing. Moreover the program reports that the certificate is valid till
04/21/2020.

>
>> and in Emacs - to my disappointed it looks like the Unicode data are
>> set at the compile time, but perhaps this can be negotiated with the
>> developers.
>
> Can you be more specific?

I often search characters by name with C-x 8 Return. I would like to use
it also for MUFI characters, I have already the name list (the example
directory at https://bitbucket.org/jsbien/unihistext/). I haven't looked
very closely into the problem and don't remember now the details, but my
impression was that it's not simple.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: Private Use areas

2018-08-23 Thread Janusz S. Bień via Unicode

On Tue, Aug 21 2018 at 11:23 -0700, unicode@unicode.org writes:
> On Tue, Aug 21, 2018 at 10:21 AM, Janusz S. Bień via Unicode 
>  wrote:
>
>  I think PUA users should provide the
>  properties of the characters used in a form analogical to the Unicode
>  itself, and the software should be able to use this additional
>  information.
>
> I already provide this myself for my uses of the PUA as well as the
> CSUR and any vendor-specific agreements I can find:
>
> http://www.kreativekorp.com/charset/PUADATA/

I would prefer to see the data in a repository, so others can can
comment and contribute.

As for "any vendor-specific agreements", do MUFI and LINCUA qualify?

https://folk.uib.no/hnooh/mufi/
http://andron-typeforum.xobor.de/t10f13-Towards-a-linguistic-corporate-use-area-LINCUA.html

>
> Of course there is no way to get software to use this information.

What kind of software do you have in mind?

I'm primarily interested in the locally developed programs

https://bitbucket.org/jsbien/unihistext/

https://bitbucket.org/jsbien/fntsample-fork-with-ucd-comments/

and in Emacs - to my disappointed it looks like the Unicode data are set
at the compile time, but perhaps this can be negotiated with the
developers.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: Private Use areas

2018-08-21 Thread Janusz S. Bień via Unicode

On Tue, Aug 21 2018 at 16:56 +0200, unicode@unicode.org writes:
> On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote:
>> On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote:
>> > Is there a block of RTL PUA also?
>> 
>> No.
>
> Perhaps there should be?
>
> What about designating a part of the PUA to have a specific property?  Only
> certain properties matter enough:
> * wide
> * RTL
> * combining
> as most others are better represented in the font itself.
>
> This could be done either by parceling one of existing PUA ranges: planes 15
> and 16 are virtually unused thus any damage would be negligible; or perhaps
> by allocating a new range elsewhere.

I don't think it's a good idea. I think PUA users should provide the
properties of the characters used in a form analogical to the Unicode
itself, and the software should be able to use this additional
information.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

L2-11/059 and L2/13‐153 (was: Variation Sequences (and L2-11/059))

2018-07-29 Thread Janusz S. Bień via Unicode

On Mon, Jul 16 2018 at  7:07 +0200, jsb...@mimuw.edu.pl writes:

[...]

> To the best of my knowledge, the only attempt to introduce additional
> variation sequences was the strongly criticised Karl Pentzlin's proposal
> L2-11/059
>
> http://www.unicode.org/L2/L2011/11059-latin-cyr-var.pdf
>
> What has happen to it? I don't remember any information about it on the
> list.

Anybody willing to share his knowledge on this topic?

Looking for something else I found

L2/13‐153

Proposal to Use Standardized Variation Sequences to Encode Church
Slavonic Glyph Variants in Unicode
https://www.unicode.org/L2/L2013/13153-variants.pdf

I don't remember the proposal being discussed on the list and I was
unable to find any trace of a discussion in the list archives. Is my memory
correct?

I found also

L2/13‐164
Expert Feedback on Cyrillic proposals...
https://www.unicode.org/L2/L2013/13164-cyrillic-fdbk.pdf

containing interesting comments on VS by David Birnbaum.

I understand L2/13‐153 and L2/13-153 has been rejected, but I would be
happy to have more information about it.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

old Polish and Unicode (was: Variation Sequences (and L2-11/059))

2018-07-20 Thread Janusz S. Bień via Unicode

I apologize for sending by mistake the previous post with no new
content.

On Thu, Jul 19 2018 at 17:47 +0100, wjgo_10...@btinternet.com writes:

[...]

> I found the following.
>
> https://en.wikipedia.org/wiki/Old_Polish_language

Thanks again for your interest in Polish language.

There is also

https://en.wikipedia.org/wiki/History_of_Polish
https://en.wikipedia.org/wiki/Middle_Polish_language
https://en.wikipedia.org/wiki/Polish_orthography
https://en.wikipedia.org/wiki/History_of_Polish_orthography

To make a long story short, this is just a mess. Looking for a good link
to recommend I just found

https://culture.pl/en/article/a-foreigners-guide-to-the-polish-alphabet

which seems worth looking at (but the multimedia version doesn't work
for me).

I used to recommend the paper

http://wbl.klf.uw.edu.pl/45/

which unfortunately it seems no longer available on the Internet.

>
> WJGO >> So you could if you wish try to make your own font
>
> JSB >Actually I tried:
>
> JSB > https://bitbucket.org/jsbien/parkosz-font/
>
> Thank you for the link to the font. I have studied the font in the 
> FontCreator program (version 8).

Please revisit the site, I just added some links and comments. This
project is now orphaned. 

>
> I remember that I produced an OpenType font using Variation Selectors
> and OpenType Glyph Substitution back in April 2017. I wrote about it
> and provided a link to the font and a link to a typecase document.
>
> https://forum.high-logic.com/viewtopic.php?f=10=7033
>
> Although that font is about chess, I am thinking that that is the sort
> of font that is needed for what you are wanting to do. This could use
> variation selectors or could use circled digits as desired.

Thanks for the link. I think I will do some tests with XeLaTeX.

>
> I am a researcher and I am looking for a worthwhile project related to
> typography in which to participate from time to time - no money
> charged, no money to pay - and I am interested in printed books of the
> incunabula period and the early sixteenth century.
>
> I do not know any Polish, but I do not need to be involved in choosing
> which glyphs are needed, so my not knowing any Polish would not seem
> to be a problem.

Please feel free to take over the font for Parkosz's treatise, if you
wish to.

I think another interesting challenge is "Nowy Karakter Polski", a 16th
century treatise comparing several proposals of Polish spelling, which
uses various strange characters. You can find the scan in various places
and in various format, e.g.

https://books.google.pl/books?id=Z3ojMAAJ
http://www.dbc.wroc.pl/publication/4239

The treatise is used as one of the important sources used by the
dictionary of the 16th century Polish language:

http://spxvi.edu.pl/

The only English language presentation of the dictionary seems to be

Luto-Kamińska, A. (2017). Several words on the dictionary of the 16th
century Polish language.

unfortunately behind a paywall:

http://www.dbpia.co.kr/Journal/ArticleList/VOIS00297995#

The history of the dictionary is long and sad. The work started in 1949
(!)  and after the initial enthusiasm and generous funding the team had
to struggle with various difficulties; in the consequence the dictionary
is still unfinished but the work continues, although rather slowly.

In my unpublished presentation

http://bc.klf.uw.edu.pl/179/

I show how the editors managed quoting "Nowy Karakter" (slides
26-35). Look like in the time of hot type the strange letters has been
written by hand, and there was a regress when the dictionary started to
be typeset on computer.

In my presentation I made some suggestions how to use Unicode for "Nowy
Karakter" (slides 40-69). Unfortunately the dictionary editors were not
interested in the proposal (there had at the time much more important
problems).

Not long ago the team received long-awaited grant for computerizing the
work on the dictionary, in particular for creating a corpus of 16th
century texts. Looks like the corpus was prepared rather in a hurry and
there was no time or money to develop a faithfull rendering of "Nowy
Karakter". The work exists in the corpus in two forms:

PDF: http://rcin.org.pl/publication/82568
HTML: http://spxvi.edu.pl/korpus/teksty/JanNKar/

I must say that for a typical user of the dictionary the solution
applied is probably a good one. The spelling has been modernized but the
occurences of strange characters has been marked with color in PDF, and
in HTML additionaly with some information displayed when you hoover over
the appropriate fragment of the text.

This solution is however not applicable to e.g. quotations in a research
paper when color is for some reasons not allowed.

So encoding "Nowy Karakter Polski" in Unicode and providing a font for
it is still in my opinion an interesting open problem.

Cf. also the thread

http://www.unicode.org/mail-arch/unicode-ml/y2010-m04/0024.html

BTW, I was definitely too optimistic...

Best regards

Janusz

--

Re: Variation Sequences (and L2-11/059)

2018-07-17 Thread Janusz S. Bień via Unicode

On Tue, Jul 17 2018 at  8:34 -0700, Asmus Freytag writes:
> On 7/16/2018 10:04 PM, Janusz S. Bień via Unicode wrote:
>
>  I understand there is no sufficient demand for the Unicode Consortium
> maintaining a supplementary non-ideographic variation database. Hence
> for the time being  a kind of Private Use variation database seems to be
> the only solution - am I right?
>
> The question comes down to resources, among other things. As well as to 
> whether
> there are actual users / implementers waiting for and ready to adopt such a 
> database
> as solution to their problems.

I hope the resources are sufficient to improve wording of the variation
sequence FAQ. Do we agree that at present users/implementers are rather
misled by it?

> A strawman proposal could identify these issues and some ways that they might 
> be
> addressed and then ask for criteria of what the UTC might deem sufficient.

Perhaps this statement should be put into FAQ, instead of "you should
propose your addition as a variation sequence"?

On Tue, Jul 17 2018 at 13:45 +0100, William_J_G Overington writes:
> Janusz S. Bien wrote:
>
>> I understand there is no sufficient demand for the Unicode
>> Consortium maintaining a supplementary non-ideographic variation
>> database. Hence for the time being a kind of Private Use variation
>> database seems to be the only solution - am I right?
>
> Well, with the greatest respect, in my opinion, no.
>
> You could use my suggestion and send a copy of your encoding to the
> Unicode Technical Committee (UTC) and maybe they will endorse it.

Difficult to do as there is no "my encoding".

>
> There is precedence over the astronaut emoji where in glyph
> substitution the rocket was lost and a space suit was obtained from
> somewhere.
>
> For my suggestion the circled digit would be lost and an alternate
> glyph introduced.

You seem to assume that my concern is only rendering.

[...]


On Tue, Jul 17 2018 at 14:07 +0100, William_J_G Overington writes:
> WJGO >> My suggestion is to use for each desired glyph a sequence
> consisting of three characters, and then have an OpenType font decode
> them so that the glyph can be displayed.
>
> JSB >This is a prohibitive requirement, because for years there is the lack 
> of font creators interested in old Polish.
>
> Well, I have not been aware of any call for participation. It seems an 
> interesting project.
>
> I make OpenType fonts using the FontCreator program.
>
> There is an active forum with helpful people participating.
>
> So you could if you wish try to make your own font

Actually I tried:

https://bitbucket.org/jsbien/parkosz-font/

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: Variation Sequences (and L2-11/059)

2018-07-16 Thread Janusz S. Bień via Unicode

On Mon, Jul 16 2018 at 19:00 +0100, wjgo_10...@btinternet.com writes:
> Hi
>
>> I ask the question because there are now several historical corpora
>> of Polish under development, which use at present a kind of fall-back
>> or some other ad hoc solutions for "nonce glyphs", as they are called
>> in the FAQ.
>
> I wonder if you could say please what are the "kind of fall-back or
> some other ad hoc solutions" please.

I would prefer not to go into details. I think some of those "solutions"
are simply wrong but the list is not the right place to criticize them.

> The reason I ask is because I have thought of a possible solution to
>the problem that has graceful fall-back and uses only plane 0
>characters, no Private Use Area characters at all: I am wondering
>whether my suggestion will be of use or if it is just another method
>that could just be added to a collection of "kind of fall-back or some
>other ad hoc solutions".
> My suggestion is to use for each desired glyph a sequence consisting
> of three characters, and then have an OpenType font decode them so
> that the glyph can be displayed.

This is a prohibitive requirement, because for years there is the lack
of font creators interested in old Polish.

> Each such sequence being of the form.
>
> Base character ZERO WIDTH JOINER then a circled digit character or a circled 
> number character.
>
> http://www.unicode.org/charts/PDF/U2460.pdf
>
> Thus there being up to twenty specific glyphs for each base character.
>
> The list of glyphs could be gradually extended as needed and if an
> attempt to display a newly added glyph is made using a font
> implemented from an earlier list then there would be graceful
> fall-back to the base character followed by a circled digit.
>
> It would be helpful for entering text into documents if the ZERO WIDTH
> JOINER character has a visible glyph within the font. Then entering
> text with OpenType glyph substitution turned off could be easier to
> carry out.

I perceive your proposal as "visible variant selectors for private
variation sequences", as a text encoded this way can be easily converted
into a text using real variant selectors.

I think it might be a reasonable temporary solution, but not the
ultimate one.

> I am wondering quite how acceptable such a solution would be for
> standardization: the list of ways that something can be encoded using
> a ZWJ (ZERO WIDTH JOINER) character seems to have recently been de
> facto extended for use with generating emoji sequences - not with
> circled digits but use of ZWJ to change meaning which is a far bigger
> extension than needed for this suggestion as meaning would often be
> unaltered when using this suggestion.

I would expect arguments that is has no obvious advantage over
variations sequences.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: Variation Sequences (and L2-11/059)

2018-07-16 Thread Janusz S. Bień via Unicode

On Mon, Jul 16 2018 at  1:08 -0700, unicode@unicode.org writes:
> The use case would seem to be more properly served by some form of
> registration mechanism, like the one IVD represents for ideographs.

I agree.

>
> The use of "standardized" variation sequences with the understanding
> that those would be (fairly) widely implemented would, in contrast, be
> best reserved to cases where the the encoding in the Standard resulted
> in deliberately unifying some variations for which there is
> nevertheless a common (!) use case of requiring each alternate to be
> selected.

I agree.

[...]

> On 7/15/2018 10:07 PM, Janusz S. Bień via Unicode wrote:
>
>  
> FAQ (http://unicode.org/faq/vs.html) states:
>
> For historic scripts, the variation sequence provides a useful tool,
> because it can show mistaken or nonce glyphs and relate them to the
> base character. It can also be used to reflect the views of
> scholars, who may see the relation between the glyphs and base
> characters differently. Also, new variation sequences can be added
> for new variant appearances (and their relation to the base
> characters) as more evidence is discovered.
> It states also:
>
>What variation sequences are valid?
>Only those listed in StandardizedVariants.txt...

The full answer is:

Only those listed in StandardizedVariants.txt,
emoji-variation-sequences.txt, or the registered sequences listed in
the Ideographic Variation Database (IVD).

Do we agree that the statements are not consistent, at least with your
view, which I share?

I understand there is no sufficient demand for the Unicode Consortium
maintaining a supplementary non-ideographic variation database. Hence
for the time being  a kind of Private Use variation database seems to be
the only solution - am I right?

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Variation Sequences (and L2-11/059)

2018-07-16 Thread Janusz S. Bień via Unicode



FAQ (http://unicode.org/faq/vs.html) states:

For historic scripts, the variation sequence provides a useful tool,
because it can show mistaken or nonce glyphs and relate them to the
base character. It can also be used to reflect the views of
scholars, who may see the relation between the glyphs and base
characters differently. Also, new variation sequences can be added
for new variant appearances (and their relation to the base
characters) as more evidence is discovered.

It states also:

   What variation sequences are valid?
   Only those listed in StandardizedVariants.txt...

However the file in question contains only sections for mathematics and
some rather exotic scripts.

To the best of my knowledge, the only attempt to introduce additional
variation sequences was the strongly criticised Karl Pentzlin's proposal
L2-11/059

http://www.unicode.org/L2/L2011/11059-latin-cyr-var.pdf

What has happen to it? I don't remember any information about it on the
list.

However my primary question is:

Are variation sequences *really* recommended for historical scripts?

I ask the question because there are now several historical corpora of
Polish under development, which use at present a kind of fall-back or
some other ad hoc solutions for "nonce glyphs", as they are called in
the FAQ.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: Unicode Emoji 11.0 characters now ready for adoption!

2018-02-28 Thread Janusz S. Bień via Unicode


Thanks to all who answered. The answers are very clear, but the original
message and the adoption page are in my opinion much less clear. I can
however live with it :-)

Best regards

Janusz

On Wed, Feb 28 2018 at 11:53 +0100, m...@macchiato.com writes:
> Also, please click through from the announcement to 
> http://www.unicode.org/consortium/adopt-a-character.html.
>
> If it isn't apparent from that page what the relationship is, we have some 
> work to do...
>
> Mark

> On Wed, Feb 28, 2018 at 11:48 AM, Martin J. Dürst via Unicode 
> <unicode@unicode.org> wrote:
>
>  On 2018/02/28 19:38, Janusz S. Bień via Unicode wrote:
>
>  On Tue, Feb 27 2018 at 13:45 -0800, announceme...@unicode.org writes:
>
>  The 157 new Emoji are now available for adoption, to help the Unicode
>  Consortium’s work on digitally disadvantaged languages.
>
>  I'm quite curious what it the relation between the new emojis and the
>  digitally disadvantages languages. I see none.
>
>  I think this was mentioned before on this list, in particular by Mark:
>  The money collected from character adoptions (where emoji are a prominent 
> target) is (mostly?) used to support work on not-yet-encoded (thus digitally
>  disadvantaged) scripts. See e.g. the recent announcement at 
> http://blog.unicode.org/2018/02/adopt-character-grant-to-support-three.html.



-- 
   ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki 
Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

Re: Unicode Emoji 11.0 characters now ready for adoption!

2018-02-28 Thread Janusz S. Bień via Unicode

On Tue, Feb 27 2018 at 13:45 -0800, announceme...@unicode.org writes:

> The 157 new Emoji are now available for adoption, to help the Unicode
> Consortium’s work on digitally disadvantaged languages.

I'm quite curious what it the relation between the new emojis and the
digitally disadvantages languages. I see none.

Best regards

Janusz

-- 
   ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki 
Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

Re: Unicode Digest, Vol 50, Issue 13

2018-02-18 Thread Janusz S. Bień via Unicode

On Sun, Feb 18 2018 at 18:03 CET, kilob...@angband.pl writes:
> On Sun, Feb 18, 2018 at 02:35:00PM +0100, Janusz S. Bień via Unicode wrote:
>> On Sun, Feb 18 2018 at 14:06 CET, unicode@unicode.org writes:
>> > Subject: metric for block coverage
>> >
>> > Hi!
>> > As a part of Debian fonts team work, we're trying to improve fonts review:
>> > ways to organize them, add metadata, pick which fonts are installed by
>> > default and/or recommended to users, etc.
>> >
>> > I'm looking for a way to determine a font's coverage of available scripts. 
>> > It's probably reasonable to do this per Unicode block.  Also, it's a safe
>> > assumption that a font which doesn't know a codepoint can do no complex
>> > shaping of such a glyph, thus looking at just codepoints should be adequate
>> > for our purposes.
>> 
>> As a Debian user using some rare characters for old Polish
>> transliteration I would be happy with a tool which scans
>> available/installed fonts for a specific list of characters and shows
>> only those fonts which support the whole list. Of course showing also
>> the characters in question would be very desirable.
>
> Thanks, your suggestion is a good addition to the wishlist of features we'd
> want to have.  Especially for the "available" case -- it'd be tedious to
> install all candidates just to check them.
>
> As for "installed":
> fc-list ':charset=16e5' file family

Thanks!

Some time ago I was looking at various Debian font utilities and found
nothing suitable, but looks like I should use Google more intensively:

https://unix.stackexchange.com/questions/162305/find-the-best-font-for-rendering-a-codepoint

Best regards

Janusz

-- 
   ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki 
Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

Re: metric for block coverage

2018-02-18 Thread Janusz S. Bień via Unicode

On Sun, Feb 18 2018 at 17:33 CET, e...@gnu.org writes:
>> Cc: unicode-requ...@unicode.org
>> Date: Sun, 18 Feb 2018 14:35:00 +0100
>> From: "Janusz S. Bień via Unicode" <unicode@unicode.org>
>> 
>> As a Debian user using some rare characters for old Polish
>> transliteration I would be happy with a tool which scans
>> available/installed fonts for a specific list of characters and shows
>> only those fonts which support the whole list. Of course showing also
>> the characters in question would be very desirable.
>
> I'm sure you know about BabelMap.  It has such a feature.

Yes, I know about BabelMap, but was not aware of the feature. Thank you.

I'm interested in a tool for Linux. I suppose BabelMap can be run on
Linux with Wine, but will this feature work in such a situation? I can
of course make a try, but have practically no experience with Wine.

Best regards

Janusz

-- 
   ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki 
Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

Re: Unicode Digest, Vol 50, Issue 13

2018-02-18 Thread Janusz S. Bień via Unicode

On Sun, Feb 18 2018 at 14:06 CET, unicode@unicode.org writes:

[...]

> From: Adam Borowski via Unicode 
> Subject: metric for block coverage
> To: unicode@unicode.org
> Date: Sat, 17 Feb 2018 23:18:25 +0100
> Reply-To: Adam Borowski 
> Date: Sat, 17 Feb 2018 23:18:25 +0100 (15 hours, 2 minutes, 28 seconds ago)
>
> Hi!
> As a part of Debian fonts team work, we're trying to improve fonts review:
> ways to organize them, add metadata, pick which fonts are installed by
> default and/or recommended to users, etc.
>
> I'm looking for a way to determine a font's coverage of available scripts. 
> It's probably reasonable to do this per Unicode block.  Also, it's a safe
> assumption that a font which doesn't know a codepoint can do no complex
> shaping of such a glyph, thus looking at just codepoints should be adequate
> for our purposes.

As a Debian user using some rare characters for old Polish
transliteration I would be happy with a tool which scans
available/installed fonts for a specific list of characters and shows
only those fonts which support the whole list. Of course showing also
the characters in question would be very desirable.

Best regards

Janusz

-- 
   ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki 
Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

Re: TIRONIAN SIGN ET

2018-01-27 Thread Janusz S. Bień via Unicode

On Sat, Jan 27 2018 at 21:59 CET, davidj_fau...@yahoo.ca writes:

[...]

> As far as I can tell, it was originally proposed in the document n1747 
> 'Contraction mark characters for the UCS’ by Everson. However, I
> cannot find that document anywhere.

Thank you very much for the reference.

On the page

http://www.evertype.com/formal.html

there is the link

http://unicode.org/wg2/docs/n1747.pdf

but it does not work. However the page

http://www.unicode.org/wg2/WG2-registry.html

states

The archival document directory for WG2 is accessible here:
http://std.dkuug.dk/jtc1/sc2/wg2/ The archives contain all
available documents through 2014

and the document is at

ftp://std.dkuug.dk/ftp.anonymous/JTC1/SC2/WG2/docs/n1747.pdf

Actually the character is "inherited" from

 ISO 5426-2:1996 Information and documentation -- Extension of
 the Latin alphabet coded character set for bibliographic
 information interchange -- Part 2: Latin characters used in
 minor European languages and obsolete typography

Hence my curiosity is fully satisfied :-)

Thanks again!

Janusz

-- 
   ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki 
Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

Re: TIRONIAN SIGN ET

2018-01-27 Thread Janusz S. Bień via Unicode

On Sat, Jan 27 2018 at 20:53 CET, r...@unicode.org writes:
> Hello Janusz --
>
> Try this: http://www.unicode.org/L2/L2017/17300-n4841-tironian-et.pdf
>
> Regards,
>
> On 1/27/2018 11:40 AM, Janusz S. Bień via Unicode wrote:
>> Hi!
>>
>> I try to find in UTC Document Register the proposals for characters
>> which interest me for some reasons. I'm usually rather successful, but
>> I'm unable to find the proposal for TIRONIAN SIGN ET.

I've seen this document, but I'm looking for an earlier one. The
character was introduced in Unicode 3.0 in 1999, cf. e.g.

http://unicode.org/mail-arch/unicode-ml/Archives-Old/UML015/0250.html

Regards

Janusz

-- 
   ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki 
Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

TIRONIAN SIGN ET

2018-01-27 Thread Janusz S. Bień via Unicode


Hi!

I try to find in UTC Document Register the proposals for characters
which interest me for some reasons. I'm usually rather successful, but
I'm unable to find the proposal for TIRONIAN SIGN ET.

Any hints?

Best regards

Janusz

-- 
   ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki 
Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

48 matches

Mail list logo