subject:"RE\: Unicode \"no\-op\" Character\?"

Re: Unicode "no-op" Character?

2019-07-12 Thread Sławomir Osipiuk via Unicode

Hello again everyone,

Though I initially took the shoo-away, there have been some comments
made since then that I feel compelled to rebut. To avoid spamming the
list, I’ve combined my responses into a single message.

Before that, I will say, again, for the record: I know this NOOP idea
is unlikely to ever happen. Certainly not with the responses I've
gotten. I haven't submitted it, nor even looked into how to. I know it
would be rejected. This is a thought experiment, nothing more. If that
doesn't interest you, please disregard this message.

And again, the hypothetical NOOP is a character whose canonical
equivalent is the absence of a character. The logical consequences of
that statement apply fully.

On Wed, Jul 3, 2019 at 8:00 PM Shawn Steele via Unicode
 wrote:
>
> Even more complicated is that, as pointed out by others, it's pretty much 
> impossible to say "these n codepoints should be ignored and have no meaning" 
> because some process would try to use codepoints 1-3 for some private 
> meaning.  Another would use codepoint 1 for their own thing, and there'd be a 
> conflict.

This is so utterly, completely, and severely missing the point I'm
starting to feel like a madman screaming to the heavens, "Why can't
they just understand?!"

Yes, a different process will have a different private meaning for the
codepoint. That is not a bug, it is a feature. A conflict is always
resolved by the current process saying, "I'm holding the string now.
The old NOOPs are gone, canonically decomposed to nothing. The new
ones mean what I want them to mean, as long as I or my buddies hold
the string. If you didn't want that, you shouldn't have given the
string to me!" This conflict-resolution mechanism is the special
sauce. If a process needs a private marker that will be preserved in
interchange, there are plenty of PUA characters to use, and even a
couple of private control characters.

> I also think that the conversation has pretty much proven that such a system 
> is mathematically impossible.  (You can't have a "private" no-meaning 
> codepoint that won't conflict with other "private" uses in a public space).

No such thing has been proven in the slightest. Any conflict is
resolved, in the default case, by normalizing all NOOPs to nothing.

On Wed, Jul 3, 2019 at 5:46 PM Mark E. Shoulson via Unicode
 wrote:
>
> Um... How could you be sure that process X would get the no-ops that process 
> W wrote?  After all, it's *discardable*, like you said, and the database 
> programs and libraries aren't in on the secret.

Yes, there is a requirement that W and X communicate via some
"NOOP-preserving path" (call it a NOOPPP). Such paths would generally
be very short and direct, because NOOPs are intended to be ephemeral,
not archival! They wouldn't be hard to come by. Memory mappings or
pipes. Direct inter-process comms. Anything that operates at
byte-level. Even simple persisting mechanisms like file storage or
databases can preserve NOOP by doing... nothing. "Discardable" doesn't
mean it must be discarded, merely that it can be. Where there are no
security implications or other need, strings containing NOOP can
simply be passed through and stored as-is. Where any interface,
library, or process does not preserve NOOP, it cannot be part of a
NOOPPP. Tough luck.

> Moreover, as you say, what about when Process Z (or its companions) comes 
> along and is using THE SAME MECHANISM for something utterly different?  How 
> does it know that process W wasn't writing no-ops for it, but was writing 
> them for Process X?

It is the responsibility of Process Z (and any process that interprets
NOOPs non-trivially) to be aware of the context/source of what it's
receiving. Prior agreement or advertised contract.

On Wed, Jul 3, 2019 at 2:06 PM Rebecca Bettencourt  wrote:
>
> And the database driver filters out the U+000F completely as a matter of best 
> practice and security-in-depth.

I'm struggling to see the security implication of "store this string,
verbatim, in your regular VARCHAR (or whatever) text field". I can
store the string "DROP TABLE [STUDENTS];" in a text field and unless
the database is horribly broken it will store that without issue. A
database could strip out NOOP out of text fields and still claim to be
Unicode conformant. But I wonder why it would bother to do that. And
even then, you could just store the string in a VARBINARY field or
whatever just accepts bytes.

> You can't say "this character should be ignored everywhere" and "this 
> character should be preserved everywhere" at the same time. That's the 
> contradiction.

I have not said "this character should be preserved everywhere". That
statement is completely false. Unfortunately, that means what I said
is still not being understood at all. Forgive me for being frustrated.

Finally, a general comment:

I think people are getting hung-up on this idea because they’re still
thinking in terms of what is being guaranteed, while this is
explicitly about

RE: Unicode "no-op" Character?

2019-07-04 Thread Doug Ewell via Unicode

Shawn Steele wrote:

> Even more complicated is that, as pointed out by others, it's pretty
> much impossible to say "these n codepoints should be ignored and have
> no meaning" because some process would try to use codepoints 1-3 for
> some private meaning.  Another would use codepoint 1 for their own
> thing, and there'd be a conflict.

That's pretty much what happened with NUL. It was originally intended (long, 
long before Unicode) to be ignorable and have no meaning, but then other 
processes were designed that gave it specific meaning, and that was pretty much 
that.

While the Unix/C "end of string" convention was not the only case in which NUL 
was hijacked, it is certainly the best-known, and the greatest impediment to 
any current attempt to use it with its original meaning.

--
Doug Ewell | Thornton, CO, US | ewellic.org

RE: Unicode "no-op" Character?

2019-07-03 Thread Shawn Steele via Unicode

I think you're overstating my concern :)

I meant that those things tend to be particular to a certain context and often 
aren't interesting for interchange.  A text editor might find it convenient to 
place word boundaries in the middle of something another part of the system 
thinks is a single unit to be rendered.  At the same time, a rendering engine 
might find it interesting that there's an ff together and want to mark it to be 
shown as a ligature though that text editor wouldn't be keen on that at all.

As has been said, these are private mechanisms for things that individual 
processes find interesting.  It's not useful to mark those for interchange as 
the text editors word breaking marks would interfere with the graphics engines 
glyph breaking marks.  Not to mention the transmission buffer size marks 
originally mentioned, which could be anywhere.

The "right" thing to do here is to use an internal higher level mechanism to 
keep track of these things however the component needs.  That can even be 
interchanged with another component designed to the same principles, via 
mechanisms like the PUA.  However, those components can't expect their private 
mechanisms are useful or harmless to other processes.  

Even more complicated is that, as pointed out by others, it's pretty much 
impossible to say "these n codepoints should be ignored and have no meaning" 
because some process would try to use codepoints 1-3 for some private meaning.  
Another would use codepoint 1 for their own thing, and there'd be a conflict.  

As a thought experiment, I think it's certainly decent to ask the question 
"could such a mechanism be useful?"  It's an intriguing thought and a decent 
hypothesis that this kind of system could be privately useful to an 
application.  I also think that the conversation has pretty much proven that 
such a system is mathematically impossible.  (You can't have a "private" 
no-meaning codepoint that won't conflict with other "private" uses in a public 
space).

It might be worth noting that this kind of thing used to be fairly common in 
early computing.  Word processers would inject a "CTRL-I" token to toggle 
italics on or off.  Old printers used to use sequences to define the start of 
bold or italic or underlined or whatever sequences.  Those were private and 
pseudo-private mechanisms that were used internally &/or documented for others 
that wanted to interoperate with their systems.  (The printer folks would tell 
the word processers how to make italics happen, then other printer folks would 
use the same or similar mechanisms for compatibility - except for the dude that 
didn't get the memo and made their own scheme.)

Unicode was explicitly intended *not* to encode any of that kind of markup, 
and, instead, be "plain text," leaving other interesting metadata to other 
higher level protocols.  Whether those be word breaking, sentence parsing, 
formatting, buffer sizing or whatever.

-Shawn

-Original Message-
From: Unicode  On Behalf Of Richard Wordingham via 
Unicode
Sent: Wednesday, July 3, 2019 4:20 PM
To: unicode@unicode.org
Subject: Re: Unicode "no-op" Character?

On Wed, 3 Jul 2019 17:51:29 -0400
"Mark E. Shoulson via Unicode"  wrote:

> I think the idea being considered at the outset was not so complex as 
> these (and indeed, the point of the character was to avoid making 
> these kinds of decisions).

Shawn Steele appeared to be claiming that there was no good, interesting reason 
for separating base character and combining mark.  I was refuting that notion.  
Natural text boundaries can get very messy - some languages have word 
boundaries that can be *within* an indecomposable combining mark.

Richard.

Re: Unicode "no-op" Character?

2019-07-03 Thread Richard Wordingham via Unicode

On Wed, 3 Jul 2019 17:51:29 -0400
"Mark E. Shoulson via Unicode"  wrote:

> I think the idea being considered at the outset was not so complex as 
> these (and indeed, the point of the character was to avoid making
> these kinds of decisions).

Shawn Steele appeared to be claiming that there was no good, interesting
reason for separating base character and combining mark.  I was
refuting that notion.  Natural text boundaries can get very messy -
some languages have word boundaries that can be *within* an
indecomposable combining mark.

Richard.

Re: Unicode "no-op" Character?

2019-07-03 Thread Mark E. Shoulson via Unicode

What you're asking for, then, is completely possible and achievable—but 
not in the Unicode Standard.  It's out of scope for Unicode, it sounds 
like.  You've said you realize it won't happen in Unicode, but it still 
can happen.  Go forth and implement it, then: make your higher-level 
protocol and show its usefulness and get the industry to use and honor 
it because of how handy it is, and best of luck with that.


~mark

On 7/3/19 2:22 PM, Ken Whistler via Unicode wrote:



On 7/3/2019 10:47 AM, Sławomir Osipiuk via Unicode wrote:


Is my idea impossible, useless, or contradictory? Not at all.


What you are proposing is in the realm of higher-level protocols.

You could develop such a protocol, and then write processes that 
honored it, or try to convince others to write processes to honor it. 
You could use PUA characters, or non-characters, or existing control 
codes -- the implications for use of any of those would be slightly 
different, in practice, but in any case would be an HLP.


But your idea is not a feasible part of the Unicode Standard. There 
are no "discardable" characters in Unicode -- *by definition*. The 
discussion of "ignorable" characters in the standard is nuanced and 
complicated, because there are some characters which are carefully 
designed to be transparent to some, well-specified processes, but not 
to others. But no characters in the standard are (or can be) ignorable 
by *all* processes, nor can a "discardable" character ever be defined 
as part of the standard.


The fact that there are a myriad of processes implemented (and 
distributed who knows where) that do 7-bit ASCII (or 8-bit 8859-1) 
conversion to/from UTF-16 by integral type conversion is a simple 
existence proof that U+000F is never, ever, ever, ever going to be 
defined to be "discardable" in the Unicode Standard.


--Ken

Re: Unicode "no-op" Character?

2019-07-03 Thread Mark E. Shoulson via Unicode

I think the idea being considered at the outset was not so complex as 
these (and indeed, the point of the character was to avoid making these 
kinds of decisions). There was a desire for some reason to be able to 
chop up a string into equal-length pieces or something, and some of 
those divisions might wind up between bases and diacritics or who knows 
where else.  Rather than have to work out acceptable places to place the 
characters, the request was for a no-op character that could safely be 
plopped *anywhere*, even in the middle of combinations like that.


~mark

On 6/23/19 4:24 AM, Richard Wordingham via Unicode wrote:

On Sat, 22 Jun 2019 23:56:50 +
Shawn Steele via Unicode  wrote:


+ the list.  For some reason the list's reply header is confusing.

From: Shawn Steele
Sent: Saturday, June 22, 2019 4:55 PM
To: Sławomir Osipiuk 
Subject: RE: Unicode "no-op" Character?

The original comment about putting it between the base character and
the combining diacritic seems peculiar.  I'm having a hard time
visualizing how that kind of markup could be interesting?

There are a number of possible interesting scenarios:

1) Chopping the string into user perceived characters.  For example,
the Khmer sequences of COENG plus letter are named sequences.  Akin to
this is identifying resting places for a simple cursor, e.g. allowing it
to be positioned between a base character and a spacing, unreordered
subscript.  (This last possibility overlaps with rendering.)

2) Chopping the string into collating elements.  (This can require
renormalisation, and may raise a rendering issue with HarfBuzz, where
renomalisation is required to get marks into a suitable order for
shaping.  I suspect no-op characters would disrupt this
renormalisation; CGJ may legitimately be used to affect rendering this
way, even though it is supposed to have no other effect* on rendering.)

3) Chopping the string into default grapheme clusters.  That
separates a coeng from the following character with which it
interacts.

*Is a Unicode-compliant *renderer* allowed to distinguish diaeresis
from the umlaut mark?

Richard.

Re: Unicode "no-op" Character?

2019-07-03 Thread Mark E. Shoulson via Unicode

Um... How could you be sure that process X would get the no-ops that 
process W wrote?  After all, it's *discardable*, like you said, and the 
database programs and libraries aren't in on the secret.  The database 
API functions might well strip it out, because it carries no meaning to 
them. Unless you can count on _certain_ programs not discarding it, and 
then you'd need either specialty libraries or some kind of registry or 
terminology for "this program does NOT strip no-ops" vs ones that do... 
But then they wouldn't be discardable, would they?  Not by 
non-discarding programs.  Which would have to have ways to pass them 
around between themselves.


Moreover, as you say, what about when Process Z (or its companions) 
comes along and is using THE SAME MECHANISM for something utterly 
different?  How does it know that process W wasn't writing no-ops for 
it, but was writing them for Process X?  And of course, Z will trash 
them and insert its own there, and when process X comes to read it, they 
won't be there. You'd need to make sure that NOBODY is allowed to touch 
the string between *pairs* of generators and consumers of no-ops, 
specifically designated for each other.


Yes, this is about consensual acts between responsible processes W and 
X, but that's exactly what the PUA is for: being assigned meaning 
between consenting processes. And they are not discardable by 
non-consenting processes, precisely because they mean something to 
someone.  If your no-ops carry meaning, they are going to need to be 
preserved and passed around and not thrown away.  If they carry no 
meaning, why are you dealing with them?  Yes, PUA characters are 
annoying and break up grapheme clusters and stuff.  But they're the only 
way to do what you're trying to do.


~mark

On 7/3/19 11:44 AM, Sławomir Osipiuk via Unicode wrote:


A process, let’s call it Process W, adds a bunch of U+000F to a string 
it received, or built, or a user entered via keyboard. Maybe it’s to 
packetize. Maybe to mark every word that is an anagram of the name of 
a famous 19^th -century painter, or that represents a pizza topping. 
Maybe something else. This is a versatile character. Process W is done 
adding U+000F to the string. It stores in it a database UTF-8 encoded 
field. Encoding isn’t a problem. The database is happy.


Now Process X runs. Process X is meant to work with Process W and it’s 
well-aware of how U+000F is used. It reads the string from the 
database. It sees U+000F and interprets it. It chops the string into 
packets, or does a websearch for each famous painter, or it orders 
pizza. The private meaning of U+000F is known to both Process X and 
Process W. There is useful information encoded in-band, within a 
limited private context.


But now we have Process Y. Process Y doesn’t care about packets or 
painters or pizza. Process Y runs outside of the private context that 
X and W had. Process Y translates strings into Morse code for 
transmission. As part of that, it replaces common words with 
abbreviations. Process Y doesn’t interpret U+000F. Why would it? It 
has no semantic value to Process Y.


Process Y reads the string from the database. Internally, it clears 
all instances of U+000F from the string. They’re just taking up space. 
They’re meaningless to Y. It compiles the Morse code sequence into an 
audio file.


But now we have Process Z. Process Z wants to take a string and mark 
every instance of five contiguous Latin consonants. It scrapes the 
database looking for text strings. It finds the string Process W 
created and marked. Z has no obligation to W. It’s not part of that 
private context. Process Z clears all instances of U+000F it finds, 
then inserts its own wherever it finds five-consonant clusters. It 
stores its results in a UTF-16LE text file. It’s allowed to do that.


Nothing impossible happened here. Let’s summarize:

Processes W and X established a private meaning for U+000F by 
agreement and interacted based on that meaning.


Process Y ignored U+000F completely because it assigned no meaning to it.

Process Z assigned a completely new meaning to U+000F. That’s 
permitted because U+000F is special and is guaranteed to have no 
semantics without private agreement and doesn’t need to be preserved.


There is no need to escape anything. Escaping is used when a character 
must have more than one meaning (i.e. it is overloaded, as when it is 
both text and markup). U+000F only gets one meaning in any context. In 
a new context, the meaning gets overridden, not overloaded. That’s 
what makes it special.


I don’t expect to see any of this in official Unicode. But I take 
exception to the idea that I’m suggesting something impossible.


*From:*Philippe Verdy [mailto:verd...@wanadoo.fr]
*Sent:* Wednesday, July 03, 2019 04:49
*To:* Sławomir Osipiuk
*Cc:* unicode Unicode Discussion
*Subject:* Re: Unicode "no-op" Character?

Your goal is **impossible** to reach with Unicode. As

Re: Unicode "no-op" Character?

2019-07-03 Thread Ken Whistler via Unicode



On 7/3/2019 10:47 AM, Sławomir Osipiuk via Unicode wrote:


Is my idea impossible, useless, or contradictory? Not at all.


What you are proposing is in the realm of higher-level protocols.

You could develop such a protocol, and then write processes that honored 
it, or try to convince others to write processes to honor it. You could 
use PUA characters, or non-characters, or existing control codes -- the 
implications for use of any of those would be slightly different, in 
practice, but in any case would be an HLP.


But your idea is not a feasible part of the Unicode Standard. There are 
no "discardable" characters in Unicode -- *by definition*. The 
discussion of "ignorable" characters in the standard is nuanced and 
complicated, because there are some characters which are carefully 
designed to be transparent to some, well-specified processes, but not to 
others. But no characters in the standard are (or can be) ignorable by 
*all* processes, nor can a "discardable" character ever be defined as 
part of the standard.


The fact that there are a myriad of processes implemented (and 
distributed who knows where) that do 7-bit ASCII (or 8-bit 8859-1) 
conversion to/from UTF-16 by integral type conversion is a simple 
existence proof that U+000F is never, ever, ever, ever going to be 
defined to be "discardable" in the Unicode Standard.


--Ken

Re: Unicode "no-op" Character?

2019-07-03 Thread Rebecca Bettencourt via Unicode

On Wed, Jul 3, 2019 at 8:47 AM Sławomir Osipiuk via Unicode <
unicode@unicode.org> wrote:

> Security gateways filter it out completely, as a matter of best practice
> and security-in-depth.
>
>
>
> A process, let’s call it Process W, adds a bunch of U+000F to a string it
> received, or built, or a user entered via keyboard. ...
>
It stores in it a database UTF-8 encoded field...
>

And the database driver filters out the U+000F completely as a matter of
best practice and security-in-depth.

You can't say "this character should be ignored everywhere" and "this
character should be preserved everywhere" at the same time. That's the
contradiction.

RE: Unicode "no-op" Character?

2019-07-03 Thread Sławomir Osipiuk via Unicode

The fact that this would require a change that is unlikely to occur is a fact I 
have stated repeatedly. It is pointless to tell me that.

 

The rest of the thread, after my initial question was answered, was a thought 
experiment, and while I strongly disagree that such posts are “pointless” 
(actually, reading through the archives of this mailing list it is those ideas 
that have fascinated me the most and I found most engaging and enlightening) I 
admit I’m new here, so I will defer.

 

Is my idea unrealistic at this point in time? Yes. I have admitted so.

 

Is my idea impossible, useless, or contradictory? Not at all.

 

 

From: Mark Davis ☕️ [mailto:m...@macchiato.com] 
Sent: Wednesday, July 03, 2019 13:33
To: Sławomir Osipiuk
Cc: verdy_p; unicode Unicode Discussion
Subject: Re: Unicode "no-op" Character?

 

Your goal is not achievable. We can't wave a magic wand, and suddenly (or even 
within decades) all processes everywhere ignore U+000F in all processing will 
not happen.

 

This thread is pointless and should be terminated.

Re: Unicode "no-op" Character?

2019-07-03 Thread Mark Davis ☕️ via Unicode

Your goal is not achievable. We can't wave a magic wand, and suddenly (or
even within decades) all processes everywhere ignore U+000F in all
processing will not happen.

This thread is pointless and should be terminated.

Mark


On Wed, Jul 3, 2019 at 5:48 PM Sławomir Osipiuk via Unicode <
unicode@unicode.org> wrote:

> I’m frustrated at how badly you seem to be missing the point. There is
> nothing impossible nor self-contradictory here. There is only the matter
> that Unicode requires all scalar values to be preserved during interchange.
> This is in many ways a good idea, and I don’t expect it to change, but
> something else would be possible if this requirement were explicitly
> dropped for a well-defined small subset of characters (even just one
> character). A modern-day SYN.
>
>
>
> Let’s say it’s U+000F. The standard takes my proposal and makes it a
> discardable, null-displayable character. What does this mean?
>
>
>
> U+000F may appear in any text. It has no (external) semantic value. But it
> may appear. It may appear a lot.
>
>
>
> Display routines (which are already dealing with combining, ligaturing,
> non-/joiners, variations, initial/medial/finals forms) understand that
> U+000F is to be processed as a no-op. Do nothing with this. Drop it. Move
> to the next character. Simple.
>
>
>
> Security gateways filter it out completely, as a matter of best practice
> and security-in-depth.
>
>
>
> A process, let’s call it Process W, adds a bunch of U+000F to a string it
> received, or built, or a user entered via keyboard. Maybe it’s to
> packetize. Maybe to mark every word that is an anagram of the name of a
> famous 19th-century painter, or that represents a pizza topping. Maybe
> something else. This is a versatile character. Process W is done adding
> U+000F to the string. It stores in it a database UTF-8 encoded field.
> Encoding isn’t a problem. The database is happy.
>
>
>
> Now Process X runs. Process X is meant to work with Process W and it’s
> well-aware of how U+000F is used. It reads the string from the database. It
> sees U+000F and interprets it. It chops the string into packets, or does a
> websearch for each famous painter, or it orders pizza. The private meaning
> of U+000F is known to both Process X and Process W. There is useful
> information encoded in-band, within a limited private context.
>
>
>
> But now we have Process Y. Process Y doesn’t care about packets or
> painters or pizza. Process Y runs outside of the private context that X and
> W had. Process Y translates strings into Morse code for transmission. As
> part of that, it replaces common words with abbreviations. Process Y
> doesn’t interpret U+000F. Why would it? It has no semantic value to Process
> Y.
>
>
>
> Process Y reads the string from the database. Internally, it clears all
> instances of U+000F from the string. They’re just taking up space. They’re
> meaningless to Y. It compiles the Morse code sequence into an audio file.
>
>
>
> But now we have Process Z. Process Z wants to take a string and mark every
> instance of five contiguous Latin consonants. It scrapes the database
> looking for text strings. It finds the string Process W created and marked.
> Z has no obligation to W. It’s not part of that private context. Process Z
> clears all instances of U+000F it finds, then inserts its own wherever it
> finds five-consonant clusters. It stores its results in a UTF-16LE text
> file. It’s allowed to do that.
>
>
>
> Nothing impossible happened here. Let’s summarize:
>
>
>
> Processes W and X established a private meaning for U+000F by agreement
> and interacted based on that meaning.
>
>
>
> Process Y ignored U+000F completely because it assigned no meaning to it.
>
>
>
> Process Z assigned a completely new meaning to U+000F. That’s permitted
> because U+000F is special and is guaranteed to have no semantics without
> private agreement and doesn’t need to be preserved.
>
>
>
> There is no need to escape anything. Escaping is used when a character
> must have more than one meaning (i.e. it is overloaded, as when it is both
> text and markup). U+000F only gets one meaning in any context. In a new
> context, the meaning gets overridden, not overloaded. That’s what makes it
> special.
>
>
>
> I don’t expect to see any of this in official Unicode. But I take
> exception to the idea that I’m suggesting something impossible.
>
>
>
>
>
> *From:* Philippe Verdy [mailto:verd...@wanadoo.fr]
> *Sent:* Wednesday, July 03, 2019 04:49
> *To:* Sławomir Osipiuk
> *Cc:* unicode Unicode Discussion
> *Subject:* Re: Unicode "no-op" Character?
>
>
>
> Your goal is **impossible** to reach with Unicode. Assume sich character
> is "added" to the UCS, then it can appear in the text. Your goal being that
> it should be "warrantied" not to be used in any text, means that your
> "character" cannot be encoded at all.
>

RE: Unicode "no-op" Character?

2019-07-03 Thread Sławomir Osipiuk via Unicode

I’m frustrated at how badly you seem to be missing the point. There is nothing 
impossible nor self-contradictory here. There is only the matter that Unicode 
requires all scalar values to be preserved during interchange. This is in many 
ways a good idea, and I don’t expect it to change, but something else would be 
possible if this requirement were explicitly dropped for a well-defined small 
subset of characters (even just one character). A modern-day SYN.

 

Let’s say it’s U+000F. The standard takes my proposal and makes it a 
discardable, null-displayable character. What does this mean?

 

U+000F may appear in any text. It has no (external) semantic value. But it may 
appear. It may appear a lot.

 

Display routines (which are already dealing with combining, ligaturing, 
non-/joiners, variations, initial/medial/finals forms) understand that U+000F 
is to be processed as a no-op. Do nothing with this. Drop it. Move to the next 
character. Simple.

 

Security gateways filter it out completely, as a matter of best practice and 
security-in-depth.

 

A process, let’s call it Process W, adds a bunch of U+000F to a string it 
received, or built, or a user entered via keyboard. Maybe it’s to packetize. 
Maybe to mark every word that is an anagram of the name of a famous 
19th-century painter, or that represents a pizza topping. Maybe something else. 
This is a versatile character. Process W is done adding U+000F to the string. 
It stores in it a database UTF-8 encoded field. Encoding isn’t a problem. The 
database is happy.

 

Now Process X runs. Process X is meant to work with Process W and it’s 
well-aware of how U+000F is used. It reads the string from the database. It 
sees U+000F and interprets it. It chops the string into packets, or does a 
websearch for each famous painter, or it orders pizza. The private meaning of 
U+000F is known to both Process X and Process W. There is useful information 
encoded in-band, within a limited private context.

 

But now we have Process Y. Process Y doesn’t care about packets or painters or 
pizza. Process Y runs outside of the private context that X and W had. Process 
Y translates strings into Morse code for transmission. As part of that, it 
replaces common words with abbreviations. Process Y doesn’t interpret U+000F. 
Why would it? It has no semantic value to Process Y.

 

Process Y reads the string from the database. Internally, it clears all 
instances of U+000F from the string. They’re just taking up space. They’re 
meaningless to Y. It compiles the Morse code sequence into an audio file.

 

But now we have Process Z. Process Z wants to take a string and mark every 
instance of five contiguous Latin consonants. It scrapes the database looking 
for text strings. It finds the string Process W created and marked. Z has no 
obligation to W. It’s not part of that private context. Process Z clears all 
instances of U+000F it finds, then inserts its own wherever it finds 
five-consonant clusters. It stores its results in a UTF-16LE text file. It’s 
allowed to do that.

 

Nothing impossible happened here. Let’s summarize:

 

Processes W and X established a private meaning for U+000F by agreement and 
interacted based on that meaning.

 

Process Y ignored U+000F completely because it assigned no meaning to it.

 

Process Z assigned a completely new meaning to U+000F. That’s permitted because 
U+000F is special and is guaranteed to have no semantics without private 
agreement and doesn’t need to be preserved.

 

There is no need to escape anything. Escaping is used when a character must 
have more than one meaning (i.e. it is overloaded, as when it is both text and 
markup). U+000F only gets one meaning in any context. In a new context, the 
meaning gets overridden, not overloaded. That’s what makes it special.

 

I don’t expect to see any of this in official Unicode. But I take exception to 
the idea that I’m suggesting something impossible.

 

 

From: Philippe Verdy [mailto:verd...@wanadoo.fr] 
Sent: Wednesday, July 03, 2019 04:49
To: Sławomir Osipiuk
Cc: unicode Unicode Discussion
Subject: Re: Unicode "no-op" Character?

 

Your goal is **impossible** to reach with Unicode. Assume sich character is 
"added" to the UCS, then it can appear in the text. Your goal being that it 
should be "warrantied" not to be used in any text, means that your "character" 
cannot be encoded at all.

Aw: Re: Unicode "no-op" Character?

2019-07-03 Thread Marius Spix via Unicode

A few suggestions

There is a reason why the C standard library function fgetc(FILE*) returns an unsigned int instead of a char, because the constant EOF (end of file) must be outside of the definition area of a char.

Some encodings like Base64 or Quoted-printable use the escape character =, but make sure that you can still encode this escape character in another way.

Another possible encoding would be using a "continue" flag. For example you could use the least significant bit to signal if a stream ends or is continued, this allows you to encode 7 bits per byte and is used for arbitrary length integers or other variable length structures where terminator characters like 0x00 may be part of the data.

Gesendet: Mittwoch, 03. Juli 2019 um 10:49 Uhr
Von: "Philippe Verdy via Unicode"
An: "Sławomir Osipiuk"
Cc: "unicode Unicode Discussion"
Betreff: Re: Unicode "no-op" Character?

Le mer. 3 juil. 2019 à 06:09, Sławomir Osipiuk <sosip...@gmail.com> a écrit :

I don’t think you understood me at all. I can packetize a string with any character that is guaranteed not to appear in the text.

Your goal is **impossible** to reach with Unicode. Assume sich character is "added" to the UCS, then it can appear in the text. Your goal being that it should be "warrantied" not to be used in any text, means that your "character" cannot be encoded at all. Unicode and ISO **require** that the any proposed character can be used in text without limitation. Logivally it would be rejected becauyse your character would not be usable at all from the start.

So you have no choice: you must use some transport format for your "packeting", jsut like what is used in MIME for emails, in HTTP(S) for streaming, or in internationalized domain names.

For your escaping mechanism you have a very large choice already of characters considered special only for your chosen transport syntax.

Your goal shows a chicken and egg problem. It is not solvable without creating self-contradictions immediately (and if you attempt to add some restriction to avoid the contradiction, then you'll fall on cases where you can no longer transport your message and your protocol will become unusable.

Re: Unicode "no-op" Character?

2019-07-03 Thread Philippe Verdy via Unicode

Also consider that C0 controls (like STX and ETX) can already be used for
packetizing, but immediately comes the need for escaping (DLE has been used
for that goal, jsut before the character to preserve in the stream content,
notably before DLE itself, or STX and ETX).
There's then no need at all of any new character in Unicode. But if your
protoclol does not allow any fom of escaping, then it is broken as it
cannot transport **all** valid Unicode text.

Le mer. 3 juil. 2019 à 10:49, Philippe Verdy  a écrit :

> Le mer. 3 juil. 2019 à 06:09, Sławomir Osipiuk  a
> écrit :
>
>> I don’t think you understood me at all. I can packetize a string with any
>> character that is guaranteed not to appear in the text.
>>
>
> Your goal is **impossible** to reach with Unicode. Assume sich character
> is "added" to the UCS, then it can appear in the text. Your goal being that
> it should be "warrantied" not to be used in any text, means that your
> "character" cannot be encoded at all. Unicode and ISO **require** that the
> any proposed character can be used in text without limitation. Logivally it
> would be rejected becauyse your character would not be usable at all from
> the start.
>
> So you have no choice: you must use some transport format for your
> "packeting", jsut like what is used in MIME for emails, in HTTP(S) for
> streaming, or in internationalized domain names.
>
> For your escaping mechanism you have a very large choice already of
> characters considered special only for your chosen transport syntax.
>
> Your goal shows a chicken and egg problem. It is not solvable without
> creating self-contradictions immediately (and if you attempt to add some
> restriction to avoid the contradiction, then you'll fall on cases where you
> can no longer transport your message and your protocol will become unusable.
>

Re: Unicode "no-op" Character?

2019-07-03 Thread Philippe Verdy via Unicode

Le mer. 3 juil. 2019 à 06:09, Sławomir Osipiuk  a
écrit :

> I don’t think you understood me at all. I can packetize a string with any
> character that is guaranteed not to appear in the text.
>

Your goal is **impossible** to reach with Unicode. Assume sich character is
"added" to the UCS, then it can appear in the text. Your goal being that it
should be "warrantied" not to be used in any text, means that your
"character" cannot be encoded at all. Unicode and ISO **require** that the
any proposed character can be used in text without limitation. Logivally it
would be rejected becauyse your character would not be usable at all from
the start.

So you have no choice: you must use some transport format for your
"packeting", jsut like what is used in MIME for emails, in HTTP(S) for
streaming, or in internationalized domain names.

For your escaping mechanism you have a very large choice already of
characters considered special only for your chosen transport syntax.

Your goal shows a chicken and egg problem. It is not solvable without
creating self-contradictions immediately (and if you attempt to add some
restriction to avoid the contradiction, then you'll fall on cases where you
can no longer transport your message and your protocol will become unusable.

RE: Unicode "no-op" Character?

2019-07-02 Thread Sławomir Osipiuk via Unicode

I don’t think you understood me at all. I can packetize a string with any 
character that is guaranteed not to appear in the text. Suggestions of TAB or 
EQUALS don’t even meet that simple criterion; they often appear in text. They 
require some kind of special escaping mechanism.

 

But assume my string has a chosen character for indicating packets. But before 
I send it out, I want to show the string to the user. I can’t just throw it 
into a display method. I’d have TABs or EQUALs or UNKNOWN GLYPHs all over the 
place visible to the user. I don’t want that. So now I have to make a new copy 
of the string with my special boundary-char removed, then display that copied 
string. Or I could keep the original string, from before I added the packet 
boundaries, but that’s if I predict or assume ahead of time that I will need to 
display it, which in reality I might not. But that still means two copies of 
the string, one of which might be a waste. More code. More processing.

 

I can do all that. But why?

 

This thread is about a tool for convenience. I don’t “need” it, in the sense 
that a task is insoluble without it. I’m a programmer, I know how to code. I 
“want” it, because a tool like that would make some tasks much faster and 
simpler. Your proposed solution doesn’t.

 

From: Philippe Verdy [mailto:verd...@wanadoo.fr] 
Sent: Saturday, June 29, 2019 15:47
To: Sławomir Osipiuk
Cc: Shawn Steele; unicode Unicode Discussion
Subject: Re: Unicode "no-op" Character?

 

If you want to "packetize" arbitrarily long Unicode text, you don't need any 
new magic character. Just prepend your packet with a base character used as a 
syntaxic delimiter, that does not combine with what follows in any 
normalization.

Re: Unicode "no-op" Character?

2019-06-29 Thread Philippe Verdy via Unicode

If you want to "packetize" arbitrarily long Unicode text, you don't need
any new magic character. Just prepend your packet with a base character
used as a syntaxic delimiter, that does not combine with what follows in
any normalization.

There's a fine character for that: the TAB control. Except that during
transmission it may turn into a SPACE that would combine. (the same will
happen with "=" which can combine with a combining slash).

But look at the normalization data (and consider that Unicode warranties
that there will not be any addition of new combining pair starting by the
same base character) there are LOT of suitable base characters in Unicode,
which you can use as a syntaxic delimiter.

Some examples (in the ASCII subset) include the hyphen-minus, the
apostrophe-quote, the double quotation mark...

So it's easy to split an arbitrarily long text at arbitrary character
position, even in the middle of any cluster or combining sequence. It does
not matter that this character may create a "cluster" with the following
character, your "packetized" stream is still not readable text, but only a
transport syntax (just like quoted-printable, or Base64).

You can also freely choose the base character at end of each packet (the
newlines are not safe as lines may be merged, but like Base64, "=" is fine
to terminate each packet, as well as two ASCII quotation marks, and in fact
all punctuations and symbols from ASCII (you can even use the ASCII letters
and digits).

If your packets have variable lengths, you may need to use escaping, or you
may prepend the length (in characters or in combining sequences) of your
packet before the expected terminator.

All this is used in MIME for attachments in emails (with the two common
transport syntaxes: Quoted Printable using escaping, or Base64 which does
not require any length but requires a distinctive terminator (not used to
encode the data part of the "packet") for variable length "packets".





Le dim. 23 juin 2019 à 02:35, Sławomir Osipiuk via Unicode <
unicode@unicode.org> a écrit :

> I assure you, it wasn’t very interesting. :-) Headache-y, more like. The
> diacritic thing was completely inapplicable anyway, as all our text was
> plain English. I really don’t want to get into what the thing was, because
> it sounds stupider the more I try to explain it. But it got the wheels
> spinning in my head, and now that I’ve been reading up a lot about Unicode
> and older standards like 2022/6429, it got me thinking whether there might
> already be an elegant solution.
>
>
>
> But, as an example I’m making up right now, imagine you want to packetize
> a large string. The packets are not all equal sized, the sizes are
> determined by some algorithm. And the packet boundary may occur between a
> base char and a diacritic. You insert markers into the string at the packet
> boundaries. You can then store the string, copy it, display it, or pass it
> to the sending function which will scan the string and know to send the
> next packet when it reaches the marker. And you can now do all that without
> the need to pass around extra metadata (like a list of ints of where the
> packet boundaries are supposed to be) or to re-calculate the boundaries;
> it’s still just a big string. If a different application sees the string,
> it will know to completely ignore the packet markers; it can even strip
> them out if it wants to (the canonical equivalent of the noop character is
> the absence of a character).
>
>
>
> As should be obvious, I’m not recommending this as good practice.
>
>
>
>
>
> *From:* Shawn Steele [mailto:shawn.ste...@microsoft.com]
> *Sent:* Saturday, June 22, 2019 19:57
> *To:* Sławomir Osipiuk; unicode@unicode.org
> *Subject:* RE: Unicode "no-op" Character?
>
>
>
> + the list.  For some reason the list’s reply header is confusing.
>
>
>
> *From:* Shawn Steele
> *Sent:* Saturday, June 22, 2019 4:55 PM
> *To:* Sławomir Osipiuk 
> *Subject:* RE: Unicode "no-op" Character?
>
>
>
> The original comment about putting it between the base character and the
> combining diacritic seems peculiar.  I’m having a hard time visualizing how
> that kind of markup could be interesting?
>
>
>
> *From:* Unicode  *On Behalf Of *Slawomir
> Osipiuk via Unicode
> *Sent:* Saturday, June 22, 2019 2:02 PM
> *To:* unicode@unicode.org
> *Subject:* RE: Unicode "no-op" Character?
>
>
>
> I see there is no such character, which I pretty much expected after
> Google didn’t help.
>
>
>
> The original problem I had was solved long ago but the recent article
> about watermarking reminded me of it, and my question was mostly out of
> curiosity. The task wasn’t, strictly speaking, about “padding”, but about
> marking – injecting “flag” characters at arbitrary points in a string
> without affecting the resulting visible text. I think we ended up using
> ESC, which is a dumb choice in retrospect, though the whole approach was a
> bit of a hack anyway and the process it was for isn’t being used anymore.
>

New control characters! (was: Re: Unicode "no-op" Character?)

2019-06-25 Thread Sławomir Osipiuk via Unicode

All right. Thanks to everyone who offered suggestions. I think the final
choice will depend on the specific application, if I ever face this puzzle
again.

 

If nothing else, this discussion has helped me formulate what exactly it is
I'm imagining, which is actually a bit different that was I started with.
So, just to put it out there for the internet to archive (with the likes of
the various proposed "unofficial" UTFs I've been reading about), here are my
two proposed control characters (why just one when you can have two at twice
the price?)

 

Implementors, feel free to jump right on this. :-) I chose to assign them to
0xE and 0xF because the use of ISO2022-style stateful shifts is expressly
not permitted by ISO 10646, so by my reading the existence of those code
points inside a UCS stream is a roundabout error. Therefore I'm reclaiming
them for something useful.

 

 

EP1 - EPHEMERAL PRIVATE SENTINEL 1 (0x0E)

 

EP1 is executed as a null operation at the presentation layer. The formation
of ligatures, the behavior of combining characters, and similar presentation
mechanisms, must proceed as if EP1 were not present even when it occurs
within sequences that effect such mechanisms.

EP1 is intended to be used as a private process-internal sentinel or flag
character. EP1 may be added at any positions in the character stream. EP1
may be removed from the stream by any receiving process that has not
established an agreement for special handling of EP1.

EP1 should be removed from the stream prior to any security validation. It
must not interfere with the recognition of security-sensitive keywords,
sequences, or credentials.

 

 

EP2 - EPHEMERAL PRIVATE SENTINEL 2 (0x0F)

 

EP2 is executed as a null operation at the presentation layer. The formation
of ligatures, the behavior of combining characters, and similar presentation
mechanisms, must proceed as if EP2 were not present even when it occurs
within sequences that effect such mechanisms.

EP2 is intended to be used as a private process-internal sentinel or flag
character. EP2 may be added at any positions in the character stream. EP2
may be removed from the stream by any receiving process that has not
established an agreement for special handling of EP2.

EP2 should be removed from the stream prior to any security validation. It
must not interfere with the recognition of security-sensitive keywords,
sequences, or credentials.

Re: Unicode "no-op" Character?

2019-06-24 Thread J Decker via Unicode

On Mon, Jun 24, 2019 at 5:35 PM David Starner via Unicode <
unicode@unicode.org> wrote:

> On Sun, Jun 23, 2019 at 10:41 PM Shawn Steele via Unicode
>  wrote:
>
> IMO, since it's unlikely that anyone expects
> that they can transmit a NUL through an arbitrary channel, unlike a
> random private use character.

You would be wrong.
NUL is a valid codepoint like any other; except like in the C standard
library and descendants.
And, I expect it to be maintained.  And, for the most part is, (except for
emscripten)

>
> --
> Kie ekzistas vivo, ekzistas espero.
>

Re: Unicode "no-op" Character?

2019-06-24 Thread David Starner via Unicode

On Sun, Jun 23, 2019 at 10:41 PM Shawn Steele via Unicode
 wrote:
> Which leads us to the key.  The desire is for a character that has no public 
> meaning, but has some sort of private meaning.  In other words it has a 
> private use.  Oddly enough, there is a group of characters intended for 
> private use, in the PUA ;-)

Who's private use? If you have a stream of data that is being
packetted for transmission, using a Private Use character is likely to
mangle data that is being transmitted at some point. A NUL is likely
to be the best option, IMO, since it's unlikely that anyone expects
that they can transmit a NUL through an arbitrary channel, unlike a
random private use character.

-- 
Kie ekzistas vivo, ekzistas espero.

RE: Unicode "no-op" Character?

2019-06-24 Thread Sławomir Osipiuk via Unicode

It's discardable outside of the context/process that created it.
For a receiving process there is a difference between "this character has a
meaning you don't understand" and "this character had a transitory meaning
that has been exhausted".
The first implies that it needs to be preserved and survive round-trip
transmission (in fact the Unicode standard requires that). The second
implies that it can be discarded.
The first implies that it should be displayed to the user even if only as an
"unknown something here". The second implies it should be ignored completely
in display.

Noncharacters have a use as internal-only sentinels, but they are difficult
for an intermediate process to use if the text it receives already contains
them (http://www.unicode.org/faq/private_use.html#nonchar10) and they break
up combinations (they have a display effect, even if it's a subtle one).

Private Use Characters are nice but they are still "part of" the text; if
they are removed, the text is semantically changed. And they too display as
something. I have to go back to how the SYN control character is defined.
ECMA16/ISO1745 says "SYN is generally removed at the receiving Terminal
Installation." It has a transitory purpose that is exhausted as soon as it
is received. I wish Unicode hadn't shied away from either formalizing SYN or
providing some kind of equivalent. I know it wasn't part of the scope
Unicode set for itself, but I can still dream.

-Original Message-
From: Shawn Steele [mailto:shawn.ste...@microsoft.com] 
Sent: Monday, June 24, 2019 01:39
To: Sławomir Osipiuk; unicode@unicode.org
Cc: 'Richard Wordingham'
Subject: RE: Unicode "no-op" Character?

But... it's not actually discardable.  The hypothetical "packet"
architecture (using the term architecture somewhat loosely) needed the
information being tunneled in by this character.  If it was actually
discardable, then the "noop" character wouldn't be required as it would be
discarded.

Since the character conveys meaning to some parts of the system, then it's
not actually a "noop" and it's not actually "discardable".  

What is actually being requested isn't a character that nobody has meaning
for, but rather a character that has no PUBLIC meaning.  

Which leads us to the key.  The desire is for a character that has no public
meaning, but has some sort of private meaning.  In other words it has a
private use.  Oddly enough, there is a group of characters intended for
private use, in the PUA ;-)

Of course if the PUA characters interfered with the processing of the
string, they'd need to be stripped, but you're sort of already in that
position by having a private flag in the middle of a string.

-Shawn

RE: Unicode "no-op" Character?

2019-06-23 Thread Shawn Steele via Unicode

But... it's not actually discardable.  The hypothetical "packet" architecture 
(using the term architecture somewhat loosely) needed the information being 
tunneled in by this character.  If it was actually discardable, then the "noop" 
character wouldn't be required as it would be discarded.

Since the character conveys meaning to some parts of the system, then it's not 
actually a "noop" and it's not actually "discardable".  

What is actually being requested isn't a character that nobody has meaning for, 
but rather a character that has no PUBLIC meaning.  

Which leads us to the key.  The desire is for a character that has no public 
meaning, but has some sort of private meaning.  In other words it has a private 
use.  Oddly enough, there is a group of characters intended for private use, in 
the PUA ;-)

Of course if the PUA characters interfered with the processing of the string, 
they'd need to be stripped, but you're sort of already in that position by 
having a private flag in the middle of a string.

-Shawn  

-Original Message-
From: Unicode  On Behalf Of Slawomir Osipiuk via 
Unicode
Sent: Saturday, June 22, 2019 6:10 PM
To: unicode@unicode.org
Cc: 'Richard Wordingham' 
Subject: RE: Unicode "no-op" Character?

That's the key to the no-op idea. The no-op character could not ever be assumed 
to survive interchange with another process. It'd be canonically equivalent to 
the absence of character. It could be added or removed at any position by a 
Unicode-conformant process. A program could wipe all the no-ops from a string 
it has received, and insert its own for its own purposes. (In fact, it should 
wipe the old ones so as not to confuse
itself.) It's "another process's discardable junk" unless known, 
internally-only, to be meaningful at a particular stage.

While all the various (non)joiners/ignorables are interesting, none of them 
have this property.

In fact, that might be the best description: It's not just an "ignorable", it's 
a "discardable". Unicode doesn't have that, does it?

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Richard 
Wordingham via Unicode
Sent: Saturday, June 22, 2019 20:59
To: unicode@unicode.org
Cc: Shawn Steele
Subject: Re: Unicode "no-op" Character?

If they're conveying an invisible message, one would have to strip out original 
ZWNBSP/WJ/ZWSP that didn't affect line-breaking.  The weak point is that that 
assumes that line-break opportunities are well-defined.  For example, they 
aren't for SE Asian text.

Richard.

RE: Unicode "no-op" Character?

2019-06-23 Thread Sławomir Osipiuk via Unicode

Ah, sorry. I meant to say that the string should always be normalized (not 
"sanitized") before being checked for exploits (i.e. sanitized).

-Original Message-
From: Sławomir Osipiuk [mailto:sosip...@gmail.com] 
Sent: Sunday, June 23, 2019 20:28
To: unicode@unicode.org
Cc: 'Richard Wordingham'
Subject: RE: Unicode "no-op" Character?

The string should always be sanitized before being checked for exploits

RE: Unicode "no-op" Character?

2019-06-23 Thread Sławomir Osipiuk via Unicode

On the subject of security, I've read through: 
https://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters which says:
"The issue is the following: A gateway might be checking for a sensitive 
sequence of characters, say "delete". If what is passed in is "deXlete", where 
X is a noncharacter, the gateway lets it through: the sequence "deXlete" may be 
in and of itself harmless. However, suppose that later on, past the gateway, an 
internal process invisibly deletes the X. In that case, the sensitive sequence 
of characters is formed, and can lead to a security breach."

Checking a string for a sequence of characters, then passing the string to a 
different function which (potentially) modifies it, then using it in a context 
where the security checks mater, just screams bad practice to me. There should 
be no modification permitted between a security check and security-sensitive 
use. The string should always be sanitized before being checked for exploits. 
Any function which modifies the characters in any way (and is not itself 
security-aware) should implicitly mark the string as unsafe again. Or am I off 
base? Security is not really my specialty, but the approach described in the TR 
stinks horribly to me.

And in my idea, noops would be stripped as part of string sanitization. But the 
more I consider it, the more I understand such a thing would have had to have 
be built into Unicode at the earliest stages. Basically, it's too late now.

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Richard 
Wordingham via Unicode
Sent: Sunday, June 23, 2019 04:37
To: unicode@unicode.org
Subject: Re: Unicode "no-op" Character?

Discardables are a security risk, as security filters may find it hard
to take them into account.

Richard.

Re: Unicode "no-op" Character?

2019-06-23 Thread Richard Wordingham via Unicode

On Sat, 22 Jun 2019 21:10:08 -0400
Sławomir Osipiuk via Unicode  wrote:

> In fact, that might be the best description: It's not just an
> "ignorable", it's a "discardable". Unicode doesn't have that, does it?

No, though the byte order mark at the start of a file comes close.
Discardables are a security risk, as security filters may find it hard
to take them into account.

Richard.

Re: Unicode "no-op" Character?

2019-06-23 Thread Richard Wordingham via Unicode

On Sat, 22 Jun 2019 23:56:50 +
Shawn Steele via Unicode  wrote:

> + the list.  For some reason the list's reply header is confusing.
> 
> From: Shawn Steele
> Sent: Saturday, June 22, 2019 4:55 PM
> To: Sławomir Osipiuk 
> Subject: RE: Unicode "no-op" Character?
> 
> The original comment about putting it between the base character and
> the combining diacritic seems peculiar.  I'm having a hard time
> visualizing how that kind of markup could be interesting?

There are a number of possible interesting scenarios:

1) Chopping the string into user perceived characters.  For example,
the Khmer sequences of COENG plus letter are named sequences.  Akin to
this is identifying resting places for a simple cursor, e.g. allowing it
to be positioned between a base character and a spacing, unreordered
subscript.  (This last possibility overlaps with rendering.)

2) Chopping the string into collating elements.  (This can require
renormalisation, and may raise a rendering issue with HarfBuzz, where
renomalisation is required to get marks into a suitable order for
shaping.  I suspect no-op characters would disrupt this
renormalisation; CGJ may legitimately be used to affect rendering this
way, even though it is supposed to have no other effect* on rendering.)

3) Chopping the string into default grapheme clusters.  That
separates a coeng from the following character with which it
interacts.

*Is a Unicode-compliant *renderer* allowed to distinguish diaeresis
from the umlaut mark?

Richard.

RE: Unicode "no-op" Character?

2019-06-22 Thread Sławomir Osipiuk via Unicode

That's the key to the no-op idea. The no-op character could not ever be
assumed to survive interchange with another process. It'd be canonically
equivalent to the absence of character. It could be added or removed at any
position by a Unicode-conformant process. A program could wipe all the
no-ops from a string it has received, and insert its own for its own
purposes. (In fact, it should wipe the old ones so as not to confuse
itself.) It's "another process's discardable junk" unless known,
internally-only, to be meaningful at a particular stage.

While all the various (non)joiners/ignorables are interesting, none of them
have this property.

In fact, that might be the best description: It's not just an "ignorable",
it's a "discardable". Unicode doesn't have that, does it?

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Richard
Wordingham via Unicode
Sent: Saturday, June 22, 2019 20:59
To: unicode@unicode.org
Cc: Shawn Steele
Subject: Re: Unicode "no-op" Character?

If they're conveying an invisible message, one would have to strip out
original ZWNBSP/WJ/ZWSP that didn't affect line-breaking.  The weak
point is that that assumes that line-break opportunities are
well-defined.  For example, they aren't for SE Asian text.

Richard.

Re: Unicode "no-op" Character?

2019-06-22 Thread Richard Wordingham via Unicode

On Sat, 22 Jun 2019 23:56:11 +
Shawn Steele via Unicode  wrote:

> Assuming you were using any of those characters as "markup", how
> would you know when they were intentionally in the string and not
> part of your marking system?

If they're conveying an invisible message, one would have to strip out
original ZWNBSP/WJ/ZWSP that didn't affect line-breaking.  The weak
point is that that assumes that line-break opportunities are
well-defined.  For example, they aren't for SE Asian text.

Richard.

RE: Unicode "no-op" Character?

2019-06-22 Thread Sławomir Osipiuk via Unicode

I assure you, it wasn't very interesting. :-) Headache-y, more like. The
diacritic thing was completely inapplicable anyway, as all our text was
plain English. I really don't want to get into what the thing was, because
it sounds stupider the more I try to explain it. But it got the wheels
spinning in my head, and now that I've been reading up a lot about Unicode
and older standards like 2022/6429, it got me thinking whether there might
already be an elegant solution.

 

But, as an example I'm making up right now, imagine you want to packetize a
large string. The packets are not all equal sized, the sizes are determined
by some algorithm. And the packet boundary may occur between a base char and
a diacritic. You insert markers into the string at the packet boundaries.
You can then store the string, copy it, display it, or pass it to the
sending function which will scan the string and know to send the next packet
when it reaches the marker. And you can now do all that without the need to
pass around extra metadata (like a list of ints of where the packet
boundaries are supposed to be) or to re-calculate the boundaries; it's still
just a big string. If a different application sees the string, it will know
to completely ignore the packet markers; it can even strip them out if it
wants to (the canonical equivalent of the noop character is the absence of a
character).

 

As should be obvious, I'm not recommending this as good practice.

 

 

From: Shawn Steele [mailto:shawn.ste...@microsoft.com] 
Sent: Saturday, June 22, 2019 19:57
To: Sławomir Osipiuk; unicode@unicode.org
Subject: RE: Unicode "no-op" Character?

 

+ the list.  For some reason the list's reply header is confusing.

 

From: Shawn Steele 
Sent: Saturday, June 22, 2019 4:55 PM
To: Sławomir Osipiuk 
Subject: RE: Unicode "no-op" Character?

 

The original comment about putting it between the base character and the
combining diacritic seems peculiar.  I'm having a hard time visualizing how
that kind of markup could be interesting?

 

From: Unicode  On Behalf Of Slawomir Osipiuk
via Unicode
Sent: Saturday, June 22, 2019 2:02 PM
To: unicode@unicode.org
Subject: RE: Unicode "no-op" Character?

 

I see there is no such character, which I pretty much expected after Google
didn't help.

 

The original problem I had was solved long ago but the recent article about
watermarking reminded me of it, and my question was mostly out of curiosity.
The task wasn't, strictly speaking, about "padding", but about marking -
injecting "flag" characters at arbitrary points in a string without
affecting the resulting visible text. I think we ended up using ESC, which
is a dumb choice in retrospect, though the whole approach was a bit of a
hack anyway and the process it was for isn't being used anymore.

RE: Unicode "no-op" Character?

2019-06-22 Thread Shawn Steele via Unicode

Assuming you were using any of those characters as "markup", how would you know 
when they were intentionally in the string and not part of your marking system?

-Original Message-
From: Unicode  On Behalf Of Richard Wordingham via 
Unicode
Sent: Saturday, June 22, 2019 4:17 PM
To: unicode@unicode.org
Subject: Re: Unicode "no-op" Character?

On Sat, 22 Jun 2019 17:50:49 -0400
Sławomir Osipiuk via Unicode  wrote:

> If faced with the same problem today, I’d probably just go with U+FEFF 
> (really only need a single char, not a whole delimited substring) or a 
> different C0 control (maybe SI/LS0) and clean up the string if it 
> needs to be presented to the user.

You'd really want an intelligent choice between U+FEFF (ZWNBSP) (better
U+2060 WJ) and U+200B (ZWSP).  

> I still think an “idle”/“null tag”/“noop”  character would be a neat 
> addition to Unicode, but I doubt I can make a convincing enough case 
> for it.

You'd still only be able to insert it between characters, not between code 
units, unless you were using UTF-32.

Richard.

RE: Unicode "no-op" Character?

2019-06-22 Thread Shawn Steele via Unicode

+ the list.  For some reason the list's reply header is confusing.

From: Shawn Steele
Sent: Saturday, June 22, 2019 4:55 PM
To: Sławomir Osipiuk 
Subject: RE: Unicode "no-op" Character?

The original comment about putting it between the base character and the 
combining diacritic seems peculiar.  I'm having a hard time visualizing how 
that kind of markup could be interesting?

From: Unicode mailto:unicode-boun...@unicode.org>> 
On Behalf Of Slawomir Osipiuk via Unicode
Sent: Saturday, June 22, 2019 2:02 PM
To: unicode@unicode.org<mailto:unicode@unicode.org>
Subject: RE: Unicode "no-op" Character?

I see there is no such character, which I pretty much expected after Google 
didn't help.

The original problem I had was solved long ago but the recent article about 
watermarking reminded me of it, and my question was mostly out of curiosity. 
The task wasn't, strictly speaking, about "padding", but about marking - 
injecting "flag" characters at arbitrary points in a string without affecting 
the resulting visible text. I think we ended up using ESC, which is a dumb 
choice in retrospect, though the whole approach was a bit of a hack anyway and 
the process it was for isn't being used anymore.

Re: Unicode "no-op" Character?

2019-06-22 Thread Richard Wordingham via Unicode

On Sat, 22 Jun 2019 17:50:49 -0400
Sławomir Osipiuk via Unicode  wrote:

> If faced with the same problem today, I’d
> probably just go with U+FEFF (really only need a single char, not a
> whole delimited substring) or a different C0 control (maybe SI/LS0)
> and clean up the string if it needs to be presented to the user.

You'd really want an intelligent choice between U+FEFF (ZWNBSP) (better
U+2060 WJ) and U+200B (ZWSP).  

> I still think an “idle”/“null tag”/“noop”  character would be a neat
> addition to Unicode, but I doubt I can make a convincing enough case
> for it.

You'd still only be able to insert it between characters, not between
code units, unless you were using UTF-32.

Richard.

RE: Unicode "no-op" Character?

2019-06-22 Thread Sławomir Osipiuk via Unicode

Indeed. There are plenty of control characters that seem useful, but they 
really aren’t, due to lack of support from common software. Unicode is 
deliberately silent about most of them, which is fair, but not always 
convenient. If faced with the same problem today, I’d probably just go with 
U+FEFF (really only need a single char, not a whole delimited substring) or a 
different C0 control (maybe SI/LS0) and clean up the string if it needs to be 
presented to the user.

I still think an “idle”/“null tag”/“noop”  character would be a neat addition 
to Unicode, but I doubt I can make a convincing enough case for it.

 

From: J Decker [mailto:d3c...@gmail.com] 
Sent: Saturday, June 22, 2019 17:19
To: Sławomir Osipiuk
Cc: Unicode Discussion
Subject: Re: Unicode "no-op" Character?

 

But it doesn't appear anything actually 'supports' that.

Re: Unicode "no-op" Character?

2019-06-22 Thread J Decker via Unicode

On Sat, Jun 22, 2019 at 2:04 PM Sławomir Osipiuk via Unicode <
unicode@unicode.org> wrote:

> I see there is no such character, which I pretty much expected after
> Google didn’t help.
>
>
>
> The original problem I had was solved long ago but the recent article
> about watermarking reminded me of it, and my question was mostly out of
> curiosity. The task wasn’t, strictly speaking, about “padding”, but about
> marking – injecting “flag” characters at arbitrary points in a string
> without affecting the resulting visible text. I think we ended up using
> ESC, which is a dumb choice in retrospect, though the whole approach was a
> bit of a hack anyway and the process it was for isn’t being used anymore.
>

The spec would suggest that there are escape codes like that, which can be
used.
APC,   U+009F
ST, String Terminator, U+009C
which is supposed to be a sequence of characters that should not be
displayed, but may be used to control the application displaying them.
(assuming they understand them)

https://www.aivosto.com/articles/control-characters.html

156$9CSTString Terminator 234 9/12 ST
ESC \ Closes a string opened by APC, DCS, OSC, PM or SOS.

159$9FAPCApplication Program Command 237 9/15 AC
ESC _ Starts an application program command string. ST will end the
command. The interpretation of the command is subject to the program in
question.
But it doesn't appear anything actually 'supports' that.

RE: Unicode "no-op" Character?

2019-06-22 Thread Sławomir Osipiuk via Unicode

I see there is no such character, which I pretty much expected after Google
didn't help.

 

The original problem I had was solved long ago but the recent article about
watermarking reminded me of it, and my question was mostly out of curiosity.
The task wasn't, strictly speaking, about "padding", but about marking -
injecting "flag" characters at arbitrary points in a string without
affecting the resulting visible text. I think we ended up using ESC, which
is a dumb choice in retrospect, though the whole approach was a bit of a
hack anyway and the process it was for isn't being used anymore.

RE: Unicode "no-op" Character?

2019-06-22 Thread Doug Ewell via Unicode

Sławomir Osipiuk wrote:

> Does Unicode include a character that does nothing at all? I'm talking
> about something that can be used for padding data without affecting
> interpretation of other characters, including combining chars and
> ligatures.

I join Shawn Steele in wondering what your "data padding" use case is for such 
a character. Most modern protocols don't require string fields to be exactly N 
characters long, or have their own mechanism for storing the real string length 
and ignoring any padding characters.

If you just need to fill up space at the end of a line, and need a character 
that has as little disruptive meaning as possible, I agree that U+FEFF is 
probably the closest you'll get.

NULL, of course, was intended to serve exactly this purpose, but everyone has 
decided for themselves what the C0 code points should be used for, and "display 
a .notdef glyph" is one of the popular choices.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Unicode "no-op" Character?

2019-06-22 Thread Rebecca T via Unicode

Perhaps a codepoint from a private use area and another processing step to
add/ remove them would work for you?

On Sat, Jun 22, 2019, 1:39 AM Mark Davis ☕️ via Unicode 
wrote:

> There nothing like what you are describing. Examples:
>
>1. Display — There are a few of the Default Ignorables that are always
>treated as invisible, and have little effect on other characters. However,
>even those will generally interfere with the display of sequences (be
>between 'q' and U+0308 ( q̈ ); within emoji sequences, within
>ligatures, etc), line break, etc.
>2. Interpretation — There is no character that would always be ignored
>by all processes. Some processes may ignore some characters (eg a search
>indexer may ignore most default ignorables), but there is nothing that all
>processes will ignore.
>
> The only exception would be if some cooperating processes that had agreed
> beforehand to strip some particular character.
>
> Mark
>
>
> On Sat, Jun 22, 2019 at 6:49 AM Sławomir Osipiuk via Unicode <
> unicode@unicode.org> wrote:
>
>> Does Unicode include a character that does nothing at all? I’m talking
>> about something that can be used for padding data without affecting
>> interpretation of other characters, including combining chars and
>> ligatures. I.e. a character that could hypothetically be inserted between a
>> latin E and a combining acute and still produce É. The historical
>> description of U+0016 SYNCHRONOUS IDLE seems like pretty much exactly what
>> I want. It only has one slight disadvantage: it doesn’t work. All software
>> I’ve tried displays it as an unknown character and it definitely breaks up
>> combinations. And U+ NULL seems even worse.
>>
>>
>>
>> I can imagine the answer is that this thing I’m looking for isn’t a
>> character at all and so should be the business of “a higher-level protocol”
>> and not what Unicode was made for… but Unicode does include some odd things
>> so I wonder if there is something like that regardless. Can anyone offer
>> any suggestions?
>>
>>
>>
>> Sławomir Osipiuk
>>
>

Re: Unicode "no-op" Character?

2019-06-22 Thread Mark Davis ☕️ via Unicode

There nothing like what you are describing. Examples:

   1. Display — There are a few of the Default Ignorables that are always
   treated as invisible, and have little effect on other characters. However,
   even those will generally interfere with the display of sequences (be
   between 'q' and U+0308 ( q̈ ); within emoji sequences, within ligatures,
   etc), line break, etc.
   2. Interpretation — There is no character that would always be ignored
   by all processes. Some processes may ignore some characters (eg a search
   indexer may ignore most default ignorables), but there is nothing that all
   processes will ignore.

The only exception would be if some cooperating processes that had agreed
beforehand to strip some particular character.

Mark

On Sat, Jun 22, 2019 at 6:49 AM Sławomir Osipiuk via Unicode <
unicode@unicode.org> wrote:

> Does Unicode include a character that does nothing at all? I’m talking
> about something that can be used for padding data without affecting
> interpretation of other characters, including combining chars and
> ligatures. I.e. a character that could hypothetically be inserted between a
> latin E and a combining acute and still produce É. The historical
> description of U+0016 SYNCHRONOUS IDLE seems like pretty much exactly what
> I want. It only has one slight disadvantage: it doesn’t work. All software
> I’ve tried displays it as an unknown character and it definitely breaks up
> combinations. And U+ NULL seems even worse.
>
>
>
> I can imagine the answer is that this thing I’m looking for isn’t a
> character at all and so should be the business of “a higher-level protocol”
> and not what Unicode was made for… but Unicode does include some odd things
> so I wonder if there is something like that regardless. Can anyone offer
> any suggestions?
>
>
>
> Sławomir Osipiuk
>

Re: Unicode "no-op" Character?

2019-06-22 Thread Alex Plantema via Unicode


Op zaterdag 22 juni 2019 02:14 schreef Sławomir Osipiuk via Unicode:


Does Unicode include a character that does nothing at all? I'm
talking about something that can be used for padding data without
affecting interpretation of other characters, including combining
chars and ligatures. I.e. a character that could hypothetically be
inserted between a latin E and a combining acute and still produce É.
The historical description of U+0016 SYNCHRONOUS IDLE seems like
pretty much exactly what I want. It only has one slight disadvantage:
it doesn't work. All software I've tried displays it as an unknown
character and it definitely breaks up combinations. And U+ NULL
seems even worse.


DEL was used as such on papertape to replace errors.

Alex.

Re: Unicode "no-op" Character?

2019-06-21 Thread J Decker via Unicode

Sounds like a great use for ZWNBSP  (zero width non-breaking space) 0xFEFF
(Also used as BOM)
or that doesn't break; maybe 'ZERO WIDTH SPACE' (U+200B)

On Fri, Jun 21, 2019 at 9:48 PM Sławomir Osipiuk via Unicode <
unicode@unicode.org> wrote:

> Does Unicode include a character that does nothing at all? I’m talking
> about something that can be used for padding data without affecting
> interpretation of other characters, including combining chars and
> ligatures. I.e. a character that could hypothetically be inserted between a
> latin E and a combining acute and still produce É. The historical
> description of U+0016 SYNCHRONOUS IDLE seems like pretty much exactly what
> I want. It only has one slight disadvantage: it doesn’t work. All software
> I’ve tried displays it as an unknown character and it definitely breaks up
> combinations. And U+ NULL seems even worse.
>
>
>
> I can imagine the answer is that this thing I’m looking for isn’t a
> character at all and so should be the business of “a higher-level protocol”
> and not what Unicode was made for… but Unicode does include some odd things
> so I wonder if there is something like that regardless. Can anyone offer
> any suggestions?
>
>
>
> Sławomir Osipiuk
>

RE: Unicode "no-op" Character?

2019-06-21 Thread Shawn Steele via Unicode

I'm curious what you'd use it for?

From: Unicode  On Behalf Of Slawomir Osipiuk via 
Unicode
Sent: Friday, June 21, 2019 5:14 PM
To: unicode@unicode.org
Subject: Unicode "no-op" Character?

Does Unicode include a character that does nothing at all? I'm talking about 
something that can be used for padding data without affecting interpretation of 
other characters, including combining chars and ligatures. I.e. a character 
that could hypothetically be inserted between a latin E and a combining acute 
and still produce É. The historical description of U+0016 SYNCHRONOUS IDLE 
seems like pretty much exactly what I want. It only has one slight 
disadvantage: it doesn't work. All software I've tried displays it as an 
unknown character and it definitely breaks up combinations. And U+ NULL 
seems even worse.

I can imagine the answer is that this thing I'm looking for isn't a character 
at all and so should be the business of "a higher-level protocol" and not what 
Unicode was made for... but Unicode does include some odd things so I wonder if 
there is something like that regardless. Can anyone offer any suggestions?

Sławomir Osipiuk

41 matches

Mail list logo