Your goal is not achievable. We can't wave a magic wand, and suddenly (or even within decades) all processes everywhere ignore U+000F in all processing will not happen.
This thread is pointless and should be terminated. Mark On Wed, Jul 3, 2019 at 5:48 PM Sławomir Osipiuk via Unicode < unicode@unicode.org> wrote: > I’m frustrated at how badly you seem to be missing the point. There is > nothing impossible nor self-contradictory here. There is only the matter > that Unicode requires all scalar values to be preserved during interchange. > This is in many ways a good idea, and I don’t expect it to change, but > something else would be possible if this requirement were explicitly > dropped for a well-defined small subset of characters (even just one > character). A modern-day SYN. > > > > Let’s say it’s U+000F. The standard takes my proposal and makes it a > discardable, null-displayable character. What does this mean? > > > > U+000F may appear in any text. It has no (external) semantic value. But it > may appear. It may appear a lot. > > > > Display routines (which are already dealing with combining, ligaturing, > non-/joiners, variations, initial/medial/finals forms) understand that > U+000F is to be processed as a no-op. Do nothing with this. Drop it. Move > to the next character. Simple. > > > > Security gateways filter it out completely, as a matter of best practice > and security-in-depth. > > > > A process, let’s call it Process W, adds a bunch of U+000F to a string it > received, or built, or a user entered via keyboard. Maybe it’s to > packetize. Maybe to mark every word that is an anagram of the name of a > famous 19th-century painter, or that represents a pizza topping. Maybe > something else. This is a versatile character. Process W is done adding > U+000F to the string. It stores in it a database UTF-8 encoded field. > Encoding isn’t a problem. The database is happy. > > > > Now Process X runs. Process X is meant to work with Process W and it’s > well-aware of how U+000F is used. It reads the string from the database. It > sees U+000F and interprets it. It chops the string into packets, or does a > websearch for each famous painter, or it orders pizza. The private meaning > of U+000F is known to both Process X and Process W. There is useful > information encoded in-band, within a limited private context. > > > > But now we have Process Y. Process Y doesn’t care about packets or > painters or pizza. Process Y runs outside of the private context that X and > W had. Process Y translates strings into Morse code for transmission. As > part of that, it replaces common words with abbreviations. Process Y > doesn’t interpret U+000F. Why would it? It has no semantic value to Process > Y. > > > > Process Y reads the string from the database. Internally, it clears all > instances of U+000F from the string. They’re just taking up space. They’re > meaningless to Y. It compiles the Morse code sequence into an audio file. > > > > But now we have Process Z. Process Z wants to take a string and mark every > instance of five contiguous Latin consonants. It scrapes the database > looking for text strings. It finds the string Process W created and marked. > Z has no obligation to W. It’s not part of that private context. Process Z > clears all instances of U+000F it finds, then inserts its own wherever it > finds five-consonant clusters. It stores its results in a UTF-16LE text > file. It’s allowed to do that. > > > > Nothing impossible happened here. Let’s summarize: > > > > Processes W and X established a private meaning for U+000F by agreement > and interacted based on that meaning. > > > > Process Y ignored U+000F completely because it assigned no meaning to it. > > > > Process Z assigned a completely new meaning to U+000F. That’s permitted > because U+000F is special and is guaranteed to have no semantics without > private agreement and doesn’t need to be preserved. > > > > There is no need to escape anything. Escaping is used when a character > must have more than one meaning (i.e. it is overloaded, as when it is both > text and markup). U+000F only gets one meaning in any context. In a new > context, the meaning gets overridden, not overloaded. That’s what makes it > special. > > > > I don’t expect to see any of this in official Unicode. But I take > exception to the idea that I’m suggesting something impossible. > > > > > > *From:* Philippe Verdy [mailto:verd...@wanadoo.fr] > *Sent:* Wednesday, July 03, 2019 04:49 > *To:* Sławomir Osipiuk > *Cc:* unicode Unicode Discussion > *Subject:* Re: Unicode "no-op" Character? > > > > Your goal is **impossible** to reach with Unicode. Assume sich character > is "added" to the UCS, then it can appear in the text. Your goal being that > it should be "warrantied" not to be used in any text, means that your > "character" cannot be encoded at all. >