Re: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters))

2015-06-09 Thread Nathan Sharfi

 On Jun 3, 2015, at 1:26 AM, William_J_G Overington 
 wjgo_10...@btinternet.com wrote:
 
 Private Use Area in Use (from Tag characters and in-line graphics (from Tag 
 characters))
 
 
 That's not agreed upon. I'd say that the general agreement is that the 
 private ranges are of limited usefulness for some very limited use cases 
 (such as designing encodings for new scripts).
 
 
 They are of limited usefulness precisely because it is pathologically hard 
 to make use of them in their current state of technological evolution. If 
 they were easy to make use of, people would be using them all the time. I’d 
 bet good money that if you surveyed a lot of applications where custom 
 characters are being used, they are not using private use ranges. Now why 
 would that be?
 
 
 Actually, I have used Private Use Area characters a lot, and, once I had got 
 used to them, I found them incredibly straightforward to use.

That's nice; I've found some persistent annoyances when I use PUA codepoints.

A while back I learned Quikscript, an alternate English orthography. Since May 
2013, my blog's been in Quikscript using PUA codepoints. I've also joined the 
Shavian mailing list, sent e-mails in Shavian, and wrote an I'm switching my 
Quikscript blog to Shavian blog post in Shavian for April Fool's Day. To do 
all this typing, I made both Quikscript and Shavian keyboard layouts for OS X, 
as well as a Quikscript font. All of my Quikscript stuff is linked to from 
https://www.frogorbits.com/qs/ if you're interested.

I'm something of a Johnny-come-lately to Shavian, so I've only used it in the 
SMP with fonts others have made.

So, how much nicer is dealing with Shavian?

- The Keyboard Viewer and input-source preview know what font to use for each 
key for Shavian; Quikscript keyboard layouts display boxes for the letters 
because there's no way for the system to guess which font to use for a 
particular codepoint. 
- Double-tapping a Shavian word in my browser will select the word; 
double-tapping a Quikscript word will select just one letter.
- Internet Explorer will happily break Quikscript text in the middle of a word; 
Shavian gets broken at word boundaries just like English. While IE's behavior 
is unlike other browsers' and Not What I Want, I can't fault the IE team; I 
could be using PUA code points for a language that doesn't use spaces much, 
like Japanese.
- I can read and write Shavian posts on Twitter on the desktop in a reasonable 
font for both Shavian and other scripts; if I wanted to do the same in 
Quikscript, I'd have to have a custom user-supplied stylesheet to override 
Twitter's own font suggestions.
- Scripts already in Unicode attract the attention of talented completionist 
organizations that PUA communities generally can't attract beforehand. Everson 
Mono, Noto, and Segoe UI Historic (as of Windows 10) — all great typefaces — 
support Shavian and not Quikscript.

This tends to be because:

- I could have multiple fonts that have wildly differing meanings and glyphs 
mapped to the same code point; the OS can't guess which I might mean.
- All the information that the OS needs to detect word breaks is in character 
properties data supplied by the Consortium and handled by the OS.

~ ~ ~

Specialists like us might be able to put up with these things, but we can't 
control everything about the reading and writing experience online unless we're 
all resigned to taking pictures of handwritten text.


Re: Tag characters and in-line graphics (from Tag characters)

2015-06-07 Thread Doug Ewell

Mark E. Shoulson mark at kli dot org wrote:


Isn't this what webfonts are all about?  You specify a font in the
stylesheet, give it a URL, and your browser goes and downloads it and
displays the text in it.


That's great if you have a stylesheet, a URL, and a browser. HTML is 
fancy text, and pretty much implies some sort of online connection. I 
thought we were talking about plain text, and apologize if we weren't or 
if that important detail was not clear.


--
Doug Ewell | http://ewellic.org | Thornton, CO  



Re: Tag characters and in-line graphics (from Tag characters)

2015-06-07 Thread Philippe Verdy
2015-06-07 18:39 GMT+02:00 Doug Ewell d...@ewellic.org:

 Mark E. Shoulson mark at kli dot org wrote:

  Isn't this what webfonts are all about?  You specify a font in the
 stylesheet, give it a URL, and your browser goes and downloads it and
 displays the text in it.


 That's great if you have a stylesheet, a URL, and a browser. HTML is fancy
 text, and pretty much implies some sort of online connection.


Everything in HTML is embeddable in a standalone document, including
graphics. HTML does not imply any online connection. HTML is independant of
HTTP or other transports.


Re: Tag characters and in-line graphics (from Tag characters)

2015-06-05 Thread Asmus Freytag (t)

On 6/4/2015 17:03 , Chris wrote:
This whole discussion is about the fact that it would be technically 
possible to have private character sets and private agreements that 
your OS downloads without the user being aware of it. 


The sticky issues are not the questions of how to make available fonts 
or images for use by the OS.


Instead, they concern the fact that any such  a model violates some 
pretty basic guarantees of plain text that the entire net infrastructure 
relies on.


There are very obvious security issues. The start with tracking; every 
time you access a custom code point, that fact potentially results in a 
trackable interaction. This problem affects even the sticker solution 
that people are hoping for for emoji. (On my system, no external 
resources are displayed when I first open any message, and there is a 
reason for that).


Beyond tracking, and beyond stickers (that is pictures that look like 
pictures) a generalized custom character set would allow text that is 
no longer really stable. You would be able to deliver identical e-mails 
to people that display differently, because when you serve the custom 
fonts, you would be able to customize what you deliver under the same 
custom character set designator.


While this would be a wonderful way to circumvent censorship (other than 
the man in the middle version), you would likewise seriously undermine 
the ability to filter unwanted or undesirable texts, because the custom 
character set engine might recognize when a request comes from a filter 
and not the end user. (Just the other day, I came across a hacked 
website that responded differently to search engined than to live users, 
making the hack effective for one and invisible to the other. Custom 
character sets would seem to just add to the hackers' arsenal here).


Finally, custom character sets sound like a great idea when thinking of 
an extension of an existing character set. But that's not where the 
issues are. The issues come in when you use the same technology to 
provide aliases for existing code points or for other custom characters.


Aliasing undermines the ability to do search (or any other 
content-focused processing, from sorting to spell-check).


At that point, the circle closes.

When Unicode was created, the alternative then was ISO 2022, which was a 
standard that addressed the issue of how to switch among (albeit 
pre-defined) character sets to achieve, in principle, coverage equal to 
the union of these character sets.


Unicode was created to address two main deficiencies of that situation. 
Unification addressed the aliasing issue, so that code points were no 
longer opaque but could be interpreted by software (other than 
display), which was the second big drawback of the patchwork of 
character sets. A processing model for opaque code points is possible to 
define, but it isn't very practical and in the late eighties people had 
had enough were glad to be quit of it.


Seen from this perspective, the discussion about custom character sets 
presents itself as a giant step backward, undermining the very advances 
that underlie the rapid acceptance and spread of Unicode.


A./


Re: Tag characters and in-line graphics (from Tag characters)

2015-06-05 Thread Martin J. Dürst

On 2015/06/04 17:03, Chris wrote:


I wish Steve Jobs was here to give this lecture.


Well, if Steve Jobs were still around, he could think about whether (and 
how many) users really want their private characters, and whether it was 
worth the time to have his engineers working on the solution. I'm not 
sure he would come to the same conclusion as you.



This whole discussion is about the fact that it would be technically possible 
to have private character sets and private agreements that your OS downloads 
without the user being aware of it.

Now if the unicode consortium were to decide on standardising a technological 
process whereby rendering engines could seamlessly download representations of 
custom characters without user intervention, no doubt all the vendors would 
support it, and all the technical mumbo jumbo of installing privately agreed 
character sets would be something users could leave for the technology to sort 
out.


You are right that it would be strictly technically possible. Not only 
that, it has been so for 10 or 20 years.


As an example, in 1996 at the WWW Conference in Paris I was 
participating in a workshop on internationalization for the Web, and by 
chance I was sitting between the participant from Adobe and the 
participant from Microsoft. These were the main companies working on 
font technology at that time, and I asked them how small it would be 
possible to make a font for a single character using their technologies 
(the purpose of such a font, as people on this thread should be able to 
guess, would be as part of a solution to exchange single, user-defined 
characters).


I don't even remember their answers. The important thing here that the 
idea, and the technology, have been around for a long time. So why 
didn't it take on? Maybe the demand is just not as big as some 
contributors on this list claim.


Also, maybe while the technology itself isn't rocket science, the 
responsible people at the relevant companies have enough experience with 
technology deployment to hold back. To give an example of why the 
deployment aspect is important, there were various Web-like hypertext 
technologies around when the Web took off in the 1990. One of them was 
called HyperG. It was technologically 'better' than the Web, in that it 
avoided broken links. But it was much more difficult to deploy, and so 
it is forgotten, whereas the Web took off.


Regards,   Martin.


Re: Tag characters and in-line graphics (from Tag characters)

2015-06-05 Thread William_J_G Overington
Asmus Freytag wrote about security issues.
This is interesting reading and I have learned a lot from the post about 
various security issues.
Whilst the post is in this thread and follows from a post in this thread, the 
topic has seemed to moved to the Custom characters thread.
I note that what you write about seems to me that it would not apply to my 
suggestion in my original post: is that correct?
http://www.unicode.org/mail-arch/unicode-ml/y2015-m05/0218.html
Also the following two posts.
http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0009.html
http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0027.html
Whilst the ideas raised by Chris are interesting, they do seem to be distinctly 
different from what I suggested.
So, for clarity, do you regard my suggested format as having any security 
issues, and if so, what please?
I know that some people have opined that my suggested format is out of scope 
for Unicode, yet the scope of Unicode is what the Unicode Technical Committee 
decides is the scope of Unicode, and my suggested format does provide a way to 
include custom glyphs within a Unicode plain text document by using the new 
base character followed by tag characters method.
William Overington
5 June 2015


Re: Tag characters and in-line graphics (from Tag characters)

2015-06-05 Thread Doug Ewell
I wrote, crumpled up, and threw away about three different responses. I
thought about ISO 2022 and about accessing the web for every PUA
character, as Asmus mentioned, and about the size of the user base, as
Martin mentioned. I thought about character properties and about
ephemerality.

I didn't think of the spoofing implications that Asmus described, which
would affect both the automatic PUA font download and the inline drawing
language. Either of these could be used to spell out, let's say,
paypal.com rather convincingly and with minimal effort.

I might have more experience with the PUA than many list members, having
transcribed the 27,000-word Alice's Adventures in Wonderland into my
constructed alphabet two years ago, in a PUA encoding, so that Michael
Everson could publish it in book form. One of the many learning
experiences of this project was finding out which software tools play
nicely with the PUA and which don't. Some tools just worked while
others would not give acceptable results with any amount of effort.

At no point, however, did I suppose that a font with my alphabet, or any
of the jillions of others that have been invented during a boring day
in class (see Omniglot for tons of examples), should be silently
downloaded to a user's computer, consuming bandwidth and disk space,
without her knowledge. That's practically malware. Maybe I'm just not
enough of a Distinguished Visionary to understand how insanely great
this would be (unfortunately, celebrity name-dropping doesn't work with
me).

Unicode has stated consistently for at least 23 years that it would not
ever standardize PUA usage, and over the years some UTC members have
used terms like strongly discouraged and not interoperable even in
the presence of an agreement. Given this, and given that no system I'm
aware of magically downloads fonts for *regularly encoded characters* (I
still have no font for Arabic math symbols), I personally would not
expect Unicode to perform a 180 on this.

--
Doug Ewell | http://ewellic.org | Thornton, CO 




Re: Tag characters and in-line graphics (from Tag characters)

2015-06-04 Thread Chris
 
 No, that's why you include a reference to the font in the private agreement, 
 so that interested parties can install it and see the special character(s).

People with their iphones and ipads and so forth don’t want to have “private 
agreements”, they don’t want to “install character sets”. The want it to “just 
work”.

I wish Steve Jobs was here to give this lecture.

I highly doubt actually that it is even possible to install a private character 
set font on an iphone such that it would be available to all applications.

This whole discussion is about the fact that it would be technically possible 
to have private character sets and private agreements that your OS downloads 
without the user being aware of it.

Now if the unicode consortium were to decide on standardising a technological 
process whereby rendering engines could seamlessly download representations of 
custom characters without user intervention, no doubt all the vendors would 
support it, and all the technical mumbo jumbo of installing privately agreed 
character sets would be something users could leave for the technology to sort 
out.





Re: Tag characters and in-line graphics (from Tag characters)

2015-06-04 Thread Chris

 On 4 Jun 2015, at 10:59 am, David Starner prosfil...@gmail.com wrote:
 
 On Wed, Jun 3, 2015 at 5:46 PM Chris idou...@gmail.com 
 mailto:idou...@gmail.com wrote:
 
 I personally think emoji should have one, single definitive representation 
 for this exact reason.
 
 Then you want an image. I don't see what's hard about that.


I already explained why an image and/or HTML5 is not a character. I’ll repeat 
again. And the world of characters is not limited to emoji.

1. HTML5 doesn’t separate one particular representation (font, size, etc) from 
the actual meaning of the character. So you can’t paste it somewhere and expect 
to increase its point size or change its font.
2. It’s highly inefficient in space to drop multi-kilobyte strings into a 
document to represent one character.
3. The entire design of HTML has nothing to do with characters. So there is no 
way to process a string of characters interspersed with HTML elements and know 
which of those elements are a “character”. This makes programatic manipulation 
impossible, and means most computer applications simply will not allow HTML in 
scenarios where they expect a list of “characters”.
4. There is no way to compare 2 HTML elements and know they are talking about 
the same character. I could put some HTML representation of a character in my 
document, you could put a different one in, and there would absolutely no way 
to know that they are the same character. Even if we are in the same community 
and agree on the existence of this character.
5. Similarly, there is no way to search or index html elements. If a HTML 
document contained an image of a particular custom character, there would be no 
way to ask google or whatever to find all the documents with that character. 
Different documents would represent it differently. HTML is a rendering 
technology. It makes things LOOK a particular way, without actually ENCODING 
anything about it. The only part of of HTML that is actually searchable in a 
deterministic fashion is the part that is encoded - the unicode part.


  
 The community interested in tony the tiger can make decisions like that. 
 
 That is a hell of a handwave. In practice, you've got a complex decision 
 that's always going to be a bit controversial, and one a decision that most 
 communities won't bother trying to make.

Apparently the world makes decisions all the time without meeting in committee. 
Strange but true. It’s called making a decision. Facebook have created a lot of 
emoji characters without consulting any committee and it seems to work fine, 
albeit restricted to the facebook universe because of a lack of a standard.

 
  
 You can’t know because they’re images.
 
 You can't know because the only obvious equivalence relation is exact image 
 identity. 

Because… there is no standard!! If facebook wants to define 2 emoji images, 
maybe one is bigger than the other, and yet basically the same, to mean the 
same thing, then that would be their choice. Since I expect they have a lot of 
smart people working there, I expect it would work rather well. Just like 
Microsoft issues courier fonts in different point sizes and we all feel they 
have made that work fairly well.

You seem to be arguing the nonsense position that if someone for example, made  
a snowflake glyph slightly different to the unicode official one, that it is 
wrong. That of course is nonsense. People can make sensible decisions about 
this without the unicode committee.


 
 You can’t iterate over compressed bits. You can’t process them.
 
 Why not? In any language I know of that has iterators, there would be no 
 problem writing one that iterates over compressed input. If you need to 
 mutate them, that is hard in compressed formats, but a new CPU can store War 
 in Peace in the on-CPU cache.  

You can’t do it because no standard library, programming language, or operating 
system is set up to iterate over characters of compressed data. So if you want 
to shift compressed bits around in your app, it will take an awful lot of work, 
and the bits won’t be recognised by anyone else.

Now if someone wants to define the next version of unicode to be a compressed 
format, and every platform supports that with standard libraries, computer 
languages etc, then fine that could work.

Yet again I point out, lots of things MIGHT be possible in the real world IF 
that is how a standard is formulated. But all the chatter about this or that 
technology is pie in the sky without that standard.



Re: Tag characters and in-line graphics (from Tag characters)

2015-06-03 Thread Chris

 On 3 Jun 2015, at 11:24 pm, David Starner prosfil...@gmail.com wrote:
 
 Chris wrote:
  There is no way to compare 2 HTML elements and know they are talking about 
  the same character
 
 That's because character identity is a hard problem. Is the emoji TIGER the 
 same as TONY THE TIGER or as TONY THE TIGER GIVING THE VICTORY SIGN?  


I personally think emoji should have one, single definitive representation for 
this exact reason. The subtley of different emotion between one happy face and 
another can be miles apart.  Emoji are a little different to other symbols in 
that respect. Symbols that are purely symbolic can be changed as much as you 
like as long as they are recognisable. Emoji have too many shades of meaning 
for allowing change.

Both of these scenarios are an argument that there should be custom characters 
with at least one official representation. Emoji because you don’t really want 
variation. Symbols because if you don’t have a local representation, then 
something is better than nothing. If you don’t have a local Snow Flake for 
example, any old snow flake will be fine.

This is not a hard problem at all. Is one tony the tiger the same as another? 
The community interested in tony the tiger can make decisions like that. But 
having made that decision there needs to be a way for generic computer programs 
that don’t know about that community to do reasonable things with tony the 
tiger characters.

 
 You can index links to images. If two documents represent it differently, 
 then I go back to the above; we can't know that they're the same thing.

You can’t know because they’re images. That’s my exact point. Anybody talking 
about HTML5 and images as a solution to custom characters is not proposing a 
valid solution.


 
 On Tue, Jun 2, 2015 at 7:11 PM Chris idou...@gmail.com 
 mailto:idou...@gmail.com wrote:
 You can’t ask the entire computing universe to compress everything all the 
 time.
 
 Anytime we care about how much space text takes up, it should be compressed. 
 It compresses very well. On the other hand, it's rare that anyone cares 
 anymore; what's a few hundred kilobytes between friends? 


You compress things when they are on the move. Between computers and as you are 
writing it to a file. But you can’t compress generically while it is in memory. 
You can’t iterate over compressed bits. You can’t process them.






Re: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters))

2015-06-03 Thread John
I don’t use old software, I use up to date versions of everything on a Mac. 
Very standard setup. 




There’s a lot of links there. Maybe they do work in PDFs, but they certainly 
don’t work in the browser, and they don’t work when I click the txt files. 
Basically what you’re saying is that PDFs have a way to make this work.




so what?




Unless we are proposing that everything in the universe be PDF, this doesn’t 
really help. There should be a standard way to put custom characters anywhere 
that characters belong and have things “just work”. Clearly right now things 
don’t just work. And without even bothering to try I know if I tried cutting 
and pasting from those PDFs into somewhere else, it won’t work.


—
Chris

On Wed, Jun 3, 2015 at 11:20 PM, Philippe Verdy verd...@wanadoo.fr
wrote:

 Note that copy-pasting from a PDF to another document is very tricky, the
 PDF format requires that embedded fonts use precise glyph naming
 conventions to map glyphs back to characters, otherwise the Unicode
 characters sequences associated to a glyph (or multiple glyphs if they are
 ligatured or in complex layouts or with uncommon decorations, or rendered
 on a non uniform background, or with glyphs filled with pattern, such as
 labels over a photograph or cartographic map) will not be recognized. This
 remark about PDFs is also applicable to PostScript documents.
 Some PDF readers in that case attempt to perform some OCR (plus dictionary
 lookups to fix mis readings) for common glyph forms, but will almost always
 fail if the glyphs are too specific such as when they include swashes,
 ligatures, or unknown scripts and scripts with complex layouts (such as the
 invented script created by William for noting sentences with specific
 characters with new glyphs, and a specific syntax and specific layout
 rules. In other casesn the PDF reader will jsut put in the clipboard only a
 bitmap for the selection, and it will be another software that will attempt
 to interpret the bitmap with OCR.
 The glyph naming conventions are documented in PDF specifications, but many
 PDF creators do not follow these rules, and copying text from these PDFs
 fails
 2015-06-03 15:03 GMT+02:00 Philippe Verdy verd...@wanadoo.fr:
 This possibly fails because William possibly forgot to embed his font in
 the document itself (or Serif PagePlus forgets to do it when it creates the
 PDF document, and refuses to embed glyphs from the font that are bound to
 Unicode PUAs when it creates the embeded font). However no such problem
 when creating PDFs with MS Office, or via the Adobe Acrobat printer
 driver or other printer drivers generating PDF files, including Google
 Cloud Print).

 So this could be a misuse of Serif PagePlus when creating the PDF (I don't
 know this software, may be there are options set up that ells it to not
 embed fonts from a list of fonts that the recipient is supposed to have
 installed locally, to save storage space for the document, byt evoiding
 such embedding). Another reason may be that the font is marked as not
 embeddable within its exposed properties.

 Another reason may be that John tries to open the document with a software
 that does not handle embedded fonts, or that ignores it to use only the
 fonts preinstalled by John in his preferences. And in such case the result
 depends only on fonts preinstalled on his local system (that does not
 include the fonts created by William), or his software is setup to use
 exclusively a specific local Unicode font for all PUAs.

 (Softwares that behaved in this bad way was old versions of Internet
 Explorer, due to limitation of his text renderers, however this should not
 happen with PDFs, provided you have used a correct plugion version for
 displaying PDF in the browser : if this fails in the browser, download the
 document and view it with Adobe Reader instead of view the plugin: there
 are many PDF plugins on markets that do not support essential features and
 just built to display PDF containing scanned bitmaps, but with very poor
 support of text or vector graphics, or tuned specifically to change the
 document for another device or paper format).

 Without citing which softwares are used (and which PDF in the list does
 not load correctly), it is difficult to tell, but for me I have no problems
 with a few docs I saw created by William. So:

 NO F = NO FAIL for me.

 2015-06-03 13:38 GMT+02:00 John idou...@gmail.com:

 Yep, I clicked on your document and saw an empty square where your
 character should be.

 F = FAIL.

 —
 Chris


 On Wed, Jun 3, 2015 at 6:30 PM, William_J_G Overington 
 wjgo_10...@btinternet.com wrote:

 Private Use Area in Use (from Tag characters and in-line graphics (from
 Tag characters))


  That's not agreed upon. I'd say that the general agreement is that
 the private ranges are of limited usefulness for some very limited use
 cases (such as designing encodings for new scripts).


  They are of limited usefulness precisely because

Re: Tag characters and in-line graphics (from Tag characters)

2015-06-03 Thread John
So what you’re saying is that the current situation where you see an empty 
square □ for unknown characters is better than seeing something useful?


—
Chris

On Thu, Jun 4, 2015 at 12:59 AM, Doug Ewell d...@ewellic.org wrote:

 Chris idou747 at gmail dot com wrote:
 Right now, what happens if you have a domain or locale requirement for
 a special character?
 That's what the PUA is for. Assign a PUA code point to your special
 character, create a font which implements the PUA character, create a
 brief private agreement which states that this code point refers to
 that character and which mentions the font, put the private agreement on
 the web, and publish your document with a reference to the agreement.
 For most non-professionals, creating the font is the tricky part.
 Also see Section 23.5 of TUS.
 Note that I am disagreeing with Martin about the PUA being useful only
 as a scratch area for standardization.
 --
 Doug Ewell | http://ewellic.org | Thornton, CO 

Re: Tag characters and in-line graphics (from Tag characters)

2015-06-03 Thread David Starner
On Wed, Jun 3, 2015 at 5:46 PM Chris idou...@gmail.com wrote:


 I personally think emoji should have one, single definitive representation
 for this exact reason.


Then you want an image. I don't see what's hard about that.


 The community interested in tony the tiger can make decisions like that.


That is a hell of a handwave. In practice, you've got a complex decision
that's always going to be a bit controversial, and one a decision that most
communities won't bother trying to make.



 You can’t know because they’re images.


You can't know because the only obvious equivalence relation is exact image
identity.

You can’t iterate over compressed bits. You can’t process them.


Why not? In any language I know of that has iterators, there would be no
problem writing one that iterates over compressed input. If you need to
mutate them, that is hard in compressed formats, but a new CPU can store
War in Peace in the on-CPU cache.


Re: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters))

2015-06-03 Thread John
Yep, I clicked on your document and saw an empty square where your character 
should be.




F = FAIL.



—
Chris

On Wed, Jun 3, 2015 at 6:30 PM, William_J_G Overington
wjgo_10...@btinternet.com wrote:

 Private Use Area in Use (from Tag characters and in-line graphics (from Tag 
 characters))
 That's not agreed upon. I'd say that the general agreement is that the 
 private ranges are of limited usefulness for some very limited use cases 
 (such as designing encodings for new scripts).
 They are of limited usefulness precisely because it is pathologically hard 
 to make use of them in their current state of technological evolution. If 
 they were easy to make use of, people would be using them all the time. I’d 
 bet good money that if you surveyed a lot of applications where custom 
 characters are being used, they are not using private use ranges. Now why 
 would that be?
 Actually, I have used Private Use Area characters a lot, and, once I had got 
 used to them, I found them incredibly straightforward to use.
 I have made fonts that include Private Use Area encodings using the 
 High-Logic FontCreator program and then used those fonts in Serif PagePlus, 
 both to produce PDF documents and PNG graphics, as needed for my particular 
 project at the time.
 For example,
 http://forum.high-logic.com/viewtopic.php?f=10t=2957
 http://forum.high-logic.com/viewtopic.php?f=10t=2672
 William Overington
 3 June 2015

Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters))

2015-06-03 Thread William_J_G Overington
Private Use Area in Use (from Tag characters and in-line graphics (from Tag 
characters))


 That's not agreed upon. I'd say that the general agreement is that the 
 private ranges are of limited usefulness for some very limited use cases 
 (such as designing encodings for new scripts).


 They are of limited usefulness precisely because it is pathologically hard to 
 make use of them in their current state of technological evolution. If they 
 were easy to make use of, people would be using them all the time. I’d bet 
 good money that if you surveyed a lot of applications where custom characters 
 are being used, they are not using private use ranges. Now why would that be?


Actually, I have used Private Use Area characters a lot, and, once I had got 
used to them, I found them incredibly straightforward to use.


I have made fonts that include Private Use Area encodings using the High-Logic 
FontCreator program and then used those fonts in Serif PagePlus, both to 
produce PDF documents and PNG graphics, as needed for my particular project at 
the time.


For example,


http://forum.high-logic.com/viewtopic.php?f=10t=2957


http://forum.high-logic.com/viewtopic.php?f=10t=2672


William Overington




3 June 2015



Re: Tag characters and in-line graphics (from Tag characters)

2015-06-03 Thread Philippe Verdy
Compression is even more important today on mobile networks: mobile apps
are very verbose over the net, and you can easily pay the extra volume. In
addition, mobile networks are frequently much slower than what they are
advertized, even if you pay the extra subscription to get 3G/4G, you depend
on antennas and the number of peoples around you.
In my home, 3G/4G in faact does not work at all, and this is the case in
many places around in my city, even though they are sold to have full
coverage (for example, just downloading an application or updating it is
simply impossible: I have to be at home connected to my Wifi router, but
when its internet link fails (this happens sometimes for several hours, I
have extremely slow connections on 3G/4G (which is also overcrowded at the
same time, and only delivers 2G speeds).
Lot of people have to support frequently low bandwidths on mobile networks,
independantly of the price they paid for their subscription.
So compressing data is stil lextremely important (even for texts or for the
smallest web requests). Thanks, compression is now part of the web
transport, but this does not mean that apps must learn to represent their
interchanged data efficiently, and develop less verbose protocols and APIs).

There are more and more people using mobile networks now than fixed
landline internet accesses (or home wifi routers connected to it, and even
for them, fiber access is still jsut for a minority of people in dense
areas, the others don't get more than an handful of mebgatit/s on their DSL
access: if you look at worldwide internet connections a large majority of
people don't get more than 2 megabit/s: this is enough for reading/sending
SMS or phone calls, or exchanging emails, but not if you need frequent
updates to your apps and your apps are too verbose and there are too many
apps in the background: many people cannot view videos on their mobile
access, or only with very poor quality if they view it live (they cannot
also download them slowly due to lack of storage space on their mobile
device, so videos have to remain short in total volume and duration).

So I disagree: compression is absolutely needed (even more today than iut
was in the past when mobile Internet accesses were still for a minority.
Mobile networks are not really faster today (their bandwidth does not
double every three year like local performances of devices ! But with this
extra local performance, you can support more complex compression schemes
that require more CPU/GPU power which is no longer a bottleneck, when the
real bottleneck is the effectively available bandwidth of the mobile
network (smaller than the connection bandwidth because this bandwidth is
shared... and expensive).



2015-06-03 15:24 GMT+02:00 David Starner prosfil...@gmail.com:

 Chris wrote:
  There is no way to compare 2 HTML elements and know they are talking
 about the same character

 That's because character identity is a hard problem. Is the emoji TIGER
 the same as TONY THE TIGER or as TONY THE TIGER GIVING THE VICTORY SIGN?


 http://www.engadget.com/2014/04/30/you-may-be-accidentally-sending-friends-a-hairy-heart-emoji/

 Note that even in Unicode, the set ẛ  ᷥ ſ ṡ s S Ŝ may be considered the
 same character or up to seven different characters, depending on
 case-folding, canonization and accent dropping.

  Similarly, there is no way to search or index html elements. If a HTML
 document contained an image of a particular custom character, there would
 be no way to ask google or whatever to find all the documents with that
 character. Different documents would represent it differently.

 You can index links to images. If two documents represent it differently,
 then I go back to the above; we can't know that they're the same thing.

 On Tue, Jun 2, 2015 at 7:11 PM Chris idou...@gmail.com wrote:

 You can’t ask the entire computing universe to compress everything all
 the time.


 Anytime we care about how much space text takes up, it should be
 compressed. It compresses very well. On the other hand, it's rare that
 anyone cares anymore; what's a few hundred kilobytes between friends?



Re: Tag characters and in-line graphics (from Tag characters)

2015-06-03 Thread William_J_G Overington
Earlier in this thread, on 2 June 2015, I wrote as follows:
 A mechanism to be able to use the method to define a glyph linked to a 
 Unicode code point would be a useful facility to add for use in a situation 
 where the glyph is for a regular Unicode character.
I have now thought of a mechanism to use.
Please imagine the base character followed by a sequence of tag characters, the 
tag characters here represented by ordinary letters and digits.
Here is an example of the mechanism for defining the glyph for U+E702 in a 
particular document as 7 red pixels.
HE702U7r
The tag H character switches to hexadecimal input mode, then there are as many 
tag characters as necessary to express in hexadecimal notation the code point 
of the character for which the definition is being made, then there is a tag U 
character to action the definition and go out of hexadecimal input mode. The 
tag 7r is to express 7 red pixels.
In practice the number of tag characters after the tag U character might be 
around 200, the above tag 7r is just a minimal example so as to explain the 
concept.

While posting, may I mention please one other matter?
Previously I mentioned using tag R, tag G and tag B is defining colours. I now 
add tag A into that defining colour so as to define opacity, that is what is 
sometimes called transparency, yet 0 means totally transparent and 255 means 
totally opaque. If no value is stated for A then it should be presumed to have 
a value of 255, so that the default situation is to define opaque colours.

I feel that the information in this thread is now a good basis for the 
assessment of this suggested format as to whether it could be a useful open 
source system with good interoperability potential that could usefully be 
submitted to the Unicode Technical Committee.
William Overington
3 June 2015


Re: Tag characters and in-line graphics (from Tag characters)

2015-06-03 Thread David Starner
Chris wrote:
 There is no way to compare 2 HTML elements and know they are talking
about the same character

That's because character identity is a hard problem. Is the emoji TIGER the
same as TONY THE TIGER or as TONY THE TIGER GIVING THE VICTORY SIGN?

http://www.engadget.com/2014/04/30/you-may-be-accidentally-sending-friends-a-hairy-heart-emoji/

Note that even in Unicode, the set ẛ  ᷥ ſ ṡ s S Ŝ may be considered the
same character or up to seven different characters, depending on
case-folding, canonization and accent dropping.

 Similarly, there is no way to search or index html elements. If a HTML
document contained an image of a particular custom character, there would
be no way to ask google or whatever to find all the documents with that
character. Different documents would represent it differently.

You can index links to images. If two documents represent it differently,
then I go back to the above; we can't know that they're the same thing.

On Tue, Jun 2, 2015 at 7:11 PM Chris idou...@gmail.com wrote:

 You can’t ask the entire computing universe to compress everything all the
 time.


Anytime we care about how much space text takes up, it should be
compressed. It compresses very well. On the other hand, it's rare that
anyone cares anymore; what's a few hundred kilobytes between friends?


Re: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters))

2015-06-03 Thread Philippe Verdy
Note that copy-pasting from a PDF to another document is very tricky, the
PDF format requires that embedded fonts use precise glyph naming
conventions to map glyphs back to characters, otherwise the Unicode
characters sequences associated to a glyph (or multiple glyphs if they are
ligatured or in complex layouts or with uncommon decorations, or rendered
on a non uniform background, or with glyphs filled with pattern, such as
labels over a photograph or cartographic map) will not be recognized. This
remark about PDFs is also applicable to PostScript documents.

Some PDF readers in that case attempt to perform some OCR (plus dictionary
lookups to fix mis readings) for common glyph forms, but will almost always
fail if the glyphs are too specific such as when they include swashes,
ligatures, or unknown scripts and scripts with complex layouts (such as the
invented script created by William for noting sentences with specific
characters with new glyphs, and a specific syntax and specific layout
rules. In other casesn the PDF reader will jsut put in the clipboard only a
bitmap for the selection, and it will be another software that will attempt
to interpret the bitmap with OCR.

The glyph naming conventions are documented in PDF specifications, but many
PDF creators do not follow these rules, and copying text from these PDFs
fails



2015-06-03 15:03 GMT+02:00 Philippe Verdy verd...@wanadoo.fr:

 This possibly fails because William possibly forgot to embed his font in
 the document itself (or Serif PagePlus forgets to do it when it creates the
 PDF document, and refuses to embed glyphs from the font that are bound to
 Unicode PUAs when it creates the embeded font). However no such problem
 when creating PDFs with MS Office, or via the Adobe Acrobat printer
 driver or other printer drivers generating PDF files, including Google
 Cloud Print).

 So this could be a misuse of Serif PagePlus when creating the PDF (I don't
 know this software, may be there are options set up that ells it to not
 embed fonts from a list of fonts that the recipient is supposed to have
 installed locally, to save storage space for the document, byt evoiding
 such embedding). Another reason may be that the font is marked as not
 embeddable within its exposed properties.

 Another reason may be that John tries to open the document with a software
 that does not handle embedded fonts, or that ignores it to use only the
 fonts preinstalled by John in his preferences. And in such case the result
 depends only on fonts preinstalled on his local system (that does not
 include the fonts created by William), or his software is setup to use
 exclusively a specific local Unicode font for all PUAs.

 (Softwares that behaved in this bad way was old versions of Internet
 Explorer, due to limitation of his text renderers, however this should not
 happen with PDFs, provided you have used a correct plugion version for
 displaying PDF in the browser : if this fails in the browser, download the
 document and view it with Adobe Reader instead of view the plugin: there
 are many PDF plugins on markets that do not support essential features and
 just built to display PDF containing scanned bitmaps, but with very poor
 support of text or vector graphics, or tuned specifically to change the
 document for another device or paper format).

 Without citing which softwares are used (and which PDF in the list does
 not load correctly), it is difficult to tell, but for me I have no problems
 with a few docs I saw created by William. So:

 NO F = NO FAIL for me.

 2015-06-03 13:38 GMT+02:00 John idou...@gmail.com:

 Yep, I clicked on your document and saw an empty square where your
 character should be.

 F = FAIL.

 —
 Chris


 On Wed, Jun 3, 2015 at 6:30 PM, William_J_G Overington 
 wjgo_10...@btinternet.com wrote:

 Private Use Area in Use (from Tag characters and in-line graphics (from
 Tag characters))


  That's not agreed upon. I'd say that the general agreement is that
 the private ranges are of limited usefulness for some very limited use
 cases (such as designing encodings for new scripts).


  They are of limited usefulness precisely because it is pathologically
 hard to make use of them in their current state of technological evolution.
 If they were easy to make use of, people would be using them all the time.
 I’d bet good money that if you surveyed a lot of applications where custom
 characters are being used, they are not using private use ranges. Now why
 would that be?


 Actually, I have used Private Use Area characters a lot, and, once I had
 got used to them, I found them incredibly straightforward to use.


 I have made fonts that include Private Use Area encodings using the
 High-Logic FontCreator program and then used those fonts in Serif PagePlus,
 both to produce PDF documents and PNG graphics, as needed for my particular
 project at the time.


 For example,


 http://forum.high-logic.com/viewtopic.php?f=10t=2957


 http://forum.high

Re: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters))

2015-06-03 Thread Philippe Verdy
This possibly fails because William possibly forgot to embed his font in
the document itself (or Serif PagePlus forgets to do it when it creates the
PDF document, and refuses to embed glyphs from the font that are bound to
Unicode PUAs when it creates the embeded font). However no such problem
when creating PDFs with MS Office, or via the Adobe Acrobat printer
driver or other printer drivers generating PDF files, including Google
Cloud Print).

So this could be a misuse of Serif PagePlus when creating the PDF (I don't
know this software, may be there are options set up that ells it to not
embed fonts from a list of fonts that the recipient is supposed to have
installed locally, to save storage space for the document, byt evoiding
such embedding). Another reason may be that the font is marked as not
embeddable within its exposed properties.

Another reason may be that John tries to open the document with a software
that does not handle embedded fonts, or that ignores it to use only the
fonts preinstalled by John in his preferences. And in such case the result
depends only on fonts preinstalled on his local system (that does not
include the fonts created by William), or his software is setup to use
exclusively a specific local Unicode font for all PUAs.

(Softwares that behaved in this bad way was old versions of Internet
Explorer, due to limitation of his text renderers, however this should not
happen with PDFs, provided you have used a correct plugion version for
displaying PDF in the browser : if this fails in the browser, download the
document and view it with Adobe Reader instead of view the plugin: there
are many PDF plugins on markets that do not support essential features and
just built to display PDF containing scanned bitmaps, but with very poor
support of text or vector graphics, or tuned specifically to change the
document for another device or paper format).

Without citing which softwares are used (and which PDF in the list does not
load correctly), it is difficult to tell, but for me I have no problems
with a few docs I saw created by William. So:

NO F = NO FAIL for me.

2015-06-03 13:38 GMT+02:00 John idou...@gmail.com:

 Yep, I clicked on your document and saw an empty square where your
 character should be.

 F = FAIL.

 —
 Chris


 On Wed, Jun 3, 2015 at 6:30 PM, William_J_G Overington 
 wjgo_10...@btinternet.com wrote:

 Private Use Area in Use (from Tag characters and in-line graphics (from
 Tag characters))


  That's not agreed upon. I'd say that the general agreement is that the
 private ranges are of limited usefulness for some very limited use cases
 (such as designing encodings for new scripts).


  They are of limited usefulness precisely because it is pathologically
 hard to make use of them in their current state of technological evolution.
 If they were easy to make use of, people would be using them all the time.
 I’d bet good money that if you surveyed a lot of applications where custom
 characters are being used, they are not using private use ranges. Now why
 would that be?


 Actually, I have used Private Use Area characters a lot, and, once I had
 got used to them, I found them incredibly straightforward to use.


 I have made fonts that include Private Use Area encodings using the
 High-Logic FontCreator program and then used those fonts in Serif PagePlus,
 both to produce PDF documents and PNG graphics, as needed for my particular
 project at the time.


 For example,


 http://forum.high-logic.com/viewtopic.php?f=10t=2957


 http://forum.high-logic.com/viewtopic.php?f=10t=2672


 William Overington




 3 June 2015





Re: Tag characters and in-line graphics (from Tag characters)

2015-06-03 Thread Doug Ewell
Chris idou747 at gmail dot com wrote:

 Right now, what happens if you have a domain or locale requirement for
 a special character?

That's what the PUA is for. Assign a PUA code point to your special
character, create a font which implements the PUA character, create a
brief private agreement which states that this code point refers to
that character and which mentions the font, put the private agreement on
the web, and publish your document with a reference to the agreement.
For most non-professionals, creating the font is the tricky part.

Also see Section 23.5 of TUS.

Note that I am disagreeing with Martin about the PUA being useful only
as a scratch area for standardization.

--
Doug Ewell | http://ewellic.org | Thornton, CO 




Tag characters and in-line graphics (from Tag characters)

2015-06-03 Thread Doug Ewell
Chris idou747 at gmail dot com wrote:

 Why shouldn’t there be a standard way to go out on the net and find
 the canonical glyph for a code?

Because there isn't one. Glyphs are suggestions, meant to convey the
identity of the character.

--
Doug Ewell | http://ewellic.org | Thornton, CO 




Re: Tag characters and in-line graphics (from Tag characters)

2015-06-03 Thread Philippe Verdy
2015-06-04 2:59 GMT+02:00 David Starner prosfil...@gmail.com:

 You can’t iterate over compressed bits. You can’t process them.


 Why not? In any language I know of that has iterators, there would be no
 problem writing one that iterates over compressed input. If you need to
 mutate them, that is hard in compressed formats, but a new CPU can store
 War in Peace in the on-CPU cache.


You're right, today the CPU is no longer the bottleneck, which is now
* the speed of long buses and communcaition links, with their limited (and
costly) bandwidth as this is a shared medium used by more and more people
but requiring mssive infrastures, or physical constraints even on the
fastest serial buses, both implying transmission roundtrip times (limiting
random access, which is a severe problem now that we have to access to
extremely large volumes of data distributed over multiple devices or over a
full network
* the storage capacity for the fastest storage medium (such as flash
memory, which is the only option for mobile devices, but also the most
expensive).
In both cases you need compression (the second bottleneck on storage
volumes will fade out in a few years, but not the bandwidth constraints).
It really pays now to use compression schemes (even the most complex ones
such as those used to transmit live video: locally a CPU or GPU will easily
handle the compression scheme.

Researches on compression schemes is really not ended, it has never been so
much active as it is today, including for text because of the explosion of
the data volumes, even if now the volume of text is largely overwhelmed by
the volume of images, videos and audio (but you can't compute a lot of
things from audio/image/video data sources, we still need text for giving
semantics to these medias from which you can derive data or perform
searches (there is still a lot to do for handling images and audio speech
and detect some semantics in them, but you won't get as much info from an
audio/video than what can be represented by text: OCR for example is a very
heuristic process with lots of false guesses produced, still much more than
humain brains can process within a broad ranges of variations that we call
cultures; computers are still very poor in recognizing cultures with as
many variations as those we recognize through social interactions and years
of education and *personal* experience).


Re: Tag characters and in-line graphics (from Tag characters)

2015-06-03 Thread Doug Ewell

Chris John idou747 at gmail dot com wrote:


So what you’re saying is that the current situation where you see an
empty square □ for unknown characters is better than seeing something
useful?


No, that's why you include a reference to the font in the private 
agreement, so that interested parties can install it and see the special 
character(s).


--
Doug Ewell | http://ewellic.org | Thornton, CO  



Re: Tag characters and in-line graphics (from Tag characters)

2015-06-02 Thread Philippe Verdy
Once again no ! Unicode is a standard for encoding characters, not for
encoding some syntaxic element of a glyph definition !

Your project is out of scope. You still want to reinvent the wheel.

For creating syntax, define it within a language (which does not need new
characters (you're not creating an APL grammar using specific symbols for
some operators more or less based on Greek letters and geometric shapes:
they are just like mathematic symbols). Programming languages and data
languages (Javascript, XML, JOSN, HTML...) and their syntax are encoded
themselves in plain text documents using standard characters) and don't
need new characters, APL being an exception only because computers or
keyboards were produced to facilitate the input (those that don't have such
keyboards used specific editors or the APL runtime envitonment that offer
an input method for entering programs in this APL input mode).

Anf again you want the chicken before the egg: have you only ever read the
encoding policy ? The UCS will not encode characters without a demonstrated
usage. Nothing in what you propose is really used except being proposed
only by you, and used only by you for your private use (or with a few of
your unknown friends, but this is invisible and unverifiable). Nothing has
been published.

Even for currency symbols (which are an exception to the demonstrated use,
only because once they are created they are extremely rapidly needed by lot
of people, in fact most people of a region as large as a country, and many
other countries that will reference or use it it). But even in this case,
what is encoded is the character itself, not the glyph or new characters
used to defined the glyph !

Can you stop proposing out of topic subjects like this on this list ? You
are not speaking about Unicode or characters. Another list will be more
appropriate. You help no one here because all you want is to change
radically the goals of TUS.

2015-06-02 11:01 GMT+02:00 William_J_G Overington wjgo_10...@btinternet.com
:

 Perhaps the solution to at least some of the various issues that have been
 discussed in this thread is to define a tag letter z as a code within the
 local glyph memory requests, as follows.



Re: Tag characters and in-line graphics (from Tag characters)

2015-06-02 Thread William_J_G Overington
Perhaps the solution to at least some of the various issues that have been 
discussed in this thread is to define a tag letter z as a code within the local 
glyph memory requests, as follows.

Local glyph memory, for use in compressing a document where the same glyph is 
used two or more times in the document:
3t7r means this is local glyph 3 being defined at its first use in the document 
as 7 red pixels
3h here local glyph 3 is being used
3z7r means this is local glyph 3 being defined, though not used, at the start 
of the document as 7 red pixels
More than one local glyph could be defined at the start of the document, as 
desired.

This would mean that use of such a glyph within the document would be by just 
using the quite short base character followed by tag characters sequence using 
the h request. This would enable document editing to be easier to accomplish.

A mechanism to be able to use the method to define a glyph linked to a Unicode 
code point would be a useful facility to add for use in a situation where the 
glyph is for a regular Unicode character.

May I mention something that I forgot to mention earlier please?
When only one pixel of a particular colour is being specified, it can be 
specified using just the code for the colour.
For example, for 1 red pixel please use r on its own, there is no need to use 
1r though 1r should be made to work just in case anyone does use that format.
There was a time when I used to use the FORTH programming language and this 
format of first inputting the number then the operator is based on the way that 
the FORTH programming language works.
William Overington
2 June 2015
Original message
From : wjgo_10...@btinternet.com
Date : 27/05/2015 - 17:26 (GMTST)
To : unicode@unicode.org
Subject : Tag characters and in-line graphics (from Tag characters)
Tag characters and in-line graphics (from Tag characters)
This document suggests a way to use the method of a base character together 
with tag characters to produce a graphic. The approach is theoretical and has 
not, at this time, been tried in practice.
The application in mind is to enable the graphic for an emoji character to be 
included within a plain text stream, though there will hopefully be other 
applications.
The base character could be either an existing character, such as U+1F5BC FRAME 
WITH PICTURE, or a new character as decided. Tests could be carried out using a 
Private Use Area character as the base character.
The explanation here is intended to explain the suggested technique by 
examples, as a basis for discussion. In each example, please consider for each 
example that the characters listed are each the tag version of the character 
used here and that they all as a group follow one base character.
The examples are deliberately short so as to explain the idea. A real use 
example might have around two hundred or so tag characters following the base 
character, maybe more, sometimes fewer.
Examples of displays:
Each example is left to right along the line then lines down the page from 
upper to lower.
7r means 7 pixels red
7r5y means 7 pixels red then 5 pixels yellow
7r5y-3b means 7 pixels red then 5 pixels yellow then next line then 3 pixels 
blue
Examples of colours available:
k black
n brown
r red
o orange
y yellow
g green (0, 255, 0)
b blue
m magenta
e grey
w white
c cyan
p pink
d dark grey
i light grey (thus avoiding using lowercase l so as to avoid confusion with 
figure 1)
f deeper green (foliage colour) (0, 128, 0)
Next line request:
- moves to the next line
Local palette requests:
192R224G64B2s means store as local palette colour 2 the colour (R=192, G=224, 
B=64)
7,2u means 7 pixels using local palette colour 2
Local glyph memory, for use in compressing a document where the same glyph is 
used two or more times in the document:
3t7r means this is local glyph 3 being defined at its first use in the document 
as 7 red pixels
3h here local glyph 3 is being used
The above is for bitmaps. It would be possible to use a similar technique to 
specify a vector glyph as used in fontmaking using on-curve and off-curve 
points specified as X, Y coordinates together with N for on-curve and F for 
off-curve. There would need to be a few other commands so as to specify places 
in the tag character stream where definition of a contour starts and so as to 
separate the definitions of the glyphs for a colour font and so on. This could 
be made OpenType compatible so that a received glyph could be added into a font.
Please feel free to suggest improvements. One improvement could be as to how to 
build a Unicode code point into a picture so that a font could be transmitted.
William Overington
27 May 2015


Re: Tag characters and in-line graphics (from Tag characters)

2015-06-02 Thread Martin J. Dürst

On 2015/06/03 07:55, Chris wrote:


As you point out, The UCS will not encode characters without a demonstrated 
usage.”. But there are use cases for characters that don’t meet UCS’s criteria for a 
world wide standard, but are necessary for more specific use cases, like specialised 
regional, business, or domain specific situations.


Unicode contains *a lot* of characters for specialized regional, 
business, or domain specific situations.



My question is, given that unicode can’t realistically (and doesn’t aim to) 
encode every possible symbol in the world, why shouldn’t there be an EXTENSIBLE 
method for encoding, so that people don’t have to totally rearchitect their 
computing universe because they want ONE non-standard character in their 
documents?


As has been explained, there are technologies that allow you to do (more 
or less) that. Information technology, like many other technologies, 
works best when finding common cases used by many people. Let's look at 
some examples:


Character encodings work best when they are used widely and uniformly. I 
don't know anybody who actually uses all the characters in Unicode 
(except the guys that work on the standard itself). So for each 
individual, a smaller set would be okay. And there were (and are) 
smaller sets, not for individuals, but for countries, regions, scripts, 
and so on. Originally (when memory was very limited), these legacy 
encodings were more efficient overall, but that's no longer the case. So 
everything is moving towards Unicode.


Most Website creators don't use all the features in HTML5. So having 
different subsets for different use cases may seem to be convenient. But 
overall, it's much more efficient to have one Hypertext Markup Language, 
so that's were everybody is converging to.


From your viewpoint, it looks like having something in between 
character encodings and HTML is what you want. It would only contain the 
features you need, and nothing more, and would work in all the places 
you wanted it to work. Asmus's inline text may be something similar.


The problem is that such an intermediate technology only makes sense if 
it covers the needs of lots and lots of people. It would add a third 
technology level (between plain text and marked-up text), which would 
divert energy from the current two levels and make things more complicated.


Up to now, such as third level hasn't emerged, among else because both 
existing technologies were good at absorbing the most important use 
cases from the middle. Unicode continues to encode whatever symbols that 
gain reasonable popularity, so every time somebody has a real good use 
case for the middle layer with a symbol that isn't yet in Unicode, that 
use case gets taken away. HTML (or Web technology in general) also 
worked to improve the situation, with technologies such as SVG and Web 
Fonts.


No technology is perfect, and so there are still some gaps between 
character encoding and markup, some of which may in due time eventually 
be filled up, but I don't think a third layer in the middle will emerge 
soon.


Regards,   Martin.


Re: Tag characters and in-line graphics (from Tag characters)

2015-06-02 Thread Chris

I was asking why the glyphs for right arrow ➡ are inconsistent in many sources, 
through a couple of iterations of unicode. Perhaps I might observe that one of 
the reasons is there is no technical link between the code and the glyph. I 
can’t realistically write a display engine that goes to unicode.org 
http://unicode.org/ or wherever, and dynamically finds the right standard 
glyph for unknown codes. This is also manifest in my seeing empty squares □ for 
characters my platform doesn’t know about. This isn’t the case with XML where I 
can send someone a random XML document, and there is a standard way to go out 
there on the internet and check if that XML is conformant. Why shouldn’t there 
be a standard way to go out on the net and find the canonical glyph for a code? 
If there was, then non-standard glyphs would fall out of that technology 
naturally.

So people are talking about all these technologies that are out there, html5, 
cmap, fonts and so forth, but there is no standard way to construct a list of 
“characters”, some of which might be non-standard, and be able to embed that 
ANYWHERE one might reasonably expect characters, have it processed in a normal 
way as characters, be sent anywhere and understood.

As you point out, The UCS will not encode characters without a demonstrated 
usage.”. But there are use cases for characters that don’t meet UCS’s criteria 
for a world wide standard, but are necessary for more specific use cases, like 
specialised regional, business, or domain specific situations.

My question is, given that unicode can’t realistically (and doesn’t aim to) 
encode every possible symbol in the world, why shouldn’t there be an EXTENSIBLE 
method for encoding, so that people don’t have to totally rearchitect their 
computing universe because they want ONE non-standard character in their 
documents?

Right now, what happens if you have a domain or locale requirement for a 
special character?  Most likely you suffer without it, because even though you 
could get it to render in some situations (like hand coding some IMGs into your 
web site), you just know you won’t be able to realistically input it into 
emails, word documents, spreadsheets, and whatever other random applications on 
a daily basis.

What I’m saying is it really beyond the unicode consortium’s scope, and/or 
would it really be a redundant technology to, for example, define a UTF-64 
coding format, where 32 bits allow 4 billion businesses and individuals to 
define their own characters sets (each of up to 4 billion characters), then 
have standard places on the internet (similar to DNS lookup servers) that can 
provide anyone with glyphs and fonts for it?

Right now, yes there are cmaps, but no standard way to combine characters from 
different encodings. No standard way to find the cmap for an unknown encoding. 
There is HTML5, but that doesn’t produce something that is recognisable as a 
list of characters that can be processed as such. (If there is an IMG in text, 
is it a “character” or an illustration in the text? How can you refer to a 
particular set of characters without having your own web server? How you render 
that text bigger, with the standard reference glyph without manually searching 
the internet where to find it? There is a host of problems here).

All these problems look unsolved to me, and they also look like encoding 
technology problems to me too. What other consortium is out there are working 
on character encoding problems?


 On 2 Jun 2015, at 7:40 pm, Philippe Verdy verd...@wanadoo.fr wrote:
 
 Once again no ! Unicode is a standard for encoding characters, not for 
 encoding some syntaxic element of a glyph definition !
 
 Your project is out of scope. You still want to reinvent the wheel.
 
 For creating syntax, define it within a language (which does not need new 
 characters (you're not creating an APL grammar using specific symbols for 
 some operators more or less based on Greek letters and geometric shapes: they 
 are just like mathematic symbols). Programming languages and data languages 
 (Javascript, XML, JOSN, HTML...) and their syntax are encoded themselves in 
 plain text documents using standard characters) and don't need new 
 characters, APL being an exception only because computers or keyboards were 
 produced to facilitate the input (those that don't have such keyboards used 
 specific editors or the APL runtime envitonment that offer an input method 
 for entering programs in this APL input mode).
 
 Anf again you want the chicken before the egg: have you only ever read the 
 encoding policy ? The UCS will not encode characters without a demonstrated 
 usage. Nothing in what you propose is really used except being proposed only 
 by you, and used only by you for your private use (or with a few of your 
 unknown friends, but this is invisible and unverifiable). Nothing has been 
 published.
 
 Even for currency symbols (which are an exception to the demonstrated use, 
 

Re: Tag characters and in-line graphics (from Tag characters)

2015-06-02 Thread Chris

Martin, you seem to be labouring under the impression that HTML5 is a 
substitute for character encoding. If it is, why do we need unicode? We could 
just have documents laden with IMG tags, and restrict ourselves to ascii.

It seems I need to spell out one more time why HTML is not character encoding:

1. HTML5 doesn’t separate one particular representation (font, size, etc) from 
the actual meaning of the character. So you can’t paste it somewhere and expect 
to increase its point size or change its font.
2. It’s highly inefficient in space to drop multi-kilobyte strings into a 
document to represent one character.
3. The entire design of HTML has nothing to do with characters. So there is no 
way to process a string of characters interspersed with HTML elements and know 
which of those elements are a “character”. This makes programatic manipulation 
impossible, and means most computer applications simply will not allow HTML in 
scenarios where they expect a list of “characters”.
4. There is no way to compare 2 HTML elements and know they are talking about 
the same character. I could put some HTML representation of a character in my 
document, you could put a different one in, and there would absolutely no way 
to know that they are the same character. Even if we are in the same community 
and agree on the existence of this character.
5. Similarly, there is no way to search or index html elements. If a HTML 
document contained an image of a particular custom character, there would be no 
way to ask google or whatever to find all the documents with that character. 
Different documents would represent it differently. HTML is a rendering 
technology. It makes things LOOK a particular way, without actually ENCODING 
anything about it. The only part of of HTML that is actually searchable in a 
deterministic fashion is the part that is encoded - the unicode part.

Unicode encodes symbols that have “reasonable popularity”. (a) that is not all 
of them. (b) how can a symbol attain reasonable popularity when it is not in 
unicode? Of course some can, but others have their popularity hindered by the 
very fact that they are not encoded!

Take the poop emoji that people recently have been talking about here. It 
gained popularity because the Japanese telecom companies decided to encode it. 
If they hadn’t encoded it, well would have become popular through normal 
culture such that the unicode consortium would have adopted it! No it wouldn’t! 
The Japanese telcos were able to do this because they controlled their entire 
user base from hardware on up to encodings. That won’t be happening into the 
future, so new interesting and potentially universal emojis won’t ever come 
into existence in the way that this one did because of the control the unicode 
consortium exercises over this technology. But the problem isn’t restricted to 
emojis, many other potentially popular symbols can’t come into existence 
either. The internet *COULD* be the birthplace of lots of interesting new 
symbols in the same way that Japanese telecom companies birthed the original 
emojis, but it won’t be because the unicode consortium r!
 ules it from the top down.

Summary: 
1. HTML renders stuff, it encodes nothing. It addresses a completely different 
problem domain. If rendering and encoding were the same problem, unicode can 
disband now.
2. Unicode encodes stuff, but isn’t extensible in a way that broadly useful. 
i.e. in a way that allows anybody (or any application) receiving a custom 
character to know what it is, or how to render it, or to combine it with other 
custom character sets.
3. The problem under discussion is not a rendering problem. HTML5 lacks nothing 
in terms of ability to render. Yet the problem remains. Because it’s an 
encoding problem. Encoding problems are in the unicode domain, not in the HTML5 
domain.

You say that character encodings work best when they are used widely and 
uniformly.  But they can only be as wide or as uniform as reality itself.  We 
could try and conform reality to technology and… for example… force all the 
world to use Latin characters and 128 ASCII representations. OR we can conform 
technology to reality. Not all encodings need to be, or ought to be as 
universal as requiring one world wide committee to pass judgment on them.



 On 3 Jun 2015, at 11:09 am, Martin J. Dürst due...@it.aoyama.ac.jp wrote:
 
 On 2015/06/03 07:55, Chris wrote:
 
 As you point out, The UCS will not encode characters without a demonstrated 
 usage.”. But there are use cases for characters that don’t meet UCS’s 
 criteria for a world wide standard, but are necessary for more specific use 
 cases, like specialised regional, business, or domain specific situations.
 
 Unicode contains *a lot* of characters for specialized regional, business, or 
 domain specific situations.

 
 My question is, given that unicode can’t realistically (and doesn’t aim to) 
 encode every possible symbol in the world, why shouldn’t 

Re: Tag characters and in-line graphics (from Tag characters)

2015-06-02 Thread Chris

 On 3 Jun 2015, at 11:22 am, Martin J. Dürst due...@it.aoyama.ac.jp wrote:
 
 On 2015/05/29 11:37, John wrote:
 
 If I had a large document that reused a particular character thousands of 
 times,
 
 Then it would be either a very boring document (containing almost only that 
 same character) or it would be a very large document.


If you have a daughter, look at her Facebook messenger, and then get back to me.


 would this HTML markup require embedding that character thousands of times, 
 or could I define the character once at the beginning of the sequence, and 
 then refer back to it in a space efficient way?
 
 If you want space efficiency, the best thing to do is to use generic 
 compression. Many generic compression methods are available, many of them are 
 widely supported, and all of them will be dealing with your case in a very 
 efficient way


You can’t ask the entire computing universe to compress everything all the 
time. And that is what your comment amounts to. Because the whole point under 
discussion is how can we encode stuff such that you can hope to universally 
move it around between different documents, formats, applications, input fields 
and platforms without any massage.


 Given that its been agreed that private use ranges are a good thing,
 
 That's not agreed upon. I'd say that the general agreement is that the 
 private ranges are of limited usefulness for some very limited use cases 
 (such as designing encodings for new scripts).


They are of limited usefulness precisely because it is pathologically hard to 
make use of them in their current state of technological evolution. If they 
were easy to make use of, people would be using them all the time. I’d bet good 
money that if you surveyed a lot of applications where custom characters are 
being used, they are not using private use ranges. Now why would that be?


 and given that we can agree that exchanging data is a good thing,
 
 Yes, but there are many other ways to do that besides Unicode. And for many 
 purposes, these other ways are better suited.

The point is a universally recognised way. Of course you, me or anybody could 
design many good ways to solve any problem we might come up with. That doesn’t 
mean it will interoperate with anybody else though.

 
 maybe something should bring those two things together. Just a thought.
 
 Just a 'non sequitur'.
 
 Regards,   Martin.




Re: Tag characters and in-line graphics (from Tag characters)

2015-06-02 Thread Martin J. Dürst

On 2015/05/29 11:37, John wrote:


If I had a large document that reused a particular character thousands of times,


Then it would be either a very boring document (containing almost only 
that same character) or it would be a very large document.



would this HTML markup require embedding that character thousands of times, or 
could I define the character once at the beginning of the sequence, and then 
refer back to it in a space efficient way?


If you want space efficiency, the best thing to do is to use generic 
compression. Many generic compression methods are available, many of 
them are widely supported, and all of them will be dealing with your 
case in a very efficient way.



Given that its been agreed that private use ranges are a good thing,


That's not agreed upon. I'd say that the general agreement is that the 
private ranges are of limited usefulness for some very limited use cases 
(such as designing encodings for new scripts).



and given that we can agree that exchanging data is a good thing,


Yes, but there are many other ways to do that besides Unicode. And for 
many purposes, these other ways are better suited.



maybe something should bring those two things together. Just a thought.


Just a 'non sequitur'.

Regards,   Martin.


Re: Tag characters and in-line graphics (from Tag characters)

2015-06-02 Thread Philippe Verdy
No, nothing about what you propose, which is to encode graphics directly
with a custom syntax using specific Unicode characters for this syntax
itself.
There's no such statement in the UTR, even for longer term.
What is proposed instead is a way to *reference* (not define) graphics.
For the rest, you need a rich-text format to embed graphics (using the
syntax of this rich-text format, such as HTML), but this syntax remains out
of scope of Unicode which will not standardize any graphic format, or any
language by its syntax.
Even for CLDR, you will use some JSON or XML rich-text format to create
references, or embed some small graphics. But CLDR is NOT part of the
Unicode Standard itself, and does not encode new characters (and I've not
seen the CLDR requesing additions in the UCS for its own use, instead it
uses its own assignments for PUAs where needed, als also for its own
private locale tags for internal references within the CLDR data itself).

2015-06-02 12:37 GMT+02:00 William_J_G Overington wjgo_10...@btinternet.com
:

 Responding to Philippe Verdy:

  Nothing has been published.

 It has been published. It is published in this thread for discussion prior
 to a possible submission to the Unicode Technical  Committee that could
 take place if people on this mailing list feel that it is a good solution
 to the problem raised in section 8 of the following document.

 http://www.unicode.org/reports/tr51/tr51-2.html

 Direct link to

 8 Longer Term Solutions

 http://www.unicode.org/reports/tr51/tr51-2.html#Longer_Term


 William Overington

 2 June 2015




Re: Tag characters and in-line graphics (from Tag characters)

2015-06-02 Thread Julian Bradfield
On 2015-06-02, William_J_G Overington wjgo_10...@btinternet.com wrote:
  take place if people on this mailing list feel that it is a good 
 solution to the problem raised in section 8 of the following document.
 http://www.unicode.org/reports/tr51/tr51-2.html

That section does not raise a problem. It says what the solution to
the emoji problem is: namely that people who want to embed graphics in
text should fix their protocols to allow it, instead of subverting
Unicode to do it.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Tag characters and in-line graphics (from Tag characters)

2015-06-02 Thread William_J_G Overington
Responding to Philippe Verdy:
 Nothing has been published.
It
 has been published. It is published in this thread for discussion prior
 to a possible submission to the Unicode Technical  Committee that could
 take place if people on this mailing list feel that it is a good 
solution to the problem raised in section 8 of the following document.
http://www.unicode.org/reports/tr51/tr51-2.html
Direct link to
8 Longer Term Solutions
http://www.unicode.org/reports/tr51/tr51-2.html#Longer_Term
William Overington
2 June 2015


Re: Tag characters and in-line graphics (from Tag characters)

2015-06-02 Thread Ken Whistler



On 6/2/2015 2:01 AM, William_J_G Overington wrote:
Local glyph memory, for use in compressing a document where the same 
glyph is used two or more times in the document:


Um, that technology already exists. It is called a font.




A mechanism to be able to use the method to define a glyph linked to a 
Unicode code point would be a useful facility to add for use in a 
situation where the glyph is for a regular Unicode character.


And that mechanism has also already been defined. It is called a cmap:

http://www.microsoft.com/typography/otspec/cmap.htm

--Ken





Re: Tag characters and in-line graphics (from Tag characters)

2015-06-02 Thread Philippe Verdy
2015-06-01 1:33 GMT+02:00 Chris idou...@gmail.com:


 Of course, anyone can invent a character set. The difficult bit is having
 a standard way of combining custom character sets. That’s why a standard
 would be useful.

 And while stuff like this can, to some extent, be recognised by magic
 numbers, and unique strings in headers, such things are unreliable. Just
 because example.net/mycharset/ appears near the start of a document,
 doesn’t necessarily mean it was meant to define a character set. Maybe it
 was a document discussing character sets.


That's not what I described. I spoke about using a MIME-compatible private
charset identifier, and how  such private identifier can be made
reasonnably unique by binding it to a domain name or URI.

If you had read more carefully I also said that it was absolutely not
necessary to dereference that URL: there are many XML schemas binding their
namespaces to a URI which is itself not a webpage or to any downloadable
DTD or XML schema or XML stylesheet. Google and Microsoft are using this a
lot in lots of schemas (which are not described and documented at this URL
if they are documented).

The URI by itself is just an identifier, it becomes a webpage only when you
use it in a web page with an href attribute to create an hyperlink, or to
perform some query to a service returning some data. An identifier for a
private charset does not need to perform any request to be usable by
itself, we just have the identifier which is sufficient by itself. The URI
can be also only a base URI for a collection of resources (whose URLs start
by this base URI, with conventional extensions appended to get the
character properties, or a font; but the best way is to embed this data in
your document, in some header or footer, if your document using the private
charset is not part of a collection of docs using the same private charset)

In that case, you don't need a new UTF: UTF-8 remains usable and you can
map your private charset to standard PUAs (and/or to hacked characters)
according to the private charset needs. The charset indicated in your
document (by some meta header) should be sufficient to avoid collisions
with other private conventions, it will define the scope of your private
charset as the document itself, which will then be interchangeable (and
possibly mixable with other documents with some renumbering if there a
collisions of assignments between two distinct private charsets: in the
document header; add to the charset identifier the range of PUAs which is
used, then with two documents colling on this range, you can reencode one
automatically by creating a compound charset with subranges of PUAs
remapped differently to other ranges).


Re: Tag characters and in-line graphics (from Tag characters)

2015-05-31 Thread Asmus Freytag (t)

On 5/31/2015 5:33 AM, Chris-as-John wrote:


Yes, Asmus good post. But I don’t really think HTML, even a subset, is 
really the right solution.


The longer I think about this, what would be needed would be something 
like an abstract format. A specification of the capabilities to be 
supported and the types of properties needed to support them in an 
extensible way. HTML and CSS would possibly become an implementation of 
such a specification.


There would still be a place for a character set, that is Unicode, as an 
efficient way to implement the most basic and most standard features of 
text contents, but perhaps some extension mechanism that can handle 
various extensions.


The first level of extension is support for recent (or rare) code points 
in the character set (additional fonts, etc, as you mention).


The next level of extension could be support for collections of custom 
entities that are not available as character sets (stickers and the like).


And finally, there would have to be a way to deal with one-offs, such 
as actual images that do not form categorizable sets, but are used in an 
ad-hoc manner and behave like custom characters.


And so on.

It should be possible to describe all of this in a way that allows it to 
be mapped to HMTL and CSS or to any other rich text format -- the goal, 
after all is to make such inline text as widely and effortlessly 
interchangeable as plain text is today (or at least nearly so).


By keeping the specification abstract, you could accommodate both SGML 
like formats where ascii-string markup is intermixed with the text, as 
well as pure text buffers with place holder code points and links to 
external data.


But, however bored you are with plain Unicode emoji, as long as there 
isn't an agreed upon common format for rich inline text I see very 
little chance that those cute facebook emoji will do anything other than 
firmly keep you in that particular ghetto.


A./

I’m reminded of the design for XML itself, it is supposed to start 
with a header that defines what that XML will conform to. Those 
definitions contain some unique identifiers of that XML schema, which 
happens to be a URL. The URL is partly just a convenient unique 
identifier, but also, the XML engine, if it doesn’t know about that 
schema could go to that URL and download the schema, and check that 
the XML  conforms to that schema.


Similarly, imagine a text format that had a header with something like:
\uCHARSET:facebook.com/charsets/pusheen-the-cat-emoji/,12345

Now all the characters following in the text will interpret characters 
that start with 12345 with respect to that character set. What would 
you find at at facebook.com/charsets/pusheen-the-cat-emoji/? You might 
find bitmaps, truetype fonts, vector graphics, etc. You might find 
many many representations of that character set that your rendering 
engine could cache for future use. The text format wouldn’t be reliant 
on today’s favorite rendering technology, whether bitmap, truetype 
fonts, or whatever. Right now, if you go to a website that references 
unicode that your platform doesn’t know about, you see nothing. If a 
format like this existed, character sets would be infinitely 
extensible, everybody on earth could see characters, even if their 
platform wasn’t previously aware of them, and the format would be 
independent of today’s rendering technologies. Let’s face it, HTML5 
changes every few years, and I don’t think anybody wants the 
fundamental textual representation dependant on an entire layout 
engine. And also the whole range of what HTML5 can do, even some 
subset, is too much information. You don’t necessarily want your text 
to embed the actual character set. Perhaps that might be a useful 
option, but I think most people would want to uniquely identify the 
character set, in a way that an engine can download it, but without 
defining the actual details itself. Of course, certain charsets would 
probably become pervasive enough that platforms would just include 
them for convenience. Emojis by major messaging platforms. Maybe 
characters related to specialised domains like, I don’t know, mapping 
or specialised work domains or whatever, But without having to be 
subservient to the central unicode committee.


As someone who is a keen user of Facebook messenger, and who sees them 
bring out a new set of emoji almost every week, I think the world will 
soon be totally bored with the plain basic emoji that unicode has defined.



—
Chris


On Sun, May 31, 2015 at 9:06 PM, Asmus Freytag (t) 
asmus-...@ix.netcom.com mailto:asmus-...@ix.netcom.com wrote:


reading this discussion, I agree with your reaductio ad absurdum
of infinitely nested HTML.

But I think you are onto something with your hypothetical example
of the subset that works in ALL textual situations.

There's clearly a use case for something like it, and I believe
many people would intuitively agree on a set of features 

Re: Tag characters and in-line graphics (from Tag characters)

2015-05-31 Thread Philippe Verdy
The abstract format already exists also for HTML (with MIME charset
extension of the media-type text/plain (it can also be embedded in a meta
tag, where the HTML source file ius just stored in a filesystem, so that a
webserver can parse it and provide the correct MIME header, if the
webserver has no repository for metadata and must infer the media type from
the file content itself with some guesser).

It also exists in various conventions for source code (recognized by
editors such as vi(m) or Emacs, or for Unic shells using embedded magic
identifiers near the top of the file.

You can use it to send an identifier for a private charset without having
to request for a registration of the charset in the IANA database (which is
not intended for private encodings). The pricate chrset can be named a
unique way (consider using a private charset name based on a domain name
you own, such as x-www.example.net-mycharset-1 if you own the domain name
example.net). It will be enough for the initial experimentation for a few
years (or more, provided that you renew this domain name). Your charset can
contain various defitnitions: a mapping of your codepoints (including PUAs,
or standard codepoints, or hacked codepoints if you have no other
solution to get the correct character properties working with existing
algorithms such as case mappings, collation, layout behavior in text
renderers).

Such solution would allow a more predictable management of PUAs (byt
allowing to control their scope of use, by binding them, only in some magic
header of the document, to a private charset that remains reasonnably
unique. for example x-example.net-mycharset-1 would map to an URL like //
www.example.net/mycharset/1/ containing some schema (it could be the base
adress of an XML of JSON file, and of a web font containing the relevant
glyphs, and of a character properties database to override the default ones
from the standard: if you already know this private charset in your
application, you don't need to download any of these files, the URL is just
an identifier and you file can still be used in standalone mode, just like
you can parse many standard XML schemas by just recognizing the URLs
assigned to the XML namespaces, without even having to find a DTD or XML
schema definition from an external resource; if needed you app can contain
a local repository in some cache folder where you can extend the number of
private charsets that can be recognized).



Full interopability will still not be possible if you need to mix in the
same document texts encoded with different private charsets (there's always
a risk of collision), without a way to reencode some of them to a joined
charset without the collisions) by infering a new private charset (it's not
impossible to do, after all this is done already with XML schemas that you
can mix together: you just need to rename the XML namespaces, keeping the
URLs to which they are bound, when there's a collision on the XML namespace
names, a situation that occurs sometimes because of versioning where some
features of a schema are not fully upward compatible).

Yes this complicate things a bit, but much less than when using documents
in which PUA assignments are not negociated at all (even minimally to make
sure they are compatible when mixing sources); and for which there exits
for now absolutely no protocol defined for such negociation (TUS says that
PUAs are usable and interchangeable under private mutual agreement but
still provides no schemes for supporting such mutual agreement, and for
this reason, PUAs are alsmost always rejected, and people want true
permanent assignments for characters that are very specific, badly
documented, or insufficiently known to have reliable permanent properties).

So let's think about securing the use of PUAs with some identification
scheme (for plain-text formats, it should just be allowed to negocaite a
single charset for the whole, using the magic header tricks that re used
since long by charset guessers (including for autodetecting UTF-8 encoded
files).

This would also solve the chicken-and-egg problem where we need more
sources to attest an effective usage before encoding new characters, but
developping this usages is extremely difficult (and much slower) in our
modern technologies where most documents are now handled numerically (in
the past it was possible to create a metal font and use it immediately to
start editing books, and there were many more people using handwriting and
drawings, so it was much less difficult to invent new characters, than it
is today, unless you're a big company that has enough resources to develop
this usage alone, such as Japanese telcos or Google, Yahoo, Samsung or
Microsoft introducing new sets of Emojis for their instant messaging
platform, with tons of developers working for them to develop a wide range
of services around it...)

However I'm not saying that Unicode should specify how such private charset
containing private 

Re: Tag characters and in-line graphics (from Tag characters)

2015-05-31 Thread Doug Ewell

David Starner wrote:


I would say that a system would conform with Unicode in having yellow
heart red (in a non-monochrome font) as well as if it made it a cross.
Either way it's violating character identity. I'd say that being
monochromatic is now like being monospaced; it's suboptimal for a
Unicode implementation, but hardly something Unicode can condemn as
nonconformant.


This seems fair and sensible. My main point was that being monochromatic 
(i.e. black) is conformant, and was an attempt to challenge the 
statement about character color sometimes being a recorded property. I 
don't see any Unicode character properties that identify color, only 
character names, which don't carry property information.


--
Doug Ewell | http://ewellic.org | Thornton, CO  



Re: Tag characters and in-line graphics (from Tag characters)

2015-05-31 Thread Chris

Of course, anyone can invent a character set. The difficult bit is having a 
standard way of combining custom character sets. That’s why a standard would be 
useful.

And while stuff like this can, to some extent, be recognised by magic numbers, 
and unique strings in headers, such things are unreliable. Just because 
example.net/mycharset/ http://example.net/mycharset/ appears near the start 
of a document, doesn’t necessarily mean it was meant to define a character set. 
Maybe it was a document discussing character sets.

And while it is tempting to allow the “container” to define the “header” 
information, whether the container be html defining something in its HEAD tag, 
or some proprietary format (MS-Word), or whatever, that doesn’t really solve 
anybody’s problem in a standard way. For a start, what if you want to copy text 
to the clipboard? You want the thing receiving it to be able to interpret it in 
a self-contained way.

The 2 obvious implementations for a standard seem to be:

1) A standard (optional) header. Perhaps if the string starts with a special 
character, then follows a header defining charsets first. These would allocate 
character ranges for custom characters, and point to where their renderings can 
be found. Standard programming libraries on all platforms would invisibly act 
appropriately on these headers. If you concatenated strings with conflicting 
namespaces, standard libraries would seamlessly reallocate one of the custom 
namespaces and merge the headers.

2) Make a new character set, let’s call it UTF-64. 32 bits would be allocated 
for custom character sets. Anybody could apply to a central authority to be 
allocated a custom id (32 bits=4 billion ids). A central location, kind of like 
a domain name system, would map that id to the URL where the canonical 
definition for that character set is.

The 2nd option has the advantage that the file format is fixed width like 
normal plain text documents. Concatenating custom character set strings is no 
issue. The canonical location for a character set isn’t forevermore mapped to a 
particular domain owner. Nothing about the meaning of the characters is defined 
in the actual bits other than the unique id. The disadvantage is it needs a 
central authority to maintain the list of ids, and map them to domains.



 On 1 Jun 2015, at 7:26 am, Philippe Verdy verd...@wanadoo.fr wrote:
 
 The abstract format already exists also for HTML (with MIME charset 
 extension of the media-type text/plain (it can also be embedded in a meta 
 tag, where the HTML source file ius just stored in a filesystem, so that a 
 webserver can parse it and provide the correct MIME header, if the webserver 
 has no repository for metadata and must infer the media type from the file 
 content itself with some guesser).
 
 It also exists in various conventions for source code (recognized by editors 
 such as vi(m) or Emacs, or for Unic shells using embedded magic identifiers 
 near the top of the file.
 
 You can use it to send an identifier for a private charset without having to 
 request for a registration of the charset in the IANA database (which is not 
 intended for private encodings). The pricate chrset can be named a unique way 
 (consider using a private charset name based on a domain name you own, such 
 as x-www.example.net-mycharset-1 if you own the domain name example.net 
 http://example.net/). It will be enough for the initial experimentation 
 for a few years (or more, provided that you renew this domain name). Your 
 charset can contain various defitnitions: a mapping of your codepoints 
 (including PUAs, or standard codepoints, or hacked codepoints if you have 
 no other solution to get the correct character properties working with 
 existing algorithms such as case mappings, collation, layout behavior in text 
 renderers).
 
 Such solution would allow a more predictable management of PUAs (byt allowing 
 to control their scope of use, by binding them, only in some magic header of 
 the document, to a private charset that remains reasonnably unique. for 
 example x-example.net-mycharset-1 would map to an URL like 
 //www.example.net/mycharset/1/ http://www.example.net/mycharset/1/ 
 containing some schema (it could be the base adress of an XML of JSON file, 
 and of a web font containing the relevant glyphs, and of a character 
 properties database to override the default ones from the standard: if you 
 already know this private charset in your application, you don't need to 
 download any of these files, the URL is just an identifier and you file can 
 still be used in standalone mode, just like you can parse many standard XML 
 schemas by just recognizing the URLs assigned to the XML namespaces, without 
 even having to find a DTD or XML schema definition from an external resource; 
 if needed you app can contain a local repository in some cache folder where 
 you can extend the number of private charsets that can be recognized).
 
 

Re: Tag characters and in-line graphics (from Tag characters)

2015-05-31 Thread Asmus Freytag (t)

John,

reading this discussion, I agree with your reaductio ad absurdum of 
infinitely nested HTML.


But I think you are onto something with your hypothetical example of the 
subset that works in ALL textual situations.


There's clearly a use case for something like it, and I believe many 
people would intuitively agree on a set of features for it.


What people seem to have in mind is something like inline text. 
Something beyond a mere stream of plain text (with effectively every 
character rendered visibly), but still limited in important ways by 
general behavior of inline text: a string of it, laid out, must wrap and 
line break, any objects included in it must behave like characters 
(albeit of custom width, height and appearance), and so on. Paragraph 
formatting, stacked layout, header levels and all those good things 
would not be available.


With such a subset clearly defined, many quirky limitations might no 
longer be necessary; any container that today only takes plain text 
could be upgraded to take inline text. I can see some inline 
containers retaining a nesting limitation, but I could imagine that it 
is possible to arrive at a consistent definition of such inline format.


Going further, I can't shake the impression that without a clean 
definition of an inline text format along those lines, any attempts at 
making stickers and similar solutions stick are doomed to failure.


The interesting thing in defining such a format is not how to represent 
it in HTML or CSS syntax, but in describing what feature sets it must 
(minimally) support. Doing it that way would free existing 
implementations of rich text to map native formats onto that minimally 
required subset and to add them to their format translators for HMTL or 
whatever else they use for interchange.


Only with a definition can you ever hope to develop a processing model. 
It won't be as simple as for plain text strings, but it should be able 
to support common abstractions (like iteration by logical unit). It 
would have to support the management of external resources - if the 
inline format allows images, custom fonts, etc. one would need a way to 
manage references to them in the local context.


If your skeptical position proves correct in that this is something that 
turns out to not be tractable, then I think you've provided conclusive 
proof why stickers won't happen and why encoding emoji was the only 
sensible decision Unicode could have taken.


A./

On 5/30/2015 7:14 AM, John wrote:


Hmm, these once entities of which you speak, do they require 
javascript? Because I'm not sure what we are looking for here is 
static documents requiring a full programming language.


But let's say for a moment that html5 can, or could do the job here. 
Then to make the dream come true that you could just cut and paste 
text that happened to contain a custom character to somewhere else, 
and nothing untoward would happen, would mean that everything in the 
computing universe should allow full blown html. So every Java Swing 
component, every Apple gui component, every .NET component, every 
windows component, every browser, every Android and IOS component 
would allow text entry of HTML entities. OK, so let's say everyone 
agrees with this course of action, now the universal text format is HTML.


But in this new world where anywhere that previously you could input 
text, you can now input full blown html, does that actually make 
sense? Does it make sense that you can for example, put full blown 
HTML inside a H1 tag in html itself? That's a lot of recursion going 
on there. Or in a MS-Excel cell? Or interspersed in some otherwise 
fairly regular text in a Word document?


I suppose someone could define a strict limited subset of HTML to be 
that subset that makes sense in ALL textual situations. That subset 
would be something like just defining things that act like characters, 
and not like a full blown rendering engine. But who would define that 
subset? Not the HTML groups, because their mandate is to define full 
blown rendering engines. It would be more likely to be something like 
the unicode group.


And also, in this brave new world where HTML5 is the new standard text 
format, what would the binary format of it be? I mean, if I have the 
string of unicode characters IMG would that be HTML5 image definition 
that should be rendered as such? Or would it be text that happens to 
contain greater than symbol, I, M and G? It would have to be the 
former I guess, and thereby there would no longer be a unicode symbol 
for the mathematical greater than symbol. Rather there would be a 
unicode symbol for opening a HTML tag, and the text code for greater 
than would be gt; Never again would a computer store  to mean 
greater than. Do we want HTML to be so pervasive? Not sure it deserves 
that.


And from a programmers point of view, he wants to be able to iterate 
over an array of characters and treat each one the same way, 

Re: Tag characters and in-line graphics (from Tag characters)

2015-05-31 Thread John
Yes, Asmus good post. But I don’t really think HTML, even a subset, is really 
the right solution. I’m reminded of the design for XML itself, it is supposed 
to start with a header that defines what that XML will conform to. Those 
definitions contain some unique identifiers of that XML schema, which happens 
to be a URL. The URL is partly just a convenient unique identifier, but also, 
the XML engine, if it doesn’t know about that schema could go to that URL and 
download the schema, and check that the XML  conforms to that schema.




Similarly, imagine a text format that had a header with something like:

\uCHARSET:facebook.com/charsets/pusheen-the-cat-emoji/,12345




Now all the characters following in the text will interpret characters that 
start with 12345 with respect to that character set. What would you find at at 
facebook.com/charsets/pusheen-the-cat-emoji/? You might find bitmaps, truetype 
fonts, vector graphics, etc. You might find many many representations of that 
character set that your rendering engine could cache for future use. The text 
format wouldn’t be reliant on today’s favorite rendering technology, whether 
bitmap, truetype fonts, or whatever. Right now, if you go to a website that 
references unicode that your platform doesn’t know about, you see nothing. If a 
format like this existed, character sets would be infinitely extensible, 
everybody on earth could see characters, even if their platform wasn’t 
previously aware of them, and the format would be independent of today’s 
rendering technologies. Let’s face it, HTML5 changes every few years, and I 
don’t think anybody wants the fundamental textual representation dependant on 
an entire layout engine. And also the whole range of what HTML5 can do, even 
some subset, is too much information. You don’t necessarily want your text to 
embed the actual character set. Perhaps that might be a useful option, but I 
think most people would want to uniquely identify the character set, in a way 
that an engine can download it, but without defining the actual details itself. 
Of course, certain charsets would probably become pervasive enough that 
platforms would just include them for convenience. Emojis by major messaging 
platforms. Maybe characters related to specialised domains like, I don’t know, 
mapping or specialised work domains or whatever, But without having to be 
subservient to the central unicode committee.




As someone who is a keen user of Facebook messenger, and who sees them bring 
out a new set of emoji almost every week, I think the world will soon be 
totally bored with the plain basic emoji that unicode has defined.





—
Chris

On Sun, May 31, 2015 at 9:06 PM, Asmus Freytag (t)
asmus-...@ix.netcom.com wrote:

 John,
 reading this discussion, I agree with your reaductio ad absurdum of 
 infinitely nested HTML.
 But I think you are onto something with your hypothetical example of the 
 subset that works in ALL textual situations.
 There's clearly a use case for something like it, and I believe many 
 people would intuitively agree on a set of features for it.
 What people seem to have in mind is something like inline text. 
 Something beyond a mere stream of plain text (with effectively every 
 character rendered visibly), but still limited in important ways by 
 general behavior of inline text: a string of it, laid out, must wrap and 
 line break, any objects included in it must behave like characters 
 (albeit of custom width, height and appearance), and so on. Paragraph 
 formatting, stacked layout, header levels and all those good things 
 would not be available.
 With such a subset clearly defined, many quirky limitations might no 
 longer be necessary; any container that today only takes plain text 
 could be upgraded to take inline text. I can see some inline 
 containers retaining a nesting limitation, but I could imagine that it 
 is possible to arrive at a consistent definition of such inline format.
 Going further, I can't shake the impression that without a clean 
 definition of an inline text format along those lines, any attempts at 
 making stickers and similar solutions stick are doomed to failure.
 The interesting thing in defining such a format is not how to represent 
 it in HTML or CSS syntax, but in describing what feature sets it must 
 (minimally) support. Doing it that way would free existing 
 implementations of rich text to map native formats onto that minimally 
 required subset and to add them to their format translators for HMTL or 
 whatever else they use for interchange.
 Only with a definition can you ever hope to develop a processing model. 
 It won't be as simple as for plain text strings, but it should be able 
 to support common abstractions (like iteration by logical unit). It 
 would have to support the management of external resources - if the 
 inline format allows images, custom fonts, etc. one would need a way to 
 manage references to them in the local context.
 If 

Re: Tag characters and in-line graphics (from Tag characters)

2015-05-30 Thread Philippe Verdy
2015-05-30 10:47 GMT+02:00 William_J_G Overington wjgo_10...@btinternet.com
:

 Responding to Doug Ewell:

  I think this cuts to the heart of what people have been trying to say
 all along.

  Historically, Unicode was not meant to be the means by which brand new
 ideas are run up the proverbial flagpole to see if they will gain traction.

 History is interesting and can be a good guide, yet many things that are
 an accepted part of Unicode today started as new ideas that gained traction
 and became implemented. So history should not be allowed to be a reason to
 restrict progress.

 For example, there was the extension from 1 plane to 17 planes.


Actually this was a restriction of the UCS to *only* 17 planes. Before that
the UCS contained 31-bit code points, i.e. 32768 planes !

If you're speaking about the old Unicode 1.0 it was then still not the UCS
and it was then incompatible with the UCS for many important parts, and the
initial targets of Unicode was only to have an industry standard
immediately usable between a few software providers (Unicode 1.0 was then
not an international standard, forget it !).


Re: Tag characters and in-line graphics (from Tag characters)

2015-05-30 Thread Doug Ewell
Note: Everything below is my personal opinion and does not represent any
official Unicode Consortium or UTC position.

William_J_G Overington wjgo underscore 10009 at btinternet dot com
wrote:

 Historically, Unicode was not meant to be the means by which brand
 new ideas are run up the proverbial flagpole to see if they will gain
 traction.

 History is interesting and can be a good guide, yet many things that
 are an accepted part of Unicode today started as new ideas that gained
 traction and became implemented. So history should not be allowed to
 be a reason to restrict progress.

I used historically to distinguish between the pre- and post-Emoji
Revolution eras. There have clearly been changes recently, but there is
still at least a minimal expectation that proposed characters will
fulfill a demonstrated need.

I'm not seeing any truly novel, untested ideas in the list below that
Unicode implemented purely on speculation.

 For example, there was the extension from 1 plane to 17 planes.

That was an architectural extension, brought about by the realization
that 64K code points wasn't enough for even the original scope. There's
no comparison.

 There was the introduction of emoji support.

Emoji proponents would argue that emoji support began in 1.0 with the
inclusion of various dingbats. But even emoji are arguably characters
in some sense. They aren't a mini-language used to define images pixel
by pixel.

 There was the introduction of the policy of colour sometimes being a
 recorded property rather than having just the original monochrome
 recording policy.

There isn't any such policy. There is a variation selector to suggest
that the rendering engine show certain characters in emoji style
instead of text style, and there are characters with colors in their
names, but there is no policy that specific colors are recorded as
part of the encoding. YELLOW HEART could conformantly appear in any
color.

 There has been the change of encoding policy that facilitated the
 introduction of the Indian Rupee character into Unicode and ISO/IEC
 10646 far more quickly than had been thought possible, so that the
 encoding was ready for use when needed.

That's not a change to what types of things get encoded. It's a
procedural change, one which I would agree has been applied with
increasing creativity.

 There has been the recent encoding policy change regarding encoding of
 pure electronic use items taking place without (extensive prior use
 using a Private Use Area encoding), such as the encoding of the
 UNICORN FACE.

This is probably your best analogy. People like Asmus have addressed it,
saying it's not reasonable to expect users to adopt PUA solutions and
wait for them to catch on.

 There is the recent change to the deprecation status of most of the
 tag characters and the acceptance of the base character followed by
 tag characters technique so as to allow the specifying of a larger
 collection of particular flags.

There must have been a great wailing and gnashing of teeth over that
decision. So many statements were made over the years about the basic
evilness of tag characters.

But the concept of representing flags was already agreed upon as a
compatibility measure, and the Regional Indicator Symbols solution was
a compromise that allowed expansion beyond the 10 flags that Japanese
telcos chose to include. RIS were an architectural decision. The tag
solution (to be fully outlined in a future PRI) was another
architectural decision. Neither (I believe) is analogous to a scope
decision to start encoding different types of non-character things as if
they were characters, and as I have said before, assigning a glyph to a
thing that isn't a character doesn't make it one.

--
Doug Ewell | http://ewellic.org | Thornton, CO 




Re: Tag characters and in-line graphics (from Tag characters)

2015-05-30 Thread David Starner
I would say that a system would conform with Unicode in having yellow heart
red (in a non-monochrome font) as well as if it made it a cross. Either way
it's violating character identity. I'd say that being monochromatic is now
like being monospaced; it's suboptimal for a Unicode implementation, but
hardly something Unicode can condemn as nonconformant.

On 4:25pm, Sat, May 30, 2015 Doug Ewell d...@ewellic.org wrote:

 Note: Everything below is my personal opinion and does not represent any
 official Unicode Consortium or UTC position.

 William_J_G Overington wjgo underscore 10009 at btinternet dot com
 wrote:

  Historically, Unicode was not meant to be the means by which brand
  new ideas are run up the proverbial flagpole to see if they will gain
  traction.
 
  History is interesting and can be a good guide, yet many things that
  are an accepted part of Unicode today started as new ideas that gained
  traction and became implemented. So history should not be allowed to
  be a reason to restrict progress.

 I used historically to distinguish between the pre- and post-Emoji
 Revolution eras. There have clearly been changes recently, but there is
 still at least a minimal expectation that proposed characters will
 fulfill a demonstrated need.

 I'm not seeing any truly novel, untested ideas in the list below that
 Unicode implemented purely on speculation.

  For example, there was the extension from 1 plane to 17 planes.

 That was an architectural extension, brought about by the realization
 that 64K code points wasn't enough for even the original scope. There's
 no comparison.

  There was the introduction of emoji support.

 Emoji proponents would argue that emoji support began in 1.0 with the
 inclusion of various dingbats. But even emoji are arguably characters
 in some sense. They aren't a mini-language used to define images pixel
 by pixel.

  There was the introduction of the policy of colour sometimes being a
  recorded property rather than having just the original monochrome
  recording policy.

 There isn't any such policy. There is a variation selector to suggest
 that the rendering engine show certain characters in emoji style
 instead of text style, and there are characters with colors in their
 names, but there is no policy that specific colors are recorded as
 part of the encoding. YELLOW HEART could conformantly appear in any
 color.

  There has been the change of encoding policy that facilitated the
  introduction of the Indian Rupee character into Unicode and ISO/IEC
  10646 far more quickly than had been thought possible, so that the
  encoding was ready for use when needed.

 That's not a change to what types of things get encoded. It's a
 procedural change, one which I would agree has been applied with
 increasing creativity.

  There has been the recent encoding policy change regarding encoding of
  pure electronic use items taking place without (extensive prior use
  using a Private Use Area encoding), such as the encoding of the
  UNICORN FACE.

 This is probably your best analogy. People like Asmus have addressed it,
 saying it's not reasonable to expect users to adopt PUA solutions and
 wait for them to catch on.

  There is the recent change to the deprecation status of most of the
  tag characters and the acceptance of the base character followed by
  tag characters technique so as to allow the specifying of a larger
  collection of particular flags.

 There must have been a great wailing and gnashing of teeth over that
 decision. So many statements were made over the years about the basic
 evilness of tag characters.

 But the concept of representing flags was already agreed upon as a
 compatibility measure, and the Regional Indicator Symbols solution was
 a compromise that allowed expansion beyond the 10 flags that Japanese
 telcos chose to include. RIS were an architectural decision. The tag
 solution (to be fully outlined in a future PRI) was another
 architectural decision. Neither (I believe) is analogous to a scope
 decision to start encoding different types of non-character things as if
 they were characters, and as I have said before, assigning a glyph to a
 thing that isn't a character doesn't make it one.

 --
 Doug Ewell | http://ewellic.org | Thornton, CO 





Re: Tag characters and in-line graphics (from Tag characters)

2015-05-30 Thread William_J_G Overington
Responding to Leo Broukhis:

 A more common occurrence is the need to include a non-standard character in a 
 text message, be it a ski piste symbol or an obscure CJK ideogram. Have you 
 thought of  embedding TrueType in Unicode? 

Not congruently so, yet, in effect, yes, as I have considered including 
individual OpenType-compatible glyphs in a base character followed by tag 
characters format. OpenType is a development from TrueType that can achieve 
more than can TrueType on its own.

There is a little about this in the last two paragraphs of the following post.

http://www.unicode.org/mail-arch/unicode-ml/y2015-m05/0218.html

There would need to be a few additions to make if work effectively: for 
example, a value for each of advance width, ascent maximum, descent maximum and 
fontunits per em.

William Overington

30 May 2015








Re: Tag characters and in-line graphics (from Tag characters)

2015-05-30 Thread William_J_G Overington
Responding to Doug Ewell:

 I think this cuts to the heart of what people have been trying to say all 
 along.

 Historically, Unicode was not meant to be the means by which brand new ideas 
 are run up the proverbial flagpole to see if they will gain traction.

History is interesting and can be a good guide, yet many things that are an 
accepted part of Unicode today started as new ideas that gained traction and 
became implemented. So history should not be allowed to be a reason to restrict 
progress.

For example, there was the extension from 1 plane to 17 planes.

There was the introduction of emoji support.

There was the introduction of the policy of colour sometimes being a recorded 
property rather than having just the original monochrome recording policy.

There has been the change of encoding policy that facilitated the introduction 
of the Indian Rupee character into Unicode and ISO/IEC 10646 far more quickly 
than had been thought possible, so that the encoding was ready for use when 
needed.

There has been the recent encoding policy change regarding encoding of pure 
electronic use items taking place without (extensive prior use using a Private 
Use Area encoding), such as the encoding of the UNICORN FACE.

There is the recent change to the deprecation status of most of the tag 
characters and the acceptance of the base character followed by tag characters 
technique so as to allow the specifying of a larger collection of particular 
flags.



The two questions that I asked in my response to a post by Mark E. Shoulson are 
relevant here.

Suppose that a plain text file is to include just one non-standard emoji 
graphic. How would that be done otherwise than by the format that I am 
suggesting?

What if there were three such non-standard emoji graphics needed in the plain 
text file, the second graphic being used twice. How would that be done 
otherwise than by the format that I am suggesting?

William Overington

30 May 2015





Re: Tag characters and in-line graphics (from Tag characters)

2015-05-30 Thread John
Hmm, these once entities of which you speak, do they require javascript? 
Because I'm not sure what we are looking for here is static documents requiring 
a full programming language.




But let's say for a moment that html5 can, or could do the job here. Then to 
make the dream come true that you could just cut and paste text that happened 
to contain a custom character to somewhere else, and nothing untoward would 
happen, would mean that everything in the computing universe should allow full 
blown html. So every Java Swing component, every Apple gui component, every 
.NET component, every windows component, every browser, every Android and IOS 
component would allow text entry of HTML entities. OK, so let's say everyone 
agrees with this course of action, now the universal text format is HTML.




But in this new world where anywhere that previously you could input text, you 
can now input full blown html, does that actually make sense? Does it make 
sense that you can for example, put full blown HTML inside a H1 tag in html 
itself? That's a lot of recursion going on there. Or in a MS-Excel cell? Or 
interspersed in some otherwise fairly regular text in a Word document?




I suppose someone could define a strict limited subset of HTML to be that 
subset that makes sense in ALL textual situations. That subset would be 
something like just defining things that act like characters, and not like a 
full blown rendering engine. But who would define that subset? Not the HTML 
groups, because their mandate is to define full blown rendering engines. It 
would be more likely to be something like the unicode group.




And also, in this brave new world where HTML5 is the new standard text format, 
what would the binary format of it be? I mean, if I have the string of unicode 
characters IMG would that be HTML5 image definition that should be rendered as 
such? Or would it be text that happens to contain greater than symbol, I, M and 
G? It would have to be the former I guess, and thereby there would no longer be 
a unicode symbol for the mathematical greater than symbol. Rather there would 
be a unicode symbol for opening a HTML tag, and the text code for greater than 
would be gt; Never again would a computer store  to mean greater than. Do we 
want HTML to be so pervasive? Not sure it deserves that.




And from a programmers point of view, he wants to be able to iterate over an 
array of characters and treat each one the same way, regardless if it is a 
custom character or not. Without that kind of programmatic abstraction, the 
whole thing can never gain traction. I don't think fully blown HTML embedded in 
your text can fulfill that. A very strictly defined subset, possibly could. 
Sure HTML5 can RENDER stuff adquately, if the only aim of the game is provide a 
correct rendering. But to be able to actually treat particular images embedded 
as characters, and have some programming library see that abstraction 
consistently, I'm not sure I'm convinced that is possible. Not without nailing 
down exactly what html elements in what particular circumstances constitute a 
character.




I guess in summary, yes we have the technology already to render anything. But 
I don't think the whole standards framework does anything to allow the 
computing universe to actually exchange custom characters as if they were just 
any other text. Someone would actually have to  work on a standard to do that, 
not just point to html5.








On Saturday, 30 May 2015 at 5:08 am, Philippe Verdy verd...@wanadoo.fr, wrote:


2015-05-29 4:37 GMT+02:00 John idou...@gmail.com:

Today the world goes very well with HTML(5) which is now the bext markup 
language for document (including for inserting embedded images that don’t 
require any external request”

If I had a large document that reused a particular character thousands of 
times, would this HTML markup require embedding that character thousands of 
times, or could I define the character once at the beginning of the sequence, 
and then refer back to it in a space efficient way?





HTML(5) allows defining *once* entities for images that can then be reused 
thousands of times without repeting their definition. You can do this as well 
with CSS styles, just define a class for a small element. This element may 
still be an image, but the semantic is carried by the class you assign to it. 
You are not required to provide an external source URL for that image if the 
CSS style provides the content.




You may also use PUAs for the same purpose (however I have not seen how CSS 
allows to style individual characters in text elements as these characters are 
not elements, and there's no defined selector for pseudo-elements matching a 
single character). PUAs are perfectly usable in the situation where you have 
embedded a custom font in your document for assigning glyphs to characters (you 
can still do that, but I would avoid TrueType/OpenType for this purpose, but 
would use the SVG 

Re: Tag characters and in-line graphics (from Tag characters)

2015-05-29 Thread William_J_G Overington
Responding to Mark E. Shoulson:


 As was pointed out to me, essentially what you are saying is you reject my 
 premise that one size does not fit all.


Well, I do not know where that came from, but no, I do not reject that premise. 
There is plain text, there is HTML, there is XML.


HTML is good for web pages.


Plain text is, amongst other applications, good for text messages.


The format that I am suggesting would allow the image for a non-standard emoji 
character to be included in a text message, with the image located at the 
correct place in the text.


I have not purported that it become the only format for transmitting images.


 You would prefer *everything* be in plain text, so you wouldn't have to use 
 other formats for it. You're essentially converting plain text into THE 
 format for everything. 


No. Use the best format for the task that is being carried out. I am 
enthusiastic that as much as possible can be done in open source formats rather 
than an end user of computing equipment needing to rely on expensive propriety 
software packages with proprietary file formats that cannot be accessed without 
expensive software.


  If you really believe one size should fit all in this way, ...


But I don't.


Just because I opine that plain text is best for some applications and I have 
suggested a format that would allow a graphic to be included directly in a 
plain text file does not mean that I opine that everything should be plain text.


For example, I use HTML files, gif files, png files, pdf files, wav files, TTF 
files as appropriate.


http://www.users.globalnet.co.uk/~ngo/library.htm


http://www.users.globalnet.co.uk/~ngo/spec0001.htm


http://www.users.globalnet.co.uk/~ngo/song1018.htm


http://www.users.globalnet.co.uk/~ngo/song1021.htm


I have embedded a wav file in a pdf and published the result on the web.


http://www.users.globalnet.co.uk/~ngo/the_mobile_art_shop.pdf


Suppose that a plain text file is to include just one non-standard emoji 
graphic. How would that be done otherwise than by the format that I am 
suggesting?


What if there were three such non-standard emoji graphics needed in the plain 
text file, the second graphic being used twice. How would that be done 
otherwise than by the format that I am suggesting?


William Overington


29 May 2015






Re: Tag characters and in-line graphics (from Tag characters)

2015-05-29 Thread William_J_G Overington
Responding to Philippe Verdy:
 There's no advantage because what you want to create is effectively another 
 markup language with its own syntax (but requiring new obscure characters 
 that most applications and users will not be able to interpret and render 
 correctly in the way intended by you, ...
Well, if the format became accepted as part of Unicode then appropriate 
applications could well be produced that would interpret the format and display 
an image in the desired place.
  ... and with still many things you have forgotten about the specific needs 
  for images (e.g. colorimetry profiles, aspect ratio of pixels with bitmaps, 
  undesired effects that must be controled such as moiré artefacts).
The format is just at present a basic suggestion. Rather than just state what 
you consider what I have forgotten and dismiss the format, how about joining in 
progress and specifying what you consider needs adding to the format and 
perhaps suggest how to add in that functionality in the style that the format 
uses.
 You don't need new characters to create a markup language and its syntax. 
 Today the world goes very well with HTML(5) which is now the bext markup 
 language for document (including for inserting embedded images that don't 
 require any external request, or embedding special effects on images, such as 
 animation or dynamic layouts for adapting the document to the redering 
 device, with the help of CSS and Javascript that are also embeddable).
The two questions that I asked in my response to a post by Mark E. Shoulson are 
relevant here.
Suppose that a plain text file is to include just one non-standard emoji 
graphic. How would that be done otherwise than by the format that I am 
suggesting?
What if there were three such non-standard emoji graphics needed in the plain 
text file, the second graphic being used twice. How would that be done 
otherwise than by the format that I am suggesting?
 At least with HTML5 they don't try to reinvent the image formats and there's 
 ample space for supporting multiple images formats tuned for specific needs 
 (e.g. JPEG, PNG, GIF, SVG, TIFF...) including animation and video, and 
 synchronization of images and audio in time for videos, or with user 
 interactions. They are designed separately and benefit from patient 
 researches made since long (your desired format, still undocumented, is 
 largely under the level needed for images, independantly of the markup syntax 
 you want to create to support them, and independantly of the fact that you 
 also want to encode these syntaxic elements with new characters, something 
 that is absolutely not needed for any markup language)
Well it is undocumented apart from posts in this thread because I have put 
forward the format for discussion. A pdf document for consideration by the 
Unicode Technical Committee could be produced and submitted if there is 
interest in the format, the content of the pdf document perhaps including 
suggestions from this thread if any such suggestions are forthcoming.
 In summary, you are reinventing the wheel.
Well, this is progress, producing an additional format for expressing an image 
for application in various specific specialised circumstances.
William Overington
29 May 2015


Re: Tag characters and in-line graphics (from Tag characters)

2015-05-29 Thread Leo Broukhis
 The format that I am suggesting would allow the image for a non-standard 
 emoji character to be included in a text message, with the image located at 
 the correct place in the text.

 A more common occurrence is the need to include a non-standard
character in a text message, be it a ski piste symbol or an obscure
CJK ideogram. Have you thought of embedding TrueType in Unicode?

Leo

On Fri, May 29, 2015 at 1:38 AM, William_J_G Overington
wjgo_10...@btinternet.com wrote:
 Responding to Mark E. Shoulson:


 As was pointed out to me, essentially what you are saying is you reject my 
 premise that one size does not fit all.


 Well, I do not know where that came from, but no, I do not reject that 
 premise. There is plain text, there is HTML, there is XML.


 HTML is good for web pages.


 Plain text is, amongst other applications, good for text messages.


 The format that I am suggesting would allow the image for a non-standard 
 emoji character to be included in a text message, with the image located at 
 the correct place in the text.


 I have not purported that it become the only format for transmitting images.


 You would prefer *everything* be in plain text, so you wouldn't have to use 
 other formats for it. You're essentially converting plain text into THE 
 format for everything.


 No. Use the best format for the task that is being carried out. I am 
 enthusiastic that as much as possible can be done in open source formats 
 rather than an end user of computing equipment needing to rely on expensive 
 propriety software packages with proprietary file formats that cannot be 
 accessed without expensive software.


  If you really believe one size should fit all in this way, ...


 But I don't.


 Just because I opine that plain text is best for some applications and I have 
 suggested a format that would allow a graphic to be included directly in a 
 plain text file does not mean that I opine that everything should be plain 
 text.


 For example, I use HTML files, gif files, png files, pdf files, wav files, 
 TTF files as appropriate.


 http://www.users.globalnet.co.uk/~ngo/library.htm


 http://www.users.globalnet.co.uk/~ngo/spec0001.htm


 http://www.users.globalnet.co.uk/~ngo/song1018.htm


 http://www.users.globalnet.co.uk/~ngo/song1021.htm


 I have embedded a wav file in a pdf and published the result on the web.


 http://www.users.globalnet.co.uk/~ngo/the_mobile_art_shop.pdf


 Suppose that a plain text file is to include just one non-standard emoji 
 graphic. How would that be done otherwise than by the format that I am 
 suggesting?


 What if there were three such non-standard emoji graphics needed in the plain 
 text file, the second graphic being used twice. How would that be done 
 otherwise than by the format that I am suggesting?


 William Overington


 29 May 2015







Re: Tag characters and in-line graphics (from Tag characters)

2015-05-29 Thread Doug Ewell
William_J_G Overington wjgo underscore 10009 at btinternet dot com
wrote:

 There's no advantage because what you want to create is effectively
 another markup language with its own syntax (but requiring new
 obscure characters that most applications and users will not be able
 to interpret and render correctly in the way intended by you, ...

 Well, if the format became accepted as part of Unicode then
 appropriate applications could well be produced that would interpret
 the format and display an image in the desired place.

I think this cuts to the heart of what people have been trying to say
all along.

Historically, Unicode was not meant to be the means by which brand new
ideas are run up the proverbial flagpole to see if they will gain
traction.

--
Doug Ewell | http://ewellic.org | Thornton, CO 




Re: Tag characters and in-line graphics (from Tag characters)

2015-05-29 Thread Philippe Verdy
2015-05-29 4:37 GMT+02:00 John idou...@gmail.com:

 Today the world goes very well with HTML(5) which is now the bext markup
 language for document (including for inserting embedded images that don’t
 require any external request”
 If I had a large document that reused a particular character thousands of
 times, would this HTML markup require embedding that character thousands of
 times, or could I define the character once at the beginning of the
 sequence, and then refer back to it in a space efficient way?


HTML(5) allows defining *once* entities for images that can then be reused
thousands of times without repeting their definition. You can do this as
well with CSS styles, just define a class for a small element. This element
may still be an image, but the semantic is carried by the class you
assign to it. You are not required to provide an external source URL for
that image if the CSS style provides the content.

You may also use PUAs for the same purpose (however I have not seen how CSS
allows to style individual characters in text elements as these characters
are not elements, and there's no defined selector for pseudo-elements
matching a single character). PUAs are perfectly usable in the situation
where you have embedded a custom font in your document for assigning glyphs
to characters (you can still do that, but I would avoid TrueType/OpenType
for this purpose, but would use the SVG font format which is valid in CSS,
for defining a collection of glyphs).

If the document is not restricted to be standalone, of course you can use
links to an external shared CSS stylesheet and to this SVG font referenced
by the stylesheet. With such approach, you don't even need to use classes
on elements, you use plain-text with very compact PUAs (it's up to you to
decide if the document must be standalone (embedding everything it needs)
or must use external references for missing definitions, HTML allows
both (and SVG as well when it contains plain-text elements).


Re: Tag characters and in-line graphics (from Tag characters)

2015-05-28 Thread Mark E. Shoulson
As was pointed out to me, essentially what you are saying is you reject 
my premise that one size does not fit all.  You would prefer 
*everything* be in plain text, so you wouldn't have to use other 
formats for it.  You're essentially converting plain text into THE 
format for everything.


But it isn't suited for that.  If you really believe one size should fit 
all in this way, I think the problem is that pretty much all of the rest 
of the computer science community doesn't agree with you.  Sorry.


~mark

On 05/28/2015 07:50 AM, William_J_G Overington wrote:

Responding to Mark E. Shoulson:

The big advantage of this new format is that the result is an unambiguous 
Unicode plain text file and could be placed within a file of plain text without 
having to make the whole document a markup file to some format. Plain text is 
the key advantage.

The following may be useful as a guide to the original problem that I am trying 
to solve.

http://www.unicode.org/reports/tr51/tr51-2.html#Longer_Term

I tried to apply the brilliant new base character followed by tag characters 
format to the problem.

In the future, maybe Serif DrawPlus will have the ability to export a picture 
to this new format.

William Overington

28 May 2015





Re: Tag characters and in-line graphics (from Tag characters)

2015-05-28 Thread John
Today the world goes very well with HTML(5) which is now the bext markup 
language for document (including for inserting embedded images that don’t 
require any external request”




If I had a large document that reused a particular character thousands of 
times, would this HTML markup require embedding that character thousands of 
times, or could I define the character once at the beginning of the sequence, 
and then refer back to it in a space efficient way?




Part of the reason at least of having any code system rather than just pixels 
and images is to efficiently and consistently encode data. Unicode has private 
use ranges of codes. I can see an argument that it would be desirable to be 
able to send someone text with private use ranges and have the header define 
some default renderings. I’m not sure that replacing a document of 100,000 
characters with 100,000 embedded html5 img tags is the same thing. It would be 
inefficient in space. Impossible to process (e.g. find all the instances of a 
particular character, or sequence), and so forth.




Given that its been agreed that private use ranges are a good thing, and given 
that we can agree that exchanging data is a good thing, maybe something should 
bring those two things together. Just a thought.


—
Chris

On Fri, May 29, 2015 at 9:45 AM, Mark E. Shoulson m...@kli.org wrote:

 As was pointed out to me, essentially what you are saying is you reject 
 my premise that one size does not fit all.  You would prefer 
 *everything* be in plain text, so you wouldn't have to use other 
 formats for it.  You're essentially converting plain text into THE 
 format for everything.
 But it isn't suited for that.  If you really believe one size should fit 
 all in this way, I think the problem is that pretty much all of the rest 
 of the computer science community doesn't agree with you.  Sorry.
 ~mark
 On 05/28/2015 07:50 AM, William_J_G Overington wrote:
 Responding to Mark E. Shoulson:

 The big advantage of this new format is that the result is an unambiguous 
 Unicode plain text file and could be placed within a file of plain text 
 without having to make the whole document a markup file to some format. 
 Plain text is the key advantage.

 The following may be useful as a guide to the original problem that I am 
 trying to solve.

 http://www.unicode.org/reports/tr51/tr51-2.html#Longer_Term

 I tried to apply the brilliant new base character followed by tag 
 characters format to the problem.

 In the future, maybe Serif DrawPlus will have the ability to export a 
 picture to this new format.

 William Overington

 28 May 2015


Re: Tag characters and in-line graphics (from Tag characters)

2015-05-28 Thread William_J_G Overington
Responding to Mark E. Shoulson:

The big advantage of this new format is that the result is an unambiguous 
Unicode plain text file and could be placed within a file of plain text without 
having to make the whole document a markup file to some format. Plain text is 
the key advantage.

The following may be useful as a guide to the original problem that I am trying 
to solve.

http://www.unicode.org/reports/tr51/tr51-2.html#Longer_Term

I tried to apply the brilliant new base character followed by tag characters 
format to the problem.

In the future, maybe Serif DrawPlus will have the ability to export a picture 
to this new format.

William Overington

28 May 2015




Tag characters and in-line graphics (from Tag characters)

2015-05-27 Thread William_J_G Overington
Tag characters and in-line graphics (from Tag characters)
This document suggests a way to use the method of a base character together 
with tag characters to produce a graphic. The approach is theoretical and has 
not, at this time, been tried in practice.
The application in mind is to enable the graphic for an emoji character to be 
included within a plain text stream, though there will hopefully be other 
applications.
The base character could be either an existing character, such as U+1F5BC FRAME 
WITH PICTURE, or a new character as decided. Tests could be carried out using a 
Private Use Area character as the base character.
The explanation here is intended to explain the suggested technique by 
examples, as a basis for discussion. In each example, please consider for each 
example that the characters listed are each the tag version of the character 
used here and that they all as a group follow one base character.
The examples are deliberately short so as to explain the idea. A real use 
example might have around two hundred or so tag characters following the base 
character, maybe more, sometimes fewer.
Examples of displays:
Each example is left to right along the line then lines down the page from 
upper to lower.
7r means 7 pixels red
7r5y means 7 pixels red then 5 pixels yellow
7r5y-3b means 7 pixels red then 5 pixels yellow then next line then 3 pixels 
blue
Examples of colours available:
k black
n brown
r red
o orange
y yellow
g green (0, 255, 0)
b blue
m magenta
e grey
w white
c cyan
p pink
d dark grey
i light grey (thus avoiding using lowercase l so as to avoid confusion with 
figure 1)
f deeper green (foliage colour) (0, 128, 0)
Next line request:
- moves to the next line
Local palette requests:
192R224G64B2s means store as local palette colour 2 the colour (R=192, G=224, 
B=64)
7,2u means 7 pixels using local palette colour 2
Local glyph memory, for use in compressing a document where the same glyph is 
used two or more times in the document:
3t7r means this is local glyph 3 being defined at its first use in the document 
as 7 red pixels
3h here local glyph 3 is being used
The above is for bitmaps. It would be possible to use a similar technique to 
specify a vector glyph as used in fontmaking using on-curve and off-curve 
points specified as X, Y coordinates together with N for on-curve and F for 
off-curve. There would need to be a few other commands so as to specify places 
in the tag character stream where definition of a contour starts and so as to 
separate the definitions of the glyphs for a colour font and so on. This could 
be made OpenType compatible so that a received glyph could be added into a font.
Please feel free to suggest improvements. One improvement could be as to how to 
build a Unicode code point into a picture so that a font could be transmitted.
William Overington
27 May 2015


RE: Tag characters and in-line graphics (from Tag characters)

2015-05-27 Thread Doug Ewell
William_J_G Overington wjgo underscore 10009 at btinternet dot com
wrote:

 Please feel free to suggest improvements.

http://en.wikipedia.org/wiki/Scalable_Vector_Graphics

--
Doug Ewell | http://ewellic.org | Thornton, CO 




Re: Tag characters and in-line graphics (from Tag characters)

2015-05-27 Thread Mark E. Shoulson

I think I've figured out the philosophy WJGO is trying to follow here.

We should have a way to encode graphics in Unicode
We should have a way to encode programming instructions in Unicode
How about
We should have a way to encode sound-waves in Unicode?
Or
We should have a way to encode *moving* graphics, maybe with sound, in 
Unicode?


Now, he didn't say the last two, in fairness to him.  But I think that's 
the thinking.  WJGO, not *everything* computers do has to be part of 
Unicode.  Doing so essentially makes *everything* that wants to support 
Unicode have to be... well, pretty much *everything* all other 
computers are.  We have graphics formats that encode graphics; they're 
*good* at it.  They're made for it. We have sound formats for encoding 
sounds.  We have various bytecodes for programming--different ones, 
written by different people, that do things in different ways, because 
one size does not fit all.  Unicode can't be the one size.  It was never 
intended to.  Don't make Unicode into an operating system, or worse, THE 
operating system.  It's a character encoding.  For encoding characters.


~mark

On 05/27/2015 12:26 PM, William_J_G Overington wrote:

Tag characters and in-line graphics (from Tag characters)


This document suggests a way to use the method of a base character 
together with tag characters to produce a graphic. The approach is 
theoretical and has not, at this time, been tried in practice.



The application in mind is to enable the graphic for an emoji 
character to be included within a plain text stream, though there will 
hopefully be other applications.