RE: Normalization rate on the Web

2013-01-21 Thread Shawn Steele
I have no idea what the stats are, however some systems generate more NFC and 
others more NFD.  And then some publisher uses NFC systems but an author uses 
an NFD system, so the pages served end up with a mixture.

I generally recommend using comparisons and index keys that understand NFC/NFD 
and compare accurately regardless of the form.

-Shawn

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Denis Jacquerye
Sent: Monday, January 21, 2013 8:12 AM
To: Unicode Discussion
Subject: Normalization rate on the Web

Does anybody have any idea of how much of the Web is normalized in NFC or NFD? 
Or how much not normalized?

How would one find out or try to make a smart guess?

I know a lot of library catalogue data is in NFD or somewhat decomposed. Is 
there any other field that heavily uses decomposition?

--
Denis Moyogo Jacquerye
African Network for Localisation http://www.africanlocalisation.net/
Nkótá ya Kongó míbalé --- http://info-langues-congo.1sd.org/
DejaVu fonts --- http://www.dejavu-fonts.org/








Re: Normalization rate on the Web

2013-01-21 Thread Bjoern Hoehrmann
* Denis Jacquerye wrote:
Does anybody have any idea of how much of the Web is normalized in NFC
or NFD? Or how much not normalized?

How would one find out or try to make a smart guess?

How much is not a good question here. Let's say there are only two web
pages: one is very short and used by only one person once a year, the
other page is very long and used by one billion people once per day. If
one of the two pages is in NFC, the other in NFD, it would be misleading
to say that 50% of the web is in NFD. More realistically, let's say
Wikipedia articles are all neither NFC nor NFD, but Google search result
pages are all NFD. How much would that be?

That problem aside, you could always get a list of web sites or pages,
download them, or use a pre-packaged dataset, and analyse it.

My personal experience is that non-NFC content in german or english is
fairly rare; I can tell fairly easily because my smartphone cannot ren-
der various characters like german umlauts properly when decomposed, so
I encounter that problem sometimes, mainly on sites that quote heavily
from PDFs and similar content.
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 



Spiral symbol

2013-01-21 Thread Andrés Sanhueza
Hello.
I have wondered if it may be a good idea to make a proposal to an spiral
character, basically because I believe is the only mayor symbol recurrently
used for represent swearing in comics that's missing from Unicode. Most
of the time it is replaced with the more common at (@), but still an actual
one may be good. Not sure yet if there's enough documentation. Some Emoji
representations displays the CYCLONE character (U+1F300) as one, yet I
don't think that fits as a better replacement.

Andrés Sanhueza


Re: Normalization rate on the Web

2013-01-21 Thread Andrew Cunningham
Hi Denis,

A fea thoughts ... library data may be nfc or nfd, but is more likely to
conform to the MARC character repetoire, so isn't exactly NFD.

Vietnamese data is either 1) NFC or 2) neither NFC nor NFD

It would be rare to find vietnamese data in NFD

For a range of afrjcan languages, maily ones uskng diacriti s anx diacritic
stackkng, it may be 1) NFC, 2) NFD or 3) niether NFC nor NFD depending on
the input framework used.
On Jan 22, 2013 3:26 AM, Denis Jacquerye moy...@gmail.com wrote:

 Does anybody have any idea of how much of the Web is normalized in NFC
 or NFD? Or how much not normalized?

 How would one find out or try to make a smart guess?

 I know a lot of library catalogue data is in NFD or somewhat
 decomposed. Is there any other field that heavily uses decomposition?

 --
 Denis Moyogo Jacquerye
 African Network for Localisation http://www.africanlocalisation.net/
 Nkótá ya Kongó míbalé --- http://info-langues-congo.1sd.org/
 DejaVu fonts --- http://www.dejavu-fonts.org/





Re: Spiral symbol

2013-01-21 Thread Asmus Freytag

On 1/21/2013 4:11 PM, Andrés Sanhueza wrote:

Hello.
I have wondered if it may be a good idea to make a proposal to an 
spiral character, basically because I believe is the only mayor 
symbol recurrently used for represent swearing in comics that's 
missing from Unicode.


If it should come to a proposal, I can help out with one or two 
citations of the use of this symbol for that purpose in contexts that 
are not that different from other lettering in the same sources. Not 
more than emoji are from regular words.


A./

Most of the time it is replaced with the more common at (@), but still 
an actual one may be good. Not sure yet if there's enough 
documentation. Some Emoji representations displays the CYCLONE 
character (U+1F300) as one, yet I don't think that fits as a better 
replacement.


Andrés Sanhueza





Re: Normalization rate on the Web

2013-01-21 Thread Martin J. Dürst

On 2013/01/22 1:12, Denis Jacquerye wrote:

Does anybody have any idea of how much of the Web is normalized in NFC
or NFD? Or how much not normalized?


I have never measured this. But at one time, there was only NFD (and 
NFKD). The Unicode Consortium, with input from W3C, then defined NFC 
(and NFKC) to be much closer to the actual encodings used on the Web.


So in some sense, Web Content is (mostly) NFC *by design*.

Regards,Martin.



How would one find out or try to make a smart guess?

I know a lot of library catalogue data is in NFD or somewhat
decomposed. Is there any other field that heavily uses decomposition?

--
Denis Moyogo Jacquerye
African Network for Localisation http://www.africanlocalisation.net/
Nkótá ya Kongó míbalé --- http://info-langues-congo.1sd.org/
DejaVu fonts --- http://www.dejavu-fonts.org/