RE: Usage stats?

Phillips, Addison Fri, 27 Mar 2015 13:43:52 -0700

What you might be looking for would be the CLDR project’s “exemplar sets” (see 
for example [1]), which describes which characters are customarily used for a 
given language and which are sometimes used. However, this is not the same 
thing as statistical distribution. One of the points of Unicode is that any 
character can be used at any time in any document—regardless of language.

[1] 
http://www.unicode.org/cldr/charts/27/by_type/core_data.alphabetic_information.main.html

From: Unicode [mailto:[email protected]] On Behalf Of Michael Norton
Sent: Friday, March 27, 2015 1:25 PM
To: John D. Burger
Cc: Vint Cerf; [email protected]
Subject: Re: Usage stats?

Just using the tools and formulations we have at present ought to allow Unicode 
to produce a usage set without indexing the entire web which would provide 
implementors with an indication of variances for traffic, overflow, and 
override purposes relative to users of the standard.  If the figure varies 
significantly from page:website, website:region, region:language, for example, 
it simplifies our ability to standardize the set.

I have particular concerns, but, like Google, they are proprietary.

On Fri, Mar 27, 2015 at 4:23 PM, John D. Burger 
<[email protected]<mailto:[email protected]>> wrote:
On Mar 27, 2015, at 15:57 , Michael Norton 
<[email protected]<mailto:[email protected]>> wrote:

Why wouldn't Unicode itself have it?

Because as Ken explained, acquiring (and constantly updating) such statistics 
would require roughly the effort that Google puts into its crawler. And it 
wouldn't include all the printed material that isn't on the web.

Turning your question around, why would Unicode have this information? What 
would be the value, and how would it be worth the (considerable) effort 
required?

- John Burger
  MITRE

On Fri, Mar 27, 2015 at 1:07 PM, Ken Whistler 
<[email protected]<mailto:[email protected]>> wrote:
Search engine companies (and in particular, Google) have such
information squirreled away in their index databases, at least as
far as usage stats for Unicode characters on the web go -- but it
is proprietary information, and they generally don't publish
information about such statistics.

Perhaps there are researchers out there who have set web crawlers
on a mission to generate such web statistics for publication, and maybe
somebody on this list knows of such research -- but it would be
virtually impossible to generate such information for the much
wider collection of documents and data that are not easily accessible
for web indexing. (Behind password walls, in pdf document archives,
in proprietary databases, ... ) As an example of why this is a problem,
consider the fact that there are *peta*bytes of information picked up
and stored in databases from scanners and other devices used at
tens of millions of retail points of sale. Such data, by its nature, would tend
to skew heavily towards use of ASCII a-z and digits 0-9 in its
character data. How would you end up weighting such (mostly
publicly inaccessible) data in trying to count up for overall statistics
on character use?

There are more traditional usage count studies that focus on
counts of character frequency within single language orthographies
in single scripts (e.g., letter frequences for French text), but I don't
think that is what you were asking about.

Here is some discussion of a similar question posted on stackoverflow:

http://stackoverflow.com/questions/22184624/unicode-character-usage-statistics

--Ken

On 3/27/2015 9:31 AM, Michael Norton wrote:
Hello and thank you for an incredible service (just joining the list).   Is 
there a list of usage statistics per character of the Unicode set available 
somewhere?

_______________________________________________
Unicode mailing list
[email protected]<mailto:[email protected]>
http://unicode.org/mailman/listinfo/unicode

--

Michael A. Norton, B.A. Cinema, M.P.A.
My Cinema Home: http://www.NortonsNook.com<http://www.nortonsnook.com/>

"All great actors are mere mathematical masters of speech and the human body."
[Image removed by sender.]

_______________________________________________
Unicode mailing list
[email protected]<mailto:[email protected]>
http://unicode.org/mailman/listinfo/unicode

--

Michael A. Norton, B.A. Cinema, M.P.A.
My Cinema Home: http://www.NortonsNook.com

"All great actors are mere mathematical masters of speech and the human body."
[Image removed by sender.]

_______________________________________________
Unicode mailing list
[email protected]
http://unicode.org/mailman/listinfo/unicode

RE: Usage stats?

Reply via email to