Re: _Unicode_code_page_and_?.net

Asmus Freytag Tue, 30 Jul 2013 15:50:22 -0700

On 7/30/2013 2:15 PM, Doug Ewell wrote:

Asmus Freytag <asmusf at ix dot netcom dot com> wrote:

A code page is not, in general,
the same as an encoding scheme.

What is, then, the proper definition of a "code page"?

I might not be able to do better than Potter Stewart here. I think of a
code page as a deliberately targeted subset of all encodable characters,
such that different "pages" make up the whole "book." The Unicode
Glossary uses the example of MS-DOS code page 437; the concept wouldn't
apply unless other pages existed, covering different repertoires.

I'm not privy to the thinking behind the actual origin of the term, butI always assumed that the term "page" was chosen in analogy to the wayone speaks of a "page" of memory - something that can be swapped in and out.

So, by selecting a different code page, one would swap the definition ofthe bytes in a byte stream such that they resulted in differentdisplayed elements (and correspond to different key strokes).

The early code pages were small and fixed width (code unit == codepoint), so this kind of image makes sense.

Later the concept was effectively first generalized to any kind ofcharacter set, but also any kind of encoding scheme. East Asiancharacter sets could exist in multiple encoding schemes that (with somelimited differences related to ASCII characters) encoded the samerepertoire but with different byte sequences.

With Unicode (by design, if not always 100% in actuality) created toallow loss-less mappings from pre-existing character sets, the code pageidentifier doubled as mapping identifier, and is very widely used forthat purpose.


That doesn't mean that mappings are code pages.

Whether Unicode is a "code page" is something that you can argue up anddown. In the original scheme, as extended, it very naturally can be justanother code page - in the architectures that support it, it has adifferent place, due to its nature as universal mapping target.

In the end, the universal nature of Unicode means that all sorts ofarchitectures that depended on swapping character sets (code pages) inmid stream are no longer viable and have been replaced by this singlesuperset. Code pages live on, only to describe data and devices that arestuck in a particular past. (Even if they are relatively alive andkicking like 8859-1 or Windows 1252).

I'm happy to think of Unicode as something outside the old code pagedefinition, but also as the "code page to end all code pages". Both workfor me, so seeing code page id's defined for all the encoding schemesdoesn't worry me.

A./

Later, it was realized that in order to specify what encoding data
were in or, for example, to specify a conversion from UTF-7 and UTF-8
to UTF-16 (native encoding scheme) one needed some suitable ID number
to identify the mapping. Well, extending the code page id was the most
natural way to do that, because, on several platforms, the use of a
numerical ID from the IBM code page registry was established practice.

I don't think the existence of numeric identifiers for Unicode encoding
schemes suffices to make them "code pages."

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell

Re: _Unicode_code_page_and_?.net

Reply via email to