On 7/30/2013 2:15 PM, Doug Ewell wrote:
Asmus Freytag <asmusf at ix dot netcom dot com> wrote:

A code page is not, in general,
the same as an encoding scheme.
What is, then, the proper definition of a "code page"?
I might not be able to do better than Potter Stewart here. I think of a
code page as a deliberately targeted subset of all encodable characters,
such that different "pages" make up the whole "book." The Unicode
Glossary uses the example of MS-DOS code page 437; the concept wouldn't
apply unless other pages existed, covering different repertoires.
I'm not privy to the thinking behind the actual origin of the term, but I always assumed that the term "page" was chosen in analogy to the way one speaks of a "page" of memory - something that can be swapped in and out.

So, by selecting a different code page, one would swap the definition of the bytes in a byte stream such that they resulted in different displayed elements (and correspond to different key strokes).

The early code pages were small and fixed width (code unit == code point), so this kind of image makes sense.

Later the concept was effectively first generalized to any kind of character set, but also any kind of encoding scheme. East Asian character sets could exist in multiple encoding schemes that (with some limited differences related to ASCII characters) encoded the same repertoire but with different byte sequences.

With Unicode (by design, if not always 100% in actuality) created to allow loss-less mappings from pre-existing character sets, the code page identifier doubled as mapping identifier, and is very widely used for that purpose.

That doesn't mean that mappings are code pages.

Whether Unicode is a "code page" is something that you can argue up and down. In the original scheme, as extended, it very naturally can be just another code page - in the architectures that support it, it has a different place, due to its nature as universal mapping target.

In the end, the universal nature of Unicode means that all sorts of architectures that depended on swapping character sets (code pages) in mid stream are no longer viable and have been replaced by this single superset. Code pages live on, only to describe data and devices that are stuck in a particular past. (Even if they are relatively alive and kicking like 8859-1 or Windows 1252).

I'm happy to think of Unicode as something outside the old code page definition, but also as the "code page to end all code pages". Both work for me, so seeing code page id's defined for all the encoding schemes doesn't worry me.

A./

Later, it was realized that in order to specify what encoding data
were in or, for example, to specify a conversion from UTF-7 and UTF-8
to UTF-16 (native encoding scheme) one needed some suitable ID number
to identify the mapping. Well, extending the code page id was the most
natural way to do that, because, on several platforms, the use of a
numerical ID from the IBM code page registry was established practice.
I don't think the existence of numeric identifiers for Unicode encoding
schemes suffices to make them "code pages."

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell ­




Reply via email to