First, what Markus said. That's the high-level picture.

Some more details:

On 12/19/2012 7:59 AM, Costello, Roger L. wrote:
2. I have a text editor open and it contains text that is encoded using 
encoding A. I select some of the text and copy it to the clipboard. What is 
copied: (a) the characters (glyphs) that I visually see displayed on the 
screen, or (b) the hex values of each character displayed on the screen, or (c) 
the codepoints, or (d) something else (what else)?

Clipboards can support multiple formats for your data.

Text editors usually copy both a raw and a formatted stream of data. The receiving application can pick which format to use (when you look at the "Paste Special.." command in many applications you can see what formats are on the clipboard.

Some clipboards support limited conversion among raw text formats. For example, on Windows, in an effort to help migration to Unicode the clipboard will accept text in Unicode and make it available as text in encoding A (or vice versa) as long as A is the predefined legacy encoding for that system.

FInally, to completely answer your question: raw text is present as a stream of code points in Unicode or whatever encoding.

3. Continuing question 2, when text is copied, is its encoding also copied? Is 
the encoding stored in the clipboard?

I'd say, usually not. But I'm not familiar with all clipboards on all systems. But there's usually some indicator of format (such as HTML vs. Plain Text) and in the example I gave, there's the Unicode vs. Legacy text format.

4. I have two text editors open. Text editor #1 contains text that is encoded using 
encoding A while text editor #2 contains text that is encoded using encoding B. Encoding 
A is different from encoding B. I copy text from #1 and paste it into #2. Does text 
editor #2 realize, "Oh my, the text being inserted uses a different encoding so I 
better convert each of its hex value into the equivalent hex value in my encoding." 
Is that the way it works?

Most encodings can't be converted into each other. The exceptions are few. Almost all encodings can be converted TO Unicode (the reverse is true only if the text happens to not contain characters that are undefined in the target encoding). Some other encodings may have one or more "partner" encodings, which contain the same characters, but with different layouts.

The easiest way to avoid problems is for the editors to work in Unicode (as Markus wrote) and then worry about encodings only when reading or writing files for particular purposes (if for some reason Unicode files are not acceptable or available).

Most editors can tell between a clipboard format for Unicode vs. legacy encoding and if the user says "paste the legacy", but the document is in Unicode, they would convert - because that is usually well supported and possible. Beyond that, the choices aren't attractive, because you are not guaranteed to succeed with a conversion, so most people don't bother trying to write code for such scenarios.

To come back to what Markus wrote and state it in a different way: if you have any choice (that is, are not forced to open legacy document) you should walk away from "encodings" other than Unicode as rapidly as possible - they are definitely not something that your application should work with natively, It's too messy.

(And why, do you think, was Unicode invented in the first place :) )

A./

Reply via email to