Re: A few questions about encoding discovery, copying text, and pasting text in one encoding into text in another encoding

Asmus Freytag Wed, 19 Dec 2012 10:36:14 -0800

First, what Markus said. That's the high-level picture.


Some more details:

On 12/19/2012 7:59 AM, Costello, Roger L. wrote:

2. I have a text editor open and it contains text that is encoded using 
encoding A. I select some of the text and copy it to the clipboard. What is 
copied: (a) the characters (glyphs) that I visually see displayed on the 
screen, or (b) the hex values of each character displayed on the screen, or (c) 
the codepoints, or (d) something else (what else)?


Clipboards can support multiple formats for your data.

Text editors usually copy both a raw and a formatted stream of data. Thereceiving application can pick which format to use (when you look at the"Paste Special.." command in many applications you can see what formatsare on the clipboard.

Some clipboards support limited conversion among raw text formats. Forexample, on Windows, in an effort to help migration to Unicode theclipboard will accept text in Unicode and make it available as text inencoding A (or vice versa) as long as A is the predefined legacyencoding for that system.

FInally, to completely answer your question: raw text is present as astream of code points in Unicode or whatever encoding.


3. Continuing question 2, when text is copied, is its encoding also copied? Is 
the encoding stored in the clipboard?

I'd say, usually not. But I'm not familiar with all clipboards on allsystems. But there's usually some indicator of format (such as HTML vs.Plain Text) and in the example I gave, there's the Unicode vs. Legacytext format.


4. I have two text editors open. Text editor #1 contains text that is encoded using 
encoding A while text editor #2 contains text that is encoded using encoding B. Encoding 
A is different from encoding B. I copy text from #1 and paste it into #2. Does text 
editor #2 realize, "Oh my, the text being inserted uses a different encoding so I 
better convert each of its hex value into the equivalent hex value in my encoding." 
Is that the way it works?

Most encodings can't be converted into each other. The exceptions arefew. Almost all encodings can be converted TO Unicode (the reverse istrue only if the text happens to not contain characters that areundefined in the target encoding). Some other encodings may have one ormore "partner" encodings, which contain the same characters, but withdifferent layouts.

The easiest way to avoid problems is for the editors to work in Unicode(as Markus wrote) and then worry about encodings only when reading orwriting files for particular purposes (if for some reason Unicode filesare not acceptable or available).

Most editors can tell between a clipboard format for Unicode vs. legacyencoding and if the user says "paste the legacy", but the document is inUnicode, they would convert - because that is usually well supported andpossible. Beyond that, the choices aren't attractive, because you arenot guaranteed to succeed with a conversion, so most people don't bothertrying to write code for such scenarios.

To come back to what Markus wrote and state it in a different way: ifyou have any choice (that is, are not forced to open legacy document)you should walk away from "encodings" other than Unicode as rapidly aspossible - they are definitely not something that your applicationshould work with natively, It's too messy.


(And why, do you think, was Unicode invented in the first place :) )

A./

Re: A few questions about encoding discovery, copying text, and pasting text in one encoding into text in another encoding

Reply via email to