After more tests, it seems that Word effectively changes a SOFT HYPHEN (U+00AD) on input into <control US> (U+001F), which it uses not as a regular "soft hyphen" but as an "optional hyphen".
This is then changed back to a regular soft hyphen in the clipboard when copying it there in a rich text format or when saving in a rich text format like HTML, but effectively changed into a visible U+00AC in the plain-text version inserted in the clipboard or when saving to a plain-text file, most probably because of a legacy usage of "¬" in legacy word processors initially made for old MS-DOS (probably Wordperfect, or even old versions of Word) for encocing their own "optional hyphen". When I tested it with Word 2010 in Windows 7, I did not see that internally, the regular soft hyphen had been converted to U+001F, because there was no U+001F in the output Unicode-encoded HTML file I had saved (U+001F is invalid in HTML, it must be converted, but Word effectively chooses the correct SHY character U+00AD in this case). Yes all this is a mess, and I see no reason why it still changes internally a regular "soft hyphen" into its own legacy "optional hyphen", that it cannot preserve, even when saving to a UTF-8 plain-text document (which assumes full support of Unicode, and not a legacy 8-bit "OEM" encoding, that displays "¬" in the "IBM graphics" charset at the ASCII control position 0x1F, for example on the DOS-like console). I admit that Word could do that only when saving to a non-"ANSI" text file because there's not even the presence of SHY in those OEM PC charsets. Even when saving to an "ANSI" file (some codepage based in ISO 8859) SHY could be used as well since long (note that U+00AC "¬" is mapped in OEM codepages 437 and 850 on 0xAA, but there's no mapping in those OEM codepages for SHY). When pasting into the Command-line console, that has now full support for Unicode in its display buffer, it should still be a regular SHY U+00AD, and it's only when reading characters from the Console in an application whose input codepage is not Unicode (e.g. reading from legacy BIOS keystrokes or from standard input or other legacy Windows APIs working in the OEM charset) that the Unicode SHY U+00AD in the display buffer "may" be changed to 0xAA with the legacy application's use of codepages 437 or 850 (Windows APIs working in a "ANSI" charset should still return SHY on 0xAD), and 0x1F otherwise (if there's not even a "¬" mapped in that legacy codepage). All this looks like a confusion between the internal storage and processing in Word, and what should be part of specific text file format convertors (they are extension DLL plugins in the "converters" directory), and not built in such hardwired way within Word's core engine. In the DOM-like VisualBasic interface of Word (or COM/DCOM), there may exist macros (or even extension plugins for various linguistic correctors such as external dictionaries) that still depend on detecting U+001F in the internal work buffer, or genering it when working with those "optional hyphens". But here also this should just be part of this VB or COM interface, and subject to versioning (version tracking of interfaces is a required component for COM programming), which may use any one of the various text format converters. Word should still make all efforts to maintain the distinction between the SOFT HYPHEN and the NOT SIGN, and even with its legacy "optional hyphen" control mapped on U+001F. To make complete tests, you should know that the Windows clipboard exposes several parallel versions of the same source text (this is either exposed by negociation and collaboration with the source application which just indicates which ouput format it supports, and the Windows clipboard will store the clipboard in memory, or in a temporary swap file using a standard text format, only if the source application must exit; then the standard clipboard becoming then the effective intermediate target capable of storing a rich-text format that it can expose itself and convert later to any other target application). And you don't get the same results depending on which source or target application or file format you use through the Windows clipboard or to/from Word itself (this consideration is not specific to texts, you have the same problem with images, even if Windows defines its own "portable" DIB format, with various capabilities for color spaces, color depths, pixel aspect ratios, logical resolutions in twips, and so on... plus a legacy BMP format as well supported by internal lossy image converters). -- Philippe. 2011/7/7 Andreas Prilop <prilop4...@trashmail.net>: > On Sat, 2 Jul 2011, Jukka K. Korpela wrote: > >> And there is really no guarantee that programs support the >> soft hyphen. For one, Microsoft Word doesn’t—it treats it >> as just another printable character. > > ... and also: > http://www.cs.tut.fi/~jkorpela/shy.html#word > > MS Word's behaviour depends on the setting > File > Options > Advanced > Cut, copy, and paste > > Pasting from other programs. > "Keep Text Only" : U+00AD remains U+00AD. > "Merge Formatting": U+00AD is changed to U+001F. > > When I copy MS Word's own soft hyphen (i.e. U+001F) > from MS Word into any other program, I get U+00AC (¬). > :-( > > -- > From the New World: > http://www.google.co.uk/search?ie=ISO-8859-2&q=Dvofi%E1k > > >