Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicodesupport"
Marco van de Voort schrieb: In our previous episode, Hans-Peter Diettrich said: storage, we'll have to take that into account. (16-bit codepages were designed into OS/2 and Windows NT before utf-8 even existed) Right, both systems were developed by Microsoft :-] A cooperation between IBM and Microsoft starting in 1984 to somewhere in the early nineties, yes. (or Micro Soft, I can't remember when they dropped the space). AFAIK MS let pay IBM the bill for the OS/2 development, and used that experience in the development of Windows (NT). No problem, as long as proper host/network byteorder conversion is applied in reading/writing such files. I don't see that as something evident. It's evident in the case of reading/writing words on byte-based media, where the byte order is important. crlf vs lf is not fully transparent either, just open an lf file with notepad. Many unix editors show crs etc. There isn't even an universal marker to signal it (like BOMs) The handling of Carriage Return (CR) and Line Feed (LF) was essential on mechanical (teletype-style) terminals. A Teletype terminal had no input buffer, and couldn't perform an full Carriage Return within the transmission time of the following code. That's why most protocols sent a CR first, to start the carriage movement, followed by an LF, which was processed before arrival of the next code. Both LF and CR had different purposes, and could be used individually for special printing effects (overwrite, form feed). Newer devices (and computers) had no such timing requirements, so that a single character code was sufficient to indicate a (logical) end-of-line. Unfortunately some company used CR for that purpose, others used LF, and MS used CR+LF as an EOL indicator. WRT to text output on printing devices, the CR+LF convention certainly was the correct solution. Problems arised only in data exchange between multiple different systems, which had to cope with all three conventions. Unicode provided no improvement, in contrary the same mess was continued with de/composed accented characters and umlauts :-( Putting layer upon layer in a misguided attempt to make anything accept anything transparent is IMHO a waste of both time resources and computing. Better intensively maintain a few good converters, and strengthen metadata processing and retention to make it automatic in a few places where it really matters. I'm no security expert, but I guess from a security viewpoint that is better too. I don't know about any text processing model really *superior* to Unicode, do you? And OOP is perfectly suited to implement multi-layer models. DoDi ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicodesupport"
In our previous episode, Hans-Peter Diettrich said: > >> storage, we'll have to take that into account. > > > > (16-bit codepages were designed into OS/2 and Windows NT before utf-8 even > > existed) > > Right, both systems were developed by Microsoft :-] A cooperation between IBM and Microsoft starting in 1984 to somewhere in the early nineties, yes. (or Micro Soft, I can't remember when they dropped the space). From http://en.wikipedia.org/wiki/Microsoft#1984.E2.80.9394:_Windows_and_Office "Microsoft released its version of OS/2 to original equipment manufacturers (OEMs) on April 2, 1987" > No problem, as long as proper host/network byteorder conversion is > applied in reading/writing such files. I don't see that as something evident. crlf vs lf is not fully transparent either, just open an lf file with notepad. Many unix editors show crs etc. There isn't even an universal marker to signal it (like BOMs) Putting layer upon layer in a misguided attempt to make anything accept anything transparent is IMHO a waste of both time resources and computing. Better intensively maintain a few good converters, and strengthen metadata processing and retention to make it automatic in a few places where it really matters. I'm no security expert, but I guess from a security viewpoint that is better too. > But times have changed, nowadays the Internet requires certain common > standards (e.g. 8-bit bytes = octets, HTML, Unicode and more), which > allow for data exchange across machine and country boundaries. Internet protocols are properly annotated with metadata, so are easiest to deal with. That doesn't make it an requirement to push this throughout the whole RTL, a simple routine in e.g. the webserver can handle that at the gate without bogging down the rest of the system with redundant checks. ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicodesupport"
Marco van de Voort schrieb: In our previous episode, Hans-Peter Diettrich said: While it certainly is a stupid (Microsoft) idea to use UTF-16 for file storage, we'll have to take that into account. (16-bit codepages were designed into OS/2 and Windows NT before utf-8 even existed) Right, both systems were developed by Microsoft :-] No problem, as long as proper host/network byteorder conversion is applied in reading/writing such files. But in former times every computer manufacturer was proud of *his* clever text processing features, with characters stored in 6 up to 9 bit registers. In those times it was an essential *marketing* feature, when files could *not* be read by competing systems, due to different bytesize, bit-/byteorder, character sets, file formats etc. But times have changed, nowadays the Internet requires certain common standards (e.g. 8-bit bytes = octets, HTML, Unicode and more), which allow for data exchange across machine and country boundaries. The lack of far-east support already forced the Japanese to invent their own BIOS, codepages etc. Nowadays continued use of UCS2 had forced the Chinese to invent their own character encoding, which then would be used by more people than UCS2. Guess what would happen to the rest of the world, then... Or will the Chinese government enforce such a development soon, to eliminate the need for continued censorship of foreign web pages, because legal equipment then only could present genuine Chinese pages, but no more HTML, JavaScript and Unicode? How would the official Chinese programming language look like? DoDi ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicodesupport"
In our previous episode, Hans-Peter Diettrich said: > While it certainly is a stupid (Microsoft) idea to use UTF-16 for file > storage, we'll have to take that into account. (16-bit codepages were designed into OS/2 and Windows NT before utf-8 even existed) ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicodesupport"
Jonas Maebe schrieb: On 26/11/14 21:25, Hans-Peter Diettrich wrote: What about file I/O? It should be possible to read (and write) files of either endianness. Standard I/O only supports single byte code pages (which should be documented). Please clarify "single byte code pages". SBCS are a subset of the ANSI/ISO codepages, with the complement being MBCS. Did you mean that basic I/O is implemented for AnsiString (byte based) only, not for UnicodeString (word based)? Reading a unicodestring from a text file converts from the single byte code page to the native-endianess UTF-16 format. Is this different from assigning an AnsiString to an UnicodeString? While it certainly is a stupid (Microsoft) idea to use UTF-16 for file storage, we'll have to take that into account. Just with file I/O it makes sense to allow for *all* encodings (see BOM), including UTF-16. The file encoding must be independent from the string type and encoding, used to hold the file data in memory. See Delphi TEncoding, for use with streams (TStreamReader/Writer) and TStrings. DoDi ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicodesupport"
On 26/11/14 21:25, Hans-Peter Diettrich wrote: > Jonas Maebe schrieb: >> On 26/11/14 17:41, Tomas Hajny wrote: >>> BTW, in this context - can users choose UTF16BE on little endian >>> platforms (and vice versa)? >> >> No, because we do not have any routines that allow a user to set/change >> the codepage of a unicodestring (either at run time or at compile time). > > What about file I/O? > It should be possible to read (and write) files of either endianness. Standard I/O only supports single byte code pages (which should be documented). Reading a unicodestring from a text file converts from the single byte code page to the native-endianess UTF-16 format. Jonas ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicodesupport"
Jonas Maebe schrieb: On 26/11/14 17:41, Tomas Hajny wrote: BTW, in this context - can users choose UTF16BE on little endian platforms (and vice versa)? No, because we do not have any routines that allow a user to set/change the codepage of a unicodestring (either at run time or at compile time). What about file I/O? It should be possible to read (and write) files of either endianness. DoDi ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicodesupport"
On 26/11/14 17:41, Tomas Hajny wrote: > BTW, in this context - can users choose UTF16BE on little endian > platforms (and vice versa)? No, because we do not have any routines that allow a user to set/change the codepage of a unicodestring (either at run time or at compile time). Jonas ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicodesupport"
On 26 Nov 14, at 17:23, Jonas Maebe wrote: > On 26/11/14 17:21, Sven Barth wrote: > > Yes, nevertheless the header record is the same for UnicodeString and > > AnsiString and thus it also has a codepage field which is always > > initialized to CP_UTF16 however. > > It can also be CP_UTF16BE (which it is on big endian FPC targets right now). BTW, in this context - can users choose UTF16BE on little endian platforms (and vice versa)? In other words - do the respective UnicodeStringManager routines have to check the source/target endiannes in the respective field of the header when performing conversion between AnsiStrings and UnicodeStrings? Not that the current implementations for Unix or Windows would seem to worry about this, but I'd prefer to know (and preferably have documented ;-) ) whether it's an omission or intention based on an explicitly defined rule... Tomas ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
