Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicodesupport"

2014-11-29 Thread Hans-Peter Diettrich

Marco van de Voort schrieb:

In our previous episode, Hans-Peter Diettrich said:

storage, we'll have to take that into account.

(16-bit codepages were designed into OS/2 and Windows NT before utf-8 even
existed)

Right, both systems were developed by Microsoft :-]


A cooperation between IBM and Microsoft starting in 1984 to somewhere in the
early nineties, yes. (or Micro Soft, I can't remember
when they dropped the space).


AFAIK MS let pay IBM the bill for the OS/2 development, and used that 
experience in the development of Windows (NT).



No problem, as long as proper host/network byteorder conversion is 
applied in reading/writing such files. 


I don't see that as something evident.


It's evident in the case of reading/writing words on byte-based media, 
where the byte order is important.



crlf vs lf is not fully transparent
either, just open an lf file with notepad. Many unix editors show crs etc.
There isn't even an universal marker to signal it (like BOMs)


The handling of Carriage Return (CR) and Line Feed (LF) was essential on 
mechanical (teletype-style) terminals. A Teletype terminal had no input 
buffer, and couldn't perform an full Carriage Return within the 
transmission time of the following code. That's why most protocols sent 
a CR first, to start the carriage movement, followed by an LF, which was 
processed before arrival of the next code. Both LF and CR had different 
purposes, and could be used individually for special printing effects 
(overwrite, form feed).


Newer devices (and computers) had no such timing requirements, so that a 
single character code was sufficient to indicate a (logical) 
end-of-line. Unfortunately some company used CR for that purpose, others 
used LF, and MS used CR+LF as an EOL indicator. WRT to text output on 
printing devices, the CR+LF convention certainly was the correct 
solution. Problems arised only in data exchange between multiple 
different systems, which had to cope with all three conventions. Unicode 
provided no improvement, in contrary the same mess was continued with 
de/composed accented characters and umlauts :-(




Putting layer upon layer in a misguided attempt to make anything accept
anything transparent is IMHO a waste of both time resources and computing.  
Better
intensively maintain a few good converters, and strengthen metadata
processing and retention to make it automatic in a few places where it
really matters. I'm no security expert, but I guess from a security
viewpoint that is better too.


I don't know about any text processing model really *superior* to 
Unicode, do you?


And OOP is perfectly suited to implement multi-layer models.

DoDi

___
fpc-devel maillist  -  [email protected]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicodesupport"

2014-11-29 Thread Marco van de Voort
In our previous episode, Hans-Peter Diettrich said:
> >> storage, we'll have to take that into account.
> > 
> > (16-bit codepages were designed into OS/2 and Windows NT before utf-8 even
> > existed)
> 
> Right, both systems were developed by Microsoft :-]

A cooperation between IBM and Microsoft starting in 1984 to somewhere in the
early nineties, yes. (or Micro Soft, I can't remember
when they dropped the space).

From
http://en.wikipedia.org/wiki/Microsoft#1984.E2.80.9394:_Windows_and_Office

"Microsoft released its version of OS/2 to original equipment manufacturers
(OEMs) on April 2, 1987"
 
> No problem, as long as proper host/network byteorder conversion is 
> applied in reading/writing such files. 

I don't see that as something evident. crlf vs lf is not fully transparent
either, just open an lf file with notepad. Many unix editors show crs etc.
There isn't even an universal marker to signal it (like BOMs)

Putting layer upon layer in a misguided attempt to make anything accept
anything transparent is IMHO a waste of both time resources and computing.  
Better
intensively maintain a few good converters, and strengthen metadata
processing and retention to make it automatic in a few places where it
really matters. I'm no security expert, but I guess from a security
viewpoint that is better too.

> But times have changed, nowadays the Internet requires certain common 
> standards (e.g. 8-bit bytes = octets, HTML, Unicode and more), which 
> allow for data exchange across machine and country boundaries.

Internet protocols are properly annotated with metadata, so are easiest to
deal with.  That doesn't make it an requirement to push this throughout the
whole RTL, a simple routine in e.g.  the webserver can handle that at the
gate without bogging down the rest of the system with redundant checks.
 
___
fpc-devel maillist  -  [email protected]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicodesupport"

2014-11-28 Thread Hans-Peter Diettrich

Marco van de Voort schrieb:

In our previous episode, Hans-Peter Diettrich said:
While it certainly is a stupid (Microsoft) idea to use UTF-16 for file 
storage, we'll have to take that into account.


(16-bit codepages were designed into OS/2 and Windows NT before utf-8 even
existed)


Right, both systems were developed by Microsoft :-]

No problem, as long as proper host/network byteorder conversion is 
applied in reading/writing such files. But in former times every 
computer manufacturer was proud of *his* clever text processing 
features, with characters stored in 6 up to 9 bit registers. In those 
times it was an essential *marketing* feature, when files could *not* be 
read by competing systems, due to different bytesize, bit-/byteorder, 
character sets, file formats etc.


But times have changed, nowadays the Internet requires certain common 
standards (e.g. 8-bit bytes = octets, HTML, Unicode and more), which 
allow for data exchange across machine and country boundaries.


The lack of far-east support already forced the Japanese to invent their 
own BIOS, codepages etc.  Nowadays continued use of UCS2 had forced the 
Chinese to invent their own character encoding, which then would be used 
by more people than UCS2. Guess what would happen to the rest of the 
world, then...



Or will the Chinese government enforce such a development soon, to 
eliminate the need for continued censorship of foreign web pages, 
because legal equipment then only could present genuine Chinese pages, 
but no more HTML, JavaScript and Unicode? How would the official Chinese 
programming language look like?



DoDi

___
fpc-devel maillist  -  [email protected]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicodesupport"

2014-11-28 Thread Marco van de Voort
In our previous episode, Hans-Peter Diettrich said:
> While it certainly is a stupid (Microsoft) idea to use UTF-16 for file 
> storage, we'll have to take that into account.

(16-bit codepages were designed into OS/2 and Windows NT before utf-8 even
existed)
 
___
fpc-devel maillist  -  [email protected]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicodesupport"

2014-11-27 Thread Hans-Peter Diettrich

Jonas Maebe schrieb:

On 26/11/14 21:25, Hans-Peter Diettrich wrote:



What about file I/O?
It should be possible to read (and write) files of either endianness.


Standard I/O only supports single byte code pages (which should be
documented).


Please clarify "single byte code pages".

SBCS are a subset of the ANSI/ISO codepages, with the complement being MBCS.

Did you mean that basic I/O is implemented for AnsiString (byte based) 
only, not for UnicodeString (word based)?



Reading a unicodestring from a text file converts from the
single byte code page to the native-endianess UTF-16 format.


Is this different from assigning an AnsiString to an UnicodeString?

While it certainly is a stupid (Microsoft) idea to use UTF-16 for file 
storage, we'll have to take that into account.


Just with file I/O it makes sense to allow for *all* encodings (see 
BOM), including UTF-16. The file encoding must be independent from the 
string type and encoding, used to hold the file data in memory. See 
Delphi TEncoding, for use with streams (TStreamReader/Writer) and TStrings.


DoDi

___
fpc-devel maillist  -  [email protected]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicodesupport"

2014-11-27 Thread Jonas Maebe
On 26/11/14 21:25, Hans-Peter Diettrich wrote:
> Jonas Maebe schrieb:
>> On 26/11/14 17:41, Tomas Hajny wrote:
>>> BTW, in this context - can users choose UTF16BE on little endian
>>> platforms (and vice versa)?
>>
>> No, because we do not have any routines that allow a user to set/change
>> the codepage of a unicodestring (either at run time or at compile time).
> 
> What about file I/O?
> It should be possible to read (and write) files of either endianness.

Standard I/O only supports single byte code pages (which should be
documented). Reading a unicodestring from a text file converts from the
single byte code page to the native-endianess UTF-16 format.


Jonas

___
fpc-devel maillist  -  [email protected]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicodesupport"

2014-11-26 Thread Hans-Peter Diettrich

Jonas Maebe schrieb:

On 26/11/14 17:41, Tomas Hajny wrote:
BTW, in this context - can users choose UTF16BE on little endian 
platforms (and vice versa)?


No, because we do not have any routines that allow a user to set/change
the codepage of a unicodestring (either at run time or at compile time).


What about file I/O?
It should be possible to read (and write) files of either endianness.

DoDi

___
fpc-devel maillist  -  [email protected]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicodesupport"

2014-11-26 Thread Jonas Maebe
On 26/11/14 17:41, Tomas Hajny wrote:
> BTW, in this context - can users choose UTF16BE on little endian 
> platforms (and vice versa)?

No, because we do not have any routines that allow a user to set/change
the codepage of a unicodestring (either at run time or at compile time).


Jonas

___
fpc-devel maillist  -  [email protected]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicodesupport"

2014-11-26 Thread Tomas Hajny
On 26 Nov 14, at 17:23, Jonas Maebe wrote:
> On 26/11/14 17:21, Sven Barth wrote:
> > Yes, nevertheless the header record is the same for UnicodeString and
> > AnsiString and thus it also has a codepage field which is always
> > initialized to CP_UTF16 however.
> 
> It can also be CP_UTF16BE (which it is on big endian FPC targets right now).

BTW, in this context - can users choose UTF16BE on little endian 
platforms (and vice versa)? In other words - do the respective 
UnicodeStringManager routines have to check the source/target 
endiannes in the respective field of the header when performing 
conversion between AnsiStrings and UnicodeStrings?

Not that the current implementations for Unix or Windows would seem 
to worry about this, but I'd prefer to know (and preferably have 
documented ;-) ) whether it's an omission or intention based on an 
explicitly defined rule...

Tomas

___
fpc-devel maillist  -  [email protected]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel