Re: [fpc-devel] simple UTF tests

2012-01-09 Thread Michael Schnell

On 01/06/2012 06:53 PM, Hans-Peter Diettrich wrote:


You're right, the XE compiler lacks some error checks :-(

If this indeed is considered a bug in Delphi, FPC _could_ in fact in a 
more sane way provide the length of an AnsiString(CP_UTF16) in terms 
of Words (i.e. UTF codes, as done with UTF8),


-Michael

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] simple UTF tests

2012-01-09 Thread Marco van de Voort
In our previous episode, Michael Schnell said:
  You're right, the XE compiler lacks some error checks :-(
 
 If this indeed is considered a bug in Delphi, FPC _could_ in fact in a 
 more sane way provide the length of an AnsiString(CP_UTF16) in terms 
 of Words (i.e. UTF codes, as done with UTF8),

An ansistring is always 8-bit. Nothing can be done there, except warn/error
if 1200 is used.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] simple UTF tests

2012-01-09 Thread Michael Schnell

On 01/09/2012 10:32 AM, Marco van de Voort wrote:
An ansistring is always 8-bit. 

Sorry I can't follow here.

Of course the term ANSI suggests 8 bit, but it also suggest one 
visible character = 8 bit, thus non UTF.


If a type called ANSI... is used to hold UTF codes, the term ANSI is 
abused anyway and now the handling of the type can be defined in any way 
that seems appropriate,


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] simple UTF tests

2012-01-09 Thread Michael Schnell

P.S.:

I found here 
http://en.wikipedia.org/wiki/Windows_ANSI_code_page#ANSI_code_page that 
there in fact


The following Windows code pages exist:The following Windows code pages 
exist:


...

65000 UTF-7
65001 UTF-8


But this seems to be be a propriety Microsoft definition while AFAIK, 
ANSI denotes American National Standards Institute.


- Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] simple UTF tests

2012-01-09 Thread Hans-Peter Diettrich

Michael Schnell schrieb:

On 01/09/2012 10:32 AM, Marco van de Voort wrote:
An ansistring is always 8-bit. 

Sorry I can't follow here.

Of course the term ANSI suggests 8 bit, but it also suggest one 
visible character = 8 bit, thus non UTF.


ANSI also covers MBCS (DBCS).

If a type called ANSI... is used to hold UTF codes, the term ANSI is 
abused anyway and now the handling of the type can be defined in any way 
that seems appropriate,


For legacy reasons Ansi means types with 8 bit AnsiChar(!), in 
contrast to WideChar or other Char sizes.


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] simple UTF tests

2012-01-09 Thread Marco van de Voort
In our previous episode, Michael Schnell said:
  An ansistring is always 8-bit. 
 Sorry I can't follow here.
 
 Of course the term ANSI suggests 8 bit, but it also suggest one 
 visible character = 8 bit, thus non UTF.

No, it means that the encoding granularity is 8-bit. Length returns encoding
granularity, not codepoints (always 32-bit, encoded in sequences of 8
(ansistring) or 16 (widestring,uncidoestring) bits) or printable characters
(possibly multiple codepoints)
 
 If a type called ANSI... is used to hold UTF codes, the term ANSI is 
 abused anyway and now the handling of the type can be defined in any way 
 that seems appropriate,

Whatever the name is, in all current Unicode Delphi versions and FPC
ansistring means 8-bit string exclusively.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] simple UTF tests

2012-01-09 Thread Michael Schnell

On 01/09/2012 11:09 AM, Marco van de Voort wrote:
Whatever the name is, in all current Unicode Delphi versions and FPC 
ansistring means 8-bit string exclusively.


OK so the definition of

AnsiString(CP_UTF16) in FPC and
AnsiString(1200) in Delphi means

an 8 bit string with data Bytes representing UTF-16 codes in low-byte-First 
notation.

This of course is not very straightforward, sane or portable but not illegal 
and not ambiguous either.

(Of course this definition should be explicitly given somewhere)

-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] simple UTF tests

2012-01-09 Thread peter green
But this seems to be be a propriety Microsoft definition while AFAIK, 
ANSI denotes American National Standards Institute.
While ANSI does denote american national standards institute in general 
it doesn't really mean that in this context.


A windows machine has two main code pages in use (both language 
dependent and for some languages they may be the same code page). The 
OEM code page and the ANSI code page. The OEM code page is one of 
the original PC code pages and afaict is mostly used for the console. 
The ANSI code page is  used for the non-unicode versions of stuff in 
windows itself.


The term ANSI comes from the fact that the initial ANSI code page 
(1252) was based on an ANSI draft of what became ISO-8859-1. 1252 is 
fairly close to ISO-8859-1 (it just replaces rarely used control 
characters with more printable characters) but most of the other ANSI 
code pages bear little to no relationship to any ANSI or ISO standard 
encoding.


Afaict in europe, america and australasia both the ANSI and OEM code 
pages are simple encodings with one byte per user-visiable character and 
all characters drawn left to right.  Once you move to asia and africa 
though that no longer holds with CJK languages being represented by 
multibyte encodings, vietnamese being represented using combining 
characters and middle eastern languages bringing the complications of 
bidirectional text. MS encourages programmers to use unicode nowasays 
and afiact languages added more recently to windows (like the indic 
languages) don't have any non-unicode support at all.


Windows also defines other code page numbers that are used as neither 
ANSI or OEM code pages. UTF-8 falls into this category.


Delphi is a windows program (yeah there was an abortive linux port but 
that came much later and didn't stick arround for long) so it inherits 
windows terminology. FPC/lazarus is essentially a delphi clone but is 
cross platform so it's put in the position of trying to interpret and 
stretch windows grounded ideas to fit a cross-platform context.


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] simple UTF tests

2012-01-09 Thread Michael Schnell

So it's me who did go insane. But at least I now know why.

-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] simple UTF tests

2012-01-09 Thread Hans-Peter Diettrich

Michael Schnell schrieb:

On 01/09/2012 11:09 AM, Marco van de Voort wrote:
Whatever the name is, in all current Unicode Delphi versions and FPC 
ansistring means 8-bit string exclusively.


OK so the definition of

AnsiString(CP_UTF16) in FPC and
AnsiString(1200) in Delphi means


Nothing sane :-(

DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel