subject:"\[fpc\-devel\] Trying to understand the wiki\-Page FPC Unicode support"

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-12-03 Thread Michael Schnell


On 12/03/2014 05:02 AM, Hans-Peter Diettrich wrote:

Michael Schnell schrieb:

 - It does not result in additional conversions.

It does, e.g. in searching or sorting of StringList, when it can contain
strings of different encodings. The choice of a unique encoding for
application strings (maybe CP_ACP, UTF-8 or UTF-16) eliminates such
conversions.
If multiple encoding brands are involved, a system without DynamicString 
also will need to do conversions. So DynamicString does not impose 
*additional* conversions.-


So the Checking Overhead is nothing but a rumor. (Remember, I don't 
suggest dropping the standard statically typed paradigm, 
altogether, as close loops of course work best in that way.

The rumor is the unimportant Conversion Overhead, i.e. how often a
check leads to a conversion. When no check is required, conversions
consequently cannot ocur at all.

Please re-read the text I wrote.
 - If in the user-code DynamicString is not used, the compiler creates 
the same code as before. So no overhead.
 - If DynamicString is used (in user-Code or in a Library interface), 
but only a single encoding brand is used everywhere where statically 
encoded strings are in place (a single program-wide string 
representation as you suggested in you previous mail) the only runtime 
overhead imposed is that at the locations where DynamicString is used 
(i.e. not in any close loops) an additional check for the EncodingType 
variable is implemented by the compiler. Here (unless the user actively 
decides to create string variables with encoding brands other than the 
program-wide default) at runtime the code *always* finds that no 
conversion is necessary and acts as if the String would not be dynamic, 
but already correct. The overhead of checking is obviously at most 
some 5 ASM instructions and hence unelectable regarding the function 
call assigned to entering the library function in question.




RawByteString cannot serve two different purposes :-( 
As I pointed out as well: A variable' encoding brand can't be static and 
dynamic at the same time. This is the cause of the major misconception 
imposed by Delphi regarding RawByteString. And this is why I would leave 
RawByteString aside (as it is / as it is assumed to be / whatever) and 
for any improvement use a completely new Type name and a CP_ANY 
constant / value.





In *Delphi* it is used as a polymorphic string, capable of *holding*
actual strings of any encoding. But when assigned to a variable of a
different encoding, a conversion may occur that converts the string into
the declared (static) encoding of the target variable.
Seemingly rather close to what I suggest as DynamicString. But (see 
http://wiki.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support 
) with a dynamic String the encoding brand number of such String would 
not be allowed to ever be written into the EncodingType field in the 
string header.


If this would be true, why do the Delhi Docs discourage making decent 
use of  the dynamic feature of RawByteString  ?


Anyway. A dynamic String type only makes sense if it is used in as 
many library interfaces (and TStrings). This is not done in Delphi and 
in Delphi this is not nice, in many cases restricting the user to make 
use of these libraries, but not as critical as with fpc, where you need 
to consider portability issues.




In *FPC* it currently is used somewhat close to your idea, i.e. no
conversion occurs in both an assignment to *and from* an RawByteString
to some other AnsiString. 


As said, to avoid ambiguity, I vote for adding yet another string type 
name (e.g. ByteString denoted by CP_BYTE) that is *known* to disallow 
any conversion (and leave RawByteString as close as possible to the 
moving target Delphi presents).




I understand the FPC attempt, to allow *at the same time* for the new
(encoded) and old (unencoded) AnsiString behaviour, where no automatic
conversions are allowed. But this would require at the same time, that
e.g. all string literals *also* are stored in that (immutable) encoding,
and that this encoding can *not* be changed at runtime, while
DefaultSystemCodePage *can* be changed.


I feel that this (simplified) attempt can't result in a decent paradigm. 
It is close to impossible to completely describe the behavior in an 
understandable way and it's prone to a lot of ambiguity.


That is why I tried to invent a concept that I suppose might work and 
will not break (much) existing code. It is intended to be straight 
from ground up (it is not even necessary to assume that the content of a 
String is printable/readable, but it should easily work for that 
application.) It would allow for making flexible use of Strings with 
understandable and easy to use syntax candy, and would not impose  
restrictions to portability any more. IMHO it would not impose 
(noticeable) performance degradation, either.


-Michael
___
fpc-devel maillist  -

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-12-03 Thread Michael Schnell


On 12/03/2014 12:52 AM, Hans-Peter Diettrich wrote:


You forget that Jonas refers to *dynamic* string encodings, unknown at 
compile time.


???
In you other mail you pointed out that fpc (other than Delphi) does not 
provide *dynamic* string encoding with RawByteString (and where else 
would it be supported ?).


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-12-03 Thread Michael Schnell


On 12/03/2014 12:52 AM, Hans-Peter Diettrich wrote:

In Delphi *no* string can have an dynamic encoding of CP_NONE or CP_ACP,


If you really do have Dynamic strings, obviously, the *definition* 
(i.e. CP_...) of such strings is strictly static (just for compiler use) 
and never cant be used as the *dynamic* notation of the *current* 
encoding (in the EncodingType field).


IMHO a different implementation is not workable.

-Michael

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-12-02 Thread Michael Schnell


On 11/28/2014 09:15 PM, Hans-Peter Diettrich wrote:


You suggested to use string as UTF-16 on Windows, and UTF-8 on 
Linux. That's what I understand as a unique program-wide string 
representation (not sourcecode-wide, instead program as *compiled*). 
Then I cannot see any need or use for another DynamicString type.
I already did understand your meaning and I understand that this  
unique program-wide string representation is better than having the 
libraries' APIs (including TStrings) force a fixed string encoding 
brand, independently from the OS we compile for (and selectable $mode 
specifications). But I  don't *suggest* this way, as it is not very 
versatile and hampers portability. As said I *suggest* using 
DynamicString in such cases. Nonetheless, the types simply called 
String might be done in the way you suggest.


Nothing can be broken, as long as the Delphi behaviour is undefined. 
That of course is is correct, but just follows the poor excuse 
Embarcadero  offers for the flawed implementation of RawByteString 
(which as we both agree will never be fixed). (In fact there are many 
instances that old flaws have been deliberately reproduces for not 
breaking compatibly.)


Applied to FPC/Lazarus code (compiler, libraries, IDE...) this means 
that it's obviously easier to *prevent* possibly different 
static/dynamic encodings, instead of *checking and reacting* on such 
flaws throughout the entire codebase. 
OK. Kill the Type RawByteString and the constant CP_NONE and the 
usability of it's value $. I do vote for doing so and instead 
provide new types such as ByteString, WordString, DWordString, and 
QWordString denoted by the constants CP_Byte = $FF01, CP_Word = $FF02, 
CP_DWord = $FF04, CP_QWord = $FF08.


Apart from that, every encoding-tolerant code will execute much slower 
than code without a need for checks and conversions everywhere.

As I pointed out I don't agree at all.
 - The check is only two ASM instructions
 - It does not result in additional conversions. In fact in appropriate 
cases it can avoid a huge count of conversations (especially when 
calling libraries, e.g. by means of TStrings)
 - in pure user code, the check is only done if DynamicString really is 
used in the user code, hence only when the user knows what to do. In 
fact commonly degradation = 0%
 - When calling libraries (e.g. via TStrings), the  check is very small 
regarding that a function call is done as a result of the same 
statement. Estimated commonly degradation = 0,01 %


So the Checking Overhead is nothing but a rumor. (Remember, I don't 
suggest dropping the standard statically typed paradigm, altogether, 
as close loops of course work best in that way.


That is why fpc would need to define an additional type name (e.g 
DynamicString) and encoding brand number (e.g. CP_ANY = $FF00) 
for a decently usable type for intermediately holding a  String content.


This again would make *FPC* programs incompatible with Delphi. 
As I decently explained this would not brake any backwards 
compatibility, even if TStrings uses this type.
 - The new type is just additional, so its pure existence can't break 
anything: you don't need to use it in user-code, if you don't want to.
 - The use of DynamicString in the interface of Library functions does 
not break anything, as it is (to be) constructed in a way that provides 
full compatibility.


Please do show any code (not containing RawByteString) that is not 
compatible when using the DynamicString paradigm as described in 
http://wiki.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support#Analysis 
. Maybe the page needs to be improved.


While fixing the RawByteString flaw would at least allow to *compile* 
FPC code with Delphi, the use of an different encoding value would 
definitely prevent compilation of such code with Delphi. What's the 
more serious incompatibility?
IMHO this would be much more dangerous than introducing a decently 
working new DynamicString type.
RawXxxString can be used for really uncoded data as done with 
old-style strings in a lot of applications.


Such a feature would be appreciated by many users, indeed :-)


While I would happily follow you suggesting making indecent use of 
this type impossible ia the fpc compiler, I don't think it's very 
dangerous to re-introduce the abysmal Delphi compatible behavior of 
RawByteString (may as well the documented as the the undocumented 
features).


But why do you say would be appreciated ? Is it not possible to use 
RawByteString in a way the name suggests, by never bringing it 
together with any String variable of a different encoding brand and 
hence avoid any conversion - be same intentional/documented/useful or not.



Anyway: I added a sentence in the introduction of the wiki page, 
explaining the paradigm a little more explicitly.




-Michael




___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-12-02 Thread Michael Schnell


On 12/02/2014 01:05 PM, Michael Schnell wrote:
But why do you say would be appreciated ? Is it not possible to use 
RawByteString in a way the name suggests, by never bringing it 
together with any String variable of a different encoding brand and 
hence avoid any conversion - be same intentional/documented/useful or 
not.
Of course you can't use any TStrings sibling (such as TStringList) in 
such code, as with Delphi, TStrings is based on a statically typed 
String brand. This would be made possible by introducing DynamicString 
and using this type for TStrings and friends.


-Michael



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-12-02 Thread Michael Schnell


On 11/29/2014 07:55 AM, Jonas Maebe wrote:
Exactly the same goes for converting strings with code page CP_NONE to 
a different code page: your program is broken when it tries to do that,


While accessing an array beyond its bounds is not detectable at compile 
time and accessing an array beyond its bounds when range checking is 
switched off is technically not detectable at runtime, and hence 
*undefined* cant be avoided, the attempt to convert strings with code 
page CP_NONE to a different code page is easily detectable by the 
compiler, as we have predefined string variable type brands types 
here. Thus, if the outcome is *defined* *to* *be* *undefined* it can and 
should result in a compiler error message.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-12-02 Thread Hans-Peter Diettrich


Michael Schnell schrieb:

On 11/28/2014 09:15 PM, Hans-Peter Diettrich wrote:


Apart from that, every encoding-tolerant code will execute much slower 
than code without a need for checks and conversions everywhere.

As I pointed out I don't agree at all.
 - The check is only two ASM instructions
 - It does not result in additional conversions.


It does, e.g. in searching or sorting of StringList, when it can contain
strings of different encodings. The choice of a unique encoding for
application strings (maybe CP_ACP, UTF-8 or UTF-16) eliminates such
conversions.


So the Checking Overhead is nothing but a rumor. (Remember, I don't 
suggest dropping the standard statically typed paradigm, altogether, 
as close loops of course work best in that way.


The rumor is the unimportant Conversion Overhead, i.e. how often a
check leads to a conversion. When no check is required, conversions
consequently cannot ocur at all.


RawXxxString can be used for really uncoded data as done with 
old-style strings in a lot of applications.


Such a feature would be appreciated by many users, indeed :-)


But why do you say would be appreciated ? Is it not possible to use 
RawByteString in a way the name suggests, by never bringing it 
together with any String variable of a different encoding brand and 
hence avoid any conversion - be same intentional/documented/useful or not.


RawByteString cannot serve two different purposes :-(

In *Delphi* it is used as a polymorphic string, capable of *holding*
actual strings of any encoding. But when assigned to a variable of a
different encoding, a conversion may occur that converts the string into
the declared (static) encoding of the target variable.

In *FPC* it currently is used somewhat close to your idea, i.e. no
conversion occurs in both an assignment to *and from* an RawByteString
to some other AnsiString. We only can *hope* that *all* AnsiString
operations are based on the dynamic encoding of every operand, with
according checks and conversions inserted everywhere. This actually is
not true, because the compiler relies on the static encoding of
AnsiString variables, and inserts checks and conversions only when that
encoding is different. Actually a single AnsiString type were
sufficient, because it already can hold data of any encoding :-(

I understand the FPC attempt, to allow *at the same time* for the new
(encoded) and old (unencoded) AnsiString behaviour, where no automatic
conversions are allowed. But this would require at the same time, that
e.g. all string literals *also* are stored in that (immutable) encoding,
and that this encoding can *not* be changed at runtime, while
DefaultSystemCodePage *can* be changed.

When the result of a conversion of an string of encoding CP_NONE is
undefined, what's of course correct for the *dynamic* encoding, this
simply could be changed into conversions of CP_NONE strings do
nothing. Then CP_NONE would be the perfect encoding for old-style
AnsiStrings, with the only remaining problem with string expressions and
assignments, when the operands have a different dynamic encoding. In
these cases all operands had to be converted into the CP_NONE encoding,
as specified in another DefaultNoneEncoding constant (not variable!);
the same encoding would apply in assignments *to* variables of a
different encoding. Then also all type alias for AnsiStrings must have
unique names, which allow to distinguish e.g.
  type UTF8String = AnsiString;
from
  type NewUTF8String = type AnsiString(CP_UTF8);

DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-29 Thread Hans-Peter Diettrich


Jonas Maebe schrieb:

On 28/11/14 21:30, Hans-Peter Diettrich wrote:

I prefer to specify and document everything *before* coding, so that
everybody can expect that the code will behave as specified.


If certain behaviour is explicitly undefined, it *is* specified and
documented. It means that your program is buggy if it triggers such
behaviour, and that the effect of triggering it could be anything.

[...]

An example from FPC itself is accessing an array beyond its bounds when
range checking is switched off.


After this hint I reviewd the Code page identifiers section again, and 
probably could find the source of misunderstandings.


CP_NONE: this value indicates that no code page information has been 
associated with the string data. The result of any explicit or implicit 
operation that converts this data to another code page is undefined.


Does this mean CP_NONE is not an allowed *dynamic* (string *data*) 
encoding, just like any other undefined encoding value?


In this case the description is correct, but it describes an special 
case of some *undefined* general rule, about valid and invalid dynamic 
encodings in general. Then this general rule should be documented 
before, not only for CP_NONE. Then also documentation of the *intended* 
purpose of CP_NONE, for the *static* encoding of the RawByteString type, 
is missing at all.


As Delphi doesn't allow for a dynamic encoding of CP_NONE, I don't 
understand the purpose of the FPC description. Now in turn some FPC 
developer might have misunderstood the (Delphi) handling of 
RawByteStrings, assuming that it were okay to omit a conversion in an 
assignment of RawByteString to an AnsiString of a different encoding.


That's why I think that the incorrect handling of such RawByteString 
assignments in FPC should be fixed, according to the general rule of 
assignments to an string of a different (static) encoding. CP_NONE 
definitely *is* different from any other encoding, and Delphi does not 
define an exception for RawByteStrings.




Exactly the same goes for converting strings with code page CP_NONE to a
different code page: your program is broken when it tries to do that,
and we cannot guarantee any outcome. This is exactly what the behaviour
is undefined means.


When a string *really* has a *dynamic* encoding of CP_NONE, this of 
course is illegal and thus will result in an undefined result. ACK, so 
far. But since Delphi (quietly) changes an SetCodePage to CP_NONE into 
the current CP_ACP, the undefined situation (invalid dynamic encoding) 
must have been forced by some illegal *hack* before, or in the FPC case 
by some erroneous (not Delphi conforming) RTL code.


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-28 Thread Jonas Maebe



On 27 Nov 2014, at 17:11, Hans-Peter Diettrich drdiettri...@aol.com wrote:

 Such statements come only from writers that do not believe that their words 
 can be understood in various ways ;-)

I'm sorry, but I simply cannot discuss with people that, when I literally state 
the result is undefined, think that I may actually have meant the result is 
defined and if you change the implementation and/or keep it stable across 
compiler releases, then it will also conform to whatever you think that this 
defined behaviour should be. I don't have the energy nor the patience for that.


Jonas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-28 Thread Michael Schnell


On 11/27/2014 03:44 PM, Hans-Peter Diettrich wrote:
The universal paradigm would allow for extensions (e.g. UTF-32, 
multiple 16 Bit Code pages, an additional fully dynamic String type, 
n-byte un-encoded string types), as I described in the Wiki page.


Even if feasable, such arbitrary string storage can dramatically 
increase the number of implicit string conversions. 


Of course it can do harm on that behalf, if the user is silly enough to 
*explicitly* define variables in a brand without thinking about what he 
is doing. But this exactly the same when he just uses the stuff 
currently offered by Delphi and fpc. If you arbitrary define code pages 
for variables for your 8 bit (ANSI) strings you will enforce many 
conversions.


Currently in Delphi if you don't define special code pages anything will 
be UTF-16. So no unnecessary conversions.


In fpc (and maybe Lazarus, as well) I suppose the way currently in the 
works is (when not changing the Default behavior by certain options):
 - when compiling for Windows, String is UTF-16, and the RTL and LCL 
ubiquitously use String: So no unnecessary conversion
 - when compiling for Linux,  String is UTF-8, and the RTL and LCL 
ubiquitously use String: So no unnecessary conversion, either.


If this is done in the libraries (e.g. RTL and LCL) and in user code, 
this would allow for as little conversions as possible and thus best 
performance. Here, you would need different library binaries which might 
or might not be a problem.


But of course the portability is very questionable (including, but not 
limited to the fact that the result of pos is different)-


When (on top of this) doing the interfaces to libraries (including 
TStrings) with DynamicString (encoding brand CP_ANY), no additional 
conversions would be necessary, as - because all other Strings use the 
same encoding brand (either UTF-16 or UTF-8, depending on the OS) and 
hence the dynamic encoding of all DynamicStrings used would always be 
exactly that brand. Hence, IMHO, this would nor harm at all, as the 
overhead the compiler needs to implement to just check the dynamic type 
brand and find that no conversion is necessary is extremely small.


But now the user has a choice !

 - If he does not do anything regarding the encoding brand of his 
strings, he will not notice the existence of the DynamicString Type at 
all. Not even Performance-wise. (But he might encounter portability issues.)
 - if he decides that he wants to use a dedicated encoding brand in all 
or parts of his code, he of course needs to know what he is doing. This 
can result

   - in improved portability (if decently done)
   - in improved performance (if decently done) e.g. by using on-byte 
strings for compact storing the information and two-byte strings for 
e.g. search loops, or using the best fitting encoding in the loops in 
the user code while allowing auto-conversion when accessing the 
libraries in case the underlying OS enforces a different encoding.
   - in disastrous increase of auto-conversions and thus performance 
degradation, (if not decently done).



An *efficient* implementation would be based on a single program-wide 
string representation, with different encodings being handled only in 
an exchange with external data sources.
Yep. But it would result in severe user code portability issues (see 
above). IMHO using DynamicString at the correct locations would not be 
(noticeably) less efficient but a lot more versatile.



Cassandra
After all I have the impression that the known RawByteString flaws 
will never be fixed in Delphi, in order to encourage the users to take 
the step to UnicodeString. Now the question is whether these flaws are 
fixed in FPC, or whether Lazarus will become the first project that 
definitely requires an complete move to UnicodeString, for reliable 
operation.

For best support of non-UTF-16 platforms I'd suggest to fix the flaws...
/Cassandra
I also don't think we will ever see a fix for the poor implementation of 
RawByteString (avoiding the word flaw and the suggestion of a bad 
purpose), because it would brake existing user code.
Regarding fpc, correcting the flaws and keeping the name RawByteString 
would result in incompatibility issues vs Delphi and breaking code that 
will be ported from Delphi.


That is why fpc would need to define an additional type name (e.g 
DynamicString) and encoding brand number (e.g. CP_ANY = $FF00) for a 
decently usable type for intermediately holding a  String content. (see 
Wiki - 
http://wiki.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support 
)


RawXxxString can be used for really uncoded data as done with 
old-style strings in a lot of applications. Even if seriously flawed 
auto-conversion might be implemented in fpc for RawByteStrimg (for 
Delphi-compatibility), the user can easily avoid it by not directly 
combining RAW and differently statically encoded strings in an operation.


-Michael

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-28 Thread Michael Schnell


On 11/27/2014 07:29 PM, Hans-Peter Diettrich wrote:

Michael Schnell schrieb:
 E.g. there are (are least two Code pages for UTF-16 (LE, and 
BE), that would be worth supporting.


You are confusing codepages and encodings :-(
That is why I put goose-feet around Code pages. I used this wording 
because fpc (and Delphi ?) uses it abbreviated as CP in the constant 
name CP_UTF-8,  CP_UTF16 and CP_UTF16BE) [ see Jonas post: 
CP_UTF16 and CP_UTF16BE can be returned by StringCodePage() when called 
on a unicodestring, and that's it. ]





See it as a multi-level protocol for text processing. 
Yep. I see that is is workable and I understand the (supposedly mostly 
historical) reasons. But IMHO not a good (i.e. crafted from ground up) 
concept.




It's known that the Delphi AnsiString implementation is flawed,...
And hence it's frustrating to see that fpc needs to follow for 
compatibility reasons. That is why I suggested an improved 
implementation (see - 
http://wiki.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support). 
While the seriously flawed Delphi compatible use of the dynamic 
encoding-brand (and bytes-per element) information (only implemented 
with  RawByteString) can be left at it is and a decent implementation 
with a new DynmicString Type (CP_ANY) should be crafted.




I see no problem in using the same names and values. Delphi documents 
clearly state: ...
I fear that there will be code that relies on the flawed behavior of 
RawByteString (it's a feature, not a bug) and using the same name with 
different behavior would brake same. And a really usable DynmicString 
would not adhere to  that description.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-28 Thread Hans-Peter Diettrich


Jonas Maebe schrieb:


I'm sorry, but I simply cannot discuss with people that, when I
literally state the result is undefined, think that I may actually
have meant the result is defined and if you change the
implementation and/or keep it stable across compiler releases, then
it will also conform to whatever you think that this defined
behaviour should be. I don't have the energy nor the patience for
that.


I also have no use for continuing such discussions.

I prefer to specify and document everything *before* coding, so that 
everybody can expect that the code will behave as specified.


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-28 Thread Hans-Peter Diettrich


Michael Schnell schrieb:

I fear that there will be code that relies on the flawed behavior of 
RawByteString (it's a feature, not a bug) and using the same name with 
different behavior would brake same. And a really usable DynmicString 
would not adhere to  that description.


How can somebody rely on behaviour *stated* as undefined, or not 
working as defined?


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-28 Thread Hans-Peter Diettrich


Michael Schnell schrieb:

On 11/27/2014 03:44 PM, Hans-Peter Diettrich wrote:


An *efficient* implementation would be based on a single program-wide 
string representation, with different encodings being handled only in 
an exchange with external data sources.
Yep. But it would result in severe user code portability issues (see 
above). IMHO using DynamicString at the correct locations would not be 
(noticeably) less efficient but a lot more versatile.


You suggested to use string as UTF-16 on Windows, and UTF-8 on Linux. 
That's what I understand as a unique program-wide string representation 
(not sourcecode-wide, instead program as *compiled*). Then I cannot see 
any need or use for another DynamicString type.



I also don't think we will ever see a fix for the poor implementation of 
RawByteString (avoiding the word flaw and the suggestion of a bad 
purpose), because it would brake existing user code.


Nothing can be broken, as long as the Delphi behaviour is undefined. 
Code relying on specific compiler/library bugs is bound to that 
compiler, not portable in any way.


Regarding fpc, correcting the flaws and keeping the name RawByteString 
would result in incompatibility issues vs Delphi and breaking code that 
will be ported from Delphi.


Same as above. When application code works properly with strings of 
*sometimes* different static and dynamic encoding, it will not stop 
working with strings of *never* different encodings.


Of course the opposite is not true. When some code works properly (only) 
with strings of the same static and dynamic encoding, it will stop 
working when compiled with Delphi. Then the coder has to insert explicit 
checks for the dynamic encoding of *all* strings, all over his code.


Applied to FPC/Lazarus code (compiler, libraries, IDE...) this means 
that it's obviously easier to *prevent* possibly different 
static/dynamic encodings, instead of *checking and reacting* on such 
flaws throughout the entire codebase. Apart from that, every 
encoding-tolerant code will execute much slower than code without a need 
for checks and conversions everywhere.


I seriously doubt that the FPC developers ever realized these 
consequences, and the amount of time required for finding, reporting and 
fixing the bugs in all affected pieces of their code :-(


That is why fpc would need to define an additional type name (e.g 
DynamicString) and encoding brand number (e.g. CP_ANY = $FF00) for a 
decently usable type for intermediately holding a  String content.


This again would make *FPC* programs incompatible with Delphi. While 
fixing the RawByteString flaw would at least allow to *compile* FPC code 
with Delphi, the use of an different encoding value would definitely 
prevent compilation of such code with Delphi. What's the more serious 
incompatibility?



RawXxxString can be used for really uncoded data as done with 
old-style strings in a lot of applications.


Such a feature would be appreciated by many users, indeed :-)

DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-28 Thread Jonas Maebe

On 28/11/14 21:30, Hans-Peter Diettrich wrote:
 I prefer to specify and document everything *before* coding, so that
 everybody can expect that the code will behave as specified.

If certain behaviour is explicitly undefined, it *is* specified and
documented. It means that your program is buggy if it triggers such
behaviour, and that the effect of triggering it could be anything.

This is standard practice in computer science. E.g., pretty much every
manual of every processor contains descriptions of explicitly undefined
behaviour (search e.g. for undefined in the Intel or ARM architecture
manuals).

An example from FPC itself is accessing an array beyond its bounds when
range checking is switched off. *Some* of the possible outcomes are
accessing a value from a variable declared/before after it, accessing
random data that has nothing to do with any of those variables, a
program crash, or actually accessing an element of the array anyway. We
don't guarantee that any of those possibilities will happen, we don't
say that those are the only possibilities, we don't say they stay the
same across compiler or OS versions, or even across program executions.
Hence, it's undefined.

Exactly the same goes for converting strings with code page CP_NONE to a
different code page: your program is broken when it tries to do that,
and we cannot guarantee any outcome. This is exactly what the behaviour
is undefined means.


Jonas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-27 Thread Marco van de Voort

In our previous episode, Hans-Peter Diettrich said:
  concatenated without data loss and that the result is then converted to
  the target string's encoding (except in case the target is
  RawByteString). How that is implemented exactly is undefined; again in
  the meaning of undefined, not in the meaning of undefined when
  defined as meaning X.
 
 In this case the implementation is compiler specific, somewhat 
 different from undefined (in a RawByteString):
 CP_NONE: this value indicates that no code page information has been 
 associated with the string data. The result of any explicit or implicit 
 operation that converts this data to another code page is undefined.
 
 IMO the result is well defined: it's the string with the encoding of 
 that other codepage. An undefined result, as I understand it, would 
 mean the result can be anything, unrelated to the function input.

This is usually called implementation defined. But implementation defined
implies it will remain the same in every iteration of the compiler (usually
documented).  If that is not wanted/possible, then it is considered
undefined.

So even if a value happens to be defined in one version of the compiler, it
doesn't automatically make it implementation defined. It needs to be a
documented choice for that.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-27 Thread Michael Schnell


On 11/26/2014 05:25 PM, Sven Barth wrote:




 So seemingly you could do MyStringType   = type 
AnsiString(CP_UTF16), and seemingly the size information is set 
according to this.


No, you can't, because the RTL does not handle that. For AnsiString 
the element size is *always* 1. It's hardcoded. AFAIK Delphi even does 
a compile error if you use CP_UTF16.




Thanks for the clarification.

I now understand that the Element Size field in the String header is 
quite dummy, as under the hood there are two completely separate 
concepts for one-byte-Strings and 2-Byte Strings and none for other 
Element sizes.


This to me is not obvious at all, as the language syntax and the String 
header data structure suggest a more universal paradigm for multiple 
string type brands, that each have an element-size6 and 
code-ID-number setting, handled by a common infrastructure.


The universal paradigm would allow for extensions (e.g. UTF-32, 
multiple 16 Bit Code pages, an additional fully dynamic String type, 
n-byte un-encoded string types), as I described in the Wiki page.


The dual mode concept of course does not provide such extensibility, 
and so I stop thinking about this (and bothering the community), and am 
happy that it just works as it is.


Thanks again,
-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-27 Thread Michael Schnell


On 11/26/2014 05:37 PM, Jonas Maebe wrote:

invalid (in the meaning of undefined) in both FPC and Delphi.
Sorry (I am not a native speaker). But to me undefined and invalid 
have  completely different meanings (in this context). An Invalid use 
of the language would result in an error (compiler or runtime), while an 
undefined language construct would result in something that might work 
in some way, but there is no guarantee that the outcome is always the 
same (e.g. in another instance or another compiler version).



CP_UTF16 and CP_UTF16BE can be returned by StringCodePage() when 
called on a unicodestring, and that's it.


I now do understand (see my reply to Sven).
-Michael


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-27 Thread Michael Schnell


On 11/26/2014 09:30 PM, Hans-Peter Diettrich wrote:
So seemingly you could do MyStringType   = type 
AnsiString(CP_UTF16), and seemingly the size information is set 
according to this.

Not in Delphi XE.

Thanks for the clarification.

I did have some hope that fpc would be (or could be extended to be) 
better than Delphi on that behalf.


I now do see the reason that resulted in the (to me rather queer) Naming 
 AnsiString for the code page aware string type. I erroneously 
supposed the syntax that finally would be used would be something like 
MyStringType   = type String(CP_UTF16), with no restriction to 
ANSI, but the CP_ constant defining as well a code page as an 
Element size, as suggested by the language syntax while working with 
string using auto-conversion, and by the structure of the string content 
header.


There still might be room for (fully compatible) improvement (as I 
described in the Wiki), but it's even more difficult to do than I supposed.


Thanks again,
-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-27 Thread Michael Schnell


On 11/26/2014 07:13 PM, Hans-Peter Diettrich wrote:


Not all codepages have a fixed number of bytes per character.
The string preamble contains the *element size* (1 for AnsiString), 
just like with every dynamic array.
Sorry for sloppy wording. Of course I did mean element size 
(Character here obviously is not printable item).


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-27 Thread Jonas Maebe

On 26/11/14 23:41, Hans-Peter Diettrich wrote:
 In this case the implementation is compiler specific, somewhat
 different from undefined (in a RawByteString):
 CP_NONE: this value indicates that no code page information has been
 associated with the string data. The result of any explicit or implicit
 operation that converts this data to another code page is undefined.
 
 IMO the result is well defined: it's the string with the encoding of
 that other codepage.

Unless you actually tested this on all platforms and noted that is the
case, you cannot state this. And if you would actually test it, you
would discover that it is wrong
(http://bugs.freepascal.org/view.php?id=22501#c61238 ).

As mentioned in a previous discussion: don't use IMO (in my opinion)
when talking about testable facts. A testable fact is either true or
false, opinions do not enter the picture.

 An undefined result, as I understand it, would
 mean the result can be anything, unrelated to the function input.

Which is 100% correct.

 IMO a better wording should be found, that does not cause the current
 obvious confusion of some readers.

The confusion only occurs for readers that do not believe what is written.


Jonas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-27 Thread Hans-Peter Diettrich


Michael Schnell schrieb:

I now understand that the Element Size field in the String header is 
quite dummy, as under the hood there are two completely separate 
concepts for one-byte-Strings and 2-Byte Strings and none for other 
Element sizes.


After a code review I realized that the element size field is specific 
to dynamic strings, not present in dynamic arrays. Since the element 
size is bound to the string type, it could be omitted in the FPC 
implementation. [With little win, when the record alignment is preserved]


This to me is not obvious at all, as the language syntax and the String 
header data structure suggest a more universal paradigm for multiple 
string type brands, that each have an element-size6 and 
code-ID-number setting, handled by a common infrastructure.


This may have been envisaged by the Delphi architects, but was not 
continued later.


The universal paradigm would allow for extensions (e.g. UTF-32, 
multiple 16 Bit Code pages, an additional fully dynamic String type, 
n-byte un-encoded string types), as I described in the Wiki page.


Even if feasable, such arbitrary string storage can dramatically 
increase the number of implicit string conversions. An *efficient* 
implementation would be based on a single program-wide string 
representation, with different encodings being handled only in an 
exchange with external data sources.


That standard encoding may be Ansi or Unicode; even Delphi allows for 
both models, where Ansi again suggests the use of one specific codepage 
(CP_ACP) for best performance.



Cassandra
After all I have the impression that the known RawByteString flaws will 
never be fixed in Delphi, in order to encourage the users to take the 
step to UnicodeString. Now the question is whether these flaws are fixed 
in FPC, or whether Lazarus will become the first project that definitely 
requires an complete move to UnicodeString, for reliable operation.

For best support of non-UTF-16 platforms I'd suggest to fix the flaws...
/Cassandra

DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-27 Thread Hans-Peter Diettrich


Michael Schnell schrieb:

On 11/26/2014 07:13 PM, Hans-Peter Diettrich wrote:


Not all codepages have a fixed number of bytes per character.
The string preamble contains the *element size* (1 for AnsiString), 
just like with every dynamic array.
Sorry for sloppy wording. Of course I did mean element size 
(Character here obviously is not printable item).


I'd restrict the use of character to physical Char types, just to 
avoid any misinterpretation.


Printable items (glyphs) are independent from the storage format. 
Ligatures or umlauts can consist of multiple codepoints, and several 
Unicode codepoints are not even printable.


A single printable character, as selectable by a single cursor step, 
can consist of multiple codepoints, even (or just) in Unicode.



That's why I'd expect that the FPC documentation includes a glossary and 
definition of the terms, which should be used in the documentation and 
discussions.


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-27 Thread Hans-Peter Diettrich


Jonas Maebe schrieb:

On 26/11/14 23:41, Hans-Peter Diettrich wrote:

In this case the implementation is compiler specific, somewhat
different from undefined (in a RawByteString):
CP_NONE: this value indicates that no code page information has been
associated with the string data. The result of any explicit or implicit
operation that converts this data to another code page is undefined.

IMO the result is well defined: it's the string with the encoding of
that other codepage.


Unless you actually tested this on all platforms and noted that is the
case, you cannot state this. And if you would actually test it, you
would discover that it is wrong
(http://bugs.freepascal.org/view.php?id=22501#c61238 ).


Bugs obviously violate some specification/definition, else it's not a 
bug, it's a feature ;-)



As mentioned in a previous discussion: don't use IMO (in my opinion)
when talking about testable facts. A testable fact is either true or
false, opinions do not enter the picture.


We're just talking about interpretations, not facts.



An undefined result, as I understand it, would
mean the result can be anything, unrelated to the function input.


Which is 100% correct.


Do you see any use for such function definitions, except in random 
generators?



IMO a better wording should be found, that does not cause the current
obvious confusion of some readers.


The confusion only occurs for readers that do not believe what is written.


Such statements come only from writers that do not believe that their 
words can be understood in various ways ;-)


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-27 Thread Hans-Peter Diettrich


Michael Schnell schrieb:

On 11/26/2014 06:37 PM, Hans-Peter Diettrich wrote:


An AnsiString consists of AnsiChar's. The *meaning* of these char's 
(bytes) depends on their encoding, regardless of whether the used 
encoding is or is not stored with the string.
I understand that the implementation (in Delphi) seems to be driven more 
by the Wording (ANSI) than by the logical paradigm the language syntax 
suggests. The language syntax and the string header fields suggest that 
both the element-size as the code-ID-number need to be adhered to (be it 
statically or dynamically - depending on the usage instance). E.g. there 
are (are least two Code pages for UTF-16 (LE, and BE), that would 
be worth supporting.


You are confusing codepages and encodings :-(

UTF-7, UTF-8, UTF-16 and UTF-16BE describe different representations of 
the same values (Unicode codepoints). And I agree, all commonly used 
encodings should be implemented, at least for data import/export.



It's essential to distinguish between low-level (physical) AnsiChar 
values, and *logical* characters possibly consisting of multiple 
AnsiChars.
I now do see that the implementation is done following this concept. But 
the language syntax and the string header field suggest a more versatile 
paradigm, providing a universal reference counting element string type.


See it as a multi-level protocol for text processing. The bottom 
(physical) level deals with physical storage items (AnsiChar, 
WideChar...), and how they are stored in memory or files. Like it 
doesn't make sense to deal with individual bytes of real numbers in 
computations, it doesn't make sense to deal with individual bytes 
(AnsiChars) of logical characters - except in type/encoding conversions. 
Higher levels deal with logical values, which can consist of multiple 
physical items, and may need different interpretatons (in case of Ansi 
codepages). This level is partially coverd now by AnsiString encodings 
and UTF-16 surrogate pairs, which allow to map the values into full 
Unicode (UCS-4) codepoints. But these codepoints still are not 
sufficient for a correct interpretation and manipulation of logical 
characters, which again can consist of multiple codepoints (decomposed 
umlauts, ligatures...). In a next level another (mostly language 
specific) interpretation may be required, like which logical characters 
have to be treated together (ligatures, non-breaking characters...). 
Some natural languages (Hebrew, Arabic...) require another special 
handling of (mixed) LTR/RTL reading, and of paths, influencing the 
graphical representation of character sequences; but that's nothing an 
application or library writer should have to deal with, such 
functionality should be provided by the target platform.


There must be a boundary between the standard (RTL) handling of the 
physical items and encodings, and higher text processing levels, up to 
language specific processing (how to break words, when to apply 
capitalization, syntax checks...), so that such special handling can be 
implemented in dedicated extensions (libraries, classes), by developers 
familiar with the rules and conventions of the natural languages.


For now we are talking only about the handling up to individual Unicode 
codepoints, and related string manipulation. Herefore at least one 
string representation must exist, that covers the full Unicode range of 
codepoints (UTF-8 or UTF-16 for now). When such an implementation claims 
for undefined behaviour, then this can only mean implementation flaws, 
resulting in something different from what can be expected from proper 
Unicode handling. This includes invalid parameter values in subroutine 
calls, which should result in proper (defined) runtime error reporting 
(AV, error result...).


WRT to AnsiString encodings, the only acceptable (expected) differences 
can result from lossy conversions, when converting proper Unicode into a 
non-UTF encoding. Even then the results should be consistent, even if 
the concrete results depend on some external (platform...) convention or 
settings.


IMO.


That's why I wonder *when* exactly the result of such an expression 
*is* converted (implicitly) into the static encoding of the target 
variable, and when *not*.
I understand that the idea is, to use the static encoding information 
provided by the type definition whenever possible.


Right, but here whenever possible depends on the correspondence of 
static and dynamic encoding. When the dynamic encoding can *ever* be 
different from the static encoding, except for RawByteString, I consider 
it NOT possible to derive the need for a conversion from the static 
encoding. In the handling of floatingpoint values we may have to expect 
invalid operations (division by zero, overflow...) or values (NaN...), 
but NOT that a Double variable ever contains two Integer values - unless 
forced by dirty hacks out of compiler control. Why should this be 
different and acceptable with

[fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Michael Schnell


I fail to understand some of the text.

It seems to be unavoidable to use the name ANSIString even though I 
always though up when seeing a thing called ANSI containing Unicode 
(e. g.   UTF8String = type AnsiString(CP_UTF8) ).



Seemingly here the bytes per character setting implicitly is thought 
of as a port of the code-page definition. correct ?



In section Dynamic code page:

When assigning a string to a plain AnsiString (= AnsiString(CP_ACP)) or 
ShortString, the string data will however be converted to 
DefaultSystemCodePage. The dynamic code page of that AnsiString(CP_ACP) 
will then be the current value of DefaultSystemCodePage (e.g. 1250 for 
the Windows-1250 code page), even though its static code page is CP_ACP 
(which is a constant  1250). This is one example of how the static 
code page can differ from the dynamic code page. Subsequent sections 
will describe more such scenarios.


1) A short String does not have a Code page notification so for this 
static code page can differ from the dynamic code page does not seem 
to make much sense.


2) I fail to understand how with this explanation that seems to force 
auto conversion for assignments between types with different code page 
settings (also for CP_ACP) the static code page can differ from the 
dynamic code page can happen.


In fact this disaster seems to be able to happen (see section 
RawByteString) if assigning a string with a static code page X1 to a 
RawByteString (hence no conversion) and then assigning that 
RawByteString to a string with a static code page X2 (no conversion 
again). In fact I assume that without abusing RawByteString such 
intersexual strings can't be produced, otherwise this would be rather 
disastrous for normal users.




In section RawByteString:

the results of conversions from/to the CP_NONE code page are undefined.

In effect the behavior is exactly defined in this section As a first 
approximation. Does that mean it is due to be changed ? Is there a 
cause why not keep the described behavior (just don't any conversion 
ever). Of course this can produce intersexual strings. Is this great 
harm ? If yes I think assigning a RawByteString to a string with a 
static code page should be completely forbidden at compile time or 
result in a runtime error if the code page does not match.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Mattias Gaertner

On Wed, 26 Nov 2014 11:23:17 +0100
Michael Schnell mschn...@lumino.de wrote:

[...]
 It seems to be unavoidable to use the name ANSIString even though I 
 always though up when seeing a thing called ANSI containing Unicode 
 (e. g.   UTF8String = type AnsiString(CP_UTF8) ).

Is there a question?

 
 Seemingly here the bytes per character setting implicitly is thought 
 of as a port of the code-page definition. correct ?

Code page define bytes per character. 
As you know: Don't confuse character with glyph and codepoint.
Ansistring supports only one byte per character code pages.

 
 In section Dynamic code page:
 
 When assigning a string to a plain AnsiString (= AnsiString(CP_ACP)) or 
 ShortString, the string data will however be converted to 
 DefaultSystemCodePage. The dynamic code page of that AnsiString(CP_ACP) 
 will then be the current value of DefaultSystemCodePage (e.g. 1250 for 
 the Windows-1250 code page), even though its static code page is CP_ACP 
 (which is a constant  1250). This is one example of how the static 
 code page can differ from the dynamic code page. Subsequent sections 
 will describe more such scenarios.
 
 1) A short String does not have a Code page notification so for this 
 static code page can differ from the dynamic code page does not seem 
 to make much sense.

What is a Code page notification? Do you mean code page information?
IMO the phrase The dynamic code page of that AnsiString is clear,
that it does *not* talk about ShortString.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Michael Schnell


On 11/26/2014 11:40 AM, Mattias Gaertner wrote:
Ansistring supports only one byte per character code pages. 


Even more confused. Am I wrong thinking that with code aware Strings,  
for Delphi XE compatibility, in Windows CP_ACP needs to be UTF16 (if not 
right, than due later) ?



What is a Code page notification? Do you mean code page information?

Yep.


that it does *not* talk about ShortString.

OK.
-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Sven Barth

Am 26.11.2014 11:53 schrieb Michael Schnell mschn...@lumino.de:

 On 11/26/2014 11:40 AM, Mattias Gaertner wrote:

 Ansistring supports only one byte per character code pages.


 Even more confused. Am I wrong thinking that with code aware Strings,
for Delphi XE compatibility, in Windows CP_ACP needs to be UTF16 (if not
right, than due later) ?

Yes, you're wrong. In Delphi (and FPC) CP_ACP corresponds by default with
the current system codepage (e.g. CP1252 on a German Windows). CP_UTF16 is
not supported, because AnsiString only supports 1-Byte character strings
(and UTF-8 as the odd one) and not 2-Byte character strings.
The difference to Delphi currently is that for FPC
String=AnsiString(CP_ACP) and for Delphi String=UnicodeString (aka 2-Byte
string).

Regards,
Sven
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Mattias Gaertner

On Wed, 26 Nov 2014 11:23:17 +0100
Michael Schnell mschn...@lumino.de wrote:

[...]
 2) I fail to understand how with this explanation that seems to force 
 auto conversion for assignments between types with different code page 
 settings (also for CP_ACP) the static code page can differ from the 
 dynamic code page can happen.

For example:
CP_ACP=0, DefaultSystemCodePage=1252
That means static code page is always 0, while dynamic code page can be
0 or 1252. Both describe the same encoding.

RawByteString has static cp CP_NONE=$, but its dynamic cp is always
different, for example CP_ACP=0, 1252 or CP_UTF8.

 
 In fact this disaster seems to be able to happen (see section 
 RawByteString) if assigning a string with a static code page X1 to a 
 RawByteString (hence no conversion) and then assigning that 
 RawByteString to a string with a static code page X2 (no conversion 
 again). In fact I assume that without abusing RawByteString such 
 intersexual strings can't be produced, otherwise this would be rather 
 disastrous for normal users.

You can use SetCodePage as well. ;)

 
 In section RawByteString:
 
 the results of conversions from/to the CP_NONE code page are undefined.

... because CP_NONE is not a real code page.


Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Mattias Gaertner

On Wed, 26 Nov 2014 11:52:50 +0100
Michael Schnell mschn...@lumino.de wrote:

 On 11/26/2014 11:40 AM, Mattias Gaertner wrote:
  Ansistring supports only one byte per character code pages. 
 
 Even more confused. Am I wrong thinking that with code aware Strings,  
 for Delphi XE compatibility, in Windows CP_ACP needs to be UTF16 (if not 
 right, than due later) ?

No.
In mode delphiunicode String=UnicodeString.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Michael Schnell


On 11/26/2014 12:09 PM, Sven Barth wrote:
 In Delphi (and FPC) CP_ACP corresponds by default with the current 
system codepage (e.g. CP1252 on a German Windows). 


OK. So in Delphi XE (in Germany) String(CP_ACP) is the same as 
String(CP1252) but different from String without brackets which in turn 
is the same as String(CP_UTF16) ? Correct ?


CP_UTF16 is not supported, because AnsiString only supports 1-Byte 
character strings (and UTF-8 as the odd one) and not 2-Byte character 
strings.


I still don't understand. The wiki article seems to suggest that it is 
about a type called ANSIString that features a dynamically settable 
code page information. From discussions about Delphi and FPC, I only 
know a String type with a dynamically settable code page information 
that also features a dynamically settable Bytes per Character 
information and hence does support 1, 2 and 4 Bytes per Character. 
(e.g. UTF-8, UTF-16, and UTF-32).


The difference to Delphi currently is that for FPC 
String=AnsiString(CP_ACP) and for Delphi String=UnicodeString (aka 
2-Byte string).




I understand that you mean (e.g.) Delphi XE. But what version of FPC is 
currently. Am I wrong assuming that in the svn we do have the 
NewStrings library that supports dynamical code-page *and* 
byte-per-character settings and hence supports e.g. CP1251, UTF-8, 
UTF-16, and UTF-32 ? So I seem to understand the meaning of 
String(CP1252), String(CP_UTF8), and String(CP_UTF16) (which seems do be 
the Delphi notation), but I seemingly don't get the exact meaning of 
AnsiString(CP_ACP) or AnsiString(CP1251)


In the end, what the definition of String without brackets is, might 
be due to a settable compiler option and/or the OS the compiler is set 
to create code for.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Michael Schnell


On 11/26/2014 12:13 PM, Mattias Gaertner wrote:

In mode delphiunicode String=UnicodeString.

I see.

So even in Delphi XE where UnicodeString is denoted by CP_UTF16, the 
value of the constant CP_UTF16 is not the same as the value of the 
(constant or) variable CP_ACP, (while OTOH using the value of CP_UTF16 
in a type or variable definition performs the same as using 0 {is 
CP_DEFAULT name of the appropriate constant ?} ).


I understand that fpc with mode delphiunicode is supposed to work in 
the same way.


-Michael

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Michael Schnell


On 11/26/2014 12:10 PM, Mattias Gaertner wrote:


the results of conversions from/to the CP_NONE code page are undefined.
... because CP_NONE is not a real code page.


So you understand result as what you would get when printing.

In the context of this wiki page I would understand result as the 
binary content of the variable in question.


Is this undefined in the meaning of not predictable by the user in 
the current version of fpc, or in the meaning of due to change when 
updating fpc.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Michael Schnell


After re-reading yet another question:

In section String concatenations there is no mentioning about 
auto-conversion. For statically typed Strings it's rather obvious that 
they will be auto-converted if appropriate. Technically - if differently 
encode - they seem to be converted to Unicode and the result is 
converted to match the target.


Regarding RawByteStrings there has been the definition a RawByteString 
has exactly the same behavior as assigning that AnsiString(X) to another 
AnsiString(X) variable with the same value of X: no code page conversion 
or copying occurs. Seemingly this is not true for the intermediate 
results of concatenations. Here the dynamical encoding information seems 
to define the fact and type of conversion. If this is the fact it should 
be mentioned. (Whether or not this makes sense is another question: is 
the code information of RawByteString meant to be NONE (i.e. RAW) 
or dynamic (i.e. complex) ).


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Sven Barth

Am 26.11.2014 12:37 schrieb Michael Schnell mschn...@lumino.de:

 On 11/26/2014 12:09 PM, Sven Barth wrote:

  In Delphi (and FPC) CP_ACP corresponds by default with the current
system codepage (e.g. CP1252 on a German Windows).


 OK. So in Delphi XE (in Germany) String(CP_ACP) is the same as
String(CP1252) but different from String without brackets which in turn is
the same as String(CP_UTF16) ? Correct ?

There is no String with brackets. You can only use AnsiString followed
by brackets, not String. And String in Delphi 2009+ is the same as
UnicodeString which is a different compiler internal type than
AnsiString(CP_UTF16) would be if it would be allowed.


 CP_UTF16 is not supported, because AnsiString only supports 1-Byte
character strings (and UTF-8 as the odd one) and not 2-Byte character
strings.


 I still don't understand. The wiki article seems to suggest that it is
about a type called ANSIString that features a dynamically settable code
page information. From discussions about Delphi and FPC, I only know a
String type with a dynamically settable code page information that also
features a dynamically settable Bytes per Character information and hence
does support 1, 2 and 4 Bytes per Character. (e.g. UTF-8, UTF-16, and
UTF-32).

While both AnsiString and UnicodeString have the current codepage and the
character size in their header record the code page is only used for
AnsiString and the size can not he influenced in any way (for an AnsiString
it's always 1 and for a UnicodeString it's always 2). There is no UTF-32
string (at least not in the sense of a compiler provided type).



 The difference to Delphi currently is that for FPC
String=AnsiString(CP_ACP) and for Delphi String=UnicodeString (aka 2-Byte
string).


 I understand that you mean (e.g.) Delphi XE. But what version of FPC is
currently.

FPC is none, because when Delphi introduced the code page aware AnsiString
it switch at the same time from having String=AnsiString to
Stribgm=UnicodeString. FPC did only the first part for now (so at best FPC
would he a not quite 2009 :P ).

 Am I wrong assuming that in the svn we do have the NewStrings library
that supports dynamical code-page *and* byte-per-character settings and
hence supports e.g. CP1251, UTF-8, UTF-16, and UTF-32 ? So I seem to
understand the meaning of String(CP1252), String(CP_UTF8), and
String(CP_UTF16) (which seems do be the Delphi notation), but I seemingly
don't get the exact meaning of AnsiString(CP_ACP) or AnsiString(CP1251)

No. The Delphi notation is the same as in FPC: AnsiString(codepage).
And a AnsiString(CP_1251) normally holds string data encoded with the
CP-1251 codepage while a AnsiString(CP_ACP) holds string data encoded with
whatever encoding the DefaultSystemCodePage denoted at the time of
assignment. This can be for example CP_1251 as well or something different
like CP_UTF8 (it can however not he CP_ACP again nor CP_UTF16 nor CP_UTF32).

 In the end, what the definition of String without brackets is, might be
due to a settable compiler option and/or the OS the compiler is set to
create code for.

That is already the case:

- any mode, H- : ShortString
- any mode except delphi_unicode, H+ : AnsiString(CP_ACP)
- mode delphi_unicode, H+ : UnicodeString
(there's also a modeswitch to change String to UnicodeString, but I forgot
its name -.-)
Please note that these switches are always per unit as precompiled units
(like the RTL ones) can not be influenced.

Regards,
Sven
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Jonas Maebe

On 26/11/14 12:53, Michael Schnell wrote:
[CP_NONE]
 Is this undefined in the meaning of not predictable by the user in
 the current version of fpc, or in the meaning of due to change when
 updating fpc.

This undefined literally means undefined. It does not mean
undefined in a meaning that is defined in a particular way.


Jonas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Michael Schnell


On 11/26/2014 03:05 PM, Sven Barth wrote:




 OK. So in Delphi XE (in Germany) String(CP_ACP) is the same as 
String(CP1252) but different from String without brackets which in 
turn is the same as String(CP_UTF16) ? Correct ?


There is no String with brackets. You can only use AnsiString 
followed by brackets, not String. And String in Delphi 2009+ is 
the same as UnicodeString which is a different compiler internal type 
than AnsiString(CP_UTF16) would be if it would be allowed.


While both AnsiString and UnicodeString have the current codepage and 
the character size in their header record the code page is only used 
for AnsiString and the size can not he influenced in any way (for an 
AnsiString it's always 1 and for a UnicodeString it's always 2).



OK.
So what is the notation in Delphi (and hence supposedly in FPC with 
mode delphiunicode) to define a variable with the (static) string 
encoding type CP with XXX = 1252, UTF8, UTF16 ?

I found this:
  CP_ACP = 0; // default to ANSI code page
  CP_UTF16   = 1200;  // utf-16
  CP_UTF16BE = 1201;  // unicodeFFFE
  CP_UTF7= 65000; // utf-7
  CP_UTF8= 65001; // utf-8
  CP_ASCII   = 20127; // us-ascii
  CP_NONE= $; // rawbytestring encoding

So seemingly you could do MyStringType   = type 
AnsiString(CP_UTF16), and seemingly the size information is set 
according to this.


There is no UTF-32 string (at least not in the sense of a compiler 
provided type).



I see (It's a shame).

Thanks a lot for your patience,
-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Jonas Maebe

On 26/11/14 13:11, Michael Schnell wrote:
 In section String concatenations there is no mentioning about
 auto-conversion.

There is.

 For statically typed Strings it's rather obvious that
 they will be auto-converted if appropriate.

It's probably rather obvious because it is literally mentioned in that
section.

 Technically - if differently
 encode - they seem to be converted to Unicode and the result is
 converted to match the target.

Technically, that section literally states that they will be
concatenated without data loss and that the result is then converted to
the target string's encoding (except in case the target is
RawByteString). How that is implemented exactly is undefined; again in
the meaning of undefined, not in the meaning of undefined when
defined as meaning X.

 Regarding RawByteStrings there has been the definition a RawByteString
 has exactly the same behavior as assigning that AnsiString(X) to another
 AnsiString(X) variable with the same value of X: no code page conversion
 or copying occurs. Seemingly this is not true for the intermediate
 results of concatenations.

That paragraph only specifies that code page-aware strings are
concatenated without data loss, and then defines to which code page the
result will be converted before assigning it to the target.

Even if the intermediary result of a concatenation would be a
RawByteString (which is not stated nor necessarily ever the case), then
the above would apply and hence the (dynamic) code page of that
RawByteString would be the one as defined by the above-mentioned rules
before it would be assigned to the target.


Jonas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Sven Barth

Am 26.11.2014 15:30 schrieb Mattias Gaertner nc-gaert...@netcologne.de:

 On Wed, 26 Nov 2014 15:05:16 +0100
 Sven Barth pascaldra...@googlemail.com wrote:

 [...]
  While both AnsiString and UnicodeString have the current codepage and
the
  character size in their header record

 AFAIK UnicodeString has only a static (fixed) code page.

Yes, nevertheless the header record is the same for UnicodeString and
AnsiString and thus it also has a codepage field which is always
initialized to CP_UTF16 however.

Regards,
Sven
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Jonas Maebe

On 26/11/14 17:21, Sven Barth wrote:
 Yes, nevertheless the header record is the same for UnicodeString and
 AnsiString and thus it also has a codepage field which is always
 initialized to CP_UTF16 however.

It can also be CP_UTF16BE (which it is on big endian FPC targets right now).


Jonas

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Jonas Maebe

On 26/11/14 16:19, Michael Schnell wrote:
 So seemingly you could do MyStringType   = type
 AnsiString(CP_UTF16), and seemingly the size information is set
 according to this.

As several people have told you several times, that is invalid (in the
meaning of undefined) in both FPC and Delphi. I've mentioned this on
the FPC_Unicode_support wiki page now.

CP_UTF16 and CP_UTF16BE can be returned by StringCodePage() when called
on a unicodestring, and that's it.




Jonas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Mattias Gaertner

On Wed, 26 Nov 2014 17:23:48 +0100
Jonas Maebe jonas.ma...@elis.ugent.be wrote:

 On 26/11/14 17:21, Sven Barth wrote:
  Yes, nevertheless the header record is the same for UnicodeString and
  AnsiString and thus it also has a codepage field which is always
  initialized to CP_UTF16 however.
 
 It can also be CP_UTF16BE (which it is on big endian FPC targets right now).

I see. 

Can you create a CP_UTF16BE on little Endian systems?

type u = UnicodeString(CP_UTF16BE); gives an error.


Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Mattias Gaertner

On Wed, 26 Nov 2014 17:50:31 +0100
Mattias Gaertner nc-gaert...@netcologne.de wrote:

 On Wed, 26 Nov 2014 17:23:48 +0100
 Jonas Maebe jonas.ma...@elis.ugent.be wrote:
 
  On 26/11/14 17:21, Sven Barth wrote:
   Yes, nevertheless the header record is the same for UnicodeString and
   AnsiString and thus it also has a codepage field which is always
   initialized to CP_UTF16 however.
  
  It can also be CP_UTF16BE (which it is on big endian FPC targets right now).
 
 I see. 
 
 Can you create a CP_UTF16BE on little Endian systems?
 
 type u = UnicodeString(CP_UTF16BE); gives an error.

Jonas has answered this. Thanks. 

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Hans-Peter Diettrich


Mattias Gaertner schrieb:

On Wed, 26 Nov 2014 11:23:17 +0100
Michael Schnell mschn...@lumino.de wrote:


Seemingly here the bytes per character setting implicitly is thought 
of as a port of the code-page definition. correct ?


Code page define bytes per character.


Huh?

Not all codepages have a fixed number of bytes per character.
The string preamble contains the *element size* (1 for AnsiString), just 
like with every dynamic array.




As you know: Don't confuse character with glyph and codepoint.


Right, but what is what?

I feel a need for an exact (official) definition of such (and more) 
terms, in order to prevent further misunderstandings of the 
documentation and in discussions.


E.g. code page has different meanings, when used with ANSI/ISO and 
Unicode character sets.
While ANSI/ISO codepages desribe different mappings of bytes into 
characters, Unicode codepages define subsets of the whole Unicode range.


My understanding of character is a *logical* unit (letter), with 
possibly different encodings, values and sizes in different codepages 
(character sets).

What's the term for the *physical* unit (AnsiChar, WideChar)?



Ansistring supports only one byte per character code pages.


Huh?

What's your definition of character?

AnsiString supports MBCS codepages as well. The restriction is the 
physical storage unit (1 byte per string item), as imposed by AnsiChar.


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Hans-Peter Diettrich


Michael Schnell schrieb:

On 11/26/2014 11:40 AM, Mattias Gaertner wrote:
Ansistring supports only one byte per character code pages. 


Even more confused. Am I wrong thinking that with code aware Strings,  
for Delphi XE compatibility, in Windows CP_ACP needs to be UTF16 (if not 
right, than due later) ?


Delphi XE does not properly support UTF-8. CP_ACP seems to depend on 
western/far-eastern versions, where the western version assumes and 
allows for any SBCS; I don't know of the same in far-east versions.
The SBCS restriction allows to simplify standard string handling and 
conversions, because every character (=byte) can be exchanged in place. 
UTF-8 doesn't fit into this picture, because it's a MBCS.


UTF-16 is not a valid value for CP_ACP in Delphi, because it's a 2-byte 
encoding. Even if the Delphi architects may have thought about an common 
string type, with a variable element size (1,2,4), this certainly turned 
out soon as a stupid idea, so that AnsiString and 
WideString/UnicodeString still are strictly distinct types. WideString 
and UnicodeString imply UTF-16, with platform specific byte order 
(endianness). The latter becomes important almost only to compiler and 
library coders, in host/network byteorder conversions. For the sake of 
completeness, pdp-11 processors use yet another byte order, maybe more 
word-based processors (DG...) as well.


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Hans-Peter Diettrich


Michael Schnell schrieb:

On 11/26/2014 12:09 PM, Sven Barth wrote:
 In Delphi (and FPC) CP_ACP corresponds by default with the current 
system codepage (e.g. CP1252 on a German Windows). 


OK. So in Delphi XE (in Germany) String(CP_ACP) is the same as 
String(CP1252) but different from String without brackets which in turn 
is the same as String(CP_UTF16) ? Correct ?


CP_ACP (and CP_NONE) describes a *static* encoding, and has an fixed 
value (CP_ACP=0, CP_NONE=$). The dynamic encoding of strings, kept 
in AnsiString(0) or RawByteString variables, must be obtained from the 
string itself. When the string is empty, StringCodepage returns 
DefaultSystemCodePage (for CP_ACP).



CP_UTF16 is not supported, because AnsiString only supports 1-Byte 
character strings (and UTF-8 as the odd one) and not 2-Byte character 
strings.


I still don't understand. The wiki article seems to suggest that it is 
about a type called ANSIString that features a dynamically settable 
code page information. From discussions about Delphi and FPC, I only 
know a String type with a dynamically settable code page information 
that also features a dynamically settable Bytes per Character 
information and hence does support 1, 2 and 4 Bytes per Character. 
(e.g. UTF-8, UTF-16, and UTF-32).


You should have noticed that there exists no String or Char type, that 
would allow for arbitrary bytes/char counts (see my other answer for 
details).



The difference to Delphi currently is that for FPC 
String=AnsiString(CP_ACP) and for Delphi String=UnicodeString (aka 
2-Byte string).




I understand that you mean (e.g.) Delphi XE. But what version of FPC is 
currently. Am I wrong assuming that in the svn we do have the 
NewStrings library that supports dynamical code-page *and* 
byte-per-character settings and hence supports e.g. CP1251, UTF-8, 
UTF-16, and UTF-32 ?


The byte-per-character field is read-only, just like for any dynamic array.

So I seem to understand the meaning of 
String(CP1252), String(CP_UTF8), and String(CP_UTF16) (which seems do be 
the Delphi notation), but I seemingly don't get the exact meaning of 
AnsiString(CP_ACP) or AnsiString(CP1251)


The Delphi notation is the same, e.g. AnsiString(CP_ACP).

In the end, what the definition of String without brackets is, might 
be due to a settable compiler option and/or the OS the compiler is set 
to create code for.


Right, the *generic* String type can be mapped to either ShortString, 
AnsiString(0) or UnicodeString, depending on compiler versions and 
switches. A raw guess can be derived from sizeof(Char).


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Hans-Peter Diettrich


Michael Schnell schrieb:

I fail to understand some of the text.

It seems to be unavoidable to use the name ANSIString even though I 
always though up when seeing a thing called ANSI containing Unicode 
(e. g.   UTF8String = type AnsiString(CP_UTF8) ).



Seemingly here the bytes per character setting implicitly is thought 
of as a port of the code-page definition. correct ?


An AnsiString consists of AnsiChar's. The *meaning* of these char's 
(bytes) depends on their encoding, regardless of whether the used 
encoding is or is not stored with the string.


It's essential to distinguish between low-level (physical) AnsiChar 
values, and *logical* characters possibly consisting of multiple AnsiChars.




In section Dynamic code page:

When assigning a string to a plain AnsiString (= AnsiString(CP_ACP)) or 
ShortString, the string data will however be converted to 
DefaultSystemCodePage. The dynamic code page of that AnsiString(CP_ACP) 
will then be the current value of DefaultSystemCodePage (e.g. 1250 for 
the Windows-1250 code page), even though its static code page is CP_ACP 
(which is a constant  1250). This is one example of how the static 
code page can differ from the dynamic code page. Subsequent sections 
will describe more such scenarios.


1) A short String does not have a Code page notification so for this 
static code page can differ from the dynamic code page does not seem 
to make much sense.


The text correctly states dynamic code page of that AnsiString. 
ShortString (and AnsiChar) has no encoding indicator, they are assumed 
to be encoded in CP_ACP.



2) I fail to understand how with this explanation that seems to force 
auto conversion for assignments between types with different code page 
settings (also for CP_ACP) the static code page can differ from the 
dynamic code page can happen.


Continue reading until you understood the special handling of string 
literals and RawByteString.


In fact this disaster seems to be able to happen (see section 
RawByteString) if assigning a string with a static code page X1 to a 
RawByteString (hence no conversion) and then assigning that 
RawByteString to a string with a static code page X2 (no conversion 
again). In fact I assume that without abusing RawByteString such 
intersexual strings can't be produced, otherwise this would be rather 
disastrous for normal users.


*All* intermediate strings, generated during the evaluation of string 
expressions, only have a dynamic encoding, thus can be considered as 
being RawByteStrings.


That's why I wonder *when* exactly the result of such an expression *is* 
converted (implicitly) into the static encoding of the target variable, 
and when *not*.


Obviously the compiler inserts an conversion request for the *direct* 
assignment of one string variable to another one, of an different 
*static* encoding. But what happens when a string expression doesn't 
have such a known static encoding???




In section RawByteString:

the results of conversions from/to the CP_NONE code page are undefined.

In effect the behavior is exactly defined in this section As a first 
approximation.


Right, the result *is* well defined, but has no *predetermined* dynamic 
encoding.


The entire mess results from the bad interpretation of RawByteString 
assignments, which IMO was well thought by the Delphi language 
architects, but not understood by the Delphi compiler coders. This 
interpretation also found its way into FPC:


Less intuitive is probably that when a RawByteString is assigned to an 
AnsiString(X), the same happens: no code page conversion[...]


It's clear that a conversion *can* be omitted for every assignment *to* 
an RawByteString. That's one of the purposes of that type - to avoid 
excess conversions into CP_ACP or UnicodeString.


But it's unclear why the heck the assignment to any *other* AnsiString 
type should be omitted, as soon as the source string is a RawByteString???


Therefore I'd suggest an compiler switch, implementing the lame Delphi 
compatible behaviour only on *demand*, while the FPC default would force 
eventual conversions with *every* assignment to any other (non-CP_NONE) 
AnsiString type. This simple change will safely prevent strings of 
different static and dynamic encoding, so that according tests can be 
removed safely from library *and* user code.



The proper use of RawByteStrings deserves further documentation, for 
users who want/need their own (generic) stringhandling routines. Topics 
should be:

- how to determine the dynamic encoding of strings (StringCodePage)
- how to force required conversions (SetCodePage)
- how to deal with strings of different encodings
- how to minimize the number of string conversions

DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Hans-Peter Diettrich


Mattias Gaertner schrieb:


For example:
CP_ACP=0, DefaultSystemCodePage=1252
That means static code page is always 0, while dynamic code page can be
0 or 1252. Both describe the same encoding.


A *dynamic* encoding *never* can be CP_ACP nor CP_NONE (in Delphi). 
These values are allowed only for *static* types in type declarations.

CP_UTF16 is also not allowed.

Delphi StringCodePage reports the current default codepage 
(DefaultSystemCodePage) for empty AnsiStrings, CP_UTF16 for all 
UnicodeStrings.



In section RawByteString:

the results of conversions from/to the CP_NONE code page are undefined.


... because CP_NONE is not a real code page.


The same for CP_ACP.

DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Hans-Peter Diettrich


Michael Schnell schrieb:

So seemingly you could do MyStringType   = type 
AnsiString(CP_UTF16), and seemingly the size information is set 
according to this.


Not in Delphi XE.

DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Hans-Peter Diettrich


Jonas Maebe schrieb:


Technically, that section literally states that they will be
concatenated without data loss and that the result is then converted to
the target string's encoding (except in case the target is
RawByteString). How that is implemented exactly is undefined; again in
the meaning of undefined, not in the meaning of undefined when
defined as meaning X.


In this case the implementation is compiler specific, somewhat 
different from undefined (in a RawByteString):
CP_NONE: this value indicates that no code page information has been 
associated with the string data. The result of any explicit or implicit 
operation that converts this data to another code page is undefined.


IMO the result is well defined: it's the string with the encoding of 
that other codepage. An undefined result, as I understand it, would 
mean the result can be anything, unrelated to the function input.


The branch taken in execution of an IF statement also is not 
undefined, only because it depends on the actual condition value.


The value of a local variable initially is undefined, i.e. can be any 
value. But after an assignment it *is* defined, even if that value still 
may be *unpredictable* by static code analysis.


IMO a better wording should be found, that does not cause the current 
obvious confusion of some readers.




Regarding RawByteStrings there has been the definition a RawByteString
has exactly the same behavior as assigning that AnsiString(X) to another
AnsiString(X) variable with the same value of X: no code page conversion
or copying occurs. Seemingly this is not true for the intermediate
results of concatenations.


That paragraph only specifies that code page-aware strings are
concatenated without data loss, and then defines to which code page the
result will be converted before assigning it to the target.


What's the meaning of no copying occurs? Of course the reference to 
the string is copied into the target variable!


What's the same value of X, in case of AnsiString(CP_ACP) and 
AnsiString(DefaultSystemCodePage)?




Even if the intermediary result of a concatenation would be a
RawByteString (which is not stated nor necessarily ever the case), then
the above would apply and hence the (dynamic) code page of that
RawByteString would be the one as defined by the above-mentioned rules
before it would be assigned to the target.


Please note that the other statements refer to *static* encodings, 
therefore my question about the (assumed) static encoding of an 
intermediate result. When the compiler inserts an conversion request 
based on *static* encodings, will it or will it not insert such an 
request, before an intermediate result is assigned to the target variable?



Suggestion:

During string operations the source strings are converted [to CP_ACP?] 
when they have a different [dynamic?] encoding. When the result is 
stored in a variable, it is converted as required by the static encoding 
of the target.


Where as required means that a static target encoding of CP_ACP is 
replaced by the DefaultSystemCodePage, while CP_NONE does not require a 
conversion.


The CP_ACP case should be clarified as well, because it's unclear 
whether CP_ACP(=0) is *considered* equal to the current 
DefaultSystemCodePage, even if both values are *always* different (see 
above). The use of CP_ACP instead of DefaultSystemCodePage can be 
confusing and should be avoided or clarified before.


Perhaps it would help to concentrate on the following steps:
1) (string) operand fetch
2) (string) operations
3) (string) assignment

1) Fetching an operand removes any information about the static encoding 
of the source, only its dynamic encoding persists.
[Now the handling of non-AnsiString sources can be explained, like for 
literals, ShortString etc.

RawByteString is not special here, it's only a static encoding.
]

2) String operations take into account the dynamic encoding of their 
operands, with lossless conversions inserted as required.


3) When a string is assigned to a variable, it is eventually converted 
as required by the static encoding of the target, with possible data loss.

[about required see above.
Special case: when the source is a variable, no conversion occurs when 
the *static* source and target types are compatible.

What exactly is compatible with CP_ACP?
]

DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support

2014-11-26 Thread Sven Barth


On 26.11.2014 19:54, Hans-Peter Diettrich wrote:

UTF-16 is not a valid value for CP_ACP in Delphi, because it's a 2-byte
encoding. Even if the Delphi architects may have thought about an common
string type, with a variable element size (1,2,4), this certainly turned
out soon as a stupid idea, so that AnsiString and
WideString/UnicodeString still are strictly distinct types. WideString
and UnicodeString imply UTF-16, with platform specific byte order
(endianness). The latter becomes important almost only to compiler and
library coders, in host/network byteorder conversions. For the sake of
completeness, pdp-11 processors use yet another byte order, maybe more
word-based processors (DG...) as well.


Just a little remark: please don't throw in WideString, which is a 
completely different type and only there for easy compatibility with COM 
and other Windows APIs. Unlike UnicodeString this type is not reference 
counted for example nor does it have the code page and element size 
information that a Ansi-/UnicodeString has.
(In FPC WideString is the same as UnicodeString for all non-Windows 
platforms)


Regards,
Sven
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

52 matches

Mail list logo