Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 12/03/2014 12:52 AM, Hans-Peter Diettrich wrote: In Delphi *no* string can have an dynamic encoding of CP_NONE or CP_ACP, If you really do have "Dynamic" strings, obviously, the *definition* (i.e. CP_...) of such strings is strictly static (just for compiler use) and never cant be used as the *dynamic* notation of the *current* encoding (in the EncodingType field). IMHO a different implementation is not workable. -Michael ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 12/03/2014 12:52 AM, Hans-Peter Diettrich wrote: You forget that Jonas refers to *dynamic* string encodings, unknown at compile time. ??? In you other mail you pointed out that fpc (other than Delphi) does not provide *dynamic* string encoding with RawByteString (and where else would it be supported ?). -Michael ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 12/03/2014 10:42 AM, Michael Schnell wrote: That is why I tried to invent a concept BTW.: I can't help with the implementation, but I'll be happy to do testing and write documentation (e.g. in Wiki format). -Michael ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 12/03/2014 05:02 AM, Hans-Peter Diettrich wrote:
Michael Schnell schrieb:
- It does not result in additional conversions.
It does, e.g. in searching or sorting of StringList, when it can contain
strings of different encodings. The choice of a unique encoding for
application strings (maybe CP_ACP, UTF-8 or UTF-16) eliminates such
conversions.
If multiple encoding brands are involved, a system without DynamicString
also will need to do conversions. So DynamicString does not impose
*additional* conversions.-
So the "Checking Overhead" is nothing but a rumor. (Remember, I don't
suggest dropping the standard "statically typed" paradigm,
altogether, as close loops of course work best in that way.
The rumor is the unimportant "Conversion Overhead", i.e. how often a
check leads to a conversion. When no check is required, conversions
consequently cannot ocur at all.
Please re-read the text I wrote.
- If in the user-code DynamicString is not used, the compiler creates
the same code as before. So no overhead.
- If DynamicString is used (in user-Code or in a Library interface),
but only a single encoding brand is used everywhere where statically
encoded strings are in place ("a single program-wide string
representation" as you suggested in you previous mail) the only runtime
overhead imposed is that at the locations where DynamicString is used
(i.e. not in any close loops) an additional check for the "EncodingType"
variable is implemented by the compiler. Here (unless the user actively
decides to create string variables with encoding brands other than the
program-wide default) at runtime the code *always* finds that no
conversion is necessary and acts as if the String would not be dynamic,
but already "correct". The overhead of checking is obviously at most
some 5 ASM instructions and hence unelectable regarding the function
call assigned to entering the library function in question.
RawByteString cannot serve two different purposes :-(
As I pointed out as well: A variable' encoding brand can't be static and
dynamic at the same time. This is the cause of the major misconception
imposed by Delphi regarding RawByteString. And this is why I would leave
RawByteString aside (as it is / as it is assumed to be / whatever) and
for any improvement use a completely new Type name and a "CP_ANY"
constant / value.
In *Delphi* it is used as a polymorphic string, capable of *holding*
actual strings of any encoding. But when assigned to a variable of a
different encoding, a conversion may occur that converts the string into
the declared (static) encoding of the target variable.
Seemingly rather close to what I suggest as "DynamicString". But (see
http://wiki.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support
) with a dynamic String the encoding brand number of such String would
not be allowed to ever be written into the EncodingType field in the
string header.
If this would be true, why do the Delhi Docs discourage making decent
use of the dynamic feature of RawByteString ?
Anyway. A "dynamic" String type only makes sense if it is used in as
many library interfaces (and TStrings). This is not done in Delphi and
in Delphi this is not nice, in many cases restricting the user to make
use of these libraries, but not as critical as with fpc, where you need
to consider portability issues.
In *FPC* it currently is used somewhat close to your idea, i.e. no
conversion occurs in both an assignment to *and from* an RawByteString
to some other AnsiString.
As said, to avoid ambiguity, I vote for adding yet another string type
name (e.g. "ByteString" denoted by CP_BYTE) that is *known* to disallow
any conversion (and leave RawByteString as close as possible to the
moving target Delphi presents).
I understand the FPC attempt, to allow *at the same time* for the new
(encoded) and old (unencoded) AnsiString behaviour, where no automatic
conversions are allowed. But this would require at the same time, that
e.g. all string literals *also* are stored in that (immutable) encoding,
and that this encoding can *not* be changed at runtime, while
DefaultSystemCodePage *can* be changed.
I feel that this (simplified) attempt can't result in a decent paradigm.
It is close to impossible to completely describe the behavior in an
understandable way and it's prone to a lot of ambiguity.
That is why I tried to invent a concept that I suppose might work and
will not break (much) existing code. It is intended to be "straight"
from ground up (it is not even necessary to assume that the content of a
"String" is printable/readable, but it should easily work for that
application.) It would allow for making flexible use of Strings with
understandable and easy to use syntax candy, and would not impose
restrictions to portability any more. IMHO it would not impose
(noticeable) performance degradation, either.
-Michael
__
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
Michael Schnell schrieb: On 11/29/2014 07:55 AM, Jonas Maebe wrote: Exactly the same goes for converting strings with code page CP_NONE to a different code page: your program is broken when it tries to do that, While accessing an array beyond its bounds is not detectable at compile time and accessing an array beyond its bounds when range checking is switched off is technically not detectable at runtime, and hence *undefined* cant be avoided, the attempt to convert strings with code page CP_NONE to a different code page is easily detectable by the compiler, as we have predefined string variable type "brands" types here. Thus, if the outcome is *defined* *to* *be* *undefined* it can and should result in a compiler error message. You forget that Jonas refers to *dynamic* string encodings, unknown at compile time. At runtime the dynamic encoding of every string is stored together with the string data, like the size of dynamic arrays is stored together with the array data. In Delphi *no* string can have an dynamic encoding of CP_NONE or CP_ACP, so that nothing can be broken. In fact all CP_xxx constants are private in System.pas, they are not available to user or library code. SetCodePage (i.e. the RTL/OS function for casting AnsiString into UnicodeString) replace 0 (CP_ACP) by DefaultSystemCodePage before a conversion, and return an empty string for an unknown target codepage, like $ (CP_NONE). For the curious: for the exact behaviour of SetCodePage see MultiByteToWideChar (on Windows) and UnicodeFromLocaleChars (on POSIX), which finally are used to perform (the first step of) an encoding conversion by Delphi. For MultiByteToWideChar see the list of allowed CP_xxx constants, as #defined in windows.h, how they are replaced, and what shit may happen to your strings when using them. The function returns 0 if it does not succeed; since this result is used to determine the required buffer size (length of the resulting string), the resulting string then is empty. DoDi ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
Michael Schnell schrieb: On 11/28/2014 09:15 PM, Hans-Peter Diettrich wrote: Apart from that, every encoding-tolerant code will execute much slower than code without a need for checks and conversions everywhere. As I pointed out I don't agree at all. - The check is only two ASM instructions - It does not result in additional conversions. It does, e.g. in searching or sorting of StringList, when it can contain strings of different encodings. The choice of a unique encoding for application strings (maybe CP_ACP, UTF-8 or UTF-16) eliminates such conversions. So the "Checking Overhead" is nothing but a rumor. (Remember, I don't suggest dropping the standard "statically typed" paradigm, altogether, as close loops of course work best in that way. The rumor is the unimportant "Conversion Overhead", i.e. how often a check leads to a conversion. When no check is required, conversions consequently cannot ocur at all. RawXxxString can be used for really "uncoded" data as done with old-style strings in a lot of applications. Such a feature would be appreciated by many users, indeed :-) But why do you say "would be appreciated" ? Is it not possible to use "RawByteString" in a way the name suggests, by never bringing it together with any String variable of a different encoding brand and hence avoid any conversion - be same intentional/documented/useful or not. RawByteString cannot serve two different purposes :-( In *Delphi* it is used as a polymorphic string, capable of *holding* actual strings of any encoding. But when assigned to a variable of a different encoding, a conversion may occur that converts the string into the declared (static) encoding of the target variable. In *FPC* it currently is used somewhat close to your idea, i.e. no conversion occurs in both an assignment to *and from* an RawByteString to some other AnsiString. We only can *hope* that *all* AnsiString operations are based on the dynamic encoding of every operand, with according checks and conversions inserted everywhere. This actually is not true, because the compiler relies on the static encoding of AnsiString variables, and inserts checks and conversions only when that encoding is different. Actually a single AnsiString type were sufficient, because it already can hold data of any encoding :-( I understand the FPC attempt, to allow *at the same time* for the new (encoded) and old (unencoded) AnsiString behaviour, where no automatic conversions are allowed. But this would require at the same time, that e.g. all string literals *also* are stored in that (immutable) encoding, and that this encoding can *not* be changed at runtime, while DefaultSystemCodePage *can* be changed. When the result of a conversion of an string of encoding CP_NONE is undefined, what's of course correct for the *dynamic* encoding, this simply could be changed into "conversions of CP_NONE strings do nothing". Then CP_NONE would be the perfect encoding for old-style AnsiStrings, with the only remaining problem with string expressions and assignments, when the operands have a different dynamic encoding. In these cases all operands had to be converted into the CP_NONE encoding, as specified in another DefaultNoneEncoding constant (not variable!); the same encoding would apply in assignments *to* variables of a different encoding. Then also all type alias for AnsiStrings must have unique names, which allow to distinguish e.g. type UTF8String = AnsiString; from type NewUTF8String = type AnsiString(CP_UTF8); DoDi ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On Tue, 02 Dec 2014 13:31:44 +0100 Michael Schnell wrote: >[...]*defined* *to* *be* *undefined* Ooh, that is soo meta. lol Mattias ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 11/29/2014 05:36 PM, Hans-Peter Diettrich wrote: As Delphi doesn't allow for a dynamic encoding of CP_NONE, I don't understand the purpose of the FPC description. As you suggested in the other mail, the Delphi implementation of RawByteString is decently flawed and this supposedly is introduced by the compiler implementers not having understood what the language architects had on their minds. (Documentation writers unsuccessfully tried to make any sense of whet the implementation provides.) Now in turn some FPC developer might have misunderstood the (Delphi) handling of RawByteStrings, assuming that it were okay to omit a conversion in an assignment of RawByteString to an AnsiString of a different encoding. IMHO a poor attempt to try to correct things in a way a little bit compatible as well to the senseless description in Delphi docs as to the flawed implementation. IMHO this can't result in anything useful, and fpc with RawByteString should react with a compiler error or imitate the flawed behavior as far as possible. If we really want a dynamic behavior (which IMHO would be very welcome), we need a decent implementation of same and this is very incompatible to RAWByteSring and hence needs an additional type name (I suggest DynamicString) and encoding brand number (I suggest CP_ANY = $FF00). That's why I think that the incorrect handling of such RawByteString assignments in FPC should be fixed, That's why *I* think that the incorrect handling of such RawByteString assignments in FPC should not be fixed, ;-) -Michael ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 11/29/2014 07:55 AM, Jonas Maebe wrote: Exactly the same goes for converting strings with code page CP_NONE to a different code page: your program is broken when it tries to do that, While accessing an array beyond its bounds is not detectable at compile time and accessing an array beyond its bounds when range checking is switched off is technically not detectable at runtime, and hence *undefined* cant be avoided, the attempt to convert strings with code page CP_NONE to a different code page is easily detectable by the compiler, as we have predefined string variable type "brands" types here. Thus, if the outcome is *defined* *to* *be* *undefined* it can and should result in a compiler error message. -Michael ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 12/02/2014 01:05 PM, Michael Schnell wrote: But why do you say "would be appreciated" ? Is it not possible to use "RawByteString" in a way the name suggests, by never bringing it together with any String variable of a different encoding brand and hence avoid any conversion - be same intentional/documented/useful or not. Of course you can't use any TStrings sibling (such as TStringList) in such code, as with Delphi, TStrings is based on a statically typed String brand. This would be made possible by introducing DynamicString and using this type for TStrings and friends. -Michael ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 11/28/2014 09:15 PM, Hans-Peter Diettrich wrote: You suggested to use "string" as UTF-16 on Windows, and UTF-8 on Linux. That's what I understand as a unique program-wide string representation (not sourcecode-wide, instead program as *compiled*). Then I cannot see any need or use for another DynamicString type. I already did understand your meaning and I understand that this " unique program-wide string representation" is better than having the libraries' APIs (including TStrings) force a fixed string encoding brand, independently from the OS we compile for (and selectable $mode specifications). But I don't *suggest* this way, as it is not very versatile and hampers portability. As said I *suggest* using DynamicString in such cases. Nonetheless, the types simply called "String" might be done in the way you suggest. Nothing can be broken, as long as the Delphi behaviour is undefined. That of course is is correct, but just follows the poor excuse Embarcadero offers for the flawed implementation of RawByteString (which as we both agree will never be fixed). (In fact there are many instances that old flaws have been deliberately reproduces for not breaking compatibly.) Applied to FPC/Lazarus code (compiler, libraries, IDE...) this means that it's obviously easier to *prevent* possibly different static/dynamic encodings, instead of *checking and reacting* on such flaws throughout the entire codebase. OK. Kill the Type RawByteString and the constant CP_NONE and the usability of it's value $. I do vote for doing so and instead provide new types such as ByteString, WordString, DWordString, and QWordString denoted by the constants CP_Byte = $FF01, CP_Word = $FF02, CP_DWord = $FF04, CP_QWord = $FF08. Apart from that, every encoding-tolerant code will execute much slower than code without a need for checks and conversions everywhere. As I pointed out I don't agree at all. - The check is only two ASM instructions - It does not result in additional conversions. In fact in appropriate cases it can avoid a huge count of conversations (especially when calling libraries, e.g. by means of TStrings) - in pure user code, the check is only done if DynamicString really is used in the user code, hence only when the user knows what to do. In fact commonly degradation = 0% - When calling libraries (e.g. via TStrings), the check is very small regarding that a function call is done as a result of the same statement. Estimated commonly degradation = 0,01 % So the "Checking Overhead" is nothing but a rumor. (Remember, I don't suggest dropping the standard "statically typed" paradigm, altogether, as close loops of course work best in that way. That is why fpc would need to define an additional type name (e.g "DynamicString") and encoding brand number (e.g. "CP_ANY" = $FF00) for a decently usable type for intermediately holding a String content. This again would make *FPC* programs incompatible with Delphi. As I decently explained this would not brake any backwards compatibility, even if TStrings uses this type. - The new type is just additional, so its pure existence can't break anything: you don't need to use it in user-code, if you don't want to. - The use of DynamicString in the interface of Library functions does not break anything, as it is (to be) constructed in a way that provides full compatibility. Please do show any code (not containing RawByteString) that is not compatible when using the DynamicString paradigm as described in http://wiki.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support#Analysis . Maybe the page needs to be improved. While fixing the RawByteString flaw would at least allow to *compile* FPC code with Delphi, the use of an different encoding value would definitely prevent compilation of such code with Delphi. What's the more serious incompatibility? IMHO this would be much more dangerous than introducing a decently working new DynamicString type. RawXxxString can be used for really "uncoded" data as done with old-style strings in a lot of applications. Such a feature would be appreciated by many users, indeed :-) While I would happily follow you suggesting making "indecent" use of this type impossible ia the fpc compiler, I don't think it's very dangerous to re-introduce the abysmal Delphi compatible behavior of RawByteString (may as well the documented as the the undocumented "features"). But why do you say "would be appreciated" ? Is it not possible to use "RawByteString" in a way the name suggests, by never bringing it together with any String variable of a different encoding brand and hence avoid any conversion - be same intentional/documented/useful or not. Anyway: I added a sentence in the introduction of the wiki page, explaining the paradigm a little more explicitly. -Michael ___ fpc-devel maillist - [email protected].
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
Jonas Maebe schrieb: On 28/11/14 21:30, Hans-Peter Diettrich wrote: I prefer to specify and document everything *before* coding, so that everybody can expect that the code will behave as specified. If certain behaviour is explicitly undefined, it *is* specified and documented. It means that your program is buggy if it triggers such behaviour, and that the effect of triggering it could be anything. [...] An example from FPC itself is accessing an array beyond its bounds when range checking is switched off. After this hint I reviewd the "Code page identifiers" section again, and probably could find the source of misunderstandings. >> CP_NONE: this value indicates that no code page information has been associated with the string data. The result of any explicit or implicit operation that converts this data to another code page is undefined. << Does this mean "CP_NONE is not an allowed *dynamic* (string *data*) encoding", just like any other undefined encoding value? In this case the description is correct, but it describes an special case of some *undefined* general rule, about valid and invalid dynamic encodings in general. Then this general rule should be documented before, not only for CP_NONE. Then also documentation of the *intended* purpose of CP_NONE, for the *static* encoding of the RawByteString type, is missing at all. As Delphi doesn't allow for a dynamic encoding of CP_NONE, I don't understand the purpose of the FPC description. Now in turn some FPC developer might have misunderstood the (Delphi) handling of RawByteStrings, assuming that it were okay to omit a conversion in an assignment of RawByteString to an AnsiString of a different encoding. That's why I think that the incorrect handling of such RawByteString assignments in FPC should be fixed, according to the general rule of assignments to an string of a different (static) encoding. CP_NONE definitely *is* different from any other encoding, and Delphi does not define an exception for RawByteStrings. Exactly the same goes for converting strings with code page CP_NONE to a different code page: your program is broken when it tries to do that, and we cannot guarantee any outcome. This is exactly what "the behaviour is undefined" means. When a string *really* has a *dynamic* encoding of CP_NONE, this of course is illegal and thus will result in an undefined result. ACK, so far. But since Delphi (quietly) changes an SetCodePage to CP_NONE into the current CP_ACP, the undefined situation (invalid dynamic encoding) must have been forced by some illegal *hack* before, or in the FPC case by some erroneous (not Delphi conforming) RTL code. DoDi ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 28/11/14 21:30, Hans-Peter Diettrich wrote: > I prefer to specify and document everything *before* coding, so that > everybody can expect that the code will behave as specified. If certain behaviour is explicitly undefined, it *is* specified and documented. It means that your program is buggy if it triggers such behaviour, and that the effect of triggering it could be anything. This is standard practice in computer science. E.g., pretty much every manual of every processor contains descriptions of explicitly undefined behaviour (search e.g. for "undefined" in the Intel or ARM architecture manuals). An example from FPC itself is accessing an array beyond its bounds when range checking is switched off. *Some* of the possible outcomes are accessing a value from a variable declared/before after it, accessing random data that has nothing to do with any of those variables, a program crash, or actually accessing an element of the array anyway. We don't guarantee that any of those possibilities will happen, we don't say that those are the only possibilities, we don't say they stay the same across compiler or OS versions, or even across program executions. Hence, it's undefined. Exactly the same goes for converting strings with code page CP_NONE to a different code page: your program is broken when it tries to do that, and we cannot guarantee any outcome. This is exactly what "the behaviour is undefined" means. Jonas ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
Michael Schnell schrieb:
I fear that there will be code that relies on the "flawed" behavior of
RawByteString ("it's a feature, not a bug") and using the same name with
different behavior would brake same. And a really usable DynmicString
would not adhere to that description.
How can somebody "rely" on behaviour *stated* as undefined, or not
working as defined?
DoDi
___
fpc-devel maillist - [email protected]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
Jonas Maebe schrieb: I'm sorry, but I simply cannot discuss with people that, when I literally state "the result is undefined", think that I may actually have meant "the result is defined and if you change the implementation and/or keep it stable across compiler releases, then it will also conform to whatever you think that this defined behaviour should be". I don't have the energy nor the patience for that. I also have no use for continuing such discussions. I prefer to specify and document everything *before* coding, so that everybody can expect that the code will behave as specified. DoDi ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
Michael Schnell schrieb: On 11/27/2014 03:44 PM, Hans-Peter Diettrich wrote: An *efficient* implementation would be based on a single program-wide string representation, with different encodings being handled only in an exchange with external data sources. Yep. But it would result in severe user code portability issues (see above). IMHO using DynamicString at the correct locations would not be (noticeably) less efficient but a lot more versatile. You suggested to use "string" as UTF-16 on Windows, and UTF-8 on Linux. That's what I understand as a unique program-wide string representation (not sourcecode-wide, instead program as *compiled*). Then I cannot see any need or use for another DynamicString type. I also don't think we will ever see a fix for the poor implementation of RawByteString (avoiding the word flaw and the suggestion of a bad purpose), because it would brake existing user code. Nothing can be broken, as long as the Delphi behaviour is undefined. Code relying on specific compiler/library bugs is bound to that compiler, not portable in any way. Regarding fpc, "correcting the flaws" and keeping the name RawByteString would result in incompatibility issues vs Delphi and breaking code that will be ported from Delphi. Same as above. When application code works properly with strings of *sometimes* different static and dynamic encoding, it will not stop working with strings of *never* different encodings. Of course the opposite is not true. When some code works properly (only) with strings of the same static and dynamic encoding, it will stop working when compiled with Delphi. Then the coder has to insert explicit checks for the dynamic encoding of *all* strings, all over his code. Applied to FPC/Lazarus code (compiler, libraries, IDE...) this means that it's obviously easier to *prevent* possibly different static/dynamic encodings, instead of *checking and reacting* on such flaws throughout the entire codebase. Apart from that, every encoding-tolerant code will execute much slower than code without a need for checks and conversions everywhere. I seriously doubt that the FPC developers ever realized these consequences, and the amount of time required for finding, reporting and fixing the bugs in all affected pieces of their code :-( That is why fpc would need to define an additional type name (e.g "DynamicString") and encoding brand number (e.g. "CP_ANY" = $FF00) for a decently usable type for intermediately holding a String content. This again would make *FPC* programs incompatible with Delphi. While fixing the RawByteString flaw would at least allow to *compile* FPC code with Delphi, the use of an different encoding value would definitely prevent compilation of such code with Delphi. What's the more serious incompatibility? RawXxxString can be used for really "uncoded" data as done with old-style strings in a lot of applications. Such a feature would be appreciated by many users, indeed :-) DoDi ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 11/27/2014 07:29 PM, Hans-Peter Diettrich wrote:
Michael Schnell schrieb:
E.g. there are (are least two "Code pages" for UTF-16 ("LE", and
"BE"), that would be worth supporting.
You are confusing codepages and encodings :-(
That is why I put "goose-feet" around "Code pages". I used this wording
because fpc (and Delphi ?) uses it abbreviated as "CP" in the constant
name "CP_UTF-8", "CP_UTF16" and "CP_UTF16BE) [ see Jonas post:
"CP_UTF16 and CP_UTF16BE can be returned by StringCodePage() when called
on a unicodestring, and that's it." ]
See it as a multi-level protocol for text processing.
Yep. I see that is is workable and I understand the (supposedly mostly
historical) reasons. But IMHO not a good (i.e. crafted from ground up)
concept.
It's known that the Delphi AnsiString implementation is flawed,...
And hence it's frustrating to see that fpc needs to follow for
compatibility reasons. That is why I suggested an improved
implementation (see ->
http://wiki.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support).
While the seriously flawed Delphi compatible use of the dynamic
encoding-brand (and bytes-per element) information (only implemented
with RawByteString) can be left at it is and a decent implementation
with a new DynmicString Type (CP_ANY) should be crafted.
I see no problem in using the same names and values. Delphi documents
clearly state: ...
I fear that there will be code that relies on the "flawed" behavior of
RawByteString ("it's a feature, not a bug") and using the same name with
different behavior would brake same. And a really usable DynmicString
would not adhere to that description.
-Michael
___
fpc-devel maillist - [email protected]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 11/27/2014 03:44 PM, Hans-Peter Diettrich wrote:
The "universal paradigm" would allow for extensions (e.g. UTF-32,
multiple 16 Bit Code pages, an additional fully dynamic String type,
n-byte "un-encoded" string types), as I described in the Wiki page.
Even if feasable, such arbitrary string storage can dramatically
increase the number of implicit string conversions.
Of course it can do harm on that behalf, if the user is silly enough to
*explicitly* define variables in a brand without thinking about what he
is doing. But this exactly the same when he just uses the stuff
currently offered by Delphi and fpc. If you arbitrary define code pages
for variables for your 8 bit ("ANSI") strings you will enforce many
conversions.
Currently in Delphi if you don't define special code pages anything will
be UTF-16. So no unnecessary conversions.
In fpc (and maybe Lazarus, as well) I suppose the way currently in the
works is (when not changing the Default behavior by certain options):
- when compiling for Windows, "String" is UTF-16, and the RTL and LCL
ubiquitously use "String": So no unnecessary conversion
- when compiling for Linux, "String" is UTF-8, and the RTL and LCL
ubiquitously use "String": So no unnecessary conversion, either.
If this is done in the libraries (e.g. RTL and LCL) and in user code,
this would allow for as little conversions as possible and thus best
performance. Here, you would need different library binaries which might
or might not be a problem.
But of course the portability is very questionable (including, but not
limited to the fact that the result of "pos" is different)-
When (on top of this) doing the interfaces to libraries (including
TStrings) with "DynamicString" (encoding brand "CP_ANY"), no additional
conversions would be necessary, as - because all other Strings use the
same encoding brand (either UTF-16 or UTF-8, depending on the OS) and
hence the dynamic encoding of all DynamicStrings used would always be
exactly that brand. Hence, IMHO, this would nor harm at all, as the
overhead the compiler needs to implement to just check the dynamic type
brand and find that no conversion is necessary is extremely small.
But now the user has a choice !
- If he does not do anything regarding the encoding brand of his
strings, he will not notice the existence of the DynamicString Type at
all. Not even Performance-wise. (But he might encounter portability issues.)
- if he decides that he wants to use a dedicated encoding brand in all
or parts of his code, he of course needs to know what he is doing. This
can result
- in improved portability (if decently done)
- in improved performance (if decently done) e.g. by using on-byte
strings for compact storing the information and two-byte strings for
e.g. search loops, or using the best fitting encoding in the loops in
the user code while allowing auto-conversion when accessing the
libraries in case the underlying OS enforces a different encoding.
- in disastrous increase of auto-conversions and thus performance
degradation, (if not decently done).
An *efficient* implementation would be based on a single program-wide
string representation, with different encodings being handled only in
an exchange with external data sources.
Yep. But it would result in severe user code portability issues (see
above). IMHO using DynamicString at the correct locations would not be
(noticeably) less efficient but a lot more versatile.
After all I have the impression that the known RawByteString flaws
will never be fixed in Delphi, in order to encourage the users to take
the step to UnicodeString. Now the question is whether these flaws are
fixed in FPC, or whether Lazarus will become the first project that
definitely requires an complete move to UnicodeString, for reliable
operation.
For best support of non-UTF-16 platforms I'd suggest to fix the flaws...
I also don't think we will ever see a fix for the poor implementation of
RawByteString (avoiding the word flaw and the suggestion of a bad
purpose), because it would brake existing user code.
Regarding fpc, "correcting the flaws" and keeping the name RawByteString
would result in incompatibility issues vs Delphi and breaking code that
will be ported from Delphi.
That is why fpc would need to define an additional type name (e.g
"DynamicString") and encoding brand number (e.g. "CP_ANY" = $FF00) for a
decently usable type for intermediately holding a String content. (see
Wiki ->
http://wiki.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support
)
RawXxxString can be used for really "uncoded" data as done with
old-style strings in a lot of applications. Even if "seriously flawed"
auto-conversion might be implemented in fpc for RawByteStrimg (for
Delphi-compatibility), the user can easily avoid it by not directly
combining RAW and differently statically encoded strings in an operation.
-Michael
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 27 Nov 2014, at 17:11, Hans-Peter Diettrich wrote: > Such statements come only from writers that do not believe that their words > can be understood in various ways ;-) I'm sorry, but I simply cannot discuss with people that, when I literally state "the result is undefined", think that I may actually have meant "the result is defined and if you change the implementation and/or keep it stable across compiler releases, then it will also conform to whatever you think that this defined behaviour should be". I don't have the energy nor the patience for that. Jonas ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
Michael Schnell schrieb:
On 11/26/2014 06:37 PM, Hans-Peter Diettrich wrote:
An AnsiString consists of AnsiChar's. The *meaning* of these char's
(bytes) depends on their encoding, regardless of whether the used
encoding is or is not stored with the string.
I understand that the implementation (in Delphi) seems to be driven more
by the Wording ("ANSI") than by the logical paradigm the language syntax
suggests. The language syntax and the string header fields suggest that
both the element-size as the code-ID-number need to be adhered to (be it
statically or dynamically - depending on the usage instance). E.g. there
are (are least two "Code pages" for UTF-16 ("LE", and "BE"), that would
be worth supporting.
You are confusing codepages and encodings :-(
UTF-7, UTF-8, UTF-16 and UTF-16BE describe different representations of
the same values (Unicode codepoints). And I agree, all commonly used
encodings should be implemented, at least for data import/export.
It's essential to distinguish between low-level (physical) AnsiChar
values, and *logical* characters possibly consisting of multiple
AnsiChars.
I now do see that the implementation is done following this concept. But
the language syntax and the string header field suggest a more versatile
paradigm, providing a universal reference counting "element string" type.
See it as a multi-level protocol for text processing. The bottom
(physical) level deals with physical storage items (AnsiChar,
WideChar...), and how they are stored in memory or files. Like it
doesn't make sense to deal with individual bytes of real numbers in
computations, it doesn't make sense to deal with individual bytes
(AnsiChars) of logical characters - except in type/encoding conversions.
Higher levels deal with logical values, which can consist of multiple
physical items, and may need different interpretatons (in case of Ansi
codepages). This level is partially coverd now by AnsiString encodings
and UTF-16 surrogate pairs, which allow to map the values into full
Unicode (UCS-4) codepoints. But these codepoints still are not
sufficient for a correct interpretation and manipulation of logical
characters, which again can consist of multiple codepoints (decomposed
umlauts, ligatures...). In a next level another (mostly language
specific) interpretation may be required, like which logical characters
have to be treated together (ligatures, non-breaking characters...).
Some natural languages (Hebrew, Arabic...) require another special
handling of (mixed) LTR/RTL reading, and of "paths", influencing the
graphical representation of character sequences; but that's nothing an
application or library writer should have to deal with, such
functionality should be provided by the target platform.
There must be a boundary between the standard (RTL) handling of the
physical items and encodings, and higher text processing levels, up to
language specific processing (how to break words, when to apply
capitalization, syntax checks...), so that such special handling can be
implemented in dedicated extensions (libraries, classes), by developers
familiar with the rules and conventions of the natural languages.
For now we are talking only about the handling up to individual Unicode
codepoints, and related string manipulation. Herefore at least one
string representation must exist, that covers the full Unicode range of
codepoints (UTF-8 or UTF-16 for now). When such an implementation claims
for "undefined" behaviour, then this can only mean implementation flaws,
resulting in something different from what can be expected from proper
Unicode handling. This includes invalid parameter values in subroutine
calls, which should result in proper (defined) runtime error reporting
(AV, error result...).
WRT to AnsiString encodings, the only acceptable (expected) differences
can result from lossy conversions, when converting proper Unicode into a
non-UTF encoding. Even then the results should be consistent, even if
the concrete results depend on some external (platform...) convention or
settings.
IMO.
That's why I wonder *when* exactly the result of such an expression
*is* converted (implicitly) into the static encoding of the target
variable, and when *not*.
I understand that the idea is, to use the static encoding information
provided by the type definition whenever possible.
Right, but here "whenever possible" depends on the correspondence of
static and dynamic encoding. When the dynamic encoding can *ever* be
different from the static encoding, except for RawByteString, I consider
it NOT possible to derive the need for a conversion from the static
encoding. In the handling of floatingpoint values we may have to expect
invalid operations (division by zero, overflow...) or values (NaN...),
but NOT that a Double variable ever contains two Integer values - unless
forced by dirty hacks out of compiler control. Why should this be
different and acc
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
Jonas Maebe schrieb:
On 26/11/14 23:41, Hans-Peter Diettrich wrote:
In this case the implementation is "compiler specific", somewhat
different from "undefined" (in a RawByteString):
"CP_NONE: this value indicates that no code page information has been
associated with the string data. The result of any explicit or implicit
operation that converts this data to another code page is undefined."
IMO the result is well defined: it's the string with the encoding of
that "other" codepage.
Unless you actually tested this on all platforms and noted that is the
case, you cannot state this. And if you would actually test it, you
would discover that it is wrong
(http://bugs.freepascal.org/view.php?id=22501#c61238 ).
Bugs obviously violate some specification/definition, else "it's not a
bug, it's a feature" ;-)
As mentioned in a previous discussion: don't use "IMO" ("in my opinion")
when talking about testable facts. A testable fact is either true or
false, opinions do not enter the picture.
We're just talking about interpretations, not facts.
An "undefined" result, as I understand it, would
mean "the result can be anything, unrelated to the function input".
Which is 100% correct.
Do you see any use for such function definitions, except in random
generators?
IMO a better wording should be found, that does not cause the current
obvious confusion of some readers.
The confusion only occurs for readers that do not believe what is written.
Such statements come only from writers that do not believe that their
words can be understood in various ways ;-)
DoDi
___
fpc-devel maillist - [email protected]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
Michael Schnell schrieb:
On 11/26/2014 07:13 PM, Hans-Peter Diettrich wrote:
Not all codepages have a fixed number of bytes per character.
The string preamble contains the *element size* (1 for AnsiString),
just like with every dynamic array.
Sorry for sloppy wording. Of course I did mean "element size"
("Character" here obviously is not "printable item").
I'd restrict the use of "character" to physical Char types, just to
avoid any misinterpretation.
Printable items (glyphs) are independent from the storage format.
Ligatures or umlauts can consist of multiple "codepoints", and several
Unicode codepoints are not even printable.
A single printable "character", as selectable by a single cursor step,
can consist of multiple codepoints, even (or just) in Unicode.
That's why I'd expect that the FPC documentation includes a glossary and
definition of the terms, which should be used in the documentation and
discussions.
DoDi
___
fpc-devel maillist - [email protected]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
Michael Schnell schrieb: I now understand that the "Element Size" field in the String header is quite dummy, as under the hood there are two completely separate concepts for one-byte-Strings and 2-Byte Strings and none for other Element sizes. After a code review I realized that the element size field is specific to dynamic strings, not present in dynamic arrays. Since the element size is bound to the string type, it could be omitted in the FPC implementation. [With little win, when the record alignment is preserved] This to me is not obvious at all, as the language syntax and the String header data structure suggest a more universal paradigm for multiple string type brands, that each have an "element-size"6 and "code-ID-number" setting, handled by a common infrastructure. This may have been envisaged by the Delphi architects, but was not continued later. The "universal paradigm" would allow for extensions (e.g. UTF-32, multiple 16 Bit Code pages, an additional fully dynamic String type, n-byte "un-encoded" string types), as I described in the Wiki page. Even if feasable, such arbitrary string storage can dramatically increase the number of implicit string conversions. An *efficient* implementation would be based on a single program-wide string representation, with different encodings being handled only in an exchange with external data sources. That standard encoding may be Ansi or Unicode; even Delphi allows for both models, where Ansi again suggests the use of one specific codepage (CP_ACP) for best performance. After all I have the impression that the known RawByteString flaws will never be fixed in Delphi, in order to encourage the users to take the step to UnicodeString. Now the question is whether these flaws are fixed in FPC, or whether Lazarus will become the first project that definitely requires an complete move to UnicodeString, for reliable operation. For best support of non-UTF-16 platforms I'd suggest to fix the flaws... DoDi ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
Sven Barth schrieb: Just a little remark: please don't throw in WideString, which is a completely different type and only there for easy compatibility with COM and other Windows APIs. I mentioned it for completness, and because (at least in Delphi) the elements of an UnicodeString are WideChar. Unlike UnicodeString this type is not reference counted for example nor does it have the code page and element size information that a Ansi-/UnicodeString has. This implementation detail is independent from the common payload encoding. (In FPC WideString is the same as UnicodeString for all non-Windows platforms) sic! DoDi ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 11/26/2014 06:37 PM, Hans-Peter Diettrich wrote:
An AnsiString consists of AnsiChar's. The *meaning* of these char's
(bytes) depends on their encoding, regardless of whether the used
encoding is or is not stored with the string.
I understand that the implementation (in Delphi) seems to be driven more
by the Wording ("ANSI") than by the logical paradigm the language syntax
suggests. The language syntax and the string header fields suggest that
both the element-size as the code-ID-number need to be adhered to (be it
statically or dynamically - depending on the usage instance). E.g. there
are (are least two "Code pages" for UTF-16 ("LE", and "BE"), that would
be worth supporting.
It's essential to distinguish between low-level (physical) AnsiChar
values, and *logical* characters possibly consisting of multiple
AnsiChars.
I now do see that the implementation is done following this concept. But
the language syntax and the string header field suggest a more versatile
paradigm, providing a universal reference counting "element string" type.
That's why I wonder *when* exactly the result of such an expression
*is* converted (implicitly) into the static encoding of the target
variable, and when *not*.
I understand that the idea is, to use the static encoding information
provided by the type definition whenever possible. I understand that if
no RawByteString is involved in the operation, the static encoding
information is sufficient and hence the potential calls to the dedicated
conversion library functions can completely be constructed at compile time.
In Delphi the use of the dynamic encoding information seems to be very
rare (and the implementation does not make much sense to me).
The entire mess results from the bad interpretation of RawByteString
assignments, which IMO was well thought by the Delphi language
architects, but not understood by the Delphi compiler coders.
I fully agree with you.
I suppose the original idea was to create an (additional) fully dynamic
type brand, for that whenever used, the compiler needs to read the
dynamic encoding information (both element-size and encoding-ID-number)
and act appropriately. With that decently implemented, in fact, TStrings
and similar classes could use this type for universal handling of all
String type brands.
My hope was, that fpc might be able to correct this error of the Delphi
compiler coders. But of course for Delphi compatibility the type name
RawByteString and the code-ID-number $ can't be used any more, but
a new naming and ID number would need to be invented. IMHO this in fact
is possible and viable (see wiki page for details).
-Michael
___
fpc-devel maillist - [email protected]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 26/11/14 23:41, Hans-Peter Diettrich wrote:
> In this case the implementation is "compiler specific", somewhat
> different from "undefined" (in a RawByteString):
> "CP_NONE: this value indicates that no code page information has been
> associated with the string data. The result of any explicit or implicit
> operation that converts this data to another code page is undefined."
>
> IMO the result is well defined: it's the string with the encoding of
> that "other" codepage.
Unless you actually tested this on all platforms and noted that is the
case, you cannot state this. And if you would actually test it, you
would discover that it is wrong
(http://bugs.freepascal.org/view.php?id=22501#c61238 ).
As mentioned in a previous discussion: don't use "IMO" ("in my opinion")
when talking about testable facts. A testable fact is either true or
false, opinions do not enter the picture.
> An "undefined" result, as I understand it, would
> mean "the result can be anything, unrelated to the function input".
Which is 100% correct.
> IMO a better wording should be found, that does not cause the current
> obvious confusion of some readers.
The confusion only occurs for readers that do not believe what is written.
Jonas
___
fpc-devel maillist - [email protected]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 11/26/2014 07:13 PM, Hans-Peter Diettrich wrote:
Not all codepages have a fixed number of bytes per character.
The string preamble contains the *element size* (1 for AnsiString),
just like with every dynamic array.
Sorry for sloppy wording. Of course I did mean "element size"
("Character" here obviously is not "printable item").
-Michael
___
fpc-devel maillist - [email protected]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 11/26/2014 07:54 PM, Hans-Peter Diettrich wrote: Delphi XE does not properly support UTF-8. That is what I supposed. Of course the developers at Embarcadero did not need to think about portability to other OSes than Windows when crafting the concept. But obviously fpc needs proper support for UTF-8, as it is constructed to do decent programs for Linux OS. Hence I did suppose/hope that in fpc the code-page aware string type implementation would be versatile enough to support any ANSI code page, UTF-8 and UTF-16 (LE and BE), and be extensible to support for e.g UTF-32 (plus an additional fully dynamic String "brand"). Thanks for your explanations, -Michael ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 11/26/2014 09:30 PM, Hans-Peter Diettrich wrote: So seemingly you could do MyStringType = type AnsiString(CP_UTF16), and seemingly the size information is set according to this. Not in Delphi XE. Thanks for the clarification. I did have some hope that fpc would be (or could be extended to be) better than Delphi on that behalf. I now do see the reason that resulted in the (to me rather queer) Naming " AnsiString" for the code page aware string type. I erroneously supposed the syntax that finally would be used would be something like "MyStringType = type String(CP_UTF16)", with no restriction to ANSI, but the CP_ constant defining as well a code page as an Element size, as suggested by the language syntax while working with string using auto-conversion, and by the structure of the string content header. There still might be room for (fully compatible) improvement (as I described in the Wiki), but it's even more difficult to do than I supposed. Thanks again, -Michael ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 11/26/2014 05:37 PM, Jonas Maebe wrote: invalid (in the meaning of "undefined") in both FPC and Delphi. Sorry (I am not a native speaker). But to me "undefined" and "invalid" have completely different meanings (in this context). An "Invalid" use of the language would result in an error (compiler or runtime), while an "undefined" language construct would result in something that might work in some way, but there is no guarantee that the outcome is always the same (e.g. in another instance or another compiler version). CP_UTF16 and CP_UTF16BE can be returned by StringCodePage() when called on a unicodestring, and that's it. I now do understand (see my reply to Sven). -Michael ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 11/26/2014 05:25 PM, Sven Barth wrote: > > So seemingly you could do MyStringType = type AnsiString(CP_UTF16), and seemingly the size information is set according to this. No, you can't, because the RTL does not handle that. For AnsiString the element size is *always* 1. It's hardcoded. AFAIK Delphi even does a compile error if you use CP_UTF16. Thanks for the clarification. I now understand that the "Element Size" field in the String header is quite dummy, as under the hood there are two completely separate concepts for one-byte-Strings and 2-Byte Strings and none for other Element sizes. This to me is not obvious at all, as the language syntax and the String header data structure suggest a more universal paradigm for multiple string type brands, that each have an "element-size"6 and "code-ID-number" setting, handled by a common infrastructure. The "universal paradigm" would allow for extensions (e.g. UTF-32, multiple 16 Bit Code pages, an additional fully dynamic String type, n-byte "un-encoded" string types), as I described in the Wiki page. The "dual mode" concept of course does not provide such extensibility, and so I stop thinking about this (and bothering the community), and am happy that it just works as it is. Thanks again, -Michael ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
In our previous episode, Hans-Peter Diettrich said: > > concatenated without data loss and that the result is then converted to > > the target string's encoding (except in case the target is > > RawByteString). How that is implemented exactly is undefined; again in > > the meaning of "undefined", not in the meaning of "undefined when > > defined as meaning X". > > In this case the implementation is "compiler specific", somewhat > different from "undefined" (in a RawByteString): > "CP_NONE: this value indicates that no code page information has been > associated with the string data. The result of any explicit or implicit > operation that converts this data to another code page is undefined." > > IMO the result is well defined: it's the string with the encoding of > that "other" codepage. An "undefined" result, as I understand it, would > mean "the result can be anything, unrelated to the function input". This is usually called "implementation defined". But implementation defined implies it will remain the same in every iteration of the compiler (usually documented). If that is not wanted/possible, then it is considered "undefined". So even if a value happens to be defined in one version of the compiler, it doesn't automatically make it implementation defined. It needs to be a documented choice for that. ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 26.11.2014 19:54, Hans-Peter Diettrich wrote: UTF-16 is not a valid value for CP_ACP in Delphi, because it's a 2-byte encoding. Even if the Delphi architects may have thought about an common string type, with a variable element size (1,2,4), this certainly turned out soon as a stupid idea, so that AnsiString and WideString/UnicodeString still are strictly distinct types. WideString and UnicodeString imply UTF-16, with platform specific byte order (endianness). The latter becomes important almost only to compiler and library coders, in host/network byteorder conversions. For the sake of completeness, pdp-11 processors use yet another byte order, maybe more word-based processors (DG...) as well. Just a little remark: please don't throw in WideString, which is a completely different type and only there for easy compatibility with COM and other Windows APIs. Unlike UnicodeString this type is not reference counted for example nor does it have the code page and element size information that a Ansi-/UnicodeString has. (In FPC WideString is the same as UnicodeString for all non-Windows platforms) Regards, Sven ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
Jonas Maebe schrieb: Technically, that section literally states that they will be concatenated without data loss and that the result is then converted to the target string's encoding (except in case the target is RawByteString). How that is implemented exactly is undefined; again in the meaning of "undefined", not in the meaning of "undefined when defined as meaning X". In this case the implementation is "compiler specific", somewhat different from "undefined" (in a RawByteString): "CP_NONE: this value indicates that no code page information has been associated with the string data. The result of any explicit or implicit operation that converts this data to another code page is undefined." IMO the result is well defined: it's the string with the encoding of that "other" codepage. An "undefined" result, as I understand it, would mean "the result can be anything, unrelated to the function input". The branch taken in execution of an IF statement also is not "undefined", only because it depends on the actual condition value. The value of a local variable initially is "undefined", i.e. can be any value. But after an assignment it *is* defined, even if that value still may be *unpredictable* by static code analysis. IMO a better wording should be found, that does not cause the current obvious confusion of some readers. Regarding RawByteStrings there has been the definition "a RawByteString has exactly the same behavior as assigning that AnsiString(X) to another AnsiString(X) variable with the same value of X: no code page conversion or copying occurs". Seemingly this is not true for the intermediate results of concatenations. That paragraph only specifies that code page-aware strings are concatenated without data loss, and then defines to which code page the result will be converted before assigning it to the target. What's the meaning of "no copying occurs"? Of course the reference to the string is copied into the target variable! What's "the same value of X", in case of AnsiString(CP_ACP) and AnsiString(DefaultSystemCodePage)? Even if the intermediary result of a concatenation would be a RawByteString (which is not stated nor necessarily ever the case), then the above would apply and hence the (dynamic) code page of that RawByteString would be the one as defined by the above-mentioned rules before it would be assigned to the target. Please note that the other statements refer to *static* encodings, therefore my question about the (assumed) static encoding of an intermediate result. When the compiler inserts an conversion request based on *static* encodings, will it or will it not insert such an request, before an intermediate result is assigned to the target variable? Suggestion: "During string operations the source strings are converted [to CP_ACP?] when they have a different [dynamic?] encoding. When the result is stored in a variable, it is converted as required by the static encoding of the target." Where "as required" means that a static target encoding of CP_ACP is replaced by the DefaultSystemCodePage, while CP_NONE does not require a conversion. The CP_ACP case should be clarified as well, because it's unclear whether CP_ACP(=0) is *considered* equal to the current DefaultSystemCodePage, even if both values are *always* different (see above). The use of "CP_ACP" instead of "DefaultSystemCodePage" can be confusing and should be avoided or clarified before. Perhaps it would help to concentrate on the following steps: 1) (string) operand fetch 2) (string) operations 3) (string) assignment 1) Fetching an operand removes any information about the static encoding of the source, only its dynamic encoding persists. [Now the handling of non-AnsiString sources can be explained, like for literals, ShortString etc. RawByteString is not special here, it's only a static encoding. ] 2) String operations take into account the dynamic encoding of their operands, with lossless conversions inserted as required. 3) When a string is assigned to a variable, it is eventually converted as required by the static encoding of the target, with possible data loss. [about "required" see above. Special case: when the source is a variable, no conversion occurs when the *static* source and target types are "compatible". What exactly is compatible with CP_ACP? ] DoDi ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
Michael Schnell schrieb: So seemingly you could do MyStringType = type AnsiString(CP_UTF16), and seemingly the size information is set according to this. Not in Delphi XE. DoDi ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
Mattias Gaertner schrieb: For example: CP_ACP=0, DefaultSystemCodePage=1252 That means static code page is always 0, while dynamic code page can be 0 or 1252. Both describe the same encoding. A *dynamic* encoding *never* can be CP_ACP nor CP_NONE (in Delphi). These values are allowed only for *static* types in type declarations. CP_UTF16 is also not allowed. Delphi StringCodePage reports the current default codepage (DefaultSystemCodePage) for empty AnsiStrings, CP_UTF16 for all UnicodeStrings. In section "RawByteString": "the results of conversions from/to the CP_NONE code page are undefined." ... because CP_NONE is not a real code page. The same for CP_ACP. DoDi ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
Michael Schnell schrieb: I fail to understand some of the text. It seems to be unavoidable to use the name "ANSIString" even though I always though up when seeing a thing called "ANSI" containing Unicode (e. g. "UTF8String = type AnsiString(CP_UTF8)" ). Seemingly here the "bytes per character" setting implicitly is thought of as a port of the "code-page" definition. correct ? An AnsiString consists of AnsiChar's. The *meaning* of these char's (bytes) depends on their encoding, regardless of whether the used encoding is or is not stored with the string. It's essential to distinguish between low-level (physical) AnsiChar values, and *logical* characters possibly consisting of multiple AnsiChars. In section "Dynamic code page": "When assigning a string to a plain AnsiString (= AnsiString(CP_ACP)) or ShortString, the string data will however be converted to DefaultSystemCodePage. The dynamic code page of that AnsiString(CP_ACP) will then be the current value of DefaultSystemCodePage (e.g. 1250 for the Windows-1250 code page), even though its static code page is CP_ACP (which is a constant <> 1250). This is one example of how the static code page can differ from the dynamic code page. Subsequent sections will describe more such scenarios." 1) A short String does not have a Code page notification so for this "static code page can differ from the dynamic code page" does not seem to make much sense. The text correctly states "dynamic code page of that AnsiString". ShortString (and AnsiChar) has no encoding indicator, they are assumed to be encoded in CP_ACP. 2) I fail to understand how with this explanation that seems to force auto conversion for assignments between types with different "code page" settings (also for CP_ACP) the "static code page can differ from the dynamic code page" can happen. Continue reading until you understood the special handling of string literals and RawByteString. In fact this disaster seems to be able to happen (see section "RawByteString") if assigning a string with a static code page X1 to a RawByteString (hence no conversion) and then assigning that RawByteString to a string with a static code page X2 (no conversion again). In fact I assume that without abusing RawByteString such "intersexual" strings can't be produced, otherwise this would be rather disastrous for normal users. *All* intermediate strings, generated during the evaluation of string expressions, only have a dynamic encoding, thus can be considered as being RawByteStrings. That's why I wonder *when* exactly the result of such an expression *is* converted (implicitly) into the static encoding of the target variable, and when *not*. Obviously the compiler inserts an conversion request for the *direct* assignment of one string variable to another one, of an different *static* encoding. But what happens when a string expression doesn't have such a known static encoding??? In section "RawByteString": "the results of conversions from/to the CP_NONE code page are undefined." In effect the behavior is exactly defined in this section "As a first approximation". Right, the result *is* well defined, but has no *predetermined* dynamic encoding. The entire mess results from the bad interpretation of RawByteString assignments, which IMO was well thought by the Delphi language architects, but not understood by the Delphi compiler coders. This interpretation also found its way into FPC: "Less intuitive is probably that when a RawByteString is assigned to an AnsiString(X), the same happens: no code page conversion[...]" It's clear that a conversion *can* be omitted for every assignment *to* an RawByteString. That's one of the purposes of that type - to avoid excess conversions into CP_ACP or UnicodeString. But it's unclear why the heck the assignment to any *other* AnsiString type should be omitted, as soon as the source string is a RawByteString??? Therefore I'd suggest an compiler switch, implementing the lame Delphi compatible behaviour only on *demand*, while the FPC default would force eventual conversions with *every* assignment to any other (non-CP_NONE) AnsiString type. This simple change will safely prevent strings of different static and dynamic encoding, so that according tests can be removed safely from library *and* user code. The proper use of RawByteStrings deserves further documentation, for users who want/need their own (generic) stringhandling routines. Topics should be: - how to determine the dynamic encoding of strings (StringCodePage) - how to force required conversions (SetCodePage) - how to deal with strings of different encodings - how to minimize the number of string conversions DoDi ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
Michael Schnell schrieb: On 11/26/2014 11:40 AM, Mattias Gaertner wrote: Ansistring supports only one byte per character code pages. Even more confused. Am I wrong thinking that with code aware Strings, for Delphi XE compatibility, in Windows CP_ACP needs to be UTF16 (if not right, than due later) ? Delphi XE does not properly support UTF-8. CP_ACP seems to depend on western/far-eastern versions, where the western version assumes and allows for any SBCS; I don't know of the same in far-east versions. The SBCS restriction allows to simplify standard string handling and conversions, because every character (=byte) can be exchanged in place. UTF-8 doesn't fit into this picture, because it's a MBCS. UTF-16 is not a valid value for CP_ACP in Delphi, because it's a 2-byte encoding. Even if the Delphi architects may have thought about an common string type, with a variable element size (1,2,4), this certainly turned out soon as a stupid idea, so that AnsiString and WideString/UnicodeString still are strictly distinct types. WideString and UnicodeString imply UTF-16, with platform specific byte order (endianness). The latter becomes important almost only to compiler and library coders, in host/network byteorder conversions. For the sake of completeness, pdp-11 processors use yet another byte order, maybe more word-based processors (DG...) as well. DoDi ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
Michael Schnell schrieb: On 11/26/2014 12:09 PM, Sven Barth wrote: In Delphi (and FPC) CP_ACP corresponds by default with the current system codepage (e.g. CP1252 on a German Windows). OK. So in Delphi XE (in Germany) String(CP_ACP) is the same as String(CP1252) but different from String without brackets which in turn is the same as String(CP_UTF16) ? Correct ? CP_ACP (and CP_NONE) describes a *static* encoding, and has an fixed value (CP_ACP=0, CP_NONE=$). The dynamic encoding of strings, kept in AnsiString(0) or RawByteString variables, must be obtained from the string itself. When the string is empty, StringCodepage returns DefaultSystemCodePage (for CP_ACP). CP_UTF16 is not supported, because AnsiString only supports 1-Byte character strings (and UTF-8 as the odd one) and not 2-Byte character strings. I still don't understand. The wiki article seems to suggest that it is about a type called "ANSIString" that features a dynamically settable "code page information". From discussions about Delphi and FPC, I only know a String type with a dynamically settable "code page information" that also features a dynamically settable "Bytes per Character information" and hence does support 1, 2 and 4 "Bytes per Character". (e.g. UTF-8, UTF-16, and UTF-32). You should have noticed that there exists no String or Char type, that would allow for arbitrary bytes/char counts (see my other answer for details). The difference to Delphi currently is that for FPC String=AnsiString(CP_ACP) and for Delphi String=UnicodeString (aka 2-Byte string). I understand that you mean (e.g.) Delphi XE. But what version of FPC is "currently". Am I wrong assuming that in the svn we do have the "NewStrings" library that supports dynamical code-page *and* byte-per-character settings and hence supports e.g. CP1251, UTF-8, UTF-16, and UTF-32 ? The byte-per-character field is read-only, just like for any dynamic array. So I seem to understand the meaning of String(CP1252), String(CP_UTF8), and String(CP_UTF16) (which seems do be the Delphi notation), but I seemingly don't get the exact meaning of "AnsiString(CP_ACP)" or "AnsiString(CP1251)" The Delphi notation is the same, e.g. AnsiString(CP_ACP). In the end, what the definition of "String" without brackets is, might be due to a settable compiler option and/or the OS the compiler is set to create code for. Right, the *generic* String type can be mapped to either ShortString, AnsiString(0) or UnicodeString, depending on compiler versions and switches. A raw guess can be derived from sizeof(Char). DoDi ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
Mattias Gaertner schrieb: On Wed, 26 Nov 2014 11:23:17 +0100 Michael Schnell wrote: Seemingly here the "bytes per character" setting implicitly is thought of as a port of the "code-page" definition. correct ? Code page define bytes per character. Huh? Not all codepages have a fixed number of bytes per character. The string preamble contains the *element size* (1 for AnsiString), just like with every dynamic array. As you know: Don't confuse character with glyph and codepoint. Right, but what is what? I feel a need for an exact (official) definition of such (and more) terms, in order to prevent further misunderstandings of the documentation and in discussions. E.g. "code page" has different meanings, when used with ANSI/ISO and Unicode character sets. While ANSI/ISO codepages desribe different mappings of bytes into characters, Unicode codepages define subsets of the whole Unicode range. My understanding of "character" is a *logical* unit (letter), with possibly different encodings, values and sizes in different codepages (character sets). What's the term for the *physical* unit (AnsiChar, WideChar)? Ansistring supports only one byte per character code pages. Huh? What's your definition of "character"? AnsiString supports MBCS codepages as well. The restriction is the physical storage unit (1 byte per string item), as imposed by AnsiChar. DoDi ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On Wed, 26 Nov 2014 17:50:31 +0100 Mattias Gaertner wrote: > On Wed, 26 Nov 2014 17:23:48 +0100 > Jonas Maebe wrote: > > > On 26/11/14 17:21, Sven Barth wrote: > > > Yes, nevertheless the header record is the same for UnicodeString and > > > AnsiString and thus it also has a codepage field which is always > > > initialized to CP_UTF16 however. > > > > It can also be CP_UTF16BE (which it is on big endian FPC targets right now). > > I see. > > Can you create a CP_UTF16BE on little Endian systems? > > type u = UnicodeString(CP_UTF16BE); gives an error. Jonas has answered this. Thanks. Mattias ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On Wed, 26 Nov 2014 17:23:48 +0100 Jonas Maebe wrote: > On 26/11/14 17:21, Sven Barth wrote: > > Yes, nevertheless the header record is the same for UnicodeString and > > AnsiString and thus it also has a codepage field which is always > > initialized to CP_UTF16 however. > > It can also be CP_UTF16BE (which it is on big endian FPC targets right now). I see. Can you create a CP_UTF16BE on little Endian systems? type u = UnicodeString(CP_UTF16BE); gives an error. Mattias ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 26/11/14 16:19, Michael Schnell wrote: > So seemingly you could do MyStringType = type > AnsiString(CP_UTF16), and seemingly the size information is set > according to this. As several people have told you several times, that is invalid (in the meaning of "undefined") in both FPC and Delphi. I've mentioned this on the FPC_Unicode_support wiki page now. CP_UTF16 and CP_UTF16BE can be returned by StringCodePage() when called on a unicodestring, and that's it. Jonas ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 26/11/14 17:21, Sven Barth wrote: > Yes, nevertheless the header record is the same for UnicodeString and > AnsiString and thus it also has a codepage field which is always > initialized to CP_UTF16 however. It can also be CP_UTF16BE (which it is on big endian FPC targets right now). Jonas ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
Am 26.11.2014 15:30 schrieb "Mattias Gaertner" : > > On Wed, 26 Nov 2014 15:05:16 +0100 > Sven Barth wrote: > > >[...] > > While both AnsiString and UnicodeString have the current codepage and the > > character size in their header record > > AFAIK UnicodeString has only a static (fixed) code page. Yes, nevertheless the header record is the same for UnicodeString and AnsiString and thus it also has a codepage field which is always initialized to CP_UTF16 however. Regards, Sven ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 26/11/14 13:11, Michael Schnell wrote: > In section "String concatenations" there is no mentioning about > auto-conversion. There is. > For statically typed Strings it's rather obvious that > they will be auto-converted if appropriate. It's probably rather obvious because it is literally mentioned in that section. > Technically - if differently > encode - they seem to be converted to Unicode and the result is > converted to match the target. Technically, that section literally states that they will be concatenated without data loss and that the result is then converted to the target string's encoding (except in case the target is RawByteString). How that is implemented exactly is undefined; again in the meaning of "undefined", not in the meaning of "undefined when defined as meaning X". > Regarding RawByteStrings there has been the definition "a RawByteString > has exactly the same behavior as assigning that AnsiString(X) to another > AnsiString(X) variable with the same value of X: no code page conversion > or copying occurs". Seemingly this is not true for the intermediate > results of concatenations. That paragraph only specifies that code page-aware strings are concatenated without data loss, and then defines to which code page the result will be converted before assigning it to the target. Even if the intermediary result of a concatenation would be a RawByteString (which is not stated nor necessarily ever the case), then the above would apply and hence the (dynamic) code page of that RawByteString would be the one as defined by the above-mentioned rules before it would be assigned to the target. Jonas ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 11/26/2014 03:05 PM, Sven Barth wrote: > > OK. So in Delphi XE (in Germany) String(CP_ACP) is the same as String(CP1252) but different from String without brackets which in turn is the same as String(CP_UTF16) ? Correct ? There is no "String with brackets". You can only use "AnsiString" followed by brackets, not "String". And "String" in Delphi 2009+ is the same as UnicodeString which is a different compiler internal type than AnsiString(CP_UTF16) would be if it would be allowed. While both AnsiString and UnicodeString have the current codepage and the character size in their header record the code page is only used for AnsiString and the size can not he influenced in any way (for an AnsiString it's always 1 and for a UnicodeString it's always 2). OK. So what is the notation in Delphi (and hence supposedly in FPC with "mode delphiunicode") to define a variable with the (static) string encoding type "CP" with XXX = 1252, UTF8, UTF16 ? I found this: CP_ACP = 0; // default to ANSI code page CP_UTF16 = 1200; // utf-16 CP_UTF16BE = 1201; // unicodeFFFE CP_UTF7= 65000; // utf-7 CP_UTF8= 65001; // utf-8 CP_ASCII = 20127; // us-ascii CP_NONE= $; // rawbytestring encoding So seemingly you could do MyStringType = type AnsiString(CP_UTF16), and seemingly the size information is set according to this. There is no UTF-32 string (at least not in the sense of a compiler provided type). I see (It's a shame). Thanks a lot for your patience, -Michael ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 26/11/14 12:53, Michael Schnell wrote: [CP_NONE] > Is this "undefined" in the meaning of "not predictable by the user" in > the "current" version of fpc, or in the meaning of "due to change" when > updating fpc. This "undefined" literally means "undefined". It does not mean "undefined in a meaning that is defined in a particular way". Jonas ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On Wed, 26 Nov 2014 15:05:16 +0100 Sven Barth wrote: >[...] > While both AnsiString and UnicodeString have the current codepage and the > character size in their header record AFAIK UnicodeString has only a static (fixed) code page. Mattias ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
Am 26.11.2014 12:37 schrieb "Michael Schnell" : > > On 11/26/2014 12:09 PM, Sven Barth wrote: >> >> In Delphi (and FPC) CP_ACP corresponds by default with the current system codepage (e.g. CP1252 on a German Windows). > > > OK. So in Delphi XE (in Germany) String(CP_ACP) is the same as String(CP1252) but different from String without brackets which in turn is the same as String(CP_UTF16) ? Correct ? There is no "String with brackets". You can only use "AnsiString" followed by brackets, not "String". And "String" in Delphi 2009+ is the same as UnicodeString which is a different compiler internal type than AnsiString(CP_UTF16) would be if it would be allowed. > >> CP_UTF16 is not supported, because AnsiString only supports 1-Byte character strings (and UTF-8 as the odd one) and not 2-Byte character strings. > > > I still don't understand. The wiki article seems to suggest that it is about a type called "ANSIString" that features a dynamically settable "code page information". From discussions about Delphi and FPC, I only know a String type with a dynamically settable "code page information" that also features a dynamically settable "Bytes per Character information" and hence does support 1, 2 and 4 "Bytes per Character". (e.g. UTF-8, UTF-16, and UTF-32). While both AnsiString and UnicodeString have the current codepage and the character size in their header record the code page is only used for AnsiString and the size can not he influenced in any way (for an AnsiString it's always 1 and for a UnicodeString it's always 2). There is no UTF-32 string (at least not in the sense of a compiler provided type). > > >> The difference to Delphi currently is that for FPC String=AnsiString(CP_ACP) and for Delphi String=UnicodeString (aka 2-Byte string). >> > > I understand that you mean (e.g.) Delphi XE. But what version of FPC is "currently". FPC is none, because when Delphi introduced the code page aware AnsiString it switch at the same time from having String=AnsiString to Stribgm=UnicodeString. FPC did only the first part for now (so at best FPC would he a "not quite 2009" :P ). > Am I wrong assuming that in the svn we do have the "NewStrings" library that supports dynamical code-page *and* byte-per-character settings and hence supports e.g. CP1251, UTF-8, UTF-16, and UTF-32 ? So I seem to understand the meaning of String(CP1252), String(CP_UTF8), and String(CP_UTF16) (which seems do be the Delphi notation), but I seemingly don't get the exact meaning of "AnsiString(CP_ACP)" or "AnsiString(CP1251)" No. The Delphi notation is the same as in FPC: AnsiString(codepage). And a AnsiString(CP_1251) normally holds string data encoded with the CP-1251 codepage while a AnsiString(CP_ACP) holds string data encoded with whatever encoding the DefaultSystemCodePage denoted at the time of assignment. This can be for example CP_1251 as well or something different like CP_UTF8 (it can however not he CP_ACP again nor CP_UTF16 nor CP_UTF32). > In the end, what the definition of "String" without brackets is, might be due to a settable compiler option and/or the OS the compiler is set to create code for. That is already the case: - any mode, H- : ShortString - any mode except delphi_unicode, H+ : AnsiString(CP_ACP) - mode delphi_unicode, H+ : UnicodeString (there's also a modeswitch to change String to UnicodeString, but I forgot its name -.-) Please note that these switches are always per unit as precompiled units (like the RTL ones) can not be influenced. Regards, Sven ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
After re-reading yet another question: In section "String concatenations" there is no mentioning about auto-conversion. For statically typed Strings it's rather obvious that they will be auto-converted if appropriate. Technically - if differently encode - they seem to be converted to Unicode and the result is converted to match the target. Regarding RawByteStrings there has been the definition "a RawByteString has exactly the same behavior as assigning that AnsiString(X) to another AnsiString(X) variable with the same value of X: no code page conversion or copying occurs". Seemingly this is not true for the intermediate results of concatenations. Here the dynamical encoding information seems to define the fact and type of conversion. If this is the fact it should be mentioned. (Whether or not this makes sense is another question: is the code information of "RawByteString" meant to be "NONE" (i.e. "RAW") or "dynamic" (i.e. "complex") ). -Michael ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 11/26/2014 12:10 PM, Mattias Gaertner wrote: "the results of conversions from/to the CP_NONE code page are undefined." ... because CP_NONE is not a real code page. So you understand "result" as what you would get when printing. In the context of this wiki page I would understand "result" as the binary content of the variable in question. Is this "undefined" in the meaning of "not predictable by the user" in the "current" version of fpc, or in the meaning of "due to change" when updating fpc. -Michael ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 11/26/2014 12:13 PM, Mattias Gaertner wrote:
In mode delphiunicode String=UnicodeString.
I see.
So even in Delphi XE where "UnicodeString" is denoted by "CP_UTF16", the
value of the constant CP_UTF16 is not the same as the value of the
(constant or) variable CP_ACP, (while OTOH using the value of CP_UTF16
in a type or variable definition performs the same as using 0 {is
CP_DEFAULT name of the appropriate constant ?} ).
I understand that fpc with "mode delphiunicode" is supposed to work in
the same way.
-Michael
___
fpc-devel maillist - [email protected]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 11/26/2014 12:09 PM, Sven Barth wrote: In Delphi (and FPC) CP_ACP corresponds by default with the current system codepage (e.g. CP1252 on a German Windows). OK. So in Delphi XE (in Germany) String(CP_ACP) is the same as String(CP1252) but different from String without brackets which in turn is the same as String(CP_UTF16) ? Correct ? CP_UTF16 is not supported, because AnsiString only supports 1-Byte character strings (and UTF-8 as the odd one) and not 2-Byte character strings. I still don't understand. The wiki article seems to suggest that it is about a type called "ANSIString" that features a dynamically settable "code page information". From discussions about Delphi and FPC, I only know a String type with a dynamically settable "code page information" that also features a dynamically settable "Bytes per Character information" and hence does support 1, 2 and 4 "Bytes per Character". (e.g. UTF-8, UTF-16, and UTF-32). The difference to Delphi currently is that for FPC String=AnsiString(CP_ACP) and for Delphi String=UnicodeString (aka 2-Byte string). I understand that you mean (e.g.) Delphi XE. But what version of FPC is "currently". Am I wrong assuming that in the svn we do have the "NewStrings" library that supports dynamical code-page *and* byte-per-character settings and hence supports e.g. CP1251, UTF-8, UTF-16, and UTF-32 ? So I seem to understand the meaning of String(CP1252), String(CP_UTF8), and String(CP_UTF16) (which seems do be the Delphi notation), but I seemingly don't get the exact meaning of "AnsiString(CP_ACP)" or "AnsiString(CP1251)" In the end, what the definition of "String" without brackets is, might be due to a settable compiler option and/or the OS the compiler is set to create code for. -Michael ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On Wed, 26 Nov 2014 11:52:50 +0100 Michael Schnell wrote: > On 11/26/2014 11:40 AM, Mattias Gaertner wrote: > > Ansistring supports only one byte per character code pages. > > Even more confused. Am I wrong thinking that with code aware Strings, > for Delphi XE compatibility, in Windows CP_ACP needs to be UTF16 (if not > right, than due later) ? No. In mode delphiunicode String=UnicodeString. Mattias ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On Wed, 26 Nov 2014 11:23:17 +0100 Michael Schnell wrote: >[...] > 2) I fail to understand how with this explanation that seems to force > auto conversion for assignments between types with different "code page" > settings (also for CP_ACP) the "static code page can differ from the > dynamic code page" can happen. For example: CP_ACP=0, DefaultSystemCodePage=1252 That means static code page is always 0, while dynamic code page can be 0 or 1252. Both describe the same encoding. RawByteString has static cp CP_NONE=$, but its dynamic cp is always different, for example CP_ACP=0, 1252 or CP_UTF8. > In fact this disaster seems to be able to happen (see section > "RawByteString") if assigning a string with a static code page X1 to a > RawByteString (hence no conversion) and then assigning that > RawByteString to a string with a static code page X2 (no conversion > again). In fact I assume that without abusing RawByteString such > "intersexual" strings can't be produced, otherwise this would be rather > disastrous for normal users. You can use SetCodePage as well. ;) > In section "RawByteString": > > "the results of conversions from/to the CP_NONE code page are undefined." ... because CP_NONE is not a real code page. Mattias ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
Am 26.11.2014 11:53 schrieb "Michael Schnell" : > > On 11/26/2014 11:40 AM, Mattias Gaertner wrote: >> >> Ansistring supports only one byte per character code pages. > > > Even more confused. Am I wrong thinking that with code aware Strings, for Delphi XE compatibility, in Windows CP_ACP needs to be UTF16 (if not right, than due later) ? Yes, you're wrong. In Delphi (and FPC) CP_ACP corresponds by default with the current system codepage (e.g. CP1252 on a German Windows). CP_UTF16 is not supported, because AnsiString only supports 1-Byte character strings (and UTF-8 as the odd one) and not 2-Byte character strings. The difference to Delphi currently is that for FPC String=AnsiString(CP_ACP) and for Delphi String=UnicodeString (aka 2-Byte string). Regards, Sven ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On 11/26/2014 11:40 AM, Mattias Gaertner wrote: Ansistring supports only one byte per character code pages. Even more confused. Am I wrong thinking that with code aware Strings, for Delphi XE compatibility, in Windows CP_ACP needs to be UTF16 (if not right, than due later) ? What is a "Code page notification"? Do you mean "code page information"? Yep. that it does *not* talk about ShortString. OK. -Michael ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
On Wed, 26 Nov 2014 11:23:17 +0100 Michael Schnell wrote: >[...] > It seems to be unavoidable to use the name "ANSIString" even though I > always though up when seeing a thing called "ANSI" containing Unicode > (e. g. "UTF8String = type AnsiString(CP_UTF8)" ). Is there a question? > Seemingly here the "bytes per character" setting implicitly is thought > of as a port of the "code-page" definition. correct ? Code page define bytes per character. As you know: Don't confuse character with glyph and codepoint. Ansistring supports only one byte per character code pages. > In section "Dynamic code page": > > "When assigning a string to a plain AnsiString (= AnsiString(CP_ACP)) or > ShortString, the string data will however be converted to > DefaultSystemCodePage. The dynamic code page of that AnsiString(CP_ACP) > will then be the current value of DefaultSystemCodePage (e.g. 1250 for > the Windows-1250 code page), even though its static code page is CP_ACP > (which is a constant <> 1250). This is one example of how the static > code page can differ from the dynamic code page. Subsequent sections > will describe more such scenarios." > > 1) A short String does not have a Code page notification so for this > "static code page can differ from the dynamic code page" does not seem > to make much sense. What is a "Code page notification"? Do you mean "code page information"? IMO the phrase "The dynamic code page of that AnsiString" is clear, that it does *not* talk about ShortString. Mattias ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"
I fail to understand some of the text. It seems to be unavoidable to use the name "ANSIString" even though I always though up when seeing a thing called "ANSI" containing Unicode (e. g. "UTF8String = type AnsiString(CP_UTF8)" ). Seemingly here the "bytes per character" setting implicitly is thought of as a port of the "code-page" definition. correct ? In section "Dynamic code page": "When assigning a string to a plain AnsiString (= AnsiString(CP_ACP)) or ShortString, the string data will however be converted to DefaultSystemCodePage. The dynamic code page of that AnsiString(CP_ACP) will then be the current value of DefaultSystemCodePage (e.g. 1250 for the Windows-1250 code page), even though its static code page is CP_ACP (which is a constant <> 1250). This is one example of how the static code page can differ from the dynamic code page. Subsequent sections will describe more such scenarios." 1) A short String does not have a Code page notification so for this "static code page can differ from the dynamic code page" does not seem to make much sense. 2) I fail to understand how with this explanation that seems to force auto conversion for assignments between types with different "code page" settings (also for CP_ACP) the "static code page can differ from the dynamic code page" can happen. In fact this disaster seems to be able to happen (see section "RawByteString") if assigning a string with a static code page X1 to a RawByteString (hence no conversion) and then assigning that RawByteString to a string with a static code page X2 (no conversion again). In fact I assume that without abusing RawByteString such "intersexual" strings can't be produced, otherwise this would be rather disastrous for normal users. In section "RawByteString": "the results of conversions from/to the CP_NONE code page are undefined." In effect the behavior is exactly defined in this section "As a first approximation". Does that mean it is due to be changed ? Is there a cause why not keep the described behavior (just don't any conversion ever). Of course this can produce intersexual strings. Is this great harm ? If yes I think assigning a RawByteString to a string with a static code page should be completely forbidden at compile time or result in a runtime error if the code page does not match. -Michael ___ fpc-devel maillist - [email protected] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
