Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 12/03/2014 05:02 AM, Hans-Peter Diettrich wrote: Michael Schnell schrieb: - It does not result in additional conversions. It does, e.g. in searching or sorting of StringList, when it can contain strings of different encodings. The choice of a unique encoding for application strings (maybe CP_ACP, UTF-8 or UTF-16) eliminates such conversions. If multiple encoding brands are involved, a system without DynamicString also will need to do conversions. So DynamicString does not impose *additional* conversions.- So the Checking Overhead is nothing but a rumor. (Remember, I don't suggest dropping the standard statically typed paradigm, altogether, as close loops of course work best in that way. The rumor is the unimportant Conversion Overhead, i.e. how often a check leads to a conversion. When no check is required, conversions consequently cannot ocur at all. Please re-read the text I wrote. - If in the user-code DynamicString is not used, the compiler creates the same code as before. So no overhead. - If DynamicString is used (in user-Code or in a Library interface), but only a single encoding brand is used everywhere where statically encoded strings are in place (a single program-wide string representation as you suggested in you previous mail) the only runtime overhead imposed is that at the locations where DynamicString is used (i.e. not in any close loops) an additional check for the EncodingType variable is implemented by the compiler. Here (unless the user actively decides to create string variables with encoding brands other than the program-wide default) at runtime the code *always* finds that no conversion is necessary and acts as if the String would not be dynamic, but already correct. The overhead of checking is obviously at most some 5 ASM instructions and hence unelectable regarding the function call assigned to entering the library function in question. RawByteString cannot serve two different purposes :-( As I pointed out as well: A variable' encoding brand can't be static and dynamic at the same time. This is the cause of the major misconception imposed by Delphi regarding RawByteString. And this is why I would leave RawByteString aside (as it is / as it is assumed to be / whatever) and for any improvement use a completely new Type name and a CP_ANY constant / value. In *Delphi* it is used as a polymorphic string, capable of *holding* actual strings of any encoding. But when assigned to a variable of a different encoding, a conversion may occur that converts the string into the declared (static) encoding of the target variable. Seemingly rather close to what I suggest as DynamicString. But (see http://wiki.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support ) with a dynamic String the encoding brand number of such String would not be allowed to ever be written into the EncodingType field in the string header. If this would be true, why do the Delhi Docs discourage making decent use of the dynamic feature of RawByteString ? Anyway. A dynamic String type only makes sense if it is used in as many library interfaces (and TStrings). This is not done in Delphi and in Delphi this is not nice, in many cases restricting the user to make use of these libraries, but not as critical as with fpc, where you need to consider portability issues. In *FPC* it currently is used somewhat close to your idea, i.e. no conversion occurs in both an assignment to *and from* an RawByteString to some other AnsiString. As said, to avoid ambiguity, I vote for adding yet another string type name (e.g. ByteString denoted by CP_BYTE) that is *known* to disallow any conversion (and leave RawByteString as close as possible to the moving target Delphi presents). I understand the FPC attempt, to allow *at the same time* for the new (encoded) and old (unencoded) AnsiString behaviour, where no automatic conversions are allowed. But this would require at the same time, that e.g. all string literals *also* are stored in that (immutable) encoding, and that this encoding can *not* be changed at runtime, while DefaultSystemCodePage *can* be changed. I feel that this (simplified) attempt can't result in a decent paradigm. It is close to impossible to completely describe the behavior in an understandable way and it's prone to a lot of ambiguity. That is why I tried to invent a concept that I suppose might work and will not break (much) existing code. It is intended to be straight from ground up (it is not even necessary to assume that the content of a String is printable/readable, but it should easily work for that application.) It would allow for making flexible use of Strings with understandable and easy to use syntax candy, and would not impose restrictions to portability any more. IMHO it would not impose (noticeable) performance degradation, either. -Michael ___ fpc-devel maillist -
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 12/03/2014 12:52 AM, Hans-Peter Diettrich wrote: You forget that Jonas refers to *dynamic* string encodings, unknown at compile time. ??? In you other mail you pointed out that fpc (other than Delphi) does not provide *dynamic* string encoding with RawByteString (and where else would it be supported ?). -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 12/03/2014 12:52 AM, Hans-Peter Diettrich wrote: In Delphi *no* string can have an dynamic encoding of CP_NONE or CP_ACP, If you really do have Dynamic strings, obviously, the *definition* (i.e. CP_...) of such strings is strictly static (just for compiler use) and never cant be used as the *dynamic* notation of the *current* encoding (in the EncodingType field). IMHO a different implementation is not workable. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 11/28/2014 09:15 PM, Hans-Peter Diettrich wrote: You suggested to use string as UTF-16 on Windows, and UTF-8 on Linux. That's what I understand as a unique program-wide string representation (not sourcecode-wide, instead program as *compiled*). Then I cannot see any need or use for another DynamicString type. I already did understand your meaning and I understand that this unique program-wide string representation is better than having the libraries' APIs (including TStrings) force a fixed string encoding brand, independently from the OS we compile for (and selectable $mode specifications). But I don't *suggest* this way, as it is not very versatile and hampers portability. As said I *suggest* using DynamicString in such cases. Nonetheless, the types simply called String might be done in the way you suggest. Nothing can be broken, as long as the Delphi behaviour is undefined. That of course is is correct, but just follows the poor excuse Embarcadero offers for the flawed implementation of RawByteString (which as we both agree will never be fixed). (In fact there are many instances that old flaws have been deliberately reproduces for not breaking compatibly.) Applied to FPC/Lazarus code (compiler, libraries, IDE...) this means that it's obviously easier to *prevent* possibly different static/dynamic encodings, instead of *checking and reacting* on such flaws throughout the entire codebase. OK. Kill the Type RawByteString and the constant CP_NONE and the usability of it's value $. I do vote for doing so and instead provide new types such as ByteString, WordString, DWordString, and QWordString denoted by the constants CP_Byte = $FF01, CP_Word = $FF02, CP_DWord = $FF04, CP_QWord = $FF08. Apart from that, every encoding-tolerant code will execute much slower than code without a need for checks and conversions everywhere. As I pointed out I don't agree at all. - The check is only two ASM instructions - It does not result in additional conversions. In fact in appropriate cases it can avoid a huge count of conversations (especially when calling libraries, e.g. by means of TStrings) - in pure user code, the check is only done if DynamicString really is used in the user code, hence only when the user knows what to do. In fact commonly degradation = 0% - When calling libraries (e.g. via TStrings), the check is very small regarding that a function call is done as a result of the same statement. Estimated commonly degradation = 0,01 % So the Checking Overhead is nothing but a rumor. (Remember, I don't suggest dropping the standard statically typed paradigm, altogether, as close loops of course work best in that way. That is why fpc would need to define an additional type name (e.g DynamicString) and encoding brand number (e.g. CP_ANY = $FF00) for a decently usable type for intermediately holding a String content. This again would make *FPC* programs incompatible with Delphi. As I decently explained this would not brake any backwards compatibility, even if TStrings uses this type. - The new type is just additional, so its pure existence can't break anything: you don't need to use it in user-code, if you don't want to. - The use of DynamicString in the interface of Library functions does not break anything, as it is (to be) constructed in a way that provides full compatibility. Please do show any code (not containing RawByteString) that is not compatible when using the DynamicString paradigm as described in http://wiki.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support#Analysis . Maybe the page needs to be improved. While fixing the RawByteString flaw would at least allow to *compile* FPC code with Delphi, the use of an different encoding value would definitely prevent compilation of such code with Delphi. What's the more serious incompatibility? IMHO this would be much more dangerous than introducing a decently working new DynamicString type. RawXxxString can be used for really uncoded data as done with old-style strings in a lot of applications. Such a feature would be appreciated by many users, indeed :-) While I would happily follow you suggesting making indecent use of this type impossible ia the fpc compiler, I don't think it's very dangerous to re-introduce the abysmal Delphi compatible behavior of RawByteString (may as well the documented as the the undocumented features). But why do you say would be appreciated ? Is it not possible to use RawByteString in a way the name suggests, by never bringing it together with any String variable of a different encoding brand and hence avoid any conversion - be same intentional/documented/useful or not. Anyway: I added a sentence in the introduction of the wiki page, explaining the paradigm a little more explicitly. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 12/02/2014 01:05 PM, Michael Schnell wrote: But why do you say would be appreciated ? Is it not possible to use RawByteString in a way the name suggests, by never bringing it together with any String variable of a different encoding brand and hence avoid any conversion - be same intentional/documented/useful or not. Of course you can't use any TStrings sibling (such as TStringList) in such code, as with Delphi, TStrings is based on a statically typed String brand. This would be made possible by introducing DynamicString and using this type for TStrings and friends. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 11/29/2014 07:55 AM, Jonas Maebe wrote: Exactly the same goes for converting strings with code page CP_NONE to a different code page: your program is broken when it tries to do that, While accessing an array beyond its bounds is not detectable at compile time and accessing an array beyond its bounds when range checking is switched off is technically not detectable at runtime, and hence *undefined* cant be avoided, the attempt to convert strings with code page CP_NONE to a different code page is easily detectable by the compiler, as we have predefined string variable type brands types here. Thus, if the outcome is *defined* *to* *be* *undefined* it can and should result in a compiler error message. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
Michael Schnell schrieb: On 11/28/2014 09:15 PM, Hans-Peter Diettrich wrote: Apart from that, every encoding-tolerant code will execute much slower than code without a need for checks and conversions everywhere. As I pointed out I don't agree at all. - The check is only two ASM instructions - It does not result in additional conversions. It does, e.g. in searching or sorting of StringList, when it can contain strings of different encodings. The choice of a unique encoding for application strings (maybe CP_ACP, UTF-8 or UTF-16) eliminates such conversions. So the Checking Overhead is nothing but a rumor. (Remember, I don't suggest dropping the standard statically typed paradigm, altogether, as close loops of course work best in that way. The rumor is the unimportant Conversion Overhead, i.e. how often a check leads to a conversion. When no check is required, conversions consequently cannot ocur at all. RawXxxString can be used for really uncoded data as done with old-style strings in a lot of applications. Such a feature would be appreciated by many users, indeed :-) But why do you say would be appreciated ? Is it not possible to use RawByteString in a way the name suggests, by never bringing it together with any String variable of a different encoding brand and hence avoid any conversion - be same intentional/documented/useful or not. RawByteString cannot serve two different purposes :-( In *Delphi* it is used as a polymorphic string, capable of *holding* actual strings of any encoding. But when assigned to a variable of a different encoding, a conversion may occur that converts the string into the declared (static) encoding of the target variable. In *FPC* it currently is used somewhat close to your idea, i.e. no conversion occurs in both an assignment to *and from* an RawByteString to some other AnsiString. We only can *hope* that *all* AnsiString operations are based on the dynamic encoding of every operand, with according checks and conversions inserted everywhere. This actually is not true, because the compiler relies on the static encoding of AnsiString variables, and inserts checks and conversions only when that encoding is different. Actually a single AnsiString type were sufficient, because it already can hold data of any encoding :-( I understand the FPC attempt, to allow *at the same time* for the new (encoded) and old (unencoded) AnsiString behaviour, where no automatic conversions are allowed. But this would require at the same time, that e.g. all string literals *also* are stored in that (immutable) encoding, and that this encoding can *not* be changed at runtime, while DefaultSystemCodePage *can* be changed. When the result of a conversion of an string of encoding CP_NONE is undefined, what's of course correct for the *dynamic* encoding, this simply could be changed into conversions of CP_NONE strings do nothing. Then CP_NONE would be the perfect encoding for old-style AnsiStrings, with the only remaining problem with string expressions and assignments, when the operands have a different dynamic encoding. In these cases all operands had to be converted into the CP_NONE encoding, as specified in another DefaultNoneEncoding constant (not variable!); the same encoding would apply in assignments *to* variables of a different encoding. Then also all type alias for AnsiStrings must have unique names, which allow to distinguish e.g. type UTF8String = AnsiString; from type NewUTF8String = type AnsiString(CP_UTF8); DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
Jonas Maebe schrieb: On 28/11/14 21:30, Hans-Peter Diettrich wrote: I prefer to specify and document everything *before* coding, so that everybody can expect that the code will behave as specified. If certain behaviour is explicitly undefined, it *is* specified and documented. It means that your program is buggy if it triggers such behaviour, and that the effect of triggering it could be anything. [...] An example from FPC itself is accessing an array beyond its bounds when range checking is switched off. After this hint I reviewd the Code page identifiers section again, and probably could find the source of misunderstandings. CP_NONE: this value indicates that no code page information has been associated with the string data. The result of any explicit or implicit operation that converts this data to another code page is undefined. Does this mean CP_NONE is not an allowed *dynamic* (string *data*) encoding, just like any other undefined encoding value? In this case the description is correct, but it describes an special case of some *undefined* general rule, about valid and invalid dynamic encodings in general. Then this general rule should be documented before, not only for CP_NONE. Then also documentation of the *intended* purpose of CP_NONE, for the *static* encoding of the RawByteString type, is missing at all. As Delphi doesn't allow for a dynamic encoding of CP_NONE, I don't understand the purpose of the FPC description. Now in turn some FPC developer might have misunderstood the (Delphi) handling of RawByteStrings, assuming that it were okay to omit a conversion in an assignment of RawByteString to an AnsiString of a different encoding. That's why I think that the incorrect handling of such RawByteString assignments in FPC should be fixed, according to the general rule of assignments to an string of a different (static) encoding. CP_NONE definitely *is* different from any other encoding, and Delphi does not define an exception for RawByteStrings. Exactly the same goes for converting strings with code page CP_NONE to a different code page: your program is broken when it tries to do that, and we cannot guarantee any outcome. This is exactly what the behaviour is undefined means. When a string *really* has a *dynamic* encoding of CP_NONE, this of course is illegal and thus will result in an undefined result. ACK, so far. But since Delphi (quietly) changes an SetCodePage to CP_NONE into the current CP_ACP, the undefined situation (invalid dynamic encoding) must have been forced by some illegal *hack* before, or in the FPC case by some erroneous (not Delphi conforming) RTL code. DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 27 Nov 2014, at 17:11, Hans-Peter Diettrich drdiettri...@aol.com wrote: Such statements come only from writers that do not believe that their words can be understood in various ways ;-) I'm sorry, but I simply cannot discuss with people that, when I literally state the result is undefined, think that I may actually have meant the result is defined and if you change the implementation and/or keep it stable across compiler releases, then it will also conform to whatever you think that this defined behaviour should be. I don't have the energy nor the patience for that. Jonas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 11/27/2014 03:44 PM, Hans-Peter Diettrich wrote: The universal paradigm would allow for extensions (e.g. UTF-32, multiple 16 Bit Code pages, an additional fully dynamic String type, n-byte un-encoded string types), as I described in the Wiki page. Even if feasable, such arbitrary string storage can dramatically increase the number of implicit string conversions. Of course it can do harm on that behalf, if the user is silly enough to *explicitly* define variables in a brand without thinking about what he is doing. But this exactly the same when he just uses the stuff currently offered by Delphi and fpc. If you arbitrary define code pages for variables for your 8 bit (ANSI) strings you will enforce many conversions. Currently in Delphi if you don't define special code pages anything will be UTF-16. So no unnecessary conversions. In fpc (and maybe Lazarus, as well) I suppose the way currently in the works is (when not changing the Default behavior by certain options): - when compiling for Windows, String is UTF-16, and the RTL and LCL ubiquitously use String: So no unnecessary conversion - when compiling for Linux, String is UTF-8, and the RTL and LCL ubiquitously use String: So no unnecessary conversion, either. If this is done in the libraries (e.g. RTL and LCL) and in user code, this would allow for as little conversions as possible and thus best performance. Here, you would need different library binaries which might or might not be a problem. But of course the portability is very questionable (including, but not limited to the fact that the result of pos is different)- When (on top of this) doing the interfaces to libraries (including TStrings) with DynamicString (encoding brand CP_ANY), no additional conversions would be necessary, as - because all other Strings use the same encoding brand (either UTF-16 or UTF-8, depending on the OS) and hence the dynamic encoding of all DynamicStrings used would always be exactly that brand. Hence, IMHO, this would nor harm at all, as the overhead the compiler needs to implement to just check the dynamic type brand and find that no conversion is necessary is extremely small. But now the user has a choice ! - If he does not do anything regarding the encoding brand of his strings, he will not notice the existence of the DynamicString Type at all. Not even Performance-wise. (But he might encounter portability issues.) - if he decides that he wants to use a dedicated encoding brand in all or parts of his code, he of course needs to know what he is doing. This can result - in improved portability (if decently done) - in improved performance (if decently done) e.g. by using on-byte strings for compact storing the information and two-byte strings for e.g. search loops, or using the best fitting encoding in the loops in the user code while allowing auto-conversion when accessing the libraries in case the underlying OS enforces a different encoding. - in disastrous increase of auto-conversions and thus performance degradation, (if not decently done). An *efficient* implementation would be based on a single program-wide string representation, with different encodings being handled only in an exchange with external data sources. Yep. But it would result in severe user code portability issues (see above). IMHO using DynamicString at the correct locations would not be (noticeably) less efficient but a lot more versatile. Cassandra After all I have the impression that the known RawByteString flaws will never be fixed in Delphi, in order to encourage the users to take the step to UnicodeString. Now the question is whether these flaws are fixed in FPC, or whether Lazarus will become the first project that definitely requires an complete move to UnicodeString, for reliable operation. For best support of non-UTF-16 platforms I'd suggest to fix the flaws... /Cassandra I also don't think we will ever see a fix for the poor implementation of RawByteString (avoiding the word flaw and the suggestion of a bad purpose), because it would brake existing user code. Regarding fpc, correcting the flaws and keeping the name RawByteString would result in incompatibility issues vs Delphi and breaking code that will be ported from Delphi. That is why fpc would need to define an additional type name (e.g DynamicString) and encoding brand number (e.g. CP_ANY = $FF00) for a decently usable type for intermediately holding a String content. (see Wiki - http://wiki.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support ) RawXxxString can be used for really uncoded data as done with old-style strings in a lot of applications. Even if seriously flawed auto-conversion might be implemented in fpc for RawByteStrimg (for Delphi-compatibility), the user can easily avoid it by not directly combining RAW and differently statically encoded strings in an operation. -Michael
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 11/27/2014 07:29 PM, Hans-Peter Diettrich wrote: Michael Schnell schrieb: E.g. there are (are least two Code pages for UTF-16 (LE, and BE), that would be worth supporting. You are confusing codepages and encodings :-( That is why I put goose-feet around Code pages. I used this wording because fpc (and Delphi ?) uses it abbreviated as CP in the constant name CP_UTF-8, CP_UTF16 and CP_UTF16BE) [ see Jonas post: CP_UTF16 and CP_UTF16BE can be returned by StringCodePage() when called on a unicodestring, and that's it. ] See it as a multi-level protocol for text processing. Yep. I see that is is workable and I understand the (supposedly mostly historical) reasons. But IMHO not a good (i.e. crafted from ground up) concept. It's known that the Delphi AnsiString implementation is flawed,... And hence it's frustrating to see that fpc needs to follow for compatibility reasons. That is why I suggested an improved implementation (see - http://wiki.freepascal.org/not_Delphi_compatible_enhancement_for_Unicode_Support). While the seriously flawed Delphi compatible use of the dynamic encoding-brand (and bytes-per element) information (only implemented with RawByteString) can be left at it is and a decent implementation with a new DynmicString Type (CP_ANY) should be crafted. I see no problem in using the same names and values. Delphi documents clearly state: ... I fear that there will be code that relies on the flawed behavior of RawByteString (it's a feature, not a bug) and using the same name with different behavior would brake same. And a really usable DynmicString would not adhere to that description. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
Jonas Maebe schrieb: I'm sorry, but I simply cannot discuss with people that, when I literally state the result is undefined, think that I may actually have meant the result is defined and if you change the implementation and/or keep it stable across compiler releases, then it will also conform to whatever you think that this defined behaviour should be. I don't have the energy nor the patience for that. I also have no use for continuing such discussions. I prefer to specify and document everything *before* coding, so that everybody can expect that the code will behave as specified. DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
Michael Schnell schrieb: I fear that there will be code that relies on the flawed behavior of RawByteString (it's a feature, not a bug) and using the same name with different behavior would brake same. And a really usable DynmicString would not adhere to that description. How can somebody rely on behaviour *stated* as undefined, or not working as defined? DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
Michael Schnell schrieb: On 11/27/2014 03:44 PM, Hans-Peter Diettrich wrote: An *efficient* implementation would be based on a single program-wide string representation, with different encodings being handled only in an exchange with external data sources. Yep. But it would result in severe user code portability issues (see above). IMHO using DynamicString at the correct locations would not be (noticeably) less efficient but a lot more versatile. You suggested to use string as UTF-16 on Windows, and UTF-8 on Linux. That's what I understand as a unique program-wide string representation (not sourcecode-wide, instead program as *compiled*). Then I cannot see any need or use for another DynamicString type. I also don't think we will ever see a fix for the poor implementation of RawByteString (avoiding the word flaw and the suggestion of a bad purpose), because it would brake existing user code. Nothing can be broken, as long as the Delphi behaviour is undefined. Code relying on specific compiler/library bugs is bound to that compiler, not portable in any way. Regarding fpc, correcting the flaws and keeping the name RawByteString would result in incompatibility issues vs Delphi and breaking code that will be ported from Delphi. Same as above. When application code works properly with strings of *sometimes* different static and dynamic encoding, it will not stop working with strings of *never* different encodings. Of course the opposite is not true. When some code works properly (only) with strings of the same static and dynamic encoding, it will stop working when compiled with Delphi. Then the coder has to insert explicit checks for the dynamic encoding of *all* strings, all over his code. Applied to FPC/Lazarus code (compiler, libraries, IDE...) this means that it's obviously easier to *prevent* possibly different static/dynamic encodings, instead of *checking and reacting* on such flaws throughout the entire codebase. Apart from that, every encoding-tolerant code will execute much slower than code without a need for checks and conversions everywhere. I seriously doubt that the FPC developers ever realized these consequences, and the amount of time required for finding, reporting and fixing the bugs in all affected pieces of their code :-( That is why fpc would need to define an additional type name (e.g DynamicString) and encoding brand number (e.g. CP_ANY = $FF00) for a decently usable type for intermediately holding a String content. This again would make *FPC* programs incompatible with Delphi. While fixing the RawByteString flaw would at least allow to *compile* FPC code with Delphi, the use of an different encoding value would definitely prevent compilation of such code with Delphi. What's the more serious incompatibility? RawXxxString can be used for really uncoded data as done with old-style strings in a lot of applications. Such a feature would be appreciated by many users, indeed :-) DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 28/11/14 21:30, Hans-Peter Diettrich wrote: I prefer to specify and document everything *before* coding, so that everybody can expect that the code will behave as specified. If certain behaviour is explicitly undefined, it *is* specified and documented. It means that your program is buggy if it triggers such behaviour, and that the effect of triggering it could be anything. This is standard practice in computer science. E.g., pretty much every manual of every processor contains descriptions of explicitly undefined behaviour (search e.g. for undefined in the Intel or ARM architecture manuals). An example from FPC itself is accessing an array beyond its bounds when range checking is switched off. *Some* of the possible outcomes are accessing a value from a variable declared/before after it, accessing random data that has nothing to do with any of those variables, a program crash, or actually accessing an element of the array anyway. We don't guarantee that any of those possibilities will happen, we don't say that those are the only possibilities, we don't say they stay the same across compiler or OS versions, or even across program executions. Hence, it's undefined. Exactly the same goes for converting strings with code page CP_NONE to a different code page: your program is broken when it tries to do that, and we cannot guarantee any outcome. This is exactly what the behaviour is undefined means. Jonas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
In our previous episode, Hans-Peter Diettrich said: concatenated without data loss and that the result is then converted to the target string's encoding (except in case the target is RawByteString). How that is implemented exactly is undefined; again in the meaning of undefined, not in the meaning of undefined when defined as meaning X. In this case the implementation is compiler specific, somewhat different from undefined (in a RawByteString): CP_NONE: this value indicates that no code page information has been associated with the string data. The result of any explicit or implicit operation that converts this data to another code page is undefined. IMO the result is well defined: it's the string with the encoding of that other codepage. An undefined result, as I understand it, would mean the result can be anything, unrelated to the function input. This is usually called implementation defined. But implementation defined implies it will remain the same in every iteration of the compiler (usually documented). If that is not wanted/possible, then it is considered undefined. So even if a value happens to be defined in one version of the compiler, it doesn't automatically make it implementation defined. It needs to be a documented choice for that. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 11/26/2014 05:25 PM, Sven Barth wrote: So seemingly you could do MyStringType = type AnsiString(CP_UTF16), and seemingly the size information is set according to this. No, you can't, because the RTL does not handle that. For AnsiString the element size is *always* 1. It's hardcoded. AFAIK Delphi even does a compile error if you use CP_UTF16. Thanks for the clarification. I now understand that the Element Size field in the String header is quite dummy, as under the hood there are two completely separate concepts for one-byte-Strings and 2-Byte Strings and none for other Element sizes. This to me is not obvious at all, as the language syntax and the String header data structure suggest a more universal paradigm for multiple string type brands, that each have an element-size6 and code-ID-number setting, handled by a common infrastructure. The universal paradigm would allow for extensions (e.g. UTF-32, multiple 16 Bit Code pages, an additional fully dynamic String type, n-byte un-encoded string types), as I described in the Wiki page. The dual mode concept of course does not provide such extensibility, and so I stop thinking about this (and bothering the community), and am happy that it just works as it is. Thanks again, -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 11/26/2014 05:37 PM, Jonas Maebe wrote: invalid (in the meaning of undefined) in both FPC and Delphi. Sorry (I am not a native speaker). But to me undefined and invalid have completely different meanings (in this context). An Invalid use of the language would result in an error (compiler or runtime), while an undefined language construct would result in something that might work in some way, but there is no guarantee that the outcome is always the same (e.g. in another instance or another compiler version). CP_UTF16 and CP_UTF16BE can be returned by StringCodePage() when called on a unicodestring, and that's it. I now do understand (see my reply to Sven). -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 11/26/2014 09:30 PM, Hans-Peter Diettrich wrote: So seemingly you could do MyStringType = type AnsiString(CP_UTF16), and seemingly the size information is set according to this. Not in Delphi XE. Thanks for the clarification. I did have some hope that fpc would be (or could be extended to be) better than Delphi on that behalf. I now do see the reason that resulted in the (to me rather queer) Naming AnsiString for the code page aware string type. I erroneously supposed the syntax that finally would be used would be something like MyStringType = type String(CP_UTF16), with no restriction to ANSI, but the CP_ constant defining as well a code page as an Element size, as suggested by the language syntax while working with string using auto-conversion, and by the structure of the string content header. There still might be room for (fully compatible) improvement (as I described in the Wiki), but it's even more difficult to do than I supposed. Thanks again, -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 11/26/2014 07:13 PM, Hans-Peter Diettrich wrote: Not all codepages have a fixed number of bytes per character. The string preamble contains the *element size* (1 for AnsiString), just like with every dynamic array. Sorry for sloppy wording. Of course I did mean element size (Character here obviously is not printable item). -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 26/11/14 23:41, Hans-Peter Diettrich wrote: In this case the implementation is compiler specific, somewhat different from undefined (in a RawByteString): CP_NONE: this value indicates that no code page information has been associated with the string data. The result of any explicit or implicit operation that converts this data to another code page is undefined. IMO the result is well defined: it's the string with the encoding of that other codepage. Unless you actually tested this on all platforms and noted that is the case, you cannot state this. And if you would actually test it, you would discover that it is wrong (http://bugs.freepascal.org/view.php?id=22501#c61238 ). As mentioned in a previous discussion: don't use IMO (in my opinion) when talking about testable facts. A testable fact is either true or false, opinions do not enter the picture. An undefined result, as I understand it, would mean the result can be anything, unrelated to the function input. Which is 100% correct. IMO a better wording should be found, that does not cause the current obvious confusion of some readers. The confusion only occurs for readers that do not believe what is written. Jonas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
Michael Schnell schrieb: I now understand that the Element Size field in the String header is quite dummy, as under the hood there are two completely separate concepts for one-byte-Strings and 2-Byte Strings and none for other Element sizes. After a code review I realized that the element size field is specific to dynamic strings, not present in dynamic arrays. Since the element size is bound to the string type, it could be omitted in the FPC implementation. [With little win, when the record alignment is preserved] This to me is not obvious at all, as the language syntax and the String header data structure suggest a more universal paradigm for multiple string type brands, that each have an element-size6 and code-ID-number setting, handled by a common infrastructure. This may have been envisaged by the Delphi architects, but was not continued later. The universal paradigm would allow for extensions (e.g. UTF-32, multiple 16 Bit Code pages, an additional fully dynamic String type, n-byte un-encoded string types), as I described in the Wiki page. Even if feasable, such arbitrary string storage can dramatically increase the number of implicit string conversions. An *efficient* implementation would be based on a single program-wide string representation, with different encodings being handled only in an exchange with external data sources. That standard encoding may be Ansi or Unicode; even Delphi allows for both models, where Ansi again suggests the use of one specific codepage (CP_ACP) for best performance. Cassandra After all I have the impression that the known RawByteString flaws will never be fixed in Delphi, in order to encourage the users to take the step to UnicodeString. Now the question is whether these flaws are fixed in FPC, or whether Lazarus will become the first project that definitely requires an complete move to UnicodeString, for reliable operation. For best support of non-UTF-16 platforms I'd suggest to fix the flaws... /Cassandra DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
Michael Schnell schrieb: On 11/26/2014 07:13 PM, Hans-Peter Diettrich wrote: Not all codepages have a fixed number of bytes per character. The string preamble contains the *element size* (1 for AnsiString), just like with every dynamic array. Sorry for sloppy wording. Of course I did mean element size (Character here obviously is not printable item). I'd restrict the use of character to physical Char types, just to avoid any misinterpretation. Printable items (glyphs) are independent from the storage format. Ligatures or umlauts can consist of multiple codepoints, and several Unicode codepoints are not even printable. A single printable character, as selectable by a single cursor step, can consist of multiple codepoints, even (or just) in Unicode. That's why I'd expect that the FPC documentation includes a glossary and definition of the terms, which should be used in the documentation and discussions. DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
Jonas Maebe schrieb: On 26/11/14 23:41, Hans-Peter Diettrich wrote: In this case the implementation is compiler specific, somewhat different from undefined (in a RawByteString): CP_NONE: this value indicates that no code page information has been associated with the string data. The result of any explicit or implicit operation that converts this data to another code page is undefined. IMO the result is well defined: it's the string with the encoding of that other codepage. Unless you actually tested this on all platforms and noted that is the case, you cannot state this. And if you would actually test it, you would discover that it is wrong (http://bugs.freepascal.org/view.php?id=22501#c61238 ). Bugs obviously violate some specification/definition, else it's not a bug, it's a feature ;-) As mentioned in a previous discussion: don't use IMO (in my opinion) when talking about testable facts. A testable fact is either true or false, opinions do not enter the picture. We're just talking about interpretations, not facts. An undefined result, as I understand it, would mean the result can be anything, unrelated to the function input. Which is 100% correct. Do you see any use for such function definitions, except in random generators? IMO a better wording should be found, that does not cause the current obvious confusion of some readers. The confusion only occurs for readers that do not believe what is written. Such statements come only from writers that do not believe that their words can be understood in various ways ;-) DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
Michael Schnell schrieb: On 11/26/2014 06:37 PM, Hans-Peter Diettrich wrote: An AnsiString consists of AnsiChar's. The *meaning* of these char's (bytes) depends on their encoding, regardless of whether the used encoding is or is not stored with the string. I understand that the implementation (in Delphi) seems to be driven more by the Wording (ANSI) than by the logical paradigm the language syntax suggests. The language syntax and the string header fields suggest that both the element-size as the code-ID-number need to be adhered to (be it statically or dynamically - depending on the usage instance). E.g. there are (are least two Code pages for UTF-16 (LE, and BE), that would be worth supporting. You are confusing codepages and encodings :-( UTF-7, UTF-8, UTF-16 and UTF-16BE describe different representations of the same values (Unicode codepoints). And I agree, all commonly used encodings should be implemented, at least for data import/export. It's essential to distinguish between low-level (physical) AnsiChar values, and *logical* characters possibly consisting of multiple AnsiChars. I now do see that the implementation is done following this concept. But the language syntax and the string header field suggest a more versatile paradigm, providing a universal reference counting element string type. See it as a multi-level protocol for text processing. The bottom (physical) level deals with physical storage items (AnsiChar, WideChar...), and how they are stored in memory or files. Like it doesn't make sense to deal with individual bytes of real numbers in computations, it doesn't make sense to deal with individual bytes (AnsiChars) of logical characters - except in type/encoding conversions. Higher levels deal with logical values, which can consist of multiple physical items, and may need different interpretatons (in case of Ansi codepages). This level is partially coverd now by AnsiString encodings and UTF-16 surrogate pairs, which allow to map the values into full Unicode (UCS-4) codepoints. But these codepoints still are not sufficient for a correct interpretation and manipulation of logical characters, which again can consist of multiple codepoints (decomposed umlauts, ligatures...). In a next level another (mostly language specific) interpretation may be required, like which logical characters have to be treated together (ligatures, non-breaking characters...). Some natural languages (Hebrew, Arabic...) require another special handling of (mixed) LTR/RTL reading, and of paths, influencing the graphical representation of character sequences; but that's nothing an application or library writer should have to deal with, such functionality should be provided by the target platform. There must be a boundary between the standard (RTL) handling of the physical items and encodings, and higher text processing levels, up to language specific processing (how to break words, when to apply capitalization, syntax checks...), so that such special handling can be implemented in dedicated extensions (libraries, classes), by developers familiar with the rules and conventions of the natural languages. For now we are talking only about the handling up to individual Unicode codepoints, and related string manipulation. Herefore at least one string representation must exist, that covers the full Unicode range of codepoints (UTF-8 or UTF-16 for now). When such an implementation claims for undefined behaviour, then this can only mean implementation flaws, resulting in something different from what can be expected from proper Unicode handling. This includes invalid parameter values in subroutine calls, which should result in proper (defined) runtime error reporting (AV, error result...). WRT to AnsiString encodings, the only acceptable (expected) differences can result from lossy conversions, when converting proper Unicode into a non-UTF encoding. Even then the results should be consistent, even if the concrete results depend on some external (platform...) convention or settings. IMO. That's why I wonder *when* exactly the result of such an expression *is* converted (implicitly) into the static encoding of the target variable, and when *not*. I understand that the idea is, to use the static encoding information provided by the type definition whenever possible. Right, but here whenever possible depends on the correspondence of static and dynamic encoding. When the dynamic encoding can *ever* be different from the static encoding, except for RawByteString, I consider it NOT possible to derive the need for a conversion from the static encoding. In the handling of floatingpoint values we may have to expect invalid operations (division by zero, overflow...) or values (NaN...), but NOT that a Double variable ever contains two Integer values - unless forced by dirty hacks out of compiler control. Why should this be different and acceptable with
[fpc-devel] Trying to understand the wiki-Page FPC Unicode support
I fail to understand some of the text. It seems to be unavoidable to use the name ANSIString even though I always though up when seeing a thing called ANSI containing Unicode (e. g. UTF8String = type AnsiString(CP_UTF8) ). Seemingly here the bytes per character setting implicitly is thought of as a port of the code-page definition. correct ? In section Dynamic code page: When assigning a string to a plain AnsiString (= AnsiString(CP_ACP)) or ShortString, the string data will however be converted to DefaultSystemCodePage. The dynamic code page of that AnsiString(CP_ACP) will then be the current value of DefaultSystemCodePage (e.g. 1250 for the Windows-1250 code page), even though its static code page is CP_ACP (which is a constant 1250). This is one example of how the static code page can differ from the dynamic code page. Subsequent sections will describe more such scenarios. 1) A short String does not have a Code page notification so for this static code page can differ from the dynamic code page does not seem to make much sense. 2) I fail to understand how with this explanation that seems to force auto conversion for assignments between types with different code page settings (also for CP_ACP) the static code page can differ from the dynamic code page can happen. In fact this disaster seems to be able to happen (see section RawByteString) if assigning a string with a static code page X1 to a RawByteString (hence no conversion) and then assigning that RawByteString to a string with a static code page X2 (no conversion again). In fact I assume that without abusing RawByteString such intersexual strings can't be produced, otherwise this would be rather disastrous for normal users. In section RawByteString: the results of conversions from/to the CP_NONE code page are undefined. In effect the behavior is exactly defined in this section As a first approximation. Does that mean it is due to be changed ? Is there a cause why not keep the described behavior (just don't any conversion ever). Of course this can produce intersexual strings. Is this great harm ? If yes I think assigning a RawByteString to a string with a static code page should be completely forbidden at compile time or result in a runtime error if the code page does not match. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On Wed, 26 Nov 2014 11:23:17 +0100 Michael Schnell mschn...@lumino.de wrote: [...] It seems to be unavoidable to use the name ANSIString even though I always though up when seeing a thing called ANSI containing Unicode (e. g. UTF8String = type AnsiString(CP_UTF8) ). Is there a question? Seemingly here the bytes per character setting implicitly is thought of as a port of the code-page definition. correct ? Code page define bytes per character. As you know: Don't confuse character with glyph and codepoint. Ansistring supports only one byte per character code pages. In section Dynamic code page: When assigning a string to a plain AnsiString (= AnsiString(CP_ACP)) or ShortString, the string data will however be converted to DefaultSystemCodePage. The dynamic code page of that AnsiString(CP_ACP) will then be the current value of DefaultSystemCodePage (e.g. 1250 for the Windows-1250 code page), even though its static code page is CP_ACP (which is a constant 1250). This is one example of how the static code page can differ from the dynamic code page. Subsequent sections will describe more such scenarios. 1) A short String does not have a Code page notification so for this static code page can differ from the dynamic code page does not seem to make much sense. What is a Code page notification? Do you mean code page information? IMO the phrase The dynamic code page of that AnsiString is clear, that it does *not* talk about ShortString. Mattias ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 11/26/2014 11:40 AM, Mattias Gaertner wrote: Ansistring supports only one byte per character code pages. Even more confused. Am I wrong thinking that with code aware Strings, for Delphi XE compatibility, in Windows CP_ACP needs to be UTF16 (if not right, than due later) ? What is a Code page notification? Do you mean code page information? Yep. that it does *not* talk about ShortString. OK. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
Am 26.11.2014 11:53 schrieb Michael Schnell mschn...@lumino.de: On 11/26/2014 11:40 AM, Mattias Gaertner wrote: Ansistring supports only one byte per character code pages. Even more confused. Am I wrong thinking that with code aware Strings, for Delphi XE compatibility, in Windows CP_ACP needs to be UTF16 (if not right, than due later) ? Yes, you're wrong. In Delphi (and FPC) CP_ACP corresponds by default with the current system codepage (e.g. CP1252 on a German Windows). CP_UTF16 is not supported, because AnsiString only supports 1-Byte character strings (and UTF-8 as the odd one) and not 2-Byte character strings. The difference to Delphi currently is that for FPC String=AnsiString(CP_ACP) and for Delphi String=UnicodeString (aka 2-Byte string). Regards, Sven ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On Wed, 26 Nov 2014 11:23:17 +0100 Michael Schnell mschn...@lumino.de wrote: [...] 2) I fail to understand how with this explanation that seems to force auto conversion for assignments between types with different code page settings (also for CP_ACP) the static code page can differ from the dynamic code page can happen. For example: CP_ACP=0, DefaultSystemCodePage=1252 That means static code page is always 0, while dynamic code page can be 0 or 1252. Both describe the same encoding. RawByteString has static cp CP_NONE=$, but its dynamic cp is always different, for example CP_ACP=0, 1252 or CP_UTF8. In fact this disaster seems to be able to happen (see section RawByteString) if assigning a string with a static code page X1 to a RawByteString (hence no conversion) and then assigning that RawByteString to a string with a static code page X2 (no conversion again). In fact I assume that without abusing RawByteString such intersexual strings can't be produced, otherwise this would be rather disastrous for normal users. You can use SetCodePage as well. ;) In section RawByteString: the results of conversions from/to the CP_NONE code page are undefined. ... because CP_NONE is not a real code page. Mattias ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On Wed, 26 Nov 2014 11:52:50 +0100 Michael Schnell mschn...@lumino.de wrote: On 11/26/2014 11:40 AM, Mattias Gaertner wrote: Ansistring supports only one byte per character code pages. Even more confused. Am I wrong thinking that with code aware Strings, for Delphi XE compatibility, in Windows CP_ACP needs to be UTF16 (if not right, than due later) ? No. In mode delphiunicode String=UnicodeString. Mattias ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 11/26/2014 12:09 PM, Sven Barth wrote: In Delphi (and FPC) CP_ACP corresponds by default with the current system codepage (e.g. CP1252 on a German Windows). OK. So in Delphi XE (in Germany) String(CP_ACP) is the same as String(CP1252) but different from String without brackets which in turn is the same as String(CP_UTF16) ? Correct ? CP_UTF16 is not supported, because AnsiString only supports 1-Byte character strings (and UTF-8 as the odd one) and not 2-Byte character strings. I still don't understand. The wiki article seems to suggest that it is about a type called ANSIString that features a dynamically settable code page information. From discussions about Delphi and FPC, I only know a String type with a dynamically settable code page information that also features a dynamically settable Bytes per Character information and hence does support 1, 2 and 4 Bytes per Character. (e.g. UTF-8, UTF-16, and UTF-32). The difference to Delphi currently is that for FPC String=AnsiString(CP_ACP) and for Delphi String=UnicodeString (aka 2-Byte string). I understand that you mean (e.g.) Delphi XE. But what version of FPC is currently. Am I wrong assuming that in the svn we do have the NewStrings library that supports dynamical code-page *and* byte-per-character settings and hence supports e.g. CP1251, UTF-8, UTF-16, and UTF-32 ? So I seem to understand the meaning of String(CP1252), String(CP_UTF8), and String(CP_UTF16) (which seems do be the Delphi notation), but I seemingly don't get the exact meaning of AnsiString(CP_ACP) or AnsiString(CP1251) In the end, what the definition of String without brackets is, might be due to a settable compiler option and/or the OS the compiler is set to create code for. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 11/26/2014 12:13 PM, Mattias Gaertner wrote: In mode delphiunicode String=UnicodeString. I see. So even in Delphi XE where UnicodeString is denoted by CP_UTF16, the value of the constant CP_UTF16 is not the same as the value of the (constant or) variable CP_ACP, (while OTOH using the value of CP_UTF16 in a type or variable definition performs the same as using 0 {is CP_DEFAULT name of the appropriate constant ?} ). I understand that fpc with mode delphiunicode is supposed to work in the same way. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 11/26/2014 12:10 PM, Mattias Gaertner wrote: the results of conversions from/to the CP_NONE code page are undefined. ... because CP_NONE is not a real code page. So you understand result as what you would get when printing. In the context of this wiki page I would understand result as the binary content of the variable in question. Is this undefined in the meaning of not predictable by the user in the current version of fpc, or in the meaning of due to change when updating fpc. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
After re-reading yet another question: In section String concatenations there is no mentioning about auto-conversion. For statically typed Strings it's rather obvious that they will be auto-converted if appropriate. Technically - if differently encode - they seem to be converted to Unicode and the result is converted to match the target. Regarding RawByteStrings there has been the definition a RawByteString has exactly the same behavior as assigning that AnsiString(X) to another AnsiString(X) variable with the same value of X: no code page conversion or copying occurs. Seemingly this is not true for the intermediate results of concatenations. Here the dynamical encoding information seems to define the fact and type of conversion. If this is the fact it should be mentioned. (Whether or not this makes sense is another question: is the code information of RawByteString meant to be NONE (i.e. RAW) or dynamic (i.e. complex) ). -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
Am 26.11.2014 12:37 schrieb Michael Schnell mschn...@lumino.de: On 11/26/2014 12:09 PM, Sven Barth wrote: In Delphi (and FPC) CP_ACP corresponds by default with the current system codepage (e.g. CP1252 on a German Windows). OK. So in Delphi XE (in Germany) String(CP_ACP) is the same as String(CP1252) but different from String without brackets which in turn is the same as String(CP_UTF16) ? Correct ? There is no String with brackets. You can only use AnsiString followed by brackets, not String. And String in Delphi 2009+ is the same as UnicodeString which is a different compiler internal type than AnsiString(CP_UTF16) would be if it would be allowed. CP_UTF16 is not supported, because AnsiString only supports 1-Byte character strings (and UTF-8 as the odd one) and not 2-Byte character strings. I still don't understand. The wiki article seems to suggest that it is about a type called ANSIString that features a dynamically settable code page information. From discussions about Delphi and FPC, I only know a String type with a dynamically settable code page information that also features a dynamically settable Bytes per Character information and hence does support 1, 2 and 4 Bytes per Character. (e.g. UTF-8, UTF-16, and UTF-32). While both AnsiString and UnicodeString have the current codepage and the character size in their header record the code page is only used for AnsiString and the size can not he influenced in any way (for an AnsiString it's always 1 and for a UnicodeString it's always 2). There is no UTF-32 string (at least not in the sense of a compiler provided type). The difference to Delphi currently is that for FPC String=AnsiString(CP_ACP) and for Delphi String=UnicodeString (aka 2-Byte string). I understand that you mean (e.g.) Delphi XE. But what version of FPC is currently. FPC is none, because when Delphi introduced the code page aware AnsiString it switch at the same time from having String=AnsiString to Stribgm=UnicodeString. FPC did only the first part for now (so at best FPC would he a not quite 2009 :P ). Am I wrong assuming that in the svn we do have the NewStrings library that supports dynamical code-page *and* byte-per-character settings and hence supports e.g. CP1251, UTF-8, UTF-16, and UTF-32 ? So I seem to understand the meaning of String(CP1252), String(CP_UTF8), and String(CP_UTF16) (which seems do be the Delphi notation), but I seemingly don't get the exact meaning of AnsiString(CP_ACP) or AnsiString(CP1251) No. The Delphi notation is the same as in FPC: AnsiString(codepage). And a AnsiString(CP_1251) normally holds string data encoded with the CP-1251 codepage while a AnsiString(CP_ACP) holds string data encoded with whatever encoding the DefaultSystemCodePage denoted at the time of assignment. This can be for example CP_1251 as well or something different like CP_UTF8 (it can however not he CP_ACP again nor CP_UTF16 nor CP_UTF32). In the end, what the definition of String without brackets is, might be due to a settable compiler option and/or the OS the compiler is set to create code for. That is already the case: - any mode, H- : ShortString - any mode except delphi_unicode, H+ : AnsiString(CP_ACP) - mode delphi_unicode, H+ : UnicodeString (there's also a modeswitch to change String to UnicodeString, but I forgot its name -.-) Please note that these switches are always per unit as precompiled units (like the RTL ones) can not be influenced. Regards, Sven ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 26/11/14 12:53, Michael Schnell wrote: [CP_NONE] Is this undefined in the meaning of not predictable by the user in the current version of fpc, or in the meaning of due to change when updating fpc. This undefined literally means undefined. It does not mean undefined in a meaning that is defined in a particular way. Jonas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 11/26/2014 03:05 PM, Sven Barth wrote: OK. So in Delphi XE (in Germany) String(CP_ACP) is the same as String(CP1252) but different from String without brackets which in turn is the same as String(CP_UTF16) ? Correct ? There is no String with brackets. You can only use AnsiString followed by brackets, not String. And String in Delphi 2009+ is the same as UnicodeString which is a different compiler internal type than AnsiString(CP_UTF16) would be if it would be allowed. While both AnsiString and UnicodeString have the current codepage and the character size in their header record the code page is only used for AnsiString and the size can not he influenced in any way (for an AnsiString it's always 1 and for a UnicodeString it's always 2). OK. So what is the notation in Delphi (and hence supposedly in FPC with mode delphiunicode) to define a variable with the (static) string encoding type CP with XXX = 1252, UTF8, UTF16 ? I found this: CP_ACP = 0; // default to ANSI code page CP_UTF16 = 1200; // utf-16 CP_UTF16BE = 1201; // unicodeFFFE CP_UTF7= 65000; // utf-7 CP_UTF8= 65001; // utf-8 CP_ASCII = 20127; // us-ascii CP_NONE= $; // rawbytestring encoding So seemingly you could do MyStringType = type AnsiString(CP_UTF16), and seemingly the size information is set according to this. There is no UTF-32 string (at least not in the sense of a compiler provided type). I see (It's a shame). Thanks a lot for your patience, -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 26/11/14 13:11, Michael Schnell wrote: In section String concatenations there is no mentioning about auto-conversion. There is. For statically typed Strings it's rather obvious that they will be auto-converted if appropriate. It's probably rather obvious because it is literally mentioned in that section. Technically - if differently encode - they seem to be converted to Unicode and the result is converted to match the target. Technically, that section literally states that they will be concatenated without data loss and that the result is then converted to the target string's encoding (except in case the target is RawByteString). How that is implemented exactly is undefined; again in the meaning of undefined, not in the meaning of undefined when defined as meaning X. Regarding RawByteStrings there has been the definition a RawByteString has exactly the same behavior as assigning that AnsiString(X) to another AnsiString(X) variable with the same value of X: no code page conversion or copying occurs. Seemingly this is not true for the intermediate results of concatenations. That paragraph only specifies that code page-aware strings are concatenated without data loss, and then defines to which code page the result will be converted before assigning it to the target. Even if the intermediary result of a concatenation would be a RawByteString (which is not stated nor necessarily ever the case), then the above would apply and hence the (dynamic) code page of that RawByteString would be the one as defined by the above-mentioned rules before it would be assigned to the target. Jonas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
Am 26.11.2014 15:30 schrieb Mattias Gaertner nc-gaert...@netcologne.de: On Wed, 26 Nov 2014 15:05:16 +0100 Sven Barth pascaldra...@googlemail.com wrote: [...] While both AnsiString and UnicodeString have the current codepage and the character size in their header record AFAIK UnicodeString has only a static (fixed) code page. Yes, nevertheless the header record is the same for UnicodeString and AnsiString and thus it also has a codepage field which is always initialized to CP_UTF16 however. Regards, Sven ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 26/11/14 17:21, Sven Barth wrote: Yes, nevertheless the header record is the same for UnicodeString and AnsiString and thus it also has a codepage field which is always initialized to CP_UTF16 however. It can also be CP_UTF16BE (which it is on big endian FPC targets right now). Jonas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 26/11/14 16:19, Michael Schnell wrote: So seemingly you could do MyStringType = type AnsiString(CP_UTF16), and seemingly the size information is set according to this. As several people have told you several times, that is invalid (in the meaning of undefined) in both FPC and Delphi. I've mentioned this on the FPC_Unicode_support wiki page now. CP_UTF16 and CP_UTF16BE can be returned by StringCodePage() when called on a unicodestring, and that's it. Jonas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On Wed, 26 Nov 2014 17:23:48 +0100 Jonas Maebe jonas.ma...@elis.ugent.be wrote: On 26/11/14 17:21, Sven Barth wrote: Yes, nevertheless the header record is the same for UnicodeString and AnsiString and thus it also has a codepage field which is always initialized to CP_UTF16 however. It can also be CP_UTF16BE (which it is on big endian FPC targets right now). I see. Can you create a CP_UTF16BE on little Endian systems? type u = UnicodeString(CP_UTF16BE); gives an error. Mattias ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On Wed, 26 Nov 2014 17:50:31 +0100 Mattias Gaertner nc-gaert...@netcologne.de wrote: On Wed, 26 Nov 2014 17:23:48 +0100 Jonas Maebe jonas.ma...@elis.ugent.be wrote: On 26/11/14 17:21, Sven Barth wrote: Yes, nevertheless the header record is the same for UnicodeString and AnsiString and thus it also has a codepage field which is always initialized to CP_UTF16 however. It can also be CP_UTF16BE (which it is on big endian FPC targets right now). I see. Can you create a CP_UTF16BE on little Endian systems? type u = UnicodeString(CP_UTF16BE); gives an error. Jonas has answered this. Thanks. Mattias ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
Mattias Gaertner schrieb: On Wed, 26 Nov 2014 11:23:17 +0100 Michael Schnell mschn...@lumino.de wrote: Seemingly here the bytes per character setting implicitly is thought of as a port of the code-page definition. correct ? Code page define bytes per character. Huh? Not all codepages have a fixed number of bytes per character. The string preamble contains the *element size* (1 for AnsiString), just like with every dynamic array. As you know: Don't confuse character with glyph and codepoint. Right, but what is what? I feel a need for an exact (official) definition of such (and more) terms, in order to prevent further misunderstandings of the documentation and in discussions. E.g. code page has different meanings, when used with ANSI/ISO and Unicode character sets. While ANSI/ISO codepages desribe different mappings of bytes into characters, Unicode codepages define subsets of the whole Unicode range. My understanding of character is a *logical* unit (letter), with possibly different encodings, values and sizes in different codepages (character sets). What's the term for the *physical* unit (AnsiChar, WideChar)? Ansistring supports only one byte per character code pages. Huh? What's your definition of character? AnsiString supports MBCS codepages as well. The restriction is the physical storage unit (1 byte per string item), as imposed by AnsiChar. DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
Michael Schnell schrieb: On 11/26/2014 11:40 AM, Mattias Gaertner wrote: Ansistring supports only one byte per character code pages. Even more confused. Am I wrong thinking that with code aware Strings, for Delphi XE compatibility, in Windows CP_ACP needs to be UTF16 (if not right, than due later) ? Delphi XE does not properly support UTF-8. CP_ACP seems to depend on western/far-eastern versions, where the western version assumes and allows for any SBCS; I don't know of the same in far-east versions. The SBCS restriction allows to simplify standard string handling and conversions, because every character (=byte) can be exchanged in place. UTF-8 doesn't fit into this picture, because it's a MBCS. UTF-16 is not a valid value for CP_ACP in Delphi, because it's a 2-byte encoding. Even if the Delphi architects may have thought about an common string type, with a variable element size (1,2,4), this certainly turned out soon as a stupid idea, so that AnsiString and WideString/UnicodeString still are strictly distinct types. WideString and UnicodeString imply UTF-16, with platform specific byte order (endianness). The latter becomes important almost only to compiler and library coders, in host/network byteorder conversions. For the sake of completeness, pdp-11 processors use yet another byte order, maybe more word-based processors (DG...) as well. DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
Michael Schnell schrieb: On 11/26/2014 12:09 PM, Sven Barth wrote: In Delphi (and FPC) CP_ACP corresponds by default with the current system codepage (e.g. CP1252 on a German Windows). OK. So in Delphi XE (in Germany) String(CP_ACP) is the same as String(CP1252) but different from String without brackets which in turn is the same as String(CP_UTF16) ? Correct ? CP_ACP (and CP_NONE) describes a *static* encoding, and has an fixed value (CP_ACP=0, CP_NONE=$). The dynamic encoding of strings, kept in AnsiString(0) or RawByteString variables, must be obtained from the string itself. When the string is empty, StringCodepage returns DefaultSystemCodePage (for CP_ACP). CP_UTF16 is not supported, because AnsiString only supports 1-Byte character strings (and UTF-8 as the odd one) and not 2-Byte character strings. I still don't understand. The wiki article seems to suggest that it is about a type called ANSIString that features a dynamically settable code page information. From discussions about Delphi and FPC, I only know a String type with a dynamically settable code page information that also features a dynamically settable Bytes per Character information and hence does support 1, 2 and 4 Bytes per Character. (e.g. UTF-8, UTF-16, and UTF-32). You should have noticed that there exists no String or Char type, that would allow for arbitrary bytes/char counts (see my other answer for details). The difference to Delphi currently is that for FPC String=AnsiString(CP_ACP) and for Delphi String=UnicodeString (aka 2-Byte string). I understand that you mean (e.g.) Delphi XE. But what version of FPC is currently. Am I wrong assuming that in the svn we do have the NewStrings library that supports dynamical code-page *and* byte-per-character settings and hence supports e.g. CP1251, UTF-8, UTF-16, and UTF-32 ? The byte-per-character field is read-only, just like for any dynamic array. So I seem to understand the meaning of String(CP1252), String(CP_UTF8), and String(CP_UTF16) (which seems do be the Delphi notation), but I seemingly don't get the exact meaning of AnsiString(CP_ACP) or AnsiString(CP1251) The Delphi notation is the same, e.g. AnsiString(CP_ACP). In the end, what the definition of String without brackets is, might be due to a settable compiler option and/or the OS the compiler is set to create code for. Right, the *generic* String type can be mapped to either ShortString, AnsiString(0) or UnicodeString, depending on compiler versions and switches. A raw guess can be derived from sizeof(Char). DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
Michael Schnell schrieb: I fail to understand some of the text. It seems to be unavoidable to use the name ANSIString even though I always though up when seeing a thing called ANSI containing Unicode (e. g. UTF8String = type AnsiString(CP_UTF8) ). Seemingly here the bytes per character setting implicitly is thought of as a port of the code-page definition. correct ? An AnsiString consists of AnsiChar's. The *meaning* of these char's (bytes) depends on their encoding, regardless of whether the used encoding is or is not stored with the string. It's essential to distinguish between low-level (physical) AnsiChar values, and *logical* characters possibly consisting of multiple AnsiChars. In section Dynamic code page: When assigning a string to a plain AnsiString (= AnsiString(CP_ACP)) or ShortString, the string data will however be converted to DefaultSystemCodePage. The dynamic code page of that AnsiString(CP_ACP) will then be the current value of DefaultSystemCodePage (e.g. 1250 for the Windows-1250 code page), even though its static code page is CP_ACP (which is a constant 1250). This is one example of how the static code page can differ from the dynamic code page. Subsequent sections will describe more such scenarios. 1) A short String does not have a Code page notification so for this static code page can differ from the dynamic code page does not seem to make much sense. The text correctly states dynamic code page of that AnsiString. ShortString (and AnsiChar) has no encoding indicator, they are assumed to be encoded in CP_ACP. 2) I fail to understand how with this explanation that seems to force auto conversion for assignments between types with different code page settings (also for CP_ACP) the static code page can differ from the dynamic code page can happen. Continue reading until you understood the special handling of string literals and RawByteString. In fact this disaster seems to be able to happen (see section RawByteString) if assigning a string with a static code page X1 to a RawByteString (hence no conversion) and then assigning that RawByteString to a string with a static code page X2 (no conversion again). In fact I assume that without abusing RawByteString such intersexual strings can't be produced, otherwise this would be rather disastrous for normal users. *All* intermediate strings, generated during the evaluation of string expressions, only have a dynamic encoding, thus can be considered as being RawByteStrings. That's why I wonder *when* exactly the result of such an expression *is* converted (implicitly) into the static encoding of the target variable, and when *not*. Obviously the compiler inserts an conversion request for the *direct* assignment of one string variable to another one, of an different *static* encoding. But what happens when a string expression doesn't have such a known static encoding??? In section RawByteString: the results of conversions from/to the CP_NONE code page are undefined. In effect the behavior is exactly defined in this section As a first approximation. Right, the result *is* well defined, but has no *predetermined* dynamic encoding. The entire mess results from the bad interpretation of RawByteString assignments, which IMO was well thought by the Delphi language architects, but not understood by the Delphi compiler coders. This interpretation also found its way into FPC: Less intuitive is probably that when a RawByteString is assigned to an AnsiString(X), the same happens: no code page conversion[...] It's clear that a conversion *can* be omitted for every assignment *to* an RawByteString. That's one of the purposes of that type - to avoid excess conversions into CP_ACP or UnicodeString. But it's unclear why the heck the assignment to any *other* AnsiString type should be omitted, as soon as the source string is a RawByteString??? Therefore I'd suggest an compiler switch, implementing the lame Delphi compatible behaviour only on *demand*, while the FPC default would force eventual conversions with *every* assignment to any other (non-CP_NONE) AnsiString type. This simple change will safely prevent strings of different static and dynamic encoding, so that according tests can be removed safely from library *and* user code. The proper use of RawByteStrings deserves further documentation, for users who want/need their own (generic) stringhandling routines. Topics should be: - how to determine the dynamic encoding of strings (StringCodePage) - how to force required conversions (SetCodePage) - how to deal with strings of different encodings - how to minimize the number of string conversions DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
Mattias Gaertner schrieb: For example: CP_ACP=0, DefaultSystemCodePage=1252 That means static code page is always 0, while dynamic code page can be 0 or 1252. Both describe the same encoding. A *dynamic* encoding *never* can be CP_ACP nor CP_NONE (in Delphi). These values are allowed only for *static* types in type declarations. CP_UTF16 is also not allowed. Delphi StringCodePage reports the current default codepage (DefaultSystemCodePage) for empty AnsiStrings, CP_UTF16 for all UnicodeStrings. In section RawByteString: the results of conversions from/to the CP_NONE code page are undefined. ... because CP_NONE is not a real code page. The same for CP_ACP. DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
Michael Schnell schrieb: So seemingly you could do MyStringType = type AnsiString(CP_UTF16), and seemingly the size information is set according to this. Not in Delphi XE. DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
Jonas Maebe schrieb: Technically, that section literally states that they will be concatenated without data loss and that the result is then converted to the target string's encoding (except in case the target is RawByteString). How that is implemented exactly is undefined; again in the meaning of undefined, not in the meaning of undefined when defined as meaning X. In this case the implementation is compiler specific, somewhat different from undefined (in a RawByteString): CP_NONE: this value indicates that no code page information has been associated with the string data. The result of any explicit or implicit operation that converts this data to another code page is undefined. IMO the result is well defined: it's the string with the encoding of that other codepage. An undefined result, as I understand it, would mean the result can be anything, unrelated to the function input. The branch taken in execution of an IF statement also is not undefined, only because it depends on the actual condition value. The value of a local variable initially is undefined, i.e. can be any value. But after an assignment it *is* defined, even if that value still may be *unpredictable* by static code analysis. IMO a better wording should be found, that does not cause the current obvious confusion of some readers. Regarding RawByteStrings there has been the definition a RawByteString has exactly the same behavior as assigning that AnsiString(X) to another AnsiString(X) variable with the same value of X: no code page conversion or copying occurs. Seemingly this is not true for the intermediate results of concatenations. That paragraph only specifies that code page-aware strings are concatenated without data loss, and then defines to which code page the result will be converted before assigning it to the target. What's the meaning of no copying occurs? Of course the reference to the string is copied into the target variable! What's the same value of X, in case of AnsiString(CP_ACP) and AnsiString(DefaultSystemCodePage)? Even if the intermediary result of a concatenation would be a RawByteString (which is not stated nor necessarily ever the case), then the above would apply and hence the (dynamic) code page of that RawByteString would be the one as defined by the above-mentioned rules before it would be assigned to the target. Please note that the other statements refer to *static* encodings, therefore my question about the (assumed) static encoding of an intermediate result. When the compiler inserts an conversion request based on *static* encodings, will it or will it not insert such an request, before an intermediate result is assigned to the target variable? Suggestion: During string operations the source strings are converted [to CP_ACP?] when they have a different [dynamic?] encoding. When the result is stored in a variable, it is converted as required by the static encoding of the target. Where as required means that a static target encoding of CP_ACP is replaced by the DefaultSystemCodePage, while CP_NONE does not require a conversion. The CP_ACP case should be clarified as well, because it's unclear whether CP_ACP(=0) is *considered* equal to the current DefaultSystemCodePage, even if both values are *always* different (see above). The use of CP_ACP instead of DefaultSystemCodePage can be confusing and should be avoided or clarified before. Perhaps it would help to concentrate on the following steps: 1) (string) operand fetch 2) (string) operations 3) (string) assignment 1) Fetching an operand removes any information about the static encoding of the source, only its dynamic encoding persists. [Now the handling of non-AnsiString sources can be explained, like for literals, ShortString etc. RawByteString is not special here, it's only a static encoding. ] 2) String operations take into account the dynamic encoding of their operands, with lossless conversions inserted as required. 3) When a string is assigned to a variable, it is eventually converted as required by the static encoding of the target, with possible data loss. [about required see above. Special case: when the source is a variable, no conversion occurs when the *static* source and target types are compatible. What exactly is compatible with CP_ACP? ] DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Trying to understand the wiki-Page FPC Unicode support
On 26.11.2014 19:54, Hans-Peter Diettrich wrote: UTF-16 is not a valid value for CP_ACP in Delphi, because it's a 2-byte encoding. Even if the Delphi architects may have thought about an common string type, with a variable element size (1,2,4), this certainly turned out soon as a stupid idea, so that AnsiString and WideString/UnicodeString still are strictly distinct types. WideString and UnicodeString imply UTF-16, with platform specific byte order (endianness). The latter becomes important almost only to compiler and library coders, in host/network byteorder conversions. For the sake of completeness, pdp-11 processors use yet another byte order, maybe more word-based processors (DG...) as well. Just a little remark: please don't throw in WideString, which is a completely different type and only there for easy compatibility with COM and other Windows APIs. Unlike UnicodeString this type is not reference counted for example nor does it have the code page and element size information that a Ansi-/UnicodeString has. (In FPC WideString is the same as UnicodeString for all non-Windows platforms) Regards, Sven ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel