Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
> Well, this is still working at the wrong level. The code > that's in pg_verifymbstr is mainly intended to enforce the > *system wide* assumption that multibyte characters must have > the high bit set in every byte. (We do not support encodings > without this property in the backend, because it breaks code > that looks for ASCII characters ... such as the main > parser/lexer ...) It's not really intended to check that the > multibyte character is actually legal in its encoding. > Ok, point taken. > The "special UTF-8 check" was never more than a very > quick-n-dirty hack that was in the wrong place to start with. > We ought to be getting rid of it not institutionalizing it. > If you want an exact encoding-specific check on the > legitimacy of a multibyte sequence, I think the right way to > do it is to add another function pointer to pg_wchar_table > entries to let each encoding have its own check routine. > Perhaps this could be defined so as to avoid a separate call > to pg_mblen inside the loop, and thereby not add any new > overhead. I'm thinking about an API something like > > int validate_mbchar(const unsigned char *str, int len) > > with result +N if a valid character N bytes long is present > at *str, and -N if an invalid character is present at *str > and it would be appropriate to display N bytes in the complaint. > (N must be <= len in either case.) This would reduce the > main loop of pg_verifymbstr to a call of this function and an > error-case-handling block. > Sounds like a plan... > regards, tom lane > > Regards, John Hansen ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
Tatsuo Ishii wrote: Tom Lane wrote: If I understood what I was reading, this would take several things: * Remove the "special UTF-8 check" in pg_verifymbstr; * Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case; * Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8. Are there any other places that would have to change? Would this break anything? The testing aspect is what's bothering me at the moment. Does this change what client_encoding = UNICODE might produce? The JDBC driver will need some tweaking to handle this -- Java uses UTF-16 internally and I think some supplementary character (?) scheme for values above 0x as of JDK 1.5. Java doesn't handle UCS above 0x? I didn't know that. As long as you put in/out JDBC, it shouldn't be a problem. However if other APIs put in such a data, you will get into trouble... Internally, Java strings are arrays of UTF-16 values. Before JDK 1.5, all the string-manipulation library routines assumed that one code point == one UTF-16 value, so you can't represent values above 0x. The 1.5 libraries understand using supplementary characters to use multiple UTF-16 values per code point. See http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ However, the JDBC driver needs to be taught about how to translate between UTF-8 representations of code points above 0x and pairs of UTF-16 values. Previously it didn't need to do anything since the server didn't use those high values. It's a minor thing.. -O ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
"John Hansen" <[EMAIL PROTECTED]> writes: > Ahh, but that's not the case. You cannot just delete the check, since > not all combinations of bytes are valid UTF8. UTF bytes FE & FF never > appear in a byte sequence for instance. Well, this is still working at the wrong level. The code that's in pg_verifymbstr is mainly intended to enforce the *system wide* assumption that multibyte characters must have the high bit set in every byte. (We do not support encodings without this property in the backend, because it breaks code that looks for ASCII characters ... such as the main parser/lexer ...) It's not really intended to check that the multibyte character is actually legal in its encoding. The "special UTF-8 check" was never more than a very quick-n-dirty hack that was in the wrong place to start with. We ought to be getting rid of it not institutionalizing it. If you want an exact encoding-specific check on the legitimacy of a multibyte sequence, I think the right way to do it is to add another function pointer to pg_wchar_table entries to let each encoding have its own check routine. Perhaps this could be defined so as to avoid a separate call to pg_mblen inside the loop, and thereby not add any new overhead. I'm thinking about an API something like int validate_mbchar(const unsigned char *str, int len) with result +N if a valid character N bytes long is present at *str, and -N if an invalid character is present at *str and it would be appropriate to display N bytes in the complaint. (N must be <= len in either case.) This would reduce the main loop of pg_verifymbstr to a call of this function and an error-case-handling block. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
> Tom Lane wrote: > > > If I understood what I was reading, this would take several things: > > * Remove the "special UTF-8 check" in pg_verifymbstr; > > * Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case; > > * Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8. > > > > Are there any other places that would have to change? Would this break > > anything? The testing aspect is what's bothering me at the moment. > > Does this change what client_encoding = UNICODE might produce? The JDBC > driver will need some tweaking to handle this -- Java uses UTF-16 > internally and I think some supplementary character (?) scheme for > values above 0x as of JDK 1.5. Java doesn't handle UCS above 0x? I didn't know that. As long as you put in/out JDBC, it shouldn't be a problem. However if other APIs put in such a data, you will get into trouble... -- Tatsuo Ishii ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
Oliver Jowett <[EMAIL PROTECTED]> writes: > Does this change what client_encoding = UNICODE might produce? The JDBC > driver will need some tweaking to handle this -- Java uses UTF-16 > internally and I think some supplementary character (?) scheme for > values above 0x as of JDK 1.5. You're not likely to get out anything you didn't put in, so I'm not sure it matters. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
Tom Lane wrote: If I understood what I was reading, this would take several things: * Remove the "special UTF-8 check" in pg_verifymbstr; * Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case; * Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8. Are there any other places that would have to change? Would this break anything? The testing aspect is what's bothering me at the moment. Does this change what client_encoding = UNICODE might produce? The JDBC driver will need some tweaking to handle this -- Java uses UTF-16 internally and I think some supplementary character (?) scheme for values above 0x as of JDK 1.5. -O ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
> -Original Message- > From: Oliver Elphick [mailto:[EMAIL PROTECTED] > Sent: Sunday, August 08, 2004 7:43 AM > To: Tom Lane > Cc: John Hansen; Hackers; Patches > Subject: Re: [HACKERS] UNICODE characters above 0x1 > > On Sat, 2004-08-07 at 07:10, Tom Lane wrote: > > Oliver Elphick <[EMAIL PROTECTED]> writes: > > > glibc provides various routines (mb...) for handling Unicode. How > > > many of our supported platforms don't have these? > > > > Every one that doesn't use glibc. Don't bother proposing a glibc-only > > solution (and that's from someone who works for a glibc-only company; > > you don't even want to think about the push-back you'll get from other > > quarters). > > No. that's not what I was proposing. My suggestion was to > use these routines if they are sufficiently widely > implemented, and our own routines where standard ones are not > available. > > The man page for mblen says > "CONFORMING TO >ISO/ANSI C, UNIX98" > > Is glibc really the only C library to conform? > > If using the mb... routines isn't feasible, IBM's ICU library > (http://oss.software.ibm.com/icu/) is available under the X > licence, which is compatible with BSD as far as I can see. > Besides character conversion, ICU can also do collation in > various locales and encodings. > My point is, we shouldn't be writing a new set of routines to > do half a job if there are already libraries available to do > all of it. > This sounds like a brilliant move, if anything. > -- > Oliver Elphick > [EMAIL PROTECTED] > Isle of Wight > http://www.lfix.co.uk/oliver > GPG: 1024D/A54310EA 92C8 39E7 280E 3631 3F0E 1EC0 5664 7A2F > A543 10EA > > "Be still before the LORD and wait patiently for him; > do not fret when men succeed in their ways, when they > carry out their wicked schemes." > Psalms 37:7 > > > Kind Regards, John Hansen ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
> -Original Message- > From: Tom Lane [mailto:[EMAIL PROTECTED] > Sent: Sunday, August 08, 2004 2:43 AM > To: Dennis Bjorklund > Cc: Tatsuo Ishii; John Hansen; [EMAIL PROTECTED]; > [EMAIL PROTECTED] > Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x1 > > Dennis Bjorklund <[EMAIL PROTECTED]> writes: > > On Sat, 7 Aug 2004, Tatsuo Ishii wrote: > >> Anyway my point is if current specification of Unicode only allows > >> 24-bit range, why we need to allow usage against the specification? > > > Is there a specific reason you want to restrict it to 24 bits? > > I see several places that have to allocate space on the basis > of the maximum encoded character length possible in the > current encoding (look for uses of > pg_database_encoding_max_length). Probably the only one > that's really significant for performance is text_substr(), > but that's enough to be an argument against setting maxmblen > higher than we have to. > > It looks to me like supporting 4-byte UTF-8 characters would > be enough to handle the existing range of Unicode codepoints, > and that is probably as much as we want to do. > > If I understood what I was reading, this would take several things: > * Remove the "special UTF-8 check" in pg_verifymbstr; I strongly disagree, this would mean one could store any sequence of characters in the db, as long as the bytes are above 0x80. This would not be valid utf8, and leave the data in an inconsistent state. Setting the client encoding to unicode, implies that this is what we're going to feed the database, and should guarantee, that what comes out of a select is valid utf8. We can make sure of that, by doing the check before it's inserted. > * Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case; pg_utf_mblen should handle any case according to the specification. Currently, it will return 3, even for 4,5, and 6 byte sequences. Those places where pg_utf_mblen is called, we should check to make sure, that the length is between 1 and 4 inclusive, and that the sequence is valid. This is what I made the patch for. > * Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8. That I have no problem with. > Are there any other places that would have to change? Would > this break anything? The testing aspect is what's bothering > me at the moment. > > regards, tom lane > > Just my $0.02 worth, Kind Regards, John Hansen ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
On Sat, 2004-08-07 at 07:10, Tom Lane wrote: > Oliver Elphick <[EMAIL PROTECTED]> writes: > > glibc provides various routines (mb...) for handling Unicode. How many > > of our supported platforms don't have these? > > Every one that doesn't use glibc. Don't bother proposing a glibc-only > solution (and that's from someone who works for a glibc-only company; > you don't even want to think about the push-back you'll get from other > quarters). No. that's not what I was proposing. My suggestion was to use these routines if they are sufficiently widely implemented, and our own routines where standard ones are not available. The man page for mblen says "CONFORMING TO ISO/ANSI C, UNIX98" Is glibc really the only C library to conform? If using the mb... routines isn't feasible, IBM's ICU library (http://oss.software.ibm.com/icu/) is available under the X licence, which is compatible with BSD as far as I can see. Besides character conversion, ICU can also do collation in various locales and encodings. My point is, we shouldn't be writing a new set of routines to do half a job if there are already libraries available to do all of it. -- Oliver Elphick [EMAIL PROTECTED] Isle of Wight http://www.lfix.co.uk/oliver GPG: 1024D/A54310EA 92C8 39E7 280E 3631 3F0E 1EC0 5664 7A2F A543 10EA "Be still before the LORD and wait patiently for him; do not fret when men succeed in their ways, when they carry out their wicked schemes." Psalms 37:7 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
Dennis Bjorklund <[EMAIL PROTECTED]> writes: > On Sat, 7 Aug 2004, Tatsuo Ishii wrote: >> Anyway my point is if current specification of Unicode only allows >> 24-bit range, why we need to allow usage against the specification? > Is there a specific reason you want to restrict it to 24 bits? I see several places that have to allocate space on the basis of the maximum encoded character length possible in the current encoding (look for uses of pg_database_encoding_max_length). Probably the only one that's really significant for performance is text_substr(), but that's enough to be an argument against setting maxmblen higher than we have to. It looks to me like supporting 4-byte UTF-8 characters would be enough to handle the existing range of Unicode codepoints, and that is probably as much as we want to do. If I understood what I was reading, this would take several things: * Remove the "special UTF-8 check" in pg_verifymbstr; * Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case; * Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8. Are there any other places that would have to change? Would this break anything? The testing aspect is what's bothering me at the moment. regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
> -Original Message- > From: Dennis Bjorklund [mailto:[EMAIL PROTECTED] > Sent: Saturday, August 07, 2004 11:23 PM > To: John Hansen > Cc: Takehiko Abe; [EMAIL PROTECTED] > Subject: RE: [PATCHES] [HACKERS] UNICODE characters above 0x1 > > On Sat, 7 Aug 2004, John Hansen wrote: > > > Now, is it really 24 bits tho? > > Afaict, it's really 21 (0 - 10 or 0 - xxx1 > ) > > Yes, up to 0x10 should be enough. > > The 24 is not really important, this is all about what utf-8 > strings to accept as input. The strings are stored as utf-8 > strings and when processed inside pg it uses wchar_t that is > 32 bit (on some systems at least). By restricting the utf-8 > input to unicode we can in the future store each character as > 3 bytes if we want. Which brings us back to something like the attached... > > -- > /Dennis Björklund > > > Regards, John Hansen wchar.c.patch Description: wchar.c.patch ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
Yea,. I know 10 - 10 : 2 separate planes iirc ... John -Original Message- From: Dennis Bjorklund [mailto:[EMAIL PROTECTED] Sent: Saturday, August 07, 2004 9:06 PM To: John Hansen Cc: Tatsuo Ishii; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: RE: [PATCHES] [HACKERS] UNICODE characters above 0x1 On Sat, 7 Aug 2004, John Hansen wrote: > should not allow them to be stored, since there might me someone using > the high ranges for a private character set, which could very well be > included in the specification some day. There are areas reserved for private character sets. -- /Dennis Björklund ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
On Sat, 7 Aug 2004, Tatsuo Ishii wrote: > More seriously, Unicode is filled with tons of confusion and > inconsistency IMO. Remember that once Unicode adovocates said that the > merit of Unicode was it only requires 16-bit width. Now they say they > need surrogate pairs and 32-bit width chars... > > Anyway my point is if current specification of Unicode only allows > 24-bit range, why we need to allow usage against the specification? Whatever problems they have had in the past, the ISO 10646 defines formally a 31-bit character set. Are you saying that applications should reject strings that contain characters that it does not recognize? Is there a specific reason you want to restrict it to 24 bits? In practice it does not matter much since it's not used today, I just don't know why you want it. -- /Dennis Björklund ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
Well, maybe we'd be better off, compiling a list of (in?)valid ranges from the full unicode database (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt and http://www.unicode.org/Public/UNIDATA/Unihan.txt) and with every release of pg, update the detection logic so only valid characters are allowed? Regards, John Hansen -Original Message- From: Tatsuo Ishii [mailto:[EMAIL PROTECTED] Sent: Saturday, August 07, 2004 8:46 PM To: John Hansen Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x1 > Yes, but the specification allows for 6byte sequences, or 32bit > characters. UTF-8 is just an encoding specification, not character set specification. Unicode only has 17 256x256 planes in its specification. > As dennis pointed out, just because they're not used, doesn't mean we > should not allow them to be stored, since there might me someone using > the high ranges for a private character set, which could very well be > included in the specification some day. We should expand it to 64-bit since some day the specification might be changed then:-) More seriously, Unicode is filled with tons of confusion and inconsistency IMO. Remember that once Unicode adovocates said that the merit of Unicode was it only requires 16-bit width. Now they say they need surrogate pairs and 32-bit width chars... Anyway my point is if current specification of Unicode only allows 24-bit range, why we need to allow usage against the specification? -- Tatsuo Ishii ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
On Sat, 7 Aug 2004, John Hansen wrote: > should not allow them to be stored, since there might me someone using > the high ranges for a private character set, which could very well be > included in the specification some day. There are areas reserved for private character sets. -- /Dennis Björklund ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
4 actually, 10 needs four bytes: 0xxx 10xx 10xx 10xx 10 = 1010 Fill in the blanks, starting from the bottom, you get: 1010 1011 1011 Regards, John Hansen -Original Message- From: Christopher Kings-Lynne [mailto:[EMAIL PROTECTED] Sent: Saturday, August 07, 2004 8:47 PM To: Tom Lane Cc: John Hansen; Hackers; Patches Subject: Re: [HACKERS] UNICODE characters above 0x1 > Now it's entirely possible that the underlying support is a few bricks > shy of a load --- for instance I see that pg_utf_mblen thinks there > are no UTF8 codes longer than 3 bytes whereas your code goes to 4. > I'm not an expert on this stuff, so I don't know what the UTF8 spec > actually says. But I do think you are fixing the code at the wrong level. Surely there are UTF-8 codes that are at least 3 bytes. I have a _vague_ recollection that you have to keep escaping and escaping to get up to like 4 bytes for some asian code points? Chris ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
Now it's entirely possible that the underlying support is a few bricks shy of a load --- for instance I see that pg_utf_mblen thinks there are no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not an expert on this stuff, so I don't know what the UTF8 spec actually says. But I do think you are fixing the code at the wrong level. Surely there are UTF-8 codes that are at least 3 bytes. I have a _vague_ recollection that you have to keep escaping and escaping to get up to like 4 bytes for some asian code points? Chris ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
> Yes, but the specification allows for 6byte sequences, or 32bit > characters. UTF-8 is just an encoding specification, not character set specification. Unicode only has 17 256x256 planes in its specification. > As dennis pointed out, just because they're not used, doesn't mean we > should not allow them to be stored, since there might me someone using > the high ranges for a private character set, which could very well be > included in the specification some day. We should expand it to 64-bit since some day the specification might be changed then:-) More seriously, Unicode is filled with tons of confusion and inconsistency IMO. Remember that once Unicode adovocates said that the merit of Unicode was it only requires 16-bit width. Now they say they need surrogate pairs and 32-bit width chars... Anyway my point is if current specification of Unicode only allows 24-bit range, why we need to allow usage against the specification? -- Tatsuo Ishii ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
Yes, but the specification allows for 6byte sequences, or 32bit characters. As dennis pointed out, just because they're not used, doesn't mean we should not allow them to be stored, since there might me someone using the high ranges for a private character set, which could very well be included in the specification some day. Regards, John Hansen -Original Message- From: Tatsuo Ishii [mailto:[EMAIL PROTECTED] Sent: Saturday, August 07, 2004 8:09 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; John Hansen; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x1 > Dennis Bjorklund <[EMAIL PROTECTED]> writes: > > ... This also means that the start byte can never start with 7 or 8 > > ones, that is illegal and should be tested for and rejected. So the > > longest utf-8 sequence is 6 bytes (and the longest character needs 4 > > bytes (or 31 bits)). > > Tatsuo would know more about this than me, but it looks from here like > our coding was originally designed to support only 16-bit-wide > internal characters (ie, 16-bit pg_wchar datatype width). I believe > that the regex library limitation here is gone, and that as far as > that library is concerned we could assume a 32-bit internal character > width. The question at hand is whether we can support 32-bit > characters or not --- and if not, what's the next bug to fix? pg_wchar has been already 32-bit datatype. However I doubt there's actually a need for 32-but width character sets. Even Unicode only uese up 0x0010, so 24-bit should be enough... -- Tatsuo Ishii ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
> Dennis Bjorklund <[EMAIL PROTECTED]> writes: > > ... This also means that the start byte can never start with 7 or 8 > > ones, that is illegal and should be tested for and rejected. So the > > longest utf-8 sequence is 6 bytes (and the longest character needs 4 > > bytes (or 31 bits)). > > Tatsuo would know more about this than me, but it looks from here like > our coding was originally designed to support only 16-bit-wide internal > characters (ie, 16-bit pg_wchar datatype width). I believe that the > regex library limitation here is gone, and that as far as that library > is concerned we could assume a 32-bit internal character width. The > question at hand is whether we can support 32-bit characters or not --- > and if not, what's the next bug to fix? pg_wchar has been already 32-bit datatype. However I doubt there's actually a need for 32-but width character sets. Even Unicode only uese up 0x0010, so 24-bit should be enough... -- Tatsuo Ishii ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
This should do it. Regards, John Hansen -Original Message- From: Dennis Bjorklund [mailto:[EMAIL PROTECTED] Sent: Saturday, August 07, 2004 5:02 PM To: Tom Lane Cc: John Hansen; Hackers; Patches Subject: Re: [HACKERS] UNICODE characters above 0x1 On Sat, 7 Aug 2004, Tom Lane wrote: > question at hand is whether we can support 32-bit characters or not > --- and if not, what's the next bug to fix? True, and that's hard to just give an answer to. One could do some simple testing, make sure regexps work and then treat anything else that might not work, as bugs to be fixed later on when found. The alternative is to inspect all code paths that involve strings, not fun at all :-) My previous mail talked about utf-8 translation. Not all characters possible to form using utf-8 are assigned by the unicode org. However, the part that interprets the unicode strings are in the os so different os'es can give different results. So I think pg should just accept even 6 byte utf-8 sequences even if some characters are not currently assigned. -- /Dennis Björklund wchar.c.patch Description: wchar.c.patch ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
On Sat, 7 Aug 2004, Tom Lane wrote: > question at hand is whether we can support 32-bit characters or not --- > and if not, what's the next bug to fix? True, and that's hard to just give an answer to. One could do some simple testing, make sure regexps work and then treat anything else that might not work, as bugs to be fixed later on when found. The alternative is to inspect all code paths that involve strings, not fun at all :-) My previous mail talked about utf-8 translation. Not all characters possible to form using utf-8 are assigned by the unicode org. However, the part that interprets the unicode strings are in the os so different os'es can give different results. So I think pg should just accept even 6 byte utf-8 sequences even if some characters are not currently assigned. -- /Dennis Björklund ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
Dennis Bjorklund <[EMAIL PROTECTED]> writes: > ... This also means that the start byte can never start with 7 or 8 > ones, that is illegal and should be tested for and rejected. So the > longest utf-8 sequence is 6 bytes (and the longest character needs 4 > bytes (or 31 bits)). Tatsuo would know more about this than me, but it looks from here like our coding was originally designed to support only 16-bit-wide internal characters (ie, 16-bit pg_wchar datatype width). I believe that the regex library limitation here is gone, and that as far as that library is concerned we could assume a 32-bit internal character width. The question at hand is whether we can support 32-bit characters or not --- and if not, what's the next bug to fix? regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
Ahh, but that's not the case. You cannot just delete the check, since not all combinations of bytes are valid UTF8. UTF bytes FE & FF never appear in a byte sequence for instance. UTF8 is more that two bytes btw, up to 6 bytes are used to represent an UTF8 character. The 5 and 6 byte characters are currently not in use tho. I didn't actually notice the difference in UTF8 width between my original patch and my last, so attached, updated patch. Regards, John Hansen -Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED] Sent: Saturday, August 07, 2004 3:07 PM To: John Hansen Cc: Hackers; Patches Subject: Re: [HACKERS] UNICODE characters above 0x1 "John Hansen" <[EMAIL PROTECTED]> writes: > My apologies for not reading the code properly. > Attached patch using pg_utf_mblen() instead of an indexed table. > It now also do bounds checks. I think you missed my point. If we don't need this limitation, the correct patch is simply to delete the whole check (ie, delete lines 827-836 of wchar.c, and for that matter we'd then not need the encoding local variable). What's really at stake here is whether anything else breaks if we do that. What else, if anything, assumes that UTF characters are not more than 2 bytes? Now it's entirely possible that the underlying support is a few bricks shy of a load --- for instance I see that pg_utf_mblen thinks there are no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not an expert on this stuff, so I don't know what the UTF8 spec actually says. But I do think you are fixing the code at the wrong level. regards, tom lane wchar.c.patch Description: wchar.c.patch ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
On Sat, 2004-08-07 at 06:06, Tom Lane wrote: > Now it's entirely possible that the underlying support is a few bricks > shy of a load --- for instance I see that pg_utf_mblen thinks there are > no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not > an expert on this stuff, so I don't know what the UTF8 spec actually > says. But I do think you are fixing the code at the wrong level. UTF-8 characters can be up to 6 bytes long: http://www.cl.cam.ac.uk/~mgk25/unicode.html glibc provides various routines (mb...) for handling Unicode. How many of our supported platforms don't have these? If there are still some that don't, wouldn't it be better to use the standard routines where they do exist? -- Oliver Elphick [EMAIL PROTECTED] Isle of Wight http://www.lfix.co.uk/oliver GPG: 1024D/A54310EA 92C8 39E7 280E 3631 3F0E 1EC0 5664 7A2F A543 10EA "Be still before the LORD and wait patiently for him; do not fret when men succeed in their ways, when they carry out their wicked schemes." Psalms 37:7 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
Oliver Elphick <[EMAIL PROTECTED]> writes: > glibc provides various routines (mb...) for handling Unicode. How many > of our supported platforms don't have these? Every one that doesn't use glibc. Don't bother proposing a glibc-only solution (and that's from someone who works for a glibc-only company; you don't even want to think about the push-back you'll get from other quarters). regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
On Sat, 7 Aug 2004, Tom Lane wrote: > shy of a load --- for instance I see that pg_utf_mblen thinks there are > no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not > an expert on this stuff, so I don't know what the UTF8 spec actually > says. But I do think you are fixing the code at the wrong level. I can give some general info about utf-9. This is how it is encoded: characterencoding --- - - 007F: 0xxx 0080 - 07FF: 110x 10xx 0800 - : 1110 10xx 10xx 0001 - 001F: 0xxx 10xx 10xx 10xx 0020 - 03FF: 10xx 10xx 10xx 10xx 10xx 0400 - 7FFF: 110x 10xx 10xx 10xx 10xx 10xx If the first byte starts with a 1 then the number of ones give the length of the utf-8 sequence. And the rest of the bytes in the sequence always starts with 10 (this makes it possble to look anywhere in the string and fast find the start of a character). This also means that the start byte can never start with 7 or 8 ones, that is illegal and should be tested for and rejected. So the longest utf-8 sequence is 6 bytes (and the longest character needs 4 bytes (or 31 bits)). -- /Dennis Björklund ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
Possibly, since I got it wrong once more About to give up, but attached, Updated patch. Regards, John Hansen -Original Message- From: Oliver Elphick [mailto:[EMAIL PROTECTED] Sent: Saturday, August 07, 2004 3:56 PM To: Tom Lane Cc: John Hansen; Hackers; Patches Subject: Re: [HACKERS] UNICODE characters above 0x1 On Sat, 2004-08-07 at 06:06, Tom Lane wrote: > Now it's entirely possible that the underlying support is a few bricks > shy of a load --- for instance I see that pg_utf_mblen thinks there > are no UTF8 codes longer than 3 bytes whereas your code goes to 4. > I'm not an expert on this stuff, so I don't know what the UTF8 spec > actually says. But I do think you are fixing the code at the wrong level. UTF-8 characters can be up to 6 bytes long: http://www.cl.cam.ac.uk/~mgk25/unicode.html glibc provides various routines (mb...) for handling Unicode. How many of our supported platforms don't have these? If there are still some that don't, wouldn't it be better to use the standard routines where they do exist? -- Oliver Elphick [EMAIL PROTECTED] Isle of Wight http://www.lfix.co.uk/oliver GPG: 1024D/A54310EA 92C8 39E7 280E 3631 3F0E 1EC0 5664 7A2F A543 10EA "Be still before the LORD and wait patiently for him; do not fret when men succeed in their ways, when they carry out their wicked schemes." Psalms 37:7 wchar.c.patch Description: wchar.c.patch ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
"John Hansen" <[EMAIL PROTECTED]> writes: > My apologies for not reading the code properly. > Attached patch using pg_utf_mblen() instead of an indexed table. > It now also do bounds checks. I think you missed my point. If we don't need this limitation, the correct patch is simply to delete the whole check (ie, delete lines 827-836 of wchar.c, and for that matter we'd then not need the encoding local variable). What's really at stake here is whether anything else breaks if we do that. What else, if anything, assumes that UTF characters are not more than 2 bytes? Now it's entirely possible that the underlying support is a few bricks shy of a load --- for instance I see that pg_utf_mblen thinks there are no UTF8 codes longer than 3 bytes whereas your code goes to 4. I'm not an expert on this stuff, so I don't know what the UTF8 spec actually says. But I do think you are fixing the code at the wrong level. regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
My apologies for not reading the code properly. Attached patch using pg_utf_mblen() instead of an indexed table. It now also do bounds checks. Regards, John Hansen -Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED] Sent: Saturday, August 07, 2004 4:37 AM To: John Hansen Cc: Hackers; Patches Subject: Re: [HACKERS] UNICODE characters above 0x1 "John Hansen" <[EMAIL PROTECTED]> writes: > Attached, as promised, small patch removing the limitation, adding > correct utf8 validation. Surely this is badly broken --- it will happily access data outside the bounds of the given string. Also, doesn't pg_mblen already know the length rules for UTF8? Why are you duplicating that knowledge? regards, tom lane wchar.c.patch Description: wchar.c.patch ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
"John Hansen" <[EMAIL PROTECTED]> writes: > Attached, as promised, small patch removing the limitation, adding > correct utf8 validation. Surely this is badly broken --- it will happily access data outside the bounds of the given string. Also, doesn't pg_mblen already know the length rules for UTF8? Why are you duplicating that knowledge? regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000
Attached, as promised, small patch removing the limitation, adding correct utf8 validation. Regards, John -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of John Hansen Sent: Friday, August 06, 2004 2:20 PM To: 'Hackers' Subject: [HACKERS] UNICODE characters above 0x1 I've started work on a patch for this problem. Doing regression tests at present. I'll get back when done. Regards, John ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html wchar.c.patch Description: wchar.c.patch ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly