Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?
Added to TODO: * Change memory allocation for multi-byte functions so memory is allocated inside conversion functions Currently we preallocate memory based on worst-case usage. --- Tom Lane wrote: > Tatsuo Ishii <[EMAIL PROTECTED]> writes: > > Thinking more, it striked me that users can define arbitarily growing > > rate by using CFREATE CONVERSION. So it seems we need functionality to > > define the growing rate anyway. > > Seems to me that would be an argument for moving the palloc inside the > conversion functions, as I suggested before. > > In practice though, I find it hard to imagine a pair of encodings for > which the growth rate is more than 3x. You'd need something that > translates a single-byte character into 4 or more bytes (pretty > unlikely, especially considering we require all these encodings to be > ASCII supersets); or something that translates a 2-byte character into > more than 6 bytes. > > regards, tom lane > > ---(end of broadcast)--- > TIP 9: In versions below 8.0, the planner will ignore your desire to >choose an index scan if your joining column's datatypes do not >match -- Bruce Momjian <[EMAIL PROTECTED]>http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?
This has been saved for the 8.4 release: http://momjian.postgresql.org/cgi-bin/pgpatches_hold --- Tatsuo Ishii wrote: > The conclusion of the discussion appears that we could reduce > MAX_CONVERSION_GROWTH from 4 to 3 safely with all existing built-in > conversions. > > However, since user defined conversions could set arbitrary growth > rate, probably it would be better leave it as it is now. > > For 8.4, maybe we could change conversion function's signature so that > we don't need to have the fixed conversion rate as Tom suggested. > -- > Tatsuo Ishii > SRA OSS, Inc. Japan > > > Where are we on this? > > > > --- > > > > Tom Lane wrote: > > > I just rearranged the code in mbutils.c a little bit to make it more > > > robust if conversion of an over-length string is attempted, and noted > > > this comment: > > > > > > /* > > > * When converting strings between different encodings, we assume that > > > space > > > * for converted result is 4-to-1 growth in the worst case. The rate for > > > * currently supported encoding pairs are within 3 (SJIS JIS X0201 half > > > width > > > * kanna -> UTF8 is the worst case). So "4" should be enough for the > > > moment. > > > * > > > * Note that this is not the same as the maximum character width in any > > > * particular encoding. > > > */ > > > #define MAX_CONVERSION_GROWTH 4 > > > > > > It strikes me that this is overly pessimistic, since we do not support > > > 5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters > > > in any supported encoding that require 4 bytes in another. Could we > > > reduce the multiplier to 3? Or even 2? This has a direct impact on the > > > longest COPY lines we can support, so I'd like it not to be larger than > > > necessary. > > > > > > regards, tom lane > > > > > > ---(end of broadcast)--- > > > TIP 4: Have you searched our list archives? > > > > > >http://archives.postgresql.org > > > > -- > > Bruce Momjian <[EMAIL PROTECTED]> http://momjian.us > > EnterpriseDB http://www.enterprisedb.com > > > > + If your life is a hard drive, Christ can be your backup. + -- Bruce Momjian <[EMAIL PROTECTED]> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?
The conclusion of the discussion appears that we could reduce MAX_CONVERSION_GROWTH from 4 to 3 safely with all existing built-in conversions. However, since user defined conversions could set arbitrary growth rate, probably it would be better leave it as it is now. For 8.4, maybe we could change conversion function's signature so that we don't need to have the fixed conversion rate as Tom suggested. -- Tatsuo Ishii SRA OSS, Inc. Japan > Where are we on this? > > --- > > Tom Lane wrote: > > I just rearranged the code in mbutils.c a little bit to make it more > > robust if conversion of an over-length string is attempted, and noted > > this comment: > > > > /* > > * When converting strings between different encodings, we assume that space > > * for converted result is 4-to-1 growth in the worst case. The rate for > > * currently supported encoding pairs are within 3 (SJIS JIS X0201 half > > width > > * kanna -> UTF8 is the worst case). So "4" should be enough for the > > moment. > > * > > * Note that this is not the same as the maximum character width in any > > * particular encoding. > > */ > > #define MAX_CONVERSION_GROWTH 4 > > > > It strikes me that this is overly pessimistic, since we do not support > > 5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters > > in any supported encoding that require 4 bytes in another. Could we > > reduce the multiplier to 3? Or even 2? This has a direct impact on the > > longest COPY lines we can support, so I'd like it not to be larger than > > necessary. > > > > regards, tom lane > > > > ---(end of broadcast)--- > > TIP 4: Have you searched our list archives? > > > >http://archives.postgresql.org > > -- > Bruce Momjian <[EMAIL PROTECTED]> http://momjian.us > EnterpriseDB http://www.enterprisedb.com > > + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?
Sorry for dealy. > On Tue, May 29, 2007 20:51, Tatsuo Ishii wrote: > > > Thinking more, it striked me that users can define arbitarily growing > > rate by using CFREATE CONVERSION. So it seems we need functionality to > > define the growing rate anyway. > > Would it make sense to define just the longest and shortest character > lengths for an encoding? Then for any conversion you'd have a safe > estimate of > > ceil(target_encoding.max_char_len / source_encoding.min_char_len) > > ...without going through every possible conversion. This will not work since certain CONVERSION allows n char to m char conversion. -- Tatsuo Ishii SRA OSS, Inc. Japan ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?
Where are we on this? --- Tom Lane wrote: > I just rearranged the code in mbutils.c a little bit to make it more > robust if conversion of an over-length string is attempted, and noted > this comment: > > /* > * When converting strings between different encodings, we assume that space > * for converted result is 4-to-1 growth in the worst case. The rate for > * currently supported encoding pairs are within 3 (SJIS JIS X0201 half width > * kanna -> UTF8 is the worst case). So "4" should be enough for the moment. > * > * Note that this is not the same as the maximum character width in any > * particular encoding. > */ > #define MAX_CONVERSION_GROWTH 4 > > It strikes me that this is overly pessimistic, since we do not support > 5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters > in any supported encoding that require 4 bytes in another. Could we > reduce the multiplier to 3? Or even 2? This has a direct impact on the > longest COPY lines we can support, so I'd like it not to be larger than > necessary. > > regards, tom lane > > ---(end of broadcast)--- > TIP 4: Have you searched our list archives? > >http://archives.postgresql.org -- Bruce Momjian <[EMAIL PROTECTED]> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?
On Tue, May 29, 2007 20:51, Tatsuo Ishii wrote: > Thinking more, it striked me that users can define arbitarily growing > rate by using CFREATE CONVERSION. So it seems we need functionality to > define the growing rate anyway. Would it make sense to define just the longest and shortest character lengths for an encoding? Then for any conversion you'd have a safe estimate of ceil(target_encoding.max_char_len / source_encoding.min_char_len) ...without going through every possible conversion. Jeroen ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?
On Tue, May 29, 2007 at 10:00:06AM -0400, Tom Lane wrote: > In practice though, I find it hard to imagine a pair of encodings for > which the growth rate is more than 3x. You'd need something that > translates a single-byte character into 4 or more bytes (pretty > unlikely, especially considering we require all these encodings to be > ASCII supersets); or something that translates a 2-byte character into > more than 6 bytes. Many characters in the 0x80..0xff range of single-byte encodings like LATIN1 become four bytes in GB18030 (e.g., LATIN1 f1 = GB18030 81 30 8a 39). PostgreSQL doesn't currently support such conversions but it's something to be aware of. -- Michael Fuhr ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?
Tatsuo Ishii <[EMAIL PROTECTED]> writes: > Thinking more, it striked me that users can define arbitarily growing > rate by using CFREATE CONVERSION. So it seems we need functionality to > define the growing rate anyway. Seems to me that would be an argument for moving the palloc inside the conversion functions, as I suggested before. In practice though, I find it hard to imagine a pair of encodings for which the growth rate is more than 3x. You'd need something that translates a single-byte character into 4 or more bytes (pretty unlikely, especially considering we require all these encodings to be ASCII supersets); or something that translates a 2-byte character into more than 6 bytes. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?
> > On Mon, May 28, 2007 at 10:23:42PM -0400, Tom Lane wrote: > > > Tatsuo Ishii <[EMAIL PROTECTED]> writes: > > > > I'm afraid we have to mke it larger, rather than smaller for 8.3. For > > > > example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3 > > > > bytes UTF_8 (0x00e3818b and 0x00e3829a). See > > > > util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details. > > > > > > > So the worst case is now 6, rather than 3. > > > > > > Yipes. > > > > Isn't MAX_CONVERSION_GROWTH a multiplier? Doesn't 2 bytes becoming > > 2 * 3 bytes represent a growth of 3, not 6? Or does that 2-byte > > SHIFT_JIS_2004 sequence have a 1-byte sequence in another supported > > encoding? Or am I missing something? > > Oops. You are right. The MAX_CONVERSION_GROWTH should be 3 (= > (2*3)/2), rather than 6 for the case. > > So it seems we could safely make MAX_CONVERSION_GROWTH down to 3 for > the moment. Thinking more, it striked me that users can define arbitarily growing rate by using CFREATE CONVERSION. So it seems we need functionality to define the growing rate anyway. -- Tatsuo Ishii SRA OSS, Inc. Japan ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?
> On Mon, May 28, 2007 at 10:23:42PM -0400, Tom Lane wrote: > > Tatsuo Ishii <[EMAIL PROTECTED]> writes: > > > I'm afraid we have to mke it larger, rather than smaller for 8.3. For > > > example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3 > > > bytes UTF_8 (0x00e3818b and 0x00e3829a). See > > > util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details. > > > > > So the worst case is now 6, rather than 3. > > > > Yipes. > > Isn't MAX_CONVERSION_GROWTH a multiplier? Doesn't 2 bytes becoming > 2 * 3 bytes represent a growth of 3, not 6? Or does that 2-byte > SHIFT_JIS_2004 sequence have a 1-byte sequence in another supported > encoding? Or am I missing something? Oops. You are right. The MAX_CONVERSION_GROWTH should be 3 (= (2*3)/2), rather than 6 for the case. So it seems we could safely make MAX_CONVERSION_GROWTH down to 3 for the moment. -- Tatsuo Ishii SRA OSS, Inc. Japan ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?
On Mon, May 28, 2007 at 10:23:42PM -0400, Tom Lane wrote: > Tatsuo Ishii <[EMAIL PROTECTED]> writes: > > I'm afraid we have to mke it larger, rather than smaller for 8.3. For > > example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3 > > bytes UTF_8 (0x00e3818b and 0x00e3829a). See > > util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details. > > > So the worst case is now 6, rather than 3. > > Yipes. Isn't MAX_CONVERSION_GROWTH a multiplier? Doesn't 2 bytes becoming 2 * 3 bytes represent a growth of 3, not 6? Or does that 2-byte SHIFT_JIS_2004 sequence have a 1-byte sequence in another supported encoding? Or am I missing something? -- Michael Fuhr ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?
> > Can we add a column to pg_conversion which represents the "growth > > rate"? This would reduce the rate for most encodings much smaller than > > 6. > > We need to do something, but the pg_conversion catalog seems a bad place > to put the info --- don't we have places that need to be able to do > conversion without catalog access? Can you tell me where? I thought conversion functions are always called by using OidFunctionCall5 thus we need to consult the pg_conversion catalog beforehand anyway. > Perhaps better would be to redefine the API for the conversion functions > so that they palloc their own result space. Then each conversion > function would have to know the maximum growth rate for its particular > conversion. This change would also make it feasible for a conversion > function to prescan the data and determine an exact output size, if that > seemed worthwhile because the potential growth rate was too extreme. -- Tatsuo Ishii SRA OSS, Inc. Japan ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?
Tatsuo Ishii <[EMAIL PROTECTED]> writes: > I'm afraid we have to mke it larger, rather than smaller for 8.3. For > example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3 > bytes UTF_8 (0x00e3818b and 0x00e3829a). See > util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details. > So the worst case is now 6, rather than 3. Yipes. > Can we add a column to pg_conversion which represents the "growth > rate"? This would reduce the rate for most encodings much smaller than > 6. We need to do something, but the pg_conversion catalog seems a bad place to put the info --- don't we have places that need to be able to do conversion without catalog access? Perhaps better would be to redefine the API for the conversion functions so that they palloc their own result space. Then each conversion function would have to know the maximum growth rate for its particular conversion. This change would also make it feasible for a conversion function to prescan the data and determine an exact output size, if that seemed worthwhile because the potential growth rate was too extreme. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?
> I just rearranged the code in mbutils.c a little bit to make it more > robust if conversion of an over-length string is attempted, and noted > this comment: > > /* > * When converting strings between different encodings, we assume that space > * for converted result is 4-to-1 growth in the worst case. The rate for > * currently supported encoding pairs are within 3 (SJIS JIS X0201 half width > * kanna -> UTF8 is the worst case). So "4" should be enough for the moment. > * > * Note that this is not the same as the maximum character width in any > * particular encoding. > */ > #define MAX_CONVERSION_GROWTH 4 > > It strikes me that this is overly pessimistic, since we do not support > 5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters > in any supported encoding that require 4 bytes in another. Could we > reduce the multiplier to 3? Or even 2? This has a direct impact on the > longest COPY lines we can support, so I'd like it not to be larger than > necessary. I'm afraid we have to mke it larger, rather than smaller for 8.3. For example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3 bytes UTF_8 (0x00e3818b and 0x00e3829a). See util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details. So the worst case is now 6, rather than 3. Can we add a column to pg_conversion which represents the "growth rate"? This would reduce the rate for most encodings much smaller than 6. -- Tatsuo Ishii SRA OSS, Inc. Japan ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] What is the maximum encoding-conversion growth rate, anyway?
I just rearranged the code in mbutils.c a little bit to make it more robust if conversion of an over-length string is attempted, and noted this comment: /* * When converting strings between different encodings, we assume that space * for converted result is 4-to-1 growth in the worst case. The rate for * currently supported encoding pairs are within 3 (SJIS JIS X0201 half width * kanna -> UTF8 is the worst case). So "4" should be enough for the moment. * * Note that this is not the same as the maximum character width in any * particular encoding. */ #define MAX_CONVERSION_GROWTH 4 It strikes me that this is overly pessimistic, since we do not support 5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters in any supported encoding that require 4 bytes in another. Could we reduce the multiplier to 3? Or even 2? This has a direct impact on the longest COPY lines we can support, so I'd like it not to be larger than necessary. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org