Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2008-03-11 Thread Bruce Momjian

Added to TODO:

* Change memory allocation for multi-byte functions so memory is
  allocated inside conversion functions

  Currently we preallocate memory based on worst-case usage.


---

Tom Lane wrote:
 Tatsuo Ishii [EMAIL PROTECTED] writes:
  Thinking more, it striked me that users can define arbitarily growing
  rate by using CFREATE CONVERSION. So it seems we need functionality to
  define the growing rate anyway.
 
 Seems to me that would be an argument for moving the palloc inside the
 conversion functions, as I suggested before.
 
 In practice though, I find it hard to imagine a pair of encodings for
 which the growth rate is more than 3x.  You'd need something that
 translates a single-byte character into 4 or more bytes (pretty
 unlikely, especially considering we require all these encodings to be
 ASCII supersets); or something that translates a 2-byte character into
 more than 6 bytes.
 
   regards, tom lane
 
 ---(end of broadcast)---
 TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

-- 
  Bruce Momjian  [EMAIL PROTECTED]http://momjian.us
  EnterpriseDB http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-07-18 Thread Tatsuo Ishii
Sorry for dealy.

 On Tue, May 29, 2007 20:51, Tatsuo Ishii wrote:
 
  Thinking more, it striked me that users can define arbitarily growing
  rate by using CFREATE CONVERSION. So it seems we need functionality to
  define the growing rate anyway.
 
 Would it make sense to define just the longest and shortest character
 lengths for an encoding?  Then for any conversion you'd have a safe
 estimate of
 
   ceil(target_encoding.max_char_len / source_encoding.min_char_len)
 
 ...without going through every possible conversion.

This will not work since certain CONVERSION allows n char to m char
conversion.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-07-18 Thread Tatsuo Ishii
The conclusion of the discussion appears that we could reduce
MAX_CONVERSION_GROWTH from 4 to 3 safely with all existing built-in
conversions.

However, since user defined conversions could set arbitrary growth
rate, probably it would be better leave it as it is now.

For 8.4, maybe we could change conversion function's signature so that
we don't need to have the fixed conversion rate as Tom suggested.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

 Where are we on this?
 
 ---
 
 Tom Lane wrote:
  I just rearranged the code in mbutils.c a little bit to make it more
  robust if conversion of an over-length string is attempted, and noted
  this comment:
  
  /*
   * When converting strings between different encodings, we assume that space
   * for converted result is 4-to-1 growth in the worst case. The rate for
   * currently supported encoding pairs are within 3 (SJIS JIS X0201 half 
  width
   * kanna - UTF8 is the worst case).  So 4 should be enough for the 
  moment.
   *
   * Note that this is not the same as the maximum character width in any
   * particular encoding.
   */
  #define MAX_CONVERSION_GROWTH  4
  
  It strikes me that this is overly pessimistic, since we do not support
  5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters
  in any supported encoding that require 4 bytes in another.  Could we
  reduce the multiplier to 3?  Or even 2?  This has a direct impact on the
  longest COPY lines we can support, so I'd like it not to be larger than
  necessary.
  
  regards, tom lane
  
  ---(end of broadcast)---
  TIP 4: Have you searched our list archives?
  
 http://archives.postgresql.org
 
 -- 
   Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
   EnterpriseDB   http://www.enterprisedb.com
 
   + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-07-18 Thread Bruce Momjian

This has been saved for the 8.4 release:

http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---

Tatsuo Ishii wrote:
 The conclusion of the discussion appears that we could reduce
 MAX_CONVERSION_GROWTH from 4 to 3 safely with all existing built-in
 conversions.
 
 However, since user defined conversions could set arbitrary growth
 rate, probably it would be better leave it as it is now.
 
 For 8.4, maybe we could change conversion function's signature so that
 we don't need to have the fixed conversion rate as Tom suggested.
 --
 Tatsuo Ishii
 SRA OSS, Inc. Japan
 
  Where are we on this?
  
  ---
  
  Tom Lane wrote:
   I just rearranged the code in mbutils.c a little bit to make it more
   robust if conversion of an over-length string is attempted, and noted
   this comment:
   
   /*
* When converting strings between different encodings, we assume that 
   space
* for converted result is 4-to-1 growth in the worst case. The rate for
* currently supported encoding pairs are within 3 (SJIS JIS X0201 half 
   width
* kanna - UTF8 is the worst case).  So 4 should be enough for the 
   moment.
*
* Note that this is not the same as the maximum character width in any
* particular encoding.
*/
   #define MAX_CONVERSION_GROWTH  4
   
   It strikes me that this is overly pessimistic, since we do not support
   5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters
   in any supported encoding that require 4 bytes in another.  Could we
   reduce the multiplier to 3?  Or even 2?  This has a direct impact on the
   longest COPY lines we can support, so I'd like it not to be larger than
   necessary.
   
 regards, tom lane
   
   ---(end of broadcast)---
   TIP 4: Have you searched our list archives?
   
  http://archives.postgresql.org
  
  -- 
Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
EnterpriseDB   http://www.enterprisedb.com
  
+ If your life is a hard drive, Christ can be your backup. +

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-07-16 Thread Bruce Momjian

Where are we on this?

---

Tom Lane wrote:
 I just rearranged the code in mbutils.c a little bit to make it more
 robust if conversion of an over-length string is attempted, and noted
 this comment:
 
 /*
  * When converting strings between different encodings, we assume that space
  * for converted result is 4-to-1 growth in the worst case. The rate for
  * currently supported encoding pairs are within 3 (SJIS JIS X0201 half width
  * kanna - UTF8 is the worst case).  So 4 should be enough for the moment.
  *
  * Note that this is not the same as the maximum character width in any
  * particular encoding.
  */
 #define MAX_CONVERSION_GROWTH  4
 
 It strikes me that this is overly pessimistic, since we do not support
 5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters
 in any supported encoding that require 4 bytes in another.  Could we
 reduce the multiplier to 3?  Or even 2?  This has a direct impact on the
 longest COPY lines we can support, so I'd like it not to be larger than
 necessary.
 
   regards, tom lane
 
 ---(end of broadcast)---
 TIP 4: Have you searched our list archives?
 
http://archives.postgresql.org

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-05-29 Thread Tatsuo Ishii
  On Mon, May 28, 2007 at 10:23:42PM -0400, Tom Lane wrote:
   Tatsuo Ishii [EMAIL PROTECTED] writes:
I'm afraid we have to mke it larger, rather than smaller for 8.3. For
example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3
bytes UTF_8 (0x00e3818b and 0x00e3829a). See
util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details.
   
So the worst case is now 6, rather than 3.
   
   Yipes.
  
  Isn't MAX_CONVERSION_GROWTH a multiplier?  Doesn't 2 bytes becoming
  2 * 3 bytes represent a growth of 3, not 6?  Or does that 2-byte
  SHIFT_JIS_2004 sequence have a 1-byte sequence in another supported
  encoding?  Or am I missing something?
 
 Oops. You are right. The MAX_CONVERSION_GROWTH should be 3 (=
 (2*3)/2), rather than 6 for the case.
 
 So it seems we could safely make MAX_CONVERSION_GROWTH down to 3 for
 the moment.

Thinking more, it striked me that users can define arbitarily growing
rate by using CFREATE CONVERSION. So it seems we need functionality to
define the growing rate anyway.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-05-29 Thread Tom Lane
Tatsuo Ishii [EMAIL PROTECTED] writes:
 Thinking more, it striked me that users can define arbitarily growing
 rate by using CFREATE CONVERSION. So it seems we need functionality to
 define the growing rate anyway.

Seems to me that would be an argument for moving the palloc inside the
conversion functions, as I suggested before.

In practice though, I find it hard to imagine a pair of encodings for
which the growth rate is more than 3x.  You'd need something that
translates a single-byte character into 4 or more bytes (pretty
unlikely, especially considering we require all these encodings to be
ASCII supersets); or something that translates a 2-byte character into
more than 6 bytes.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-05-29 Thread Michael Fuhr
On Tue, May 29, 2007 at 10:00:06AM -0400, Tom Lane wrote:
 In practice though, I find it hard to imagine a pair of encodings for
 which the growth rate is more than 3x.  You'd need something that
 translates a single-byte character into 4 or more bytes (pretty
 unlikely, especially considering we require all these encodings to be
 ASCII supersets); or something that translates a 2-byte character into
 more than 6 bytes.

Many characters in the 0x80..0xff range of single-byte encodings
like LATIN1 become four bytes in GB18030 (e.g., LATIN1 f1 = GB18030
81 30 8a 39).  PostgreSQL doesn't currently support such conversions
but it's something to be aware of.

-- 
Michael Fuhr

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-05-29 Thread Jeroen T. Vermeulen
On Tue, May 29, 2007 20:51, Tatsuo Ishii wrote:

 Thinking more, it striked me that users can define arbitarily growing
 rate by using CFREATE CONVERSION. So it seems we need functionality to
 define the growing rate anyway.

Would it make sense to define just the longest and shortest character
lengths for an encoding?  Then for any conversion you'd have a safe
estimate of

  ceil(target_encoding.max_char_len / source_encoding.min_char_len)

...without going through every possible conversion.


Jeroen



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-05-28 Thread Tatsuo Ishii
 I just rearranged the code in mbutils.c a little bit to make it more
 robust if conversion of an over-length string is attempted, and noted
 this comment:
 
 /*
  * When converting strings between different encodings, we assume that space
  * for converted result is 4-to-1 growth in the worst case. The rate for
  * currently supported encoding pairs are within 3 (SJIS JIS X0201 half width
  * kanna - UTF8 is the worst case).  So 4 should be enough for the moment.
  *
  * Note that this is not the same as the maximum character width in any
  * particular encoding.
  */
 #define MAX_CONVERSION_GROWTH  4
 
 It strikes me that this is overly pessimistic, since we do not support
 5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters
 in any supported encoding that require 4 bytes in another.  Could we
 reduce the multiplier to 3?  Or even 2?  This has a direct impact on the
 longest COPY lines we can support, so I'd like it not to be larger than
 necessary.

I'm afraid we have to mke it larger, rather than smaller for 8.3. For
example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3
bytes UTF_8 (0x00e3818b and 0x00e3829a). See
util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details.

So the worst case is now 6, rather than 3.

Can we add a column to pg_conversion which represents the growth
rate? This would reduce the rate for most encodings much smaller than
6.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-05-28 Thread Tom Lane
Tatsuo Ishii [EMAIL PROTECTED] writes:
 I'm afraid we have to mke it larger, rather than smaller for 8.3. For
 example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3
 bytes UTF_8 (0x00e3818b and 0x00e3829a). See
 util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details.

 So the worst case is now 6, rather than 3.

Yipes.

 Can we add a column to pg_conversion which represents the growth
 rate? This would reduce the rate for most encodings much smaller than
 6.

We need to do something, but the pg_conversion catalog seems a bad place
to put the info --- don't we have places that need to be able to do
conversion without catalog access?

Perhaps better would be to redefine the API for the conversion functions
so that they palloc their own result space.  Then each conversion
function would have to know the maximum growth rate for its particular
conversion.  This change would also make it feasible for a conversion
function to prescan the data and determine an exact output size, if that
seemed worthwhile because the potential growth rate was too extreme.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-05-28 Thread Tatsuo Ishii
  Can we add a column to pg_conversion which represents the growth
  rate? This would reduce the rate for most encodings much smaller than
  6.
 
 We need to do something, but the pg_conversion catalog seems a bad place
 to put the info --- don't we have places that need to be able to do
 conversion without catalog access?

Can you tell me where? I thought conversion functions are always
called by using OidFunctionCall5 thus we need to consult the
pg_conversion catalog beforehand anyway.

 Perhaps better would be to redefine the API for the conversion functions
 so that they palloc their own result space.  Then each conversion
 function would have to know the maximum growth rate for its particular
 conversion.  This change would also make it feasible for a conversion
 function to prescan the data and determine an exact output size, if that
 seemed worthwhile because the potential growth rate was too extreme.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-05-28 Thread Michael Fuhr
On Mon, May 28, 2007 at 10:23:42PM -0400, Tom Lane wrote:
 Tatsuo Ishii [EMAIL PROTECTED] writes:
  I'm afraid we have to mke it larger, rather than smaller for 8.3. For
  example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3
  bytes UTF_8 (0x00e3818b and 0x00e3829a). See
  util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details.
 
  So the worst case is now 6, rather than 3.
 
 Yipes.

Isn't MAX_CONVERSION_GROWTH a multiplier?  Doesn't 2 bytes becoming
2 * 3 bytes represent a growth of 3, not 6?  Or does that 2-byte
SHIFT_JIS_2004 sequence have a 1-byte sequence in another supported
encoding?  Or am I missing something?

-- 
Michael Fuhr

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-05-28 Thread Tatsuo Ishii
 On Mon, May 28, 2007 at 10:23:42PM -0400, Tom Lane wrote:
  Tatsuo Ishii [EMAIL PROTECTED] writes:
   I'm afraid we have to mke it larger, rather than smaller for 8.3. For
   example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3
   bytes UTF_8 (0x00e3818b and 0x00e3829a). See
   util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details.
  
   So the worst case is now 6, rather than 3.
  
  Yipes.
 
 Isn't MAX_CONVERSION_GROWTH a multiplier?  Doesn't 2 bytes becoming
 2 * 3 bytes represent a growth of 3, not 6?  Or does that 2-byte
 SHIFT_JIS_2004 sequence have a 1-byte sequence in another supported
 encoding?  Or am I missing something?

Oops. You are right. The MAX_CONVERSION_GROWTH should be 3 (=
(2*3)/2), rather than 6 for the case.

So it seems we could safely make MAX_CONVERSION_GROWTH down to 3 for
the moment.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings