Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2008-03-11 Thread Bruce Momjian

Added to TODO:

* Change memory allocation for multi-byte functions so memory is
  allocated inside conversion functions

  Currently we preallocate memory based on worst-case usage.


---

Tom Lane wrote:
> Tatsuo Ishii <[EMAIL PROTECTED]> writes:
> > Thinking more, it striked me that users can define arbitarily growing
> > rate by using CFREATE CONVERSION. So it seems we need functionality to
> > define the growing rate anyway.
> 
> Seems to me that would be an argument for moving the palloc inside the
> conversion functions, as I suggested before.
> 
> In practice though, I find it hard to imagine a pair of encodings for
> which the growth rate is more than 3x.  You'd need something that
> translates a single-byte character into 4 or more bytes (pretty
> unlikely, especially considering we require all these encodings to be
> ASCII supersets); or something that translates a 2-byte character into
> more than 6 bytes.
> 
>   regards, tom lane
> 
> ---(end of broadcast)---
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>choose an index scan if your joining column's datatypes do not
>match

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>http://momjian.us
  EnterpriseDB http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-07-18 Thread Bruce Momjian

This has been saved for the 8.4 release:

http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---

Tatsuo Ishii wrote:
> The conclusion of the discussion appears that we could reduce
> MAX_CONVERSION_GROWTH from 4 to 3 safely with all existing built-in
> conversions.
> 
> However, since user defined conversions could set arbitrary growth
> rate, probably it would be better leave it as it is now.
> 
> For 8.4, maybe we could change conversion function's signature so that
> we don't need to have the fixed conversion rate as Tom suggested.
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> 
> > Where are we on this?
> > 
> > ---
> > 
> > Tom Lane wrote:
> > > I just rearranged the code in mbutils.c a little bit to make it more
> > > robust if conversion of an over-length string is attempted, and noted
> > > this comment:
> > > 
> > > /*
> > >  * When converting strings between different encodings, we assume that 
> > > space
> > >  * for converted result is 4-to-1 growth in the worst case. The rate for
> > >  * currently supported encoding pairs are within 3 (SJIS JIS X0201 half 
> > > width
> > >  * kanna -> UTF8 is the worst case).  So "4" should be enough for the 
> > > moment.
> > >  *
> > >  * Note that this is not the same as the maximum character width in any
> > >  * particular encoding.
> > >  */
> > > #define MAX_CONVERSION_GROWTH  4
> > > 
> > > It strikes me that this is overly pessimistic, since we do not support
> > > 5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters
> > > in any supported encoding that require 4 bytes in another.  Could we
> > > reduce the multiplier to 3?  Or even 2?  This has a direct impact on the
> > > longest COPY lines we can support, so I'd like it not to be larger than
> > > necessary.
> > > 
> > >   regards, tom lane
> > > 
> > > ---(end of broadcast)---
> > > TIP 4: Have you searched our list archives?
> > > 
> > >http://archives.postgresql.org
> > 
> > -- 
> >   Bruce Momjian  <[EMAIL PROTECTED]>  http://momjian.us
> >   EnterpriseDB   http://www.enterprisedb.com
> > 
> >   + If your life is a hard drive, Christ can be your backup. +

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-07-18 Thread Tatsuo Ishii
The conclusion of the discussion appears that we could reduce
MAX_CONVERSION_GROWTH from 4 to 3 safely with all existing built-in
conversions.

However, since user defined conversions could set arbitrary growth
rate, probably it would be better leave it as it is now.

For 8.4, maybe we could change conversion function's signature so that
we don't need to have the fixed conversion rate as Tom suggested.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

> Where are we on this?
> 
> ---
> 
> Tom Lane wrote:
> > I just rearranged the code in mbutils.c a little bit to make it more
> > robust if conversion of an over-length string is attempted, and noted
> > this comment:
> > 
> > /*
> >  * When converting strings between different encodings, we assume that space
> >  * for converted result is 4-to-1 growth in the worst case. The rate for
> >  * currently supported encoding pairs are within 3 (SJIS JIS X0201 half 
> > width
> >  * kanna -> UTF8 is the worst case).  So "4" should be enough for the 
> > moment.
> >  *
> >  * Note that this is not the same as the maximum character width in any
> >  * particular encoding.
> >  */
> > #define MAX_CONVERSION_GROWTH  4
> > 
> > It strikes me that this is overly pessimistic, since we do not support
> > 5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters
> > in any supported encoding that require 4 bytes in another.  Could we
> > reduce the multiplier to 3?  Or even 2?  This has a direct impact on the
> > longest COPY lines we can support, so I'd like it not to be larger than
> > necessary.
> > 
> > regards, tom lane
> > 
> > ---(end of broadcast)---
> > TIP 4: Have you searched our list archives?
> > 
> >http://archives.postgresql.org
> 
> -- 
>   Bruce Momjian  <[EMAIL PROTECTED]>  http://momjian.us
>   EnterpriseDB   http://www.enterprisedb.com
> 
>   + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-07-18 Thread Tatsuo Ishii
Sorry for dealy.

> On Tue, May 29, 2007 20:51, Tatsuo Ishii wrote:
> 
> > Thinking more, it striked me that users can define arbitarily growing
> > rate by using CFREATE CONVERSION. So it seems we need functionality to
> > define the growing rate anyway.
> 
> Would it make sense to define just the longest and shortest character
> lengths for an encoding?  Then for any conversion you'd have a safe
> estimate of
> 
>   ceil(target_encoding.max_char_len / source_encoding.min_char_len)
> 
> ...without going through every possible conversion.

This will not work since certain CONVERSION allows n char to m char
conversion.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-07-16 Thread Bruce Momjian

Where are we on this?

---

Tom Lane wrote:
> I just rearranged the code in mbutils.c a little bit to make it more
> robust if conversion of an over-length string is attempted, and noted
> this comment:
> 
> /*
>  * When converting strings between different encodings, we assume that space
>  * for converted result is 4-to-1 growth in the worst case. The rate for
>  * currently supported encoding pairs are within 3 (SJIS JIS X0201 half width
>  * kanna -> UTF8 is the worst case).  So "4" should be enough for the moment.
>  *
>  * Note that this is not the same as the maximum character width in any
>  * particular encoding.
>  */
> #define MAX_CONVERSION_GROWTH  4
> 
> It strikes me that this is overly pessimistic, since we do not support
> 5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters
> in any supported encoding that require 4 bytes in another.  Could we
> reduce the multiplier to 3?  Or even 2?  This has a direct impact on the
> longest COPY lines we can support, so I'd like it not to be larger than
> necessary.
> 
>   regards, tom lane
> 
> ---(end of broadcast)---
> TIP 4: Have you searched our list archives?
> 
>http://archives.postgresql.org

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-05-29 Thread Jeroen T. Vermeulen
On Tue, May 29, 2007 20:51, Tatsuo Ishii wrote:

> Thinking more, it striked me that users can define arbitarily growing
> rate by using CFREATE CONVERSION. So it seems we need functionality to
> define the growing rate anyway.

Would it make sense to define just the longest and shortest character
lengths for an encoding?  Then for any conversion you'd have a safe
estimate of

  ceil(target_encoding.max_char_len / source_encoding.min_char_len)

...without going through every possible conversion.


Jeroen



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-05-29 Thread Michael Fuhr
On Tue, May 29, 2007 at 10:00:06AM -0400, Tom Lane wrote:
> In practice though, I find it hard to imagine a pair of encodings for
> which the growth rate is more than 3x.  You'd need something that
> translates a single-byte character into 4 or more bytes (pretty
> unlikely, especially considering we require all these encodings to be
> ASCII supersets); or something that translates a 2-byte character into
> more than 6 bytes.

Many characters in the 0x80..0xff range of single-byte encodings
like LATIN1 become four bytes in GB18030 (e.g., LATIN1 f1 = GB18030
81 30 8a 39).  PostgreSQL doesn't currently support such conversions
but it's something to be aware of.

-- 
Michael Fuhr

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-05-29 Thread Tom Lane
Tatsuo Ishii <[EMAIL PROTECTED]> writes:
> Thinking more, it striked me that users can define arbitarily growing
> rate by using CFREATE CONVERSION. So it seems we need functionality to
> define the growing rate anyway.

Seems to me that would be an argument for moving the palloc inside the
conversion functions, as I suggested before.

In practice though, I find it hard to imagine a pair of encodings for
which the growth rate is more than 3x.  You'd need something that
translates a single-byte character into 4 or more bytes (pretty
unlikely, especially considering we require all these encodings to be
ASCII supersets); or something that translates a 2-byte character into
more than 6 bytes.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-05-29 Thread Tatsuo Ishii
> > On Mon, May 28, 2007 at 10:23:42PM -0400, Tom Lane wrote:
> > > Tatsuo Ishii <[EMAIL PROTECTED]> writes:
> > > > I'm afraid we have to mke it larger, rather than smaller for 8.3. For
> > > > example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3
> > > > bytes UTF_8 (0x00e3818b and 0x00e3829a). See
> > > > util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details.
> > > 
> > > > So the worst case is now 6, rather than 3.
> > > 
> > > Yipes.
> > 
> > Isn't MAX_CONVERSION_GROWTH a multiplier?  Doesn't 2 bytes becoming
> > 2 * 3 bytes represent a growth of 3, not 6?  Or does that 2-byte
> > SHIFT_JIS_2004 sequence have a 1-byte sequence in another supported
> > encoding?  Or am I missing something?
> 
> Oops. You are right. The MAX_CONVERSION_GROWTH should be 3 (=
> (2*3)/2), rather than 6 for the case.
> 
> So it seems we could safely make MAX_CONVERSION_GROWTH down to 3 for
> the moment.

Thinking more, it striked me that users can define arbitarily growing
rate by using CFREATE CONVERSION. So it seems we need functionality to
define the growing rate anyway.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-05-28 Thread Tatsuo Ishii
> On Mon, May 28, 2007 at 10:23:42PM -0400, Tom Lane wrote:
> > Tatsuo Ishii <[EMAIL PROTECTED]> writes:
> > > I'm afraid we have to mke it larger, rather than smaller for 8.3. For
> > > example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3
> > > bytes UTF_8 (0x00e3818b and 0x00e3829a). See
> > > util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details.
> > 
> > > So the worst case is now 6, rather than 3.
> > 
> > Yipes.
> 
> Isn't MAX_CONVERSION_GROWTH a multiplier?  Doesn't 2 bytes becoming
> 2 * 3 bytes represent a growth of 3, not 6?  Or does that 2-byte
> SHIFT_JIS_2004 sequence have a 1-byte sequence in another supported
> encoding?  Or am I missing something?

Oops. You are right. The MAX_CONVERSION_GROWTH should be 3 (=
(2*3)/2), rather than 6 for the case.

So it seems we could safely make MAX_CONVERSION_GROWTH down to 3 for
the moment.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-05-28 Thread Michael Fuhr
On Mon, May 28, 2007 at 10:23:42PM -0400, Tom Lane wrote:
> Tatsuo Ishii <[EMAIL PROTECTED]> writes:
> > I'm afraid we have to mke it larger, rather than smaller for 8.3. For
> > example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3
> > bytes UTF_8 (0x00e3818b and 0x00e3829a). See
> > util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details.
> 
> > So the worst case is now 6, rather than 3.
> 
> Yipes.

Isn't MAX_CONVERSION_GROWTH a multiplier?  Doesn't 2 bytes becoming
2 * 3 bytes represent a growth of 3, not 6?  Or does that 2-byte
SHIFT_JIS_2004 sequence have a 1-byte sequence in another supported
encoding?  Or am I missing something?

-- 
Michael Fuhr

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-05-28 Thread Tatsuo Ishii
> > Can we add a column to pg_conversion which represents the "growth
> > rate"? This would reduce the rate for most encodings much smaller than
> > 6.
> 
> We need to do something, but the pg_conversion catalog seems a bad place
> to put the info --- don't we have places that need to be able to do
> conversion without catalog access?

Can you tell me where? I thought conversion functions are always
called by using OidFunctionCall5 thus we need to consult the
pg_conversion catalog beforehand anyway.

> Perhaps better would be to redefine the API for the conversion functions
> so that they palloc their own result space.  Then each conversion
> function would have to know the maximum growth rate for its particular
> conversion.  This change would also make it feasible for a conversion
> function to prescan the data and determine an exact output size, if that
> seemed worthwhile because the potential growth rate was too extreme.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-05-28 Thread Tom Lane
Tatsuo Ishii <[EMAIL PROTECTED]> writes:
> I'm afraid we have to mke it larger, rather than smaller for 8.3. For
> example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3
> bytes UTF_8 (0x00e3818b and 0x00e3829a). See
> util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details.

> So the worst case is now 6, rather than 3.

Yipes.

> Can we add a column to pg_conversion which represents the "growth
> rate"? This would reduce the rate for most encodings much smaller than
> 6.

We need to do something, but the pg_conversion catalog seems a bad place
to put the info --- don't we have places that need to be able to do
conversion without catalog access?

Perhaps better would be to redefine the API for the conversion functions
so that they palloc their own result space.  Then each conversion
function would have to know the maximum growth rate for its particular
conversion.  This change would also make it feasible for a conversion
function to prescan the data and determine an exact output size, if that
seemed worthwhile because the potential growth rate was too extreme.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] What is the maximum encoding-conversion growth rate, anyway?

2007-05-28 Thread Tatsuo Ishii
> I just rearranged the code in mbutils.c a little bit to make it more
> robust if conversion of an over-length string is attempted, and noted
> this comment:
> 
> /*
>  * When converting strings between different encodings, we assume that space
>  * for converted result is 4-to-1 growth in the worst case. The rate for
>  * currently supported encoding pairs are within 3 (SJIS JIS X0201 half width
>  * kanna -> UTF8 is the worst case).  So "4" should be enough for the moment.
>  *
>  * Note that this is not the same as the maximum character width in any
>  * particular encoding.
>  */
> #define MAX_CONVERSION_GROWTH  4
> 
> It strikes me that this is overly pessimistic, since we do not support
> 5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters
> in any supported encoding that require 4 bytes in another.  Could we
> reduce the multiplier to 3?  Or even 2?  This has a direct impact on the
> longest COPY lines we can support, so I'd like it not to be larger than
> necessary.

I'm afraid we have to mke it larger, rather than smaller for 8.3. For
example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3
bytes UTF_8 (0x00e3818b and 0x00e3829a). See
util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details.

So the worst case is now 6, rather than 3.

Can we add a column to pg_conversion which represents the "growth
rate"? This would reduce the rate for most encodings much smaller than
6.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---(end of broadcast)---
TIP 6: explain analyze is your friend