Re: [HACKERS] improve Chinese locale performance

2013-09-09 Thread Quan Zongliang

On 09/06/2013 01:02 AM, Robert Haas wrote:

On Wed, Sep 4, 2013 at 11:02 PM, Quan Zongliang quanzongli...@gmail.com wrote:

I think of a new idea.
Add a compare method column to pg_collation.
Every collation has its own compare function or null.
When function varstr_cmp is called, if specified collation
has compare function, call it instead of strcoll().


I think we're going to need to have two kinds of collations:
OS-derived collations (which get all of their smarts from the OS), and
PG-internal collations (which use PG-aware code for everything).
Which I suspect is a bit more involved than what you're imagining, but
mixing and matching doesn't seem likely to end well.

However, what you're proposing might serve as a useful demonstration
of how much performance there is to be gained here.


Understood.

I just try to speed up text compare, not redesign locale.

Do you have a plan to do this?

Thank you.

Quan Zongliang



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improve Chinese locale performance

2013-09-09 Thread Robert Haas
On Mon, Sep 9, 2013 at 5:22 AM, Quan Zongliang quanzongli...@gmail.com wrote:
 Understood.

 I just try to speed up text compare, not redesign locale.

 Do you have a plan to do this?

Not any time soon, anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improve Chinese locale performance

2013-09-05 Thread Robert Haas
On Wed, Sep 4, 2013 at 11:02 PM, Quan Zongliang quanzongli...@gmail.com wrote:
 I think of a new idea.
 Add a compare method column to pg_collation.
 Every collation has its own compare function or null.
 When function varstr_cmp is called, if specified collation
 has compare function, call it instead of strcoll().

I think we're going to need to have two kinds of collations:
OS-derived collations (which get all of their smarts from the OS), and
PG-internal collations (which use PG-aware code for everything).
Which I suspect is a bit more involved than what you're imagining, but
mixing and matching doesn't seem likely to end well.

However, what you're proposing might serve as a useful demonstration
of how much performance there is to be gained here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improve Chinese locale performance

2013-09-04 Thread Quan Zongliang

On 07/23/2013 09:42 PM, Craig Ringer wrote:

(Replying on phone, please forgive bad quoting)

Isn't this pretty much what adopting ICU is supposed to give us? OS-independent 
collations?

I'd be interested in seeing the rest data for this performance report, partly 
as I'd like to see how ICU collations would compare when ICU is crudely hacked 
into place for testing.


I think of a new idea.
Add a compare method column to pg_collation.
Every collation has its own compare function or null.
When function varstr_cmp is called, if specified collation
has compare function, call it instead of strcoll().

How about this?

Regards.

Quan Zongliang



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improve Chinese locale performance

2013-08-01 Thread Robert Haas
On Sun, Jul 28, 2013 at 5:39 AM, Martijn van Oosterhout
klep...@svana.org wrote:
 On Tue, Jul 23, 2013 at 10:34:21AM -0400, Robert Haas wrote:
 I pretty much lost interest in ICU upon reading that they use UTF-16
 as their internal format.

 http://userguide.icu-project.org/strings#TOC-Strings-in-ICU

 The UTF-8 support has been steadily improving:

   For example, icu::Collator::compareUTF8() compares two UTF-8 strings
   incrementally, without converting all of the two strings to UTF-16 if
   there is an early base letter difference.

 http://userguide.icu-project.org/strings/utf-8

 For all other encodings you should be able to use an iterator. As to
 performance I have no idea.

 The main issue with strxfrm() is its lame API. If it supported
 returning prefixes you'd be set, but as it is you need 10MB of memory
 just to transform a 10MB string, even if only the first few characers
 would be enough to sort...

Yep, definitely.  And by 10MB you mean 90MB, at least on my Mac,
which is really outrageous.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improve Chinese locale performance

2013-07-28 Thread Martijn van Oosterhout
On Tue, Jul 23, 2013 at 10:34:21AM -0400, Robert Haas wrote:
 I pretty much lost interest in ICU upon reading that they use UTF-16
 as their internal format.
 
 http://userguide.icu-project.org/strings#TOC-Strings-in-ICU

The UTF-8 support has been steadily improving:

  For example, icu::Collator::compareUTF8() compares two UTF-8 strings
  incrementally, without converting all of the two strings to UTF-16 if
  there is an early base letter difference.

http://userguide.icu-project.org/strings/utf-8

For all other encodings you should be able to use an iterator. As to
performance I have no idea.

The main issue with strxfrm() is its lame API. If it supported
returning prefixes you'd be set, but as it is you need 10MB of memory
just to transform a 10MB string, even if only the first few characers
would be enough to sort...

Mvg,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] improve Chinese locale performance

2013-07-23 Thread Robert Haas
On Mon, Jul 22, 2013 at 12:49 PM, Greg Stark st...@mit.edu wrote:
 On Mon, Jul 22, 2013 at 12:50 PM, Peter Eisentraut pete...@gmx.net wrote:
 I think part of the problem is that we call strcoll for each comparison,
 instead of doing strxfrm once for each datum and then just strcmp for
 each comparison.  That is effectively equivalent to what the proposal
 implements.

 Fwiw I used to be a big proponent of using strxfrm. But upon further
 analysis I realized it was a real difficult tradeoff. strxrfm saves
 potentially a lot of cpu cost but at the expense of expanding the size
 of the sort key. If the sort spills to disk or even if it's just
 memory bandwidth limited it might actually be slower than doing the
 additional cpu work of calling strcoll.

 It's hard to see how to decide in advance which way will be faster. I
 suspect strxfrm is still the better bet, especially for complex large
 character set based locales like Chinese. strcoll might still win by a
 large margin on simple mostly-ascii character sets.

The storage blow-up on systems I've tested is on the order of 10x.
That's possibly fine if the data still fits in memory, but it pretty
much sucks if it makes your sort spill to disk, which seems like a
likely outcome in many cases.

But I don't have much trouble believing the OP's contention that he's
coded a locale-specific version that is faster than the version that
ships with the OS.  On glibc, for example, we copy the strings we want
to compare, so that we can add a terminating zero byte.  The first
thing that glibc does is call strlen().  That's pretty horrible, and
I'm not at all sure the horror ends there, either.

It would be great to have support for user-defined collations in
PostgreSQL.  Let the user provide their own comparison function and
whatever else is needed and use that instead of the OS-specific
support.  Aside from the performance advantages, one could even create
collations that have the same names and orderings on all platforms we
support.  Our support team has gotten more than one inquiry of the
form what's the equivalent of Linux collation XYZ on Windows? - and
telling them that there is no exact equivalent is not the answer the
want to hear.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improve Chinese locale performance

2013-07-23 Thread Craig Ringer
(Replying on phone, please forgive bad quoting)

Isn't this pretty much what adopting ICU is supposed to give us? OS-independent 
collations?

I'd be interested in seeing the rest data for this performance report, partly 
as I'd like to see how ICU collations would compare when ICU is crudely hacked 
into place for testing.
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improve Chinese locale performance

2013-07-23 Thread Robert Haas
On Tue, Jul 23, 2013 at 9:42 AM, Craig Ringer cr...@2ndquadrant.com wrote:
 (Replying on phone, please forgive bad quoting)

 Isn't this pretty much what adopting ICU is supposed to give us? 
 OS-independent collations?

Yes.

 I'd be interested in seeing the rest data for this performance report, partly 
 as I'd like to see how ICU collations would compare when ICU is crudely 
 hacked into place for testing.

I pretty much lost interest in ICU upon reading that they use UTF-16
as their internal format.

http://userguide.icu-project.org/strings#TOC-Strings-in-ICU

What that would mean for us is that instead of copying the input
strings into a temporary buffer and passing the buffer to strcoll(),
we'd need to convert them to ICU's representation (which means writing
twice as many bytes as the length of the input string in cases where
the input string is mostly single-byte characters) and then call ICU's
strcoll() equivalent.  I agree that it might be worth testing, but I
can't work up much optimism.  It seems to me that something that
operates directly on the server encoding could run a whole lot faster.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improve Chinese locale performance

2013-07-23 Thread Quan Zongliang

On 07/23/2013 09:42 PM, Craig Ringer wrote:

(Replying on phone, please forgive bad quoting)

Isn't this pretty much what adopting ICU is supposed to give us? OS-independent 
collations?


Yes, we need OS-independent collations.


I'd be interested in seeing the rest data for this performance report, partly 
as I'd like to see how ICU collations would compare when ICU is crudely hacked 
into place for testing.





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improve Chinese locale performance

2013-07-22 Thread Craig Ringer
On 07/22/2013 12:17 PM, Quan Zongliang wrote:
 Hi hackers,
 
 I tried to improve performance when database is Chinese.
 
 Under openSUSE, create index on table with 54996 rows
 locale=C, 140ms
 locale=zh_CN, 985ms
 
 I think the function strcoll() of Linux is too slow.
 So, I made a new utf8 to GB18030 map, store Chinese order in it.
 Do not call strcoll().
 On my modified code, same operation, locale=zh_CN, 203ms.

It might be worth looking at gcc's strcoll() implementation. See if it
performs better when you use the latest gcc, and if not try to improve
gcc's strcoll() .

I'd be interested in seeing a test case for this that shows that the
results of your new collation are exactly the same as the original
strcoll() based approach.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improve Chinese locale performance

2013-07-22 Thread Quan Zongliang

On 07/22/2013 03:54 PM, Craig Ringer wrote:

On 07/22/2013 12:17 PM, Quan Zongliang wrote:

Hi hackers,

I tried to improve performance when database is Chinese.

Under openSUSE, create index on table with 54996 rows
locale=C, 140ms
locale=zh_CN, 985ms

I think the function strcoll() of Linux is too slow.
So, I made a new utf8 to GB18030 map, store Chinese order in it.
Do not call strcoll().
On my modified code, same operation, locale=zh_CN, 203ms.


It might be worth looking at gcc's strcoll() implementation. See if it
performs better when you use the latest gcc, and if not try to improve
gcc's strcoll() .

I'd be interested in seeing a test case for this that shows that the
results of your new collation are exactly the same as the original
strcoll() based approach.


Do not same exactly.
I found some errors in gcc's strcoll() when order by Chinese character.
Because there are lots of special characters in Chinese.
gcc's strcoll() do not consider this or missed at part of them.

Yes, the best way is to impove gcc's strcoll().
But I don't know how to do.


Thanks,
Quan Zongliang



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improve Chinese locale performance

2013-07-22 Thread Peter Eisentraut
On 7/22/13 3:54 AM, Craig Ringer wrote:
 It might be worth looking at gcc's strcoll() implementation. See if it
 performs better when you use the latest gcc, and if not try to improve
 gcc's strcoll() .

I think part of the problem is that we call strcoll for each comparison,
instead of doing strxfrm once for each datum and then just strcmp for
each comparison.  That is effectively equivalent to what the proposal
implements.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improve Chinese locale performance

2013-07-22 Thread Greg Stark
On Mon, Jul 22, 2013 at 12:50 PM, Peter Eisentraut pete...@gmx.net wrote:
 I think part of the problem is that we call strcoll for each comparison,
 instead of doing strxfrm once for each datum and then just strcmp for
 each comparison.  That is effectively equivalent to what the proposal
 implements.

Fwiw I used to be a big proponent of using strxfrm. But upon further
analysis I realized it was a real difficult tradeoff. strxrfm saves
potentially a lot of cpu cost but at the expense of expanding the size
of the sort key. If the sort spills to disk or even if it's just
memory bandwidth limited it might actually be slower than doing the
additional cpu work of calling strcoll.

It's hard to see how to decide in advance which way will be faster. I
suspect strxfrm is still the better bet, especially for complex large
character set based locales like Chinese. strcoll might still win by a
large margin on simple mostly-ascii character sets.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improve Chinese locale performance

2013-07-22 Thread Andrew Dunstan


On 07/22/2013 12:49 PM, Greg Stark wrote:

On Mon, Jul 22, 2013 at 12:50 PM, Peter Eisentraut pete...@gmx.net wrote:

I think part of the problem is that we call strcoll for each comparison,
instead of doing strxfrm once for each datum and then just strcmp for
each comparison.  That is effectively equivalent to what the proposal
implements.

Fwiw I used to be a big proponent of using strxfrm. But upon further
analysis I realized it was a real difficult tradeoff. strxrfm saves
potentially a lot of cpu cost but at the expense of expanding the size
of the sort key. If the sort spills to disk or even if it's just
memory bandwidth limited it might actually be slower than doing the
additional cpu work of calling strcoll.

It's hard to see how to decide in advance which way will be faster. I
suspect strxfrm is still the better bet, especially for complex large
character set based locales like Chinese. strcoll might still win by a
large margin on simple mostly-ascii character sets.





Perhaps we need a bit of performance testing to prove the point.

Maybe the behaviour should be locale-dependent.

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers