Re: [HACKERS] Looked at TODO:Considering improving performance of computing CHAR() value lengths

2014-10-11 Thread Bruce Momjian
On Mon, Aug  4, 2014 at 11:48:38AM +0300, Heikki Linnakangas wrote:
> The biggest issue with the previously discussed patches was the lack
> of performance testing. People posted results of various
> micro-benchmarks that showed different aspects, but no-one
> comprehensively covered all the interesting cases. I'd like to see a
> simple suite of performance tests that show:

Any progress on this?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Looked at TODO:Considering improving performance of computing CHAR() value lengths

2014-08-04 Thread Heikki Linnakangas

On 08/02/2014 09:43 PM, John Cochran wrote:

I took at look at the TODO list and got interested in the possible
optimization of the bcTruelen() function. Read the archived messages about
that subject and decided to see what could be done.

I tested the performance of 5 different versions of bcTruelen().
1. The code as it exists in postgresql currently.
2. The code with the patch described in the above mentioned messages.
3. A modification of the code mentioned in 2 above using the concept of a
sentinel.
4. A modification of the code mentioned in 1 above using a sentinel.
5. A modification of the code mentioned in 4 using knowledge of 1B and 4B
headers.

First, let me describe a sentinel. The key thing about using a sentinel is
that with on, you can guarantee loop termination without having to check
for an index. What you do is place a value that you're searching for 1 past
the limit of the range you're searching for. By doing this, you can
optimize your loop to have just 1 condition to check for instead of 2
(search value and index limit). This mean that your loop now needs to just
execute 3 opcodes (pointer decrement, search compare, conditional jump)
instead of 5 (pointer decrement, search compare, conditional jump, limit
compare, conditional jump). In the case of bcTruelen(), the sentinel value
is anything that's non-space.

I ran all 5 different versions of bcTruelen() in a benchmark program that
simply allocated a gig of memory, populated the memory with varlena
structures of varying length, alignment, and trailing spaces. Then called
the different versions of bcTruelen() using a pseudo random sequence to try
and prevent processor caching from affecting the values.

Overall, the results were as follows:


Can we see the benchmark program and the actual data, please?


1. For small numbers of trailing space, the "4 at a time" approach was a
consistent loser in terms of performance. If there's a large number of
trailing spaces, the performance of the "4 at a time" approach could be
quite impressive however.
2. The best typical performer was the single byte at a time approach using
a sentinel and knowledge of the types of headers. For that routine, the
performance compared against the current version of bcTruelen() was as
follows
CHAR(1) with only spaces. Current version wins. New version loses by about
20%
CHAR(1) with non-space. New version wins by about 10%.
CHAR(2) with only spaces. New version loses by about 4%.
CHAR(2) with one space. New version wins by about 15%.
CHAR(2) with no spaces. New version wins by about 10%

Once CHAR(4) was reached, the new version was faster for all numbers of
trailing spaces. I suspect the reason for the new routine being slower for
short CHAR types with all spaces is because using the sentinel results in
an extra loop iteration over the code that checks for limits each
iteration. But once the number of iterations gets large enough, the shorter
loop wins even though it iterated one extra time. The tests were performed
on an Intel Core2 duo.

The code effectively has 3 identical loops with different surroundings.
1. If the arg is pointing to a 1B style header, then there is no need to
worry about a sentinel. No 1B header matches a space character whether the
target system is big or little endian.
2. For those cases where the arg is pointing to 4B style header, you do
have to worry about a good sentinel existing. For a little endian machine,
such an issue happens if you're dealing with a CHAR(134217728) through
CHAR(138412031). Unlikely to be sure. But for a big endian machine,
CHAR(32), CHAR(288) and other values can cause problems. So for 4B style
headers, a check is made to see if the sentinel location has a value of '
'. If it does, it is replaced with a zero, the scan is then performed, and
then the sentinel is then restored with the original space. It is of course
possible to unconditionally save the value at the sentinel location, stuff
in a zero, then restore the original value, but in doing so, more overhead
in incurred every call, slowing down the function.


You can't modify *arg, even momentarily, because it might point directly 
to an on-disk buffer. The same Datum can be read by another process at 
the same time, or even written out to disk.


You could simply check if the byte in the header happens to be != 0x20. 
It's very likely that it is, so that you can use the optimized version.



The biggest issue with the previously discussed patches was the lack of 
performance testing. People posted results of various micro-benchmarks 
that showed different aspects, but no-one comprehensively covered all 
the interesting cases. I'd like to see a simple suite of performance 
tests that show:


1. The worst case performance.
2. Performance with a couple of typical cases. CHAR(n), where n is 
between 1-64 would be interesting, and where the number of spaces at the 
end varies.


You said you did benchmarking with a custom program, which is good, but 
it would be 

[HACKERS] Looked at TODO:Considering improving performance of computing CHAR() value lengths

2014-08-02 Thread John Cochran
Greetings,
I took at look at the TODO list and got interested in the possible
optimization of the bcTruelen() function. Read the archived messages about
that subject and decided to see what could be done.

I tested the performance of 5 different versions of bcTruelen().
1. The code as it exists in postgresql currently.
2. The code with the patch described in the above mentioned messages.
3. A modification of the code mentioned in 2 above using the concept of a
sentinel.
4. A modification of the code mentioned in 1 above using a sentinel.
5. A modification of the code mentioned in 4 using knowledge of 1B and 4B
headers.

First, let me describe a sentinel. The key thing about using a sentinel is
that with on, you can guarantee loop termination without having to check
for an index. What you do is place a value that you're searching for 1 past
the limit of the range you're searching for. By doing this, you can
optimize your loop to have just 1 condition to check for instead of 2
(search value and index limit). This mean that your loop now needs to just
execute 3 opcodes (pointer decrement, search compare, conditional jump)
instead of 5 (pointer decrement, search compare, conditional jump, limit
compare, conditional jump). In the case of bcTruelen(), the sentinel value
is anything that's non-space.

I ran all 5 different versions of bcTruelen() in a benchmark program that
simply allocated a gig of memory, populated the memory with varlena
structures of varying length, alignment, and trailing spaces. Then called
the different versions of bcTruelen() using a pseudo random sequence to try
and prevent processor caching from affecting the values.

Overall, the results were as follows:
1. For small numbers of trailing space, the "4 at a time" approach was a
consistent loser in terms of performance. If there's a large number of
trailing spaces, the performance of the "4 at a time" approach could be
quite impressive however.
2. The best typical performer was the single byte at a time approach using
a sentinel and knowledge of the types of headers. For that routine, the
performance compared against the current version of bcTruelen() was as
follows
CHAR(1) with only spaces. Current version wins. New version loses by about
20%
CHAR(1) with non-space. New version wins by about 10%.
CHAR(2) with only spaces. New version loses by about 4%.
CHAR(2) with one space. New version wins by about 15%.
CHAR(2) with no spaces. New version wins by about 10%

Once CHAR(4) was reached, the new version was faster for all numbers of
trailing spaces. I suspect the reason for the new routine being slower for
short CHAR types with all spaces is because using the sentinel results in
an extra loop iteration over the code that checks for limits each
iteration. But once the number of iterations gets large enough, the shorter
loop wins even though it iterated one extra time. The tests were performed
on an Intel Core2 duo.

The code effectively has 3 identical loops with different surroundings.
1. If the arg is pointing to a 1B style header, then there is no need to
worry about a sentinel. No 1B header matches a space character whether the
target system is big or little endian.
2. For those cases where the arg is pointing to 4B style header, you do
have to worry about a good sentinel existing. For a little endian machine,
such an issue happens if you're dealing with a CHAR(134217728) through
CHAR(138412031). Unlikely to be sure. But for a big endian machine,
CHAR(32), CHAR(288) and other values can cause problems. So for 4B style
headers, a check is made to see if the sentinel location has a value of '
'. If it does, it is replaced with a zero, the scan is then performed, and
then the sentinel is then restored with the original space. It is of course
possible to unconditionally save the value at the sentinel location, stuff
in a zero, then restore the original value, but in doing so, more overhead
in incurred every call, slowing down the function.

If you wish me to submit a patch, I can do so easily.

Code is as follows:
/*
 * "True" length (not counting trailing blanks) of a BpChar
 */
int
bcTruelen(BpChar *arg)
{
char*s;
char*p;

if (VARATT_IS_1B(arg))
{
s = (void *)arg;
p = s + VARSIZE_1B(arg);
do
--p;
while(*p == ' ');
}
else
{
s = VARDATA_4B(arg);
p = s + VARSIZE_4B(arg)-VARHDRSZ;
--s;/* Point to where sentinel will be */

if (*s == ' ') {
*s = 0; /* Stuff in a non-blank sentinel */
do
--p;
while(*p == ' ');
*s = ' ';   /* Replace blank */
} else {
do
--p;
while(*p == ' ');
}
}

return p-s; /* Return length */
}


-- 

There are two ways of constructing a software design: One way is to make it
so simple that there are obviously no deficiencies and