Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-31 Thread Robert Haas
On Sun, Oct 30, 2011 at 8:02 AM, Kääriäinen Anssi
anssi.kaariai...@thl.fi wrote:
 Table size is around 600MB, index size is around 350MB and VM on-disk
 size is 16kB with default fillfactor. With fillfactor = 10, the VM size is 104
 KB, and table size is around 6GB.  The index size is the same.

What I think you're probably measuring here (oprofile would tell us
for sure) is that once the size of the table goes beyond about half a
gigabyte, it will have more than one page in the visibility map.  The
index-only scan code keeps the most recently used visibility map page
pinned to save on overhead, but if you're bouncing back and forth
between data in the first ~500MB of the table and data in the last
~100MB, each switch will result in dropping the current pin and
getting a new one, which figures to be fairly expensive.  With the
table is only a little over 500GB, you're probably only changing VM
pages every couple of tuples, but with a 6GB table just about every
tuple will switch to a new VM page.

Now, maybe you're right and the CPU caches are the more significant
effect.  But I wouldn't like to bet on it without seeing how much the
drop-and-get-new-pin operations are costing us.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-31 Thread Anssi Kääriäinen

On 10/31/2011 02:44 PM, Robert Haas wrote:

What I think you're probably measuring here (oprofile would tell us
for sure) is that once the size of the table goes beyond about half a
gigabyte, it will have more than one page in the visibility map.  The
index-only scan code keeps the most recently used visibility map page
pinned to save on overhead, but if you're bouncing back and forth
between data in the first ~500MB of the table and data in the last
~100MB, each switch will result in dropping the current pin and
getting a new one, which figures to be fairly expensive.  With the
table is only a little over 500GB, you're probably only changing VM
pages every couple of tuples, but with a 6GB table just about every
tuple will switch to a new VM page.

Now, maybe you're right and the CPU caches are the more significant
effect.  But I wouldn't like to bet on it without seeing how much the
drop-and-get-new-pin operations are costing us.


Maybe I should have left the analysis part out of the post,
I don't know the internals, so my analysis is likely to be wrong.
Now that I think of it, claiming that the cache effect is 50%
of the runtime is likely a little wrong...

However the part about clustering being important is still correct.
According to the test, you can get 50% overhead because of
random access to the VM.

Stupid question, but why not keep the whole VM pinned?

 - Anssi

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-31 Thread Robert Haas
On Mon, Oct 31, 2011 at 9:51 AM, Anssi Kääriäinen
anssi.kaariai...@thl.fi wrote:
 Stupid question, but why not keep the whole VM pinned?

It might be that keeping more than one VM page pinned is a good idea,
but we'd have to think carefully about it.  For example, if we pin too
many pages in shared_buffers, other queries could start erroring out
for failure to find an evictable buffer.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-30 Thread Kääriäinen Anssi
Quoting Robert Haas:

I tried this on my MacBook Pro this morning, using pgbench -i -s 500
to create a database about 7.5GB in size, and then using SELECT
sum(1) FROM pgbench_accounts as a test query, on a build WITHOUT
--enable-cassert.  This machine has 4GB of memory, and I set
shared_buffers = 400MB.  (No, I'm not sure whether that's the optimal
setting for shared_buffers for this machine.)


I did some tests where I tried to compare the effect of having the index
ordered tuples not be in the same order they are in the base table.
The idea is to test what effect accessing the VM map randomly as
opposed to sequential order has on performance. I suspect the above
test access the VM in order (the accounts table is effectively clustered
on the index used in the test). I might be mistaken here.

My test setup is this:
drop table if exists test;
drop table if exists test2;
create unlogged table test /* with (fillfactor = 10) */ 
as select generate_series(0, 20*1000*1000) as id;
create index idx1 on test(id);
vacuum test;
create unlogged table test2 /* with (fillfactor = 10) */
as (select * from test order by random());
create index idx2 on test2(id);
vacuum test2;

Table size is around 600MB, index size is around 350MB and VM on-disk
size is 16kB with default fillfactor. With fillfactor = 10, the VM size is 104
KB, and table size is around 6GB.  The index size is the same.

Results for the randomly ordered table:
# select count(*) from test2;
14822.045 ms
14826.253 ms
14815.450 ms

Results for the effectively clustered table:
# select count(*) from test;
11761.890 ms
11767.926 ms
11810.900 ms

Now, this test still has the benefit of fitting the VM easily into the L1 cache.

Next, I did a ugly hack to get the table size large enough so that the VM
will trash the L1 cache while still having somewhat reasonable test setup
creation time. My harware is old, 1GB of memory, processor is Genuine
Intel(R) CPU L2400  @ 1.66GHz. The L1 data cache size is 32kB on my.

The hack is to simply set fillfactor to 10. The VM size is now 104kB, the
table size is about 6.3 GB while the index size is still the same as in above
test.

Results for the randomly ordered table:
# select count(*) from test2;
21606.683 ms
21829.063 ms
21637.434 ms

Results for the effectively clustered table:
# select count(*) from test;
11714.663 ms
11449.264 ms
11658.534 ms

Now, the next step would be to trash the L2 cache (20GB table size should
do this on Sandy Bridge, where L2 cache is 256KB). I don't have hardware
to do that test. It is worth noting that the L2 cache is shared on Sandy
Bridge, so it is likely that an index-only scan of a large enough table would
slow down other processes, too. Without tests this is only FUD, though. The
test would be to scan a 20GB table's index repeatedly in one process, and
see how it affects standard in-memory pgbench results for other processes.
Compare this with doing the same with a sequential scan process.

Lessons learned (or what I learned, at least):
  - Clustering is important for index only scans. Picking a clustered index
over non-clustered index will have a big performance effect.
  - Large table index-only scans are going to be more expensive compared
to sequential scan than what pgbench accounts tests suggests. I assume
that the accounts table is effectively clustered on the index used. I
haven't verified this.
  - There is the possibility that index-only scans will trash the caches for
other processes, too. Not tested, though.

I am sure these results will vary significantly based on hardware used. I
am also notorious for screwing up benchmarks, so verifying these results
is recommended.

You will need around 16GB of disk space for the fillfactor = 10 test. I would
recommend you have more than 1GB of memory, otherwise creating the
test setup can take some time...

 - Anssi Kääriäinen
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-30 Thread Kääriäinen Anssi
Sorry, I forgot to include the version used  some information about my setup:
PostgreSQL version: Git HEAD as of:
Date:   Fri Oct 28 21:18:36 2011 -0400
Commit: 51eba98cf4595e90730dedd9305da8aa84b649ee

Compiled with defaults, (only change --with-pgport = 5431). I used default
settings, shared_buffers size is 24MB. The system is Linux Mint Debian edition
(kernel 3.0.0, gcc 4.6.1). The interesting parts about my hardware were in the
original post.

 - Anssi


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-28 Thread Bruce Momjian
Tom Lane wrote:
 I wonder how trustworthy the measure of the visibilitymap_test call site
 as a consumer of cycles really is.  I've frequently noticed that
 oprofile blames remarkably large fractions of the runtime on individual
 statements that appear to be quite trivial.  I'm not sure if that

Are these statements perhaps near kernel calls?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-28 Thread Robert Haas
On Mon, Oct 24, 2011 at 4:23 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Mon, Oct 24, 2011 at 3:35 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I wonder how trustworthy the measure of the visibilitymap_test call site
 as a consumer of cycles really is.

 I'm not sure either.  I guess we could try short-circuiting
 visibilitymap_test and see what that does to performance (let's leave
 correct answers out of it).

 That would conflate the cost of the call with the cost of the function.
 Maybe you could try manually inlining the visibility test?

I fooled around with this some more on my Fedora 14 desktop machine.
pgbench database, scale factor 20, shared_buffers=400MB.  I ran the
query select count(*) from pgbench_accounts.  I'm having a hard time
getting completely reproducible results, but it appears that warming
the cache makes the sequential scan go faster drop from maybe 390 ms
to 245 ms, and an index-only scan takes about 350 ms, so which one is
better depends a lot on your assumptions about what is going on on the
system at the same time (which means maybe we ought not to sweat it).

If I wipe out the whole if-block that calls visibilitymap_test(), the
index-only scan drops down to about 300 ms (and delivers potentially
wrong answers, of course).  Inlining visibilitymap_test (see attached
vismap-inline.patch) causes the index-only scan to drop to about 330
ms.

I also tried changing the BufferIsValid() tests in
visibilitymap_test() to use BufferIsInvalid() instead, with the sense
of the tests reversed (see attached vismap-test-invalid.patch).  Since
BufferIsInvalid() just checks for InvalidBuffer instead of also doing
the sanity checks, it's significantly cheaper.  This also reduced the
time to about 330 ms, so seems clearly worth doing.  Apparently these
changes don't stack, because doing both things only gets me down to
about 320 ms, which is fairly unexciting for the amount of ugliness
that inlining entails.

I tried sprinkling a little branch-prediction magic on this code using
GCC's __builtin_expect().  That initially seemed to help, but once I
changed the BufferIsValid() test to !BufferIsInvalid() essentially all
of the savings disappeared.

I also spent some time poking through the opannotate -s -a output,
which shows where time is being spent by individual assembler
instruction, but also annotates the assembler listing with the
original source code.  At least according to oprofile, the time that
is being spent in IndexOnlyNext() is mostly being spent on seemingly
innocuous operations like saving and restoring registers.  For
example, much of the time being attributed to the visibilitymap_test()
call in IndexOnlyNext() is actually attributable to the instructions
that are calculating what address to pass for scandesc-heapRelation.
Many but not all of the pointer deferences at the top of
IndexOnlyNext() have a chunk cycles attributed to them, and while
they're not that significant individually, they add up.  Similarly,
the long series of instructions to which index_getattr() resolves
bleeds cycles at just about every step.  There's not much to optimize
there, though, unless we want to add some code that avoids decoding
the tuple altogether in the particular case of a zero-argument
aggregate, or maybe something more general that only pulls out the
columns that are actually needed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


vismap-inline.patch
Description: Binary data


vismap-test-invalid.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-28 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 I also tried changing the BufferIsValid() tests in
 visibilitymap_test() to use BufferIsInvalid() instead, with the sense
 of the tests reversed (see attached vismap-test-invalid.patch).  Since
 BufferIsInvalid() just checks for InvalidBuffer instead of also doing
 the sanity checks, it's significantly cheaper.  This also reduced the
 time to about 330 ms, so seems clearly worth doing.

Hmm.  I wonder whether it wouldn't be better to get rid of the range
checks in BufferIsValid, or better convert them into Asserts.  It seems
less than intuitive that BufferIsValid and BufferIsInvalid aren't simple
inverses.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-28 Thread Robert Haas
On Fri, Oct 28, 2011 at 2:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 I also tried changing the BufferIsValid() tests in
 visibilitymap_test() to use BufferIsInvalid() instead, with the sense
 of the tests reversed (see attached vismap-test-invalid.patch).  Since
 BufferIsInvalid() just checks for InvalidBuffer instead of also doing
 the sanity checks, it's significantly cheaper.  This also reduced the
 time to about 330 ms, so seems clearly worth doing.

 Hmm.  I wonder whether it wouldn't be better to get rid of the range
 checks in BufferIsValid, or better convert them into Asserts.  It seems
 less than intuitive that BufferIsValid and BufferIsInvalid aren't simple
 inverses.

Seems reasonable.  It would break if anyone is using an out-of-range
buffer number in lieu of InvalidBuffer, but I doubt that anyone is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-28 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Fri, Oct 28, 2011 at 2:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Hmm.  I wonder whether it wouldn't be better to get rid of the range
 checks in BufferIsValid, or better convert them into Asserts.  It seems
 less than intuitive that BufferIsValid and BufferIsInvalid aren't simple
 inverses.

 Seems reasonable.  It would break if anyone is using an out-of-range
 buffer number in lieu of InvalidBuffer, but I doubt that anyone is.

Yeah, I find that unlikely as well.  But leaving Asserts in place would
tell us.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-28 Thread Robert Haas
On Fri, Oct 28, 2011 at 3:27 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Fri, Oct 28, 2011 at 2:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Hmm.  I wonder whether it wouldn't be better to get rid of the range
 checks in BufferIsValid, or better convert them into Asserts.  It seems
 less than intuitive that BufferIsValid and BufferIsInvalid aren't simple
 inverses.

 Seems reasonable.  It would break if anyone is using an out-of-range
 buffer number in lieu of InvalidBuffer, but I doubt that anyone is.

 Yeah, I find that unlikely as well.  But leaving Asserts in place would
 tell us.

OK.  Should I go do that, or are you all over it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-28 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Fri, Oct 28, 2011 at 3:27 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Yeah, I find that unlikely as well.  But leaving Asserts in place would
 tell us.

 OK.  Should I go do that, or are you all over it?

Go for it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-28 Thread Robert Haas
On Fri, Oct 28, 2011 at 3:53 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Fri, Oct 28, 2011 at 3:27 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Yeah, I find that unlikely as well.  But leaving Asserts in place would
 tell us.

 OK.  Should I go do that, or are you all over it?

 Go for it.

OK, done.  Any other thoughts on the rest of what I wrote?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-25 Thread Wolfgang Wilhelm
Hello,

my experience is that as soon as index only scans are available they are used - 
sometimes just because of the simple logic that a user thinks it is faster. 
Even when the index is so ridiculously long just to have all info in the 
index...

Regards
Wolfgang Wilhelm




Von: Tom Lane t...@sss.pgh.pa.us
An: Robert Haas robertmh...@gmail.com
Cc: Kevin Grittner kevin.gritt...@wicourts.gov; pgsql-hackers@postgresql.org
Gesendet: 21:35 Montag, 24.Oktober 2011 
Betreff: Re: [HACKERS] So, is COUNT(*) fast now? 

Robert Haas robertmh...@gmail.com writes:
 But even though Tom's statement that most indexes are one column might
 be a slight exaggeration, I suspect it probably is true that the
 optimizations he's talking about for large numbers of columns won't
 produce any material benefit even for a 3 or 4 column index.  Which
 makes me think maybe we should focus our efforts elsewhere.

Right.  If we thought the average was something like ten, it might be
worth pursuing optimizations similar to slot_getallattrs.  If it's
around two or three, almost certainly not.

Your point about people trying to create wider indexes to exploit
index-only scans is an interesting one, but I think it's premature to
optimize on the basis of hypotheses about what people might do in
future.

Not sure about your other idea of returning multiple tuples per
amgettuple call.  The trouble with that is that it will add complexity
(and hence cycles) at the nodeIndexscan level, because now nodeIndexscan
will have to buffer those tuples, keep track of whether it's fetching
forward or backward, etc etc.  Plus another layer of the same in
indexam.c (index_getnext etc).  I'm not at all convinced that it's
likely to be a net win.

I wonder how trustworthy the measure of the visibilitymap_test call site
as a consumer of cycles really is.  I've frequently noticed that
oprofile blames remarkably large fractions of the runtime on individual
statements that appear to be quite trivial.  I'm not sure if that
represents real hardware-level effects such as cache line switching,
or whether it's just measurement artifacts.  Keep in mind that
sampling-based measurements are always subject to sampling artifacts.

            regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Robert Haas
On Sun, Oct 23, 2011 at 7:01 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Fri, Oct 21, 2011 at 12:52 PM, Robert Haas robertmh...@gmail.com wrote:

 Also, this line is kind of expensive:

        if (!visibilitymap_test(scandesc-heapRelation,
                                ItemPointerGetBlockNumber(tid),
                                node-ioss_VMBuffer))

 Around 2%.  But I don't see any way to avoid that, or even make it cheaper.

 Could we cache by ItemPointerGetBlockNumber(tid) the results of those
 tests, for groups of tids on the same index page?

 How useful this would be would depend on how well-clustered the table
 and index are.

I thought about that, but the existing code is so ridiculously cheap
that it's hard to believe a caching layer would save enough to pay for
itself.  I mean, I presume that the cost attributed to that line has
to be associated with either (a) one of the pointer deferences, (b)
the expense of evaluating ItemPointerGetBlockNumber(), (c) setting up
the function call, or perhaps (d) overhead incurred as a result of
branch mis-prediction.  The actual time spent *inside*
visibilitymap_test will be attributed to that function, not this one.

If you add up the time for this line and visibilitymap_test(), it's
like 10% of the runtime, which seems pretty significant.  But it's 10%
of the runtime that is spent basically a handful of arithmetic
operations and then reading a byte from shared memory.  It's
astonishing to find that so expensive on a test with just one backend
running.  If you stick some kind of cache in there, it's going to
involve adding a branch that isn't there now, and I think that's going
to be pricey given how hot this code apparently is.

Also, I'm not sure it's safe.  Suppose that we lock the index page,
return a tuple, check the visibility map, and find the page all
visible.  Another transaction comes along, adds a tuple to the index
page, and clears the visibility map bit.  We then go back, relock the
index page, and return another tuple.  We'd better notice that the
visibility map bit has now been cleared, or we're in trouble.

I wonder if it might help to create some method for the index to
return all the matching keys on a single index page in one call.   If
you're dealing with an MVCC snapshot, any new tuples added after the
start of the scan can't be visible anyway.  That would save the
overhead of unlocking and relocking the buffer once per tuple, and
probably some overhead associated with validating and decoding the
possibly-changed page contents each time.  If you did it that way, it
would also be safe to do what you're proposing - if a bunch of the
tuples on the index page are also on the same heap page, you could do
one visibility map check for all of them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 
 I had wondered whether it'd be worth optimizing that along the
 lines of slot_getallattrs().  But most indexes probably have only
 one column, or anyway not enough to make for a useful savings.
 
From a heavily-used production database:
 
cir= select indnatts, count(*) from pg_index group by indnatts
order by indnatts;
 indnatts | count 
--+---
1 |   200
2 |   684
3 |   155
4 |76
5 |43
6 |13
7 | 2
9 | 1
(8 rows)
 
This includes system table and TOAST table indexes (which seem to
have two columns).  There are over 400 user tables, each of which
has a primary key, so most primary keys in our database are more
than one column.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes:
 Tom Lane t...@sss.pgh.pa.us wrote:
 I had wondered whether it'd be worth optimizing that along the
 lines of slot_getallattrs().  But most indexes probably have only
 one column, or anyway not enough to make for a useful savings.
 
 From a heavily-used production database:
 
 cir= select indnatts, count(*) from pg_index group by indnatts
 order by indnatts;
  indnatts | count 
 --+---
 1 |   200
 2 |   684
 3 |   155
 4 |76
 5 |43
 6 |13
 7 | 2
 9 | 1
 (8 rows)
 
 This includes system table and TOAST table indexes (which seem to
 have two columns).

Yeah, TOAST indexes are 2-column.  It would be best to exclude those
from your counts, since it seems pretty unlikely that anyone will care
how fast nodeIndexonlyscan.c is for scans on toast tables.

 There are over 400 user tables, each of which
 has a primary key, so most primary keys in our database are more
 than one column.

It doesn't look to me like the mean is above 2 (unless you have many
fewer toast tables than I suspect), so trying to optimize many-column
cases isn't going to help.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 
 Yeah, TOAST indexes are 2-column.  It would be best to exclude
 those from your counts, since it seems pretty unlikely that anyone
 will care how fast nodeIndexonlyscan.c is for scans on toast
 tables.
 
User indexes (excluding toast):
 
 indnatts | count 
--+---
1 |   200
2 |   222
3 |   155
4 |76
5 |43
6 |13
7 | 2
9 | 1
(8 rows)
 
System indexes (excluding toast):
 
 indnatts | count 
--+---
1 |46
2 |24
3 | 9
4 | 5
(4 rows)
 
 It doesn't look to me like the mean is above 2 (unless you have
 many fewer toast tables than I suspect), so trying to optimize
 many-column cases isn't going to help.
 
The mean is 2.4 (give or take a little depending on whether you
include system tables).  I have no idea where the optimization
becomes worthwhile, but the assertion that most indexes probably
have a single column worried me.  I'm sure there are databases where
that is true (especially for those who insist on adding a
meaningless surrogate key column to every table), but there are many
where it isn't true.  I would guess that our average of 2.4 is
higher than most, though.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Kevin Grittner
Copy/paste problems -- the first set includes the system tables
except for toast.  User tables would be the difference between the
results below.  Sorry.
 
-Kevin
 
 
Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 Tom Lane t...@sss.pgh.pa.us wrote:
  
 Yeah, TOAST indexes are 2-column.  It would be best to exclude
 those from your counts, since it seems pretty unlikely that
 anyone will care how fast nodeIndexonlyscan.c is for scans on
 toast tables.
  
 User indexes (excluding toast):
  
  indnatts | count 
 --+---
 1 |   200
 2 |   222
 3 |   155
 4 |76
 5 |43
 6 |13
 7 | 2
 9 | 1
 (8 rows)
  
 System indexes (excluding toast):
  
  indnatts | count 
 --+---
 1 |46
 2 |24
 3 | 9
 4 | 5
 (4 rows)
  
 It doesn't look to me like the mean is above 2 (unless you have
 many fewer toast tables than I suspect), so trying to optimize
 many-column cases isn't going to help.
  
 The mean is 2.4 (give or take a little depending on whether you
 include system tables).  I have no idea where the optimization
 becomes worthwhile, but the assertion that most indexes probably
 have a single column worried me.  I'm sure there are databases
 where that is true (especially for those who insist on adding a
 meaningless surrogate key column to every table), but there are
 many where it isn't true.  I would guess that our average of 2.4
 is higher than most, though.
  
 -Kevin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Robert Haas
On Mon, Oct 24, 2011 at 2:15 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 It doesn't look to me like the mean is above 2 (unless you have
 many fewer toast tables than I suspect), so trying to optimize
 many-column cases isn't going to help.

 The mean is 2.4 (give or take a little depending on whether you
 include system tables).  I have no idea where the optimization
 becomes worthwhile, but the assertion that most indexes probably
 have a single column worried me.  I'm sure there are databases where
 that is true (especially for those who insist on adding a
 meaningless surrogate key column to every table), but there are many
 where it isn't true.  I would guess that our average of 2.4 is
 higher than most, though.

Keep in mind that the existence of index-only scans is going to
encourage people to try to create covering indexes on columns that
aren't indexed today.  It's not clear how much of a win that will be;
for certainly workloads, HOT might make it backfire spectacularly.

But even though Tom's statement that most indexes are one column might
be a slight exaggeration, I suspect it probably is true that the
optimizations he's talking about for large numbers of columns won't
produce any material benefit even for a 3 or 4 column index.  Which
makes me think maybe we should focus our efforts elsewhere.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 But even though Tom's statement that most indexes are one column might
 be a slight exaggeration, I suspect it probably is true that the
 optimizations he's talking about for large numbers of columns won't
 produce any material benefit even for a 3 or 4 column index.  Which
 makes me think maybe we should focus our efforts elsewhere.

Right.  If we thought the average was something like ten, it might be
worth pursuing optimizations similar to slot_getallattrs.  If it's
around two or three, almost certainly not.

Your point about people trying to create wider indexes to exploit
index-only scans is an interesting one, but I think it's premature to
optimize on the basis of hypotheses about what people might do in
future.

Not sure about your other idea of returning multiple tuples per
amgettuple call.  The trouble with that is that it will add complexity
(and hence cycles) at the nodeIndexscan level, because now nodeIndexscan
will have to buffer those tuples, keep track of whether it's fetching
forward or backward, etc etc.  Plus another layer of the same in
indexam.c (index_getnext etc).  I'm not at all convinced that it's
likely to be a net win.

I wonder how trustworthy the measure of the visibilitymap_test call site
as a consumer of cycles really is.  I've frequently noticed that
oprofile blames remarkably large fractions of the runtime on individual
statements that appear to be quite trivial.  I'm not sure if that
represents real hardware-level effects such as cache line switching,
or whether it's just measurement artifacts.  Keep in mind that
sampling-based measurements are always subject to sampling artifacts.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Josh Berkus
On 10/24/11 12:35 PM, Tom Lane wrote:
 Your point about people trying to create wider indexes to exploit
 index-only scans is an interesting one, but I think it's premature to
 optimize on the basis of hypotheses about what people might do in
 future.

I don't think that this is hypothetical at all.  I know *I'll* be doing
it, and we can expect users who are familiar with MySQL and Oracle to do
it as well.

No, it won't be the majority of our users, who are using ORMs and thus
don't really think about indexing at all.  But it will be a significant
number of users who are performance-sensitive ... such as most or all of
our data warehousing users.

Mind you, we're pretty much talking exclusively about users whose tables
don't fit in memory ... usually tables which are 10X or more the size of
memory.

One case which is going to be critical to test is the join table, i.e.
the table which supports many-to-many joins and consists only of keys
from the respective two other tables.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Robert Haas
On Mon, Oct 24, 2011 at 3:35 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Your point about people trying to create wider indexes to exploit
 index-only scans is an interesting one, but I think it's premature to
 optimize on the basis of hypotheses about what people might do in
 future.

Well, I don't think it's too much of a stretch to guess that people
will try to use covering indexes; that's common practice on other
products and a frequent and a not-uncommon heartfelt request from
people with large (non-memory-resident) databases they want to migrate
to PostgreSQL.  Exactly to what degree they'll do that, and how well
it will work, is another question.  But I have little doubt that it
will be tried.

 Not sure about your other idea of returning multiple tuples per
 amgettuple call.  The trouble with that is that it will add complexity
 (and hence cycles) at the nodeIndexscan level, because now nodeIndexscan
 will have to buffer those tuples, keep track of whether it's fetching
 forward or backward, etc etc.  Plus another layer of the same in
 indexam.c (index_getnext etc).  I'm not at all convinced that it's
 likely to be a net win.

I definitely agree that you don't want two layers of caching, but I
don't see why we'd need that.  I wasn't thinking of changing
index_getnext() at all, but rather adding a new API that fills a
buffer (or a pair of buffers, maybe) with index tuples and heap TIDs.
It should spit them out in the same order that multiple
index_getnext() calls would have done and leave the scan position
wherever it would have ended up after a number of index_getnext_tid()
calls equal to the number of TIDs returned.  Any user of the API (and
it might be just nodeIndexonlyscan.c) would just need to keep track of
the number of tuples returned and the number consumed to date.

This actually gets into a wider architectural discussion, which is
whether the fact that the whole executor (modulo bitmap scans and a
few other special cases) operates on one tuple at a time is a good
design...  but my brain hurts just thinking about that.

 I wonder how trustworthy the measure of the visibilitymap_test call site
 as a consumer of cycles really is.  I've frequently noticed that
 oprofile blames remarkably large fractions of the runtime on individual
 statements that appear to be quite trivial.  I'm not sure if that
 represents real hardware-level effects such as cache line switching,
 or whether it's just measurement artifacts.  Keep in mind that
 sampling-based measurements are always subject to sampling artifacts.

I'm not sure either.  I guess we could try short-circuiting
visibilitymap_test and see what that does to performance (let's leave
correct answers out of it).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Kevin Grittner
Josh Berkus j...@agliodbs.com wrote:
 
 On 10/24/11 12:35 PM, Tom Lane wrote:
 Your point about people trying to create wider indexes to exploit
 index-only scans is an interesting one, but I think it's
 premature to optimize on the basis of hypotheses about what
 people might do in future.
 
 I don't think that this is hypothetical at all.  I know *I'll* be
 doing it, and we can expect users who are familiar with MySQL and
 Oracle to do it as well.
 
And Sybase, and MS SQL Server.  And others, most likely.  We've
never gotten around to narrowing the indexes to which we added extra
columns to overcome performance problems through covering index
techniques when we were using Sybase, so they're already here.  :-)
 
 One case which is going to be critical to test is the join
 table, i.e. the table which supports many-to-many joins and
 consists only of keys from the respective two other tables.
 
Yeah, that is an important use of covering indexes for us.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Oct 24, 2011 at 3:35 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I wonder how trustworthy the measure of the visibilitymap_test call site
 as a consumer of cycles really is.

 I'm not sure either.  I guess we could try short-circuiting
 visibilitymap_test and see what that does to performance (let's leave
 correct answers out of it).

That would conflate the cost of the call with the cost of the function.
Maybe you could try manually inlining the visibility test?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-23 Thread Jeff Janes
On Fri, Oct 21, 2011 at 12:07 PM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Oct 21, 2011 at 2:33 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I think HeapTupleSatisfiesMVCC is probably being skipped anyway in
 this case, since all the heap pages should be PD_ALL_VISIBLE.

 Proves my point ;-) ... you're comparing a code path that's been beat on
 for *years* with one that just got written.

 I know.  I wrote a chunk of it.  :-)  My point is just that it'd be
 nice to make it better.

 Anyhow, here's the scoop.  On my desktop machine running F14, running
 SELECT sum(1) FROM pgbench_accounts in a tight loop, 60 s worth of
 oprofile data:

 176830   13.0801  postgres postgres 
 ExecProject

Hi Robert,

count(*) and sum(1) do different things internally, and in my hands
sum(1) is ~10% slower.

I don't know how to dump the output of ExecBuildProjectionInfo into a
human readable form, so I don't know the basis of the difference.  But
I wonder if using count(*) would lower the weight of the ExecProject
function.


Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-23 Thread Tom Lane
Jeff Janes jeff.ja...@gmail.com writes:
 count(*) and sum(1) do different things internally, and in my hands
 sum(1) is ~10% slower.
 I don't know how to dump the output of ExecBuildProjectionInfo into a
 human readable form, so I don't know the basis of the difference.  But
 I wonder if using count(*) would lower the weight of the ExecProject
 function.

Probably.  count() doesn't actually have any arguments, so there's
nothing for ExecProject to do.  sum(1) invokes the generic case there
(ExecTargetList).  I suppose we could add another special-case path for
constant tlist elements, but I suspect that would mostly be optimizing
for benchmarks rather than helping real-world cases.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-23 Thread Jeff Janes
On Fri, Oct 21, 2011 at 12:52 PM, Robert Haas robertmh...@gmail.com wrote:

 Also, this line is kind of expensive:

        if (!visibilitymap_test(scandesc-heapRelation,
                                ItemPointerGetBlockNumber(tid),
                                node-ioss_VMBuffer))

 Around 2%.  But I don't see any way to avoid that, or even make it cheaper.

Could we cache by ItemPointerGetBlockNumber(tid) the results of those
tests, for groups of tids on the same index page?

How useful this would be would depend on how well-clustered the table
and index are.


Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-22 Thread Andres Freund
On Friday, October 21, 2011 08:14:12 PM Robert Haas wrote:
 On Fri, Oct 21, 2011 at 2:08 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Robert Haas robertmh...@gmail.com writes:
  On Fri, Oct 21, 2011 at 1:18 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  I don't know why you'd imagine that touching an index is free, or even
  cheap, CPU-wise.  The whole point of the index-only optimization is to
  avoid I/O.  When you try it on a case where there's no I/O to be saved,
  and no shared-buffers contention to be avoided, there's no way it's
  going to be a win.
  
  Well, call me naive, but I would have thought touching six times less
  data would make the operation run faster, not slower.
  
  It's not touching six times less data.  It's touching the exact same
  number of tuples either way, just index tuples in one case and heap
  tuples in the other.
 
 Yeah, but it works out to fewer pages.
But access to those is not sequential. I guess if you measure cache hit ratios 
the index scan will come out significantly worse.


Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-22 Thread desmodemone
2011/10/22 Andres Freund and...@anarazel.de

 On Friday, October 21, 2011 08:14:12 PM Robert Haas wrote:
  On Fri, Oct 21, 2011 at 2:08 PM, Tom Lane t...@sss.pgh.pa.us wrote:
   Robert Haas robertmh...@gmail.com writes:
   On Fri, Oct 21, 2011 at 1:18 PM, Tom Lane t...@sss.pgh.pa.us wrote:
   I don't know why you'd imagine that touching an index is free, or
 even
   cheap, CPU-wise.  The whole point of the index-only optimization is
 to
   avoid I/O.  When you try it on a case where there's no I/O to be
 saved,
   and no shared-buffers contention to be avoided, there's no way it's
   going to be a win.
  
   Well, call me naive, but I would have thought touching six times less
   data would make the operation run faster, not slower.
  
   It's not touching six times less data.  It's touching the exact same
   number of tuples either way, just index tuples in one case and heap
   tuples in the other.
 
  Yeah, but it works out to fewer pages.
 But access to those is not sequential. I guess if you measure cache hit
 ratios
 the index scan will come out significantly worse.


 Andres

 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


But access to those is not sequential  yes, I am agree.
In my opinion the problem is that. If the query needs to scan all the b-tree
index without to
access the table rows, the better way to read the index is like sequential
one,
in fact , query like count(*) or other not need the data are in order so I
think we
could read all blocks (better, only the leaf blocks) without to touching
too much the branch blocks.

For example query like this :

select column_a  from table ;

is better to read the data from indexes like sequential

For query like this :

select column_a  from table order by  column_a ;

is better to read the data from indexes in range scan from root block to
first branch blocks and their leaf blocks, so we could save
the sorting.

Mat


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-22 Thread Tom Lane
Andres Freund and...@anarazel.de writes:
 On Friday, October 21, 2011 08:14:12 PM Robert Haas wrote:
 On Fri, Oct 21, 2011 at 2:08 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 It's not touching six times less data.  It's touching the exact same
 number of tuples either way, just index tuples in one case and heap
 tuples in the other.

 Yeah, but it works out to fewer pages.

 But access to those is not sequential. I guess if you measure cache hit 
 ratios 
 the index scan will come out significantly worse.

Huh?  In the case he's complaining about, the index is all in RAM.
Sequentiality of access is not an issue (at least not at the page
level --- within a page I suppose there could be cache-line effects).

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-22 Thread Andres Freund
On Saturday, October 22, 2011 05:20:26 PM Tom Lane wrote:
 Andres Freund and...@anarazel.de writes:
  On Friday, October 21, 2011 08:14:12 PM Robert Haas wrote:
  On Fri, Oct 21, 2011 at 2:08 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  It's not touching six times less data.  It's touching the exact same
  number of tuples either way, just index tuples in one case and heap
  tuples in the other.
  
  Yeah, but it works out to fewer pages.
  
  But access to those is not sequential. I guess if you measure cache hit
  ratios the index scan will come out significantly worse.
 
 Huh?  In the case he's complaining about, the index is all in RAM.
 Sequentiality of access is not an issue (at least not at the page
 level --- within a page I suppose there could be cache-line effects).
I was talking about L2/L3 caches...

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-22 Thread Tom Lane
Andres Freund and...@anarazel.de writes:
 On Saturday, October 22, 2011 05:20:26 PM Tom Lane wrote:
 Huh?  In the case he's complaining about, the index is all in RAM.
 Sequentiality of access is not an issue (at least not at the page
 level --- within a page I suppose there could be cache-line effects).

 I was talking about L2/L3 caches...

Yeah, but unless you think cache lines cross page boundaries (and we do
take pains to align the buffers on 8K addresses), there's not going to
be any sequentiality effect.  Even if there were, it would only apply
if the pages chanced to be adjacent in the buffer array, and there is no
reason to expect that to be the case, for either seqscans or indexscans.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-22 Thread Robert Haas
On Fri, Oct 21, 2011 at 10:57 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 Yeah, but it works out to fewer pages.

 But since those pages are already in RAM, why would it matter all that
 much?  (Other than in the case of highly concurrent access, which you
 don't seem to be testing?)

Well, because memory access takes time, and accessing more memory
takes more time.  In the testing that I've done recently, performance
on in-memory workloads seems to be extremely sensitive to memory
speed, so you'd think that cutting down on memory access would be a
win.

 One of Tom's commits that made it not lock the same index page over
 and over again (once for each tuple on it) made me think it should be
 much faster than the seq scan, but a bit of random flailing about
 convinced me that any saving from this were compensated for by the
 high over head of FunctionCall2Coll and all of the hokey-pokey that
 that call entails, which a seqscan can skip entirely.

Huh.  Not sure whether that's significant or not.

 If count(*) could cause the index-only scan to happen in physical
 order of the index, rather than logical order, that might be a big
 win.  Both for all in memory and for not-all-in-memory.

That's an interesting point.  I sort of assumed that would only help
for not-all-in-memory, but maybe not.  The trouble is that I think
there are some problematic concurrency issues there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-22 Thread karavelov
- Цитат от Tom Lane (t...@sss.pgh.pa.us), на 22.10.2011 в 19:19 -

 Andres Freund and...@anarazel.de writes:
 On Saturday, October 22, 2011 05:20:26 PM Tom Lane wrote:
 Huh?  In the case he's complaining about, the index is all in RAM.
 Sequentiality of access is not an issue (at least not at the page
 level --- within a page I suppose there could be cache-line effects).
 
 I was talking about L2/L3 caches...
 
 Yeah, but unless you think cache lines cross page boundaries (and we do
 take pains to align the buffers on 8K addresses), there's not going to
 be any sequentiality effect.  Even if there were, it would only apply
 if the pages chanced to be adjacent in the buffer array, and there is no
 reason to expect that to be the case, for either seqscans or indexscans.
 
   regards, tom lane

I worked on in-memory hash stables of parrot project. It is not the same as
btrees but the structure and memory layout are not that different - tupples are
going into pages etc.

I have benchmarked iterating over such hash tables - sequential scan
of the same table comes 20-30% faster than scan ordered by the hash value
of the key. And this is overhead only of CPU cache lines - the numbers of 
instructions executed on the processor are pretty much the same (counted by 
valgrind).

So I do think that if we have sequential scan of indexes (physical order) it
will help even when all the data is in the buffercache.

Best regards

--
Luben Karavelov

Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-22 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Fri, Oct 21, 2011 at 10:57 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 If count(*) could cause the index-only scan to happen in physical
 order of the index, rather than logical order, that might be a big
 win.  Both for all in memory and for not-all-in-memory.

 That's an interesting point.  I sort of assumed that would only help
 for not-all-in-memory, but maybe not.  The trouble is that I think
 there are some problematic concurrency issues there.

Yeah.  We managed to make physical-order scanning work for VACUUM
because it's okay if VACUUM sometimes sees the same index tuple twice;
it'll just make the same decision about (not) deleting it.  That will
not fly for regular querying.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-22 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Fri, Oct 21, 2011 at 3:55 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Anyhow, here's the scoop.  On my desktop machine running F14, running
 SELECT sum(1) FROM pgbench_accounts in a tight loop, 60 s worth of
 oprofile data:
 176830   13.0801  postgres postgres 
 ExecProject

 Hm, that's weird.  In both these cases, I'd have expected that
 ExecProject would get optimized away thanks to selection of a physical
 tlist for the scan node.  Wonder if that got broken ...

 If it did, it looks like it wasn't recent.  I set up the same test
 case on my MacBook using REL9_1_STABLE and REL9_0_STABLE and set a
 breakpoint on ExecProject().  Both back-branches appear to also call
 ExecProject() for every tuple.

Oh, the ExecProject calls are coming from advance_aggregates().
Move along, nothing to see here ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] So, is COUNT(*) fast now?

2011-10-21 Thread Robert Haas
Laments at:

http://wiki.postgresql.org/wiki/FAQ#Why_is_.22SELECT_count.28.2A.29_FROM_bigtable.3B.22_slow.3F
http://wiki.postgresql.org/wiki/Slow_Counting

I tried this on my MacBook Pro this morning, using pgbench -i -s 500
to create a database about 7.5GB in size, and then using SELECT
sum(1) FROM pgbench_accounts as a test query, on a build WITHOUT
--enable-cassert.  This machine has 4GB of memory, and I set
shared_buffers = 400MB.  (No, I'm not sure whether that's the optimal
setting for shared_buffers for this machine.)

With enable_indexonlyscan = false, times for this query are: 96799.747
ms, 89108.987 ms, 114433.664 ms.
With enable_indexonlyscan = true, times were: 16779.325 ms, 16537.726
ms, 16703.229 ms.

That's about six times faster.  It's worth noting that the
pgbench_accounts table has relatively narrow rows.  On a table with
wider rows (but not so wide that they get toasted and become narrow
again), the benefit might be more.  On the other hand, this is also a
table that's just been vacuumed, and you've got the happy case where
the table (6404 MB) does not fit in memory but but the index (1071 MB)
does.  What happens on a smaller test case?  Here are the results at
scale factor = 100:

enable_indexonlyscan = false times: 1774.945 ms, 1784.985 ms, 1836.099 ms
enable_indexonlyscan = true times: 1450.445 ms, 1452.407 ms, 1452.426 ms

That's about a 23% speedup.  At this scale factor, everything fits
into memory, but the index by itself (214 MB) fits into memory while
the table (1281 MB) does not.  Let's crank the scale factor down some
more.  Here are the results at scale_factor = 20:

enable_indexonlyscan = false times: 352.213 ms, 353.988 ms, 350.859 ms
enable_indexonlyscan = true times: 300.623 ms, 301.355 ms, 302.346 ms

Now the entire database fits into shared_buffers, but we're still
seeing a 17% speedup.  But this turns out to misleading.  The
ring-buffer logic is actually preventing shared_buffers from getting
all of the heap blocks in cache quickly.  If I run the query over and
over again until the whole table actually makes it into shared
buffers, the sequential scan gets much faster:

enable_indexonlyscan = false times after lots and lots of prewarming:
215.487 ms, 219.006 ms, 218.490 ms

That's a bit disappointing - it's now more than a third faster to do
the sequential scan, even though the sequential scan has to touch six
times as many blocks (at scale factor 20, index is 43 MB, table is 256
MB) all of which are in cache.  Of course, touching that many fewer
blocks does have some advantages if there is concurrent activity on
the system, but it still seems unfortunate that the ratio of runtime
to blocks touched is more than 8x higher for the index-only case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-21 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 That's a bit disappointing - it's now more than a third faster to do
 the sequential scan, even though the sequential scan has to touch six
 times as many blocks (at scale factor 20, index is 43 MB, table is 256
 MB) all of which are in cache.  Of course, touching that many fewer
 blocks does have some advantages if there is concurrent activity on
 the system, but it still seems unfortunate that the ratio of runtime
 to blocks touched is more than 8x higher for the index-only case.

I don't know why you'd imagine that touching an index is free, or even
cheap, CPU-wise.  The whole point of the index-only optimization is to
avoid I/O.  When you try it on a case where there's no I/O to be saved,
*and* no shared-buffers contention to be avoided, there's no way it's
going to be a win.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-21 Thread Robert Haas
On Fri, Oct 21, 2011 at 1:18 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 That's a bit disappointing - it's now more than a third faster to do
 the sequential scan, even though the sequential scan has to touch six
 times as many blocks (at scale factor 20, index is 43 MB, table is 256
 MB) all of which are in cache.  Of course, touching that many fewer
 blocks does have some advantages if there is concurrent activity on
 the system, but it still seems unfortunate that the ratio of runtime
 to blocks touched is more than 8x higher for the index-only case.

 I don't know why you'd imagine that touching an index is free, or even
 cheap, CPU-wise.  The whole point of the index-only optimization is to
 avoid I/O.  When you try it on a case where there's no I/O to be saved,
 *and* no shared-buffers contention to be avoided, there's no way it's
 going to be a win.

Well, call me naive, but I would have thought touching six times less
data would make the operation run faster, not slower.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-21 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Fri, Oct 21, 2011 at 1:18 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I don't know why you'd imagine that touching an index is free, or even
 cheap, CPU-wise.  The whole point of the index-only optimization is to
 avoid I/O.  When you try it on a case where there's no I/O to be saved,
 *and* no shared-buffers contention to be avoided, there's no way it's
 going to be a win.

 Well, call me naive, but I would have thought touching six times less
 data would make the operation run faster, not slower.

It's not touching six times less data.  It's touching the exact same
number of tuples either way, just index tuples in one case and heap
tuples in the other.  You've arranged the test case so that all these
tuples are in shared buffers already, so there's no data movement to be
avoided.  What this test case proves is that btree's overhead per index
tuple touched is significantly more than the cost of the fastest path
through HeapTupleSatisfiesMVCC, which I don't find surprising
considering how much sweat has been expended on that code path over the
years.

(It may well be that it's not even btree at fault but the per-tuple
visits to the visibility map ... did you do any oprofiling yet?)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-21 Thread Robert Haas
On Fri, Oct 21, 2011 at 2:08 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Fri, Oct 21, 2011 at 1:18 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I don't know why you'd imagine that touching an index is free, or even
 cheap, CPU-wise.  The whole point of the index-only optimization is to
 avoid I/O.  When you try it on a case where there's no I/O to be saved,
 *and* no shared-buffers contention to be avoided, there's no way it's
 going to be a win.

 Well, call me naive, but I would have thought touching six times less
 data would make the operation run faster, not slower.

 It's not touching six times less data.  It's touching the exact same
 number of tuples either way, just index tuples in one case and heap
 tuples in the other.

Yeah, but it works out to fewer pages.

 You've arranged the test case so that all these
 tuples are in shared buffers already, so there's no data movement to be
 avoided.  What this test case proves is that btree's overhead per index
 tuple touched is significantly more than the cost of the fastest path
 through HeapTupleSatisfiesMVCC, which I don't find surprising
 considering how much sweat has been expended on that code path over the
 years.

I think HeapTupleSatisfiesMVCC is probably being skipped anyway in
this case, since all the heap pages should be PD_ALL_VISIBLE.

 (It may well be that it's not even btree at fault but the per-tuple
 visits to the visibility map ... did you do any oprofiling yet?)

No, but I think that might be a good idea.  Maybe I'll go do that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-21 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Fri, Oct 21, 2011 at 2:08 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 What this test case proves is that btree's overhead per index
 tuple touched is significantly more than the cost of the fastest path
 through HeapTupleSatisfiesMVCC, which I don't find surprising
 considering how much sweat has been expended on that code path over the
 years.

 I think HeapTupleSatisfiesMVCC is probably being skipped anyway in
 this case, since all the heap pages should be PD_ALL_VISIBLE.

Proves my point ;-) ... you're comparing a code path that's been beat on
for *years* with one that just got written.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-21 Thread Robert Haas
On Fri, Oct 21, 2011 at 2:33 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I think HeapTupleSatisfiesMVCC is probably being skipped anyway in
 this case, since all the heap pages should be PD_ALL_VISIBLE.

 Proves my point ;-) ... you're comparing a code path that's been beat on
 for *years* with one that just got written.

I know.  I wrote a chunk of it.  :-)  My point is just that it'd be
nice to make it better.

Anyhow, here's the scoop.  On my desktop machine running F14, running
SELECT sum(1) FROM pgbench_accounts in a tight loop, 60 s worth of
oprofile data:

176830   13.0801  postgres postgres ExecProject
170028   12.5770  postgres postgres
IndexOnlyNext
96631 7.1478  postgres postgres
visibilitymap_test
86019 6.3628  postgres postgres
advance_aggregates
74366 5.5009  postgres postgres ExecScan
72428 5.3575  postgres postgres
ExecClearTuple
68483 5.0657  postgres postgres btgettuple
60614 4.4836  postgres postgres
advance_transition_function
59680 4.4145  postgres postgres ExecProcNode
52295 3.8683  postgres postgres
_bt_checkkeys
52078 3.8522  libc-2.12.90.so  libc-2.12.90.so
__memcpy_sse2
49548 3.6651  postgres postgres
index_getnext_tid
48265 3.5702  postgres postgres
ExecEvalConst
42989 3.1799  postgres postgres _bt_next
40544 2.9990  postgres postgres _bt_readpage
35162 2.6009  no-vmlinux   no-vmlinux   /no-vmlinux
33639 2.4883  postgres postgres
MemoryContextReset

And without index-only scans. but everything in shared_buffers:

169515   18.4261  postgres postgres ExecProject
9482710.3076  postgres postgres
heapgettup_pagemode
84850 9.2231  postgres postgres
advance_aggregates
57998 6.3043  postgres postgres
advance_transition_function
55638 6.0478  postgres postgres
ExecEvalConst
53684 5.8354  postgres postgres heapgetpage
51411 5.5883  postgres postgres ExecScan
48387 5.2596  postgres postgres ExecProcNode
44129 4.7968  postgres postgres
ExecStoreTuple
30759 3.3435  postgres postgres heap_getnext
25923 2.8178  postgres postgres SeqNext
24145 2.6245  postgres postgres
CheckForSerializableConflictOut
23155 2.5169  postgres postgres ExecAgg
18864 2.0505  postgres postgres
heap_page_prune_opt
18784 2.0418  no-vmlinux   no-vmlinux   /no-vmlinux

The index-only scan takes about 385 ms, while the non-index-only
version takes about 284 ms.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-21 Thread Robert Haas
On Fri, Oct 21, 2011 at 3:07 PM, Robert Haas robertmh...@gmail.com wrote:
 [ oprofile results ]

*grovels through the line-by-line results*

Hmm, I guess there is a bit of a hotspot in StoreIndexTuple, which is
probably being folded into IndexOnlyNext in the per-function timings:

ExecClearTuple(slot);
for (i = 0; i  nindexatts; i++)
values[i] = index_getattr(itup, i + 1, itupdesc, isnull[i]);
ExecStoreVirtualTuple(slot);

If I'm reading these results right, that section is about 3% of the
total number of samples.

Also, this line is kind of expensive:

if (!visibilitymap_test(scandesc-heapRelation,
ItemPointerGetBlockNumber(tid),
node-ioss_VMBuffer))

Around 2%.  But I don't see any way to avoid that, or even make it cheaper.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-21 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Anyhow, here's the scoop.  On my desktop machine running F14, running
 SELECT sum(1) FROM pgbench_accounts in a tight loop, 60 s worth of
 oprofile data:

 176830   13.0801  postgres postgres 
 ExecProject

Hm, that's weird.  In both these cases, I'd have expected that
ExecProject would get optimized away thanks to selection of a physical
tlist for the scan node.  Wonder if that got broken ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-21 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Hmm, I guess there is a bit of a hotspot in StoreIndexTuple, which is
 probably being folded into IndexOnlyNext in the per-function timings:

 ExecClearTuple(slot);
 for (i = 0; i  nindexatts; i++)
 values[i] = index_getattr(itup, i + 1, itupdesc, isnull[i]);
 ExecStoreVirtualTuple(slot);

I had wondered whether it'd be worth optimizing that along the lines of
slot_getallattrs().  But most indexes probably have only one column,
or anyway not enough to make for a useful savings.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-21 Thread Robert Haas
On Fri, Oct 21, 2011 at 3:55 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Anyhow, here's the scoop.  On my desktop machine running F14, running
 SELECT sum(1) FROM pgbench_accounts in a tight loop, 60 s worth of
 oprofile data:

 176830   13.0801  postgres                 postgres                 
 ExecProject

 Hm, that's weird.  In both these cases, I'd have expected that
 ExecProject would get optimized away thanks to selection of a physical
 tlist for the scan node.  Wonder if that got broken ...

If it did, it looks like it wasn't recent.  I set up the same test
case on my MacBook using REL9_1_STABLE and REL9_0_STABLE and set a
breakpoint on ExecProject().  Both back-branches appear to also call
ExecProject() for every tuple.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-21 Thread Jeff Janes
On Fri, Oct 21, 2011 at 11:14 AM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Oct 21, 2011 at 2:08 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Fri, Oct 21, 2011 at 1:18 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I don't know why you'd imagine that touching an index is free, or even
 cheap, CPU-wise.  The whole point of the index-only optimization is to
 avoid I/O.  When you try it on a case where there's no I/O to be saved,
 *and* no shared-buffers contention to be avoided, there's no way it's
 going to be a win.

 Well, call me naive, but I would have thought touching six times less
 data would make the operation run faster, not slower.

 It's not touching six times less data.  It's touching the exact same
 number of tuples either way, just index tuples in one case and heap
 tuples in the other.

 Yeah, but it works out to fewer pages.

But since those pages are already in RAM, why would it matter all that
much?  (Other than in the case of highly concurrent access, which you
don't seem to be testing?)

One of Tom's commits that made it not lock the same index page over
and over again (once for each tuple on it) made me think it should be
much faster than the seq scan, but a bit of random flailing about
convinced me that any saving from this were compensated for by the
high over head of FunctionCall2Coll and all of the hokey-pokey that
that call entails, which a seqscan can skip entirely.

If count(*) could cause the index-only scan to happen in physical
order of the index, rather than logical order, that might be a big
win.  Both for all in memory and for not-all-in-memory.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers