Re: [HACKERS] REVIEW: pg_stat_statements with query tree based normalization

2012-01-21 Thread Kääriäinen Anssi

I am going to produce another revision in response to feedback already
received. I intend to outline the steps that it will take to resolve
some outstanding issues in the next day or so. It would be nice if you
could take a look at the revised patch that is ultimately produced.
Should I keep you posted?

Please do.

I tried the patch because I wanted to investigate the Django test suite.
While I did that, I ended up doing a small review of the feature. I can easily
do that again with an updated patch.

 - Anssi

Sent via pgsql-hackers mailing list (
To make changes to your subscription:

[HACKERS] REVIEW: pg_stat_statements with query tree based normalization

2012-01-20 Thread Kääriäinen Anssi
This is a short review of pg_stat_statements based on quick testing of the 

1. Installation: after managing to actually build PostgreSQL and contrib 
modules + changing
shared_preload_libraries to include pg_stat_statements I got this error:
FATAL:  could not create shared memory segment: Invalid argument
DETAIL:  Failed system call was shmget(key=5431001, size=34627584, 03600)

So, I needed to rise my SMH limits. I guess there isn't anything surprising or 
erroneous about
this but figured this is worth mentioning.

2. Usability:
  - If you have two similarly named tables in different schemas, for example 
public.tbl and
some_schema.tbl these tables will get different entries in pg_stat_statements. 
However, the
table names are not schema-qualified, so it is impossible to see which table is 

# select query, calls from pg_stat_statements where query like 'select%test%';
query| calls 
 select * from test; | 4
 select * from test; | 2

# select * from tmp.test;
# select query, calls from pg_stat_statements where query like 'select%test%';
query| calls 
 select * from test; | 5
 select * from test; | 2

# select * from test;
# select query, calls from pg_stat_statements where query like 'select%test%';
query| calls 
 select * from test; | 5
 select * from test; | 3

- It would be nice from user perspective to transform where id in (list of 
values) to
where id in(?) always, regardless of the length of the list. Now where id in 
(1, 2) is
grouped to different pool than where id in (1, 2, 3).

3. I tried to run Django's test suite a few times and see if there would be any 
behavior. Some results (note that I haven't tried to reproduce this error on 
without the patch):

test_django_testdb_default=# SELECT,, aggregation_publisher.num_awards, 
MIN(aggregation_book.pubdate) AS earliest_book FROM 
aggregation_publisher LEFT OUTER JOIN aggregation_book ON 
( = aggregation_book.publisher_id) GROUP BY,, 
aggregation_publisher.num_awards HAVING NOT 
(MIN(aggregation_book.pubdate) IS NULL) ORDER BY earliest_book ASC;

ERROR:  unrecognized node type for havingclause node: 315
test_django_testdb_default=# \d aggregation_publisher
   Table public.aggregation_publisher
   Column   |  Type  | Modifiers
 id | integer| not null default 
 name   | character varying(255) | not null
 num_awards | integer| not null
aggregation_publisher_pkey PRIMARY KEY, btree (id)
Referenced by:
TABLE aggregation_book CONSTRAINT aggregation_book_publisher_id_fkey 
FOREIGN KEY (publisher_id) REFERENCES aggregation_publisher(id) DEFERRABLE 

The time used for insert statements seems suspiciously low. Maybe PostgreSQL is 
just faster than I thought :)

query  | INSERT INTO django_content_type (id, name, app_label, 
model) VALUES (?, ?, ?, ?)
calls  | 5490
total_time | 0.8231193

Multi-values inserts do not seem to be normalized:
query  | INSERT INTO custom_pk_business_employees (business_id, 
employee_id) VALUES ('Sears', 456), ('Sears', 123)
calls  | 1256
total_time | 0.619693

I did not see any noticeable difference in runtimes with pg_stat_statements 
installed or uninstalled (as extension).
Not tested on master without the patch at all.

Overall the feature seems to be really useful.

 - Anssi
Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Re: [HACKERS] Re: [patch] Include detailed information about a row failing a CHECK constraint into the error message

2011-11-10 Thread Kääriäinen Anssi

What I want to find in the end is something which tells me this row
causes the error. Unfortunately, as the new row of the table with the
constraint is not yet on disk, it doesn't really have its own ctid, and
therefore I cannot report that. (Which makes sense, obviously.)

Would an error with the row's PK value be useful? Something like row
with primary key 'pk_val' fails check 'foo_check'. That would be limited
in size, yet give some context.

There are two problems I can see:
  - The PK value doesn't necessarily identify the row in any useful
manner (SERIAL primary key in INSERT).
  - The table might lack PK constraint (skip the detail in this case?)

 - Anssi
Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-30 Thread Kääriäinen Anssi
Quoting Robert Haas:

I tried this on my MacBook Pro this morning, using pgbench -i -s 500
to create a database about 7.5GB in size, and then using SELECT
sum(1) FROM pgbench_accounts as a test query, on a build WITHOUT
--enable-cassert.  This machine has 4GB of memory, and I set
shared_buffers = 400MB.  (No, I'm not sure whether that's the optimal
setting for shared_buffers for this machine.)

I did some tests where I tried to compare the effect of having the index
ordered tuples not be in the same order they are in the base table.
The idea is to test what effect accessing the VM map randomly as
opposed to sequential order has on performance. I suspect the above
test access the VM in order (the accounts table is effectively clustered
on the index used in the test). I might be mistaken here.

My test setup is this:
drop table if exists test;
drop table if exists test2;
create unlogged table test /* with (fillfactor = 10) */ 
as select generate_series(0, 20*1000*1000) as id;
create index idx1 on test(id);
vacuum test;
create unlogged table test2 /* with (fillfactor = 10) */
as (select * from test order by random());
create index idx2 on test2(id);
vacuum test2;

Table size is around 600MB, index size is around 350MB and VM on-disk
size is 16kB with default fillfactor. With fillfactor = 10, the VM size is 104
KB, and table size is around 6GB.  The index size is the same.

Results for the randomly ordered table:
# select count(*) from test2;
14822.045 ms
14826.253 ms
14815.450 ms

Results for the effectively clustered table:
# select count(*) from test;
11761.890 ms
11767.926 ms
11810.900 ms

Now, this test still has the benefit of fitting the VM easily into the L1 cache.

Next, I did a ugly hack to get the table size large enough so that the VM
will trash the L1 cache while still having somewhat reasonable test setup
creation time. My harware is old, 1GB of memory, processor is Genuine
Intel(R) CPU L2400  @ 1.66GHz. The L1 data cache size is 32kB on my.

The hack is to simply set fillfactor to 10. The VM size is now 104kB, the
table size is about 6.3 GB while the index size is still the same as in above

Results for the randomly ordered table:
# select count(*) from test2;
21606.683 ms
21829.063 ms
21637.434 ms

Results for the effectively clustered table:
# select count(*) from test;
11714.663 ms
11449.264 ms
11658.534 ms

Now, the next step would be to trash the L2 cache (20GB table size should
do this on Sandy Bridge, where L2 cache is 256KB). I don't have hardware
to do that test. It is worth noting that the L2 cache is shared on Sandy
Bridge, so it is likely that an index-only scan of a large enough table would
slow down other processes, too. Without tests this is only FUD, though. The
test would be to scan a 20GB table's index repeatedly in one process, and
see how it affects standard in-memory pgbench results for other processes.
Compare this with doing the same with a sequential scan process.

Lessons learned (or what I learned, at least):
  - Clustering is important for index only scans. Picking a clustered index
over non-clustered index will have a big performance effect.
  - Large table index-only scans are going to be more expensive compared
to sequential scan than what pgbench accounts tests suggests. I assume
that the accounts table is effectively clustered on the index used. I
haven't verified this.
  - There is the possibility that index-only scans will trash the caches for
other processes, too. Not tested, though.

I am sure these results will vary significantly based on hardware used. I
am also notorious for screwing up benchmarks, so verifying these results
is recommended.

You will need around 16GB of disk space for the fillfactor = 10 test. I would
recommend you have more than 1GB of memory, otherwise creating the
test setup can take some time...

 - Anssi Kääriäinen
Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-30 Thread Kääriäinen Anssi
Sorry, I forgot to include the version used  some information about my setup:
PostgreSQL version: Git HEAD as of:
Date:   Fri Oct 28 21:18:36 2011 -0400
Commit: 51eba98cf4595e90730dedd9305da8aa84b649ee

Compiled with defaults, (only change --with-pgport = 5431). I used default
settings, shared_buffers size is 24MB. The system is Linux Mint Debian edition
(kernel 3.0.0, gcc 4.6.1). The interesting parts about my hardware were in the
original post.

 - Anssi

Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Re: [HACKERS] Index only scan paving the way for auto clustered tables?

2011-10-11 Thread Kääriäinen Anssi
Robert Haas wrote:

And it seems to me that there could easily be format changes that
would make sense for particular cases, but not across the board,

- index-organized tables (heap is a btree, and secondary indexes
reference the PK rather than the TID; this is how MySQL does it, and
Oracle offers it as an option)
- WORM tables (no updates or deletes, and no inserts after creating
transaction commits, allowing a much smaller tuple header)
- non-transactional tables (tuples visible as soon as they're written,
again allowing for smaller tuple header; useful for internal stuff and
perhaps for insert-only log tables)

This is probably a silly idea, but I have been wondering about the
following idea: Instead of having visibility info in the row header,
have a couple of row visibility slots in the page header. These slots
could be shared between rows in the page, so that if you do a bulk
insert/update/delete you would only use one slot. If the slots
overflow, you would use external slots buffer.

When the row is all visible, no slot would be used at all.

The xmin, xmax and cid would be in the slots. ctid would have its
current meaning, except when the external slots would be used,
then ctid would point to the external slot, and it would have the real
row header. I don't know if there would be any other row header
parts which could be shared.

The external slots buffer would then contain xmin, xmax, cid and
the real ctid.

Updates would write the new rows to another page in the heap,
and old rows would stay in place, just as now. So there would not
be any redo log like configuration. Also, the external slots buffer
would be small (18 bytes per row), so it would not get out of
cache too easily.

The performance would suck if you had lots of small updates, or
long running transactions. On the other hand in data warehousing,
where bulk loads are normal, and there are a lot of small rows,
this could actually work.

As said, this is probably a silly idea. But as pluggable heap types
came up, I thought to ask if this could actually work. If this kind of
wondering posts are inappropriate for this list, please tell me so
that I can avoid these in the future.

 - Anssi Kääriäinen

Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Re: [HACKERS] pg_cancel_backend by non-superuser

2011-10-01 Thread Kääriäinen Anssi

In *every* case -- and there are many -- where we've had people
express pain, this would have sufficed.  Usually the problem is a
large index creation gone awry, or an automated backup process
blocking a schema change that has taken half the locks it needs, or
something like that -- all by the same role that is under control of
the folks feeling distress.  If this minimal set is uncontroversial, I
would like to see that much committed and then spend some time
hand-wringing on whether to extend it.

If one does want to extend it, I think role inheritance makes the most
sense: a child role should be able to cancel its parent role's
queries, and not vice-versa. Since one can use SET ROLE in this case
anyway to basically act on behalf on that role, I think that, too,
should be uncontroversial.

I would be a step in the right direction if the DB owner would see all queries
to the DB in pg_stat_activity.

 - Anssi
Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Re: [HACKERS] index-only scans

2011-08-13 Thread Kääriäinen Anssi

Now, suppose that we know that 50% of the heap pages have their
visibility map bits set.  What's the chance that this query won't need
a heap fetch?  Well, the chance is 50% *if* you assume that a row
which has been quiescent for a long time is just as likely to be
queried as one that has been recently inserted or updated.  However,
in many real-world use cases, nothing could be farther from the truth.

What do we do about that?

The example is much more realistic if the query is a fetch of N latest rows 
from a table. Very common use case, and the whole relation's visibility 
statistics are completely wrong for that query. Wouldn't it be great if there 
was something like pg_stat_statements that would know the statistics per query, 
derived from usage...

Even if the statistics are not available per query, the statistics could be 
calculated from the relation usage: the weighted visibility of the pages would 
be pages_visible_when_read / total_pages_read for the relation. That percentage 
would minimize the average cost of the plans much better than just the 
non-weighted visibility percentage.

For the above example, if the usage is 90% read the N latest rows and we assume 
they are never visible, the weighted visibility percentage would be 10% while 
the non-weighted visibility percentage could be 90%. Even if the visibility 
percentage would be incorrect for the queries reading old rows, by definition 
of the weighted visibility percentage there would not be too many of them.

The same idea could of course be used to calculate the effective cache hit 
ratio for each table. Cache hit ratio would have the problem of feedback loops, 

Of course, keeping such statistic could be more expensive than the benefit it 
gives. On the other hand, page hit percentage is already available...

 - Anssi
Sent via pgsql-hackers mailing list (
To make changes to your subscription:


2011-02-11 Thread Kääriäinen Anssi

From: [] 
On Behalf Of Tom Lane []
Sent: Friday, February 11, 2011 7:35 PM
To: Dimitri Fontaine
Cc: David E. Wheeler; Robert Haas; Josh Berkus;

We will add logic to find a chain of update scripts leading from oldv to
v, in case that exact combination is not available in the extension's
script directory.  (NOTE: maybe in the CREATE ... FROM case, it would be
a better idea to not do that search, but insist on finding exactly
extname-oldv-v.sql?  That would provide at least a little bit of extra
protection against wrong FROM choice.  Not sure how much it helps

Version strings will have no hard-wired semantics except equality; we
don't need a sorting rule.

This has the side effect that you can also have downgrade scripts. I don't know 
if this is designed or just coincidental, so thought it would be worth 
mentioning. It can have some impact on how to find the update chain to the 
desired version (loops in the graph), although standard graph traversal 
algorithms should handle this just fine. The worst case is that if you are 
upgrading from 1.2 to 2.0 the path is 1.2 - 1.1 - 2.0, even if there exists a 
path 1.2 - 1.8 - 1.9 - 2.0. This could potentially result in data loss, if 
the downgrade drops some columns or something like that.

All this can of course be avoided by documenting that even if it is possible to 
define downgrade script, don't do it...

 - Anssi
 PS. I hope this mail comes out somewhat sanely formatted, using our lovely 
OWA-webmail here...

Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Re: [HACKERS] SSI and Hot Standby

2011-01-21 Thread Kääriäinen Anssi
When I test your example, though, I'm getting the serialization
failure on T3 rather than T2, so I'd call that a bug.  Will
investigate.  Thanks again for your tests!  You seem to be able to
shake out issues better than anyone else!  Once found, fixing them
is not usually very hard, it's coming up with that creative usage
pattern to *find* the problem which is the hard part.

Thank you very much, but I do not deserve this honor. I was just constructing 
an example for myself so that I could understand why read only transaction 
might pose a problem. I posted it to help other people to see a concrete 
example of the problem. I had no idea this would show an actual bug in the code.

OK if I add this one to our dcheck test suite, too?

It is of course OK. And if you want to add this as an example in the 
documentation, it would be great. This is a simple, but concrete example of why 
read only serializable transaction might cause an anomaly. If I am not 
mistaken, there isn't any complete example in the documentation. It was hard 
for me to grasp why there might be a problem and I don't think I am alone.

 - Anssi

Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Re: [HACKERS] REVIEW: Extensions support for pg_dump

2011-01-17 Thread Kääriäinen Anssi
 Well I'm not seeing that here

I am not at work at the moment and I don't have the possibility to compile 
PostgreSQL on this computer, so the example here is from memory.

The issue I saw was this: assume you have an extension foo, containing one 
function, test().

-- restricted due to dependency

-- not restricted!

The same can be done using CREATE OR REPLACE.

I hope this is not an error on my part. It is possible because I had a lot of 
schemas and my search_path might have been wrong...

 - Anssi
PS: Using web email client, I hope this comes out in somewhat sane format.
Sent via pgsql-hackers mailing list (
To make changes to your subscription: