Re: [HACKERS] counting pallocs

2012-05-17 Thread Heikki Linnakangas

On 17.05.2012 06:43, Robert Haas wrote:

The attached patch provides some rough instrumentation for determining
where palloc calls are coming from.  This is obviously just for
noodling around with, not for commit, and there may well be bugs.  But
enjoy.

I gave this a quick spin on a couple of test workloads: a very short
pgbench test, a very short pgbench -S test, and the regression tests.
On the pgbench test, the top culprits are ExecInitExpr() and
expression_tree_mutator(); in both cases, the lappend() call for the
T_List case is the major contributor.  Other significant contributors
include _copyVar(), which I haven't drilled into terribly far but
seems to be coming mostly from add_vars_to_targetlist();
buildRelationAliases() via lappend, pstrdup, and makeString;
ExecAllocTupleTableSlot(); and makeColumnRef() via makeNode, lcons,
and makeString.


What percentage of total CPU usage is the palloc() overhead in these 
tests? If we could totally eliminate the palloc() overhead, how much 
faster would the test run?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] master and sync-replica diverging

2012-05-17 Thread Erik Rijkers
AMD FX 8120 / centos 6.2 / latest source (git head)


It seems to be quite easy to force a 'sync' replica to not be equal to master by
recreating+loading a table in a while loop.


For this test I compiled+checked+installed three separate instances on the same 
machine.  The
replica application_name are names 'wal_receiver_$copy' where $copy is 01, 
resp. 02.

$ ./sync_state.sh
  pid  | application_name |   state   | sync_state
---+--+---+
 19520 | wal_receiver_01  | streaming | sync
 19567 | wal_receiver_02  | streaming | async
(2 rows)

 port | synchronous_commit | synchronous_standby_names
--++---
 6564 | on | wal_receiver_01
(1 row)

 port | synchronous_commit | synchronous_standby_names
--++---
 6565 | off|
(1 row)

 port | synchronous_commit | synchronous_standby_names
--++---
 6566 | off|
(1 row)



The test consists of creating a table and loading tab-separated data from file 
with COPY and then
taking the rowcount of that table (13 MB, almost 200k rows) in all three 
instances:


# wget 
http://flybase.org/static_pages/downloads/FB2012_03/genes/fbgn_annotation_ID_fb_2012_03.tsv.gz

slurp_file=fbgn_annotation_ID_fb_2012_03.tsv.gz

zcat $slurp_file \
 | grep -v '^#' \
 | grep -Ev '^[[:space:]]*$' \
 | psql -c 
drop table if exists $table cascade;
create table $table (
 gene_symbol  text
,primary_fbgn text
,secondary_fbgns  text
,annotation_idtext
,secondary_annotation_ids text
);
copy $table from stdin csv delimiter E'\t';
 ;

# count on master:
echo select current_setting('port') port,count(*) from $table|psql -qtXp 6564

# count on wal_receiver_01 (sync replica):
echo select current_setting('port') port,count(*) from $table|psql -qtXp 6565

# count on wal_receiver_02 (async replica):
echo select current_setting('port') port,count(*) from $table|psql -qtXp 6566



I expected the rowcounts from master and sync replica to always be the same.

Initially this seemed to be the case, but when I run the above sequence in a 
while loop for a few
minutes about 10% of rowcounts from the sync-replica are not equal to the 
master.

Perhaps not a likely scenario, but surely such a deviating rowcount on a sync 
replica should not
be possible?


thank you,


Erik Rijkers





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why is indexonlyscan so darned slow?

2012-05-17 Thread Ants Aasma
On Thu, May 17, 2012 at 6:08 AM, Joshua Berkus j...@agliodbs.com wrote:
 As you can see, the indexonlyscan version of the query spends 5% as much time 
 reading the data as the seq scan version, and doesn't have to read the heap 
 at all.  Yet it spends 20 seconds doing ... what, exactly?

 BTW, kudos on the new explain analyze reporting ... works great!

Looks like timing overhead. Timing is called twice per tuple which
gives around 950ns per timing call for your index only result. This is
around what is expected of hpet based timing. If you are on Linux you
can check what clocksource you are using by running cat
/sys/devices/system/clocksource/clocksource0/current_clocksource

You can verify that it is due to timing overhead by adding timing off
to the explain clause. Or use the pg_test_timing utility to check the
timing overhead on your system. With hpet based timing I'm seeing
660ns timing overhead and 26.5s execution for your query, with timing
off execution time falls to 2.1s. For reference, tsc based timing
gives 19.2ns overhead and 2.3s execution time with timing.

Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why is indexonlyscan so darned slow?

2012-05-17 Thread Joshua Berkus
Ants,

Well, that's somewhat better, but again hardly the gain in performance I'd 
expect to see ... especially since this is ideal circumstances for index-only 
scan.  

bench2=# select count(*) from pgbench_accounts;
  count   
--
 2000
(1 row)

Time: 3827.508 ms

bench2=# set enable_indexonlyscan=off;
SET
Time: 0.241 ms
bench2=# select count(*) from pgbench_accounts;
  count   
--
 2000
(1 row)

Time: 16012.444 ms

For some reason counting tuples in an index takes 5X as long (per tuple) as 
counting them in a table.  Why?

- Original Message -
 On Thu, May 17, 2012 at 6:08 AM, Joshua Berkus j...@agliodbs.com
 wrote:
  As you can see, the indexonlyscan version of the query spends 5% as
  much time reading the data as the seq scan version, and doesn't
  have to read the heap at all.  Yet it spends 20 seconds doing ...
  what, exactly?
 
  BTW, kudos on the new explain analyze reporting ... works great!
 
 Looks like timing overhead. Timing is called twice per tuple which
 gives around 950ns per timing call for your index only result. This
 is
 around what is expected of hpet based timing. If you are on Linux you
 can check what clocksource you are using by running cat
 /sys/devices/system/clocksource/clocksource0/current_clocksource
 
 You can verify that it is due to timing overhead by adding timing off
 to the explain clause. Or use the pg_test_timing utility to check the
 timing overhead on your system. With hpet based timing I'm seeing
 660ns timing overhead and 26.5s execution for your query, with timing
 off execution time falls to 2.1s. For reference, tsc based timing
 gives 19.2ns overhead and 2.3s execution time with timing.
 
 Ants Aasma
 --
 Cybertec Schönig  Schönig GmbH
 Gröhrmühlgasse 26
 A-2700 Wiener Neustadt
 Web: http://www.postgresql-support.de
 
 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] could not open relation with OID errors after promoting the standby to master

2012-05-17 Thread Joachim Wieland
On Wed, May 16, 2012 at 11:38 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Well, that is not surprising in itself -- InitTempTableNamespace calls
 RemoveTempRelations to cleanup from a possibly crashed previous backend
 with the same ID.  So that part of the backtrace looks normal to me
 (unless there is something weird going on, which might very well be the
 case).

Right, I guess the stack trace is okay but some state was obviously wrong.

I was able to clean that up now by some catalog hacking, but I'm
definitely going to dump and reload soon.

I found out that it was certain backend ids which couldn't create
temporary tables, meaning that when I did a create temp table in
these few certain backend ids (about 4-5 all with low id numbers which
is why I hit them quite often), it would give me this could not open
relation with OID x error.

I also couldn't drop the temp schema in these backends:

# drop schema pg_temp_4;
ERROR:  cache lookup failed for relation 1990987636

# select oid, * from pg_namespace ;
(got oid 4664506 for pg_temp_4)

# select * from pg_class where oid = 1990987636;
(no rows returned)

# delete from pg_namespace where oid = 4664506;
DELETE 1

# create temp table myfoo(a int);
CREATE TABLE

Later on I also found some leftover pg_type entries from temporary
tables that didn't exist anymore. I'm quite that certain I shouldn't
see these anymore... And I also find a few entries in pg_class with
relistemp='t' whose oid is considerably older than anything recent.
This kinda suggests that there might be something weird going on when
you have temp tables in flight and fail over, at least that's the only
explanation I have for how this could have happened.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] counting pallocs

2012-05-17 Thread Robert Haas
On Thu, May 17, 2012 at 2:28 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 What percentage of total CPU usage is the palloc() overhead in these tests?
 If we could totally eliminate the palloc() overhead, how much faster would
 the test run?

AllocSetAlloc is often the top CPU consumer in profiling results, but
it's typically only in the single-digit percentages.  However, there's
also some distributed overhead that's more difficult to measure.  For
example, the fact that OpExpr uses a List instead of directly pointing
to its arguments costs us three pallocs - plus three more if we ever
copy it - but it also means that accessing the first element of an
OpExpr requires three pointer dereferences instead of one, and
accessing the second one requires four pointer dereferences instead of
one.  There's no real way to isolate the overhead of that, but it's
got to cost at least something.

The reality - I'm not sure whether it's a happy reality or a sad
reality - is that most CPU profiles of PostgreSQL are pretty flat.
The nails that stick up have, for the most part, long since been
pounded down.  If we want to make further improvements to our parse
and plan time, and I do, because I think we lag our competitors, then
I think this is the kind of stuff we need to look at.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] master and sync-replica diverging

2012-05-17 Thread Joshua Berkus
Erik,

Are you taking the counts *while* the table is loading?  In sync replication, 
it's possible for the counts to differ for a short time due to one of three 
things:

* transaction has been saved to the replica and confirm message hasn't reached 
the master yet
* replica has synched the transaction to the WAL log, but due to wal_delay 
settings hasn't yet applied it to the tables in memory.
* updating the master with synchronous_commit = local.

- Original Message -
 AMD FX 8120 / centos 6.2 / latest source (git head)
 
 
 It seems to be quite easy to force a 'sync' replica to not be equal
 to master by
 recreating+loading a table in a while loop.
 
 
 For this test I compiled+checked+installed three separate instances
 on the same machine.  The
 replica application_name are names 'wal_receiver_$copy' where $copy
 is 01, resp. 02.
 
 $ ./sync_state.sh
   pid  | application_name |   state   | sync_state
 ---+--+---+
  19520 | wal_receiver_01  | streaming | sync
  19567 | wal_receiver_02  | streaming | async
 (2 rows)
 
  port | synchronous_commit | synchronous_standby_names
 --++---
  6564 | on | wal_receiver_01
 (1 row)
 
  port | synchronous_commit | synchronous_standby_names
 --++---
  6565 | off|
 (1 row)
 
  port | synchronous_commit | synchronous_standby_names
 --++---
  6566 | off|
 (1 row)
 
 
 
 The test consists of creating a table and loading tab-separated data
 from file with COPY and then
 taking the rowcount of that table (13 MB, almost 200k rows) in all
 three instances:
 
 
 # wget
 http://flybase.org/static_pages/downloads/FB2012_03/genes/fbgn_annotation_ID_fb_2012_03.tsv.gz
 
 slurp_file=fbgn_annotation_ID_fb_2012_03.tsv.gz
 
 zcat $slurp_file \
  | grep -v '^#' \
  | grep -Ev '^[[:space:]]*$' \
  | psql -c 
 drop table if exists $table cascade;
 create table $table (
  gene_symbol  text
 ,primary_fbgn text
 ,secondary_fbgns  text
 ,annotation_idtext
 ,secondary_annotation_ids text
 );
 copy $table from stdin csv delimiter E'\t';
  ;
 
 # count on master:
 echo select current_setting('port') port,count(*) from $table|psql
 -qtXp 6564
 
 # count on wal_receiver_01 (sync replica):
 echo select current_setting('port') port,count(*) from $table|psql
 -qtXp 6565
 
 # count on wal_receiver_02 (async replica):
 echo select current_setting('port') port,count(*) from $table|psql
 -qtXp 6566
 
 
 
 I expected the rowcounts from master and sync replica to always be
 the same.
 
 Initially this seemed to be the case, but when I run the above
 sequence in a while loop for a few
 minutes about 10% of rowcounts from the sync-replica are not equal to
 the master.
 
 Perhaps not a likely scenario, but surely such a deviating rowcount
 on a sync replica should not
 be possible?
 
 
 thank you,
 
 
 Erik Rijkers
 
 
 
 
 
 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Strange issues with 9.2 pg_basebackup replication

2012-05-17 Thread Joshua Berkus
Jim, Fujii,

Even more fun:

1) Set up a server as a cascading replica (e.g. max_wal_senders = 3, 
standby_mode = on )

2) Connect the server to *itself* as a replica.

3) This will work and report success, up until you do your first write.

4) Then ... segfault!  
 


- Original Message -
 On 5/16/12 10:53 AM, Fujii Masao wrote:
  On Wed, May 16, 2012 at 3:43 AM, Joshua Berkusj...@agliodbs.com
   wrote:
 
  Before restarting it, you need to do pg_basebackup and make a
  base
  backup
  onto the standby again. Since you started the standby without
  recovery.conf,
  a series of WAL in the standby has gotten inconsistent with that
  in
  the master.
  So you need a fresh backup to restart the standby.
 
  You're not understanding the bug.  The problem is that the standby
  came up and reported that it was replicating OK, when clearly it
  wasn't.
 
  8. Got this fatal error on the standby server:
 
  LOG:  record with incorrect prev-link 0/7B8 at 0/7E0
  LOG:  record with incorrect prev-link 0/7B8 at 0/7E0
 
  ... this error message repeated every 5s.
 
  According to your first report, ISTM you got error messages.
 
 Only *after* it was correctly setup.
 
 Josh's point is that if you flub the configuration, you should get an
 error, which is not what's happening now. Right now it just comes up
 and acts as if nothing's wrong.
 --
 Jim C. Nasby, Database Architect   j...@nasby.net
 512.569.9461 (cell) http://jim.nasby.net
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] counting pallocs

2012-05-17 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 One piece of reasonably low-hanging fruit appears to be OpExpr.  It
 seems like it would be better all around to put Node *arg1 and Node
 *arg2 in there instead of a list...  aside from saving pallocs, it
 seems like it would generally simplify the code.

Obviously, Stephen Frost's list-allocation patch would affect your
results here ... but I wonder how much the above change would affect
*his* results.  Specifically, the observation that most lists are 1
or 2 elements long would presumably become less true, but I wonder
by how much exactly.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] read() returns ERANGE in Mac OS X

2012-05-17 Thread Florian Pflug
On May16, 2012, at 15:51 , Tom Lane wrote:
 Alvaro Herrera alvhe...@alvh.no-ip.org writes:
 We just came across a situation where a corrupted HFS+ filesystem
 appears to return ERANGE on a customer machine.  Our first reaction was
 to turn zero_damaged_pages on to allow taking a pg_dump backup of the
 database, but surprisingly this does not work.  A quick glance at the
 code shows the reason:
 ...
 Note that zero_damaged_pages only enters the picture if it's a short
 read, not if the read actually fails completely.
 
 Is this by design, or is this just an oversight?
 
 It is by design, in that the only contemplated case was truncated-away
 pages.  I'm pretty hesitant to consider allowing arbitrary kernel errors
 to be ignored here …

Maybe we should have zero_missing_pages which would only zero on short reads,
and zero_damaged_pages which would zero on all IO errors?

Or we could have zero_damaged_pages zero only if reports EIO, and then add
any platform-specific additional error codes as we learn about them. ERANGE
on darwin would be the first such addition.

In any case, it seems to me that at least EIO should trigger zeroing, since
that is presumably what you'd get on a filesystem with integrated checksums
like ZFS.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Strange issues with 9.2 pg_basebackup replication

2012-05-17 Thread Ants Aasma
On Thu, May 17, 2012 at 3:42 PM, Joshua Berkus j...@agliodbs.com wrote:
 Even more fun:

 1) Set up a server as a cascading replica (e.g. max_wal_senders = 3, 
 standby_mode = on )

 2) Connect the server to *itself* as a replica.

 3) This will work and report success, up until you do your first write.

 4) Then ... segfault!

I cannot reproduce this. Attached is the script that I use for cascade
replication testing. With it I can see the replica connecting to
itself but no segfault.

Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


test_cascading.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] master and sync-replica diverging

2012-05-17 Thread Erik Rijkers
On Thu, May 17, 2012 14:32, Joshua Berkus wrote:
 Erik,

 Are you taking the counts *while* the table is loading?  In sync replication, 
 it's possible for
 the counts to differ for a short time due to one of three things:

 * transaction has been saved to the replica and confirm message hasn't 
 reached the master yet

The count(*) was done in the way that I showed, i.e. *after* psql had exited.  
My understanding is
that, with synchronous replication 'on' and configured properly, psql could 
only return *after*
the sync-replica had the data safely on disk.  Either my understanding is not 
correct or there is
a bug in postgres sync-rep.

 * replica has synched the transaction to the WAL log, but due to wal_delay 
 settings hasn't yet
 applied it to the tables in memory.

settings are untouched.

 * updating the master with synchronous_commit = local.

During my test:  synchronous_commit = on


Thanks,


Erik Rijkers

 - Original Message -
 AMD FX 8120 / centos 6.2 / latest source (git head)


 It seems to be quite easy to force a 'sync' replica to not be equal
 to master by
 recreating+loading a table in a while loop.


 For this test I compiled+checked+installed three separate instances
 on the same machine.  The
 replica application_name are names 'wal_receiver_$copy' where $copy
 is 01, resp. 02.

 $ ./sync_state.sh
   pid  | application_name |   state   | sync_state
 ---+--+---+
  19520 | wal_receiver_01  | streaming | sync
  19567 | wal_receiver_02  | streaming | async
 (2 rows)

  port | synchronous_commit | synchronous_standby_names
 --++---
  6564 | on | wal_receiver_01
 (1 row)

  port | synchronous_commit | synchronous_standby_names
 --++---
  6565 | off|
 (1 row)

  port | synchronous_commit | synchronous_standby_names
 --++---
  6566 | off|
 (1 row)



 The test consists of creating a table and loading tab-separated data
 from file with COPY and then
 taking the rowcount of that table (13 MB, almost 200k rows) in all
 three instances:


 # wget
 http://flybase.org/static_pages/downloads/FB2012_03/genes/fbgn_annotation_ID_fb_2012_03.tsv.gz

 slurp_file=fbgn_annotation_ID_fb_2012_03.tsv.gz

 zcat $slurp_file \
  | grep -v '^#' \
  | grep -Ev '^[[:space:]]*$' \
  | psql -c 
 drop table if exists $table cascade;
 create table $table (
  gene_symbol  text
 ,primary_fbgn text
 ,secondary_fbgns  text
 ,annotation_idtext
 ,secondary_annotation_ids text
 );
 copy $table from stdin csv delimiter E'\t';
  ;

 # count on master:
 echo select current_setting('port') port,count(*) from $table|psql
 -qtXp 6564

 # count on wal_receiver_01 (sync replica):
 echo select current_setting('port') port,count(*) from $table|psql
 -qtXp 6565

 # count on wal_receiver_02 (async replica):
 echo select current_setting('port') port,count(*) from $table|psql
 -qtXp 6566



 I expected the rowcounts from master and sync replica to always be
 the same.

 Initially this seemed to be the case, but when I run the above
 sequence in a while loop for a few
 minutes about 10% of rowcounts from the sync-replica are not equal to
 the master.

 Perhaps not a likely scenario, but surely such a deviating rowcount
 on a sync replica should not
 be possible?


 thank you,


 Erik Rijkers





 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql bug

2012-05-17 Thread Teodor Sigaev

will investigate that

Tom Lane wrote:

Teodor Sigaevteo...@sigaev.ru  writes:

After editing query with external editor psql exits on Ctrl-C:


FWIW, I failed to reproduce that on any of my machines.  Maybe
your editor is leaving the tty in a funny state?

regards, tom lane


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] master and sync-replica diverging

2012-05-17 Thread Ants Aasma
On Thu, May 17, 2012 at 4:53 PM, Erik Rijkers e...@xs4all.nl wrote:
 The count(*) was done in the way that I showed, i.e. *after* psql had exited. 
  My understanding is
 that, with synchronous replication 'on' and configured properly, psql could 
 only return *after*
 the sync-replica had the data safely on disk.  Either my understanding is not 
 correct or there is
 a bug in postgres sync-rep.

Commit can only return when sync-replica has the data safely on disk,
but this doesn't mean that it's visible yet.

The sequence of events is in dot notation:
commit_command - master_wal_sync - replica_wal_sync -
master_commit_visible - commit_response
replica_wal_sync - replica_replay_wal - replica_commit_visible

If you issue a select on the replica after getting a commit response
from master you can see that the query getting a snapshot races with
replay of the commit record.

Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] master and sync-replica diverging

2012-05-17 Thread Erik Rijkers
On Thu, May 17, 2012 16:10, Ants Aasma wrote:
 On Thu, May 17, 2012 at 4:53 PM, Erik Rijkers e...@xs4all.nl wrote:
 The count(*) was done in the way that I showed, i.e. *after* psql had 
 exited.  My understanding
 is
 that, with synchronous replication 'on' and configured properly, psql could 
 only return *after*
 the sync-replica had the data safely on disk.  Either my understanding is 
 not correct or there
 is
 a bug in postgres sync-rep.

 Commit can only return when sync-replica has the data safely on disk,
 but this doesn't mean that it's visible yet.

 The sequence of events is in dot notation:
 commit_command - master_wal_sync - replica_wal_sync -
 master_commit_visible - commit_response
 replica_wal_sync - replica_replay_wal - replica_commit_visible

 If you issue a select on the replica after getting a commit response
 from master you can see that the query getting a snapshot races with
 replay of the commit record.


Ah yes, that makes sense. I hadn't thought of that.

Thank you for that explanation.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Missing optimization when filters are applied after window functions

2012-05-17 Thread Volker Grabsch
Hitoshi Harada schrieb:
 On Wed, May 16, 2012 at 12:50 AM, Volker Grabsch v...@notjusthosting.com 
 wrote:
  I propose the following general optimization: If all window
  functions are partitioned by the same first field (here: id),
  then any filter on that field should be executed before
  WindowAgg. So a query like this:
 
 I think that's possible.  Currently the planner doesn't think any
 qualification from the upper query can be pushed down to a query that
 has a window function.  It would be possible to let it push down if
 the expression matches PARTITION BY expression.

Sounds great!

 However, the
 challenge is that a query may have a number of window functions that
 have different PARTITION BY expressions.  At the time of pushing down
 in the planning, it is not obvious which window function comes first.

I'm don't really unterstand what you mean with which window function
comes first, because to my understanding, all window functions of
a query belong to the same level in the query hierarchy. But then,
my knowledge of PostgreSQL internals isn't very deep, either.

 One idea is to restrict such optimization in only case of single
 window function, and the other is to make it generalize and cover a
 lot of cases.

From a practical point of view, the restriction to a single window
function wouldn't be that bad, although I'd prefer to think about
the number of different windows rather than number of window functions.

In other words, every optimization that is correct for a single window
function is also correct for multiple window functions if those use
all the same window.

 That said, our planner on window functions has a lot of improvement to
 be done.  Every kind of optimization I see is what I raised above;
 they can be done easily by hacking in a small case, or they can be
 done by generalizing for the most of cases.  My understanding is our
 project tends to like the latter and it takes a little time but covers
 more use cases.

I'd also prefer to see a general solution, as this provides less
room for unpleasant surprises (e.g. This query is only slightly
different from the previous one. Why does it take so much longer?).

On the other hand, any small improvement is a big step forward
regarding window functions.

Unfortunately, I can't voluteer on that, as it is currently
impossible for me to allocate enough time for this.

However, any pointer to where to look at the source (or in the
manual) would be of great. Maybe I'll find at least enough time
to provide a rough proposal, or to improve existing attempts
to solve this issue.

Also, is there any chance to include a (simple) attempt of
such an optimiztation into PostgreSQL-9.2 beta, or is this
only a possible topic for 9.3 and later?


Regards,
Volker

-- 
Volker Grabsch
---(())---

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pre-alloc ListCell's optimization

2012-05-17 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
 So I guess the first question here is - does it improve performance?
 
 Because if it does, then it's worth pursuing ... if not, that's the
 first thing to fix.

Alright, so I've done some pgbench's using all default configs with just
a straight up './configure' and pgbench -S -T 300, 3 runs each and then
averaged:

llist_opt: 9289 tps
HEAD:  9286 tps

I realize we see tons of palloc() calls happening but now I'm wondering
if they really contribute all that match time, overall.  Also, I'm
wondering if all the benefit from removing the palloc()'s is being
sucked up by the regression in list_concat().  

A few folks have mentioned just going whole-hog and doing all list
allocations in blocks of 8, which would actually allow us to go back to
an O(1) list_concat, though we wouldn't be able to free the 2nd list
passed to list_concat in that case, which may or may not be acceptable,
based on how necessary those couple pfree's we had previously are.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Why is indexonlyscan so darned slow?

2012-05-17 Thread Jeff Janes
On Thu, May 17, 2012 at 5:22 AM, Joshua Berkus j...@agliodbs.com wrote:
 Ants,

 Well, that's somewhat better, but again hardly the gain in performance I'd 
 expect to see ... especially since this is ideal circumstances for index-only 
 scan.

 bench2=# select count(*) from pgbench_accounts;
  count
 --
  2000
 (1 row)

 Time: 3827.508 ms

 bench2=# set enable_indexonlyscan=off;
 SET
 Time: 0.241 ms
 bench2=# select count(*) from pgbench_accounts;
  count
 --
  2000
 (1 row)

 Time: 16012.444 ms

 For some reason counting tuples in an index takes 5X as long (per tuple) as 
 counting them in a table.  Why?


It looks like the IOS is taking 4x less time, not more time.

Anyway, the IOS follows the index logical structure, not the physical
structure, so if the index is not in RAM it will really be hurt by the
lack of sequential reads.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Strange issues with 9.2 pg_basebackup replication

2012-05-17 Thread Fujii Masao
On Thu, May 17, 2012 at 12:01 PM, Joshua Berkus j...@agliodbs.com wrote:

  And: if we still have to ship logs, what's the point in even having
  cascading replication?

 At least cascading replication (1) allows you to adopt more flexible
 configuration of servers,

 I'm just pretty shocked.  The last time we talked about this, at the end of 
 the 9.1 development cycle, you almost had remastering using streaming-only 
 replication working, you just ran out of time.  Now it appears that you've 
 abandoned working on that completely.  What's going on?

You mean that remaster is, after promoting one of standby servers, to make
remaining standby servers reconnect to new master and resolve the timeline
gap without the shared archive? Yep, that's one of my TODO items, but I'm not
sure if I have enough time to implement that for 9.3

Regards,

-- 
Fujii Masao

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Strange issues with 9.2 pg_basebackup replication

2012-05-17 Thread Fujii Masao
On Thu, May 17, 2012 at 10:42 PM, Ants Aasma a...@cybertec.at wrote:
 On Thu, May 17, 2012 at 3:42 PM, Joshua Berkus j...@agliodbs.com wrote:
 Even more fun:

 1) Set up a server as a cascading replica (e.g. max_wal_senders = 3, 
 standby_mode = on )

 2) Connect the server to *itself* as a replica.

 3) This will work and report success, up until you do your first write.

 4) Then ... segfault!

 I cannot reproduce this.

Me, neither.

Josh, could you show me the more detail procedure to reproduce the problem?

Regards,

-- 
Fujii Masao

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Missing optimization when filters are applied after window functions

2012-05-17 Thread Nicolas Barbier
2012/5/17 Volker Grabsch v...@notjusthosting.com:

 Also, is there any chance to include a (simple) attempt of
 such an optimiztation into PostgreSQL-9.2 beta, or is this
 only a possible topic for 9.3 and later?

For 9.2, you’re about 4 months late :-). The last commitfest was in Januari:

URL:https://commitfest.postgresql.org/action/commitfest_view?id=13

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why is indexonlyscan so darned slow?

2012-05-17 Thread Joshua Berkus
Jeff,

That's in-RAM speed ... I ran the query twice to make sure the index was 
cached, and it didn't get any better.  And I meant 5X per byte rather than 5X 
per tuple.

I talked this over with Haas, and his opinion is that we have a LOT of overhead 
in the way we transverse indexes, especially lookups which happen once per leaf 
node instead of in bulk.Certainly the performance I'm seeing would be 
consistent with that idea.

I'll try some multi-column covering indexes next to see how it looks. 

- Original Message -
 On Thu, May 17, 2012 at 5:22 AM, Joshua Berkus j...@agliodbs.com
 wrote:
  Ants,
 
  Well, that's somewhat better, but again hardly the gain in
  performance I'd expect to see ... especially since this is ideal
  circumstances for index-only scan.
 
  bench2=# select count(*) from pgbench_accounts;
   count
  --
   2000
  (1 row)
 
  Time: 3827.508 ms
 
  bench2=# set enable_indexonlyscan=off;
  SET
  Time: 0.241 ms
  bench2=# select count(*) from pgbench_accounts;
   count
  --
   2000
  (1 row)
 
  Time: 16012.444 ms
 
  For some reason counting tuples in an index takes 5X as long (per
  tuple) as counting them in a table.  Why?
 
 
 It looks like the IOS is taking 4x less time, not more time.
 
 Anyway, the IOS follows the index logical structure, not the physical
 structure, so if the index is not in RAM it will really be hurt by
 the
 lack of sequential reads.
 
 Cheers,
 
 Jeff
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why is indexonlyscan so darned slow?

2012-05-17 Thread Jeff Janes
On Thu, May 17, 2012 at 11:35 AM, Joshua Berkus j...@agliodbs.com wrote:
 Jeff,

 That's in-RAM speed ... I ran the query twice to make sure the index was 
 cached, and it didn't get any better.  And I meant 5X per byte rather than 5X 
 per tuple.

Ah, OK that makes more sense.  I played around with this, specifically
count(*), quite a bit when IOS first came out, and I attributed a
large part of the time to the code that forms a tuple out of raw
bytes, and the code that advances the aggregate.  The first one is
probably more a per-tuple cost than per byte, and the second
definitely is per tuple cost.

I can't find my detailed notes from this work, so this is just from memory.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Strange issues with 9.2 pg_basebackup replication

2012-05-17 Thread Joshua Berkus
Yeah, I don't know how I produced the crash in the first place, because of 
course the self-replica should block all writes, and retesting it I can't get 
it to accept a write.  Not sure how I did it in the first place.

So the bug is just that you can connect a server to itself as its own replica.  
Since I can't think of any good reason to do this, we should simply error out 
on startup if someone sets things up that way.  How can we detect that we've 
connected streaming replication to the same server?

- Original Message -
 On Thu, May 17, 2012 at 10:42 PM, Ants Aasma a...@cybertec.at
 wrote:
  On Thu, May 17, 2012 at 3:42 PM, Joshua Berkus j...@agliodbs.com
  wrote:
  Even more fun:
 
  1) Set up a server as a cascading replica (e.g. max_wal_senders =
  3, standby_mode = on )
 
  2) Connect the server to *itself* as a replica.
 
  3) This will work and report success, up until you do your first
  write.
 
  4) Then ... segfault!
 
  I cannot reproduce this.
 
 Me, neither.
 
 Josh, could you show me the more detail procedure to reproduce the
 problem?
 
 Regards,
 
 --
 Fujii Masao
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql bug

2012-05-17 Thread Teodor Sigaev

FWIW, I failed to reproduce that on any of my machines.  Maybe
your editor is leaving the tty in a funny state?


Seems system() call cleanups sigaction state on FreeBSD. I've modify
void
setup_cancel_handler(void)
{
fprintf(stderr, %p - %p\n, pqsignal(SIGINT, handle_sigint), 
handle_sigint);
}

and add it around system() call. Next:

% export EDITOR=echo
% psql postgres
0x0 - 0x409620
SET
Timing is on.
psql (9.2beta1)
Type help for help.

postgres=# \e
0x409620 - 0x409620
/tmp/psql.edit.7997.sql
0x0 - 0x409620
postgres=#

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers