date:20120517

[HACKERS] master and sync-replica diverging

2012-05-17 Thread Erik Rijkers

AMD FX 8120 / centos 6.2 / latest source (git head)


It seems to be quite easy to force a 'sync' replica to not be equal to master by
recreating+loading a table in a while loop.


For this test I compiled+checked+installed three separate instances on the same 
machine.  The
replica application_name are names 'wal_receiver_$copy' where $copy is 01, 
resp. 02.

$ ./sync_state.sh
  pid  | application_name |   state   | sync_state
---+--+---+
 19520 | wal_receiver_01  | streaming | sync
 19567 | wal_receiver_02  | streaming | async
(2 rows)

 port | synchronous_commit | synchronous_standby_names
--++---
 6564 | on | wal_receiver_01
(1 row)

 port | synchronous_commit | synchronous_standby_names
--++---
 6565 | off|
(1 row)

 port | synchronous_commit | synchronous_standby_names
--++---
 6566 | off|
(1 row)



The test consists of creating a table and loading tab-separated data from file 
with COPY and then
taking the rowcount of that table (13 MB, almost 200k rows) in all three 
instances:


# wget 
http://flybase.org/static_pages/downloads/FB2012_03/genes/fbgn_annotation_ID_fb_2012_03.tsv.gz

slurp_file=fbgn_annotation_ID_fb_2012_03.tsv.gz

zcat $slurp_file \
 | grep -v '^#' \
 | grep -Ev '^[[:space:]]*$' \
 | psql -c "
drop table if exists $table cascade;
create table $table (
 gene_symbol  text
,primary_fbgn text
,secondary_fbgns  text
,annotation_idtext
,secondary_annotation_ids text
);
copy $table from stdin csv delimiter E'\t';
 ";

# count on master:
echo "select current_setting('port') port,count(*) from $table"|psql -qtXp 6564

# count on wal_receiver_01 (sync replica):
echo "select current_setting('port') port,count(*) from $table"|psql -qtXp 6565

# count on wal_receiver_02 (async replica):
echo "select current_setting('port') port,count(*) from $table"|psql -qtXp 6566



I expected the rowcounts from master and sync replica to always be the same.

Initially this seemed to be the case, but when I run the above sequence in a 
while loop for a few
minutes about 10% of rowcounts from the sync-replica are not equal to the 
master.

Perhaps not a likely scenario, but surely such a deviating rowcount on a sync 
replica should not
be possible?


thank you,


Erik Rijkers





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why is indexonlyscan so darned slow?

2012-05-17 Thread Ants Aasma

On Thu, May 17, 2012 at 6:08 AM, Joshua Berkus  wrote:
> As you can see, the indexonlyscan version of the query spends 5% as much time 
> reading the data as the seq scan version, and doesn't have to read the heap 
> at all.  Yet it spends 20 seconds doing ... what, exactly?
>
> BTW, kudos on the new explain analyze reporting ... works great!

Looks like timing overhead. Timing is called twice per tuple which
gives around 950ns per timing call for your index only result. This is
around what is expected of hpet based timing. If you are on Linux you
can check what clocksource you are using by running cat
/sys/devices/system/clocksource/clocksource0/current_clocksource

You can verify that it is due to timing overhead by adding timing off
to the explain clause. Or use the pg_test_timing utility to check the
timing overhead on your system. With hpet based timing I'm seeing
660ns timing overhead and 26.5s execution for your query, with timing
off execution time falls to 2.1s. For reference, tsc based timing
gives 19.2ns overhead and 2.3s execution time with timing.

Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why is indexonlyscan so darned slow?

2012-05-17 Thread Joshua Berkus

Ants,

Well, that's somewhat better, but again hardly the gain in performance I'd 
expect to see ... especially since this is ideal circumstances for index-only 
scan.  

bench2=# select count(*) from pgbench_accounts;
  count   
--
 2000
(1 row)

Time: 3827.508 ms

bench2=# set enable_indexonlyscan=off;
SET
Time: 0.241 ms
bench2=# select count(*) from pgbench_accounts;
  count   
--
 2000
(1 row)

Time: 16012.444 ms

For some reason counting tuples in an index takes 5X as long (per tuple) as 
counting them in a table.  Why?

- Original Message -
> On Thu, May 17, 2012 at 6:08 AM, Joshua Berkus 
> wrote:
> > As you can see, the indexonlyscan version of the query spends 5% as
> > much time reading the data as the seq scan version, and doesn't
> > have to read the heap at all.  Yet it spends 20 seconds doing ...
> > what, exactly?
> >
> > BTW, kudos on the new explain analyze reporting ... works great!
> 
> Looks like timing overhead. Timing is called twice per tuple which
> gives around 950ns per timing call for your index only result. This
> is
> around what is expected of hpet based timing. If you are on Linux you
> can check what clocksource you are using by running cat
> /sys/devices/system/clocksource/clocksource0/current_clocksource
> 
> You can verify that it is due to timing overhead by adding timing off
> to the explain clause. Or use the pg_test_timing utility to check the
> timing overhead on your system. With hpet based timing I'm seeing
> 660ns timing overhead and 26.5s execution for your query, with timing
> off execution time falls to 2.1s. For reference, tsc based timing
> gives 19.2ns overhead and 2.3s execution time with timing.
> 
> Ants Aasma
> --
> Cybertec Schönig & Schönig GmbH
> Gröhrmühlgasse 26
> A-2700 Wiener Neustadt
> Web: http://www.postgresql-support.de
> 
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
> 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "could not open relation with OID" errors after promoting the standby to master

2012-05-17 Thread Joachim Wieland

On Wed, May 16, 2012 at 11:38 PM, Alvaro Herrera
 wrote:
> Well, that is not surprising in itself -- InitTempTableNamespace calls
> RemoveTempRelations to cleanup from a possibly crashed previous backend
> with the same ID.  So that part of the backtrace looks normal to me
> (unless there is something weird going on, which might very well be the
> case).

Right, I guess the stack trace is okay but some state was obviously wrong.

I was able to clean that up now by some catalog hacking, but I'm
definitely going to dump and reload soon.

I found out that it was certain backend ids which couldn't create
temporary tables, meaning that when I did a "create temp table" in
these few certain backend ids (about 4-5 all with low id numbers which
is why I hit them quite often), it would give me this "could not open
relation with OID x" error.

I also couldn't drop the temp schema in these backends:

# drop schema pg_temp_4;
ERROR:  cache lookup failed for relation 1990987636

# select oid, * from pg_namespace ;
(got oid 4664506 for "pg_temp_4")

# select * from pg_class where oid = 1990987636;
(no rows returned)

# delete from pg_namespace where oid = 4664506;
DELETE 1

# create temp table myfoo(a int);
CREATE TABLE

Later on I also found some leftover pg_type entries from temporary
tables that didn't exist anymore. I'm quite that certain I shouldn't
see these anymore... And I also find a few entries in pg_class with
relistemp='t' whose oid is considerably older than anything recent.
This kinda suggests that there might be something weird going on when
you have temp tables in flight and fail over, at least that's the only
explanation I have for how this could have happened.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] counting pallocs

2012-05-17 Thread Robert Haas

On Thu, May 17, 2012 at 2:28 AM, Heikki Linnakangas
 wrote:
> What percentage of total CPU usage is the palloc() overhead in these tests?
> If we could totally eliminate the palloc() overhead, how much faster would
> the test run?

AllocSetAlloc is often the top CPU consumer in profiling results, but
it's typically only in the single-digit percentages.  However, there's
also some distributed overhead that's more difficult to measure.  For
example, the fact that OpExpr uses a List instead of directly pointing
to its arguments costs us three pallocs - plus three more if we ever
copy it - but it also means that accessing the first element of an
OpExpr requires three pointer dereferences instead of one, and
accessing the second one requires four pointer dereferences instead of
one.  There's no real way to isolate the overhead of that, but it's
got to cost at least something.

The reality - I'm not sure whether it's a happy reality or a sad
reality - is that most CPU profiles of PostgreSQL are pretty flat.
The nails that stick up have, for the most part, long since been
pounded down.  If we want to make further improvements to our parse
and plan time, and I do, because I think we lag our competitors, then
I think this is the kind of stuff we need to look at.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] master and sync-replica diverging

2012-05-17 Thread Joshua Berkus

Erik,

Are you taking the counts *while* the table is loading?  In sync replication, 
it's possible for the counts to differ for a short time due to one of three 
things:

* transaction has been saved to the replica and confirm message hasn't reached 
the master yet
* replica has synched the transaction to the WAL log, but due to wal_delay 
settings hasn't yet applied it to the tables in memory.
* updating the master with synchronous_commit = local.

- Original Message -
> AMD FX 8120 / centos 6.2 / latest source (git head)
> 
> 
> It seems to be quite easy to force a 'sync' replica to not be equal
> to master by
> recreating+loading a table in a while loop.
> 
> 
> For this test I compiled+checked+installed three separate instances
> on the same machine.  The
> replica application_name are names 'wal_receiver_$copy' where $copy
> is 01, resp. 02.
> 
> $ ./sync_state.sh
>   pid  | application_name |   state   | sync_state
> ---+--+---+
>  19520 | wal_receiver_01  | streaming | sync
>  19567 | wal_receiver_02  | streaming | async
> (2 rows)
> 
>  port | synchronous_commit | synchronous_standby_names
> --++---
>  6564 | on | wal_receiver_01
> (1 row)
> 
>  port | synchronous_commit | synchronous_standby_names
> --++---
>  6565 | off|
> (1 row)
> 
>  port | synchronous_commit | synchronous_standby_names
> --++---
>  6566 | off|
> (1 row)
> 
> 
> 
> The test consists of creating a table and loading tab-separated data
> from file with COPY and then
> taking the rowcount of that table (13 MB, almost 200k rows) in all
> three instances:
> 
> 
> # wget
> http://flybase.org/static_pages/downloads/FB2012_03/genes/fbgn_annotation_ID_fb_2012_03.tsv.gz
> 
> slurp_file=fbgn_annotation_ID_fb_2012_03.tsv.gz
> 
> zcat $slurp_file \
>  | grep -v '^#' \
>  | grep -Ev '^[[:space:]]*$' \
>  | psql -c "
> drop table if exists $table cascade;
> create table $table (
>  gene_symbol  text
> ,primary_fbgn text
> ,secondary_fbgns  text
> ,annotation_idtext
> ,secondary_annotation_ids text
> );
> copy $table from stdin csv delimiter E'\t';
>  ";
> 
> # count on master:
> echo "select current_setting('port') port,count(*) from $table"|psql
> -qtXp 6564
> 
> # count on wal_receiver_01 (sync replica):
> echo "select current_setting('port') port,count(*) from $table"|psql
> -qtXp 6565
> 
> # count on wal_receiver_02 (async replica):
> echo "select current_setting('port') port,count(*) from $table"|psql
> -qtXp 6566
> 
> 
> 
> I expected the rowcounts from master and sync replica to always be
> the same.
> 
> Initially this seemed to be the case, but when I run the above
> sequence in a while loop for a few
> minutes about 10% of rowcounts from the sync-replica are not equal to
> the master.
> 
> Perhaps not a likely scenario, but surely such a deviating rowcount
> on a sync replica should not
> be possible?
> 
> 
> thank you,
> 
> 
> Erik Rijkers
> 
> 
> 
> 
> 
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
> 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Strange issues with 9.2 pg_basebackup & replication

2012-05-17 Thread Joshua Berkus

Jim, Fujii,

Even more fun:

1) Set up a server as a cascading replica (e.g. max_wal_senders = 3, 
standby_mode = on )

2) Connect the server to *itself* as a replica.

3) This will work and report success, up until you do your first write.

4) Then ... segfault!  
 


- Original Message -
> On 5/16/12 10:53 AM, Fujii Masao wrote:
> > On Wed, May 16, 2012 at 3:43 AM, Joshua Berkus
> >  wrote:
> >>
> >>> Before restarting it, you need to do pg_basebackup and make a
> >>> base
> >>> backup
> >>> onto the standby again. Since you started the standby without
> >>> recovery.conf,
> >>> a series of WAL in the standby has gotten inconsistent with that
> >>> in
> >>> the master.
> >>> So you need a fresh backup to restart the standby.
> >>
> >> You're not understanding the bug.  The problem is that the standby
> >> came up and reported that it was replicating OK, when clearly it
> >> wasn't.
> >
> >> 8. Got this fatal error on the standby server:
> >>
> >> LOG:  record with incorrect prev-link 0/7B8 at 0/7E0
> >> LOG:  record with incorrect prev-link 0/7B8 at 0/7E0
> >>
> >> ... this error message repeated every 5s.
> >
> > According to your first report, ISTM you got error messages.
> 
> Only *after* it was correctly setup.
> 
> Josh's point is that if you flub the configuration, you should get an
> error, which is not what's happening now. Right now it just comes up
> and acts as if nothing's wrong.
> --
> Jim C. Nasby, Database Architect   j...@nasby.net
> 512.569.9461 (cell) http://jim.nasby.net
> 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] counting pallocs

2012-05-17 Thread Tom Lane

Robert Haas  writes:
> One piece of reasonably low-hanging fruit appears to be OpExpr.  It
> seems like it would be better all around to put Node *arg1 and Node
> *arg2 in there instead of a list...  aside from saving pallocs, it
> seems like it would generally simplify the code.

Obviously, Stephen Frost's list-allocation patch would affect your
results here ... but I wonder how much the above change would affect
*his* results.  Specifically, the observation that most lists are 1
or 2 elements long would presumably become less true, but I wonder
by how much exactly.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] read() returns ERANGE in Mac OS X

2012-05-17 Thread Florian Pflug

On May16, 2012, at 15:51 , Tom Lane wrote:
> Alvaro Herrera  writes:
>> We just came across a situation where a corrupted HFS+ filesystem
>> appears to return ERANGE on a customer machine.  Our first reaction was
>> to turn zero_damaged_pages on to allow taking a pg_dump backup of the
>> database, but surprisingly this does not work.  A quick glance at the
>> code shows the reason:
>> ...
>> Note that zero_damaged_pages only enters the picture if it's a short
>> read, not if the read actually fails completely.
> 
>> Is this by design, or is this just an oversight?
> 
> It is by design, in that the only contemplated case was truncated-away
> pages.  I'm pretty hesitant to consider allowing arbitrary kernel errors
> to be ignored here …

Maybe we should have zero_missing_pages which would only zero on short reads,
and zero_damaged_pages which would zero on all IO errors?

Or we could have zero_damaged_pages zero only if reports EIO, and then add
any platform-specific additional error codes as we learn about them. ERANGE
on darwin would be the first such addition.

In any case, it seems to me that at least EIO should trigger zeroing, since
that is presumably what you'd get on a filesystem with integrated checksums
like ZFS.

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Strange issues with 9.2 pg_basebackup & replication

2012-05-17 Thread Ants Aasma

On Thu, May 17, 2012 at 3:42 PM, Joshua Berkus  wrote:
> Even more fun:
>
> 1) Set up a server as a cascading replica (e.g. max_wal_senders = 3, 
> standby_mode = on )
>
> 2) Connect the server to *itself* as a replica.
>
> 3) This will work and report success, up until you do your first write.
>
> 4) Then ... segfault!

I cannot reproduce this. Attached is the script that I use for cascade
replication testing. With it I can see the replica connecting to
itself but no segfault.

Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


test_cascading.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] master and sync-replica diverging

2012-05-17 Thread Erik Rijkers

On Thu, May 17, 2012 14:32, Joshua Berkus wrote:
> Erik,
>
> Are you taking the counts *while* the table is loading?  In sync replication, 
> it's possible for
> the counts to differ for a short time due to one of three things:
>
> * transaction has been saved to the replica and confirm message hasn't 
> reached the master yet

The count(*) was done in the way that I showed, i.e. *after* psql had exited.  
My understanding is
that, with synchronous replication 'on' and configured properly, psql could 
only return *after*
the sync-replica had the data safely on disk.  Either my understanding is not 
correct or there is
a bug in postgres sync-rep.

> * replica has synched the transaction to the WAL log, but due to wal_delay 
> settings hasn't yet
> applied it to the tables in memory.

settings are untouched.

> * updating the master with synchronous_commit = local.

During my test:  synchronous_commit = on


Thanks,


Erik Rijkers

> - Original Message -
>> AMD FX 8120 / centos 6.2 / latest source (git head)
>>
>>
>> It seems to be quite easy to force a 'sync' replica to not be equal
>> to master by
>> recreating+loading a table in a while loop.
>>
>>
>> For this test I compiled+checked+installed three separate instances
>> on the same machine.  The
>> replica application_name are names 'wal_receiver_$copy' where $copy
>> is 01, resp. 02.
>>
>> $ ./sync_state.sh
>>   pid  | application_name |   state   | sync_state
>> ---+--+---+
>>  19520 | wal_receiver_01  | streaming | sync
>>  19567 | wal_receiver_02  | streaming | async
>> (2 rows)
>>
>>  port | synchronous_commit | synchronous_standby_names
>> --++---
>>  6564 | on | wal_receiver_01
>> (1 row)
>>
>>  port | synchronous_commit | synchronous_standby_names
>> --++---
>>  6565 | off|
>> (1 row)
>>
>>  port | synchronous_commit | synchronous_standby_names
>> --++---
>>  6566 | off|
>> (1 row)
>>
>>
>>
>> The test consists of creating a table and loading tab-separated data
>> from file with COPY and then
>> taking the rowcount of that table (13 MB, almost 200k rows) in all
>> three instances:
>>
>>
>> # wget
>> http://flybase.org/static_pages/downloads/FB2012_03/genes/fbgn_annotation_ID_fb_2012_03.tsv.gz
>>
>> slurp_file=fbgn_annotation_ID_fb_2012_03.tsv.gz
>>
>> zcat $slurp_file \
>>  | grep -v '^#' \
>>  | grep -Ev '^[[:space:]]*$' \
>>  | psql -c "
>> drop table if exists $table cascade;
>> create table $table (
>>  gene_symbol  text
>> ,primary_fbgn text
>> ,secondary_fbgns  text
>> ,annotation_idtext
>> ,secondary_annotation_ids text
>> );
>> copy $table from stdin csv delimiter E'\t';
>>  ";
>>
>> # count on master:
>> echo "select current_setting('port') port,count(*) from $table"|psql
>> -qtXp 6564
>>
>> # count on wal_receiver_01 (sync replica):
>> echo "select current_setting('port') port,count(*) from $table"|psql
>> -qtXp 6565
>>
>> # count on wal_receiver_02 (async replica):
>> echo "select current_setting('port') port,count(*) from $table"|psql
>> -qtXp 6566
>>
>>
>>
>> I expected the rowcounts from master and sync replica to always be
>> the same.
>>
>> Initially this seemed to be the case, but when I run the above
>> sequence in a while loop for a few
>> minutes about 10% of rowcounts from the sync-replica are not equal to
>> the master.
>>
>> Perhaps not a likely scenario, but surely such a deviating rowcount
>> on a sync replica should not
>> be possible?
>>
>>
>> thank you,
>>
>>
>> Erik Rijkers
>>
>>
>>
>>
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql bug

2012-05-17 Thread Teodor Sigaev


will investigate that

Tom Lane wrote:

Teodor Sigaev  writes:

After editing query with external editor psql exits on Ctrl-C:


FWIW, I failed to reproduce that on any of my machines.  Maybe
your editor is leaving the tty in a funny state?

regards, tom lane


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] master and sync-replica diverging

2012-05-17 Thread Ants Aasma

On Thu, May 17, 2012 at 4:53 PM, Erik Rijkers  wrote:
> The count(*) was done in the way that I showed, i.e. *after* psql had exited. 
>  My understanding is
> that, with synchronous replication 'on' and configured properly, psql could 
> only return *after*
> the sync-replica had the data safely on disk.  Either my understanding is not 
> correct or there is
> a bug in postgres sync-rep.

Commit can only return when sync-replica has the data safely on disk,
but this doesn't mean that it's visible yet.

The sequence of events is in dot notation:
commit_command -> master_wal_sync -> replica_wal_sync ->
master_commit_visible -> commit_response
replica_wal_sync -> replica_replay_wal -> replica_commit_visible

If you issue a select on the replica after getting a commit response
from master you can see that the query getting a snapshot races with
replay of the commit record.

Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] master and sync-replica diverging

2012-05-17 Thread Erik Rijkers

On Thu, May 17, 2012 16:10, Ants Aasma wrote:
> On Thu, May 17, 2012 at 4:53 PM, Erik Rijkers  wrote:
>> The count(*) was done in the way that I showed, i.e. *after* psql had 
>> exited.  My understanding
>> is
>> that, with synchronous replication 'on' and configured properly, psql could 
>> only return *after*
>> the sync-replica had the data safely on disk.  Either my understanding is 
>> not correct or there
>> is
>> a bug in postgres sync-rep.
>
> Commit can only return when sync-replica has the data safely on disk,
> but this doesn't mean that it's visible yet.
>
> The sequence of events is in dot notation:
> commit_command -> master_wal_sync -> replica_wal_sync ->
> master_commit_visible -> commit_response
> replica_wal_sync -> replica_replay_wal -> replica_commit_visible
>
> If you issue a select on the replica after getting a commit response
> from master you can see that the query getting a snapshot races with
> replay of the commit record.
>

Ah yes, that makes sense. I hadn't thought of that.

Thank you for that explanation.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Missing optimization when filters are applied after window functions

2012-05-17 Thread Volker Grabsch

Hitoshi Harada schrieb:
> On Wed, May 16, 2012 at 12:50 AM, Volker Grabsch  
> wrote:
> > I propose the following general optimization: If all window
> > functions are partitioned by the same first field (here: id),
> > then any filter on that field should be executed before
> > WindowAgg. So a query like this:
> 
> I think that's possible.  Currently the planner doesn't think any
> qualification from the upper query can be pushed down to a query that
> has a window function.  It would be possible to let it push down if
> the expression matches PARTITION BY expression.

Sounds great!

> However, the
> challenge is that a query may have a number of window functions that
> have different PARTITION BY expressions.  At the time of pushing down
> in the planning, it is not obvious which window function comes first.

I'm don't really unterstand what you mean with "which window function
comes first", because to my understanding, all window functions of
a query belong to the same level in the query hierarchy. But then,
my knowledge of PostgreSQL internals isn't very deep, either.

> One idea is to restrict such optimization in only case of single
> window function, and the other is to make it generalize and cover a
> lot of cases.

>From a practical point of view, the restriction to a single window
function wouldn't be that bad, although I'd prefer to think about
the number of different windows rather than number of window functions.

In other words, every optimization that is correct for a single window
function is also correct for multiple window functions if those use
all the same window.

> That said, our planner on window functions has a lot of improvement to
> be done.  Every kind of optimization I see is what I raised above;
> they can be done easily by hacking in a small case, or they can be
> done by generalizing for the most of cases.  My understanding is our
> project tends to like the latter and it takes a little time but covers
> more use cases.

I'd also prefer to see a general solution, as this provides less
room for unpleasant surprises (e.g. "This query is only slightly
different from the previous one. Why does it take so much longer?").

On the other hand, any small improvement is a big step forward
regarding window functions.

Unfortunately, I can't voluteer on that, as it is currently
impossible for me to allocate enough time for this.

However, any pointer to where to look at the source (or in the
manual) would be of great. Maybe I'll find at least enough time
to provide a rough proposal, or to improve existing attempts
to solve this issue.

Also, is there any chance to include a (simple) attempt of
such an optimiztation into PostgreSQL-9.2 beta, or is this
only a possible topic for 9.3 and later?

Regards,
Volker

-- 
Volker Grabsch
---<<(())>>---

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pre-alloc ListCell's optimization

2012-05-17 Thread Stephen Frost

* Robert Haas (robertmh...@gmail.com) wrote:
> So I guess the first question here is - does it improve performance?
> 
> Because if it does, then it's worth pursuing ... if not, that's the
> first thing to fix.

Alright, so I've done some pgbench's using all default configs with just
a straight up './configure' and pgbench -S -T 300, 3 runs each and then
averaged:

llist_opt: 9289 tps
HEAD:  9286 tps

I realize we see tons of palloc() calls happening but now I'm wondering
if they really contribute all that match time, overall.  Also, I'm
wondering if all the benefit from removing the palloc()'s is being
sucked up by the regression in list_concat().  

A few folks have mentioned just going whole-hog and doing all list
allocations in blocks of 8, which would actually allow us to go back to
an O(1) list_concat, though we wouldn't be able to free the 2nd list
passed to list_concat in that case, which may or may not be acceptable,
based on how necessary those couple pfree's we had previously are.

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Why is indexonlyscan so darned slow?

2012-05-17 Thread Jeff Janes

On Thu, May 17, 2012 at 5:22 AM, Joshua Berkus  wrote:
> Ants,
>
> Well, that's somewhat better, but again hardly the gain in performance I'd 
> expect to see ... especially since this is ideal circumstances for index-only 
> scan.
>
> bench2=# select count(*) from pgbench_accounts;
>  count
> --
>  2000
> (1 row)
>
> Time: 3827.508 ms
>
> bench2=# set enable_indexonlyscan=off;
> SET
> Time: 0.241 ms
> bench2=# select count(*) from pgbench_accounts;
>  count
> --
>  2000
> (1 row)
>
> Time: 16012.444 ms
>
> For some reason counting tuples in an index takes 5X as long (per tuple) as 
> counting them in a table.  Why?
>

It looks like the IOS is taking 4x less time, not more time.

Anyway, the IOS follows the index logical structure, not the physical
structure, so if the index is not in RAM it will really be hurt by the
lack of sequential reads.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Strange issues with 9.2 pg_basebackup & replication

2012-05-17 Thread Fujii Masao

On Thu, May 17, 2012 at 12:01 PM, Joshua Berkus  wrote:
>
>> > And: if we still have to ship logs, what's the point in even having
>> > cascading replication?
>>
>> At least cascading replication (1) allows you to adopt more flexible
>> configuration of servers,
>
> I'm just pretty shocked.  The last time we talked about this, at the end of 
> the 9.1 development cycle, you almost had remastering using streaming-only 
> replication working, you just ran out of time.  Now it appears that you've 
> abandoned working on that completely.  What's going on?

You mean that "remaster" is, after promoting one of standby servers, to make
remaining standby servers reconnect to new master and resolve the timeline
gap without the shared archive? Yep, that's one of my TODO items, but I'm not
sure if I have enough time to implement that for 9.3

Regards,

-- 
Fujii Masao

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Strange issues with 9.2 pg_basebackup & replication

2012-05-17 Thread Fujii Masao

On Thu, May 17, 2012 at 10:42 PM, Ants Aasma  wrote:
> On Thu, May 17, 2012 at 3:42 PM, Joshua Berkus  wrote:
>> Even more fun:
>>
>> 1) Set up a server as a cascading replica (e.g. max_wal_senders = 3, 
>> standby_mode = on )
>>
>> 2) Connect the server to *itself* as a replica.
>>
>> 3) This will work and report success, up until you do your first write.
>>
>> 4) Then ... segfault!
>
> I cannot reproduce this.

Me, neither.

Josh, could you show me the more detail procedure to reproduce the problem?

Regards,

-- 
Fujii Masao

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Missing optimization when filters are applied after window functions

2012-05-17 Thread Nicolas Barbier

2012/5/17 Volker Grabsch :

> Also, is there any chance to include a (simple) attempt of
> such an optimiztation into PostgreSQL-9.2 beta, or is this
> only a possible topic for 9.3 and later?

For 9.2, you’re about 4 months late :-). The last commitfest was in Januari:

https://commitfest.postgresql.org/action/commitfest_view?id=13>

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why is indexonlyscan so darned slow?

2012-05-17 Thread Joshua Berkus

Jeff,

That's in-RAM speed ... I ran the query twice to make sure the index was 
cached, and it didn't get any better.  And I meant 5X per byte rather than 5X 
per tuple.

I talked this over with Haas, and his opinion is that we have a LOT of overhead 
in the way we transverse indexes, especially lookups which happen once per leaf 
node instead of in bulk.Certainly the performance I'm seeing would be 
consistent with that idea.

I'll try some multi-column covering indexes next to see how it looks. 

- Original Message -
> On Thu, May 17, 2012 at 5:22 AM, Joshua Berkus 
> wrote:
> > Ants,
> >
> > Well, that's somewhat better, but again hardly the gain in
> > performance I'd expect to see ... especially since this is ideal
> > circumstances for index-only scan.
> >
> > bench2=# select count(*) from pgbench_accounts;
> >  count
> > --
> >  2000
> > (1 row)
> >
> > Time: 3827.508 ms
> >
> > bench2=# set enable_indexonlyscan=off;
> > SET
> > Time: 0.241 ms
> > bench2=# select count(*) from pgbench_accounts;
> >  count
> > --
> >  2000
> > (1 row)
> >
> > Time: 16012.444 ms
> >
> > For some reason counting tuples in an index takes 5X as long (per
> > tuple) as counting them in a table.  Why?
> >
> 
> It looks like the IOS is taking 4x less time, not more time.
> 
> Anyway, the IOS follows the index logical structure, not the physical
> structure, so if the index is not in RAM it will really be hurt by
> the
> lack of sequential reads.
> 
> Cheers,
> 
> Jeff
> 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why is indexonlyscan so darned slow?

2012-05-17 Thread Jeff Janes

On Thu, May 17, 2012 at 11:35 AM, Joshua Berkus  wrote:
> Jeff,
>
> That's in-RAM speed ... I ran the query twice to make sure the index was 
> cached, and it didn't get any better.  And I meant 5X per byte rather than 5X 
> per tuple.

Ah, OK that makes more sense.  I played around with this, specifically
count(*), quite a bit when IOS first came out, and I attributed a
large part of the time to the code that forms a tuple out of raw
bytes, and the code that advances the aggregate.  The first one is
probably more a per-tuple cost than per byte, and the second
definitely is per tuple cost.

I can't find my detailed notes from this work, so this is just from memory.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Strange issues with 9.2 pg_basebackup & replication

2012-05-17 Thread Joshua Berkus

Yeah, I don't know how I produced the crash in the first place, because of 
course the self-replica should block all writes, and retesting it I can't get 
it to accept a write.  Not sure how I did it in the first place.

So the bug is just that you can connect a server to itself as its own replica.  
Since I can't think of any good reason to do this, we should simply error out 
on startup if someone sets things up that way.  How can we detect that we've 
connected streaming replication to the same server?

- Original Message -
> On Thu, May 17, 2012 at 10:42 PM, Ants Aasma 
> wrote:
> > On Thu, May 17, 2012 at 3:42 PM, Joshua Berkus 
> > wrote:
> >> Even more fun:
> >>
> >> 1) Set up a server as a cascading replica (e.g. max_wal_senders =
> >> 3, standby_mode = on )
> >>
> >> 2) Connect the server to *itself* as a replica.
> >>
> >> 3) This will work and report success, up until you do your first
> >> write.
> >>
> >> 4) Then ... segfault!
> >
> > I cannot reproduce this.
> 
> Me, neither.
> 
> Josh, could you show me the more detail procedure to reproduce the
> problem?
> 
> Regards,
> 
> --
> Fujii Masao
> 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql bug

2012-05-17 Thread Teodor Sigaev


FWIW, I failed to reproduce that on any of my machines.  Maybe
your editor is leaving the tty in a funny state?


Seems system() call cleanups sigaction state on FreeBSD. I've modify
void
setup_cancel_handler(void)
{
fprintf(stderr, "%p -> %p\n", pqsignal(SIGINT, handle_sigint), 
handle_sigint);
}

and add it around system() call. Next:

% export EDITOR=echo
% psql postgres
0x0 -> 0x409620
SET
Timing is on.
psql (9.2beta1)
Type "help" for help.

postgres=# \e
0x409620 -> 0x409620
/tmp/psql.edit.7997.sql
0x0 -> 0x409620
postgres=#

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] master and sync-replica diverging

Re: [HACKERS] Why is indexonlyscan so darned slow?

Re: [HACKERS] Why is indexonlyscan so darned slow?

Re: [HACKERS] "could not open relation with OID" errors after promoting the standby to master

Re: [HACKERS] counting pallocs

Re: [HACKERS] master and sync-replica diverging

Re: [HACKERS] Strange issues with 9.2 pg_basebackup & replication

Re: [HACKERS] counting pallocs

Re: [HACKERS] read() returns ERANGE in Mac OS X

Re: [HACKERS] Strange issues with 9.2 pg_basebackup & replication

Re: [HACKERS] master and sync-replica diverging

Re: [HACKERS] psql bug

Re: [HACKERS] master and sync-replica diverging

Re: [HACKERS] master and sync-replica diverging

Re: [HACKERS] Missing optimization when filters are applied after window functions

Re: [HACKERS] Pre-alloc ListCell's optimization

Re: [HACKERS] Why is indexonlyscan so darned slow?

Re: [HACKERS] Strange issues with 9.2 pg_basebackup & replication

Re: [HACKERS] Strange issues with 9.2 pg_basebackup & replication

Re: [HACKERS] Missing optimization when filters are applied after window functions

Re: [HACKERS] Why is indexonlyscan so darned slow?

Re: [HACKERS] Why is indexonlyscan so darned slow?

Re: [HACKERS] Strange issues with 9.2 pg_basebackup & replication

Re: [HACKERS] psql bug

24 matches

Site Navigation

Mail list logo

Footer information