date:20111024

Re: [HACKERS] Online base backup from the hot-standby

2011-10-24 Thread Heikki Linnakangas


On 25.10.2011 08:12, Fujii Masao wrote:

On Tue, Oct 25, 2011 at 12:24 AM, Heikki Linnakangas
  wrote:

On 24.10.2011 15:29, Fujii Masao wrote:


+
+
+  Copy the pg_control file from the cluster directory to the global
+  sub-directory of the backup. For example:
+
+ cp $PGDATA/global/pg_control /mnt/server/backupdir/global
+
+
+


Why is this step required? The control file is overwritten by information
from the backup_label anyway, no?


Yes, when recovery starts, the control file is overwritten. But before that,
we retrieve the minimum recovery point from the control file. Then it's used
as the backup end location.

During recovery, pg_stop_backup() cannot write an end-of-backup record.
So, in standby-only backup, other way to retrieve the backup end location
(instead of an end-of-backup record) is required. Ishiduka-san used the
control file as that, according to your suggestion ;)
http://archives.postgresql.org/pgsql-hackers/2011-05/msg01405.php


Oh :-)


+
+  Again connect to the database as a superuser, and execute
+pg_stop_backup. This terminates the backup mode, but
does not
+  perform a switch to the next WAL segment, create a backup history
file and
+  wait for all required WAL segments to be archived,
+  unlike that during normal processing.
+
+


How do you ensure that all the required WAL segments have been archived,
then?


The patch doesn't provide any capability to ensure that, IOW assumes that's
a user responsibility. If a user wants to ensure that, he/she needs to calculate
the backup start and end WAL files from the result of pg_start_backup()
and pg_stop_backup() respectively, and needs to wait until those files have
appeared in the archive. Also if the required WAL file has not been archived
yet, a user might need to execute pg_switch_xlog() in the master.


Frankly, I think this whole thing is too fragile. The procedure is 
superficially similar to what you do on master: run pg_start_backup(), 
rsync data directory, run pg_stop_backup(), but is actually subtly 
different and more complicated. If you don't know that, and don't follow 
the full procedure, you get a corrupt backup. And the backup might look 
ok, and might even sometimes work, which means that you won't notice in 
quick testing. That's a *huge* foot-gun.


I think we need to step back and find a way to make this:
a) less complicated, or at least
b) more robust, so that if you don't follow the procedure, you get an error.

With pg_basebackup, we have a fighting chance of getting this right, 
because we have more control over how the backup is made. For example, 
we can co-operate with the buffer manager to avoid torn-pages, 
eliminating the need for full_page_writes=on, and we can include a 
control file with the correct end-of-backup location automatically, 
without requiring user intervention. pg_basebackup is less flexible than 
the pg_start/stop_backup method, and unfortunately you're more likely to 
need the flexibility in a more complicated setup with a hot standby 
server and all, but making the generic pg_start/stop_backup method work 
seems infeasible at the moment.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Wolfgang Wilhelm

Hello,

my experience is that as soon as index only scans are available they are used - 
sometimes just because of the simple logic that a user thinks it is faster. 
Even when the index is so ridiculously long just to have all info in the 
index...

Regards
Wolfgang Wilhelm

Von: Tom Lane 
An: Robert Haas 
Cc: Kevin Grittner ; pgsql-hackers@postgresql.org
Gesendet: 21:35 Montag, 24.Oktober 2011 
Betreff: Re: [HACKERS] So, is COUNT(*) fast now? 

Robert Haas  writes:
> But even though Tom's statement that most indexes are one column might
> be a slight exaggeration, I suspect it probably is true that the
> optimizations he's talking about for large numbers of columns won't
> produce any material benefit even for a 3 or 4 column index.  Which
> makes me think maybe we should focus our efforts elsewhere.

Right.  If we thought the average was something like ten, it might be
worth pursuing optimizations similar to slot_getallattrs.  If it's
around two or three, almost certainly not.

Your point about people trying to create wider indexes to exploit
index-only scans is an interesting one, but I think it's premature to
optimize on the basis of hypotheses about what people might do in
future.

Not sure about your other idea of returning multiple tuples per
amgettuple call.  The trouble with that is that it will add complexity
(and hence cycles) at the nodeIndexscan level, because now nodeIndexscan
will have to buffer those tuples, keep track of whether it's fetching
forward or backward, etc etc.  Plus another layer of the same in
indexam.c (index_getnext etc).  I'm not at all convinced that it's
likely to be a net win.

I wonder how trustworthy the measure of the visibilitymap_test call site
as a consumer of cycles really is.  I've frequently noticed that
oprofile blames remarkably large fractions of the runtime on individual
statements that appear to be quite trivial.  I'm not sure if that
represents real hardware-level effects such as cache line switching,
or whether it's just measurement artifacts.  Keep in mind that
sampling-based measurements are always subject to sampling artifacts.

            regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unreproducible bug in snapshot import code

2011-10-24 Thread Fujii Masao

On Tue, Oct 25, 2011 at 2:41 AM, Gurjeet Singh  wrote:
> On Mon, Oct 24, 2011 at 1:08 PM, Alvaro Herrera 
> wrote:
>>
>> Excerpts from Gurjeet Singh's message of lun oct 24 13:55:44 -0300 2011:
>>
>> >     Starting from line 89 of terminal1.txt we see this snippet:
>> >
>> > postgres=# rollback;
>> > ROLLBACK
>> > postgres=# begin TRANSACTION ;
>> > BEGIN
>> > postgres=# set transaction snapshot '02C8-1';
>> > ERROR:  SET TRANSACTION SNAPSHOT must be called before any query
>> > postgres=# rollback;
>> > ROLLBACK
>> >
>> >
>> >     As you can see the SET TRANSACTION SNAPSHOT was the first statement
>> > in
>> > that transaction, and yet the ERROR message says that it is not.
>>
>> Maybe the tab-completion feature issued a query before the set
>> transaction command.
>
> I have tried reproducing the bug starting from 1 and 2 transactions before
> the one shown in snippet, and I used tab-completion to get the same
> screen-output as termonal1.txt and yet it's not reproducible.

I could reproduce it when I typed TAB just after typing "set" in "set
transaction snapshot".
As Tom and Alvaro pointed out, the tab-completion issues a query and which
prevents the "set transaction snapshot" command.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Online base backup from the hot-standby

2011-10-24 Thread Fujii Masao

On Tue, Oct 25, 2011 at 12:33 AM, Heikki Linnakangas
 wrote:
> One problem with this whole FPW-tracking is that pg_lesslog makes it fail.
> I'm not sure what we need to do about that - maybe just add a warning to the
> docs. But it leaves a bit bad feeling in my mouth. Usually we try to make
> features work orthogonally, without dependencies to other settings. Now this
> feature requires that full_page_writes is turned on in the master, and also
> that you don't use pg_lesslog to compress the WAL segments or your base
> backup might be corrupt.

Right, pg_lesslog users cannot use the documented procedure. They need to
do more complex one;

1. Execute pg_start_backup() in the master, and save its return value.
2. Wait until the backup starting checkpoint record has been replayed
in the standby. You can do this by comparing the return value of
pg_start_backup() with pg_last_replay_location().
3. Do the documented standby-only backup procedure.
4. Execute pg_stop_backup() in the master.

This is complicated, but I'm not sure how we can simplify it. Anyway we can
document this procedure for pg_lesslog users. We should?

> The procedure to take a backup from the standby
> seems more complicated than taking it on the master - there are more steps
> to follow.

Extending pg_basebackup so that it can take a backup from the standby would
make the procedure simple to a certain extent, I think. Though a user
still needs
to enable FPW in the master and must not use pg_lesslog.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Separating bgwriter and checkpointer

2011-10-24 Thread Simon Riggs

On Mon, Oct 24, 2011 at 11:40 AM, Heikki Linnakangas
 wrote:

> The patch looks sane, it's mostly just moving existing code around, but
> there's one thing that's been bothering me about this whole idea from the
> get-go:
>
> If the bgwriter and checkpointer are two different processes, whenever
> bgwriter writes out a page it needs to send an fsync-request to the
> checkpointer. We avoided that when both functions were performed by the same
> process, but now we have to send and absorb a fsync-request message for
> every single write() that happens in the system, except for those done at
> checkpoints. Isn't that very expensive? Does it make the fsync-request queue
> a bottleneck on some workloads?

That is a reasonable question and one I considered.

I did some benchmarking earlier to see the overhead of that.
Basically, its very small, much, much smaller than I thought.

The benefit of allowing the bgwriter to continue working during long
fsyncs easily outweighs the loss of doing more fsync-requests. Both of
those overheads/problems occur at the same time so there is the
overhead is always covered.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Online base backup from the hot-standby

2011-10-24 Thread Fujii Masao

Thanks for the review!

On Tue, Oct 25, 2011 at 12:24 AM, Heikki Linnakangas
 wrote:
> On 24.10.2011 15:29, Fujii Masao wrote:
>>
>> +    
>> +     
>> +      Copy the pg_control file from the cluster directory to the global
>> +      sub-directory of the backup. For example:
>> + 
>> + cp $PGDATA/global/pg_control /mnt/server/backupdir/global
>> + 
>> +     
>> +    
>
> Why is this step required? The control file is overwritten by information
> from the backup_label anyway, no?

Yes, when recovery starts, the control file is overwritten. But before that,
we retrieve the minimum recovery point from the control file. Then it's used
as the backup end location.

During recovery, pg_stop_backup() cannot write an end-of-backup record.
So, in standby-only backup, other way to retrieve the backup end location
(instead of an end-of-backup record) is required. Ishiduka-san used the
control file as that, according to your suggestion ;)
http://archives.postgresql.org/pgsql-hackers/2011-05/msg01405.php

>> +    
>> +     
>> +      Again connect to the database as a superuser, and execute
>> +      pg_stop_backup. This terminates the backup mode, but
>> does not
>> +      perform a switch to the next WAL segment, create a backup history
>> file and
>> +      wait for all required WAL segments to be archived,
>> +      unlike that during normal processing.
>> +     
>> +    
>
> How do you ensure that all the required WAL segments have been archived,
> then?

The patch doesn't provide any capability to ensure that, IOW assumes that's
a user responsibility. If a user wants to ensure that, he/she needs to calculate
the backup start and end WAL files from the result of pg_start_backup()
and pg_stop_backup() respectively, and needs to wait until those files have
appeared in the archive. Also if the required WAL file has not been archived
yet, a user might need to execute pg_switch_xlog() in the master.

If we change pg_stop_backup() so that, even during recovery, it waits until
all required WAL files have been archived, we would need to WAL-log
the completion of WAL archiving in the master. This enables the standby to
check whether specified WAL files have been archived. We should change
the patch in this way? But even if we change, you still might need to execute
pg_switch_xlog() in the master additionally, and pg_stop_backup() might keep
waiting infinitely if the master is not in progress.

>> +   
>> +    
>> +
>> +    
>> +     You cannot use the pg_basebackup tool to take the
>> backup
>> +     from the standby.
>> +    
>
> Why not? We have cascading replication now.

Because no one has implemented that feature.

Yeah, we have cascading replication, but without adopting the standby-only
backup patch, pg_basebackup cannot execute do_pg_start_backup() and
do_pg_stop_backup() during recovery. So we can think that the patch that
Ishiduka-san proposed is the first step to extend pg_basebackup so that it
can take backup from the standby.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Idea: Always consistent in-database cache using SSI mechanisms

2011-10-24 Thread Alexander Korotkov

On Tue, Oct 25, 2011 at 1:46 AM, Kevin Grittner  wrote:

> Alexander Korotkov  wrote:
>
> > Coundn't be predicate locking implementation in SSI be used for
> > in-database cache invalidation.
>
> It would not necessarily be limited to *in-database* caches.  The
> main thing would be to design a good API to the predicate locking
> portion of SSI, which I think is about 80% of the SSI code.  Dan and
> I both have an interest in such further use, and there have been
> others who have talked about potential uses for the non-blocking
> predicate locking.  I think the API would need to be based around a
> listen/notify model.
>
> > It could be possible to implement in-database cache which will
> > acquire predicate locks like SSI transactions. In case of any
> > conflich with other transactions corresponding cache invalidates.
> > Therefore, it might be possible to get acceleration of caching
> > without risk of inconsistent answers.
>
> I had not thought of that potential use.  At first glance, I think
> it has possibilities, but only if the above-mentioned API was
> formalized *and* there was some way to configure a cluster for
> "serializable transactions only".  Long-range, I have hopes for
> both.
>

Sure, it would be rather better to implement that through API.

 > Actually, I don't understand SSI in details. So, probably I mess
> > up things. Does my idea any matter?
>
> Sure!  Besides having the available development time, I think the
> biggest obstacle is having enough plausible use cases for predicate
> lock access to do a good job defining the API.  While we made some
> effort to keep the predicate locking and serializable behavior
> separate in the implementation, it wasn't clear where the "bright
> line" was, so there is bound to be some rearrangement needed when we
> figure that out.  The more ideas we have in front of us on how
> predicate locks might be useful, the better the API design is likely
> to be.


Thanks for feedback on my idea. I'll share ideas about more possible usage
of that API if I have any.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] SSI implementation question

2011-10-24 Thread Kevin Grittner

Tom Lane  wrote:

> I don't understand the SSI code well enough to tell if this is
> sufficient or not, so I hope you guys will take a closer look at
> the issue when you have time.

I will definitely give it a look.  Right now we have a "perfect
storm" of time demands due to some recently-passed legislation
combined with the need to wrap up some "Annual Plan" items.  Within
a few weeks I should find time for a more careful review and
testing.

-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Idea: Always consistent in-database cache using SSI mechanisms

2011-10-24 Thread Kevin Grittner

Alexander Korotkov  wrote:

> Coundn't be predicate locking implementation in SSI be used for
> in-database cache invalidation.

It would not necessarily be limited to *in-database* caches.  The
main thing would be to design a good API to the predicate locking
portion of SSI, which I think is about 80% of the SSI code.  Dan and
I both have an interest in such further use, and there have been
others who have talked about potential uses for the non-blocking
predicate locking.  I think the API would need to be based around a
listen/notify model.

> It could be possible to implement in-database cache which will
> acquire predicate locks like SSI transactions. In case of any
> conflich with other transactions corresponding cache invalidates.
> Therefore, it might be possible to get acceleration of caching
> without risk of inconsistent answers.

I had not thought of that potential use.  At first glance, I think
it has possibilities, but only if the above-mentioned API was
formalized *and* there was some way to configure a cluster for
"serializable transactions only".  Long-range, I have hopes for
both.

> Actually, I don't understand SSI in details. So, probably I mess
> up things. Does my idea any matter?

Sure!  Besides having the available development time, I think the
biggest obstacle is having enough plausible use cases for predicate
lock access to do a good job defining the API.  While we made some
effort to keep the predicate locking and serializable behavior
separate in the implementation, it wasn't clear where the "bright
line" was, so there is bound to be some rearrangement needed when we
figure that out.  The more ideas we have in front of us on how
predicate locks might be useful, the better the API design is likely
to be.

-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Idea: Always consistent in-database cache using SSI mechanisms

2011-10-24 Thread Alexander Korotkov

Hackers,

After Hekki's talk on PgConf.EU about SSI, some idea comes to my mind.
Coundn't be predicate locking implementation in SSI be used for in-database
cache invalidation.
It could be possible to implement in-database cache which will acquire
predicate locks like SSI transactions. In case of any conflich with other
transactions corresponding cache invalidates. Therefore, it might be
possible to get acceleration of caching without risk of inconsistent
answers.
Actually, I don't understand SSI in details. So, probably I mess up things.
Does my idea any matter?

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Tom Lane

Robert Haas  writes:
> On Mon, Oct 24, 2011 at 3:35 PM, Tom Lane  wrote:
>> I wonder how trustworthy the measure of the visibilitymap_test call site
>> as a consumer of cycles really is.

> I'm not sure either.  I guess we could try short-circuiting
> visibilitymap_test and see what that does to performance (let's leave
> correct answers out of it).

That would conflate the cost of the call with the cost of the function.
Maybe you could try manually inlining the visibility test?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Kevin Grittner

Josh Berkus  wrote:

> On 10/24/11 12:35 PM, Tom Lane wrote:
>> Your point about people trying to create wider indexes to exploit
>> index-only scans is an interesting one, but I think it's
>> premature to optimize on the basis of hypotheses about what
>> people might do in future.
> 
> I don't think that this is hypothetical at all.  I know *I'll* be
> doing it, and we can expect users who are familiar with MySQL and
> Oracle to do it as well.

And Sybase, and MS SQL Server.  And others, most likely.  We've
never gotten around to narrowing the indexes to which we added extra
columns to overcome performance problems through "covering index"
techniques when we were using Sybase, so they're already here.  :-)

> One case which is going to be critical to test is the "join"
> table, i.e. the table which supports many-to-many joins and
> consists only of keys from the respective two other tables.

Yeah, that is an important use of covering indexes for us.

-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Robert Haas

On Mon, Oct 24, 2011 at 3:35 PM, Tom Lane  wrote:
> Your point about people trying to create wider indexes to exploit
> index-only scans is an interesting one, but I think it's premature to
> optimize on the basis of hypotheses about what people might do in
> future.

Well, I don't think it's too much of a stretch to guess that people
will try to use covering indexes; that's common practice on other
products and a frequent and a not-uncommon heartfelt request from
people with large (non-memory-resident) databases they want to migrate
to PostgreSQL.  Exactly to what degree they'll do that, and how well
it will work, is another question.  But I have little doubt that it
will be tried.

> Not sure about your other idea of returning multiple tuples per
> amgettuple call.  The trouble with that is that it will add complexity
> (and hence cycles) at the nodeIndexscan level, because now nodeIndexscan
> will have to buffer those tuples, keep track of whether it's fetching
> forward or backward, etc etc.  Plus another layer of the same in
> indexam.c (index_getnext etc).  I'm not at all convinced that it's
> likely to be a net win.

I definitely agree that you don't want two layers of caching, but I
don't see why we'd need that.  I wasn't thinking of changing
index_getnext() at all, but rather adding a new API that fills a
buffer (or a pair of buffers, maybe) with index tuples and heap TIDs.
It should spit them out in the same order that multiple
index_getnext() calls would have done and leave the scan position
wherever it would have ended up after a number of index_getnext_tid()
calls equal to the number of TIDs returned.  Any user of the API (and
it might be just nodeIndexonlyscan.c) would just need to keep track of
the number of tuples returned and the number consumed to date.

This actually gets into a wider architectural discussion, which is
whether the fact that the whole executor (modulo bitmap scans and a
few other special cases) operates on one tuple at a time is a good
design...  but my brain hurts just thinking about that.

> I wonder how trustworthy the measure of the visibilitymap_test call site
> as a consumer of cycles really is.  I've frequently noticed that
> oprofile blames remarkably large fractions of the runtime on individual
> statements that appear to be quite trivial.  I'm not sure if that
> represents real hardware-level effects such as cache line switching,
> or whether it's just measurement artifacts.  Keep in mind that
> sampling-based measurements are always subject to sampling artifacts.

I'm not sure either.  I guess we could try short-circuiting
visibilitymap_test and see what that does to performance (let's leave
correct answers out of it).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Josh Berkus

On 10/24/11 12:35 PM, Tom Lane wrote:
> Your point about people trying to create wider indexes to exploit
> index-only scans is an interesting one, but I think it's premature to
> optimize on the basis of hypotheses about what people might do in
> future.

I don't think that this is hypothetical at all.  I know *I'll* be doing
it, and we can expect users who are familiar with MySQL and Oracle to do
it as well.

No, it won't be the majority of our users, who are using ORMs and thus
don't really think about indexing at all.  But it will be a significant
number of users who are performance-sensitive ... such as most or all of
our data warehousing users.

Mind you, we're pretty much talking exclusively about users whose tables
don't fit in memory ... usually tables which are 10X or more the size of
memory.

One case which is going to be critical to test is the "join" table, i.e.
the table which supports many-to-many joins and consists only of keys
from the respective two other tables.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Tom Lane

Robert Haas  writes:
> But even though Tom's statement that most indexes are one column might
> be a slight exaggeration, I suspect it probably is true that the
> optimizations he's talking about for large numbers of columns won't
> produce any material benefit even for a 3 or 4 column index.  Which
> makes me think maybe we should focus our efforts elsewhere.

Right.  If we thought the average was something like ten, it might be
worth pursuing optimizations similar to slot_getallattrs.  If it's
around two or three, almost certainly not.

Your point about people trying to create wider indexes to exploit
index-only scans is an interesting one, but I think it's premature to
optimize on the basis of hypotheses about what people might do in
future.

Not sure about your other idea of returning multiple tuples per
amgettuple call.  The trouble with that is that it will add complexity
(and hence cycles) at the nodeIndexscan level, because now nodeIndexscan
will have to buffer those tuples, keep track of whether it's fetching
forward or backward, etc etc.  Plus another layer of the same in
indexam.c (index_getnext etc).  I'm not at all convinced that it's
likely to be a net win.

I wonder how trustworthy the measure of the visibilitymap_test call site
as a consumer of cycles really is.  I've frequently noticed that
oprofile blames remarkably large fractions of the runtime on individual
statements that appear to be quite trivial.  I'm not sure if that
represents real hardware-level effects such as cache line switching,
or whether it's just measurement artifacts.  Keep in mind that
sampling-based measurements are always subject to sampling artifacts.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Robert Haas

On Mon, Oct 24, 2011 at 2:15 PM, Kevin Grittner
 wrote:
>> It doesn't look to me like the mean is above 2 (unless you have
>> many fewer toast tables than I suspect), so trying to optimize
>> many-column cases isn't going to help.
>
> The mean is 2.4 (give or take a little depending on whether you
> include system tables).  I have no idea where the optimization
> becomes worthwhile, but the assertion that most indexes probably
> have a single column worried me.  I'm sure there are databases where
> that is true (especially for those who insist on adding a
> meaningless surrogate key column to every table), but there are many
> where it isn't true.  I would guess that our average of 2.4 is
> higher than most, though.

Keep in mind that the existence of index-only scans is going to
encourage people to try to create covering indexes on columns that
aren't indexed today.  It's not clear how much of a win that will be;
for certainly workloads, HOT might make it backfire spectacularly.

But even though Tom's statement that most indexes are one column might
be a slight exaggeration, I suspect it probably is true that the
optimizations he's talking about for large numbers of columns won't
produce any material benefit even for a 3 or 4 column index.  Which
makes me think maybe we should focus our efforts elsewhere.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Kevin Grittner

Copy/paste problems -- the first set includes the system tables
except for toast.  User tables would be the difference between the
results below.  Sorry.
 
-Kevin
 
 
"Kevin Grittner"  wrote:
> Tom Lane  wrote:
>  
>> Yeah, TOAST indexes are 2-column.  It would be best to exclude
>> those from your counts, since it seems pretty unlikely that
>> anyone will care how fast nodeIndexonlyscan.c is for scans on
>> toast tables.
>  
> User indexes (excluding toast):
>  
>  indnatts | count 
> --+---
> 1 |   200
> 2 |   222
> 3 |   155
> 4 |76
> 5 |43
> 6 |13
> 7 | 2
> 9 | 1
> (8 rows)
>  
> System indexes (excluding toast):
>  
>  indnatts | count 
> --+---
> 1 |46
> 2 |24
> 3 | 9
> 4 | 5
> (4 rows)
>  
>> It doesn't look to me like the mean is above 2 (unless you have
>> many fewer toast tables than I suspect), so trying to optimize
>> many-column cases isn't going to help.
>  
> The mean is 2.4 (give or take a little depending on whether you
> include system tables).  I have no idea where the optimization
> becomes worthwhile, but the assertion that most indexes probably
> have a single column worried me.  I'm sure there are databases
> where that is true (especially for those who insist on adding a
> meaningless surrogate key column to every table), but there are
> many where it isn't true.  I would guess that our average of 2.4
> is higher than most, though.
>  
> -Kevin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Kevin Grittner

Tom Lane  wrote:

> Yeah, TOAST indexes are 2-column.  It would be best to exclude
> those from your counts, since it seems pretty unlikely that anyone
> will care how fast nodeIndexonlyscan.c is for scans on toast
> tables.

User indexes (excluding toast):

 indnatts | count 
--+---
1 |   200
2 |   222
3 |   155
4 |76
5 |43
6 |13
7 | 2
9 | 1
(8 rows)

System indexes (excluding toast):

 indnatts | count 
--+---
1 |46
2 |24
3 | 9
4 | 5
(4 rows)

> It doesn't look to me like the mean is above 2 (unless you have
> many fewer toast tables than I suspect), so trying to optimize
> many-column cases isn't going to help.

The mean is 2.4 (give or take a little depending on whether you
include system tables).  I have no idea where the optimization
becomes worthwhile, but the assertion that most indexes probably
have a single column worried me.  I'm sure there are databases where
that is true (especially for those who insist on adding a
meaningless surrogate key column to every table), but there are many
where it isn't true.  I would guess that our average of 2.4 is
higher than most, though.

-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] (PATCH) Adding CORRESPONDING to Set Operations

2011-10-24 Thread Erik Rijkers

On Wed, October 19, 2011 15:01, Kerem Kat wrote:
> Adding CORRESPONDING to Set Operations
> Initial patch, filename: corresponding_clause_v2.patch

I had a quick look at the behaviour of this patch.

Btw, the examples in your email were typoed (one select is missing):

> SELECT 1 a, 2 b, 3 c UNION CORRESPONDING 4 b, 5 d, 6 c, 7 f;
should be:
  SELECT 1 a, 2 b, 3 c UNION CORRESPONDING select 4 b, 5 d, 6 c, 7 f;

and

> SELECT 1 a, 2 b, 3 c UNION CORRESPONDING BY(b) 4 b, 5 d, 6 c, 7 f;
should be:
  SELECT 1 a, 2 b, 3 c UNION CORRESPONDING BY(b) select 4 b, 5 d, 6 c, 7 f;
>

But there is also a small bug, I think: the order in the CORRESPONDING BY list 
should be followed,
according to the standard (foundation, p. 408):

"2) If  is specified, then let SL be a  
of those s explicitly appearing in the  in the order 
that these
s appear in the . Every  
in the
 shall be a  of both T1 and T2."

That would make this wrong, I think:

SELECT 1 a, 2 b, 3 c UNION CORRESPONDING BY(c,b) select 5 d, 6 c, 7 f, 4 b ;

 b | c
---+---
 2 | 3
 4 | 6
(2 rows)

i.e., I think it should show columns in the order c, b (and not b, c); the 
order of the
CORRESPONDING BY phrase.

(but maybe I'm misreading the text of the standard; I find it often difficult 
to follow)

Thanks,

Erik Rijkers

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Tom Lane

"Kevin Grittner"  writes:
> Tom Lane  wrote:
>> I had wondered whether it'd be worth optimizing that along the
>> lines of slot_getallattrs().  But most indexes probably have only
>> one column, or anyway not enough to make for a useful savings.
 
>> From a heavily-used production database:
 
> cir=> select indnatts, count(*) from pg_index group by indnatts
> order by indnatts;
>  indnatts | count 
> --+---
> 1 |   200
> 2 |   684
> 3 |   155
> 4 |76
> 5 |43
> 6 |13
> 7 | 2
> 9 | 1
> (8 rows)
 
> This includes system table and TOAST table indexes (which seem to
> have two columns).

Yeah, TOAST indexes are 2-column.  It would be best to exclude those
from your counts, since it seems pretty unlikely that anyone will care
how fast nodeIndexonlyscan.c is for scans on toast tables.

> There are over 400 user tables, each of which
> has a primary key, so most primary keys in our database are more
> than one column.

It doesn't look to me like the mean is above 2 (unless you have many
fewer toast tables than I suspect), so trying to optimize many-column
cases isn't going to help.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unreproducible bug in snapshot import code

2011-10-24 Thread Gurjeet Singh

On Mon, Oct 24, 2011 at 1:08 PM, Alvaro Herrera
wrote:

>
> Excerpts from Gurjeet Singh's message of lun oct 24 13:55:44 -0300 2011:
>
> > Starting from line 89 of terminal1.txt we see this snippet:
> >
> > postgres=# rollback;
> > ROLLBACK
> > postgres=# begin TRANSACTION ;
> > BEGIN
> > postgres=# set transaction snapshot '02C8-1';
> > ERROR:  SET TRANSACTION SNAPSHOT must be called before any query
> > postgres=# rollback;
> > ROLLBACK
> >
> >
> > As you can see the SET TRANSACTION SNAPSHOT was the first statement
> in
> > that transaction, and yet the ERROR message says that it is not.
>
> Maybe the tab-completion feature issued a query before the set
> transaction command.
>

I have tried reproducing the bug starting from 1 and 2 transactions before
the one shown in snippet, and I used tab-completion to get the same
screen-output as termonal1.txt and yet it's not reproducible.

Regards,
-- 
Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Kevin Grittner

Tom Lane  wrote:
 
> I had wondered whether it'd be worth optimizing that along the
> lines of slot_getallattrs().  But most indexes probably have only
> one column, or anyway not enough to make for a useful savings.
 
>From a heavily-used production database:
 
cir=> select indnatts, count(*) from pg_index group by indnatts
order by indnatts;
 indnatts | count 
--+---
1 |   200
2 |   684
3 |   155
4 |76
5 |43
6 |13
7 | 2
9 | 1
(8 rows)
 
This includes system table and TOAST table indexes (which seem to
have two columns).  There are over 400 user tables, each of which
has a primary key, so most primary keys in our database are more
than one column.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inserting heap tuples in bulk in COPY

2011-10-24 Thread Heikki Linnakangas


On 25.09.2011 16:03, Dean Rasheed wrote:

On 25 September 2011 09:43, Kohei KaiGai  wrote:

Hi Heikki,

I checked your patch, then I have a comment and two questions here.

2011/9/14 Heikki Linnakangas:


Attached is a new version of the patch. It is now complete, including WAL
replay code.


Hi,

I had a quick look at the patch as well and spotted an oversight: the
multi-insert code branch fails to invoke AFTER ROW triggers.


Thanks! Here's an updated version of the patch, fixing that, and all the 
other issues pointed out this far.


I extracted the code that sets oid and tuple headers, and invokes the 
toaster, into a new function that's shared by heap_insert() and 
heap_multi_insert(). Tom objected to merging heap_insert() and 
heap_multi_insert() into one complicated function, and I think he was 
right on that, but sharing this code to prepare a tuple still makes 
sense. IMHO it makes heap_insert() slightly more readable too.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***
*** 24,29 
--- 24,30 
   *		heap_getnext	- retrieve next tuple in scan
   *		heap_fetch		- retrieve tuple with given tid
   *		heap_insert		- insert tuple into a relation
+  *		heap_multi_insert - insert multiple tuples into a relation
   *		heap_delete		- delete a tuple from a relation
   *		heap_update		- replace a tuple in a relation with another tuple
   *		heap_markpos	- mark scan position
***
*** 79,84  static HeapScanDesc heap_beginscan_internal(Relation relation,
--- 80,87 
  		int nkeys, ScanKey key,
  		bool allow_strat, bool allow_sync,
  		bool is_bitmapscan);
+ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
+ 	TransactionId xid, CommandId cid, int options);
  static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
  ItemPointerData from, Buffer newbuf, HeapTuple newtup,
  bool all_visible_cleared, bool new_all_visible_cleared);
***
*** 1866,1920  heap_insert(Relation relation, HeapTuple tup, CommandId cid,
  	Buffer		vmbuffer = InvalidBuffer;
  	bool		all_visible_cleared = false;
  
- 	if (relation->rd_rel->relhasoids)
- 	{
- #ifdef NOT_USED
- 		/* this is redundant with an Assert in HeapTupleSetOid */
- 		Assert(tup->t_data->t_infomask & HEAP_HASOID);
- #endif
- 
- 		/*
- 		 * If the object id of this tuple has already been assigned, trust the
- 		 * caller.	There are a couple of ways this can happen.  At initial db
- 		 * creation, the backend program sets oids for tuples. When we define
- 		 * an index, we set the oid.  Finally, in the future, we may allow
- 		 * users to set their own object ids in order to support a persistent
- 		 * object store (objects need to contain pointers to one another).
- 		 */
- 		if (!OidIsValid(HeapTupleGetOid(tup)))
- 			HeapTupleSetOid(tup, GetNewOid(relation));
- 	}
- 	else
- 	{
- 		/* check there is not space for an OID */
- 		Assert(!(tup->t_data->t_infomask & HEAP_HASOID));
- 	}
- 
- 	tup->t_data->t_infomask &= ~(HEAP_XACT_MASK);
- 	tup->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK);
- 	tup->t_data->t_infomask |= HEAP_XMAX_INVALID;
- 	HeapTupleHeaderSetXmin(tup->t_data, xid);
- 	HeapTupleHeaderSetCmin(tup->t_data, cid);
- 	HeapTupleHeaderSetXmax(tup->t_data, 0);		/* for cleanliness */
- 	tup->t_tableOid = RelationGetRelid(relation);
- 
  	/*
! 	 * If the new tuple is too big for storage or contains already toasted
! 	 * out-of-line attributes from some other relation, invoke the toaster.
  	 *
  	 * Note: below this point, heaptup is the data we actually intend to store
  	 * into the relation; tup is the caller's original untoasted data.
  	 */
! 	if (relation->rd_rel->relkind != RELKIND_RELATION)
! 	{
! 		/* toast table entries should never be recursively toasted */
! 		Assert(!HeapTupleHasExternal(tup));
! 		heaptup = tup;
! 	}
! 	else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
! 		heaptup = toast_insert_or_update(relation, tup, NULL, options);
! 	else
! 		heaptup = tup;
  
  	/*
  	 * We're about to do the actual insert -- but check for conflict first,
--- 1869,1882 
  	Buffer		vmbuffer = InvalidBuffer;
  	bool		all_visible_cleared = false;
  
  	/*
! 	 * Fill in tuple header fields, assign an OID, and toast the tuple if
! 	 * necessary.
  	 *
  	 * Note: below this point, heaptup is the data we actually intend to store
  	 * into the relation; tup is the caller's original untoasted data.
  	 */
! 	heaptup = heap_prepare_insert(relation, tup, xid, cid, options);
  
  	/*
  	 * We're about to do the actual insert -- but check for conflict first,
***
*** 2035,2041  heap_insert(Relation relation, HeapTuple tup, CommandId cid,
  	 */
  	CacheInvalidateHeapTuple(relation, heaptup, NULL);
  
! 	pgstat_count_heap_insert(relation);
  
  	/*
  	 * If heaptup is a private copy, release it.  Don't forget to copy t_self
--- 1997,20

Re: [HACKERS] [PATCH] Use new oom_score_adj without a new compile-time constant

2011-10-24 Thread Robert Haas

On Mon, Oct 24, 2011 at 1:17 PM, Tom Lane  wrote:
> Robert Haas  writes:
>>> [ oom_score_adj business ]
>
>> Did we do anything about this?  Anyone else have an opinion on what
>> ought to be done?
>
> I held off doing anything because it didn't seem like we had consensus.
> OTOH, it may well be that it's not important enough to demand real
> consensus, and he who does the work gets to make the choices.

Half a loaf is better than none.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Separating bgwriter and checkpointer

2011-10-24 Thread Heikki Linnakangas


On 19.10.2011 17:58, Simon Riggs wrote:

On Wed, Oct 19, 2011 at 3:29 PM, Fujii Masao  wrote:

On Wed, Oct 19, 2011 at 9:45 PM, Robert Haas  wrote:

I don't really see any reason to break the monitoring view just
because we did some internal refactoring.  I'd rather have backward
compatibility.


Fair enough.

The patch doesn't change any document, but at least the description
of pg_stat_bgwriter seems to need to be changed.


Thanks for the review.

Will follow up on suggestions.


The patch looks sane, it's mostly just moving existing code around, but 
there's one thing that's been bothering me about this whole idea from 
the get-go:


If the bgwriter and checkpointer are two different processes, whenever 
bgwriter writes out a page it needs to send an fsync-request to the 
checkpointer. We avoided that when both functions were performed by the 
same process, but now we have to send and absorb a fsync-request message 
for every single write() that happens in the system, except for those 
done at checkpoints. Isn't that very expensive? Does it make the 
fsync-request queue a bottleneck on some workloads?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] termination of backend waiting for sync rep generates a junk log message

2011-10-24 Thread Kevin Grittner

Tom Lane  wrote:

> It might be that it'd work to deal with this by reducing the
> reported strength of all such cases from ERROR to WARNING.  Not
> sure that that's a good idea, but it might work.

Throwing an error on commit of a transaction when its work has
actually been persisted seems as bad as failing to persist the work
of a transaction when the commit appears successful.  If we know we
have met the persistence guarantees and some problem occurs after
that, then it seems to me we should issue a warning and indicate
success.  If we don't know whether the work will be persisted, it
seems to me we can block until we find out (if that's feasible) or
break the connection without indicating one way or the other.  The
work of the transaction must either all be persisted or not (I'm
assuming this isn't a problem here), and any user must be prepared
to deal with a broken connection before response on a commit
attempt.

I don't know whether changing all failures beyond the persistence
point in a commit to warnings is sufficient, but I think it is
required for anything which doesn't break the connection.

-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Monster query

2011-10-24 Thread Jan Urbański

On 24/10/11 19:16, Omar Bettin wrote:
> Hi,
> 
> Since the data are of third parties, will prepare a database suitable for the 
> purpose.
> In any case, the compressed backup will be around 20 MB.

If you are able to prepare a database dump that doesn't contain private
data, it would be awesome.

If it's 20 MB please just post a link to the archive.

Thank you very much!
Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] autovacuum and orphaned large objects

2011-10-24 Thread Euler Taveira de Oliveira


On 24-10-2011 11:36, Tom Lane wrote:

Euler Taveira de Oliveira  writes:

The main point of autovacuum is maintenance tasks. Currently, it executes
VACUUM and ANALYZE commands. I want to propose that we incorporate vacuumlo
functionality into it.


I'm not terribly thrilled with that because (a) large objects seem like
mostly a legacy feature from here, and

>
Right, but there isn't a solution for > 1 GB column data besides LO.


(b) it's hard to see how to
implement it without imposing overhead on everybody, whether they use
LOs or not.  This is especially problematic if you're proposing that
cleanup triggers not be required.

I was thinking about starting the LO cleanup after autovacuum finishes the 
VACUUM command (so no trigger, no new mechanism). And about the overhead 
imposed, it will only execute the cleanup code in the tables that have oid/lo 
columns (this information will be collected when the autovacuum collects table 
information).



--
   Euler Taveira de Oliveira - Timbira   http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Use new oom_score_adj without a new compile-time constant

2011-10-24 Thread Tom Lane

Robert Haas  writes:
>> [ oom_score_adj business ]

> Did we do anything about this?  Anyone else have an opinion on what
> ought to be done?

I held off doing anything because it didn't seem like we had consensus.
OTOH, it may well be that it's not important enough to demand real
consensus, and he who does the work gets to make the choices.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Monster query

2011-10-24 Thread Omar Bettin

Hi,

Since the data are of third parties, will prepare a database suitable for the 
purpose.
In any case, the compressed backup will be around 20 MB.

Regards,

Omar Bettin

-Messaggio originale-
Da: Jan Urbański [mailto:wulc...@wulczer.org] 
Inviato: lunedì 24 ottobre 2011 18:56
A: Omar Bettin
Cc: pgsql-hackers@postgresql.org
Oggetto: Re: [HACKERS] [9.1] unusable for large views

On 24/10/11 10:57, Omar Bettin wrote:
> 
> [monster query]

I see that your problem is already solved, but incidentially I'm working on a 
join order planning module and I'm looking for real-life examples of humongous 
queries like that to benchmark against them.

Any chance you could share the schema, or at least part of it, that goes with 
this query? Or perhaps you have more of these queries?

Cheers,
Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] autovacuum and orphaned large objects

2011-10-24 Thread Euler Taveira de Oliveira


On 24-10-2011 10:57, Robert Haas wrote:

I think the main reason why vacuumlo is a contrib module rather than
in core is that it is just a heuristic, and it might not be what
everyone wants to do.  You could store a bunch of large objects in the
database and use the returned OIDs to generate links that you email to
users, and then when the user clicks on the link we retrieve the
corresponding LO and send it to the user over HTTP.  In that design,
there are no tables in the database at all, yet the large objects
aren't orphaned.

Uau, what a strange method to solve a problem and possibly bloat your 
database. No, I'm not suggesting that we forbid it. The proposed method could 
cleanup orphaned LO in 95% (if not 99%) of the use cases.


I've never heard someone using LO like you describe it. It seems strange that 
someone distributes an OID number but (s)he does not store its reference at 
the same database. Yes, it is a possibility but ...



--
   Euler Taveira de Oliveira - Timbira   http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unreproducible bug in snapshot import code

2011-10-24 Thread Alvaro Herrera


Excerpts from Gurjeet Singh's message of lun oct 24 13:55:44 -0300 2011:

> Starting from line 89 of terminal1.txt we see this snippet:
> 
> postgres=# rollback;
> ROLLBACK
> postgres=# begin TRANSACTION ;
> BEGIN
> postgres=# set transaction snapshot '02C8-1';
> ERROR:  SET TRANSACTION SNAPSHOT must be called before any query
> postgres=# rollback;
> ROLLBACK
> 
> 
> As you can see the SET TRANSACTION SNAPSHOT was the first statement in
> that transaction, and yet the ERROR message says that it is not.

Maybe the tab-completion feature issued a query before the set
transaction command.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Range Types - typo + NULL string constructor

2011-10-24 Thread Heikki Linnakangas


On 17.10.2011 01:09, Jeff Davis wrote:

On Sat, 2011-10-15 at 01:46 +0300, Heikki Linnakangas wrote:

* The binary i/o format includes the length of the lower and upper
bounds twice, once explicitly in range_send, and second time within the
send-function of the subtype. Seems wasteful.


Any ideas how to fix that? How else do I know how much the underlying
send function will consume?


Oh, never mind. I was misreading the code, it's not sending the length 
twice.



* range_constructor_internal - I think it would be better to move logic
to figure out the the arguments into the callers.


Done.


The comment above range_constructor0() is now outdated.


* The gist support functions frequently call range_deserialize(), which
does catalog lookups. Isn't that horrendously expensive?


Yes, it was. I have introduced a cached structure that avoids syscache
lookups when it's the same range as the last lookup (the common case).


Hmm, I don't think that's safe. After Oid wraparound, a range type oid 
might get reused for some other range type, and the cache would return 
stale values. Extremely unlikely to happen by accident, but could be 
exploited by an attacker.



* What exactly is canonical function supposed to return? It's not clear
what format one should choose as the canonical format when writing a
custom range type. And even for the built-in types, it would be good to
explain which types use which canonical format (I saw the discussions on
that, so I understand that might still be subject to change).


The canonical function is just supposed to return a new range such that
two equal values will have equal representations. I have listed the
built-in discrete range types and their canonical form.

As far as describing what a custom range type is supposed to use for the
canonical form, I don't know which part is exactly unclear. There aren't
too many rules to defining one -- the only guideline is that ranges of
equal value going in should be put in one canonical representation.


Ok. The name "canonical" certainly hints at that, but it would be good 
to explicitly state that guideline. As the text stands, it would seem 
that a canonical function that maps "[1,7]" to "[1,8)", and also vice 
versa, "[1,8)" to "[1,7]", would be valid. That would be pretty silly, 
but it would be good to say something like "The canonical output for two 
values that are equal, like [1,7] and [1,8), must be equal. It doesn't 
matter which representation you choose to be the canonical one, as long 
as two equal values with different formattings are always mapped to the 
same value with same formatting"


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unreproducible bug in snapshot import code

2011-10-24 Thread Tom Lane

Gurjeet Singh  writes:
> Starting from line 89 of terminal1.txt we see this snippet:

> postgres=# rollback;
> ROLLBACK
> postgres=# begin TRANSACTION ;
> BEGIN
> postgres=# set transaction snapshot '02C8-1';
> ERROR:  SET TRANSACTION SNAPSHOT must be called before any query
> postgres=# rollback;
> ROLLBACK

> As is evident from the rest of the lines in terminal1.txt, I have not
> been able to reproduce this error again; I tried even with a clean build of
> the sources.

Given the weird capitalization of the BEGIN command, I'm guessing that
you used tab-completion to enter it.  I wonder if this could have been
affected by the execution of some query in support of tab completion?
I couldn't reproduce it either on the basis of that guess, though.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [9.1] unusable for large views

2011-10-24 Thread Pavel Stehule

Hello

please, send a result of explain analyze on 9.1.1 and older

please, use http://explain.depesz.com/

Regards

Pavel Stehule

2011/10/24 Omar Bettin :
> Hello,
>
>
>
> I have tried 9.1.1 win64 version and when I am trying to declare a cursor
> for a very large view (lot of joins and aggregate functions),
>
> postgres is using around 3GB of memory and the query never returns.
>
>
>
> Same proble selecting from the view without cursor.
>
>
>
> Same query worked fine from 8.3.3 to 9.0.5.
>
>
>
> Should I change some configuration params to have the same behavior as
> previous versions?
>
>
>
>
>
>
>
> Tried on Win2008 server R2  64bit 8GB RAM.
>
> also on Win7 64bit 8GB RAM.
>
> default postgresql.conf
>
>
>
> Regards,
>
>
>
>
>
> The view (!)
>
>
>
> CREATE OR REPLACE VIEW articoli_rip_ven20110227_view AS
>
>  SELECT a.pk_id, a.articolo, a.articolo_f, a.origine, ( SELECT
> ditte.attivita
>
>    FROM ditte
>
>   WHERE ditte.partitario = a.cod_fornit) AS dummy_1, a.descrizion,
> a.taglia, a.peso_spec, a.giacenza_colli, a.giacenza, rim.colli AS
> rimanenza_colli, rim.quantita AS rimanenza_qta, rimass.colli AS
> rimanenza_ass_colli, rimass.quantita AS rimanenza_ass_qta,
> COALESCE(b20.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b21.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b22.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b23.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b24.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b25.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b26.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b27.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b28.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b29.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b30.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b31.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b32.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b33.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b34.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b35.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b36.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b37.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b38.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b39.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b40.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b41.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b42.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b43.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b44.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b45.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b46.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b47.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b48.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b49.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b50.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b51.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b52.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b53.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b54.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b55.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b56.quantita, NULL::numeric, 0::numeric)::double precision +
> (COALESCE(rim.quantita, NULL::numeric, 0::numeric) +
> COALESCE(rimass.quantita, NULL::numeric, 0::numeric))::double precision -
> (COALESCE(b1.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b2.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b3.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b4.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b5.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b6.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b7.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b8.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b9.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b10.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b11.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b12.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b13.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b14.quantita, NULL::numeric, 0::numeric)::double precision +

Re: [HACKERS] [9.1] unusable for large views (SOLVED)

2011-10-24 Thread Pavel Stehule

2011/10/24 Omar Bettin :
> ...sorry guys...
>
> was a bad configuration of database.
>
> 9.1.1 is working good.
>
> is 4% to 8% faster than 9.0.5.
>
> Thanks a lot.
>
> Regards
>
> Omar
>
> P.s.
> attached EXPLAIN
>

attachment is missing

Pavel


>
>>Hmm.  A 59-table join is pretty enormous
>
> and is not the biggest, basically are delivery notes for one day seen in
> vertical.
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] TABLE tab completion

2011-10-24 Thread Magnus Hagander

On Mon, Sep 26, 2011 at 10:37, Magnus Hagander  wrote:
> On Sun, Sep 25, 2011 at 15:06, Dean Rasheed  wrote:
>> On 24 September 2011 11:59, Magnus Hagander  wrote:
>>> TABLE tab completion in psql only completes to tables, not views. but
>>> the TABLE command works fine for both tables and views (and also
>>> sequences).
>>>
>>> Seems we should just complete it to relations and not tables - or can
>>> anyone see a particular reason why we shouldn't?
>>>
>>
>> Doesn't that mean that "DROP TABLE " would offer up views as well
>> as tables, which would be incorrect?
>
> Meh - you are correct, of course. I guess that's why we have code review :-)
>
> So - not a oneliner, but how about something like this?
>
> (Happy to have someone point out a neater way of doing it, not
> entirely fluent in how we do the tab completion..)

Rebased on top of the changes Tom made to the infrastructure, and applied.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

R: [HACKERS] [9.1] unusable for large views (SOLVED)

2011-10-24 Thread Omar Bettin

...sorry guys...

was a bad configuration of database.

9.1.1 is working good.

is 4% to 8% faster than 9.0.5.

Thanks a lot.

Regards

Omar

P.s.
attached EXPLAIN


>Hmm.  A 59-table join is pretty enormous

and is not the biggest, basically are delivery notes for one day seen in
vertical.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Unreproducible bug in snapshot import code

2011-10-24 Thread Gurjeet Singh

Hi All,

I have not been able to reproduce this error, but wanted to report this
in case it might be useful.

Commit -id: 0f39d5050dc0dce99258381f33f1832c437aff85

Configure options:
--prefix=/mnt/storage/gurjeet/dev/builds//master/db --enable-debug
--enable-cassert CFLAGS=-O0 --enable-depend --enable-thread-safety
--with-openssl

Used VPATH to build postgres.

Attached files: terminal1.txt termonal2.txt config.log

Starting from line 89 of terminal1.txt we see this snippet:

postgres=# rollback;
ROLLBACK
postgres=# begin TRANSACTION ;
BEGIN
postgres=# set transaction snapshot '02C8-1';
ERROR:  SET TRANSACTION SNAPSHOT must be called before any query
postgres=# rollback;
ROLLBACK


As you can see the SET TRANSACTION SNAPSHOT was the first statement in
that transaction, and yet the ERROR message says that it is not. That
snapshot id was generated in another session (terminal2.txt), and was
generated outside any transaction; that's the only peculiar thing I can say
about that snapshot-id.

As is evident from the rest of the lines in terminal1.txt, I have not
been able to reproduce this error again; I tried even with a clean build of
the sources.

I have taken a look at the code and everything tells me that the error
should have been:

ERROR:  a snapshot-importing transaction must have isolation level
SERIALIZABLE or REPEATABLE READ

Maybe it was not a clean build the first time, and the code may have
wrongly linked with previously compiled .o files.

Regards,
-- 
Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
psql (9.2devel)
Type "help" for help.

postgres=# begin transaction;
BEGIN
postgres=# set iso

postgres=# set trans
transaction transaction_deferrable  transaction_isolation   
transaction_read_only   transform_null_equals   
postgres=# set transaction isolation level read only;
ERROR:  syntax error at or near "only"
LINE 1: set transaction isolation level read only;
 ^
postgres=# set transaction isolation level readonly;
ERROR:  syntax error at or near "readonly"
LINE 1: set transaction isolation level readonly;
^
postgres=# set transaction 
ISOLATION LEVEL  READ 
postgres=# set transaction ISOLATION LEVEL 
READ  REPEATABLESERIALIZABLE  
postgres=# set transaction ISOLATION LEVEL READ 
COMMITTEDUNCOMMITTED  
postgres=# set transaction ISOLATION LEVEL READ committed;
ERROR:  current transaction is aborted, commands ignored until end of 
transaction block
postgres=# rollback;
ROLLBACK
postgres=# begin transaction;
BEGIN
postgres=# set transaction ISOLATION LEVEL READ committed;
SET
postgres=# rollback;
ROLLBACK
postgres=# begin transaction;
BEGIN
postgres=# select 1;
 ?column? 
--
1
(1 row)

postgres=# set transaction ISOLATION LEVEL READ committed;
SET
postgres=# rollback;
ROLLBACK
postgres=# begin transaction;
BEGIN
postgres=# set transaction ISOLATION LEVEL serializable;;
SET
postgres=# rollback;
ROLLBACK
postgres=# begin transaction;
BEGIN
postgres=# select 1;
 ?column? 
--
1
(1 row)

postgres=# set transaction ISOLATION LEVEL serializable;;
ERROR:  SET TRANSACTION ISOLATION LEVEL must be called before any query
postgres=# rollback;
ROLLBACK
(reverse-i-search)`be': select count(*) from clustermem^Crs ;
postgres=# begin transaction;
BEGIN
postgres=# set transaction ISOLATION LEVEL repeatable read;
SET
postgres=# rollback;
ROLLBACK
postgres=# begin transaction;
BEGIN
postgres=# set transaction ISOLATION LEVEL repeatable read;
SET
postgres=# rollback;
ROLLBACK
postgres=# begin transaction;
BEGIN
postgres=# select 1;
 ?column? 
--
1
(1 row)

postgres=# set transaction ISOLATION LEVEL repeatable read;
ERROR:  SET TRANSACTION ISOLATION LEVEL must be called before any query
postgres=# set transaction ISOLATION LEVEL ;
READ  REPEATABLESERIALIZABLE  
postgres=# rollback;
ROLLBACK
postgres=# begin TRANSACTION ;
BEGIN
postgres=# set transaction snapshot '02C8-1';
ERROR:  SET TRANSACTION SNAPSHOT must be called before any query
postgres=# rollback;
ROLLBACK
postgres=# begin TRANSACTION ;
BEGIN
postgres=# set transaction snapshot '02C9-1';
ERROR:  a snapshot-importing transaction must have isolation level SERIALIZABLE 
or REPEATABLE READ
postgres=# rollback;
ROLLBACK
postgres=# begin TRANSACTION ;
BEGIN
postgres=# set transaction ISOLATION LEVEL repeatable read;
SET
postgres=# set transaction snapshot '02C9-1';
SET
postgres=# rollback;
ROLLBACK
postgres=# begin TRANSACTION ;
BEGIN
postgres=# set transaction snapshot '02CA-1';
ERROR:  a snapshot-importing transaction must have isolation level SERIALIZABLE 
or REPEATABLE READ
postgres=# rollack;
ERROR:  syntax error at or near "rollack"
LINE 1: rollack;
^
postgres=# rollback;
ROLLBACK
postgres=# begin TRANSACTION ;
BEGIN
postgres=# set transaction ISOLATION LEVEL repeatable read;
SET
post

Re: [HACKERS] [9.1] unusable for large views

2011-10-24 Thread Jan Urbański

On 24/10/11 10:57, Omar Bettin wrote:
> 
> [monster query]

I see that your problem is already solved, but incidentially I'm working
on a join order planning module and I'm looking for real-life examples
of humongous queries like that to benchmark against them.

Any chance you could share the schema, or at least part of it, that goes
with this query? Or perhaps you have more of these queries?

Cheers,
Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] autovacuum and orphaned large objects

2011-10-24 Thread Robert Haas

On Mon, Oct 24, 2011 at 12:56 AM, Euler Taveira de Oliveira
 wrote:
> The main point of autovacuum is maintenance tasks. Currently, it executes
> VACUUM and ANALYZE commands. I want to propose that we incorporate vacuumlo
> functionality into it. While dealing with large objects (LO), we have lo
> contrib module that helps with LO maintenance but has some limitations (does
> not handle DROP TABLE and TRUNCATE cases) and vacuumlo that does an
> excellent job but have to be executed outside DBMS. The proposal is to clean
> up LO when autovacuum triggers VACUUM; cleanup LO routine will starts
> after(?) VACUUM command.
>
> In a near future I want to propose that orphaned LO be cleaned up by VACUUM
> but that a history for another thread...
>
> Comments?

I think the main reason why vacuumlo is a contrib module rather than
in core is that it is just a heuristic, and it might not be what
everyone wants to do.  You could store a bunch of large objects in the
database and use the returned OIDs to generate links that you email to
users, and then when the user clicks on the link we retrieve the
corresponding LO and send it to the user over HTTP.  In that design,
there are no tables in the database at all, yet the large objects
aren't orphaned.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] loss of transactions in streaming replication

2011-10-24 Thread Fujii Masao

On Fri, Oct 21, 2011 at 12:01 PM, Robert Haas  wrote:
> On Thu, Oct 20, 2011 at 9:51 PM, Fujii Masao  wrote:
>> On Thu, Oct 20, 2011 at 1:05 AM, Robert Haas  wrote:
>>> OK, so this is an artifact of the changes to make libpq communication
>>> bidirectional.  But I'm still confused about where the error is coming
>>> from.  In your OP, you wrote "In 9.2dev and 9.1, when walreceiver
>>> detects an error while sending data to WAL stream, it always emits
>>> ERROR even if there are data available in the receive buffer."  So
>>> that implied to me that this is only going to trigger if you have a
>>> shutdown together with an awkwardly-timed error.  But your scenario
>>> for reproducing this problem doesn't seem to involve an error.
>>
>> Yes, my scenario doesn't cause any real error. My original description was
>> misleading. The following would be closer to the truth:
>>
>>    "In 9.2dev and 9.1, when walreceiver detects the termination of 
>> replication
>>    connection while sending data to WAL stream, it always emits ERROR
>>    even if there are data available in the receive buffer."
>
> Ah, OK.  I think I now agree that this is a bug and that we should fix
> and back-patch.

The patch that I posted before is well-formed enough to be adopted?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] EXECUTE tab completion

2011-10-24 Thread Magnus Hagander

On Mon, Oct 24, 2011 at 01:26, Tom Lane  wrote:
> Andreas Karlsson  writes:
>> Thanks for cleaning up the code to some sanity, I should have done so
>> myself when I noticed the problem.
>
>> A new version is attached.
>
> Committed with minor adjustments --- I didn't see any need to make this
> wait for the next commitfest.

Thanks - I was planning to pick that one up along with my TABLE patch,
but was too busy with pgconf.eu over the past couple of weeks..

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Updated version of pg_receivexlog

2011-10-24 Thread Heikki Linnakangas


+   /*
+* Looks like an xlog file. Parse it's position.


s/it's/its/


+*/
+   if (sscanf(dirent->d_name, "%08X%08X%08X", &tli, &log, &seg) != 
3)
+   {
+   fprintf(stderr, _("%s: could not parse xlog filename 
\"%s\"\n"),
+   progname, dirent->d_name);
+   disconnect_and_exit(1);
+   }
+   log *= XLOG_SEG_SIZE;


That multiplication by XLOG_SEG_SIZE could overflow, if logid is very 
high. It seems completely unnecessary, anyway,


s/IDENFITY_SYSTEM/IDENTIFY_SYSTEM/ (two occurrences)

In pg_basebackup, it would be a good sanity check to check that the 
systemid returned by IDENTIFY_SYSTEM in the main connection and the 
WAL-streaming connection match. Just to be sure that some connection 
pooler didn't hijack one of the connections and point to a different 
server. And better check timelineid too while you're at it.


How does this interact with synchronous replication? If a base backup 
that streams WAL is in progress, and you have synchronous_standby_names 
set to '*', I believe the in-progress backup will count as a standby for 
that purpose. That might give a false sense of security. 
synchronous_standby_names='*' is prone to such confusion in general, but 
it seems that it's particularly surprising if a running pg_basebackup 
lets a commit in synchronous replication to proceed. Maybe we just need 
a warning in the docs. I think we should advise that 
synchronous_standby_names='*' is dangerous in general, and cite this as 
one reason for that.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Updated version of pg_receivexlog

2011-10-24 Thread Jaime Casanova

On Mon, Oct 24, 2011 at 7:40 AM, Magnus Hagander  wrote:
>
>> synchronous_standby_names='*' is prone to such confusion in general, but it
>> seems that it's particularly surprising if a running pg_basebackup lets a
>> commit in synchronous replication to proceed. Maybe we just need a warning
>> in the docs. I think we should advise that synchronous_standby_names='*' is
>> dangerous in general, and cite this as one reason for that.
>
> Hmm. i think this is common enough that we want to make sure we avoid
> it in code.
>
> Could we pass a parameter from the client indicating to the master
> that it refuses to be a sync slave? An optional keyword to the
> START_REPLICATION command, perhaps?
>

can't you execute "set synchronous_commit to off/local" for this connection?

-- 
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Robert Haas

On Sun, Oct 23, 2011 at 7:01 PM, Jeff Janes  wrote:
> On Fri, Oct 21, 2011 at 12:52 PM, Robert Haas  wrote:
>>
>> Also, this line is kind of expensive:
>>
>>        if (!visibilitymap_test(scandesc->heapRelation,
>>                                ItemPointerGetBlockNumber(tid),
>>                                &node->ioss_VMBuffer))
>>
>> Around 2%.  But I don't see any way to avoid that, or even make it cheaper.
>
> Could we cache by ItemPointerGetBlockNumber(tid) the results of those
> tests, for groups of tids on the same index page?
>
> How useful this would be would depend on how well-clustered the table
> and index are.

I thought about that, but the existing code is so ridiculously cheap
that it's hard to believe a caching layer would save enough to pay for
itself.  I mean, I presume that the cost attributed to that line has
to be associated with either (a) one of the pointer deferences, (b)
the expense of evaluating ItemPointerGetBlockNumber(), (c) setting up
the function call, or perhaps (d) overhead incurred as a result of
branch mis-prediction.  The actual time spent *inside*
visibilitymap_test will be attributed to that function, not this one.

If you add up the time for this line and visibilitymap_test(), it's
like 10% of the runtime, which seems pretty significant.  But it's 10%
of the runtime that is spent basically a handful of arithmetic
operations and then reading a byte from shared memory.  It's
astonishing to find that so expensive on a test with just one backend
running.  If you stick some kind of cache in there, it's going to
involve adding a branch that isn't there now, and I think that's going
to be pricey given how hot this code apparently is.

Also, I'm not sure it's safe.  Suppose that we lock the index page,
return a tuple, check the visibility map, and find the page all
visible.  Another transaction comes along, adds a tuple to the index
page, and clears the visibility map bit.  We then go back, relock the
index page, and return another tuple.  We'd better notice that the
visibility map bit has now been cleared, or we're in trouble.

I wonder if it might help to create some method for the index to
return all the matching keys on a single index page in one call.   If
you're dealing with an MVCC snapshot, any new tuples added after the
start of the scan can't be visible anyway.  That would save the
overhead of unlocking and relocking the buffer once per tuple, and
probably some overhead associated with validating and decoding the
possibly-changed page contents each time.  If you did it that way, it
would also be safe to do what you're proposing - if a bunch of the
tuples on the index page are also on the same heap page, you could do
one visibility map check for all of them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Updated version of pg_receivexlog

2011-10-24 Thread Magnus Hagander

On Mon, Oct 24, 2011 at 13:46, Heikki Linnakangas
 wrote:
>> +               /*
>> +                * Looks like an xlog file. Parse it's position.
>
> s/it's/its/
>
>> +                */
>> +               if (sscanf(dirent->d_name, "%08X%08X%08X", &tli, &log,
>> &seg) != 3)
>> +               {
>> +                       fprintf(stderr, _("%s: could not parse xlog
>> filename \"%s\"\n"),
>> +                                       progname, dirent->d_name);
>> +                       disconnect_and_exit(1);
>> +               }
>> +               log *= XLOG_SEG_SIZE;
>
> That multiplication by XLOG_SEG_SIZE could overflow, if logid is very high.
> It seems completely unnecessary, anyway,

How do you mean completely unnecessary? We'd have to change the points
that use it to divide by XLOG_SEG_SIZE otherwise, no? That might be a
way to get around the overflow, but I'm not sure that's what you mean?


> s/IDENFITY_SYSTEM/IDENTIFY_SYSTEM/ (two occurrences)

Oops.


> In pg_basebackup, it would be a good sanity check to check that the systemid
> returned by IDENTIFY_SYSTEM in the main connection and the WAL-streaming
> connection match. Just to be sure that some connection pooler didn't hijack
> one of the connections and point to a different server. And better check
> timelineid too while you're at it.

That's a good idea. Will fix.


> How does this interact with synchronous replication? If a base backup that
> streams WAL is in progress, and you have synchronous_standby_names set to
> '*', I believe the in-progress backup will count as a standby for that
> purpose. That might give a false sense of security.

Ah yes. Did not think of that. Yes, it will have this problem.


> synchronous_standby_names='*' is prone to such confusion in general, but it
> seems that it's particularly surprising if a running pg_basebackup lets a
> commit in synchronous replication to proceed. Maybe we just need a warning
> in the docs. I think we should advise that synchronous_standby_names='*' is
> dangerous in general, and cite this as one reason for that.

Hmm. i think this is common enough that we want to make sure we avoid
it in code.

Could we pass a parameter from the client indicating to the master
that it refuses to be a sync slave? An optional keyword to the
START_REPLICATION command, perhaps?


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] termination of backend waiting for sync rep generates a junk log message

2011-10-24 Thread Tom Lane

Robert Haas  writes:
> On Sun, Oct 23, 2011 at 6:34 PM, Tom Lane  wrote:
>> BTW, it strikes me that if we want to do something about that, it ought
>> to be possible; but it has to be built into error handling, not a
>> localized hack for sync rep.

> I actually think that emitting a NOTICE or WARNING and then slamming
> the connection shut is quite elegant,

No, it's a horrid crock, whose only saving grace is that it was
implementable with two or three lines localized to SyncRepWaitForLSN
... or at least, we thought that until Fujii-san started pointing out
the bugs in it.  It's not convenient for clients at all, it does not fit
well into the backend structure (which is the reason for the bugs), and
it forces session termination unnecessarily, or at least it would if
we'd been consistent and applied the method to query-cancel as well.

> as it seems to me that any
> client that is paranoid enough to care about sync rep had better
> already be handling the case of a connection loss during commit.

Agreed, but that is a problem that by definition we can't help with.
Also, the issue with connection loss is that you really can't know
whether your transaction got committed without reconnecting and looking
for evidence.  There is no reason at all to inject such uncertainty
into the cancel-SyncRepWaitForLSN case.  We know the transaction got
committed, and there's no reason to make the client guess about that,
nor to make it parse WARNING messages for which we didn't even get the
assignment of a unique SQLSTATE right (thus making the problem
insoluble anyhow).

> But I think that throwing an ERROR is likely to cause a LOT of client
> breakage, even if you have some special (human-invisible?) flag that
> indicates that you don't really mean it.  If we must do something
> other than simulating a server disconnect, letting the command
> completion message go through and annotating it with a NOTICE or
> WARNING seems preferable.

I think you're thinking narrowly of the SyncRepWaitForLSN case.  What
I'm trying to point out is that there's a boatload of post-commit code
which is capable of sometimes throwing errors, and that's not ever
going to go away completely.

It might be that it'd work to deal with this by reducing the reported
strength of all such cases from ERROR to WARNING.  Not sure that that's
a good idea, but it might work.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] autovacuum and orphaned large objects

2011-10-24 Thread Tom Lane

Euler Taveira de Oliveira  writes:
> The main point of autovacuum is maintenance tasks. Currently, it executes 
> VACUUM and ANALYZE commands. I want to propose that we incorporate vacuumlo 
> functionality into it.

I'm not terribly thrilled with that because (a) large objects seem like
mostly a legacy feature from here, and (b) it's hard to see how to
implement it without imposing overhead on everybody, whether they use
LOs or not.  This is especially problematic if you're proposing that
cleanup triggers not be required.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [9.1] unusable for large views

2011-10-24 Thread Tom Lane

"Omar Bettin"  writes:
> I have tried 9.1.1 win64 version and when I am trying to declare a cursor
> for a very large view (lot of joins and aggregate functions),

> postgres is using around 3GB of memory and the query never returns.

Could we see a self-contained test case?  I'm not about to try to
reverse-engineer the schema that goes with such a monster query.

http://wiki.postgresql.org/wiki/Guide_to_reporting_problems

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Online base backup from the hot-standby

2011-10-24 Thread Robert Haas

On Mon, Oct 24, 2011 at 11:33 AM, Heikki Linnakangas
 wrote:
> On 24.10.2011 15:29, Fujii Masao wrote:
>>
>> In your patch, FPW is always WAL-logged at startup even when FPW has
>> not been changed since last shutdown. I don't think that's required.
>> I changed the recovery code so that it keeps track of last FPW indicated
>> by WAL record. Then, at end of startup, if that FPW is equal to FPW
>> specified in postgresql.conf (which means that FPW has not been changed
>> since last shutdown or crash), WAL-logging of FPW is skipped. This change
>> prevents unnecessary WAL-logging. Thought?
>
> One problem with this whole FPW-tracking is that pg_lesslog makes it fail.
> I'm not sure what we need to do about that - maybe just add a warning to the
> docs. But it leaves a bit bad feeling in my mouth. Usually we try to make
> features work orthogonally, without dependencies to other settings. Now this
> feature requires that full_page_writes is turned on in the master, and also
> that you don't use pg_lesslog to compress the WAL segments or your base
> backup might be corrupt. The procedure to take a backup from the standby
> seems more complicated than taking it on the master - there are more steps
> to follow.

Doing it on the master isn't as easy as I'd like it to be, either.

But it's not really clear how to make it simpler.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

R: [HACKERS] [9.1] unusable for large views (SOLVED)

2011-10-24 Thread Omar Bettin

Hi Tom,

...are about two hours I am trying to communicate that the problem has been
solved, but I do not see the messages in the mailing list...

Anyway,
the problems was a bad installation of database (pgsql functions).

9.1.1 is working good.

is 4% to 8% faster than 9.0.5.

Thanks a lot to everyone.

Regards,

Omar

-Messaggio originale-
Da: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Inviato: lunedì 24 ottobre 2011 16:46
A: Omar Bettin
Cc: pgsql-hackers@postgresql.org
Oggetto: Re: [HACKERS] [9.1] unusable for large views 

"Omar Bettin"  writes:
> I have tried 9.1.1 win64 version and when I am trying to declare a cursor
> for a very large view (lot of joins and aggregate functions),

> postgres is using around 3GB of memory and the query never returns.

Could we see a self-contained test case?  I'm not about to try to
reverse-engineer the schema that goes with such a monster query.

http://wiki.postgresql.org/wiki/Guide_to_reporting_problems

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] autovacuum and orphaned large objects

2011-10-24 Thread Robert Haas

On Mon, Oct 24, 2011 at 10:25 AM, Euler Taveira de Oliveira
 wrote:
> On 24-10-2011 10:57, Robert Haas wrote:
>>
>> I think the main reason why vacuumlo is a contrib module rather than
>> in core is that it is just a heuristic, and it might not be what
>> everyone wants to do.  You could store a bunch of large objects in the
>> database and use the returned OIDs to generate links that you email to
>> users, and then when the user clicks on the link we retrieve the
>> corresponding LO and send it to the user over HTTP.  In that design,
>> there are no tables in the database at all, yet the large objects
>> aren't orphaned.
>>
> Uau, what a strange method to solve a problem and possibly bloat your
> database. No, I'm not suggesting that we forbid it. The proposed method
> could cleanup orphaned LO in 95% (if not 99%) of the use cases.
>
> I've never heard someone using LO like you describe it. It seems strange
> that someone distributes an OID number but (s)he does not store its
> reference at the same database. Yes, it is a possibility but ...

I guess we could make it an optional behavior, but once you go that
far then you have to wonder whether what's really needed here is a
general-purpose task scheduler.  I mean, the autovacuum launcher's
idea about how often to vacuum the database won't necessarily match
the user's idea of how often they want to vacuum away large objects -
and if the user is doing something funky (like storing arrays of large
object OIDs, or inexplicably storing them using numeric or int8) then
putting it in the backend removes a considerable amount of
flexibility.  Another case where vacuumlo will fall over is if you
have a very, very large table with an OID column, but with lots of
duplicate values so that the number of OIDs actually referenced is
much smaller.  You might end up doing a table scan on the large table
every time this logic kicks in, and that might suck.

I'm sort of unexcited about the idea of doing a lot of engineering
around this; it seems to me that the only reasons we still have a
separate large object facility rather than just letting everyone go
through regular tables with toastable columns are (1) the size limit
is 2GB rather than 1GB and (2) you can read and write parts of objects
rather than the whole thing.  If we're going to do some more
engineering here, I'd rather set our sights a little higher.
Complaints I often hear about the large object machinery include (1)
2GB is still not enough, (2) 4 billion large objects is not enough,
(3) the performance is inadequate, particularly with large numbers of
large objects from possibly-unrelated subsystems slammed into a single
table, and (4) it would be nice to be able to partial reads and writes
on any toastable field, not just large objects.  I'm not saying that
the problem you're complaining about isn't worth fixing in the
abstract, and if it seemed like a nice, clean fix I'd be all in favor,
but I just don't think it's going to be very simple, and for the
amount of work involved I'd rather get a little bit more bang for the
buck.

Of course, you don't have to agree with me on any of this; I'm just
giving you my take on it.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

R: [HACKERS] [9.1] unusable for large views (SOLVED)

2011-10-24 Thread Omar Bettin

...sorry guys...

was a misconfiguration of database.

9.1.1 is working good.

is ~4% faster than 9.0.5 for same query.

Thanks a lot.

Regards




-Messaggio originale-
Da: Pavel Stehule [mailto:pavel.steh...@gmail.com] 
Inviato: lunedì 24 ottobre 2011 12:13
A: Omar Bettin
Cc: pgsql-hackers@postgresql.org
Oggetto: Re: [HACKERS] [9.1] unusable for large views

Hello

please, send a result of explain analyze on 9.1.1 and older

please, use http://explain.depesz.com/

Regards

Pavel Stehule

2011/10/24 Omar Bettin :
> Hello,
>
>
>
> I have tried 9.1.1 win64 version and when I am trying to declare a cursor
> for a very large view (lot of joins and aggregate functions),
>
> postgres is using around 3GB of memory and the query never returns.
>
>
>
> Same proble selecting from the view without cursor.
>
>
>
> Same query worked fine from 8.3.3 to 9.0.5.
>
>
>
> Should I change some configuration params to have the same behavior as
> previous versions?
>
>
>
>
>
>
>
> Tried on Win2008 server R2  64bit 8GB RAM.
>
> also on Win7 64bit 8GB RAM.
>
> default postgresql.conf
>
>
>
> Regards,
>
>
>
>
>
> The view (!)
>
>
>
> CREATE OR REPLACE VIEW articoli_rip_ven20110227_view AS
>
>  SELECT a.pk_id, a.articolo, a.articolo_f, a.origine, ( SELECT
> ditte.attivita
>
>FROM ditte
>
>   WHERE ditte.partitario = a.cod_fornit) AS dummy_1, a.descrizion,
> a.taglia, a.peso_spec, a.giacenza_colli, a.giacenza, rim.colli AS
> rimanenza_colli, rim.quantita AS rimanenza_qta, rimass.colli AS
> rimanenza_ass_colli, rimass.quantita AS rimanenza_ass_qta,
> COALESCE(b20.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b21.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b22.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b23.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b24.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b25.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b26.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b27.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b28.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b29.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b30.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b31.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b32.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b33.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b34.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b35.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b36.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b37.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b38.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b39.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b40.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b41.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b42.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b43.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b44.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b45.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b46.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b47.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b48.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b49.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b50.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b51.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b52.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b53.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b54.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b55.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b56.quantita, NULL::numeric, 0::numeric)::double precision +
> (COALESCE(rim.quantita, NULL::numeric, 0::numeric) +
> COALESCE(rimass.quantita, NULL::numeric, 0::numeric))::double precision -
> (COALESCE(b1.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b2.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b3.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b4.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b5.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b6.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b7.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b8.quantita, NULL::numeric, 0::numeric)::double precision +
> COALESCE(b9.quantita, NULL::numeric, 0::numeric)::double

Re: [HACKERS] [9.1] unusable for large views

2011-10-24 Thread Robert Haas

On Mon, Oct 24, 2011 at 4:57 AM, Omar Bettin
 wrote:
> I have tried 9.1.1 win64 version and when I am trying to declare a cursor
> for a very large view (lot of joins and aggregate functions),
>
> postgres is using around 3GB of memory and the query never returns.

Hmm.  A 59-table join is pretty enormous.

I wish we had a better way to handle these kinds of queries.  Odds are
good that the join order doesn't matter much, and in an ideal world we
would be able to notice that and just use some simple heuristic to
pick a tolerably good one.  As it is, I am a bit surprised to hear
that GEQO isn't bailing you out.

Can you EXPLAIN a query against that view, or does even that wipe out?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Online base backup from the hot-standby

2011-10-24 Thread Heikki Linnakangas


On 24.10.2011 15:29, Fujii Masao wrote:

In your patch, FPW is always WAL-logged at startup even when FPW has
not been changed since last shutdown. I don't think that's required.
I changed the recovery code so that it keeps track of last FPW indicated
by WAL record. Then, at end of startup, if that FPW is equal to FPW
specified in postgresql.conf (which means that FPW has not been changed
since last shutdown or crash), WAL-logging of FPW is skipped. This change
prevents unnecessary WAL-logging. Thought?


One problem with this whole FPW-tracking is that pg_lesslog makes it 
fail. I'm not sure what we need to do about that - maybe just add a 
warning to the docs. But it leaves a bit bad feeling in my mouth. 
Usually we try to make features work orthogonally, without dependencies 
to other settings. Now this feature requires that full_page_writes is 
turned on in the master, and also that you don't use pg_lesslog to 
compress the WAL segments or your base backup might be corrupt. The 
procedure to take a backup from the standby seems more complicated than 
taking it on the master - there are more steps to follow.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] termination of backend waiting for sync rep generates a junk log message

2011-10-24 Thread Robert Haas

On Sun, Oct 23, 2011 at 6:34 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> Well, there is a general problem that anything which throws an ERROR
>> too late in the commit path is Evil; and sync rep makes that worse to
>> the extent that it adds more stuff late in the commit path, but it
>> didn't invent the problem.
>
> BTW, it strikes me that if we want to do something about that, it ought
> to be possible; but it has to be built into error handling, not a
> localized hack for sync rep.
>
> Consider a design along these lines: we invent a global flag that gets
> set at some appropriate point in RecordTransactionCommit (probably right
> where we exit the commit critical section) and is not cleared until we
> send a suitable message to the client --- I think either
> command-complete or an error message would qualify, but that would have
> to be analyzed more carefully than I've done so far.  If elog.c is told
> to send an error message while this flag is set, then it does something
> special to inform the client that this was a post-commit error and the
> xact is in fact committed.
>
> My inclination for the "something special" would be to add a new error
> message field, but that could be difficult for clients to examine
> depending on what sort of driver infrastructure they're dealing with.
> You could also imagine emitting a separate NOTICE or WARNING message,
> which is analogous to the current hack in SyncRepWaitForLSN, but seems
> pretty ugly because it requires clients to re-associate that event with
> the later error message.  (But it might be worth doing anyway for human
> users, even if we provide a different flag mechanism that is intended
> for program consumption.)  Or maybe we could override the SQLSTATE with
> some special value.  Or something else.
>
> Given infrastructure like this, it would be reasonable for
> SyncRepWaitForLSN to just throw an ERROR if it gets an interrupt,
> instead of trying to kluge its own solution.

I actually think that emitting a NOTICE or WARNING and then slamming
the connection shut is quite elegant, as it seems to me that any
client that is paranoid enough to care about sync rep had better
already be handling the case of a connection loss during commit.  I
realize that handling query cancellation in a somewhat different way
is a wart, though, and I'm not necessarily opposed to changing the
behavior.

But I think that throwing an ERROR is likely to cause a LOT of client
breakage, even if you have some special (human-invisible?) flag that
indicates that you don't really mean it.  If we must do something
other than simulating a server disconnect, letting the command
completion message go through and annotating it with a NOTICE or
WARNING seems preferable.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Online base backup from the hot-standby

2011-10-24 Thread Heikki Linnakangas


On 24.10.2011 15:29, Fujii Masao wrote:

+
+ 
+  Copy the pg_control file from the cluster directory to the global
+  sub-directory of the backup. For example:
+ 
+ cp $PGDATA/global/pg_control /mnt/server/backupdir/global
+ 
+ 
+


Why is this step required? The control file is overwritten by 
information from the backup_label anyway, no?



+
+ 
+  Again connect to the database as a superuser, and execute
+  pg_stop_backup. This terminates the backup mode, but does 
not
+  perform a switch to the next WAL segment, create a backup history file 
and
+  wait for all required WAL segments to be archived,
+  unlike that during normal processing.
+ 
+


How do you ensure that all the required WAL segments have been archived, 
then?



+   
+
+
+
+ You cannot use the pg_basebackup tool to take the backup
+ from the standby.
+


Why not? We have cascading replication now.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Use new oom_score_adj without a new compile-time constant

2011-10-24 Thread Robert Haas

On Fri, Sep 23, 2011 at 4:05 PM, Dan McGee  wrote:
> On Fri, Sep 23, 2011 at 2:49 PM, Robert Haas  wrote:
>> On Mon, Sep 19, 2011 at 4:36 PM, Dan McGee  wrote:
>>> [ patch ]
>>
>> I suppose it's Tom who really needs to comment on this, but I'm not
>> too enthusiastic about this approach.  Duplicating the Linux kernel
>> calculation into our code means that we could drift if the formula
>> changes again.
> Why would the formula ever change? This seems like a different excuse
> way of really saying you don't appreciate the hacky approach, which I
> can understand completely. However, it simply doesn't make sense for
> them to change this formula, as it scales the -17 to 16 old range to
> the new -1000 to 1000 range. Those endpoints won't be changing unless
> a third method is introduced, in which case this whole thing is mute
> and we'd need to fix it yet again.
>
>> I like Tom's previous suggestion (I think) of allowing both constants
>> to be defined - if they are, then we try oom_score_adj first and fall
>> back to oom_adj if that fails.  For bonus points, we could have
>> postmaster stat() its own oom_score_adj file before forking and set a
>> global variable to indicate the results.  That way we'd only ever need
>> to test once per postmaster startup (at least until someone figures
>> out a way to swap out the running kernel without stopping the
>> server...!).
> This would be fine, it just seems unreasonably complicated, not to
> mention unnecessary. I might as well point this out [1]. I don't think
> a soul out there has built without defining this to 0 (if they define
> it at all), and not even that many people are using it. Is it all that
> bad of an idea to just force it to 0 for both knobs and be done with
> it?

Did we do anything about this?  Anyone else have an opinion on what
ought to be done?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ALTER TABLE ONLY ...DROP CONSTRAINT is broken in HEAD.

2011-10-24 Thread Robert Haas

On Mon, Sep 12, 2011 at 11:53 AM, Alexey Klyukin
 wrote:
> This works in 9.1, but not in HEAD:
>
> CREATE TABLE parent(id INTEGER, CONSTRAINT id_check CHECK(id>1));
> CREATE TABLE child() INHERITS(parent);
>
> ALTER TABLE ONLY parent DROP CONSTRAINT id_check;
>
> I'm getting:
> ERROR:  relation 16456 has non-inherited constraint "id_check"
> where 16456 is the oid of the child table.
>
> It seems that the pg_constraint scan at ATExecDropConstraint 
> (tablecmds.c:6751) is re-reading those tuples that were updated in the 
> previous iterations of this scan, at least that's what I've observed in gdb. 
> I'm not sure how to fix this yet.

Woops, seems this got overlooked.  It's been fixed, though: see commit
c0f03aae0469e758964faac0fb741685170c39a5.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: GiST for range types (was Re: [HACKERS] Range Types - typo + NULL string constructor)

2011-10-24 Thread Alexander Korotkov

Hi!

On Mon, Oct 17, 2011 at 12:38 PM, Jeff Davis  wrote:

> > I started implementing subtype_diff, and I noticed that it requires
> > defining an extra function for each range type. Previously, the numeric
> > types could just use a cast, which was convenient for user-defined range
> > types.
> >
> > If you have any other ideas to make that cleaner, please let me know.
> > Otherwise I'll just finish implementing subtype_diff.
>
I think implementing subtype_diff for each datatype is ok.  We could
implement some universal function based on minus operator and casting to
double precision. But such solution might be unacceptable in both
*predictability
(operator and casting function might do not the things we expect) and
performance.*

I'm beginning to think that we should just allow the user to specify
> their own gist_penalty function. Specifying just the subtype_diff
> doesn't save much time, and it can only be limiting. Additionally, it's
> harder for users to understand the purpose of the function.
>
If we allow user to specify own gist_penalty function, then such function
should deal with:
1) GiST-specific data structures such as GISTENTRY.
2) Decomposing ranges using range_deserialize.
3) Inifinities, which we could handle in general penalty functions.
Thats why I prefere to implement subtype_diff.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] Updated version of pg_receivexlog

2011-10-24 Thread Magnus Hagander

On Mon, Oct 24, 2011 at 16:12, Jaime Casanova  wrote:
> On Mon, Oct 24, 2011 at 7:40 AM, Magnus Hagander  wrote:
>>
>>> synchronous_standby_names='*' is prone to such confusion in general, but it
>>> seems that it's particularly surprising if a running pg_basebackup lets a
>>> commit in synchronous replication to proceed. Maybe we just need a warning
>>> in the docs. I think we should advise that synchronous_standby_names='*' is
>>> dangerous in general, and cite this as one reason for that.
>>
>> Hmm. i think this is common enough that we want to make sure we avoid
>> it in code.
>>
>> Could we pass a parameter from the client indicating to the master
>> that it refuses to be a sync slave? An optional keyword to the
>> START_REPLICATION command, perhaps?
>>
>
> can't you execute "set synchronous_commit to off/local" for this connection?

This is a walsender connection, it doesn't take SQL. Plus it's the
receiving end, and SET sync_commit is for the sending end.

that said, we are reasonably safe in current implementations, because
it always sets the flush location to invalidxlogptr, so it will not be
considered for sync slave. Should we ever start accepting "write" as
the point to sync against, the problem will show up, of course.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] termination of backend waiting for sync rep generates a junk log message

2011-10-24 Thread Robert Haas

On Mon, Oct 24, 2011 at 10:05 AM, Tom Lane  wrote:
>> as it seems to me that any
>> client that is paranoid enough to care about sync rep had better
>> already be handling the case of a connection loss during commit.
>
> Agreed, but that is a problem that by definition we can't help with.
> Also, the issue with connection loss is that you really can't know
> whether your transaction got committed without reconnecting and looking
> for evidence.  There is no reason at all to inject such uncertainty
> into the cancel-SyncRepWaitForLSN case.  We know the transaction got
> committed,

I disagree.  The whole point of synchronous replication is that the
user is worried about the case where the primary goes away just after
the commit is acknowledged to the client.  Consider the following
scenario: Someone has determined that it can't be reached from 90% of
the corporate Internet, but the synchronous standby, which is
naturally on another network, still has connectivity.  So they log
into the master and perform a fast shutdown.  When they reconnect, the
connection pooler (or other mechanism) redirects their connection to
the standby, which has sense been promoted.  ISTM that the client had
darn well better go search for hard evidence about the transaction
state.

>> But I think that throwing an ERROR is likely to cause a LOT of client
>> breakage, even if you have some special (human-invisible?) flag that
>> indicates that you don't really mean it.  If we must do something
>> other than simulating a server disconnect, letting the command
>> completion message go through and annotating it with a NOTICE or
>> WARNING seems preferable.
>
> I think you're thinking narrowly of the SyncRepWaitForLSN case.  What
> I'm trying to point out is that there's a boatload of post-commit code
> which is capable of sometimes throwing errors, and that's not ever
> going to go away completely.
>
> It might be that it'd work to deal with this by reducing the reported
> strength of all such cases from ERROR to WARNING.  Not sure that that's
> a good idea, but it might work.

It's hard to be sure that a systematic approach will work.  For
example, if we fail to can't nuke a memory context for some reason, it
wouldn't be utterly crazy to just ignore the problem and try to
soldier on.  We've probably leaked some memory, but oh well.  If we've
failed to release a heavyweight lock we had better call
LockReleaseAll() somehow, but the details of what gets sent to the
client are negotiable and a WARNING is probably fine.  On the other
hand, if we experienced some failure that affects our ability to make
the transaction globally visible (like we wrote the commit record but
then fail trying to acquire ProcArrayLock to clear our xmin), it's
hard to believe that anything other than PANIC is enough.

Because of that and similar cases elsewhere, including for example
inside the lock manager, I've long been feeling grumpy about this:

/* Ensure we will have room to remember the lock */
if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS)
elog(ERROR, "too many LWLocks taken");

It seems to me that the idea that the abort path is going to be able
to recover from that situation is wildly optimistic.  Fortunately, our
coding practices are good enough that I think it never happens anyway,
but if it does it should surely PANIC.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Silent failure with invalid hba_file setting

2011-10-24 Thread Thom Brown

On 19 October 2011 05:20, Tom Lane  wrote:
> Peter Eisentraut  writes:
>> I wouldn't add extra special checks for that.  It might not be
>> completely unreasonable to have a standby that no one can connect to,
>> for example.
>
> Well, you couldn't monitor its state then, so I don't find that example
> very convincing.  But if you were intent on having that, you could
> easily set up a pg_hba file containing only "reject" entries.

I hadn't noticed you'd committed some changes around this until now. Thanks.

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

64 matches

Mail list logo