date:20110525

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

2011-05-25 Thread Greg Stark

On Wed, May 25, 2011 at 9:41 AM, Tom Lane  wrote:
> Yeah, I had been thinking about the latter point.  We could be
> conservative and just keep the reported tuple density the same (ie,
> update relpages to the new correct value, while setting reltuples so
> that the density ratio doesn't change).  But that has its own problems
> when the table contents *do* change.  What I'm currently imagining is
> to do a smoothed moving average, where we factor in the new density
> estimate with a weight dependent on the percentage of the table we did
> scan.  That is, the calculation goes something like
>
> old_density = old_reltuples / old_relpages
> new_density = counted_tuples / scanned_pages
> reliability = scanned_pages / new_relpages
> updated_density = old_density + (new_density - old_density) * reliability
> new_reltuples = updated_density * new_relpages

This amounts to assuming that the pages observed in the vacuum have
the density observed and the pages that weren't seen have the density
that were previously in the reltuples/relpages stats. That seems like
a pretty solid approach to me. If the numbers were sane before it
follows that they should be sane after the update.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

2011-05-25 Thread Greg Stark

On Wed, May 25, 2011 at 10:05 PM, Greg Stark  wrote:
>> updated_density = old_density + (new_density - old_density) * reliability
>> new_reltuples = updated_density * new_relpages
>
> This amounts to assuming that the pages observed in the vacuum have
> the density observed and the pages that weren't seen have the density
> that were previously in the reltuples/relpages stats.

In case it's not clear, Tom's expression for updated_density is
equivalent by simple algebra to:

updated_density = (old_density * (1-reliability)) + (new_density * reliability)

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tackling full page writes

2011-05-25 Thread Fujii Masao

On Thu, May 26, 2011 at 1:18 PM, Robert Haas  wrote:
>> The replay of the WAL record for A doesn't rely on the content of chunk 1
>> which B modified. So I don't think that "partial page writes" has such
>> a problem.
>> No?
>
> Sorry.  WAL records today DO rely on the prior state of the page.  If
> they didn't, we wouldn't need full page writes.  They don't rely on
> them terribly heavily - things like where pd_upper is pointing, and
> what the page LSN is.  But they do rely on them.

Yeah, I'm sure that normal WAL record (neither full page writes nor
"partial page writes") relies on the prior state of the page. But WAL
record for A is "partial page writes", which also relies on the prior
state?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tackling full page writes

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 10:09 PM, Fujii Masao  wrote:
> On Wed, May 25, 2011 at 9:34 PM, Robert Haas  wrote:
>> On Tue, May 24, 2011 at 10:52 PM, Jeff Davis  wrote:
>>> On Tue, 2011-05-24 at 16:34 -0400, Robert Haas wrote:
 As I think about it a bit more, we'd
 need to XLOG not only the parts of the page we actually modifying, but
 any that the WAL record would need to be correct on replay.
>>>
>>> I don't understand that statement. Can you clarify?
>>
>> I'll try.  Suppose we have two WAL records A and B, with no
>> intervening checkpoint, that both modify the same page.  A reads chunk
>> 1 of that page and then modifies chunk 2.  B modifies chunk 1.  Now,
>> suppose we make A do a "partial page write" on chunk 2 only, and B do
>> the same for chunk 1.  At the point the system crashes, A and B are
>> both on disk, and the page has already been written to disk as well.
>> Replay begins from a checkpoint preceding A.
>>
>> Now, when we get to the record for A, what are we to do?  If it were a
>> full page image, we could just restore it, and everything would be
>> fine after that.  But if we replay the partial page write, we've got
>> trouble.  A will now see the state of the chunk 1 as it existed after
>> the action protected by B occurred, and will presumably do the wrong
>> thing.
>
> If this is really true, full page writes would also cause the similar problem.
> No? Imagine the case where A reads page 1, then modifies page 2, and B
> modifies page 1. At the recovery, A will see the state of page 1 as it existed
> after the action by B.

Yeah, but it won't matter, because the LSN interlock will prevent A
from taking any action.  If you only write parts of the page, though,
the concept of "the" LSN of the page becomes a bit murky, because you
may have different parts of the page from different points in the WAL
stream.  I believe it's possible to cope with that if we design it
carefully, but it does seem rather complex and error-prone (which is
not necessarily the best design for a recovery system, but hey).

Anyway, you can either have the partial page write for A restore the
older LSN, or not.  If you do, then you have the problem as I
described it.  If you don't, then the effects of A vanish into the
either.  Either way, it doesn't work.

> The replay of the WAL record for A doesn't rely on the content of chunk 1
> which B modified. So I don't think that "partial page writes" has such
> a problem.
> No?

Sorry.  WAL records today DO rely on the prior state of the page.  If
they didn't, we wouldn't need full page writes.  They don't rely on
them terribly heavily - things like where pd_upper is pointing, and
what the page LSN is.  But they do rely on them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Another attempt at vacuum improvements

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 11:51 PM, Pavan Deolasee
 wrote:
>> Agreed.  The only thing I'm trying to do further is to avoid the need
>> for a reshuffle when the special LSN storage is reclaimed.
>
> Ah ok. That was never clear from your initial emails or may be I
> mis-read.

Sorry, I must not have explained it very well.  :-(

> So what you are saying is by storing LSN after line pointer
> array, we might be able to reclaim LSN storage without shuffling. That
> makes sense. Having said that, it doesn't excite me too much because I
> think we should do the dead line pointer reclaim operation during page
> pruning and we are already holding cleanup lock at that time and most
> likely do a reshuffle anyways.

I'll give that a firm maybe.  If there is no reshuffle, then you can
do this with just an exclusive content lock.  Maybe that's worthless,
but I'm not certain of it.  I guess we might need to see how the code
shakes out.

Also, reshuffling might be more expensive.  I agree that if there are
new dead tuples on the page, then you're going to be paying that price
anyway; but if not, it might be avoidable.

> Also a downside of storing LSN after line pointer array is that you
> may waste space because of alignment issues.

We could possibly store it unaligned and read it back two bytes at a
time.  Granted, that's not free.

> I also thought that the
> LSN might come in between extending line pointer array, but probably
> thats not a big deal since if there is free space in the page (and
> there should be if we are adding a new tuple), it should be available
> immediately after the LSN.

Yeah.  I'm not sure how icky that is, though.

> There are some other issues that we should think about too. Like
> recording free space  and managing visibility map. The free space is
> recorded in the second pass pass today, but I don't see any reason why
> that can't be moved to the first pass. Its not clear though if we
> should also record free space after retail page vacuum or leave it as
> it is.

Not sure.  Any idea why it's like that, or why we might want to change it?

> For visibility maps, we should not update them until there are
> LP_DEAD line pointers on the page. Now thats not good because all
> tuples in the page may be visible, so we may loose some advantage, at
> least for a while, but if mark the page all-visible, the vacuum scan
> would not find the dead line pointers in it and that would leave
> dangling index pointers after an index vacuum.

Also, an index-only scan might return index tuples that are pointing
to dead line pointers.

Currently, I believe the only way a page can get marked all-visible is
by vacuum.  But if we make this change, then it would be possible for
a HOT cleanup to encounter a situation where all-visible could be set.
 We probably want to make that work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Another attempt at vacuum improvements

2011-05-25 Thread Pavan Deolasee

On Wed, May 25, 2011 at 11:39 PM, Robert Haas  wrote:
> On Wed, May 25, 2011 at 1:43 PM, Pavan Deolasee
> Now, there
>> is no way you can store is after the line pointer array without moving
>> the live tuple somewhere else.
>
> So far I agree.  But don't we always defragment immediately after
> pruning dead tuples to line pointers?  The removal of even one tuple
> will give us more than enough space to store the LSN.
>

Yes, we do. But defragment means shuffling tuples around. So we agree
that to find space for the LSN, we might need to move the tuples
around.

>
> Agreed.  The only thing I'm trying to do further is to avoid the need
> for a reshuffle when the special LSN storage is reclaimed.

Ah ok. That was never clear from your initial emails or may be I
mis-read. So what you are saying is by storing LSN after line pointer
array, we might be able to reclaim LSN storage without shuffling. That
makes sense. Having said that, it doesn't excite me too much because I
think we should do the dead line pointer reclaim operation during page
pruning and we are already holding cleanup lock at that time and most
likely do a reshuffle anyways.

Also a downside of storing LSN after line pointer array is that you
may waste space because of alignment issues. I also thought that the
LSN might come in between extending line pointer array, but probably
thats not a big deal since if there is free space in the page (and
there should be if we are adding a new tuple), it should be available
immediately after the LSN.

There are some other issues that we should think about too. Like
recording free space  and managing visibility map. The free space is
recorded in the second pass pass today, but I don't see any reason why
that can't be moved to the first pass. Its not clear though if we
should also record free space after retail page vacuum or leave it as
it is. For visibility maps, we should not update them until there are
LP_DEAD line pointers on the page. Now thats not good because all
tuples in the page may be visible, so we may loose some advantage, at
least for a while, but if mark the page all-visible, the vacuum scan
would not find the dead line pointers in it and that would leave
dangling index pointers after an index vacuum.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SSI predicate locking on heap -- tuple or row?

2011-05-25 Thread Kevin Grittner

> Dan Ports  wrote:
> On Tue, May 24, 2011 at 04:18:37AM -0500, Kevin Grittner wrote:

>> These proofs show that there is no legitimate cycle which could
>> cause an anomaly which the move from row-based to tuple-based
>> logic will miss. They don't prove that the change will generate
>> all the same serialization failures; and in fact, some false
>> positives are eliminated by the change.
>
> Yes, that's correct. That's related to the part in the proof where
> I claimed T3 couldn't have a conflict out *to some transaction T0
> that precedes T1*.
>
> I originally tried to show that T3 couldn't have any conflicts out
> that T2 didn't have, which would mean we got the same set of
> serialization failures, but that's not true. In fact, it's not too
> hard to come up with an example where there would be a
> serialization failure with the row version links, but not without.
> However, because the rw-conflict can't be pointing to a transaction
> that precedes T1 in the serial order, it won't create a cycle. In
> other words, there are serialization failures that won't happen
> anymore, but they were false positives.

Dan and I went around a couple times chasing down all code, comment,
and patch changes needed, resulting in the attached patch.  We found
and fixed the bug which originally manifested in a way which I
confused with a need for row locks, as well as another which was
nearby in the code.  We backed out the changes which were causing
merge problems for Robert, as those were part of the attempt at the
row locking (versus tuple locking).  We removed a function which is
no longer needed.  We adjusted the comments and an affected isolation
test.

As might be expected from removing an unnecessary feature, the lines
of code went down -- a net decrease of 93 lines.

This patch applies against HEAD, `make check-world` passes as does
`make -C src/test/isolation installcheck` and `make
installcheck-world` against a database with
default_transaction_isolation = 'serializable'.  Dan is running
stress testing against the patched version tonight with DBT-2.

These changes generate merge conflicts with the work I've done on
handling CLUSTER, DROP INDEX, etc.  It seems to me that the best
course would be to commit this, then I can rebase the other work and
post it.  Since these issues are orthogonal, it didn't seem like a
good idea to combine them in one patch, and this one seems more
urgent.

-Kevin

ssi-tuple-not-row-locking-3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tackling full page writes

2011-05-25 Thread Fujii Masao

On Wed, May 25, 2011 at 9:34 PM, Robert Haas  wrote:
> On Tue, May 24, 2011 at 10:52 PM, Jeff Davis  wrote:
>> On Tue, 2011-05-24 at 16:34 -0400, Robert Haas wrote:
>>> As I think about it a bit more, we'd
>>> need to XLOG not only the parts of the page we actually modifying, but
>>> any that the WAL record would need to be correct on replay.
>>
>> I don't understand that statement. Can you clarify?
>
> I'll try.  Suppose we have two WAL records A and B, with no
> intervening checkpoint, that both modify the same page.  A reads chunk
> 1 of that page and then modifies chunk 2.  B modifies chunk 1.  Now,
> suppose we make A do a "partial page write" on chunk 2 only, and B do
> the same for chunk 1.  At the point the system crashes, A and B are
> both on disk, and the page has already been written to disk as well.
> Replay begins from a checkpoint preceding A.
>
> Now, when we get to the record for A, what are we to do?  If it were a
> full page image, we could just restore it, and everything would be
> fine after that.  But if we replay the partial page write, we've got
> trouble.  A will now see the state of the chunk 1 as it existed after
> the action protected by B occurred, and will presumably do the wrong
> thing.

If this is really true, full page writes would also cause the similar problem.
No? Imagine the case where A reads page 1, then modifies page 2, and B
modifies page 1. At the recovery, A will see the state of page 1 as it existed
after the action by B.

The replay of the WAL record for A doesn't rely on the content of chunk 1
which B modified. So I don't think that "partial page writes" has such
a problem.
No?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New/Revised TODO? Gathering actual read performance data for use by planner

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 8:37 PM, Michael Nolan  wrote:
> On Wed, May 25, 2011 at 11:18 AM, Robert Haas  wrote:
>>
>> I basically agree.  There have been several recent discussions of this
>> topic on both -hackers and -performance; it is likely that the TODO
>> needs to be updated with some more recent links.
>
> Anything to help the NKOTB to get up to speed would be appreciated, though I
> still think it is not just a 'caching' issue.
>
> The question I hesitated to ask in Ottawa was:  So, what information would
> you like and what would you do with it?

*scratches head*

I'm not sure I understand your question.  Can you elaborate?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New/Revised TODO? Gathering actual read performance data for use by planner

2011-05-25 Thread Michael Nolan

On Wed, May 25, 2011 at 11:18 AM, Robert Haas  wrote:

>
> I basically agree.  There have been several recent discussions of this
> topic on both -hackers and -performance; it is likely that the TODO
> needs to be updated with some more recent links.
>

Anything to help the NKOTB to get up to speed would be appreciated, though I
still think it is not just a 'caching' issue.

The question I hesitated to ask in Ottawa was:  So, what information would
you like and what would you do with it?
--
Mike Nolan

Re: [HACKERS] tackling full page writes

2011-05-25 Thread Greg Stark

On Tue, May 24, 2011 at 1:34 PM, Robert Haas  wrote:
> 1. Heikki suggested that instead of doing full page writes, we might
> try to write only the parts of the page that have changed.  For
> example, if we had 16 bits to play with in the page header (which we
> don't), then we could imagine the page as being broken up into 16
> 512-byte chunks, one per bit.  Each time we update the page, we write
> whatever subset of the 512-byte chunks we're actually modifying,
>

Alternately we could have change vectors which are something like
 which I think would be a lot less wasteful than
dumping 512 byte chunks. The main advantage of 512 byte chunks is it's
easier to figure how what chunks to include in the output and if
you're replacing the entire block it looks just like our existing
system including not having to read in the page before writing. If we
output change vectors then we need to do some arithmetic to figure out
when it makes sense to merge records and how much of the block we need
to be replacing before we just decide to include the whole block.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_upgrade automatic testing

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 3:07 PM, Peter Eisentraut  wrote:
> On ons, 2011-04-27 at 18:14 -0400, Noah Misch wrote:
>> Enthusiastic +1 for this concept.  There's at least one rough edge: it fails 
>> if
>> you have another postmaster running on port 5432.
>
> This has now been addressed: pg_upgrade accepts PGPORT settings.
> Attached is a slightly updated patch runs the test suite with a port of
> 65432, which you can override by setting PGPORT yourself.
>
> Anyway, is this something that people want in the repository?  It's not
> as polished as the pg_regress business, but it is definitely helpful.

Is this going to result in using the built binaries with the installed
libraries, a la Tom's recent complaint about the isolation tests?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 5:54 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Wed, May 25, 2011 at 1:04 PM, Tom Lane  wrote:
>>> Because the problem is not specific to TOAST tables.  As things
>>> currently stand, we will accept the word of an ANALYZE as gospel even if
>>> it scanned 1% of the table, and completely ignore the results from a
>>> VACUUM even if it scanned 99% of the table.  This is not sane.
>
>> I agree that if VACUUM scanned 99% of the table, it's probably fine to
>> use its numbers.  It's also fine to use the numbers from ANALYZE,
>> because those pages are chosen randomly.  What bothers me is the idea
>> of using a small *non-random* sample, and I'm not sure that
>> incorporating possibly-bogus results slowly is any better than
>> incorporating them quickly.
>
> The above is simply fuzzy thinking.  The fact that ANALYZE looked at a
> random subset of pages is *no guarantee whatsoever* that its results are
> highly accurate.  They might be more trustworthy than VACUUM's nonrandom
> sample of a similar number of pages, but it doesn't hold even a little
> bit of water to claim that we should believe ANALYZE completely and
> VACUUM not at all even when the latter has looked at a significantly
> larger sample of pages.

I think you're arguing against a straw-man.

> In any case, your line of thought doesn't help us for fixing the problem
> with toast tables, because we aren't going to start doing ANALYZEs on
> toast tables.

Can we simply use a constant for tuple density on TOAST tables?

> The bottom line here is that making use of stats we have is a lot better
> than not making use of them, even if they aren't entirely trustworthy.

http://dilbert.com/strips/comic/2008-05-07/

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Displaying optimized CASE expressions (was Re: [HACKERS] Nested CASE-WHEN scoping)

2011-05-25 Thread Tom Lane

I wrote:
> Heikki Linnakangas  writes:
>> BTW, i just stumbled into this:

>> postgres=# explain verbose SELECT CASE now() WHEN (29+random()::int4) 
>> THEN 'foo' ELSE 'bar' END;
>> ERROR:  unexpected CASE WHEN clause: 326

>> Looks like ruleutils.c is also not prepared for the case that the 
>> implicit equality operation gets inlined into something else than an OpExpr.

> Grumble ... I thought we'd fixed that ...

Yeah, you're right.  We've hacked that code so it can handle some
transformations that the optimizer might apply, but inlining some random
expression to replace the equality operator is far beyond what we can
hope to deal with.

For those following along at home, the point is that if the user writes

CASE test_expr WHEN cmp_expr THEN ...

the parser identifies the equality operator to use and produces
something that looks like this:

CASE test_expr WHEN CaseTestExpr = cmp_expr THEN ...

We really need ruleutils.c to generate the original form when it is
looking at a stored rule (eg a view), so it goes to some lengths to
recognize "CaseTestExpr = something" in a WHEN clause and only print the
"something".  However, this example shows that there's no chance of
always being able to do that when looking at an expression that's been
through the planner.

I think what we'd better do, if we don't see something that looks like
"CaseTestExpr = something", is just print whatever we do have in the
WHEN clause.  That will require inventing a print representation for
CaseTestExpr, since in most cases that's going to appear in there
somewhere.  I suggest we just print CASE_TEST_EXPR, but if anyone wants
to bikeshed, feel free ...

Note that if Heikki does what I suggested upthread, the display will
eventually probably look like "$nn" instead (since it'll be a Param not
a CaseTestExpr).  But that's 9.2 or later material.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.

2011-05-25 Thread Tim Uckun

> I thought the problem was that they upgraded the OS and now the encoding
> names changed, though they behaved the same.  Is that now what is
> happening?  Can they supply the values with different cases?
>


In my case I never touched the locale. It was set by the OS. I presume
this is true for most people.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

2011-05-25 Thread Tom Lane

Robert Haas  writes:
> On Wed, May 25, 2011 at 1:04 PM, Tom Lane  wrote:
>> Because the problem is not specific to TOAST tables.  As things
>> currently stand, we will accept the word of an ANALYZE as gospel even if
>> it scanned 1% of the table, and completely ignore the results from a
>> VACUUM even if it scanned 99% of the table.  This is not sane.

> I agree that if VACUUM scanned 99% of the table, it's probably fine to
> use its numbers.  It's also fine to use the numbers from ANALYZE,
> because those pages are chosen randomly.  What bothers me is the idea
> of using a small *non-random* sample, and I'm not sure that
> incorporating possibly-bogus results slowly is any better than
> incorporating them quickly.

The above is simply fuzzy thinking.  The fact that ANALYZE looked at a
random subset of pages is *no guarantee whatsoever* that its results are
highly accurate.  They might be more trustworthy than VACUUM's nonrandom
sample of a similar number of pages, but it doesn't hold even a little
bit of water to claim that we should believe ANALYZE completely and
VACUUM not at all even when the latter has looked at a significantly
larger sample of pages.

In any case, your line of thought doesn't help us for fixing the problem
with toast tables, because we aren't going to start doing ANALYZEs on
toast tables.

The bottom line here is that making use of stats we have is a lot better
than not making use of them, even if they aren't entirely trustworthy.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.

2011-05-25 Thread Bruce Momjian

Tom Lane wrote:
> Robert Haas  writes:
> > On Wed, May 25, 2011 at 2:13 PM, Bruce Momjian  wrote:
> >> I thought the problem was that they upgraded the OS and now the encoding
> >> names changed, though they behaved the same. ?Is that now what is
> >> happening? ?Can they supply the values with different cases?
> 
> > Oh, hmm.  I don't know.
> 
> The original complaint wasn't too clear.  But I agree with Robert that
> case-insensitive comparison is about as far as it seems safe to go in
> trying to allow for variant spellings of locale names.  I would think
> that a platform that randomly whacks such names around is going to find
> a whole lot of software breaking, not only Postgres.

Agreed.  Unless an expert shows up and can tell us exactly how the
naming works, we have gone as far as we can go.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Nested CASE-WHEN scoping

2011-05-25 Thread Tom Lane

Heikki Linnakangas  writes:
> On 25.05.2011 17:47, Tom Lane wrote:
>> [ scratches head ... ]  Why does the save/restore in ExecEvalCase not
>> take care of this?

> The mistake happens during planning, when the SQL function is inlined 
> and pre-evaluated.

OK, I see the problem now: we have a CaseTestExpr in the arguments of
the function, which we are unable to reduce to a constant, so it gets
substituted as-is into the body of the function during inlining.  And
then it's physically inside the CASE expression that's in the function
body, so it looks like it syntactically belongs to that expression,
which it doesn't.  You're probably right that this is impractical to fix
without redesigning the expression representation, and that
CoerceToDomainValue has got a similar issue.

My advice is to not change the parser output representation, though.
What I think you ought to do about this is to have the planner replace 
CaseTestExpr and CoerceToDomainValue with PARAM_EXEC Params during
expression preprocessing, and assign suitable Param numbers which it
sticks into the CaseExpr (resp CoerceToDomainExpr) so that those
expressions know which Param slot to fill at runtime.  The
const-simplification logic can avoid getting dumber by treating the
cases like known-Param-value cases.  I don't think you need to invent
something separate from the PARAM_EXEC infrastructure to handle these.

The main annoyance here is that standalone expressions may now need
Param slots to execute, which will complicate places that need to
evaluate them, but there's probably no way around that (a bespoke
mechanism would complicate those callers just as much, if the number
of value slots it needs is variable, which it will be).

> BTW, i just stumbled into this:

> postgres=# explain verbose SELECT CASE now() WHEN (29+random()::int4) 
> THEN 'foo' ELSE 'bar' END;
> ERROR:  unexpected CASE WHEN clause: 326

> Looks like ruleutils.c is also not prepared for the case that the 
> implicit equality operation gets inlined into something else than an OpExpr.

Grumble ... I thought we'd fixed that ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.

2011-05-25 Thread Tom Lane

Robert Haas  writes:
> On Wed, May 25, 2011 at 2:13 PM, Bruce Momjian  wrote:
>> I thought the problem was that they upgraded the OS and now the encoding
>> names changed, though they behaved the same.  Is that now what is
>> happening?  Can they supply the values with different cases?

> Oh, hmm.  I don't know.

The original complaint wasn't too clear.  But I agree with Robert that
case-insensitive comparison is about as far as it seems safe to go in
trying to allow for variant spellings of locale names.  I would think
that a platform that randomly whacks such names around is going to find
a whole lot of software breaking, not only Postgres.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pull up aggregate subquery

2011-05-25 Thread Tom Lane

Robert Haas  writes:
> I think getting it working is probably a good first goal.  I am not
> really sure that we want to commit it that way, and I think my vote
> would be for you to work on the approach we discussed before rather
> than this one, but it's your project, and I think you'll probably
> learn enough in getting it working that it will be a step forward in
> any case.  The planner is complex enough that it's worth trying to get
> something that works, first, before trying to make it perfect.

Remember Brooks' advice: "Plan to throw one away.  You will anyhow."

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] about EDITOR_LINENUMBER_SWITCH

2011-05-25 Thread Tom Lane

Alvaro Herrera  writes:
> Excerpts from Tom Lane's message of mar may 24 17:11:17 -0400 2011:
>> Right.  It would also increase the cognitive load on the user to have
>> to remember the command-line go-to-line-number switch for his editor.
>> So I don't particularly want to redesign this feature.  However, I can
>> see the possible value of letting EDITOR_LINENUMBER_SWITCH be set from
>> the same place that you set EDITOR, which would suggest that we allow
>> the value to come from an environment variable.  I'm not sure whether
>> there is merit in allowing both that source and ~/.psqlrc, though
>> possibly for Windows users it might be easier if ~/.psqlrc worked.

> If we're going to increase the number of options in .psqlrc that do not
> work with older psql versions, can I please have .psqlrc-MAJORVERSION or
> some such?  Having 8.3's psql complain all the time because it doesn't
> understand "linestyle" is annoying.

1. I thought we already did have that.

2. In any case, EDITOR_LINENUMBER_SWITCH isn't a hazard for this,
because older versions will just think it's a variable without any
special meaning.

But the real question here is whether we want to change it to be also
(or instead?) an environment variable.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] use less space in xl_xact_commit patch

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 3:05 PM, Simon Riggs  wrote:
> On Wed, May 25, 2011 at 3:41 PM, Simon Riggs  wrote:
>> On Wed, May 25, 2011 at 2:43 PM, Leonardo Francalanci  
>> wrote:
 Da: Simon Riggs 
 I can't find a clear  discussion of what you are trying to do, and how,
 just a URL back to a  complex discussion on another topic.
>>>
>>>
>>> While trying to write a patch to allow changing an unlogged table into
>>> a logged one, I had to add another int field to xl_xact_commit.
>>> Robert Haas said:
>>>
>>>  "I have to admit I don't like this approach very much.  I can't see
>>> adding 4 bytes to every commit record for this feature."
>>>
>>>
>>> which is a correct remark.
>>>
>>> xl_xact_commit can contain some arrays (relation to drops,
>>> committed sub-trans, shared invalidation msgs). The length of
>>> these arrays is specified using 3 ints in the struct.
>>>
>>> So, to avoid adding more ints to the struct, I've been suggested to
>>> remove all the ints, and use   xl_xact_commit.xinfo to flag which
>>> arrays are, in fact, present.
>>>
>>> So the whole idea is:
>>>
>>> - remove nrels, nsubxacts and nmsgs from xl_xact_commit
>>> - use bits in xinfo to signal which arrays are present at the end
>>> of   xl_xact_commit
>>> - for each present array, add the length of the array (as int) at
>>> the end of    xl_xact_commit
>>> - add each present array after all the lengths
>>
>>
>> OK, thats clear. Thanks.
>>
>> That formatting sounds quite complex.
>>
>> I would propose we split this into 2 WAL records: xl_xact_commit and
>> xl_xact_commit_with_info
>>
>> xl_xact_commit doesn't have any flags, counts or arrays.
>>
>> xl_xact_commit_with_info always has all 3 counts, even if zero.
>> Arrays follow the main record
>>
>> I think it might also be possible to removed dbId and tsId from
>> xl_act_commit if we use that definition.
>
> Yes, that's correct. We can remove them from the normal commit record
> when nmsgs == 0.
>
> I think we can get away with these 2 definitions, but pls check. Using
> static definitions works better for me because we can see what they
> contain, rather than having the info flags imply that the record can
> contain any permutation of settings when that's not really possible.
>
>  {
>        TimestampTz xact_time;          /* time of commit */
>        int                     nsubxacts;              /* number of 
> subtransaction XIDs */
>        /* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */
>  } xl_xact_commit;
>
>  {
>        TimestampTz xact_time;          /* time of commit */
>        uint32          xinfo;                  /* info flags */
>        int                     nrels;                  /* number of 
> RelFileNodes */
>        int                     nsubxacts;              /* number of 
> subtransaction XIDs */
>        int                     nmsgs;          /* number of shared inval msgs 
> */
>        Oid                     dbId;                   /* MyDatabaseId */
>        Oid                     tsId;                   /* 
> MyDatabaseTableSpace */
>        /* Array of RelFileNode(s) to drop at commit */
>        RelFileNode xnodes[1];          /* VARIABLE LENGTH ARRAY */
>        /* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */
>        /* ARRAY OF SHARED INVALIDATION MESSAGES FOLLOWS */
>  } xl_xact_commit_with_info;

Wow, that is identical to the conclusion that I came to about 60 seconds ago.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_upgrade automatic testing

2011-05-25 Thread Peter Eisentraut

On ons, 2011-04-27 at 18:14 -0400, Noah Misch wrote:
> Enthusiastic +1 for this concept.  There's at least one rough edge: it fails 
> if
> you have another postmaster running on port 5432.

This has now been addressed: pg_upgrade accepts PGPORT settings.
Attached is a slightly updated patch runs the test suite with a port of
65432, which you can override by setting PGPORT yourself.

Anyway, is this something that people want in the repository?  It's not
as polished as the pg_regress business, but it is definitely helpful.

diff --git i/contrib/pg_upgrade/Makefile w/contrib/pg_upgrade/Makefile
index 8f3fd7c..9ec7bc0 100644
--- i/contrib/pg_upgrade/Makefile
+++ w/contrib/pg_upgrade/Makefile
@@ -21,3 +21,6 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+check: test.sh
+	MAKE=$(MAKE) bindir=$(bindir) $(SHELL) $<
diff --git i/contrib/pg_upgrade/test.sh w/contrib/pg_upgrade/test.sh
index e69de29..c675f26 100644
--- i/contrib/pg_upgrade/test.sh
+++ w/contrib/pg_upgrade/test.sh
@@ -0,0 +1,41 @@
+set -eux
+
+: ${MAKE=make}
+: ${PGPORT=65432}
+export PGPORT
+
+temp_root=$PWD/tmp_check
+temp_install=$temp_root/install
+bindir=$temp_install/$bindir
+PATH=$bindir:$PATH
+export PATH
+
+PGDATA=$temp_root/data
+export PGDATA
+rm -rf "$PGDATA" "$PGDATA".old
+
+logdir=$PWD/log
+rm -rf "$logdir"
+mkdir "$logdir"
+
+$MAKE -C ../.. install DESTDIR="$temp_install" 2>&1 | tee "$logdir/install.log"
+$MAKE -C ../pg_upgrade_support install DESTDIR="$temp_install" 2>&1 | tee -a "$logdir/install.log"
+$MAKE -C . install DESTDIR="$temp_install" 2>&1 | tee -a "$logdir/install.log"
+
+initdb 2>&1 | tee "$logdir/initdb1.log"
+pg_ctl start -l "$logdir/postmaster1.log" -w
+$MAKE -C ../.. installcheck 2>&1 | tee "$logdir/installcheck.log"
+pg_dumpall >"$temp_root"/dump1.sql
+pg_ctl -m fast stop
+
+mv "${PGDATA}" "${PGDATA}.old"
+
+initdb 2>&1 | tee "$logdir/initdb2.log"
+
+pg_upgrade -d "${PGDATA}.old" -D "${PGDATA}" -b "$bindir" -B "$bindir"
+
+pg_ctl start -l "$logdir/postmaster2.log" -w
+pg_dumpall >"$temp_root"/dump2.sql
+pg_ctl -m fast stop
+
+diff -q "$temp_root"/dump1.sql "$temp_root"/dump2.sql

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

2011-05-25 Thread Cédric Villemain

2011/5/25 Alvaro Herrera :
> Excerpts from Cédric Villemain's message of mié may 25 13:24:01 -0400 2011:
>
>> > Well, we only actually need to store one number, because you can figure
>> > out a much more precise number-of-pages figure with pg_relation_size()
>> > divided by configured page size.
>
>> I may miss something but we need relation size in costsize.c even if
>> we have a reldensity (or we need a reltuples). Else what values should
>> be used to estimate the relation size ? (pg_relation_size() goes down
>> to kernel/fs to ask the stat.st_size, is it really what we want?)
>
> Actually yes, the planner does go to kernel to determine the current
> relation size, and then multiplies by density as computed from catalog
> data to figure out current reasonably accurate number of tuples.

Okay! I just read that part. Interesting.
(If I dive correctly, we search our last segment and then use a
fileseek to the end of this segment to get our information)

make more sense, suddendly :)

>
> --
> Álvaro Herrera 
> The PostgreSQL Company - Command Prompt, Inc.
> PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>



-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] use less space in xl_xact_commit patch

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 10:41 AM, Simon Riggs  wrote:
> OK, thats clear. Thanks.
>
> That formatting sounds quite complex.
>
> I would propose we split this into 2 WAL records: xl_xact_commit and
> xl_xact_commit_with_info
>
> xl_xact_commit doesn't have any flags, counts or arrays.
>
> xl_xact_commit_with_info always has all 3 counts, even if zero.
> Arrays follow the main record
>
> I think it might also be possible to removed dbId and tsId from
> xl_act_commit if we use that definition.

That's an interesting idea.  I hadn't noticed that if nmsgs=0 then
those fields aren't needed either.  Also, both nrels>0 and nmsgs>0
will probably only happen if some DDL has been done, which most
transactions probably don't, so it would be logical to have one record
type that excludes nrels, nmsgs, dbId, tsId, and another record type
that includes all of those.  In fact, we could probably throw in xinfo
as well: if there's no DDL involved, we shouldn't need to set
XACT_COMPLETION_UPDATE_RELCACHE_FILE or
XACT_COMPLETION_FORCE_SYNC_COMMIT either.

But I'm not so sure about nsubxacts.  It seems to me that we might
lose a lot of the benefit of having two record types if we have to use
the larger one whenever nsubxacts>0.  I'm not exactly sure how common
subtransactions are, but if I had to guess I'd speculate that they are
far more frequent than DDL operations.  Maybe we should think about
making the xl_xact_commit record contain just xact_time, nsubxacts,
and a subxact array; and then have the xl_xact_commit_with_info (or
maybe xl_xact_commit_full, or whatever we decide to call it) contain
everything else.   That would shave off 20 bytes from the commit
records of any non-DDL operation, which would be pretty nice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] use less space in xl_xact_commit patch

2011-05-25 Thread Simon Riggs

On Wed, May 25, 2011 at 3:41 PM, Simon Riggs  wrote:
> On Wed, May 25, 2011 at 2:43 PM, Leonardo Francalanci  
> wrote:
>>> Da: Simon Riggs 
>>> I can't find a clear  discussion of what you are trying to do, and how,
>>> just a URL back to a  complex discussion on another topic.
>>
>>
>> While trying to write a patch to allow changing an unlogged table into
>> a logged one, I had to add another int field to xl_xact_commit.
>> Robert Haas said:
>>
>>  "I have to admit I don't like this approach very much.  I can't see
>> adding 4 bytes to every commit record for this feature."
>>
>>
>> which is a correct remark.
>>
>> xl_xact_commit can contain some arrays (relation to drops,
>> committed sub-trans, shared invalidation msgs). The length of
>> these arrays is specified using 3 ints in the struct.
>>
>> So, to avoid adding more ints to the struct, I've been suggested to
>> remove all the ints, and use   xl_xact_commit.xinfo to flag which
>> arrays are, in fact, present.
>>
>> So the whole idea is:
>>
>> - remove nrels, nsubxacts and nmsgs from xl_xact_commit
>> - use bits in xinfo to signal which arrays are present at the end
>> of   xl_xact_commit
>> - for each present array, add the length of the array (as int) at
>> the end of    xl_xact_commit
>> - add each present array after all the lengths
>
>
> OK, thats clear. Thanks.
>
> That formatting sounds quite complex.
>
> I would propose we split this into 2 WAL records: xl_xact_commit and
> xl_xact_commit_with_info
>
> xl_xact_commit doesn't have any flags, counts or arrays.
>
> xl_xact_commit_with_info always has all 3 counts, even if zero.
> Arrays follow the main record
>
> I think it might also be possible to removed dbId and tsId from
> xl_act_commit if we use that definition.

Yes, that's correct. We can remove them from the normal commit record
when nmsgs == 0.

I think we can get away with these 2 definitions, but pls check. Using
static definitions works better for me because we can see what they
contain, rather than having the info flags imply that the record can
contain any permutation of settings when that's not really possible.

  {
TimestampTz xact_time;  /* time of commit */
int nsubxacts;  /* number of 
subtransaction XIDs */
/* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */
  } xl_xact_commit;

  {
TimestampTz xact_time;  /* time of commit */
uint32  xinfo;  /* info flags */
int nrels;  /* number of 
RelFileNodes */
int nsubxacts;  /* number of 
subtransaction XIDs */
int nmsgs;  /* number of shared inval msgs 
*/
Oid dbId;   /* MyDatabaseId */
Oid tsId;   /* MyDatabaseTableSpace 
*/
/* Array of RelFileNode(s) to drop at commit */
RelFileNode xnodes[1];  /* VARIABLE LENGTH ARRAY */
/* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */
/* ARRAY OF SHARED INVALIDATION MESSAGES FOLLOWS */
  } xl_xact_commit_with_info;

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 2:13 PM, Bruce Momjian  wrote:
> Alvaro Herrera wrote:
>> Excerpts from Robert Haas's message of mié may 25 13:33:41 -0400 2011:
>> > On Wed, May 25, 2011 at 1:22 PM, Bruce Momjian  wrote:
>>
>> > > I can easily remove dashes before the compare if people like that idea
>> > > --- I think you could argue that a dash is not significant, unless "ab-c"
>> > > and "a-bc" are different locales.
>> >
>> > I think the more we mush that string around, the more chance we have
>> > of breaking something.  What's wrong with insisting that people set
>> > the value to the same thing?  Like, really the same?
>>
>> No objection here to that idea.
>
> I thought the problem was that they upgraded the OS and now the encoding
> names changed, though they behaved the same.  Is that now what is
> happening?  Can they supply the values with different cases?

Oh, hmm.  I don't know.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The way to know whether the standby has caught up with the master

2011-05-25 Thread David Fetter

On Wed, May 25, 2011 at 12:34:59PM +0100, Simon Riggs wrote:
> On Wed, May 25, 2011 at 6:16 AM, Heikki Linnakangas
>  wrote:
> 
> >> To achieve that, I'm thinking to change walsender so that, when the
> >> standby
> >> has caught up with the master, it sends back the message indicating that
> >> to
> >> the standby. And I'm thinking to add new function (or view like
> >> pg_stat_replication)
> >> available on the standby, which shows that info.
> >
> > By the time the standby has received that message, it might not be caught-up
> > anymore because new WAL might've been generated in the master already.
> 
> AFAICS, this is an exact case of the Byzantine General's problem.

Have they updated it to acknowledge that the city is now called
Istanbul?

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Expression Evaluator used for creating the plan tree / stmt ?

2011-05-25 Thread Vaibhav Kaushal

I think the command 'where' does the same. And the command showed something
which looked like was part of evaluation...it got me confused. Anyways,
thanks robert. I will check that too. I did not know the 'bt' command.

--
Sent from my Android
On 25 May 2011 23:02, "Robert Haas"  wrote:

[HACKERS] Minor issues with docs

2011-05-25 Thread Marco Nenciarini

While I was working on automatic translation of PostgreSQL's
documentation from SGML to XML, I found some minor issues.

In the file doc/src/sgml/ecpg.sgml there are many lines of C code
containing unescaped '<' characters.

In the file doc/src/sgml/array.sgml there is a tag which has a case
mismatch error with its end tag.

Attached you can find the patch, if you want to apply it.

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it
>From d22539bdb7cabcb6bfbf0ce1b80a59fbba283ca4 Mon Sep 17 00:00:00 2001
From: Marco Nenciarini 
Date: Wed, 25 May 2011 19:35:36 +0200
Subject: [PATCH] Fix minor issues with documentation markup


Signed-off-by: Marco Nenciarini 
---
 doc/src/sgml/array.sgml |2 +-
 doc/src/sgml/ecpg.sgml  |   20 ++--
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/array.sgml b/doc/src/sgml/array.sgml
index bb318c5..3508ba3 100644
--- a/doc/src/sgml/array.sgml
+++ b/doc/src/sgml/array.sgml
@@ -369,7 +369,7 @@ UPDATE sal_emp SET pay_by_quarter = ARRAY[25000,25000,27000,27000]
 
 UPDATE sal_emp SET pay_by_quarter[4] = 15000
 WHERE name = 'Bill';
-
+
 
   or updated in a slice:
 
diff --git a/doc/src/sgml/ecpg.sgml b/doc/src/sgml/ecpg.sgml
index 9c6ca4c..a8ffde5 100644
--- a/doc/src/sgml/ecpg.sgml
+++ b/doc/src/sgml/ecpg.sgml
@@ -4154,7 +4154,7 @@ switch (v.sqltype)
 {
 case ECPGt_char:
 memset(&var_buf, 0, sizeof(var_buf));
-memcpy(&var_buf, sqldata, (sizeof(var_buf) <= sqllen ? sizeof(var_buf) - 1 : sqllen));
+memcpy(&var_buf, sqldata, (sizeof(var_buf) <= sqllen ? sizeof(var_buf) - 1 : sqllen));
 break;
 
 case ECPGt_int: /* integer */
@@ -4390,7 +4390,7 @@ main(void)
 
 case ECPGt_char:
 memset(&var_buf, 0, sizeof(var_buf));
-memcpy(&var_buf, sqldata, (sizeof(var_buf) <= sqllen ? sizeof(var_buf)-1 : sqllen));
+memcpy(&var_buf, sqldata, (sizeof(var_buf) <= sqllen ? sizeof(var_buf)-1 : sqllen));
 break;
 
 case ECPGt_int: /* integer */
@@ -5871,39 +5871,39 @@ main(void)
 
 /* create */
 loid = lo_create(conn, 0);
-if (loid < 0)
+if (loid < 0)
 printf("lo_create() failed: %s", PQerrorMessage(conn));
 
 printf("loid = %d\n", loid);
 
 /* write test */
 fd = lo_open(conn, loid, INV_READ|INV_WRITE);
-if (fd < 0)
+if (fd < 0)
 printf("lo_open() failed: %s", PQerrorMessage(conn));
 
 printf("fd = %d\n", fd);
 
 rc = lo_write(conn, fd, buf, buflen);
-if (rc < 0)
+if (rc < 0)
 printf("lo_write() failed\n");
 
 rc = lo_close(conn, fd);
-if (rc < 0)
+if (rc < 0)
 printf("lo_close() failed: %s", PQerrorMessage(conn));
 
 /* read test */
 fd = lo_open(conn, loid, INV_READ);
-if (fd < 0)
+if (fd < 0)
 printf("lo_open() failed: %s", PQerrorMessage(conn));
 
 printf("fd = %d\n", fd);
 
 rc = lo_read(conn, fd, buf2, buflen);
-if (rc < 0)
+if (rc < 0)
 printf("lo_read() failed\n");
 
 rc = lo_close(conn, fd);
-if (rc < 0)
+if (rc < 0)
 printf("lo_close() failed: %s", PQerrorMessage(conn));
 
 /* check */
@@ -5912,7 +5912,7 @@ main(void)
 
 /* cleanup */
 rc = lo_unlink(conn, loid);
-if (rc < 0)
+if (rc < 0)
 printf("lo_unlink() failed: %s", PQerrorMessage(conn));
 
 EXEC SQL COMMIT;
-- 
1.7.5.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.

2011-05-25 Thread Bruce Momjian

Alvaro Herrera wrote:
> Excerpts from Robert Haas's message of mi?? may 25 13:33:41 -0400 2011:
> > On Wed, May 25, 2011 at 1:22 PM, Bruce Momjian  wrote:
> 
> > > I can easily remove dashes before the compare if people like that idea
> > > --- I think you could argue that a dash is not significant, unless "ab-c"
> > > and "a-bc" are different locales.
> > 
> > I think the more we mush that string around, the more chance we have
> > of breaking something.  What's wrong with insisting that people set
> > the value to the same thing?  Like, really the same?
> 
> No objection here to that idea.

I thought the problem was that they upgraded the OS and now the encoding
names changed, though they behaved the same.  Is that now what is
happening?  Can they supply the values with different cases?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Another attempt at vacuum improvements

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 1:43 PM, Pavan Deolasee
 wrote:
> I think the point is you can not *always* put it just after the line
> pointer array without possibly shuffling the tuples. Remember we need
> to put the LSN when the dead line pointer is generated because we
> decided to prune away the dead tuple. Say, for example, the page is
> completely full and there are no dead line pointers and hence no LSN
> on the page. Also there is no free space after the line pointer array.
> Now say we prune dead tuples and generate dead line pointers, but the
> last line pointer in the array is still in-use and the first tuple
> immediately after the line pointer array is live. Since you generated
> dead line pointers you want to store the LSN on the page. Now, there
> is no way you can store is after the line pointer array without moving
> the live tuple somewhere else.

So far I agree.  But don't we always defragment immediately after
pruning dead tuples to line pointers?  The removal of even one tuple
will give us more than enough space to store the LSN.

> Let me summarize the sequence of operations and let me know if you
> still disagree with the general principle:
>
> 1. There are no dead line pointers in the page - we are good.
> 2. Few tuples become dead, HOT pruning is invoked either during normal
> operation or heap vacuum. The dead tuples are pruned away and
> truncated to dead line pointers. We already hold cleanup lock on the
> buffer. We set the flag in the page header and store the LSN (either
> at the end of line pointer array or at the end of the page)
> 3. Someday index vacuum is run and it removes the index pointers to
> the dead line pointers. We remember the start LSN of the index vacuum
> somewhere, may be as a pg_class attribute (how does index vacuum get
> the list of dead line pointers is not material in the general scheme
> of things)
> 4. When the page is again chosen for pruning, we check if the flag is
> set in the header. If so, get the LSN stored in the page, check it
> against the last successful index vacuum LSN and if its precedes the
> index vacuum LSN, we turn the LP_DEAD line pointers to LP_UNUSED. The
> special LSN can be removed unless new LP_DEAD line pointers get
> generated during the pruning, otherwise its overwritten with the
> current LSN. Since we hold the buffer cleanup lock, the special LSN
> storage can be reclaimed by shuffling things around.

Agreed.  The only thing I'm trying to do further is to avoid the need
for a reshuffle when the special LSN storage is reclaimed.  For
example, consider:

1. There are three tuples on the page.  We are good.
2. Tuple #2 becomes dead.  The tuple is pruned to a line pointer.  The
page is defragmented.  At this point, it doesn't matter WHERE we put
the LSN - we are rearranging the whole page anyway.
3. Index vacuum is run.
4. Now we want to make the dead line pointer unused, and reclaim the
LSN storage.  If the LSN is stored at the end of the page, then we now
have to move all of the tuple data forward by 8 bytes.  But if it's
stored adjacent to the hole in the middle of the page, we need only
clear the page-header bits saying it's there (and maybe adjust
pd_lower).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] adding a new column in IDENTIFY_SYSTEM

2011-05-25 Thread Magnus Hagander

On Tue, May 24, 2011 at 22:31, Jaime Casanova  wrote:
> On Tue, May 24, 2011 at 8:52 PM, Fujii Masao  wrote:
>
>> +       primary_xlp_magic = atoi(PQgetvalue(res, 0, 2));
>>
>> You wrongly get the third field (i.e., current xlog location) as the
>> WAL version.
>> You should call PQgetvalue(res, 0, 3), instead.
>>
>>> errdetail("Expected 1 tuple with 3 fields, got %d tuples with %d fields.",
>>
>> You need to change the above message.
>>
>
> Fixed.
>
> About you comments on the check... if you read the thread, you will
> find that the whole reason for the field is future improvement, but
> everyone wanted some use of the field now... so i made a patch to use
> it in pg_basebackup before the transfer starts and avoid time and
> bandwith waste but Magnus prefer this in walreceiver...

The idea *at this point* was, I believe, to be able to provide a more
useful error message in the case of walreceiver.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Another attempt at vacuum improvements

2011-05-25 Thread Pavan Deolasee

On Wed, May 25, 2011 at 10:36 PM, Robert Haas  wrote:

>
>> I don't see how you'd store
>> anything in the hole without it being in a fixed place, where it would
>> eventually be hit by either the line pointer array or tuple data.
>
> The point is that it doesn't matter.  Suppose we put it just after the
>  line pointer array.

I think the point is you can not *always* put it just after the line
pointer array without possibly shuffling the tuples. Remember we need
to put the LSN when the dead line pointer is generated because we
decided to prune away the dead tuple. Say, for example, the page is
completely full and there are no dead line pointers and hence no LSN
on the page. Also there is no free space after the line pointer array.
Now say we prune dead tuples and generate dead line pointers, but the
last line pointer in the array is still in-use and the first tuple
immediately after the line pointer array is live. Since you generated
dead line pointers you want to store the LSN on the page. Now, there
is no way you can store is after the line pointer array without moving
the live tuple somewhere else.

Thats the point me and Alvaro are making. Do you agree with that logic
? Now that does not matter because we would always generate dead line
pointers holding a buffer cleanup lock and hence we are free to
shuffle tuples around. May be we are digressing on a trivial detail
here, but I hope I got it correct.

> Any time we're thinking about extending the line
> pointer array, we already have an exclusive lock on the buffer.  And
> if we already have a exclusive lock on the buffer, then we can reclaim
> the dead line pointers and now we no longer need the saved LSN, so
> writing over it is perfectly fine.
>

The trouble is you may not be able to shrink the line pointer array.
But of course, you can reuse the reclaimed dead line pointers. I would
still advocate doing that during the pruning operation because we want
to emit WAL records for the operation.

> OK, I lied: if we have an exclusive buffer lock, but the last vacuum
> either failed, or is still in progress, then the LSN might not be old
> enough for us to reclaim the dead line pointers yet.  So ideally we'd
> like to hold onto it.  We can do that by either (a) moving the LSN out
> another 6 bytes, if there's enough room; or (b) deciding not to put
> the new tuple on this page, after all.  There's no situation in which
> we absolutely HAVE to get another tuple onto this particular page.  We
> can just decide that the effective size of a page that contains dead
> line pointers is effectively 8 bytes less.  The alternative is to eat
> up 8 bytes of space on ALL pages, whether they contain dead line
> pointers or not.
>

I think we are on the same page as far as storing LSN if and only if
its required. But what was not convincing is the argument that you can
*always* find free space for the LSN without moving things around.

Let me summarize the sequence of operations and let me know if you
still disagree with the general principle:

1. There are no dead line pointers in the page - we are good.
2. Few tuples become dead, HOT pruning is invoked either during normal
operation or heap vacuum. The dead tuples are pruned away and
truncated to dead line pointers. We already hold cleanup lock on the
buffer. We set the flag in the page header and store the LSN (either
at the end of line pointer array or at the end of the page)
3. Someday index vacuum is run and it removes the index pointers to
the dead line pointers. We remember the start LSN of the index vacuum
somewhere, may be as a pg_class attribute (how does index vacuum get
the list of dead line pointers is not material in the general scheme
of things)
4. When the page is again chosen for pruning, we check if the flag is
set in the header. If so, get the LSN stored in the page, check it
against the last successful index vacuum LSN and if its precedes the
index vacuum LSN, we turn the LP_DEAD line pointers to LP_UNUSED. The
special LSN can be removed unless new LP_DEAD line pointers get
generated during the pruning, otherwise its overwritten with the
current LSN. Since we hold the buffer cleanup lock, the special LSN
storage can be reclaimed by shuffling things around.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pull up aggregate subquery

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 1:37 PM, Hitoshi Harada  wrote:
>> How do you decide whether or not to push down?
>
> Yeah, that's the problem. In addition to the conditions of join-qual
> == grouping key && outer is unique on qual, we need some criteria if
> it should be done. At first I started to think I can compare cost of
> two different plan nodes, which are generated by calling
> subquery_planner() twice. But now my plan is to apply some heuristics
> like that join qual selectivity is less than 10% or so. I either don't
> like magic numbers but given Query restructuring instead of
> PlannerInfo (which means we cannot use Path) it is only left way. To
> get it work is my first goal anyway.

I think getting it working is probably a good first goal.  I am not
really sure that we want to commit it that way, and I think my vote
would be for you to work on the approach we discussed before rather
than this one, but it's your project, and I think you'll probably
learn enough in getting it working that it will be a step forward in
any case.  The planner is complex enough that it's worth trying to get
something that works, first, before trying to make it perfect.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] about EDITOR_LINENUMBER_SWITCH

2011-05-25 Thread Alvaro Herrera

Excerpts from Tom Lane's message of mar may 24 17:11:17 -0400 2011:

> Right.  It would also increase the cognitive load on the user to have
> to remember the command-line go-to-line-number switch for his editor.
> So I don't particularly want to redesign this feature.  However, I can
> see the possible value of letting EDITOR_LINENUMBER_SWITCH be set from
> the same place that you set EDITOR, which would suggest that we allow
> the value to come from an environment variable.  I'm not sure whether
> there is merit in allowing both that source and ~/.psqlrc, though
> possibly for Windows users it might be easier if ~/.psqlrc worked.

If we're going to increase the number of options in .psqlrc that do not
work with older psql versions, can I please have .psqlrc-MAJORVERSION or
some such?  Having 8.3's psql complain all the time because it doesn't
understand "linestyle" is annoying.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.

2011-05-25 Thread Alvaro Herrera

Excerpts from Robert Haas's message of mié may 25 13:33:41 -0400 2011:
> On Wed, May 25, 2011 at 1:22 PM, Bruce Momjian  wrote:

> > I can easily remove dashes before the compare if people like that idea
> > --- I think you could argue that a dash is not significant, unless "ab-c"
> > and "a-bc" are different locales.
> 
> I think the more we mush that string around, the more chance we have
> of breaking something.  What's wrong with insisting that people set
> the value to the same thing?  Like, really the same?

No objection here to that idea.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pull up aggregate subquery

2011-05-25 Thread Ross J. Reedstrom

On Mon, May 23, 2011 at 11:08:40PM -0400, Robert Haas wrote:
> 
> I don't really like the idea of adding a GUC for this, unless we
> convince ourselves that nothing else is sensible.  I mean, that leads
> to conversations like this:
> 
> Newbie: My query is slow.
> Hacker: Turn on enable_magic_pixie_dust and it'll get better.
> Newbie: Oh, yeah, that is better.  Why isn't this turned on by default, 
> anyway?
> Hacker: Well, on pathological queries, it makes planning take way too
> long, so we think it's not really safe to enable it by default.
> Newbie: Wait... I thought you just told me to enable it.  It's not safe?
> Hacker: Well, sort of.  I mean, uh... hey, look, an elephant!
> 

ROTFL! This needs to go on the wiki somewhere discussing why HACKERs
rejects tunable knobs as often as happens.

Ross
-- 
Ross Reedstrom, Ph.D. reeds...@rice.edu
Systems Engineer & Admin, Research Scientistphone: 713-348-6166
Connexions  http://cnx.orgfax: 713-348-3665
Rice University MS-375, Houston, TX 77005
GPG Key fingerprint = F023 82C8 9B0E 2CC6 0D8E  F888 D3AE 810E 88F0 BEDE

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

2011-05-25 Thread Alvaro Herrera

Excerpts from Cédric Villemain's message of mié may 25 13:24:01 -0400 2011:

> > Well, we only actually need to store one number, because you can figure
> > out a much more precise number-of-pages figure with pg_relation_size()
> > divided by configured page size.

> I may miss something but we need relation size in costsize.c even if
> we have a reldensity (or we need a reltuples). Else what values should
> be used to estimate the relation size ? (pg_relation_size() goes down
> to kernel/fs to ask the stat.st_size, is it really what we want?)

Actually yes, the planner does go to kernel to determine the current
relation size, and then multiplies by density as computed from catalog
data to figure out current reasonably accurate number of tuples.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pull up aggregate subquery

2011-05-25 Thread Hitoshi Harada

2011/5/26 Robert Haas :
> On Wed, May 25, 2011 at 10:35 AM, Hitoshi Harada  wrote:
>> 2011/5/25 Hitoshi Harada :
>>> So I'm still
>>> thinking which of pulling up and parameterized scan is better.
>>
>> After more investigation I came up with third idea, pushing down
>> RangeTblEntry to aggregate subquery. This sounds like a crazy idea,
>> but it seems to me like it is slightly easier than pulling up
>> agg-subquery. The main point is that when you want to pull up, you
>> must care if the final output would be correct with other join
>> relations than the aggregate-related one. In contrast, when pushing
>> down the target relation to agg-subquery it is simple to ensure the
>> result.
>>
>> I'm looking around pull_up_subqueries() in subquery_planner() to add
>> the pushing down logic. It could be possible to do it around
>> make_one_rel() but I bet query structure changes are doable against
>> Query, not PlannerInfo.
>
> How do you decide whether or not to push down?

Yeah, that's the problem. In addition to the conditions of join-qual
== grouping key && outer is unique on qual, we need some criteria if
it should be done. At first I started to think I can compare cost of
two different plan nodes, which are generated by calling
subquery_planner() twice. But now my plan is to apply some heuristics
like that join qual selectivity is less than 10% or so. I either don't
like magic numbers but given Query restructuring instead of
PlannerInfo (which means we cannot use Path) it is only left way. To
get it work is my first goal anyway.

Regards,

-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 1:22 PM, Bruce Momjian  wrote:
> Alvaro Herrera wrote:
>> Excerpts from Bruce Momjian's message of mar may 24 15:59:59 -0400 2011:
>>
>> > > I think you misread what I wrote, or I misexplained it, but never
>> > > mind.  Matching locale names case-insensitively sounds reasonable to
>> > > me, unless someone has reason to believe it will blow up.
>> >
>> > OK, that's what I needed to hear.  I have applied the attached patch,
>> > but only to 9.1 because  of the risk of breakage. (This was only the
>> > first bug report of this, and we aren't 100% certain about the case
>> > issue.)
>>
>> Hmm, I know the locale can be spelled "en_US.UTF-8" (note dash) in some
>> systems.  So if you have a locale alias that makes en_US.utf8 the same
>> as en_US.UTF-8, the patched code would still not work.  I wonder if
>> there's a need for canonicalization somewhere.  Or maybe this is just
>> not worth the trouble.
>
> I can easily remove dashes before the compare if people like that idea
> --- I think you could argue that a dash is not significant, unless "ab-c"
> and "a-bc" are different locales.

I think the more we mush that string around, the more chance we have
of breaking something.  What's wrong with insisting that people set
the value to the same thing?  Like, really the same?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Expression Evaluator used for creating the plan tree / stmt ?

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 12:34 PM, Vaibhav Kaushal
 wrote:
> But somehow the execevalvar is being called. When i changed the logic of its
> working to use slot_getattr instead of cute_datum_array for the first run /
> call, everything just worked!

This is what gdb is for... set a breakpoint on that function using "b
ExexEvalVar", and when it hits the breakpoint, use "bt" to see where
the call came from.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

2011-05-25 Thread Cédric Villemain

2011/5/25 Alvaro Herrera :
> Excerpts from Kevin Grittner's message of mié may 25 12:37:24 -0400 2011:
>> Tom Lane  wrote:
>>
>> > I don't know what client-side code might be looking at
>> > relpages/reltuples.
>>
>> I know that I find reltuples useful for getting an "accurate enough"
>> sense of rows in a table (or set of tables) without resorting to
>> count(*).  I'd be OK with any two of pages, tuples, and density; but
>> dropping to one would make databases harder to manage, IMV.
>>
>> Personally, I don't have much code that uses those columns;
>> eliminating an existing column wouldn't involve much pain for me as
>> long as it could still be derived.
>
> Well, we only actually need to store one number, because you can figure
> out a much more precise number-of-pages figure with pg_relation_size()
> divided by configured page size.
>
> (We feel free to hack around catalogs in other places, and we regularly
> break pgAdmin and the like when we drop columns -- people just live with
> it and update their tools.  I don't think it's such a big deal in this
> particular case.)

I may miss something but we need relation size in costsize.c even if
we have a reldensity (or we need a reltuples). Else what values should
be used to estimate the relation size ? (pg_relation_size() goes down
to kernel/fs to ask the stat.st_size, is it really what we want?)


>
> --
> Álvaro Herrera 
> The PostgreSQL Company - Command Prompt, Inc.
> PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 1:15 PM, Robert Haas  wrote:
> I agree that if VACUUM scanned 99% of the table, it's probably fine to
> use its numbers.  It's also fine to use the numbers from ANALYZE,
> because those pages are chosen randomly.  What bothers me is the idea
> of using a small *non-random* sample, and I'm not sure that
> incorporating possibly-bogus results slowly is any better than
> incorporating them quickly.

In particular, unless I'm misremembering, VACUUM *always* scans the
first few pages of the table, until it sees enough consecutive
all-visible bits that it decides to start skipping.  If I'm right
about that, then those pages could easily end up being overweighted
when VACUUM does the counting; especially if ANALYZE or an actual
full-table vacuum aren't allowed to snap the count back to reality.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tackling full page writes

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 1:06 PM, Greg Smith  wrote:
> On 05/24/2011 04:34 PM, Robert Haas wrote:
>>
>> we could name this feature "partial full page writes" and enable it
>> either with a setting of full_page_writes=partial
>
> +1 to overloading the initial name, but only if the parameter is named
> 'maybe', 'sometimes', or 'perhaps'.

Perfect!

> I've been looking into a similar refactoring of the names here, where we
> bundle all of these speed over safety things (fsync, full_page_writes, etc.)
> into one control so they're easier to turn off at once.  Not sure if it
> should be named "web_scale" or "do_you_feel_lucky_punk".

Actually, I suggested that same idea to you, or someone, a while back,
only I was serious.  crash_safety=off.  I never got around to fleshing
out the details, though.

>> 3. Going a bit further, Greg proposed the idea of ripping out our
>> current WAL infrastructure altogether and instead just having one WAL
>> record that says "these byte ranges on this page changed to have these
>> new contents".
>
> The main thing that makes this idea particularly interesting to me, over the
> other two, is that it might translate well into the idea of using
> sync_file_range to aim for a finer fsync call on Linux than is currently
> possible.

Hmm, maybe.  But it's possible that the dirty blocks are the first and
last ones in the file.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Nested CASE-WHEN scoping

2011-05-25 Thread Heikki Linnakangas


On 25.05.2011 20:11, Tom Lane wrote:

Heikki Linnakangas  writes:

On 25.05.2011 17:47, Tom Lane wrote:

[ scratches head ... ]  Why does the save/restore in ExecEvalCase not
take care of this?



The mistake happens during planning, when the SQL function is inlined
and pre-evaluated.


Hm.  I'm inclined to think this may be more of a bug in the inlining
process than anything else.


Well, if you want to get away without the capability to reference 
multiple CaseTestExprs at a time, you'll have to detect the danger and 
abort the inlining process. That's a bit pessimal, although this is a 
pretty artificial case in the first place so maybe we don't care much.


(I'm still going to need more placeholder slots to handle IN and 
BETWEEN. Of course, I can just copy-paste CaseTestExpr into something 
like InTestExpr and BetweenTestExpr, but it seems like it would be good 
to unite all that infrastructure)


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Expression Evaluator used for creating the plan tree / stmt ?

2011-05-25 Thread Vaibhav Kaushal

But somehow the execevalvar is being called. When i changed the logic of its
working to use slot_getattr instead of cute_datum_array for the first run /
call, everything just worked!

This would indicate surely that the function does get called at least once
before being called by executor for qual check. This is what got me confused
- where does the function get called? Gdb trace says its first call comes
through ExecutePlan but results say something else. Moreover, gdb still
confuses me. Is there some gui to gdb? :(

The only thing i am trying to make sure that ExecEvalVar gets the values
from cute_datum_array only when called durin a scan, no way before it. May
be there could be another way. I think i have to rethink about this.
--
Sent from my Android
On 25 May 2011 20:34, "Tom Lane"  wrote:
> Alvaro Herrera  writes:
>> Excerpts from Vaibhav Kaushal's message of mié may 25 05:52:32 -0400
2011:
>>> If the above is confusing, I just want to ask: "Is expression evaluator,
>>> even in part responsible for {PLANNEDSTMT creation?"
>
>> Yeah, as far as I understood Tom's talk, the expr evaluator is used to
>> reduce some expressions to constants and such.
>
> The planner would never call it with an expression containing a Var,
> though.
>
> regards, tom lane

Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.

2011-05-25 Thread Bruce Momjian

Alvaro Herrera wrote:
> Excerpts from Bruce Momjian's message of mar may 24 15:59:59 -0400 2011:
> 
> > > I think you misread what I wrote, or I misexplained it, but never
> > > mind.  Matching locale names case-insensitively sounds reasonable to
> > > me, unless someone has reason to believe it will blow up.
> > 
> > OK, that's what I needed to hear.  I have applied the attached patch,
> > but only to 9.1 because  of the risk of breakage. (This was only the
> > first bug report of this, and we aren't 100% certain about the case
> > issue.)
> 
> Hmm, I know the locale can be spelled "en_US.UTF-8" (note dash) in some
> systems.  So if you have a locale alias that makes en_US.utf8 the same
> as en_US.UTF-8, the patched code would still not work.  I wonder if
> there's a need for canonicalization somewhere.  Or maybe this is just
> not worth the trouble.

I can easily remove dashes before the compare if people like that idea
--- I think you could argue that a dash is not significant, unless "ab-c"
and "a-bc" are different locales.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tackling full page writes

2011-05-25 Thread Greg Smith


On 05/24/2011 04:34 PM, Robert Haas wrote:

we could name this feature "partial full page writes" and enable it
either with a setting of full_page_writes=partial


+1 to overloading the initial name, but only if the parameter is named 
'maybe', 'sometimes', or 'perhaps'.


I've been looking into a similar refactoring of the names here, where we 
bundle all of these speed over safety things (fsync, full_page_writes, 
etc.) into one control so they're easier to turn off at once.  Not sure 
if it should be named "web_scale" or "do_you_feel_lucky_punk".



3. Going a bit further, Greg proposed the idea of ripping out our
current WAL infrastructure altogether and instead just having one WAL
record that says "these byte ranges on this page changed to have these
new contents".


The main thing that makes this idea particularly interesting to me, over 
the other two, is that it might translate well into the idea of using 
sync_file_range to aim for a finer fsync call on Linux than is currently 
possible.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 1:04 PM, Tom Lane  wrote:
> [ shrug... ]  When you don't have complete information, it's *always*
> the case that you will sometimes make a mistake.  That's not
> justification for paralysis, especially not when the existing code is
> demonstrably broken.
>
> What occurs to me after thinking a bit more is that the existing tuple
> density is likely to be only an estimate, too (one coming from the last
> ANALYZE, which could very well have scanned even less of the table than
> VACUUM did).  So what I now think is that both VACUUM and ANALYZE ought
> to use a calculation of the above form to arrive at a new value for
> pg_class.reltuples.  In both cases it would be pretty easy to track the
> number of pages we looked at while counting tuples, so the same raw
> information is available.
>
>> I am wondering, though, why we're not just inserting a special-purpose
>> hack for TOAST tables.
>
> Because the problem is not specific to TOAST tables.  As things
> currently stand, we will accept the word of an ANALYZE as gospel even if
> it scanned 1% of the table, and completely ignore the results from a
> VACUUM even if it scanned 99% of the table.  This is not sane.

I agree that if VACUUM scanned 99% of the table, it's probably fine to
use its numbers.  It's also fine to use the numbers from ANALYZE,
because those pages are chosen randomly.  What bothers me is the idea
of using a small *non-random* sample, and I'm not sure that
incorporating possibly-bogus results slowly is any better than
incorporating them quickly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

2011-05-25 Thread Alvaro Herrera

Excerpts from Robert Haas's message of mié may 25 12:54:28 -0400 2011:

> I don't know.  That's maybe better, but I'd be willing to wager that
> in some cases it will just slow down the rate at which we converge to
> a completely incorrect value, while in other cases it'll fail to
> update the data when it really has changed.

For regular tables I don't think there's a real problem because it'll
get fixed next time a full scan happens anyway.  For toast tables, I
think the set of operations is limited enough that this is easy to prove
correct (or fixed so that it is) -- no HOT updates (in fact no updates
at all), etc.

BTW one thing we haven't fixed at all is how do HOT updates affect
vacuuming behavior ...

> I am wondering, though, why we're not just inserting a special-purpose
> hack for TOAST tables.  Your email seems to indicate that regular
> tables are already handled well enough, and certainly if we only whack
> around the TOAST behavior it's much less likely to fry anything.

Well, having two paths one of which is uncommonly used means that it
will get all the bugs.  As with the xl_commit WAL record comment from
Simon, I'd rather stick with having a single one.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Nested CASE-WHEN scoping

2011-05-25 Thread Tom Lane

Heikki Linnakangas  writes:
> On 25.05.2011 17:47, Tom Lane wrote:
>> [ scratches head ... ]  Why does the save/restore in ExecEvalCase not
>> take care of this?

> The mistake happens during planning, when the SQL function is inlined 
> and pre-evaluated.

Hm.  I'm inclined to think this may be more of a bug in the inlining
process than anything else.  I have to run off for a doctor's
appointment, but will look at this closer later today.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Nested CASE-WHEN scoping

2011-05-25 Thread Heikki Linnakangas


On 25.05.2011 17:47, Tom Lane wrote:

Heikki Linnakangas  writes:

While looking at fixing the multiple-evaluation issue in IN and BETWEEN
discussed a while ago, I realized that the current assumption that only
one CaseTestExpr placeholder needs to be valid at any given time is not
true.


[ scratches head ... ]  Why does the save/restore in ExecEvalCase not
take care of this?


The mistake happens during planning, when the SQL function is inlined 
and pre-evaluated. It's a bit hard to see what happened once the 
planning is finished because the whole expression is folded into a 
constant, but here's goes:


The original expression is:

  CASE now() WHEN 29 THEN 'foo' ELSE 'bar' END;

In parse analysis, it is turned into this:

  CASE WHEN CaseTestExpr = 29 THEN 'foo' ELSE 'bar' END;

where CaseTestExpr stands for the now(). Next the planner tries to 
simplify the WHEN condition, "CaseTestExpr = 29". The equality operator 
is implemented by the evileq(timestamptz, int4) SQL function, defined as:


  CASE $2 WHEN length($1::text) THEN true ELSE false END;

That SQL-function is transformed at parse analysis into:

  CASE CaseTestExpr = length($1::text) THEN true ELSE false END;

This CaseTestExpr stands for the Param to the function, $2. When that 
tranformed SQL function body is inlined into the outer WHEN clause, 
"CaseTestExpr = 29", and Params are substituted, it becomes:


 CASE CaseTestExpr = length(CaseTestExpr::text) THEN true ELSE false END.

(you can see the expression tree for that if you print out 'newexpr' in 
inline_function(), just after the call to substitute_actual_parameters())


At this point it's easy to see that we have screwed up. The first 
CaseTestExpr stands for the inner CASE-value, which is $2, which stands 
for 29, and the second CaseTestExpr stands for the *outer* CASE-value, 
which is supposed to be now(). The planner cannot distinguish between 
the two anymore.


Both CaseTestExprs are then incorrectly replaced with constant 29, and 
the whole expression is constant-folded into 'bar'.



So I'm going to put the BETWEEN/IN fix aside for now, and refactor the
placeholder infrastructure to handle several simultaneous placeholders,


That sounds like a mess, and I'm not even convinced it'll solve the
problem ...


Hmm, it would solve the above by if we can keep the CaseTestExprs 
separate. It's not quite as straightforward as I originally thought, as 
the parse analysis of the inlined SQL function needs to use placeholder 
numbers that are different from those used in the outer context. But it 
seems doable.



BTW, i just stumbled into this:

postgres=# explain verbose SELECT CASE now() WHEN (29+random()::int4) 
THEN 'foo' ELSE 'bar' END;

ERROR:  unexpected CASE WHEN clause: 326

Looks like ruleutils.c is also not prepared for the case that the 
implicit equality operation gets inlined into something else than an OpExpr.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Another attempt at vacuum improvements

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 12:48 PM, Alvaro Herrera
 wrote:
> Excerpts from Robert Haas's message of mié may 25 12:27:38 -0400 2011:
>> On Wed, May 25, 2011 at 10:07 AM, Pavan Deolasee
>>  wrote:
>
>> > I think if are reclaiming LP_DEAD line pointers only while
>> > defragmenting the page, we can always reclaim the space for the LSN,
>> > irrespective of where we store it. So may be we should decide
>> > depending on what would matter for on-disk compatibility and whatever
>> > requires least invasive changes. I don't know what is that yet.
>>
>> If we store the LSN at the beginning (page header) or end (special
>> space) of the page, then we'll have to adjust the location of the data
>> which follows it or precedes it if and when we want to reclaim the
>> space it occupies.  But if we store it in the hole in the middle
>> somewhere (probably by reducing the size of the hole, not by actually
>> referencing the area between pd_lower and pd_upper) then no tuples or
>> item pointers have to move around when we want to reclaim the space.
>> That way, we need only an exclusive lock (not a cleanup lock), and we
>> don't need to memmove() anything.
>
> You can store anything in the "hole" in the data area (currently we only
> store tuple data), but then you'd have to store the offset to where you
> stored it in some fixed location, and make sure you adjust pd_upper/lower
> so that the next tuple does not overwrite it.  You'd still need space
> in either page header or special space.

You only need one bit of space in the page header, to indicate whether
the LSN is present or not.  And we have unused bit space there.

> I don't see how you'd store
> anything in the hole without it being in a fixed place, where it would
> eventually be hit by either the line pointer array or tuple data.

The point is that it doesn't matter.  Suppose we put it just after the
line pointer array.  Any time we're thinking about extending the line
pointer array, we already have an exclusive lock on the buffer.  And
if we already have a exclusive lock on the buffer, then we can reclaim
the dead line pointers and now we no longer need the saved LSN, so
writing over it is perfectly fine.

OK, I lied: if we have an exclusive buffer lock, but the last vacuum
either failed, or is still in progress, then the LSN might not be old
enough for us to reclaim the dead line pointers yet.  So ideally we'd
like to hold onto it.  We can do that by either (a) moving the LSN out
another 6 bytes, if there's enough room; or (b) deciding not to put
the new tuple on this page, after all.  There's no situation in which
we absolutely HAVE to get another tuple onto this particular page.  We
can just decide that the effective size of a page that contains dead
line pointers is effectively 8 bytes less.  The alternative is to eat
up 8 bytes of space on ALL pages, whether they contain dead line
pointers or not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

2011-05-25 Thread Tom Lane

Robert Haas  writes:
> On Wed, May 25, 2011 at 12:41 PM, Tom Lane  wrote:
>> Yeah, I had been thinking about the latter point.  We could be
>> conservative and just keep the reported tuple density the same (ie,
>> update relpages to the new correct value, while setting reltuples so
>> that the density ratio doesn't change).  But that has its own problems
>> when the table contents *do* change.  What I'm currently imagining is
>> to do a smoothed moving average, where we factor in the new density
>> estimate with a weight dependent on the percentage of the table we did
>> scan.  That is, the calculation goes something like
>> 
>> old_density = old_reltuples / old_relpages
>> new_density = counted_tuples / scanned_pages
>> reliability = scanned_pages / new_relpages
>> updated_density = old_density + (new_density - old_density) * reliability
>> new_reltuples = updated_density * new_relpages
>> 
>> We could slow the moving-average convergence even further when
>> reliability is small; for instance if you were really paranoid you might
>> want to use the square of reliability in line 4.  That might be
>> overdoing it, though.

> I don't know.  That's maybe better, but I'd be willing to wager that
> in some cases it will just slow down the rate at which we converge to
> a completely incorrect value, while in other cases it'll fail to
> update the data when it really has changed.

[ shrug... ]  When you don't have complete information, it's *always*
the case that you will sometimes make a mistake.  That's not
justification for paralysis, especially not when the existing code is
demonstrably broken.

What occurs to me after thinking a bit more is that the existing tuple
density is likely to be only an estimate, too (one coming from the last
ANALYZE, which could very well have scanned even less of the table than
VACUUM did).  So what I now think is that both VACUUM and ANALYZE ought
to use a calculation of the above form to arrive at a new value for
pg_class.reltuples.  In both cases it would be pretty easy to track the
number of pages we looked at while counting tuples, so the same raw
information is available.

> I am wondering, though, why we're not just inserting a special-purpose
> hack for TOAST tables.

Because the problem is not specific to TOAST tables.  As things
currently stand, we will accept the word of an ANALYZE as gospel even if
it scanned 1% of the table, and completely ignore the results from a
VACUUM even if it scanned 99% of the table.  This is not sane.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

2011-05-25 Thread Alvaro Herrera

Excerpts from Kevin Grittner's message of mié may 25 12:37:24 -0400 2011:
> Tom Lane  wrote:
>  
> > I don't know what client-side code might be looking at
> > relpages/reltuples.
>  
> I know that I find reltuples useful for getting an "accurate enough"
> sense of rows in a table (or set of tables) without resorting to
> count(*).  I'd be OK with any two of pages, tuples, and density; but
> dropping to one would make databases harder to manage, IMV.
>  
> Personally, I don't have much code that uses those columns;
> eliminating an existing column wouldn't involve much pain for me as
> long as it could still be derived.

Well, we only actually need to store one number, because you can figure
out a much more precise number-of-pages figure with pg_relation_size()
divided by configured page size.

(We feel free to hack around catalogs in other places, and we regularly
break pgAdmin and the like when we drop columns -- people just live with
it and update their tools.  I don't think it's such a big deal in this
particular case.)

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 12:41 PM, Tom Lane  wrote:
> Yeah, I had been thinking about the latter point.  We could be
> conservative and just keep the reported tuple density the same (ie,
> update relpages to the new correct value, while setting reltuples so
> that the density ratio doesn't change).  But that has its own problems
> when the table contents *do* change.  What I'm currently imagining is
> to do a smoothed moving average, where we factor in the new density
> estimate with a weight dependent on the percentage of the table we did
> scan.  That is, the calculation goes something like
>
> old_density = old_reltuples / old_relpages
> new_density = counted_tuples / scanned_pages
> reliability = scanned_pages / new_relpages
> updated_density = old_density + (new_density - old_density) * reliability
> new_reltuples = updated_density * new_relpages
>
> We could slow the moving-average convergence even further when
> reliability is small; for instance if you were really paranoid you might
> want to use the square of reliability in line 4.  That might be
> overdoing it, though.

I don't know.  That's maybe better, but I'd be willing to wager that
in some cases it will just slow down the rate at which we converge to
a completely incorrect value, while in other cases it'll fail to
update the data when it really has changed.

I am wondering, though, why we're not just inserting a special-purpose
hack for TOAST tables.  Your email seems to indicate that regular
tables are already handled well enough, and certainly if we only whack
around the TOAST behavior it's much less likely to fry anything.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Another attempt at vacuum improvements

2011-05-25 Thread Alvaro Herrera

Excerpts from Robert Haas's message of mié may 25 12:27:38 -0400 2011:
> On Wed, May 25, 2011 at 10:07 AM, Pavan Deolasee
>  wrote:

> > I think if are reclaiming LP_DEAD line pointers only while
> > defragmenting the page, we can always reclaim the space for the LSN,
> > irrespective of where we store it. So may be we should decide
> > depending on what would matter for on-disk compatibility and whatever
> > requires least invasive changes. I don't know what is that yet.
> 
> If we store the LSN at the beginning (page header) or end (special
> space) of the page, then we'll have to adjust the location of the data
> which follows it or precedes it if and when we want to reclaim the
> space it occupies.  But if we store it in the hole in the middle
> somewhere (probably by reducing the size of the hole, not by actually
> referencing the area between pd_lower and pd_upper) then no tuples or
> item pointers have to move around when we want to reclaim the space.
> That way, we need only an exclusive lock (not a cleanup lock), and we
> don't need to memmove() anything.

You can store anything in the "hole" in the data area (currently we only
store tuple data), but then you'd have to store the offset to where you
stored it in some fixed location, and make sure you adjust pd_upper/lower
so that the next tuple does not overwrite it.  You'd still need space
in either page header or special space.  I don't see how you'd store
anything in the hole without it being in a fixed place, where it would
eventually be hit by either the line pointer array or tuple data.

As far as the general idea of this thread goes, I remember thinking
about exactly this topic a couple of months ago -- I didn't get this far
though, so thanks for fleshing out some details and getting the ball
rolling.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New/Revised TODO? Gathering actual read performance data for use by planner

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 10:17 AM, Greg Smith  wrote:
>> This data would probably need to be kept separately for each table or
>> index, as some tables or indexes
>> may be mostly or fully in cache or on faster physical media than others,
>> although in the absence of other
>> data about a specific table or index, data about other relations in the
>> same tablespace might be of some use.
>
> This is the important part.  Model how the data needs to get stored such
> that the optimizer can make decisions using it, and I consider it easy to
> figure out how it will get populated later.

I basically agree.  There have been several recent discussions of this
topic on both -hackers and -performance; it is likely that the TODO
needs to be updated with some more recent links.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "errno" not set in case of "libm" functions (HPUX)

2011-05-25 Thread Ibrar Ahmed

Please find the updated patch. I have added this "+Olibmerrno" compile flag
check in configure/configure.in file.


OS

HP-UX B.11.31 U ia64

without patch
---
postgres=# select acos(2);
 acos
--
  NaN
(1 row)


with patch
---
postgres=# select acos(2);
ERROR:  input is out of range



On Tue, May 24, 2011 at 11:13 PM, Tom Lane  wrote:

> Heikki Linnakangas  writes:
> > So the default is indeed non-standard. But I wonder if we should use -Aa
> > instead?
>
> Probably not; at least on older HPUX versions, -Aa turns off access to
> assorted stuff that we do want, eg "long long".  "man cc" on my box
> saith
>
> -Amode Specify the compilation standard to be used by the
>compiler.  mode can be one of the following letters:
>
>   c(Default) Compile in a mode compatible with
>HP-UX releases prior to 7.0.  (See The C
>Programming Language, First Edition by
>Kernighan and Ritchie).  This option also
>defines the symbol _HPUX_SOURCE and allows the
>user to access macros and typedefs provided by
>the HPUX Operating System. The default
>compilation mode may change in future releases.
>
>   aCompile under ANSI mode (ANSI programming
>language C standard ISO 9899:1990).  When
>compiling under ANSI mode, the header files
>would define only those names (macros and
>typedefs) specified by the Standard. To access
>macros and typedefs that are not defined by the
>ANSI Standard but are provided by the HPUX
>Operating System, define the symbol
>_HPUX_SOURCE; or use the extension option
>described below.
>
>   eExtended ANSI mode.  Same as -Aa -D_HPUX_SOURCE
>+e.  This would define the names (macros and
>typedefs) provided by the HPUX Operating System
>and, in addition, allow the following
>extensions: $ characters in identifier names,
>sized enums, sized bit-fields, and 64-bit
>integral type long long.  Additional extensions
>may be added to this option in the future.
>
> The +e option is elsewhere stated to mean
>
>  +eEnables HP value-added features while compiling
>in ANSI C mode, -Aa.  This option is ignored
>with -Ac because these features are already
>provided.  Features enabled:
>
> o  Long pointers
> o  Integral type specifiers can appear in
>enum declarations.
> o  The $ character can appear in
>identifier names.
> o  Missing parameters on intrinsic calls
>
> which isn't 100% consistent with what it says under -Ae, so maybe some
> additional experimentation is called for.  But anyway, autoconf appears
> to think that -Ae is preferable to the combination -Aa -D_HPUX_SOURCE
> (that choice is coming from autoconf not our own code); so I'm not
> optimistic that we can get more-standard behavior by overriding that.
>
>regards, tom lane
>



-- 
   Ibrar Ahmed
diff --git a/configure b/configure
index 3b23c46..d448534 100755
--- a/configure
+++ b/configure
@@ -4468,6 +4468,66 @@ if test x"$pgac_cv_prog_cc_cflags__qnoansialias" = x"yes"; then
   CFLAGS="$CFLAGS -qnoansialias"
 fi
 
+elif test "$PORTNAME" = "hpux"; then
+  # HP-UX libm functions on 'Integrity server' do not set errno by default,
+  # for errno setting, compile with the +Olibmerrno option.
+  { $as_echo "$as_me:$LINENO: checking whether $CC supports +Olibmerrno" >&5
+$as_echo_n "checking whether $CC supports +Olibmerrno... " >&6; }
+if test "${pgac_cv_prog_cc_cflags_pOlibmerrno+set}" = set; then
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS +Olibmerrno"
+cat >conftest.$ac_ext <<_ACEOF
+/* confdefs.h.  */
+_ACEOF
+cat confdefs.h >>conftest.$ac_ext
+cat >>conftest.$ac_ext <<_ACEOF
+/* end confdefs.h.  */
+
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+rm -f conftest.$ac_objext
+if { (ac_try="$ac_compile"
+case "(($ac_try" in
+  *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;;
+  *) ac_try_echo=$ac_try;;
+esac
+eval ac_try_echo="\"\$as_me:$LINENO: $ac_try_echo\""
+$as_echo "$ac_try_echo") >&5
+  (eval "$ac_compile") 2>conftest.er1
+  ac

Re: [HACKERS] Expression Evaluator used for creating the plan tree / stmt ?

2011-05-25 Thread Vaibhav Kaushal

@pavan - thanks a lot. Will try it when i go to desktop the next time.

--
Sent from my Android
On 25 May 2011 15:33, "Pavan Deolasee"  wrote:
> On Wed, May 25, 2011 at 3:22 PM, Vaibhav Kaushal <
> vaibhavkaushal...@gmail.com> wrote:
>
>> I see that the target list to be scanned is handled by "ExecTargetList"
>> function.
>>
>> I am not so sure about this because this function is not listed in the
list
>> of functions which GDB shows me (./configure --enable-debug && make clean
&&
>> make && make install). Rest everything else (almost / perhaps) is shown!
>> (Can anyone tell why? :( )
>>
>>
> You would need to turn optimization off by passing "-O0" flag to the
> compiler. Otherwise static functions may get optimized and you may not see
> them in gdb stack.
>
> CFLAGS="-O0" ./configure --enable-debug
>
> Thanks,
> Pavan
>
> --
> Pavan Deolasee
> EnterpriseDB http://www.enterprisedb.com

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

2011-05-25 Thread Tom Lane

Robert Haas  writes:
> On Wed, May 25, 2011 at 11:47 AM, Tom Lane  wrote:
>> 2. Revise the vacuum code so that it doesn't skip updating the pg_class
>> entries.  We could have it count the number of pages it skipped, rather
>> than just keeping a bool, and then scale up the rel_tuples count to be
>> approximately right by assuming the skipped pages have tuple density
>> similar to the scanned ones.

> This approach doesn't seem like a good idea to me.  The skipped
> portions of the table are the ones that haven't been modified in a
> while, so this is or embeds an assumption that the tuples in the hot
> and cold portions of the table are the same size.  That might be true,
> but it isn't hard to think of cases where it might not be.  There
> could also very easily be sampling error, because it's easy to imagine
> a situation where 99% of the table is getting skipped.

Yeah, I had been thinking about the latter point.  We could be
conservative and just keep the reported tuple density the same (ie,
update relpages to the new correct value, while setting reltuples so
that the density ratio doesn't change).  But that has its own problems
when the table contents *do* change.  What I'm currently imagining is
to do a smoothed moving average, where we factor in the new density
estimate with a weight dependent on the percentage of the table we did
scan.  That is, the calculation goes something like

old_density = old_reltuples / old_relpages
new_density = counted_tuples / scanned_pages
reliability = scanned_pages / new_relpages
updated_density = old_density + (new_density - old_density) * reliability
new_reltuples = updated_density * new_relpages

We could slow the moving-average convergence even further when
reliability is small; for instance if you were really paranoid you might
want to use the square of reliability in line 4.  That might be
overdoing it, though.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.

2011-05-25 Thread Alvaro Herrera

Excerpts from Bruce Momjian's message of mar may 24 15:59:59 -0400 2011:

> > I think you misread what I wrote, or I misexplained it, but never
> > mind.  Matching locale names case-insensitively sounds reasonable to
> > me, unless someone has reason to believe it will blow up.
> 
> OK, that's what I needed to hear.  I have applied the attached patch,
> but only to 9.1 because  of the risk of breakage. (This was only the
> first bug report of this, and we aren't 100% certain about the case
> issue.)

Hmm, I know the locale can be spelled "en_US.UTF-8" (note dash) in some
systems.  So if you have a locale alias that makes en_US.utf8 the same
as en_US.UTF-8, the patched code would still not work.  I wonder if
there's a need for canonicalization somewhere.  Or maybe this is just
not worth the trouble.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

2011-05-25 Thread Kevin Grittner

Tom Lane  wrote:

> I don't know what client-side code might be looking at
> relpages/reltuples.

I know that I find reltuples useful for getting an "accurate enough"
sense of rows in a table (or set of tables) without resorting to
count(*).  I'd be OK with any two of pages, tuples, and density; but
dropping to one would make databases harder to manage, IMV.

Personally, I don't have much code that uses those columns;
eliminating an existing column wouldn't involve much pain for me as
long as it could still be derived.

-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Another attempt at vacuum improvements

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 10:07 AM, Pavan Deolasee
 wrote:
>> I'm confused.  A major point of the approach I was proposing was to
>> avoid having to move tuples around.
>
> Well, I am not sure how you can always guarantee to make space
> available for the LSN without moving tuples , irrespective of where
> you store it.  But probably its not important as we discussed below.

Well, if we just defragged the page, then we should be guaranteed that
pd_lower + 8 < pd_upper.  No?

> I think if are reclaiming LP_DEAD line pointers only while
> defragmenting the page, we can always reclaim the space for the LSN,
> irrespective of where we store it. So may be we should decide
> depending on what would matter for on-disk compatibility and whatever
> requires least invasive changes. I don't know what is that yet.

If we store the LSN at the beginning (page header) or end (special
space) of the page, then we'll have to adjust the location of the data
which follows it or precedes it if and when we want to reclaim the
space it occupies.  But if we store it in the hole in the middle
somewhere (probably by reducing the size of the hole, not by actually
referencing the area between pd_lower and pd_upper) then no tuples or
item pointers have to move around when we want to reclaim the space.
That way, we need only an exclusive lock (not a cleanup lock), and we
don't need to memmove() anything.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

2011-05-25 Thread Tom Lane

Alvaro Herrera  writes:
> Excerpts from Tom Lane's message of miÃ© may 25 11:47:52 -0400 2011:
>> I can see two basic approaches we might take here:
>> 
>> 1. Modify autovacuum to use something from the stats collector, rather
>> than reltuples, to make its decisions.  I'm not too clear on why
>> reltuples is being used there anyway; is there some good algorithmic or
>> statistical reason why AV should be looking at a number from the last
>> vacuum?

> It uses reltuples simply because it was what the original contrib code
> was using.  Since pgstat was considerably weaker at the time, reltuples
> might have been the only thing available.  It's certainly the case that
> pgstat has improved a lot since autovacuum got in, and some things have
> been revised but not this one.

On reflection I'm hesitant to do this, especially for a backpatched bug
fix, because it would be changing the feedback loop behavior for
autovacuum scheduling.  That could have surprising consequences.

>> 2. Revise the vacuum code so that it doesn't skip updating the pg_class
>> entries.  We could have it count the number of pages it skipped, rather
>> than just keeping a bool, and then scale up the rel_tuples count to be
>> approximately right by assuming the skipped pages have tuple density
>> similar to the scanned ones.

> Hmm, interesting idea.  This would be done only for toast tables, or all
> tables?

I'm thinking just do it for all.  The fact that these numbers don't
necessarily update after a vacuum is certainly surprising in and of
itself, and it did not work that way before the VM patch went in.
I'm concerned about other stuff besides AV not dealing well with
obsolete values.

> At this point I only wonder why we store tuples & pages rather than just
> live tuple density.

It's just for backwards compatibility.  I've thought about doing that in
the past, but I don't know what client-side code might be looking at
relpages/reltuples.  It's not like collapsing them into one field would
save much, anyway.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 11:47 AM, Tom Lane  wrote:
> 2. Revise the vacuum code so that it doesn't skip updating the pg_class
> entries.  We could have it count the number of pages it skipped, rather
> than just keeping a bool, and then scale up the rel_tuples count to be
> approximately right by assuming the skipped pages have tuple density
> similar to the scanned ones.

This approach doesn't seem like a good idea to me.  The skipped
portions of the table are the ones that haven't been modified in a
while, so this is or embeds an assumption that the tuples in the hot
and cold portions of the table are the same size.  That might be true,
but it isn't hard to think of cases where it might not be.  There
could also very easily be sampling error, because it's easy to imagine
a situation where 99% of the table is getting skipped.  Any error that
creeps into the estimate is going to get scaled up along with the
estimate itself.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reducing overhead of frequent table locks

2011-05-25 Thread Bruce Momjian

Simon Riggs wrote:
> On Tue, May 24, 2011 at 6:37 PM, Robert Haas  wrote:
> 
> >> That being said, it's a slight extra cost for all fast-path lockers to 
> >> benefit
> >> the strong lockers, so I'm not prepared to guess whether it will pay off.
> >
> > Yeah. ?Basically this entire idea is about trying to make life easier
> > for weak lockers at the expense of making it more difficult for strong
> > lockers. ?I think that's a good trade-off in general, but we might
> > need to wait until we have an actual implementation to judge whether
> > we've turned the dial too far.
> 
> I like this overall concept and like the way this has been described
> with strong and weak locks. It seems very useful to me, since temp
> tables can be skipped. That leaves shared DDL and we have done lots to
> reduce the lock levels held and are looking at further reductions
> also. I think even quite extensive delays are worth the trade-off.
> 
> I'd been looking at this also, though hadn't mentioned it previously
> because I found an Oracle patent that discusses dynamically turning on
> and off locking. So that's something to be aware of. IMHO if we
> discuss this in terms of sharing/not sharing locking information then
> it is sufficient to avoid the patent. That patent also discusses the
> locking state change needs to wait longer than required.

FYI, I thought we had agreed not to look at any patents because looking
at them might cause us more problems than not looking at them.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pull up aggregate subquery

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 10:35 AM, Hitoshi Harada  wrote:
> 2011/5/25 Hitoshi Harada :
>> So I'm still
>> thinking which of pulling up and parameterized scan is better.
>
> After more investigation I came up with third idea, pushing down
> RangeTblEntry to aggregate subquery. This sounds like a crazy idea,
> but it seems to me like it is slightly easier than pulling up
> agg-subquery. The main point is that when you want to pull up, you
> must care if the final output would be correct with other join
> relations than the aggregate-related one. In contrast, when pushing
> down the target relation to agg-subquery it is simple to ensure the
> result.
>
> I'm looking around pull_up_subqueries() in subquery_planner() to add
> the pushing down logic. It could be possible to do it around
> make_one_rel() but I bet query structure changes are doable against
> Query, not PlannerInfo.

How do you decide whether or not to push down?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tackling full page writes

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 10:13 AM, Bruce Momjian  wrote:
>> > Idempotent does seem like the most promising idea.
>>
>> I tend to agree with you, but I'm worried it won't actually work out
>> to a win.  By the time we augment the records with enough additional
>> information we may have eaten up a lot of the benefit we were hoping
>> to get.
>
> This is where I was confused.  Our bad case now is when someone modifies
> one row on a page between checkpoints --- instead of writing 400 bytes,
> we write 8400.  What portion of between-checkpoint activity writes more
> than a few rows to a page?  I didn't think many, except for COPY.
> Ideally we could switch in and out of this mode per page, but that seems
> super-complicated.

Well, an easy to understand example would be a page that gets repeated
HOT updates.  We'll do this: add a tuple, add a tuple, add a tuple,
add a tuple, HOT cleanup, add a tuple, add a tuple, add a tuple, add a
tuple, HOT cleanup... and so on.  In the worst case, that could be
done many, many times between checkpoints that might be up to an hour
apart.  The problem can also occur (with a little more difficulty)
even without HOT.  Imagine a small table without lots of inserts and
deletes.  Page fills up, some rows are deleted, vacuum frees up space,
page fills up again, some more rows are deleted, vacuum frees up space
again, and so on.

But you raise an interesting point, which is that it might also be
possible to reduce the impact of write-ahead logging in other ways.
For example, if we're doing a large COPY into a table, we could buffer
up a full block of tuples and then just emit an FPI for the page.
This would likely be cheaper than logging each tuple individually.

In fact, you could imagine keeping a queue of pending WAL for each
block in shared buffers.  You don't really need that WAL to be
consolidated into a single stream until either (a) you need to write
the block or (b) you need to commit the transaction.  When one of
those things happens, you can decide at that point whether it's
cheaper to emit the individual records or do some sort of
consolidation.  Doing it in exactly that way is probably impractical,
because every backend that wants to commit would have to make a sweep
of every buffer it's dirtied and see if any of them still contain WAL
that needs to be shoved into the main queue, and that would probably
suck, especially for small transactions.  But maybe there's some
variant that could be made to work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

2011-05-25 Thread Alvaro Herrera

Excerpts from Tom Lane's message of mié may 25 11:47:52 -0400 2011:

> I think I see what must be going on here: that toast table must contain
> a long run of all-visible-according-to-the-VM pages (which is hardly a
> surprising situation).  This results in VACUUM choosing not to update
> the pg_class entry:
> 
> /*
>  * Update statistics in pg_class.  But only if we didn't skip any pages;
>  * the tuple count only includes tuples from the pages we've visited, and
>  * we haven't frozen tuples in unvisited pages either.  The page count is
>  * accurate in any case, but because we use the reltuples / relpages ratio
>  * in the planner, it's better to not update relpages either if we can't
>  * update reltuples.
>  */
> if (vacrelstats->scanned_all)
> vac_update_relstats(onerel,
> vacrelstats->rel_pages, vacrelstats->rel_tuples,
> vacrelstats->hasindex,
> FreezeLimit);
> 
> For an ordinary table this wouldn't be fatal because we'll still do an
> ANALYZE from time to time, and that will update the entries with new
> (approximate) values.  But we never run ANALYZE on toast tables.

Ouch.

> I can see two basic approaches we might take here:
> 
> 1. Modify autovacuum to use something from the stats collector, rather
> than reltuples, to make its decisions.  I'm not too clear on why
> reltuples is being used there anyway; is there some good algorithmic or
> statistical reason why AV should be looking at a number from the last
> vacuum?

It uses reltuples simply because it was what the original contrib code
was using.  Since pgstat was considerably weaker at the time, reltuples
might have been the only thing available.  It's certainly the case that
pgstat has improved a lot since autovacuum got in, and some things have
been revised but not this one.

> 2. Revise the vacuum code so that it doesn't skip updating the pg_class
> entries.  We could have it count the number of pages it skipped, rather
> than just keeping a bool, and then scale up the rel_tuples count to be
> approximately right by assuming the skipped pages have tuple density
> similar to the scanned ones.

Hmm, interesting idea.  This would be done only for toast tables, or all
tables?

At this point I only wonder why we store tuples & pages rather than just
live tuple density.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Volunteering as Commitfest Manager

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 10:32 AM, Simon Riggs  wrote:
> On Wed, May 25, 2011 at 3:24 PM, Tom Lane  wrote:
>> Simon Riggs  writes:
>>> I hear that CF manager is a difficult role for a single individual.
>>> So it makes sense to split that role between multiple people.
>>
>>> I volunteer to be the CF manager for Replication, and also for
>>> Performance. ...
>>> Patches need an eventual committer anyway, so this is a reasonable way
>>> of having the process be managed by the eventual committer.
>>
>> ISTM that we really want the CF manager to be somebody who is *not*
>> directly involved in reviewing or committing patches.  It's a distinct
>> skill set, so there is no reason why it's a good idea for a committer
>> to do it.  And we need to get the CF work load more spread out, not more
>> concentrated.
>
> I agree it makes sense if a non-committer performs the role. If a
> committer does take the role, I would volunteer to split the role and
> for me to work on the suggested areas.

I think it would be really valuable for you to get more involved in
reviewing some of the patches, whether you end up doing any of the
CommitFest management tasks or not.  In the past, your involvement has
been light; but we have a limited number of people who are qualified
to do detailed technical reviews of complicated patches, and more help
is very much needed.

I share the opinion expressed upthread that in an ideal world, we
would not have major reviewers or committers also doing CommitFest
management work, but in some sense that's an ideal world that we don't
live in.  Greg Smith and Kevin Grittner, both of whom have
successfully managed CommitFests in the past, also contribute in many
other ways, and I would not want to say either that those other
contributions disqualify them from managing CommitFests, or that
because they are managing the CommitFest they should suspend their
other involvement.  Either rule would be a random bureaucratic
obstacle that accomplishes nothing useful.  So I don't see the fact
that you are a committer as a reason why you can't or shouldn't help
with CommitFest management - if, for example, you don't want to review
patches yourself, but you do want to help organize other people to
volunteer to review patches, we should accept the contribution that
you're willing to make rather than insisting that you should make some
other contribution instead.

So, what I think is important is to understand exactly what you'd like
to volunteer to do, and see whether that fits in well with what other
people want to do.  Typically, at the beginning of the CommitFest,
there are a bunch of people who volunteer to be assigned a patch.
This task is likely best done by one person who has the entire
CommitFest in mind - in particular, I have found that it's a good idea
to tackle the most significant patches first; the little stuff can be
picked off at the end without much trouble, and is rarely what holds
the process up.  However, all of the other CommitFest management tasks
can be easily split across multiple people.  The main task, and where
the vast majority of the work goes, is in following up on patches
where discussion has died off or failed to reach a conclusion, and
encouraging the patch author to resubmit or the reviewer to re-review
or a committer to consider it for commit or just marking it Returned
with Feedback if there's insufficient consensus and/or time and/or
motivation to get it finished up in a month's time.  We've allowed
this to become the problem of a single and rather monolithic
individual, but that system sucks, and I have no desire to perpetuate
it.  It sucks for that individual because it's a lot of work, and it's
difficult to keep many unrelated patches in your head at the same
time; and it sucks for everyone else, because about half the time that
one monolithic individual finds that the rest of their life interferes
with PostgreSQL, or the other things they are doing for PostgreSQL
interfere with their role as CommitFest manager, and they don't do it
as well as it needs to be done to really make things run smoothly.

Therefore, if you'd like to volunteer to help keep on top of the
replication and performance patches, and make sure that they are
getting followed up on regularly, I think that would be excellent.  We
tried before having a position of "patch-chaser" or "assistant
commitfest manager by topic" for the September of 2009 CommitFest, but
it wasn't entirely successful.  It's time to revisit that concept,
because we're not always going to find one person who can devote 40+
hours to managing a CommitFest, and even when we can, it's not exactly
accomplishing our goal of decentralizing the work.  It's better than
the "Tom do everything" approach, and we do have a lot of people
helping review now, but there are still two or three people per
CommitFest burning a lot of midnight oil.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pg

Re: [HACKERS] Nested CASE-WHEN scoping

2011-05-25 Thread Tom Lane

Heikki Linnakangas  writes:
> While looking at fixing the multiple-evaluation issue in IN and BETWEEN 
> discussed a while ago, I realized that the current assumption that only 
> one CaseTestExpr placeholder needs to be valid at any given time is not 
> true.

[ scratches head ... ]  Why does the save/restore in ExecEvalCase not
take care of this?

> So I'm going to put the BETWEEN/IN fix aside for now, and refactor the 
> placeholder infrastructure to handle several simultaneous placeholders, 

That sounds like a mess, and I'm not even convinced it'll solve the
problem ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pull up aggregate subquery

2011-05-25 Thread Hitoshi Harada

2011/5/26 Tom Lane :
> Hitoshi Harada  writes:
>> 2011/5/25 Hitoshi Harada :
>>> So I'm still
>>> thinking which of pulling up and parameterized scan is better.
>
>> After more investigation I came up with third idea, pushing down
>> RangeTblEntry to aggregate subquery. This sounds like a crazy idea,
>
> Yes, it sure does.  Won't that typically change the number of rows
> produced by the subquery's jointree?  And thus possibly change the
> aggregate's result?

No, because it will be restricted to assumption that join qual ==
grouping key and outer relation is unique on the var in that join
qual. Pushing down the baserel to sub-aggregate is equivalent to
pulling up sub-aggregate eventually if there are only two relations
(which is my first example.) The difference is if the optimizer needs
to care about other relations which may change the number of rows
produced.

Regards,

-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

2011-05-25 Thread Tom Lane

Florian Helmberger  writes:
> On 25.05.11 04:47, Tom Lane wrote:
>> Florian Helmberger  writes:
>>> I'm running a production database with PostgreSQL 9.0.3 (64-bit) on
>>> Debian 5.0.4 and have an issue with a TOAST table and far to frequent
>>> autovacuum runs.
>>> 
>>> I think I've pinned the problem down to the values pg_class holds for
>>> the affected TOAST table:
>>> 
>>> relpages  | 433596
>>> reltuples | 1868538
>>> 
>>> These values are significantly too low. Interestingly, the autovacuum
>>> logout reports the correct values:
>>> 
>>> pages: 0 removed, 34788136 remain
>>> tuples: 932487 removed, 69599038 remain
>>> 
>>> but these aren't stored in pg_class after each run.

>> That's exceedingly weird.  Do the pg_stat_all_tables columns update
>> after autovacuums on that table?

> Yes they do:

I think I see what must be going on here: that toast table must contain
a long run of all-visible-according-to-the-VM pages (which is hardly a
surprising situation).  This results in VACUUM choosing not to update
the pg_class entry:

/*
 * Update statistics in pg_class.  But only if we didn't skip any pages;
 * the tuple count only includes tuples from the pages we've visited, and
 * we haven't frozen tuples in unvisited pages either.  The page count is
 * accurate in any case, but because we use the reltuples / relpages ratio
 * in the planner, it's better to not update relpages either if we can't
 * update reltuples.
 */
if (vacrelstats->scanned_all)
vac_update_relstats(onerel,
vacrelstats->rel_pages, vacrelstats->rel_tuples,
vacrelstats->hasindex,
FreezeLimit);

For an ordinary table this wouldn't be fatal because we'll still do an
ANALYZE from time to time, and that will update the entries with new
(approximate) values.  But we never run ANALYZE on toast tables.

And this would *still* be okay, because as noted in the comment, the
planner only depends on the ratio being roughly correct, not on either
individual value being current.  But autovacuum didn't get the memo;
it thinks it can use reltuples to make decisions.

I can see two basic approaches we might take here:

1. Modify autovacuum to use something from the stats collector, rather
than reltuples, to make its decisions.  I'm not too clear on why
reltuples is being used there anyway; is there some good algorithmic or
statistical reason why AV should be looking at a number from the last
vacuum?

2. Revise the vacuum code so that it doesn't skip updating the pg_class
entries.  We could have it count the number of pages it skipped, rather
than just keeping a bool, and then scale up the rel_tuples count to be
approximately right by assuming the skipped pages have tuple density
similar to the scanned ones.

Thoughts?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] use less space in xl_xact_commit patch

2011-05-25 Thread Simon Riggs

On Wed, May 25, 2011 at 2:43 PM, Leonardo Francalanci  wrote:
>> Da: Simon Riggs 
>> I can't find a clear  discussion of what you are trying to do, and how,
>> just a URL back to a  complex discussion on another topic.
>
>
> While trying to write a patch to allow changing an unlogged table into
> a logged one, I had to add another int field to xl_xact_commit.
> Robert Haas said:
>
>  "I have to admit I don't like this approach very much.  I can't see
> adding 4 bytes to every commit record for this feature."
>
>
> which is a correct remark.
>
> xl_xact_commit can contain some arrays (relation to drops,
> committed sub-trans, shared invalidation msgs). The length of
> these arrays is specified using 3 ints in the struct.
>
> So, to avoid adding more ints to the struct, I've been suggested to
> remove all the ints, and use   xl_xact_commit.xinfo to flag which
> arrays are, in fact, present.
>
> So the whole idea is:
>
> - remove nrels, nsubxacts and nmsgs from xl_xact_commit
> - use bits in xinfo to signal which arrays are present at the end
> of   xl_xact_commit
> - for each present array, add the length of the array (as int) at
> the end of    xl_xact_commit
> - add each present array after all the lengths


OK, thats clear. Thanks.

That formatting sounds quite complex.

I would propose we split this into 2 WAL records: xl_xact_commit and
xl_xact_commit_with_info

xl_xact_commit doesn't have any flags, counts or arrays.

xl_xact_commit_with_info always has all 3 counts, even if zero.
Arrays follow the main record

I think it might also be possible to removed dbId and tsId from
xl_act_commit if we use that definition.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reducing overhead of frequent table locks

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 8:56 AM, Simon Riggs  wrote:
> On Wed, May 25, 2011 at 1:44 PM, Robert Haas  wrote:
>> On Wed, May 25, 2011 at 8:27 AM, Simon Riggs  wrote:
>>> I got a bit lost with the description of a potential solution. It
>>> seemed like you were unaware that there is a local lock and a shared
>>> lock table, maybe just me?
>>
>> No, I'm not unaware of the local lock table.  The point of this
>> proposal is to avoid fighting over the LWLocks that protect the shared
>> hash table by allowing some locks to be taken without touching it.
>
> Yes, I got that. I just couldn't work out where mmap came in.

I don't think we were planning to use mmap anywhere.

>>> Design seemed relatively easy from there: put local lock table in
>>> shared memory for all procs. We then have a use_strong_lock at proc
>>> and at transaction level. Anybody that wants a strong lock first sets
>>> use_strong_lock at proc and transaction level, then copies all local
>>> lock data into shared lock table, double checking
>>> TransactionIdIsInProgress() each time. Then queues for lock using the
>>> now fully set up shared lock table. When transaction with strong lock
>>> completes we do not attempt to reset transaction level boolean, only
>>> at proc level, since DDL often occurs in groups and we want to avoid
>>> flip-flopping quickly between lock share states. Cleanup happens by
>>> regularly by bgwriter, perhaps every 10 seconds or so. All locks are
>>> still visible for pg_locks.
>>
>> I'm not following this...
>
> Which bit aren't you following? It's a design outline for how to
> implement, deliberately brief to allow a discussion of design
> alternatives.

Well, OK:

1. I don't think putting the local lock table in shared memory is a
good idea both for performance (keeping it uncontended has value) and
memory management (it would increase shared memory requirements quite
a bit).  The design Noah and I were discussing upthread uses only a
small and fixed amount of shared memory for each backend, and
preserves the local lock table as an unshared resource.

2. You haven't explained the procedure for acquire a weak lock at all,
and in particular how a weak locker would be able to quickly determine
whether a conflicting strong lock was potentially present.

3. I don't understand the proposed procedure for acquiring a strong
lock either; in particular, I don't see why
TransactionIdIsInProgress() would be relevant.  The lock manager
doesn't really do anything with transaction IDs now, and you haven't
offered any explanation of why that would be necessary or advisable.

4. You haven't explained what the transaction-level boolean would
actually do.  It's not even clear whether you intend for that to be
kept in local or shared memory.  It's also unclear what you intend for
bgwriter to clean up.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pull up aggregate subquery

2011-05-25 Thread Tom Lane

Hitoshi Harada  writes:
> 2011/5/25 Hitoshi Harada :
>> So I'm still
>> thinking which of pulling up and parameterized scan is better.

> After more investigation I came up with third idea, pushing down
> RangeTblEntry to aggregate subquery. This sounds like a crazy idea,

Yes, it sure does.  Won't that typically change the number of rows
produced by the subquery's jointree?  And thus possibly change the
aggregate's result?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Expression Evaluator used for creating the plan tree / stmt ?

2011-05-25 Thread Tom Lane

Alvaro Herrera  writes:
> Excerpts from Vaibhav Kaushal's message of miÃ© may 25 05:52:32 -0400 2011:
>> If the above is confusing, I just want to ask: "Is expression evaluator,
>> even in part responsible for {PLANNEDSTMT creation?"

> Yeah, as far as I understood Tom's talk, the expr evaluator is used to
> reduce some expressions to constants and such.

The planner would never call it with an expression containing a Var,
though.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Expression Evaluator used for creating the plan tree / stmt ?

2011-05-25 Thread Alvaro Herrera

Excerpts from Vaibhav Kaushal's message of mié may 25 05:52:32 -0400 2011:

> If the above is confusing, I just want to ask: "Is expression evaluator,
> even in part responsible for {PLANNEDSTMT creation?"

Yeah, as far as I understood Tom's talk, the expr evaluator is used to
reduce some expressions to constants and such.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pull up aggregate subquery

2011-05-25 Thread Hitoshi Harada

2011/5/25 Hitoshi Harada :
> So I'm still
> thinking which of pulling up and parameterized scan is better.

After more investigation I came up with third idea, pushing down
RangeTblEntry to aggregate subquery. This sounds like a crazy idea,
but it seems to me like it is slightly easier than pulling up
agg-subquery. The main point is that when you want to pull up, you
must care if the final output would be correct with other join
relations than the aggregate-related one. In contrast, when pushing
down the target relation to agg-subquery it is simple to ensure the
result.

I'm looking around pull_up_subqueries() in subquery_planner() to add
the pushing down logic. It could be possible to do it around
make_one_rel() but I bet query structure changes are doable against
Query, not PlannerInfo.

I appreciate any comments.

Regards,

-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reducing overhead of frequent table locks

2011-05-25 Thread Tom Lane

Simon Riggs  writes:
> On Wed, May 25, 2011 at 1:44 PM, Robert Haas  wrote:
>> On Wed, May 25, 2011 at 8:27 AM, Simon Riggs  wrote:
>>> Design seemed relatively easy from there: put local lock table in
>>> shared memory for all procs. We then have a use_strong_lock at proc
>>> and at transaction level. Anybody that wants a strong lock first sets
>>> use_strong_lock at proc and transaction level, then copies all local
>>> lock data into shared lock table,

>> I'm not following this...

> Which bit aren't you following? It's a design outline for how to
> implement, deliberately brief to allow a discussion of design
> alternatives.

What I'm not following is how moving the local lock table into shared
memory can possibly be a good idea.  The reason we invented the local
lock table in the first place (we didn't use to have one) is so that a
process could do some manipulations without touching shared memory.
(Notably, it is currently nearly free, and certainly lock-free, to
re-request a lock type you already hold.  This is not an infrequent
case.)  That advantage will go away if you do this.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Volunteering as Commitfest Manager

2011-05-25 Thread Simon Riggs

On Wed, May 25, 2011 at 3:24 PM, Tom Lane  wrote:
> Simon Riggs  writes:
>> I hear that CF manager is a difficult role for a single individual.
>> So it makes sense to split that role between multiple people.
>
>> I volunteer to be the CF manager for Replication, and also for
>> Performance. ...
>> Patches need an eventual committer anyway, so this is a reasonable way
>> of having the process be managed by the eventual committer.
>
> ISTM that we really want the CF manager to be somebody who is *not*
> directly involved in reviewing or committing patches.  It's a distinct
> skill set, so there is no reason why it's a good idea for a committer
> to do it.  And we need to get the CF work load more spread out, not more
> concentrated.

I agree it makes sense if a non-committer performs the role. If a
committer does take the role, I would volunteer to split the role and
for me to work on the suggested areas.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] use less space in xl_xact_commit patch

2011-05-25 Thread Leonardo Francalanci

Sorry, email sent without body.

Fixed some English mistakes.

commitlog_lessbytes02.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Volunteering as Commitfest Manager

2011-05-25 Thread Tom Lane

Simon Riggs  writes:
> I hear that CF manager is a difficult role for a single individual.
> So it makes sense to split that role between multiple people.

> I volunteer to be the CF manager for Replication, and also for
> Performance. ...
> Patches need an eventual committer anyway, so this is a reasonable way
> of having the process be managed by the eventual committer.

ISTM that we really want the CF manager to be somebody who is *not*
directly involved in reviewing or committing patches.  It's a distinct
skill set, so there is no reason why it's a good idea for a committer
to do it.  And we need to get the CF work load more spread out, not more
concentrated.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New/Revised TODO? Gathering actual read performance data for use by planner

2011-05-25 Thread Greg Smith


Michael Nolan wrote:
Based on last year's discussion of this TODO item, it seems thoughts 
have been focused on estimating how much data is
being satisfied from PG's shared buffers.  However, I think that's 
only part of the problem.  


Sure, but neither it nor what you're talking about are the real gating 
factor on making an improvement here.  Figuring out how to expose all 
this information to the optimizer so it can use it when planning is the 
hard part.  Collecting a read time profile is just one of the many ways 
you can estimate what's in cache, and each of the possible methods has 
good and bad trade-offs.  I've been suggesting that people assume that's 
a solved problem--I'm pretty sure what you're proposing was done by Greg 
Stark once and a prototype built even--and instead ask what you're going 
to do next if you had this data.


This data would probably need to be kept separately for each table or 
index, as some tables or indexes
may be mostly or fully in cache or on faster physical media than 
others, although in the absence of other
data about a specific table or index, data about other relations in 
the same tablespace might be of some use. 


This is the important part.  Model how the data needs to get stored such 
that the optimizer can make decisions using it, and I consider it easy 
to figure out how it will get populated later.  There are actually 
multiple ways to do it, and it may end up being something people plug-in 
an implementation that fits their workload into, rather than just having 
one available.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tackling full page writes

2011-05-25 Thread Bruce Momjian

Robert Haas wrote:
> That idea has the additional advantage of confusing the level of
> detail of our WAL logging (minimal vs. archive vs. hot standby) with
> the mechanism used to protect against torn pages (full page writes vs.
> idempotent WAL records vs. prayer).  When they set it wrong and
> destroy their system, we can tell them it's their own fault for not
> configuring the system properly!  Bwahahahaha!

I love it!  Create confusing configuration options and blame the user!

> In all seriousness, I can't imagine that we'd make this
> user-configurable in the first place, since that would amount to
> having two sets of WAL records each of which would be even less well
> tested than what we have now; and for a project this complex, we
> probably shouldn't even consider changing things that seem to work now
> unless the new system is clearly better than the old.
> 
> > Idempotent does seem like the most promising idea.
> 
> I tend to agree with you, but I'm worried it won't actually work out
> to a win.  By the time we augment the records with enough additional
> information we may have eaten up a lot of the benefit we were hoping
> to get.

This is where I was confused.  Our bad case now is when someone modifies
one row on a page between checkpoints --- instead of writing 400 bytes,
we write 8400.  What portion of between-checkpoint activity writes more
than a few rows to a page?  I didn't think many, except for COPY. 
Ideally we could switch in and out of this mode per page, but that seems
super-complicated.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Another attempt at vacuum improvements

2011-05-25 Thread Pavan Deolasee

On Wed, May 25, 2011 at 5:57 PM, Robert Haas  wrote:

>
> Alternatively, it's possible that we'd be better off vacuuming the
> table more often (say, autovacuum_vacuum_scale_factor=0.10 or 0.08 or
> something) but only doing the index scans every once in a while when
> enough dead line pointers have accumulated.

Thats precisely the reason I suggested separating heap and index
vacuums instead of a tight integration as we have now. If we don't
spool the dead line pointers in a separate area though, we would need
to make sure that index vacuum runs through the heap first to collect
the dead line pointers and then remove the corresponding index
pointers. We would need to also take into consideration the
implications on visibility map for any such scheme to work correctly
and efficiently.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Volunteering as Commitfest Manager

2011-05-25 Thread Stephen Frost

* Simon Riggs (si...@2ndquadrant.com) wrote:
> So it makes sense to split that role between multiple people.

I agree that it'd be good to have multiple people supporting the CF.
I've been considering volunteering to be part of such a group for a
given CF.

> I volunteer to be the CF manager for Replication, and also for
> Performance.

I dislike the idea of splitting up the duties along these lines.
However, if others are willing to handle it that way and we can get
coverage for all the CF categories, I wouldn't object.  The reason
that I dislike this is simply that the actual work of the CF manager
isn't really distinguished in any way based on if it's a Replication
patch or a Performance patch or some other patch.  Most of the CF work
is about following-up with reviewers and authors and committers.  

I feel this kind of "I'd prefer to be working on what interests me/what
I'm knowledgable in" concept is typically addressed through the
self-volunteer reviewers, where someone will mark themselves down as a
reviewer for a specific patch because it's what they're interested in.
Additionally, reviewers, when volunteering, can, and often do, ask for
specific kinds of patches (though it's usually driven more by 'size' or
'complexity' than category).  That feels like a better place to put this
than at the CF manager level.

> Patches need an eventual committer anyway, so this is a reasonable way
> of having the process be managed by the eventual committer.

I also don't like the idea that committers, when supporting a
commitfest, would only be involved/working on specific categories of
patches.  I feel that part of the idea of a commitfest is to have
individuals, particularly committers, looking at patches which fall
outside of what they're currently working on (since they can/could
commit those things more-or-less anytime).

My thinking (and I have no idea how others feel or if anyone's even
considered it this way) is simply that part of the responsibility of a
committer would be that they help get non-committer patches, which are
good for PG, submitted through the right process, etc, but which may
not be of specific interest to any given committer, committed.  If a
patch is already of interest to a committer because it hits on a
hot-spot with them then that patch doesn't really *need* a CF to get
done.

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Proposal: Another attempt at vacuum improvements

2011-05-25 Thread Pavan Deolasee

On Wed, May 25, 2011 at 7:20 PM, Robert Haas  wrote:
> On Wed, May 25, 2011 at 7:07 AM, Pavan Deolasee
>  wrote:
>>> But instead of allocating permanent space in the page header, which would
>>> both reduce (admittedly only by 8 bytes) the amount of space available
>>> for tuples, and more significantly have the effect of breaking on-disk
>>> compatibility, I'm wondering if we could get by with making space for
>>> that extra LSN only when it's actually present. In other words, when
>>> it's present, we set a bit PD_HAS_DEAD_LINE_PTR_LSN or somesuch,
>>> increment pd_upper, and use the extra space to store the LSN.  There
>>> is an alignment problem to worry about there but that shouldn't be a
>>> huge issue.
>>
>> That might work but would require us to move tuples around when the first
>> dead line pointer gets generated in the page.
>
> I'm confused.  A major point of the approach I was proposing was to
> avoid having to move tuples around.
>

Well, I am not sure how you can always guarantee to make space
available for the LSN without moving tuples , irrespective of where
you store it.  But probably its not important as we discussed below.

>> You may argue that we should
>> be holding a cleanup-lock when that happens and the dead line pointer
>> creation is always followed by a call to PageRepairFragmentation(), so it
>> should be easier to make space for the LSN.
>
> I'm not sure if this is the same thing you're saying, but certainly
> the only time we need to make space for this value is when we've just
> remove tuples from the page and defragmented, and at that point there
> should certainly be 8 bytes free somewhere.
>

Agree.

>> Instead of storing the LSN after the page header, would it be easier to set
>> pd_special and store the LSN at the end of the page ?
>
> I was proposing storing it after the line pointer array, not after the
> page header.  If we store it at the end of the page, I suspect we're
> going to basically end up allocating permanent space for it, because
> otherwise we'll have to shift all the tuple data forward and backward
> by 8 bytes when we allocate or deallocate space for this.  Now, maybe
> that's OK: I'm not sure.  But it's something to think about carefully.
>  If we are going to allocate permanent space, the special space seems
> better than the page header, because we should be able to make that
> work without on-disk compatibility, and because AFAIUI we only need
> the space for heap pages, not index pages.
>

I think if are reclaiming LP_DEAD line pointers only while
defragmenting the page, we can always reclaim the space for the LSN,
irrespective of where we store it. So may be we should decide
depending on what would matter for on-disk compatibility and whatever
requires least invasive changes. I don't know what is that yet.


>
>> If so, how do we handle the case where after restart the page may get LSN
>> less than the index vacuum LSN if the index vacuum happened before the
>> crash/stop ?
>
> Well, on a crash, the unlogged relations get truncated, and their
> indexes also, so no problem.  On a clean shutdown, I guess we need to
> arrange to save the counter across restarts.

Oh ok. I was not aware that unlogged tables get truncated. I think we
can just restore from the value stored for last successful index
vacuum (after incrementing it may be). That should be possible.

>
> Take a look at the existing logic around GetXLogRecPtrForTemp().
> That's slightly different, because there we don't even need to be
> consistent across backends.  We just need an increasing sequence of
> values.  For unlogged relations things are a bit more complex - but it
> seems manageable.

Ok. Will look at it.

Thanks,
Pavan


-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The way to know whether the standby has caught up with the master

2011-05-25 Thread Tom Lane

Heikki Linnakangas  writes:
> On 25.05.2011 07:42, Fujii Masao wrote:
>> To achieve that, I'm thinking to change walsender so that, when the standby
>> has caught up with the master, it sends back the message indicating that to
>> the standby. And I'm thinking to add new function (or view like
>> pg_stat_replication)
>> available on the standby, which shows that info.

> By the time the standby has received that message, it might not be 
> caught-up anymore because new WAL might've been generated in the master 
> already.

Even assuming that you believe this is a useful capability, there is no
need to change walsender.  It *already* sends the current-end-of-WAL in
every message, which indicates precisely whether the message contains
all of available WAL data.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] use less space in xl_xact_commit patch

2011-05-25 Thread Leonardo Francalanci



commitlog_lessbytes02.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Another attempt at vacuum improvements

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 7:07 AM, Pavan Deolasee
 wrote:
>> But instead of allocating permanent space in the page header, which would
>> both reduce (admittedly only by 8 bytes) the amount of space available
>> for tuples, and more significantly have the effect of breaking on-disk
>> compatibility, I'm wondering if we could get by with making space for
>> that extra LSN only when it's actually present. In other words, when
>> it's present, we set a bit PD_HAS_DEAD_LINE_PTR_LSN or somesuch,
>> increment pd_upper, and use the extra space to store the LSN.  There
>> is an alignment problem to worry about there but that shouldn't be a
>> huge issue.
>
> That might work but would require us to move tuples around when the first
> dead line pointer gets generated in the page.

I'm confused.  A major point of the approach I was proposing was to
avoid having to move tuples around.

> You may argue that we should
> be holding a cleanup-lock when that happens and the dead line pointer
> creation is always followed by a call to PageRepairFragmentation(), so it
> should be easier to make space for the LSN.

I'm not sure if this is the same thing you're saying, but certainly
the only time we need to make space for this value is when we've just
remove tuples from the page and defragmented, and at that point there
should certainly be 8 bytes free somewhere.

> Instead of storing the LSN after the page header, would it be easier to set
> pd_special and store the LSN at the end of the page ?

I was proposing storing it after the line pointer array, not after the
page header.  If we store it at the end of the page, I suspect we're
going to basically end up allocating permanent space for it, because
otherwise we'll have to shift all the tuple data forward and backward
by 8 bytes when we allocate or deallocate space for this.  Now, maybe
that's OK: I'm not sure.  But it's something to think about carefully.
 If we are going to allocate permanent space, the special space seems
better than the page header, because we should be able to make that
work without on-disk compatibility, and because AFAIUI we only need
the space for heap pages, not index pages.

> I think that should be not so difficult to handle. I think handling this by
> special space mechanism might be less complicated.

A permanent space allocation will certainly be simpler.  I'm just not
sure how much we care about giving up 8 bytes that could potentially
be used to store tuple data.

>> If the LSN stored in the page precedes the
>> start-of-last-successful-index-vacuum LSN, and if, further, we can get
>> a buffer cleanup lock on the page, then we can do a HOT cleanup and
>> life is good.  Otherwise, we can either (1) just forget about the
>> most-recent-dead-line-pointer LSN - not ideal but not catastrophic
>> either - or (2) if the start-of-last-successful-vacuum-LSN is old
>> enough, we could overwrite an LP_DEAD line pointer in place.
>
> I don't think we need the cleanup lock to turn the LP_DEAD line pointers to
> LP_UNUSED since that does not involve moving tuples around. So a simple
> EXCLUSIVE lock should be enough. But we would need to WAL log the operation
> of turning DEAD to UNUSED, so it would be simpler to consolidate this in HOT
> pruning. There could be exceptions such as, say large number of DEAD line
> pointers are accumulated towards the end and reclaiming those would free up
> substantial space in the page. But may be we can use those conditions to
> invoke HOT prune instead of handling them separately.

Makes sense.

>> Another issue is that this causes problems for temporary and unlogged
>> tables, because no WAL records are generated and, therefore, the LSN
>> does not advance.  This is also a problem for GIST indexes; Heikki
>> fixed temporary GIST indexes by generating fake LSNs off of a
>> backend-local counter.  Unlogged GIST indexes are currently not
>> supported.  I think what we need to do is create an API to which you
>> can pass a relation and get an LSN.  If it's a permanent relation, you
>> get a regular LSN.  If it's a temporary relation, you get a fake LSN
>> based on a backend-local counter.  If it's an unlogged relation, you
>> get a fake LSN based on a shared-memory counter that is reset on
>> restart.  If we can encapsulate that properly, it should provide both
>> what we need to make this idea work and allow a somewhat graceful fix
>> for GIST-vs-unlogged problem.
>
> Can you explain more how things would work for unlogged tables ? Do we use
> the same shared memory counter for tracking last successful index vacuum ?

Yes.

> If so, how do we handle the case where after restart the page may get LSN
> less than the index vacuum LSN if the index vacuum happened before the
> crash/stop ?

Well, on a crash, the unlogged relations get truncated, and their
indexes also, so no problem.  On a clean shutdown, I guess we need to
arrange to save the counter across restarts.

Take a look at the existing logic around GetXLogRecPtrForTemp

Re: [HACKERS] use less space in xl_xact_commit patch

2011-05-25 Thread Leonardo Francalanci

> Da: Simon Riggs 
> I can't find a clear  discussion of what you are trying to do, and how,
> just a URL back to a  complex discussion on another topic.


While trying to write a patch to allow changing an unlogged table into
a logged one, I had to add another int field to xl_xact_commit.
Robert Haas said:

 "I have to admit I don't like this approach very much.  I can't see 
adding 4 bytes to every commit record for this feature."


which is a correct remark.

xl_xact_commit can contain some arrays (relation to drops, 
committed sub-trans, shared invalidation msgs). The length of
these arrays is specified using 3 ints in the struct. 

So, to avoid adding more ints to the struct, I've been suggested to
remove all the ints, and use   xl_xact_commit.xinfo to flag which
arrays are, in fact, present. 

So the whole idea is:

- remove nrels, nsubxacts and nmsgs from xl_xact_commit
- use bits in xinfo to signal which arrays are present at the end
of   xl_xact_commit 
- for each present array, add the length of the array (as int) at
the end ofxl_xact_commit 
- add each present array after all the lengths


> I can't see any analysis that shows  whether this would be effective in
> reducing space of WAL records and what the  impacts might be.


The space of WAL records should always be <= than without the patch:
in the worst case scenario, all ints are present (as they would without
the patch). In the best case, which I guess is the more common, each
xl_xact_commit will be 3 ints shorter.

I don't think the effect will be "perceivable": the whole idea is to allow
more variable-length structures in xl_xact_commit in the future,
without incrementing the space on disk.

> The patch contains very close to zero  comments.


I tried to add some.
 
 

Leonardo

commitlog_lessbytes01.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Another attempt at vacuum improvements

2011-05-25 Thread Simon Riggs

On Wed, May 25, 2011 at 1:27 PM, Robert Haas  wrote:

>> At the moment we scan indexes if we have > 0 rows to remove, which is
>> probably wasteful. Perhaps it would be better to keep a running total
>> of rows to remove, by updating pg_stats, then when we hit a certain
>> threshold in total we can do the index scan. So we don't need to
>> remember the TIDs, just remember how many there were and use that to
>> avoid cleaning too vigorously.
>
> That occurred to me, too.  If we're being launched by autovacuum then
> we know that a number of updates and deletes equal ~20% (or whatever
> autovacuum_vacuum_scale_factor is set to) of the table size have
> occurred since the last autovacuum.  But it's possible that many of
> those were HOT updates, in which case the number of index entries to
> be cleaned up might be much less than 20% of the table size.
> Alternatively, it's possible that we'd be better off vacuuming the
> table more often (say, autovacuum_vacuum_scale_factor=0.10 or 0.08 or
> something) but only doing the index scans every once in a while when
> enough dead line pointers have accumulated.  After all, it's the first
> heap pass that frees up most of the space; cleaning dead line pointers
> seems a bit less urgent.  But not having done any real analysis of how
> this would work out in practice, I'm not sure whether it's a good idea
> or not.

We know whether a TID was once in the index or not, so we can keep an
exact count. HOT doesn't come into it.

We can remove TIDs from index as well without VACUUM during btree
split avoidance. We can optimise the second scan by skipping htids no
longer present in the index, though we'd need a spare bit to mark
usage that which I'm not sure we have.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tackling full page writes

2011-05-25 Thread Simon Riggs

On Tue, May 24, 2011 at 9:34 PM, Robert Haas  wrote:

> I'm not sure that these ideas are much good, but
> for the sake of posterity:

Both (1) and (2) seem promising to me.

Heikki mentioned (2) would only be effective if we managed to change
*all* WAL records. ISTM likely that we would find that difficult
and/or time consuming and/or buggy.

I would qualify that by saying *all* WAL record types for a certain
page type, such as all btree index pages. Not that much better, since
heap and btree are the big ones, ISTM.

(1) seems like we could do it incrementally if we supported both
partial and full page writes at same time. That way we could work on
the most frequent record types over time. Not sure if that is
possible, but seems worth considering.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reducing overhead of frequent table locks

2011-05-25 Thread Simon Riggs

On Wed, May 25, 2011 at 1:44 PM, Robert Haas  wrote:
> On Wed, May 25, 2011 at 8:27 AM, Simon Riggs  wrote:
>> I got a bit lost with the description of a potential solution. It
>> seemed like you were unaware that there is a local lock and a shared
>> lock table, maybe just me?
>
> No, I'm not unaware of the local lock table.  The point of this
> proposal is to avoid fighting over the LWLocks that protect the shared
> hash table by allowing some locks to be taken without touching it.

Yes, I got that. I just couldn't work out where mmap came in.

>> Design seemed relatively easy from there: put local lock table in
>> shared memory for all procs. We then have a use_strong_lock at proc
>> and at transaction level. Anybody that wants a strong lock first sets
>> use_strong_lock at proc and transaction level, then copies all local
>> lock data into shared lock table, double checking
>> TransactionIdIsInProgress() each time. Then queues for lock using the
>> now fully set up shared lock table. When transaction with strong lock
>> completes we do not attempt to reset transaction level boolean, only
>> at proc level, since DDL often occurs in groups and we want to avoid
>> flip-flopping quickly between lock share states. Cleanup happens by
>> regularly by bgwriter, perhaps every 10 seconds or so. All locks are
>> still visible for pg_locks.
>
> I'm not following this...

Which bit aren't you following? It's a design outline for how to
implement, deliberately brief to allow a discussion of design
alternatives.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reducing overhead of frequent table locks

2011-05-25 Thread Robert Haas

On Wed, May 25, 2011 at 8:27 AM, Simon Riggs  wrote:
> I got a bit lost with the description of a potential solution. It
> seemed like you were unaware that there is a local lock and a shared
> lock table, maybe just me?

No, I'm not unaware of the local lock table.  The point of this
proposal is to avoid fighting over the LWLocks that protect the shared
hash table by allowing some locks to be taken without touching it.

> Design seemed relatively easy from there: put local lock table in
> shared memory for all procs. We then have a use_strong_lock at proc
> and at transaction level. Anybody that wants a strong lock first sets
> use_strong_lock at proc and transaction level, then copies all local
> lock data into shared lock table, double checking
> TransactionIdIsInProgress() each time. Then queues for lock using the
> now fully set up shared lock table. When transaction with strong lock
> completes we do not attempt to reset transaction level boolean, only
> at proc level, since DDL often occurs in groups and we want to avoid
> flip-flopping quickly between lock share states. Cleanup happens by
> regularly by bgwriter, perhaps every 10 seconds or so. All locks are
> still visible for pg_locks.

I'm not following this...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tackling full page writes

2011-05-25 Thread Robert Haas

On Tue, May 24, 2011 at 11:52 PM, Bruce Momjian  wrote:
> Robert Haas wrote:
>> 2. The other fairly obvious alternative is to adjust our existing WAL
>> record types to be idempotent - i.e. to not rely on the existing page
>> contents.  For XLOG_HEAP_INSERT, we currently store the target tid and
>> the tuple contents.  I'm not sure if there's anything else, but we
>> would obviously need the offset where the new tuple should be written,
>> which we currently infer from reading the existing page contents.  For
>> XLOG_HEAP_DELETE, we store just the TID of the target tuple; we would
>> certainly need to store its offset within the block, and maybe the
>> infomask.  For XLOG_HEAP_UPDATE, we'd need the old and new offsets and
>> perhaps also the old and new infomasks.  Assuming that's all we need
>> and I'm not missing anything (which I won't bet on), that means we'd
>> be adding, say, 4 bytes per insert or delete and 8 bytes per update.
>> So, if checkpoints are spread out widely enough that there will be
>> more than ~2K operations per page between checkpoints, then it makes
>> more sense to just do a full page write and call it good.  If not,
>> this idea might have legs.
>
> I vote for "wal_level = idempotent" because so few people will know what
> idempotent means.  ;-)

That idea has the additional advantage of confusing the level of
detail of our WAL logging (minimal vs. archive vs. hot standby) with
the mechanism used to protect against torn pages (full page writes vs.
idempotent WAL records vs. prayer).  When they set it wrong and
destroy their system, we can tell them it's their own fault for not
configuring the system properly!  Bwahahahaha!

In all seriousness, I can't imagine that we'd make this
user-configurable in the first place, since that would amount to
having two sets of WAL records each of which would be even less well
tested than what we have now; and for a project this complex, we
probably shouldn't even consider changing things that seem to work now
unless the new system is clearly better than the old.

> Idempotent does seem like the most promising idea.

I tend to agree with you, but I'm worried it won't actually work out
to a win.  By the time we augment the records with enough additional
information we may have eaten up a lot of the benefit we were hoping
to get.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tackling full page writes

2011-05-25 Thread Robert Haas

On Tue, May 24, 2011 at 10:52 PM, Jeff Davis  wrote:
> On Tue, 2011-05-24 at 16:34 -0400, Robert Haas wrote:
>> As I think about it a bit more, we'd
>> need to XLOG not only the parts of the page we actually modifying, but
>> any that the WAL record would need to be correct on replay.
>
> I don't understand that statement. Can you clarify?

I'll try.  Suppose we have two WAL records A and B, with no
intervening checkpoint, that both modify the same page.  A reads chunk
1 of that page and then modifies chunk 2.  B modifies chunk 1.  Now,
suppose we make A do a "partial page write" on chunk 2 only, and B do
the same for chunk 1.  At the point the system crashes, A and B are
both on disk, and the page has already been written to disk as well.
Replay begins from a checkpoint preceding A.

Now, when we get to the record for A, what are we to do?  If it were a
full page image, we could just restore it, and everything would be
fine after that.  But if we replay the partial page write, we've got
trouble.  A will now see the state of the chunk 1 as it existed after
the action protected by B occurred, and will presumably do the wrong
thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 114 matches

Mail list logo