Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?

2011-12-30 Thread Peter Eisentraut
On ons, 2011-08-24 at 11:24 -0700, Daniel Farina wrote:
> At Heroku we use CREATE INDEX CONCURRENTLY with great success, but
> recently when frobbing around some indexes I realized that there is no
> equivalent for DROP INDEX, and this is a similar but lesser problem
> (as CREATE INDEX takes much longer), as DROP INDEX takes an ACCESS
> EXCLUSIVE lock on the parent table while doing the work to unlink
> files, which nominally one would think to be trivial, but I assure you
> it is not at times for even indexes that are a handful of gigabytes
> (let's say ~=< a dozen).  By non-trivial, I mean it can take 30+
> seconds, but less than a couple of minutes.  The storage layer
> (starting from the higher levels of abstraction) are XFS, a somewhat
> trivial lvm setup, mdraid (8-ways), Amazon EBS (NBD?).
> 
> I was poking around at tablecmds and index.c and wonder if a similar
> two-pass approach as used by CREATE INDEX CONCURRENTLY can be used to
> create a DROP INDEX CONCURRENTLY, and if there would be any interest
> in accepting such a patch.

Hmm, it seems I just independently came up with this same concept.  My
problem is that if a CREATE INDEX CONCURRENTLY fails, you need an
exclusive lock on the table just to clean that up.  If the table is
under constant load, you can't easily do that.  So a two-pass DROP INDEX
CONCURRENTLY might have been helpful for me.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2011-12-30 Thread Merlin Moncure
On Fri, Dec 30, 2011 at 2:30 PM, Peter Geoghegan  wrote:
> On 30 December 2011 19:46, Merlin Moncure  wrote:
>> On Thu, Dec 29, 2011 at 8:03 PM, Peter Geoghegan  
>> wrote:
>>> * A spreadsheet that shows the results of re-running my earlier heap
>>> tuple sorting benchmark with this new patch. The improvement in the
>>> query that orders by 2 columns is all that is pertinent there, when
>>> considering the value of (1) and the sense in standing still for
>>> controversy A.
>>>
>>> * A spreadsheet that shows the difference in index creation times,
>>> generated with the help of the new python script.
>>
>> very nice.  let me save everyone the effort of opening his
>> spreadsheets (which by the way both show 'HEAD/unoptimized' --
>> probably not what you meant): he's showing a consistent ~50% reduction
>> in running time of sort driven queries -- that's money.
>
> Sorry, I think you may have misinterpreted the results, which is my
> fault - I introduced a formatting error. In the case of the "btree"
> spreadsheet, the first query on each sheet should be "create index
> test on orderlines (prod_id);", and not "select * from orderlines
> order by prod_id". The idea is to compare the results from each set of
> binaries across pages of the spreadsheet (note that there are two
> tabs). You should not compare anything between the two spreadsheets.
> Revised btree results attached. The heap results that I posted do not
> have any formatting errors, so they have not been revised.

right-- my bad. still, that's 31-37% -- still pretty nice.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2011-12-30 Thread Peter Geoghegan
On 30 December 2011 19:46, Merlin Moncure  wrote:
> On Thu, Dec 29, 2011 at 8:03 PM, Peter Geoghegan  
> wrote:
>> * A spreadsheet that shows the results of re-running my earlier heap
>> tuple sorting benchmark with this new patch. The improvement in the
>> query that orders by 2 columns is all that is pertinent there, when
>> considering the value of (1) and the sense in standing still for
>> controversy A.
>>
>> * A spreadsheet that shows the difference in index creation times,
>> generated with the help of the new python script.
>
> very nice.  let me save everyone the effort of opening his
> spreadsheets (which by the way both show 'HEAD/unoptimized' --
> probably not what you meant): he's showing a consistent ~50% reduction
> in running time of sort driven queries -- that's money.

Sorry, I think you may have misinterpreted the results, which is my
fault - I introduced a formatting error. In the case of the "btree"
spreadsheet, the first query on each sheet should be "create index
test on orderlines (prod_id);", and not "select * from orderlines
order by prod_id". The idea is to compare the results from each set of
binaries across pages of the spreadsheet (note that there are two
tabs). You should not compare anything between the two spreadsheets.
Revised btree results attached. The heap results that I posted do not
have any formatting errors, so they have not been revised.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


results_server_btree_revised.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2011-12-30 Thread Merlin Moncure
On Thu, Dec 29, 2011 at 8:03 PM, Peter Geoghegan  wrote:
> * A spreadsheet that shows the results of re-running my earlier heap
> tuple sorting benchmark with this new patch. The improvement in the
> query that orders by 2 columns is all that is pertinent there, when
> considering the value of (1) and the sense in standing still for
> controversy A.
>
> * A spreadsheet that shows the difference in index creation times,
> generated with the help of the new python script.

very nice.  let me save everyone the effort of opening his
spreadsheets (which by the way both show 'HEAD/unoptimized' --
probably not what you meant): he's showing a consistent ~50% reduction
in running time of sort driven queries -- that's money.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2011-12-30 Thread Jeff Janes
On 12/29/11, Ants Aasma  wrote:

> Unless I'm missing something, double-writes are needed for all writes,
> not only the first page after a checkpoint. Consider this sequence of
> events:
>
> 1. Checkpoint
> 2. Double-write of page A (DW buffer write, sync, heap write)
> 3. Sync of heap, releasing DW buffer for new writes.
>  ... some time goes by
> 4. Regular write of page A
> 5. OS writes one part of page A
> 6. Crash!
>
> Now recovery comes along, page A is broken in the heap with no
> double-write buffer backup nor anything to recover it by in the WAL.

Isn't 3 the very definition of a checkpoint, meaning that 4 is not
really a regular write as it is the first one after a checkpoint?

But it doesn't seem safe to me replace a page from the DW buffer and
then apply WAL to that replaced page which preceded the age of the
page in the buffer.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlocks on powerpc

2011-12-30 Thread Andrew Dunstan



On 12/30/2011 11:23 AM, Tom Lane wrote:

Manabu Ori  writes:

2011/12/30 Tom Lane

The info that I've found says that the hint exists beginning in POWER6,
and there were certainly 64-bit Power machines before that.  However,
it might be that the only machines that actually spit up on the hint bit
(rather than ignore it) were 32-bit, in which case this would be a
usable heuristic.  Not sure how we can research that ... do we want to
just assume the kernel guys know what they're doing?

I'm a bit confused and might miss the point, but...
If we can decide whether to use the hint operand when we build
postgres, I think it's better to check if we can compile and run
a sample code with lwarx hint operand than to refer to some
arbitrary defines, such as FOO_PPC64 or something.

Well, there are two different conditions we have to deal with:

(1) does gcc+assembler understand the hint operand for lwarx?
This we can reasonably check with configure, since it's a property
of the build environment.

(2) does the machine where the executable will run understand the
hint bit, or failing that at least treat it as a no-op?  We cannot
determine that at configure time, unless we can fall back on some
approximate proxy condition like testing 64-bit vs 32-bit.

(I see that the kernel boys dodged point 1 by writing the lwarx
instruction as a numeric constant, but that seems far too ugly
and fragile for my taste.  In any case point 2 is the big issue.)

If you don't like the 64-bit hack or something much like it,
I think we have got three other alternatives:

* Do nothing, ie reject the patch.

* Push the problem onto the user by offering a configure option.
I don't care for this in the least, notably because packagers
such as Linux distros couldn't safely enable the option, so in
practice it would be unavailable to a large fraction of users.

* Perform a runtime test.  I'm not sure if there's a better way,
but if nothing else we could fork a subprocess during postmaster
start, have it try an lwarx with hint bit, observe whether it dumps
core, and set a flag to tell future TAS calls whether to use the hint
bit.  Ick.  In any case, adding a conditional branch to the TAS code
would lose some of the performance benefit of the patch.  Given that
you don't get any benefit at all until you have a large number of
cores, this would be a net loss for a lot of people.

None of those look better than an approximate proxy condition
to me.




#3 in particular is unspeakably ugly.

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlocks on powerpc

2011-12-30 Thread Peter Eisentraut
On fre, 2011-12-30 at 14:47 +0900, Manabu Ori wrote:
> If we can decide whether to use the hint operand when we build
> postgres, I think it's better to check if we can compile and run
> a sample code with lwarx hint operand than to refer to some
> arbitrary defines, such as FOO_PPC64 or something.
> 
But you can't be sure that the host you are running this on has the same
capability as the build system.  Packaging systems only classify
architectures on broad categories such as "i386" or "powerpc" or maybe
"powerpc64".  So a package built for "powerpc64" has to run on all
powerpc64 hosts.

Imagine you are using some Pentium instruction and run the program on a
80486.  It's the same architecture as far as kernel, package management,
etc. are concerned, but your program will break.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlocks on powerpc

2011-12-30 Thread Tom Lane
Manabu Ori  writes:
> 2011/12/30 Tom Lane 
>> The info that I've found says that the hint exists beginning in POWER6,
>> and there were certainly 64-bit Power machines before that.  However,
>> it might be that the only machines that actually spit up on the hint bit
>> (rather than ignore it) were 32-bit, in which case this would be a
>> usable heuristic.  Not sure how we can research that ... do we want to
>> just assume the kernel guys know what they're doing?

> I'm a bit confused and might miss the point, but...

> If we can decide whether to use the hint operand when we build
> postgres, I think it's better to check if we can compile and run
> a sample code with lwarx hint operand than to refer to some
> arbitrary defines, such as FOO_PPC64 or something.

Well, there are two different conditions we have to deal with:

(1) does gcc+assembler understand the hint operand for lwarx?
This we can reasonably check with configure, since it's a property
of the build environment.

(2) does the machine where the executable will run understand the
hint bit, or failing that at least treat it as a no-op?  We cannot
determine that at configure time, unless we can fall back on some
approximate proxy condition like testing 64-bit vs 32-bit.

(I see that the kernel boys dodged point 1 by writing the lwarx
instruction as a numeric constant, but that seems far too ugly
and fragile for my taste.  In any case point 2 is the big issue.)

If you don't like the 64-bit hack or something much like it,
I think we have got three other alternatives:

* Do nothing, ie reject the patch.

* Push the problem onto the user by offering a configure option.
I don't care for this in the least, notably because packagers
such as Linux distros couldn't safely enable the option, so in
practice it would be unavailable to a large fraction of users.

* Perform a runtime test.  I'm not sure if there's a better way,
but if nothing else we could fork a subprocess during postmaster
start, have it try an lwarx with hint bit, observe whether it dumps
core, and set a flag to tell future TAS calls whether to use the hint
bit.  Ick.  In any case, adding a conditional branch to the TAS code
would lose some of the performance benefit of the patch.  Given that
you don't get any benefit at all until you have a large number of
cores, this would be a net loss for a lot of people.

None of those look better than an approximate proxy condition
to me.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2011-12-30 Thread Aidan Van Dyk
On Thu, Dec 29, 2011 at 11:44 AM, Kevin Grittner
 wrote:

> You wind up with a database free of torn pages before you apply WAL.
> full_page_writes to the WAL are not needed as long as double-write is
> used for any pages which would have been written to the WAL.  If
> checksums were written to the double-buffer metadata instead of
> adding them to the page itself, this could be implemented alone.  It
> would probably allow a modest speed improvement over using
> full_page_writes and would eliminate those full-page images from the
> WAL files, making them smaller.

Correct.  So now lots of people seem to be jumping on the double-write
bandwagon and looking at some the things it promise: All writes are
durable

This solves 2 big issues:
   - Remove torn-page problem
   - Remove FPW from WAL

That up front looks pretty attractive.  But we need to look at the
tradeoffs, and then decide (benchmark anyone).

Remember, postgresql is a double-write system right now.  The 1st,
checkumed write is the FPW in WAL.  It's fsynced.  And the 2nd synced
write is when the file is synced during checkpoint.

So, postgresql currently has an optimization now that not every write
has *requirements* for atomic, instant durability.  And so postgresql
get's to do lots of writes to the OS cache and *not* request them to
be instantly synced.  And then at some point, when it's reay to clear
the 1st checksumed write, make sure everywrite is synced.  And lots of
work went into PG recently to get even better at the collection of
writes/syncs that happen at checkpoint time to take even biger
advantage of the fact that its' better to write everything in a fil
efirst, then call a single sync.

So moving to this new double-write-area bandwagon, we move from a "WAL
FPW synced at the commit, collect as many other writes, then final
sync" type system to a system where *EVERY* write requires syncs of 2
separate 8K writes at buffer write-out time.  So we avoid the FPW at
commit (yes, that's nice for latency), and we guarentee every buffer
written is consistent (that fixes our hit-bit-only dirty writes from
being torn).  And we do that at a cost of every buffer write requiring
2 fsyncs, in a serial fashion.  Come checkpoint, I'm wondering

Again, all that to avoid a single "optimization" that postgresql currently has:
1) writes for hint-bit only buffers don't need to be durable

And the problem that optimization introduces:
1) Since they aren't guarenteed durable, we can't believe a checksum



-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2011-12-30 Thread Kevin Grittner
> Simon Riggs  wrote:
> Kevin Grittner  wrote:
 
>> if there is no checksum in the page itself, you can put one in the
>> double-write metadata.
 
> However, I don't see that it provides protection across non-crash
> write problems. We know we have these since many systems have run
> without a crash for years and yet still experience corrupt data.
 
Agreed.  I don't think anyone has tried to assert it solves the same
problems that checksums solve -- it is a high-performance way to
solve some of the problems that an in-page checksum *creates* without
breaking pg_upgrade.
 
> Double writes do not require page checksums but neither do they
> replace page checksums.
 
To nit-pick: double writes require a page checksum, but (as Heikki
pointed out) they don't require it to be stored in the page.  If
there *is* one stored in the page, it probably makes sense to use it.
 
> So I think we need page checksums plus either FPWs or double
> writes.
 
Adding checksums by themselves creates a risk of false positive
corrupted page indications following an OS or hardware crash.
Additional FPWs or a new double-write mechanism are two of miriad
possible solutions to that.  If it is going to be addressed for 9.2,
I believe they're the two most reasonable, especially from the POV of
pg_upgrade.
 
So, while they should be separate patches, the complement each other;
each makes the other perform better, and they should share some code.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2011-12-30 Thread Kevin Grittner
> Nicolas Barbier  wrote:
> 2011/12/30 Ants Aasma :
>> Kevin Grittner  wrote:
>>
>>> positives. To get this right for a checksum in the page header,
>>> double-write would need to be used for all cases where
>>> full_page_writes now are used (i.e., the first write of a page
>>> after a checkpoint), and for all unlogged writes (e.g.,
>>> hint-bit-only writes). There would be no correctness problem for
>>> always using double-write, but it would be unnecessary overhead
>>> for other page writes, which I think we can avoid.
>>
>> Unless I'm missing something, double-writes are needed for all
>> writes, not only the first page after a checkpoint. Consider this
>> sequence of events:
>>
>> 1. Checkpoint
>> 2. Double-write of page A (DW buffer write, sync, heap write)
>> 3. Sync of heap, releasing DW buffer for new writes.
>> ... some time goes by
>> 4. Regular write of page A
>> 5. OS writes one part of page A
>> 6. Crash!
>>
>> Now recovery comes along, page A is broken in the heap with no
>> double-write buffer backup nor anything to recover it by in the
>> WAL.
> 
> I guess the assumption is that the write in (4) is either backed by
> the WAL, or made safe by double writing. ISTM that such reasoning
> is only correct if the change that is expressed by the WAL record
> can be applied in the context of inconsistent (i.e., partially
> written) pages, which I assume is not the case (excuse my ignorance
> regarding such basic facts).
> 
> So I think you are right.
 
Hmm.  It appears that I didn't think that through all the way.  I see
two alternatives.
 
(1)  We don't eliminate full_page_writes and we only need to use
double-writes for unlogged writes.
 
(2)  We double-write all writes and on recovery we only apply WAL to
a page from pd_lsn onward.  We would start from the same point and
follow the same rules except that when we read a page and find a
pd_lsn past the location of the record we are applying, we do nothing
because we are 100% sure everything to that point is safely written
and not torn.  full_page_writes to WAL would not be needed.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2011-12-30 Thread Simon Riggs
On Thu, Dec 29, 2011 at 4:44 PM, Kevin Grittner
 wrote:
>> Heikki Linnakangas  wrote:
>> On 28.12.2011 01:39, Simon Riggs wrote:
>>> On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas
>>>  wrote:
 On 25.12.2011 15:01, Kevin Grittner wrote:
>
> I don't believe that. Double-writing is a technique to avoid
> torn pages, but it requires a checksum to work. This chicken-
> and-egg problem requires the checksum to be implemented first.

 I don't think double-writes require checksums on the data pages
 themselves, just on the copies in the double-write buffers. In
 the double-write buffer, you'll need some extra information per-
 page anyway, like a relfilenode and block number that indicates
 which page it is in the buffer.
>
> You are clearly right -- if there is no checksum in the page itself,
> you can put one in the double-write metadata.  I've never seen that
> discussed before, but I'm embarrassed that it never occurred to me.

Heikki's idea for double writes works well. It solves the problems of
torn pages in a way that would make FPW redundant.

However, I don't see that it provides protection across non-crash
write problems. We know we have these since many systems have run
without a crash for years and yet still experience corrupt data.

Double writes do not require page checksums but neither do they
replace page checksums.

So I think we need page checksums plus either FPWs or double writes.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] grants vs. inherited tables

2011-12-30 Thread Marko Kreen
On Thu, Dec 29, 2011 at 11:11:22PM -0300, Alvaro Herrera wrote:
> 
> Excerpts from Marko Kreen's message of jue dic 29 15:22:49 -0300 2011:
> > 
> > On Thu, Dec 29, 2011 at 03:12:50PM -0300, Alvaro Herrera wrote:
> > > Excerpts from Marko Kreen's message of jue dic 29 15:04:49 -0300 2011:
> > > > 3) Have a way to format aclitem into something
> > > >that can used to create GRANT statement easily.  Eg:
> > > > 
> > > >  pg_get_privilege_info(
> > > >   IN priv aclitem,
> > > >   OUT rolename text,
> > > >   OUT privlist text[],
> > > >   OUT privlist_with_grant_option text[]);
> > > > 
> > > >This allows doing complex introspection in pl/pgsql
> > > >and also helps tools that want to re-create table structure
> > > >in other databases.
> > > 
> > > aclexplode?
> > 
> > I guess that decides the name.  :)
> 
> I have the (hopefully wrong) impression that you're missing the fact
> that it already exists, at least in 9.0.

You are right, I missed it.  For quite obvious reason:

  $ grep -ri aclexplode doc/
  $

Is there a good reason why it's undocumented?  Internal/unstable API?
I better avoid it then.  But I would like to have this or similar
function as part of public API.

Although this hints also to an obvious area that I shouldn't
have missed - the grants can be seen from information_schema...

I guess the 3) is covered then.

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers