Re: [HACKERS] Trust intermediate CA for client certificates

2013-03-09 Thread Ian Pilcher
On 03/07/2013 12:42 PM, Ray Stell wrote:
 What Tom said works for me.  Here is a page that gives an example and I think 
 it demonstrates that the root CA does not allow everybody in the gate, the 
 chain has to be in place:
  
 http://stackoverflow.com/questions/1456034/trouble-understanding-ssl-certificate-chain-verification

That page doesn't even mention PostgreSQL.

 You can use the openssl verify command to test that the root is not wide 
 open on it's own.

The issue is the behavior of the PostgreSQL server.  openssl verify is
germane only in that it points to the source of the problem -- OpenSSL's
insistence on ultimately validating all certificates against a self-
signed root CA.  This requires that the root CA certificate be present
in root.crt, which causes the server to accept connections from all
clients that can present a certificate chain leading to that root CA.

If you don't believe me, test with the attached files, which implement
the following hierarchy.

+-+
| Root CA |
+-+
/\
   /  \
  /\
 /  \
/\
   /  \
  /\
 /  \
  +---++---+
  | Server CA || Client CA |
  +---++---+
/\\
   /  \\
  /\\
 /  \\
/\\
   /  \\
  /\\
 /  \\
  +--+  ++   ++
  | postgres |  | Bad  |   | Good |
  | (server) |  | client |   | client |
  +--+  ++   ++

The goal is to configure the server such that the good client will
be allowed to connect (because its certificate is signed by the Client
CA), but the bad client will not be allowed to connect (because its
certificate is not signed by the Client CA).

You will find the following:

1. You cannot simply use client-ca,crt as $PGDATA/root.crt.  OpenSSL
   will not validate a client certificate without access to the root CA
   certificate.

2. To enable client connections, you must add the root CA certificate
   to $PGDATA/root.crt -- cat client-ca.crt root-ca.crt  root.crt.

3. Once the root CA certificate is trusted, however, the bad client
   can also connect by using a certificate chain that includes the
   Server CA certificate --cat bad-client.crt server-ca.crt 
   ~/.postgresql/postgresql.crt.

After looking at be-secure.c and investigating the way that OpenSSL
validates certificates, I do not believe that there is any way of
achieving the desired behavior with the current codebase.

Adding pgsql-hackers to see if there is any interest in a patch to add
this functionality.

-- 

Ian Pilcher arequip...@gmail.com
Sometimes there's nothing left to do but crash and burn...or die trying.



root-ca.crt
Description: application/pkix-cert


server-ca.crt
Description: application/pkix-cert


client-ca.crt
Description: application/pkix-cert


bad-client.crt
Description: application/pkix-cert
-BEGIN RSA PRIVATE KEY-
MIIEpAIBAAKCAQEA5rerLO8F7DxFxuOXFmJT/YDQDmQtaMPxMQs1fufUiFIqgCyA
oBMCTghSQJPpm5dPP4385ZKBys2noqIMz2zr3JFQaTU8mO5wAHBuMjCPKbzqap/o
YAQTejaDYW/vh9Kz3wWLgabcaQl0xl9ZFjBUAV9sEh9P3drG58k+f5Glnf6OgDSK
P4SBZm6nT2tP4XL7T1NFHtqUMQn6TLkpLYGUxg3zgb4j69la6ItPjVEKuMkAW5zl
biLDYc5uuqtNUJHGeRmd03rWeMwamJClH7bUwEuuq7e4wn5mufzZR63uYrNgrG+h
1k/ar+7roc02icQh2rPdU4nxIJrs+c4l4HdwFQIDAQABAoIBAHO/a3pEhFUrO9p3
LcKGHBsPN9IwgfOQcf2n4PPE/QRTLI1XRkSIpNxfIlzRmB59/70jz9+g68rB+DsI
T6L0wzPKF2xg0ADthnVB8pbtc7V92KEbjmo1QUxL8we8L5CVrbXSw1WNUADGRLaM
+VW/czWpGL/Sw6/K5YU9mkRH3q3vJqZbSYyPxO42mZUJcH1wk/vIfYtzcR9tMPkW
Ln9zHZvDT9uBTTzWrPqr+R8QPsdW01AW2+D4OrQF5XB09N5fNGB9oUIxFdr3p+Lw
XeqMjNEafGtzYLHITMrIb0btvXJ8VdAUgT4Rm5QU0aATwP5iC0cCxCnefN0Q1qKO
MiU1OvUCgYEA/L1EZiT7yqa/+UEm9z2w7m//FOSzS+1tV6HazHhIr9xQ9C3AK5W5
s2YARG+k1WIvgLBpF2Bp4SAvMIBFSY0TpD+f3wek/lhO1OrrETXtZkdhFTi90qox
65wgybPkITXXLJoYNpmUKgTy7GosoSI8E4VO+inusj+fsyk+7RkiddcCgYEA6bGr
PgpYHehCblY/4XVaKB76C73pDRUyQh4KLYrcZwEq1tV5XdfRW8WHDP349xdpFli1
ABSuoc0DKqarVQBjLhnJNmJ8g3/FnzrMqk6JIYBbinYZmFFs888H92IK5/ctMT32
9D7GOD4JrfBRqbOn91c5Nq/ne015vLREGIZ+c/MCgYAr2f8DJgmWCMaoTbigD1Ei
ncYJbwD4/JILMWcQMRKTiMt3AnUkWs8kpF8JgMF90JJjZrhlOPJGAFqPtMHQ2Cx/
RBbOELp88v+Ci9wLWWr+YwYiM30kDymoMqext4euh3P1JitrVcxSWhd4E5f4wULh
NDEW0K28ubNQ16g2ZTUIcwKBgQCOs3pA2SozoQcnvy0k7HcQNtIzZ1UvMvlMnHFU
nA24LGNPam3BGy9xna25BkEICViXV7W3BeoZTUoYuku3DRSDKyXOOteTqOsxL0OY

[HACKERS] Re: Why do we still perform a check for pre-sorted input within qsort variants?

2013-03-09 Thread Dann Corbit
Original Message-
From: gsst...@gmail.com [mailto:gsst...@gmail.com] On Behalf Of Greg Stark
Sent: Friday, March 08, 2013 4:59 PM
To: Dann Corbit
Cc: Bruce Momjian; Peter Geoghegan; Robert Haas; Tom Lane; PG Hackers
Subject: Re: Why do we still perform a check for pre-sorted input within qsort 
variants?

On Fri, Mar 8, 2013 at 7:43 PM, Dann Corbit dcor...@connx.com wrote:
 Checking for pre-sorted input will not make the routine faster on average.
 However, it prevents a common perverse case where the runtime goes 
 quadratic, so sorting 10^6 elements will take K*10^12th operations when the 
 bad condition does occur.
 Checking for pre-sorted data can be thought of as an insurance policy.  It's 
 kind of a pain to pay the premiums, but you sure are glad you have it when 
 an accident occurs.
 Because the check is linear, and the algorithm is O(n*log(n)), the cost is 
 not terribly significant.

Well pre-sorted inputs are not the only quadratic case. If we really wanted to 
eliminate quadratic cases we could implement the pivot choice algorithm that 
guarantees n*log(n) worst-case. 

There are three things that give qsort fits:
1.  Large ascending partitions.
2.  Large descending partitions.
3.  Bad choice of the median.
If you examine the link in the previous article I gave:
http://www.cs.toronto.edu/~zhouqq/postgresql/sort/sort.html
It has a subfolder located by following this link:
http://www.cs.toronto.edu/~zhouqq/postgresql/sort/dann/
In that folder is a file called qsortpdq.c
In file qsortpdq.c is a function:
void qsortPDQ(void *a, size_t n, size_t es, int (*cmp) (const void *, const 
void *));
which is an interface that checks for ascending partitions, descending 
partitions, and does a median of 3 choice for the pivot.
Now, even this routine will have very rare inputs that can cause it to go 
quadratic.  The probability of one of these particular inputs is very low.

If you want 100% guarantee against quadratic behavior, then qsortpdq can be 
wrapped in heapsort with a recursion depth check.  That sort would never go 
quadratic.  That method is called introspective sort.
Here are some links that describe introspective sort:
http://en.wikipedia.org/wiki/Introsort
https://secweb.cs.odu.edu/~zeil/cs361/web/website/Lectures/heapsort/pages/introspect.html
And this is the real thing, right from the horse's mouth:
http://www.cs.rpi.edu/~musser/gp/introsort.ps

There is no such thing as a quicksort that never goes quadratic.  It was 
formally proven.  I cannot find the proof that I once read, but the article A 
Killer Adversary for Quicksort by M. D. MCILROY answers pretty nearly.

Here is the summary from that paper:

SUMMARY
Quicksort can be made to go quadratic by constructing input on the fly in 
response to the sequence of
items compared. The technique is illustrated by a specific adversary for the 
standard C qsort function.
The generalmethod works against any implementation of quicksort-even a 
randomizing one-that satisfies
certain very mild and realistic assumptions.

But that will increase the average run-time. If we're not doing that then I 
think your whole argument falls apart. We do care about the average case as 
well as the worst-case.
The makefile in folder:
http://www.cs.toronto.edu/~zhouqq/postgresql/sort/dann/
will build the sort system and test it.
The makefile in folder:
http://www.cs.toronto.edu/~zhouqq/postgresql/sort/sort.html
will build the simpler sort system and test it.

I would suggest, in addition, the use of this routine to create a tough data 
set for robustness:
http://www.cs.dartmouth.edu/~doug/aqsort.c

There's been a *ton* of research on sorting. I find it hard to believe there 
isn't a pretty solid consensus on which of these defense mechanisms is the 
best trade-off.
I'm partial to chapter 13 of C Unleashed in that regard.  As far as the 
trade-offs to, the problem is that they really are trade-offs.  However, the 
cost is quite small (benchmark and see) and the risk of not performing the 
checks is enormous.  Mostly sorted data happens all the time in real life, as 
do other perverse distributions that cause problems for sorting algorithms.to 
believe there isn't a pretty solid consensus on which of these defense 
mechanisms is the best trade-off.

My own personal recommendation would be to include every single safeguard (all 
three data checks and a recursion depth check as well).
That's what we do here (I work at a database tools company).



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Identity projection

2013-03-09 Thread Amit kapila
On Friday, March 08, 2013 11:21 PM Heikki Linnakangas wrote:
On 12.02.2013 11:03, Amit Kapila wrote:
 + /*
 +  * equivalent_tlists
 +  *returns whether two traget lists are equivalent
 +  *
 +  * We consider two target lists equivalent if both have
 +  * only Var entries and resjunk of each target entry is same.
 +  *
 +  * This function is used to decide whether to create a result node.
 +  * We need to ensure that each entry is Var as below node will not be
 +  * projection capable, so it will not be able to compute expressions.
 +  */
 + bool
 + equivalent_tlists(List *tlist1, List *tlist2)
 + {
 + ListCell   *lp,
 +*lc;
 +
 + if (list_length(tlist1) != list_length(tlist2))
 + return false;   /* tlists not same length */
 +
 + forboth(lp, tlist1, lc, tlist2)
 + {
 + TargetEntry *tle1 = (TargetEntry *) lfirst(lp);
 + TargetEntry *tle2 = (TargetEntry *) lfirst(lc);
 +
 + if (tle1-resjunk != tle1-resjunk)
 + return false;   /* tlist doesn't match junk 
 status */
 +
 + if (tle1-expr  IsA(tle1-expr, Var) 
 + tle2-expr  IsA(tle2-expr, Var))
 + {
 + Var*var1 = (Var *) tle1-expr;
 + Var*var2 = (Var *) tle2-expr;
 +
 + if (var1-varattno != var2-varattno)
 + return false;   /* different order */
 + }
 + else
 + return false;
 + }
 + return true;
 + }

 Hmm, shouldn't that also check Var.varno and Var.varlevelsup? I tried
 really hard to come up with a query that would fail because of that, but
 I wasn't able to find one. Nevertheless, seems like those fields should
 be checked.

I had reffered functions trivial_subqueryscan() and tlist_matches_tupdesc() 
which does 
something similar. I had rechecked today and found that both of these functions 
have 
Assert for Var.varno and Var.varlevelsup.
If you think that for current case we should have check rather than Assert, 
then I can
update the patch for same.

 On a more trivial note, equivalent_tlists() seems too generic for this.
 I'd expect two tlists that contain e.g an identical Const node to be
 equivalent, but this returns false for it. I'd suggest calling it
 something like tlist_needs_projection() instead. Also, it probably
 belongs in tlist.c.

You are right. I shall update this in patch once above point for usage of vano 
and varlevelup is confirmed.

With Regards,
Amit Kapila.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Ever seen transient garbage results from DELETE RETURNING?

2013-03-09 Thread Tom Lane
While hacking on the writable-foreign-tables patch, my attention was
drawn to what seems to be a pre-existing bug.  Consider the section
of ExecDelete() that computes the results for DELETE RETURNING:

/* Process RETURNING if present */
if (resultRelInfo-ri_projectReturning)
{
/*
 * We have to put the target tuple into a slot, which means first we
 * gotta fetch it.  We can use the trigger tuple slot.
 */
TupleTableSlot *slot = estate-es_trig_tuple_slot;
TupleTableSlot *rslot;
HeapTupleData deltuple;
Bufferdelbuffer;

if (oldtuple != NULL)
{
deltuple.t_data = oldtuple;
deltuple.t_len = HeapTupleHeaderGetDatumLength(oldtuple);
ItemPointerSetInvalid((deltuple.t_self));
deltuple.t_tableOid = InvalidOid;
delbuffer = InvalidBuffer;
}
else
{
deltuple.t_self = *tupleid;
if (!heap_fetch(resultRelationDesc, SnapshotAny,
deltuple, delbuffer, false, NULL))
elog(ERROR, failed to fetch deleted tuple for DELETE 
RETURNING);
}

if (slot-tts_tupleDescriptor != RelationGetDescr(resultRelationDesc))
ExecSetSlotDescriptor(slot, RelationGetDescr(resultRelationDesc));
ExecStoreTuple(deltuple, slot, InvalidBuffer, false);

rslot = ExecProcessReturning(resultRelInfo-ri_projectReturning,
 slot, planSlot);

ExecClearTuple(slot);
if (BufferIsValid(delbuffer))
ReleaseBuffer(delbuffer);

return rslot;
}

In the normal code path where we're deleting from a plain table, we
fetch back the just-deleted row (using SnapshotAny), compute the
RETURNING expressions, then release the pin on the row's buffer.
But suppose that some of the RETURNING expressions are simple Vars
of pass-by-reference data types.  What we will end up with after
ExecProcessReturning is a virtual tuple (that is, an array of
Datum values) some of which are pass-by-reference pointers pointing
right into the disk buffer.  As soon as we ReleaseBuffer, those are
dangling pointers, because the disk buffer is now subject to change.
If the buffer did change before the RETURNING results got sent off
to the client, we'd return garbage, or maybe even crash trying to
interpret the fields.

The obvious mechanism for this, that the shared buffer's content
gets replaced by another page, is probably pretty improbable because
our own query has just made at least two separate touches of the page;
it should have a high enough LRU count to protect it for a little
while.  The only other possibility I can think of is that HOT pruning
or VACUUM comes along and shuffles data on the page.  This is probably
not very likely either, because our own query would have done a round
of HOT pruning when it first touched the page; barring some commits
in between, there wouldn't be much reason for the page to change further.

So this seems to be a live bug but one that's quite unlikely to be
seen in the field, and it wouldn't be reproducible at all either.
I'm curious if anyone has seen symptoms that might match this.

The fix is simple: insert an ExecMaterializeSlot(rslot) before releasing
the underlying data.  It will add a few cycles but there seems no help
for that; the whole point is we gotta copy data off the disk buffer
before releasing that pin.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-09 Thread Simon Riggs
On 8 March 2013 03:31, Bruce Momjian br...@momjian.us wrote:

 I also see the checksum patch is taking a beating.  I wanted to step
 back and ask what percentage of known corruptions cases will this
 checksum patch detect?  What percentage of these corruptions would
 filesystem checksums have detected?

 Also, don't all modern storage drives have built-in checksums, and
 report problems to the system administrator?  Does smartctl help report
 storage corruption?

 Let me take a guess at answering this --- we have several layers in a
 database server:

 1 storage
 2 storage controller
 3 file system
 4 RAM
 5 CPU

 My guess is that storage checksums only cover layer 1, while our patch
 covers layers 1-3, and probably not 4-5 because we only compute the
 checksum on write.

 If that is correct, the open question is what percentage of corruption
 happens in layers 1-3?

Yes, checksums patch is taking a beating, and so it should. If we find
a reason to reject, we should.

CPU and RAM error checking are pretty standard now. Storage isn't
necessarily the same. The figures we had from the Google paper early
in development showed it was worth checksumming storage, but not
memory. I did originally argue for memory also, but there was
insufficient evidence of utility.

At the moment, we only reject blocks if the header is damaged. That
covers basic sanity checks on about 10 bytes near the start of every
block. Given that some errors might still be allowed through, lets say
that covers just 8 bytes of the block. Checksums cover the whole block
and detect most errors, 99.999%. Which means that we will detect
errors on 8192 bytes of the block. Which means that checksums are
approximately 1000 times better at spotting corruption than not using
them. Or put it another way, if you don't use checksums, by the time
you see a single corrupt block header you will on average have lost
about 500 blocks/4MB of user data. That doesn't sound too bad, but if
your database has been giving wrong answers during the period those
blocks went bad, you could be looking at a significant number of
reads/writes gone bad, since updates would spread corruption to other
rows and data would be retrieved incorrectly over a long period.

I agree with Robert's comments. This isn't a brilliant design, its a
brilliant stop-gap until we get a better design. However, that is a
whole chunk of work away, with pg_upgrade handling on-disk page
rewrites, plus some as yet undecided redesign of the way hint bits
work. It's a long way off.

There are performance wrinkles also, no question. For some
applications, not losing data is worth the hit.

Given the patch offers choice to users, I think its acceptable to look
towards committing it.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Btrfs clone WIP patch

2013-03-09 Thread Greg Smith

On 3/1/13 1:40 AM, Jonathan Rogers wrote:

I've been thinking about both of these issues and decided to try a
different approach. This patch adds GUC options for two external
commands


This is a reasonable approach for a proof of concept patch.  I like the 
idea you're playing with here, as a useful boost for both production 
deployments and development synchronization jobs.  There are some major 
coding hurdles for this idea to go beyond proof of concept into 
committed feature though, so I'm just going to dump all the obvious ones 
I can see on you too.  Feel free to ignore those and focus on the fun 
usability / capability parts for a while first.  Just want you to 
realize what work is in your way, but not visible to you necessarily.


With so many GUCs already, adding more of them to support something with 
a fairly narrow user base will be hard to justify committing.  This sort 
of thing seems like you could load it as an extension instead.  I don't 
think this is worth doing yet.  For now the GUC approach is fine, and it 
keeps your code sample small.


I'm going to start with how to continue validating this idea is useful 
without worrying about those, to get some more testing/benchmarking 
samples, and for all of that job what you're doing so far is fine.  The 
road toward something to commit is going to take a lot more complicated 
code eventually though.  There's no big conceptual leap involved, it's 
just where the changes need to go and how much they need to touch to do 
this accurately in all cases isn't as simple as your demo.  Handling 
unusual situations and race conditions turns out to be almost all the 
code in what you're replacing, and you'll need similar work in the 
replacements.


 I have just been experimenting with ZFS and it does not seem to have any
 capability or interface for cloning ordinary files or directories so the
 configuration is not as straightforward. However, I was able to set up a
 Postgres cluster as a hierarchy of ZFS file systems in the same pool
 with each directory under base being a separate file system and
 configure Postgres to call shell scripts which call zfs snapshot and
 clone commands to do the cloning and deleting.

To make things easier for a reviewer, it's useful to include working 
examples like this as part of your submission.  For this one that would 
include:


-A sample script that created a test pool
-Working postgresql.conf changes compatible with ZFS
-Test program showing how to exercise the feature

I think I can see how to construct such an example for the btrfs 
version, but having you show that explicitly (preferably with a whole 
sample session executing it) will also help reviewers.  Remember:  if 
you want to get your submission off to a good start, the reviewer should 
be able to run your sample test, see the code work, and do something fun 
within a few seconds of compiling it.  Make that easy for them, and your 
reviewer will start with a good impression of you and a positive outlook 
for the change.


Now onto the code nitpicking!

= Extension vs. GUC =

In addition to not polluting the postgresql.conf.sample, there's another 
reason this might make for better extension material eventually.  Adding 
new execv calls that are building strings like this is just generally a 
risky thing.  It would be nice from a security perspective if that 
entire mechanism wasn't even introduced into the server at all unless 
someone loaded the extension.


An extension implementation will end up being more code, both to add a 
useful hook for replace these calls and for the extension packaging 
itself.  Having a bit more code in contrib/ but less in src  
postgresql.conf is probably a net positive trade though.


= Diff reduction / code refactoring =

Looks like you added a File Operation Options entry into guc.c but 
then not use it here?  I would just keep this in an existing category 
for now, try to reduce the diff length of the proof of concept version 
as much as possible in the beginning.


On the topic of smaller diffs, the similar cut and paste sections of the 
two entries that both do fork/exec/waitpid should really be refactored 
into one function.  The archiver does something similar for running 
archive_command, there may be some reuse or refactoring of its code to 
present this interface.


Again, this sort of refactoring is not necessary as a POC patch.  But it 
will probably come up if this moves toward commit candidate.


= Recursion and directory navigation =


In either case, the directories are copied recursively while the
Postgres internal copydir function does not recurse. I don't think that
should be a problem since there shouldn't be nested directories in the
first place.


copydir takes an option for whether it should recurse or not.

The rm side of makes me twitch for a number of reasons.  First off, 
there's just the general scariness of the concept of shelling out to run 
rm recursively with some text string you 

Re: [HACKERS] [pgsql-advocacy] Call for Google Summer of Code mentors, admins

2013-03-09 Thread Thom Brown
On 9 March 2013 01:01, Josh Berkus j...@agliodbs.com wrote:
 Thom.

 I don't mind being an admin again.

 Can you gather together all of the projects suggested on this thread and
 use them to create updated text for the GSOC page?  If you don't have
 web repo access, I can create a patch, but if you can do the text, that
 would be a big help.

Okay, I've pushed some changes to the repo for 2013 GSoC, and purged
the varnish cache so that it's visible immediately:
http://www.postgresql.org/developer/summerofcode/

I've also created a new GSoC 2013 wiki page with mentor volunteers and
project ideas submitted so far:
https://wiki.postgresql.org/wiki/GSoC_2013

--
Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-03-09 Thread Fujii Masao
On Sat, Mar 9, 2013 at 1:31 PM, Michael Paquier
michael.paqu...@gmail.com wrote:


 On Sat, Mar 9, 2013 at 1:37 AM, Fujii Masao masao.fu...@gmail.com wrote:

 + para
 +  Concurrent indexes based on a literalPRIMARY KEY/ or an
 literal
 +  EXCLUSION/  constraint need to be dropped with literalALTER
 TABLE

 Typo: s/EXCLUSION/EXCLUDE

 Thanks. This is corrected.


 I encountered a segmentation fault when I ran REINDEX CONCURRENTLY.
 The test case to reproduce the segmentation fault is:

 1. Install btree_gist
 2. Run btree_gist's regression test (i.e., make installcheck)
 3. Log in contrib_regression database after the regression test
 4. Execute REINDEX TABLE CONCURRENTLY moneytmp

 Oops. I simply forgot to take into account the case of system attributes
 when building column names in index_concurrent_create. Fixed in new version
 attached.

Thanks for updating the patch!

I found the problem that the patch changed the behavior of
ALTER TABLE SET TABLESPACE so that it moves also
the index on the specified table to new tablespace. Per the
document of ALTER TABLE, this is not right behavior.

I think that it's worth adding new option for concurrent rebuilding
into reindexdb command. It's better to implement this separately
from core patch, though.

You need to add the description of locking of REINDEX CONCURRENTLY
into mvcc.sgml, I think.

+   Rebuild a table concurrently:
+
+programlisting
+REINDEX TABLE CONCURRENTLY my_broken_table;

Obviously REINDEX cannot rebuild a table ;)

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-03-09 Thread Fujii Masao
On Fri, Mar 8, 2013 at 1:46 AM, Andres Freund and...@2ndquadrant.com wrote:
 Why do you want to temporarily mark it as valid? I don't see any
 requirement that it is set to that during validate_index() (which imo is
 badly named, but...).
 I'd just set it to valid in the same transaction that does the swap.

+1. I cannot realize yet why isprimary flag needs to be set even
in the invalid index. In current patch, we can easily get into the
inconsistent situation, i.e., a table having more than one primary
key indexes.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why do we still perform a check for pre-sorted input within qsort variants?

2013-03-09 Thread Greg Stark
On Sat, Mar 9, 2013 at 10:32 AM, Dann Corbit dcor...@connx.com wrote:
 There is no such thing as a quicksort that never goes quadratic.  It was 
 formally proven

The median of medians selection of the pivot gives you O(n*log(n)).

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-03-09 Thread Fujii Masao
On Sun, Mar 10, 2013 at 3:48 AM, Fujii Masao masao.fu...@gmail.com wrote:
 Thanks for updating the patch!

- SELECT reltoastidxid 

- FROM info_rels i 
JOIN pg_catalog.pg_class c 
-  ON 
i.reloid = c.oid));
+ SELECT indexrelid 
+ FROM info_rels i 
+   JOIN 
pg_catalog.pg_class c 
+ ON i.reloid = 
c.oid 
+   JOIN 
pg_catalog.pg_index p 
+ ON i.reloid = 
p.indrelid 
+ WHERE p.indexrelid 
= %u , FirstNormalObjectId));

This new SQL doesn't seem to be right. Old one doesn't pick up any indexes
other than toast index, but new one seems to do.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Why do we still perform a check for pre-sorted input within qsort variants?

2013-03-09 Thread Dann Corbit
-Original Message-
From: gsst...@gmail.com [mailto:gsst...@gmail.com] On Behalf Of Greg Stark
Sent: Saturday, March 09, 2013 11:39 AM
To: Dann Corbit
Cc: Bruce Momjian; Peter Geoghegan; Robert Haas; Tom Lane; PG Hackers
Subject: Re: Why do we still perform a check for pre-sorted input within qsort 
variants?

On Sat, Mar 9, 2013 at 10:32 AM, Dann Corbit dcor...@connx.com wrote:
 There is no such thing as a quicksort that never goes quadratic.  It was 
 formally proven

The median of medians selection of the pivot gives you O(n*log(n)).


No.  It does make O(n*n)  far less probable, but it does not eliminate it.  If 
it were possible, then introspective sort would be totally without purpose. 
 And yet introspective sort is the algorithm of choice for the ANSI/ISO C++ 
standard directly because it will never go quadratic.
Bentley and Mcilroy's  Engineering a Sort Function says this about their 
final qsort source code model chosen for their implementation in the footnote:
Of course, quadratic behavior is still possible. One can generate fiendish 
inputs by bugging Quicksort: Consider
key values to be unknown initially. In the code for selecting a partition 
element, assign values in increasing
order as unknown keys are encountered. In the partitioning code, make unknown 
keys compare high.

Interestingly, some standard library qsort routines will go quadratic if simply 
given a set that contains random 0 and 1 values for the input set.

It has been formally proven that no matter what method used to choose the 
median (other than calculating the exact median of every partition which would 
make quicksort slower than heapsort) there exists an adversary distribution 
that makes quicksort go quadratic. (unfortunately, I can't find the proof or I 
would give you the citation).  Median selection that is randomized does not 
select the true median unless it operates over all the elements of the entire 
partition.  With some small probability, it makes the worst possible choice for 
the median.  It also does not matter if you choose the first, last and middle 
elements for a median of three (or other evenly selected points for larger 
sample sets) or if you select the sample with randomization.  Both methods have 
adversaries.  Now, the quickselect algorithm can select the median in O(n) but 
that is again an average.  There are also adversary distributions for 
quickselect.

On the other hand, various safeguards can make O(n*n) extremely improbable 
{even probabilities approaching zero}.  There is a tradeoff with more and more 
complicated safeguards.  If I choose large subsets of a partition and calculate 
the true median of the subset, it will make the routine safer from quadratic 
behavior but it will also make it slower.  At some point, you end up with a 
routine no faster than heapsort, while also being much more complex and 
therefore more difficult to maintain.  Hence, the necessity of introspective 
sort.

On the other hand, there are also data specific sorts that have very special 
properties.  There are cache conscious versions of burstsort that can sort 
strings much faster than quicksort or even radix sort according to 
measurements.  There is a special version of LSD radix sort that will sort an 
array in one counting pass and N distribution passes, where N is the radix.  
So, for instance, if you are sorting 4 byte integers and your radix is 256 {one 
byte} then you can sort in 5 passes.  One pass to count the bytes in each 
integer by position, and 4 passes to distribute the data.  This can also be 
done in place.  MSD radix sort can also be used as an outer sort container 
which sorts by distribution until the partitions are small enough and then 
switches to introspective sort (which eventually switches to insertion sort).  
Radix sorts can also be made totally generic via callback routines like 
quicksort.  Instead of a comparison operator, the callback gets request for the 
radix bits for a specific bin number and then returns the radix count bits for 
that specific bin.  For unsigned char and radix 256, this is trivially 
character [i] from the string for bin [i], for example.

I guess improving sort routines all boils down to how much energy you want to 
put into it.  Donald Knuth once stated that according to measurement, about 40% 
of business related mainframe computer cycles are involved in ordering data.  
So it is clearly an important problem.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Why do we still perform a check for pre-sorted input within qsort variants?

2013-03-09 Thread Dann Corbit
A Machine-Checked Proof of the Average-Case Complexity of Quicksort in Coq
By Eelis van der Weegen and James McKinna
Institute for Computing and Information Sciences
Radboud University Nijmegen
Heijendaalseweg 135, 6525 AJ Nijmegen, The Netherlands

Contains a formal proof, validated by machine logic, that quicksort is 
quadratic in the worst case and also a formal proof of the average runtime 
efficiency.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why do we still perform a check for pre-sorted input within qsort variants?

2013-03-09 Thread Greg Stark
On Sat, Mar 9, 2013 at 8:52 PM, Dann Corbit dcor...@connx.com wrote:
 Median of medians selection of the pivot gives you O(n*log(n)).

 No.  It does make O(n*n)  far less probable, but it does not eliminate it.  
 If it were possible, then introspective sort would be totally without purpose.

No really, quicksort with median of medians pivot selection is most
definitely O(n*log(n)) worst case. This is textbook stuff. In fact even the
introspective sort paper mentions it as one of the options to fail
over to if the partition size isn't decreasing rapidly enough.

The problem is that median of medians is O(n) rather than O(1). That
doesn't change the O() growth rate since there will be log(n)
iterations. But it means it contributes to the constant factor and the
end result ends up being a constant factor larger than heap sort or
merge sort. That also explains how your reference on the quicksort
adversary doesn't apply. It works by ordering elements you haven't
compared yet and assumes that all but O(1) elements will still be
eligible for reordering.

In any case I think you're basically right. What we have is basically
a broken introspective sort that does more work than necessary and
then handles fewer cases than it should. Implementing a better
introspection that detects all perverse cases and does so with a lower
overhead than the current check is a fine idea.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Why do we still perform a check for pre-sorted input within qsort variants?

2013-03-09 Thread Dann Corbit
Yes, you are right.  I knew of a median of medians technique for pivot 
selection and I mistook that for the median of medians median selection 
algorithm (which it definitely isn't).
I was not aware of a true linear time selection of the median algorithm {which 
is what median of medians accomplishes).  The fastest median selection 
algorithm that I was aware of was quickselect, which is only linear on average.
I think that you analysis is correct, at any rate.

I also think I will enjoy learning and experimenting with the median of medians 
algorithm.
I found a link about it here:
http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm

-Original Message-
From: gsst...@gmail.com [mailto:gsst...@gmail.com] On Behalf Of Greg Stark
Sent: Saturday, March 09, 2013 1:21 PM
To: Dann Corbit
Cc: Bruce Momjian; Peter Geoghegan; Robert Haas; Tom Lane; PG Hackers
Subject: Re: Why do we still perform a check for pre-sorted input within qsort 
variants?

On Sat, Mar 9, 2013 at 8:52 PM, Dann Corbit dcor...@connx.com wrote:
 Median of medians selection of the pivot gives you O(n*log(n)).

 No.  It does make O(n*n)  far less probable, but it does not eliminate it.  
 If it were possible, then introspective sort would be totally without purpose.

No really, quicksort with median of medians pivot selection is most definitely 
O(n*log(n)) worst case. This is textbook stuff. In fact even the introspective 
sort paper mentions it as one of the options to fail over to if the partition 
size isn't decreasing rapidly enough.

The problem is that median of medians is O(n) rather than O(1). That doesn't 
change the O() growth rate since there will be log(n) iterations. But it means 
it contributes to the constant factor and the end result ends up being a 
constant factor larger than heap sort or merge sort. That also explains how 
your reference on the quicksort adversary doesn't apply. It works by ordering 
elements you haven't compared yet and assumes that all but O(1) elements will 
still be eligible for reordering.

In any case I think you're basically right. What we have is basically a broken 
introspective sort that does more work than necessary and then handles fewer 
cases than it should. Implementing a better introspection that detects all 
perverse cases and does so with a lower overhead than the current check is a 
fine idea.


--
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why do we still perform a check for pre-sorted input within qsort variants?

2013-03-09 Thread Greg Stark
On Sat, Mar 9, 2013 at 10:22 PM, Dann Corbit dcor...@connx.com wrote:
 Yes, you are right.  I knew of a median of medians technique for pivot 
 selection and I mistook that for the median of medians median selection 
 algorithm (which it definitely isn't).
 I was not aware of a true linear time selection of the median algorithm 
 {which is what median of medians accomplishes).  The fastest median selection 
 algorithm that I was aware of was quickselect, which is only linear on 
 average.
 I think that you analysis is correct, at any rate.

Hm, I was using the terminology differently than the Wikipedia page. I
was referring to the recursive median of 5s used as the pivot
selection as median of medians. And I still called Quicksort or
Quickselect using that pivot Quicksort or Quickselect with that
specific pivot choice algorithm.

When using that pivot choice Quicksort is O(n*log(n)) and Quickselect
(Median of Medians on Wikipedia) is O(n). But the constant factor
becomes larger than if the pivot choice algorithm is O(1). I suppose
it's more interesting in the case of Quickselect since there's no
other alternative algorithms that could be used that have better
constant factors whereas for sorting we have other options.


I wonder if it makes sense to use timsort as the fallback for
quicksort if the partition sizes are skewed. Timsort is specifically
designed to handle presorted inputs well.  On the other hand it is
itself a hybrid sort so it might be getting overly complex to make it
part of a hybrid algorithm.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Why do we still perform a check for pre-sorted input within qsort variants?

2013-03-09 Thread Dann Corbit
-Original Message-
From: gsst...@gmail.com [mailto:gsst...@gmail.com] On Behalf Of Greg Stark
Sent: Saturday, March 09, 2013 5:16 PM
To: Dann Corbit
Cc: Bruce Momjian; Peter Geoghegan; Robert Haas; Tom Lane; PG Hackers
Subject: Re: Why do we still perform a check for pre-sorted input within qsort 
variants?

On Sat, Mar 9, 2013 at 10:22 PM, Dann Corbit dcor...@connx.com wrote:
 Yes, you are right.  I knew of a median of medians technique for pivot 
 selection and I mistook that for the median of medians median selection 
 algorithm (which it definitely isn't).
 I was not aware of a true linear time selection of the median algorithm 
 {which is what median of medians accomplishes).  The fastest median selection 
 algorithm that I was aware of was quickselect, which is only linear on 
 average.
 I think that you analysis is correct, at any rate.

Hm, I was using the terminology differently than the Wikipedia page. I was 
referring to the recursive median of 5s used as the pivot selection as median 
of medians. And I still called Quicksort or Quickselect using that pivot 
Quicksort or Quickselect with that specific pivot choice algorithm.

When using that pivot choice Quicksort is O(n*log(n)) and Quickselect (Median 
of Medians on Wikipedia) is O(n). But the constant factor becomes larger than 
if the pivot choice algorithm is O(1). I suppose it's more interesting in the 
case of Quickselect since there's no other alternative algorithms that could be 
used that have better constant factors whereas for sorting we have other 
options.


I wonder if it makes sense to use timsort as the fallback for quicksort if the 
partition sizes are skewed. Timsort is specifically designed to handle 
presorted inputs well.  On the other hand it is itself a hybrid sort so it 
might be getting overly complex to make it part of a hybrid algorithm.

--

My opinion (and it is no better than anyone else's) is that for a database you 
have to be very defensive in programming.
Since database systems can contain any possible distribution (and over time 
they likely will encounter almost every possibility) it is important to prevent 
unusual inputs from causing disaster.

The reason we added introspection to the sort here is that we already had 
quicksort with ordered partition check along with a median of three sample from 
three medians of three (not to mention the standard recursion of the smallest 
partition first and switch to insertion at small enough partition size).  Sure 
enough, there was a customer who had a data distribution that caused bad 
behavior.

We have customers with mainframe data where a single table can be 24 GB (I know 
this offhand, there are sure to be some that are larger), so a bad behavior on 
a sort *will* be noticed.

Sorry about the oddball quoting.  I will look at fixing it so that my posts are 
not so hard to grok.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Btrfs clone WIP patch

2013-03-09 Thread Jonathan Rogers
Greg Smith wrote:
 I think I can see how to construct such an example for the btrfs
 version, but having you show that explicitly (preferably with a whole
 sample session executing it) will also help reviewers.  Remember:  if
 you want to get your submission off to a good start, the reviewer should
 be able to run your sample test, see the code work, and do something fun
 within a few seconds of compiling it.  Make that easy for them, and your
 reviewer will start with a good impression of you and a positive outlook
 for the change.

Yes, an example is very simple with Btrfs, since it only requires one
GUC variable and that the cluster be created on a Btrfs file system.
Constructing a ZFS is decidedly non-trivial and I'm starting to question
if it's worth it.

 
 Now onto the code nitpicking!
 
 = Extension vs. GUC =
 
 In addition to not polluting the postgresql.conf.sample, there's another
 reason this might make for better extension material eventually.  Adding
 new execv calls that are building strings like this is just generally a
 risky thing.  It would be nice from a security perspective if that
 entire mechanism wasn't even introduced into the server at all unless
 someone loaded the extension.
 
 An extension implementation will end up being more code, both to add a
 useful hook for replace these calls and for the extension packaging
 itself.  Having a bit more code in contrib/ but less in src 
 postgresql.conf is probably a net positive trade though.

I will look into how to write an extension.

 
 = Diff reduction / code refactoring =
 
 Looks like you added a File Operation Options entry into guc.c but
 then not use it here?  I would just keep this in an existing category
 for now, try to reduce the diff length of the proof of concept version
 as much as possible in the beginning.

Yes, that is a good idea.

 
 On the topic of smaller diffs, the similar cut and paste sections of the
 two entries that both do fork/exec/waitpid should really be refactored
 into one function.  The archiver does something similar for running
 archive_command, there may be some reuse or refactoring of its code to
 present this interface.

I'll look at archive_command to see what might be in common.

 
 Again, this sort of refactoring is not necessary as a POC patch.  But it
 will probably come up if this moves toward commit candidate.
 
 = Recursion and directory navigation =
 
 In either case, the directories are copied recursively while the
 Postgres internal copydir function does not recurse. I don't think that
 should be a problem since there shouldn't be nested directories in the
 first place.
 
 copydir takes an option for whether it should recurse or not.
 
 The rm side of makes me twitch for a number of reasons.  First off,
 there's just the general scariness of the concept of shelling out to run
 rm recursively with some text string you build.  The worst time I saw a
 bug in that sort of code destroyed a terabyte, and the last time I saw
 such a bug was only a week ago.  Validation you're doing the right thing
 is always job #1 in removing files.

I needed to call an external command to remove a directory only when
experimenting on ZFS since the regular implementation works fine on
Btrfs. Unlike Btrfs, ZFS does not have any capability to clone
individual files. Therefore, a directory under base/ has to start out as
a ZFS snapshot, which can be cloned with the zfs clone command to
become a snapshot directory with exactly the same contents. To remove
the clone, the zfs destroy command has to be called on the clone.
AFAICT, clone and remove operations on ZFS always operate on a whole
directory at a time.

 
 The same sort of issue is in your external_copydir.  Iterating into
 subdirectories when it doesn't happen now just isn't safe, even though
 the one expected case you're hooking won't be any different.  You really
 can't just do that.  Would this work instead, and is there any concern
 about files that start with a .?
 
 cp * --reflink=auto
 
 Regardless, you need to keep most of the structure to copydir anyway.
 Error handling, handling cancellation, and fsync calls are all vital
 things.  You probably have to make the forked command copy a single file
 at a time to get the same interrupt handling behavior.
 

In an earlier implementation, I did call cp --reflink=auto once per
regular file, preserving the behavior of copydir. On Btrfs, this works
well, though slightly slower due to extra processes. AFAIK, there's no
way to do something equivalent on ZFS without coming up with a much more
complicated scheme involving both links and clones.

I don't think it will be possible to implement a scheme that works on
ZFS and addresses your concerns about file and directory handling that
is not many times more complex than what I have so far. OTOH, I think
the approach I have already implemented which calls an external command
for each regular file to copy might be acceptable. Since I don't
personally have much 

Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-03-09 Thread Michael Paquier
On Sun, Mar 10, 2013 at 4:50 AM, Fujii Masao masao.fu...@gmail.com wrote:

 On Sun, Mar 10, 2013 at 3:48 AM, Fujii Masao masao.fu...@gmail.com
 wrote:
  Thanks for updating the patch!

 - SELECT
 reltoastidxid 
 - FROM info_rels
 i JOIN pg_catalog.pg_class c 
 -  ON
 i.reloid = c.oid));
 + SELECT
 indexrelid 
 + FROM info_rels
 i 
 +   JOIN
 pg_catalog.pg_class c 
 + ON i.reloid
 = c.oid 
 +   JOIN
 pg_catalog.pg_index p 
 + ON i.reloid
 = p.indrelid 
 + WHERE
 p.indexrelid = %u , FirstNormalObjectId));

 This new SQL doesn't seem to be right. Old one doesn't pick up any indexes
 other than toast index, but new one seems to do.

Indeed, it was selecting all indexes...
I replaced it by this query reducing the selection of indexes for toast
relations:
- SELECT
reltoastidxid 
- FROM info_rels i
JOIN pg_catalog.pg_class c 
-  ON
i.reloid = c.oid));
+ SELECT
indexrelid 
+ FROM pg_index 
+ WHERE indrelid
IN (SELECT reltoastrelid 
+   FROM
pg_class 
+   WHERE
oid = %u 
+  AND
reltoastrelid != %u),
+
FirstNormalObjectId, InvalidOid));
Will send patch soon...
-- 
Michael


Re: [HACKERS] odd behavior in materialized view

2013-03-09 Thread Fujii Masao
On Fri, Mar 8, 2013 at 2:42 AM, Kevin Grittner kgri...@ymail.com wrote:
 Fujii Masao masao.fu...@gmail.com wrote:

 I found one typo in the document of MV. Please see the attached
 patch.

 Pushed.  Thanks!

Thanks!

I found that pg_dump always fails against 9.2 or before server because
of the MV patch.

$ pg_dump
pg_dump: [archiver (db)] query failed: pg_dump: [archiver (db)] query was:

Attached patch fixes this problem.

Regards,

-- 
Fujii Masao


pg_dump_bugfix_matview.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers