Re: [HACKERS] Trust intermediate CA for client certificates
On 03/07/2013 12:42 PM, Ray Stell wrote: What Tom said works for me. Here is a page that gives an example and I think it demonstrates that the root CA does not allow everybody in the gate, the chain has to be in place: http://stackoverflow.com/questions/1456034/trouble-understanding-ssl-certificate-chain-verification That page doesn't even mention PostgreSQL. You can use the openssl verify command to test that the root is not wide open on it's own. The issue is the behavior of the PostgreSQL server. openssl verify is germane only in that it points to the source of the problem -- OpenSSL's insistence on ultimately validating all certificates against a self- signed root CA. This requires that the root CA certificate be present in root.crt, which causes the server to accept connections from all clients that can present a certificate chain leading to that root CA. If you don't believe me, test with the attached files, which implement the following hierarchy. +-+ | Root CA | +-+ /\ / \ /\ / \ /\ / \ /\ / \ +---++---+ | Server CA || Client CA | +---++---+ /\\ / \\ /\\ / \\ /\\ / \\ /\\ / \\ +--+ ++ ++ | postgres | | Bad | | Good | | (server) | | client | | client | +--+ ++ ++ The goal is to configure the server such that the good client will be allowed to connect (because its certificate is signed by the Client CA), but the bad client will not be allowed to connect (because its certificate is not signed by the Client CA). You will find the following: 1. You cannot simply use client-ca,crt as $PGDATA/root.crt. OpenSSL will not validate a client certificate without access to the root CA certificate. 2. To enable client connections, you must add the root CA certificate to $PGDATA/root.crt -- cat client-ca.crt root-ca.crt root.crt. 3. Once the root CA certificate is trusted, however, the bad client can also connect by using a certificate chain that includes the Server CA certificate --cat bad-client.crt server-ca.crt ~/.postgresql/postgresql.crt. After looking at be-secure.c and investigating the way that OpenSSL validates certificates, I do not believe that there is any way of achieving the desired behavior with the current codebase. Adding pgsql-hackers to see if there is any interest in a patch to add this functionality. -- Ian Pilcher arequip...@gmail.com Sometimes there's nothing left to do but crash and burn...or die trying. root-ca.crt Description: application/pkix-cert server-ca.crt Description: application/pkix-cert client-ca.crt Description: application/pkix-cert bad-client.crt Description: application/pkix-cert -BEGIN RSA PRIVATE KEY- MIIEpAIBAAKCAQEA5rerLO8F7DxFxuOXFmJT/YDQDmQtaMPxMQs1fufUiFIqgCyA oBMCTghSQJPpm5dPP4385ZKBys2noqIMz2zr3JFQaTU8mO5wAHBuMjCPKbzqap/o YAQTejaDYW/vh9Kz3wWLgabcaQl0xl9ZFjBUAV9sEh9P3drG58k+f5Glnf6OgDSK P4SBZm6nT2tP4XL7T1NFHtqUMQn6TLkpLYGUxg3zgb4j69la6ItPjVEKuMkAW5zl biLDYc5uuqtNUJHGeRmd03rWeMwamJClH7bUwEuuq7e4wn5mufzZR63uYrNgrG+h 1k/ar+7roc02icQh2rPdU4nxIJrs+c4l4HdwFQIDAQABAoIBAHO/a3pEhFUrO9p3 LcKGHBsPN9IwgfOQcf2n4PPE/QRTLI1XRkSIpNxfIlzRmB59/70jz9+g68rB+DsI T6L0wzPKF2xg0ADthnVB8pbtc7V92KEbjmo1QUxL8we8L5CVrbXSw1WNUADGRLaM +VW/czWpGL/Sw6/K5YU9mkRH3q3vJqZbSYyPxO42mZUJcH1wk/vIfYtzcR9tMPkW Ln9zHZvDT9uBTTzWrPqr+R8QPsdW01AW2+D4OrQF5XB09N5fNGB9oUIxFdr3p+Lw XeqMjNEafGtzYLHITMrIb0btvXJ8VdAUgT4Rm5QU0aATwP5iC0cCxCnefN0Q1qKO MiU1OvUCgYEA/L1EZiT7yqa/+UEm9z2w7m//FOSzS+1tV6HazHhIr9xQ9C3AK5W5 s2YARG+k1WIvgLBpF2Bp4SAvMIBFSY0TpD+f3wek/lhO1OrrETXtZkdhFTi90qox 65wgybPkITXXLJoYNpmUKgTy7GosoSI8E4VO+inusj+fsyk+7RkiddcCgYEA6bGr PgpYHehCblY/4XVaKB76C73pDRUyQh4KLYrcZwEq1tV5XdfRW8WHDP349xdpFli1 ABSuoc0DKqarVQBjLhnJNmJ8g3/FnzrMqk6JIYBbinYZmFFs888H92IK5/ctMT32 9D7GOD4JrfBRqbOn91c5Nq/ne015vLREGIZ+c/MCgYAr2f8DJgmWCMaoTbigD1Ei ncYJbwD4/JILMWcQMRKTiMt3AnUkWs8kpF8JgMF90JJjZrhlOPJGAFqPtMHQ2Cx/ RBbOELp88v+Ci9wLWWr+YwYiM30kDymoMqext4euh3P1JitrVcxSWhd4E5f4wULh NDEW0K28ubNQ16g2ZTUIcwKBgQCOs3pA2SozoQcnvy0k7HcQNtIzZ1UvMvlMnHFU nA24LGNPam3BGy9xna25BkEICViXV7W3BeoZTUoYuku3DRSDKyXOOteTqOsxL0OY
[HACKERS] Re: Why do we still perform a check for pre-sorted input within qsort variants?
Original Message- From: gsst...@gmail.com [mailto:gsst...@gmail.com] On Behalf Of Greg Stark Sent: Friday, March 08, 2013 4:59 PM To: Dann Corbit Cc: Bruce Momjian; Peter Geoghegan; Robert Haas; Tom Lane; PG Hackers Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants? On Fri, Mar 8, 2013 at 7:43 PM, Dann Corbit dcor...@connx.com wrote: Checking for pre-sorted input will not make the routine faster on average. However, it prevents a common perverse case where the runtime goes quadratic, so sorting 10^6 elements will take K*10^12th operations when the bad condition does occur. Checking for pre-sorted data can be thought of as an insurance policy. It's kind of a pain to pay the premiums, but you sure are glad you have it when an accident occurs. Because the check is linear, and the algorithm is O(n*log(n)), the cost is not terribly significant. Well pre-sorted inputs are not the only quadratic case. If we really wanted to eliminate quadratic cases we could implement the pivot choice algorithm that guarantees n*log(n) worst-case. There are three things that give qsort fits: 1. Large ascending partitions. 2. Large descending partitions. 3. Bad choice of the median. If you examine the link in the previous article I gave: http://www.cs.toronto.edu/~zhouqq/postgresql/sort/sort.html It has a subfolder located by following this link: http://www.cs.toronto.edu/~zhouqq/postgresql/sort/dann/ In that folder is a file called qsortpdq.c In file qsortpdq.c is a function: void qsortPDQ(void *a, size_t n, size_t es, int (*cmp) (const void *, const void *)); which is an interface that checks for ascending partitions, descending partitions, and does a median of 3 choice for the pivot. Now, even this routine will have very rare inputs that can cause it to go quadratic. The probability of one of these particular inputs is very low. If you want 100% guarantee against quadratic behavior, then qsortpdq can be wrapped in heapsort with a recursion depth check. That sort would never go quadratic. That method is called introspective sort. Here are some links that describe introspective sort: http://en.wikipedia.org/wiki/Introsort https://secweb.cs.odu.edu/~zeil/cs361/web/website/Lectures/heapsort/pages/introspect.html And this is the real thing, right from the horse's mouth: http://www.cs.rpi.edu/~musser/gp/introsort.ps There is no such thing as a quicksort that never goes quadratic. It was formally proven. I cannot find the proof that I once read, but the article A Killer Adversary for Quicksort by M. D. MCILROY answers pretty nearly. Here is the summary from that paper: SUMMARY Quicksort can be made to go quadratic by constructing input on the fly in response to the sequence of items compared. The technique is illustrated by a specific adversary for the standard C qsort function. The generalmethod works against any implementation of quicksort-even a randomizing one-that satisfies certain very mild and realistic assumptions. But that will increase the average run-time. If we're not doing that then I think your whole argument falls apart. We do care about the average case as well as the worst-case. The makefile in folder: http://www.cs.toronto.edu/~zhouqq/postgresql/sort/dann/ will build the sort system and test it. The makefile in folder: http://www.cs.toronto.edu/~zhouqq/postgresql/sort/sort.html will build the simpler sort system and test it. I would suggest, in addition, the use of this routine to create a tough data set for robustness: http://www.cs.dartmouth.edu/~doug/aqsort.c There's been a *ton* of research on sorting. I find it hard to believe there isn't a pretty solid consensus on which of these defense mechanisms is the best trade-off. I'm partial to chapter 13 of C Unleashed in that regard. As far as the trade-offs to, the problem is that they really are trade-offs. However, the cost is quite small (benchmark and see) and the risk of not performing the checks is enormous. Mostly sorted data happens all the time in real life, as do other perverse distributions that cause problems for sorting algorithms.to believe there isn't a pretty solid consensus on which of these defense mechanisms is the best trade-off. My own personal recommendation would be to include every single safeguard (all three data checks and a recursion depth check as well). That's what we do here (I work at a database tools company). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Identity projection
On Friday, March 08, 2013 11:21 PM Heikki Linnakangas wrote: On 12.02.2013 11:03, Amit Kapila wrote: + /* + * equivalent_tlists + *returns whether two traget lists are equivalent + * + * We consider two target lists equivalent if both have + * only Var entries and resjunk of each target entry is same. + * + * This function is used to decide whether to create a result node. + * We need to ensure that each entry is Var as below node will not be + * projection capable, so it will not be able to compute expressions. + */ + bool + equivalent_tlists(List *tlist1, List *tlist2) + { + ListCell *lp, +*lc; + + if (list_length(tlist1) != list_length(tlist2)) + return false; /* tlists not same length */ + + forboth(lp, tlist1, lc, tlist2) + { + TargetEntry *tle1 = (TargetEntry *) lfirst(lp); + TargetEntry *tle2 = (TargetEntry *) lfirst(lc); + + if (tle1-resjunk != tle1-resjunk) + return false; /* tlist doesn't match junk status */ + + if (tle1-expr IsA(tle1-expr, Var) + tle2-expr IsA(tle2-expr, Var)) + { + Var*var1 = (Var *) tle1-expr; + Var*var2 = (Var *) tle2-expr; + + if (var1-varattno != var2-varattno) + return false; /* different order */ + } + else + return false; + } + return true; + } Hmm, shouldn't that also check Var.varno and Var.varlevelsup? I tried really hard to come up with a query that would fail because of that, but I wasn't able to find one. Nevertheless, seems like those fields should be checked. I had reffered functions trivial_subqueryscan() and tlist_matches_tupdesc() which does something similar. I had rechecked today and found that both of these functions have Assert for Var.varno and Var.varlevelsup. If you think that for current case we should have check rather than Assert, then I can update the patch for same. On a more trivial note, equivalent_tlists() seems too generic for this. I'd expect two tlists that contain e.g an identical Const node to be equivalent, but this returns false for it. I'd suggest calling it something like tlist_needs_projection() instead. Also, it probably belongs in tlist.c. You are right. I shall update this in patch once above point for usage of vano and varlevelup is confirmed. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Ever seen transient garbage results from DELETE RETURNING?
While hacking on the writable-foreign-tables patch, my attention was drawn to what seems to be a pre-existing bug. Consider the section of ExecDelete() that computes the results for DELETE RETURNING: /* Process RETURNING if present */ if (resultRelInfo-ri_projectReturning) { /* * We have to put the target tuple into a slot, which means first we * gotta fetch it. We can use the trigger tuple slot. */ TupleTableSlot *slot = estate-es_trig_tuple_slot; TupleTableSlot *rslot; HeapTupleData deltuple; Bufferdelbuffer; if (oldtuple != NULL) { deltuple.t_data = oldtuple; deltuple.t_len = HeapTupleHeaderGetDatumLength(oldtuple); ItemPointerSetInvalid((deltuple.t_self)); deltuple.t_tableOid = InvalidOid; delbuffer = InvalidBuffer; } else { deltuple.t_self = *tupleid; if (!heap_fetch(resultRelationDesc, SnapshotAny, deltuple, delbuffer, false, NULL)) elog(ERROR, failed to fetch deleted tuple for DELETE RETURNING); } if (slot-tts_tupleDescriptor != RelationGetDescr(resultRelationDesc)) ExecSetSlotDescriptor(slot, RelationGetDescr(resultRelationDesc)); ExecStoreTuple(deltuple, slot, InvalidBuffer, false); rslot = ExecProcessReturning(resultRelInfo-ri_projectReturning, slot, planSlot); ExecClearTuple(slot); if (BufferIsValid(delbuffer)) ReleaseBuffer(delbuffer); return rslot; } In the normal code path where we're deleting from a plain table, we fetch back the just-deleted row (using SnapshotAny), compute the RETURNING expressions, then release the pin on the row's buffer. But suppose that some of the RETURNING expressions are simple Vars of pass-by-reference data types. What we will end up with after ExecProcessReturning is a virtual tuple (that is, an array of Datum values) some of which are pass-by-reference pointers pointing right into the disk buffer. As soon as we ReleaseBuffer, those are dangling pointers, because the disk buffer is now subject to change. If the buffer did change before the RETURNING results got sent off to the client, we'd return garbage, or maybe even crash trying to interpret the fields. The obvious mechanism for this, that the shared buffer's content gets replaced by another page, is probably pretty improbable because our own query has just made at least two separate touches of the page; it should have a high enough LRU count to protect it for a little while. The only other possibility I can think of is that HOT pruning or VACUUM comes along and shuffles data on the page. This is probably not very likely either, because our own query would have done a round of HOT pruning when it first touched the page; barring some commits in between, there wouldn't be much reason for the page to change further. So this seems to be a live bug but one that's quite unlikely to be seen in the field, and it wouldn't be reproducible at all either. I'm curious if anyone has seen symptoms that might match this. The fix is simple: insert an ExecMaterializeSlot(rslot) before releasing the underlying data. It will add a few cycles but there seems no help for that; the whole point is we gotta copy data off the disk buffer before releasing that pin. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 8 March 2013 03:31, Bruce Momjian br...@momjian.us wrote: I also see the checksum patch is taking a beating. I wanted to step back and ask what percentage of known corruptions cases will this checksum patch detect? What percentage of these corruptions would filesystem checksums have detected? Also, don't all modern storage drives have built-in checksums, and report problems to the system administrator? Does smartctl help report storage corruption? Let me take a guess at answering this --- we have several layers in a database server: 1 storage 2 storage controller 3 file system 4 RAM 5 CPU My guess is that storage checksums only cover layer 1, while our patch covers layers 1-3, and probably not 4-5 because we only compute the checksum on write. If that is correct, the open question is what percentage of corruption happens in layers 1-3? Yes, checksums patch is taking a beating, and so it should. If we find a reason to reject, we should. CPU and RAM error checking are pretty standard now. Storage isn't necessarily the same. The figures we had from the Google paper early in development showed it was worth checksumming storage, but not memory. I did originally argue for memory also, but there was insufficient evidence of utility. At the moment, we only reject blocks if the header is damaged. That covers basic sanity checks on about 10 bytes near the start of every block. Given that some errors might still be allowed through, lets say that covers just 8 bytes of the block. Checksums cover the whole block and detect most errors, 99.999%. Which means that we will detect errors on 8192 bytes of the block. Which means that checksums are approximately 1000 times better at spotting corruption than not using them. Or put it another way, if you don't use checksums, by the time you see a single corrupt block header you will on average have lost about 500 blocks/4MB of user data. That doesn't sound too bad, but if your database has been giving wrong answers during the period those blocks went bad, you could be looking at a significant number of reads/writes gone bad, since updates would spread corruption to other rows and data would be retrieved incorrectly over a long period. I agree with Robert's comments. This isn't a brilliant design, its a brilliant stop-gap until we get a better design. However, that is a whole chunk of work away, with pg_upgrade handling on-disk page rewrites, plus some as yet undecided redesign of the way hint bits work. It's a long way off. There are performance wrinkles also, no question. For some applications, not losing data is worth the hit. Given the patch offers choice to users, I think its acceptable to look towards committing it. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Btrfs clone WIP patch
On 3/1/13 1:40 AM, Jonathan Rogers wrote: I've been thinking about both of these issues and decided to try a different approach. This patch adds GUC options for two external commands This is a reasonable approach for a proof of concept patch. I like the idea you're playing with here, as a useful boost for both production deployments and development synchronization jobs. There are some major coding hurdles for this idea to go beyond proof of concept into committed feature though, so I'm just going to dump all the obvious ones I can see on you too. Feel free to ignore those and focus on the fun usability / capability parts for a while first. Just want you to realize what work is in your way, but not visible to you necessarily. With so many GUCs already, adding more of them to support something with a fairly narrow user base will be hard to justify committing. This sort of thing seems like you could load it as an extension instead. I don't think this is worth doing yet. For now the GUC approach is fine, and it keeps your code sample small. I'm going to start with how to continue validating this idea is useful without worrying about those, to get some more testing/benchmarking samples, and for all of that job what you're doing so far is fine. The road toward something to commit is going to take a lot more complicated code eventually though. There's no big conceptual leap involved, it's just where the changes need to go and how much they need to touch to do this accurately in all cases isn't as simple as your demo. Handling unusual situations and race conditions turns out to be almost all the code in what you're replacing, and you'll need similar work in the replacements. I have just been experimenting with ZFS and it does not seem to have any capability or interface for cloning ordinary files or directories so the configuration is not as straightforward. However, I was able to set up a Postgres cluster as a hierarchy of ZFS file systems in the same pool with each directory under base being a separate file system and configure Postgres to call shell scripts which call zfs snapshot and clone commands to do the cloning and deleting. To make things easier for a reviewer, it's useful to include working examples like this as part of your submission. For this one that would include: -A sample script that created a test pool -Working postgresql.conf changes compatible with ZFS -Test program showing how to exercise the feature I think I can see how to construct such an example for the btrfs version, but having you show that explicitly (preferably with a whole sample session executing it) will also help reviewers. Remember: if you want to get your submission off to a good start, the reviewer should be able to run your sample test, see the code work, and do something fun within a few seconds of compiling it. Make that easy for them, and your reviewer will start with a good impression of you and a positive outlook for the change. Now onto the code nitpicking! = Extension vs. GUC = In addition to not polluting the postgresql.conf.sample, there's another reason this might make for better extension material eventually. Adding new execv calls that are building strings like this is just generally a risky thing. It would be nice from a security perspective if that entire mechanism wasn't even introduced into the server at all unless someone loaded the extension. An extension implementation will end up being more code, both to add a useful hook for replace these calls and for the extension packaging itself. Having a bit more code in contrib/ but less in src postgresql.conf is probably a net positive trade though. = Diff reduction / code refactoring = Looks like you added a File Operation Options entry into guc.c but then not use it here? I would just keep this in an existing category for now, try to reduce the diff length of the proof of concept version as much as possible in the beginning. On the topic of smaller diffs, the similar cut and paste sections of the two entries that both do fork/exec/waitpid should really be refactored into one function. The archiver does something similar for running archive_command, there may be some reuse or refactoring of its code to present this interface. Again, this sort of refactoring is not necessary as a POC patch. But it will probably come up if this moves toward commit candidate. = Recursion and directory navigation = In either case, the directories are copied recursively while the Postgres internal copydir function does not recurse. I don't think that should be a problem since there shouldn't be nested directories in the first place. copydir takes an option for whether it should recurse or not. The rm side of makes me twitch for a number of reasons. First off, there's just the general scariness of the concept of shelling out to run rm recursively with some text string you
Re: [HACKERS] [pgsql-advocacy] Call for Google Summer of Code mentors, admins
On 9 March 2013 01:01, Josh Berkus j...@agliodbs.com wrote: Thom. I don't mind being an admin again. Can you gather together all of the projects suggested on this thread and use them to create updated text for the GSOC page? If you don't have web repo access, I can create a patch, but if you can do the text, that would be a big help. Okay, I've pushed some changes to the repo for 2013 GSoC, and purged the varnish cache so that it's visible immediately: http://www.postgresql.org/developer/summerofcode/ I've also created a new GSoC 2013 wiki page with mentor volunteers and project ideas submitted so far: https://wiki.postgresql.org/wiki/GSoC_2013 -- Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for REINDEX CONCURRENTLY
On Sat, Mar 9, 2013 at 1:31 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Sat, Mar 9, 2013 at 1:37 AM, Fujii Masao masao.fu...@gmail.com wrote: + para + Concurrent indexes based on a literalPRIMARY KEY/ or an literal + EXCLUSION/ constraint need to be dropped with literalALTER TABLE Typo: s/EXCLUSION/EXCLUDE Thanks. This is corrected. I encountered a segmentation fault when I ran REINDEX CONCURRENTLY. The test case to reproduce the segmentation fault is: 1. Install btree_gist 2. Run btree_gist's regression test (i.e., make installcheck) 3. Log in contrib_regression database after the regression test 4. Execute REINDEX TABLE CONCURRENTLY moneytmp Oops. I simply forgot to take into account the case of system attributes when building column names in index_concurrent_create. Fixed in new version attached. Thanks for updating the patch! I found the problem that the patch changed the behavior of ALTER TABLE SET TABLESPACE so that it moves also the index on the specified table to new tablespace. Per the document of ALTER TABLE, this is not right behavior. I think that it's worth adding new option for concurrent rebuilding into reindexdb command. It's better to implement this separately from core patch, though. You need to add the description of locking of REINDEX CONCURRENTLY into mvcc.sgml, I think. + Rebuild a table concurrently: + +programlisting +REINDEX TABLE CONCURRENTLY my_broken_table; Obviously REINDEX cannot rebuild a table ;) Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for REINDEX CONCURRENTLY
On Fri, Mar 8, 2013 at 1:46 AM, Andres Freund and...@2ndquadrant.com wrote: Why do you want to temporarily mark it as valid? I don't see any requirement that it is set to that during validate_index() (which imo is badly named, but...). I'd just set it to valid in the same transaction that does the swap. +1. I cannot realize yet why isprimary flag needs to be set even in the invalid index. In current patch, we can easily get into the inconsistent situation, i.e., a table having more than one primary key indexes. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why do we still perform a check for pre-sorted input within qsort variants?
On Sat, Mar 9, 2013 at 10:32 AM, Dann Corbit dcor...@connx.com wrote: There is no such thing as a quicksort that never goes quadratic. It was formally proven The median of medians selection of the pivot gives you O(n*log(n)). -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for REINDEX CONCURRENTLY
On Sun, Mar 10, 2013 at 3:48 AM, Fujii Masao masao.fu...@gmail.com wrote: Thanks for updating the patch! - SELECT reltoastidxid - FROM info_rels i JOIN pg_catalog.pg_class c - ON i.reloid = c.oid)); + SELECT indexrelid + FROM info_rels i + JOIN pg_catalog.pg_class c + ON i.reloid = c.oid + JOIN pg_catalog.pg_index p + ON i.reloid = p.indrelid + WHERE p.indexrelid = %u , FirstNormalObjectId)); This new SQL doesn't seem to be right. Old one doesn't pick up any indexes other than toast index, but new one seems to do. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Why do we still perform a check for pre-sorted input within qsort variants?
-Original Message- From: gsst...@gmail.com [mailto:gsst...@gmail.com] On Behalf Of Greg Stark Sent: Saturday, March 09, 2013 11:39 AM To: Dann Corbit Cc: Bruce Momjian; Peter Geoghegan; Robert Haas; Tom Lane; PG Hackers Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants? On Sat, Mar 9, 2013 at 10:32 AM, Dann Corbit dcor...@connx.com wrote: There is no such thing as a quicksort that never goes quadratic. It was formally proven The median of medians selection of the pivot gives you O(n*log(n)). No. It does make O(n*n) far less probable, but it does not eliminate it. If it were possible, then introspective sort would be totally without purpose. And yet introspective sort is the algorithm of choice for the ANSI/ISO C++ standard directly because it will never go quadratic. Bentley and Mcilroy's Engineering a Sort Function says this about their final qsort source code model chosen for their implementation in the footnote: Of course, quadratic behavior is still possible. One can generate fiendish inputs by bugging Quicksort: Consider key values to be unknown initially. In the code for selecting a partition element, assign values in increasing order as unknown keys are encountered. In the partitioning code, make unknown keys compare high. Interestingly, some standard library qsort routines will go quadratic if simply given a set that contains random 0 and 1 values for the input set. It has been formally proven that no matter what method used to choose the median (other than calculating the exact median of every partition which would make quicksort slower than heapsort) there exists an adversary distribution that makes quicksort go quadratic. (unfortunately, I can't find the proof or I would give you the citation). Median selection that is randomized does not select the true median unless it operates over all the elements of the entire partition. With some small probability, it makes the worst possible choice for the median. It also does not matter if you choose the first, last and middle elements for a median of three (or other evenly selected points for larger sample sets) or if you select the sample with randomization. Both methods have adversaries. Now, the quickselect algorithm can select the median in O(n) but that is again an average. There are also adversary distributions for quickselect. On the other hand, various safeguards can make O(n*n) extremely improbable {even probabilities approaching zero}. There is a tradeoff with more and more complicated safeguards. If I choose large subsets of a partition and calculate the true median of the subset, it will make the routine safer from quadratic behavior but it will also make it slower. At some point, you end up with a routine no faster than heapsort, while also being much more complex and therefore more difficult to maintain. Hence, the necessity of introspective sort. On the other hand, there are also data specific sorts that have very special properties. There are cache conscious versions of burstsort that can sort strings much faster than quicksort or even radix sort according to measurements. There is a special version of LSD radix sort that will sort an array in one counting pass and N distribution passes, where N is the radix. So, for instance, if you are sorting 4 byte integers and your radix is 256 {one byte} then you can sort in 5 passes. One pass to count the bytes in each integer by position, and 4 passes to distribute the data. This can also be done in place. MSD radix sort can also be used as an outer sort container which sorts by distribution until the partitions are small enough and then switches to introspective sort (which eventually switches to insertion sort). Radix sorts can also be made totally generic via callback routines like quicksort. Instead of a comparison operator, the callback gets request for the radix bits for a specific bin number and then returns the radix count bits for that specific bin. For unsigned char and radix 256, this is trivially character [i] from the string for bin [i], for example. I guess improving sort routines all boils down to how much energy you want to put into it. Donald Knuth once stated that according to measurement, about 40% of business related mainframe computer cycles are involved in ordering data. So it is clearly an important problem. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Why do we still perform a check for pre-sorted input within qsort variants?
A Machine-Checked Proof of the Average-Case Complexity of Quicksort in Coq By Eelis van der Weegen and James McKinna Institute for Computing and Information Sciences Radboud University Nijmegen Heijendaalseweg 135, 6525 AJ Nijmegen, The Netherlands Contains a formal proof, validated by machine logic, that quicksort is quadratic in the worst case and also a formal proof of the average runtime efficiency. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why do we still perform a check for pre-sorted input within qsort variants?
On Sat, Mar 9, 2013 at 8:52 PM, Dann Corbit dcor...@connx.com wrote: Median of medians selection of the pivot gives you O(n*log(n)). No. It does make O(n*n) far less probable, but it does not eliminate it. If it were possible, then introspective sort would be totally without purpose. No really, quicksort with median of medians pivot selection is most definitely O(n*log(n)) worst case. This is textbook stuff. In fact even the introspective sort paper mentions it as one of the options to fail over to if the partition size isn't decreasing rapidly enough. The problem is that median of medians is O(n) rather than O(1). That doesn't change the O() growth rate since there will be log(n) iterations. But it means it contributes to the constant factor and the end result ends up being a constant factor larger than heap sort or merge sort. That also explains how your reference on the quicksort adversary doesn't apply. It works by ordering elements you haven't compared yet and assumes that all but O(1) elements will still be eligible for reordering. In any case I think you're basically right. What we have is basically a broken introspective sort that does more work than necessary and then handles fewer cases than it should. Implementing a better introspection that detects all perverse cases and does so with a lower overhead than the current check is a fine idea. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Why do we still perform a check for pre-sorted input within qsort variants?
Yes, you are right. I knew of a median of medians technique for pivot selection and I mistook that for the median of medians median selection algorithm (which it definitely isn't). I was not aware of a true linear time selection of the median algorithm {which is what median of medians accomplishes). The fastest median selection algorithm that I was aware of was quickselect, which is only linear on average. I think that you analysis is correct, at any rate. I also think I will enjoy learning and experimenting with the median of medians algorithm. I found a link about it here: http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm -Original Message- From: gsst...@gmail.com [mailto:gsst...@gmail.com] On Behalf Of Greg Stark Sent: Saturday, March 09, 2013 1:21 PM To: Dann Corbit Cc: Bruce Momjian; Peter Geoghegan; Robert Haas; Tom Lane; PG Hackers Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants? On Sat, Mar 9, 2013 at 8:52 PM, Dann Corbit dcor...@connx.com wrote: Median of medians selection of the pivot gives you O(n*log(n)). No. It does make O(n*n) far less probable, but it does not eliminate it. If it were possible, then introspective sort would be totally without purpose. No really, quicksort with median of medians pivot selection is most definitely O(n*log(n)) worst case. This is textbook stuff. In fact even the introspective sort paper mentions it as one of the options to fail over to if the partition size isn't decreasing rapidly enough. The problem is that median of medians is O(n) rather than O(1). That doesn't change the O() growth rate since there will be log(n) iterations. But it means it contributes to the constant factor and the end result ends up being a constant factor larger than heap sort or merge sort. That also explains how your reference on the quicksort adversary doesn't apply. It works by ordering elements you haven't compared yet and assumes that all but O(1) elements will still be eligible for reordering. In any case I think you're basically right. What we have is basically a broken introspective sort that does more work than necessary and then handles fewer cases than it should. Implementing a better introspection that detects all perverse cases and does so with a lower overhead than the current check is a fine idea. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why do we still perform a check for pre-sorted input within qsort variants?
On Sat, Mar 9, 2013 at 10:22 PM, Dann Corbit dcor...@connx.com wrote: Yes, you are right. I knew of a median of medians technique for pivot selection and I mistook that for the median of medians median selection algorithm (which it definitely isn't). I was not aware of a true linear time selection of the median algorithm {which is what median of medians accomplishes). The fastest median selection algorithm that I was aware of was quickselect, which is only linear on average. I think that you analysis is correct, at any rate. Hm, I was using the terminology differently than the Wikipedia page. I was referring to the recursive median of 5s used as the pivot selection as median of medians. And I still called Quicksort or Quickselect using that pivot Quicksort or Quickselect with that specific pivot choice algorithm. When using that pivot choice Quicksort is O(n*log(n)) and Quickselect (Median of Medians on Wikipedia) is O(n). But the constant factor becomes larger than if the pivot choice algorithm is O(1). I suppose it's more interesting in the case of Quickselect since there's no other alternative algorithms that could be used that have better constant factors whereas for sorting we have other options. I wonder if it makes sense to use timsort as the fallback for quicksort if the partition sizes are skewed. Timsort is specifically designed to handle presorted inputs well. On the other hand it is itself a hybrid sort so it might be getting overly complex to make it part of a hybrid algorithm. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Why do we still perform a check for pre-sorted input within qsort variants?
-Original Message- From: gsst...@gmail.com [mailto:gsst...@gmail.com] On Behalf Of Greg Stark Sent: Saturday, March 09, 2013 5:16 PM To: Dann Corbit Cc: Bruce Momjian; Peter Geoghegan; Robert Haas; Tom Lane; PG Hackers Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants? On Sat, Mar 9, 2013 at 10:22 PM, Dann Corbit dcor...@connx.com wrote: Yes, you are right. I knew of a median of medians technique for pivot selection and I mistook that for the median of medians median selection algorithm (which it definitely isn't). I was not aware of a true linear time selection of the median algorithm {which is what median of medians accomplishes). The fastest median selection algorithm that I was aware of was quickselect, which is only linear on average. I think that you analysis is correct, at any rate. Hm, I was using the terminology differently than the Wikipedia page. I was referring to the recursive median of 5s used as the pivot selection as median of medians. And I still called Quicksort or Quickselect using that pivot Quicksort or Quickselect with that specific pivot choice algorithm. When using that pivot choice Quicksort is O(n*log(n)) and Quickselect (Median of Medians on Wikipedia) is O(n). But the constant factor becomes larger than if the pivot choice algorithm is O(1). I suppose it's more interesting in the case of Quickselect since there's no other alternative algorithms that could be used that have better constant factors whereas for sorting we have other options. I wonder if it makes sense to use timsort as the fallback for quicksort if the partition sizes are skewed. Timsort is specifically designed to handle presorted inputs well. On the other hand it is itself a hybrid sort so it might be getting overly complex to make it part of a hybrid algorithm. -- My opinion (and it is no better than anyone else's) is that for a database you have to be very defensive in programming. Since database systems can contain any possible distribution (and over time they likely will encounter almost every possibility) it is important to prevent unusual inputs from causing disaster. The reason we added introspection to the sort here is that we already had quicksort with ordered partition check along with a median of three sample from three medians of three (not to mention the standard recursion of the smallest partition first and switch to insertion at small enough partition size). Sure enough, there was a customer who had a data distribution that caused bad behavior. We have customers with mainframe data where a single table can be 24 GB (I know this offhand, there are sure to be some that are larger), so a bad behavior on a sort *will* be noticed. Sorry about the oddball quoting. I will look at fixing it so that my posts are not so hard to grok. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Btrfs clone WIP patch
Greg Smith wrote: I think I can see how to construct such an example for the btrfs version, but having you show that explicitly (preferably with a whole sample session executing it) will also help reviewers. Remember: if you want to get your submission off to a good start, the reviewer should be able to run your sample test, see the code work, and do something fun within a few seconds of compiling it. Make that easy for them, and your reviewer will start with a good impression of you and a positive outlook for the change. Yes, an example is very simple with Btrfs, since it only requires one GUC variable and that the cluster be created on a Btrfs file system. Constructing a ZFS is decidedly non-trivial and I'm starting to question if it's worth it. Now onto the code nitpicking! = Extension vs. GUC = In addition to not polluting the postgresql.conf.sample, there's another reason this might make for better extension material eventually. Adding new execv calls that are building strings like this is just generally a risky thing. It would be nice from a security perspective if that entire mechanism wasn't even introduced into the server at all unless someone loaded the extension. An extension implementation will end up being more code, both to add a useful hook for replace these calls and for the extension packaging itself. Having a bit more code in contrib/ but less in src postgresql.conf is probably a net positive trade though. I will look into how to write an extension. = Diff reduction / code refactoring = Looks like you added a File Operation Options entry into guc.c but then not use it here? I would just keep this in an existing category for now, try to reduce the diff length of the proof of concept version as much as possible in the beginning. Yes, that is a good idea. On the topic of smaller diffs, the similar cut and paste sections of the two entries that both do fork/exec/waitpid should really be refactored into one function. The archiver does something similar for running archive_command, there may be some reuse or refactoring of its code to present this interface. I'll look at archive_command to see what might be in common. Again, this sort of refactoring is not necessary as a POC patch. But it will probably come up if this moves toward commit candidate. = Recursion and directory navigation = In either case, the directories are copied recursively while the Postgres internal copydir function does not recurse. I don't think that should be a problem since there shouldn't be nested directories in the first place. copydir takes an option for whether it should recurse or not. The rm side of makes me twitch for a number of reasons. First off, there's just the general scariness of the concept of shelling out to run rm recursively with some text string you build. The worst time I saw a bug in that sort of code destroyed a terabyte, and the last time I saw such a bug was only a week ago. Validation you're doing the right thing is always job #1 in removing files. I needed to call an external command to remove a directory only when experimenting on ZFS since the regular implementation works fine on Btrfs. Unlike Btrfs, ZFS does not have any capability to clone individual files. Therefore, a directory under base/ has to start out as a ZFS snapshot, which can be cloned with the zfs clone command to become a snapshot directory with exactly the same contents. To remove the clone, the zfs destroy command has to be called on the clone. AFAICT, clone and remove operations on ZFS always operate on a whole directory at a time. The same sort of issue is in your external_copydir. Iterating into subdirectories when it doesn't happen now just isn't safe, even though the one expected case you're hooking won't be any different. You really can't just do that. Would this work instead, and is there any concern about files that start with a .? cp * --reflink=auto Regardless, you need to keep most of the structure to copydir anyway. Error handling, handling cancellation, and fsync calls are all vital things. You probably have to make the forked command copy a single file at a time to get the same interrupt handling behavior. In an earlier implementation, I did call cp --reflink=auto once per regular file, preserving the behavior of copydir. On Btrfs, this works well, though slightly slower due to extra processes. AFAIK, there's no way to do something equivalent on ZFS without coming up with a much more complicated scheme involving both links and clones. I don't think it will be possible to implement a scheme that works on ZFS and addresses your concerns about file and directory handling that is not many times more complex than what I have so far. OTOH, I think the approach I have already implemented which calls an external command for each regular file to copy might be acceptable. Since I don't personally have much
Re: [HACKERS] Support for REINDEX CONCURRENTLY
On Sun, Mar 10, 2013 at 4:50 AM, Fujii Masao masao.fu...@gmail.com wrote: On Sun, Mar 10, 2013 at 3:48 AM, Fujii Masao masao.fu...@gmail.com wrote: Thanks for updating the patch! - SELECT reltoastidxid - FROM info_rels i JOIN pg_catalog.pg_class c - ON i.reloid = c.oid)); + SELECT indexrelid + FROM info_rels i + JOIN pg_catalog.pg_class c + ON i.reloid = c.oid + JOIN pg_catalog.pg_index p + ON i.reloid = p.indrelid + WHERE p.indexrelid = %u , FirstNormalObjectId)); This new SQL doesn't seem to be right. Old one doesn't pick up any indexes other than toast index, but new one seems to do. Indeed, it was selecting all indexes... I replaced it by this query reducing the selection of indexes for toast relations: - SELECT reltoastidxid - FROM info_rels i JOIN pg_catalog.pg_class c - ON i.reloid = c.oid)); + SELECT indexrelid + FROM pg_index + WHERE indrelid IN (SELECT reltoastrelid + FROM pg_class + WHERE oid = %u + AND reltoastrelid != %u), + FirstNormalObjectId, InvalidOid)); Will send patch soon... -- Michael
Re: [HACKERS] odd behavior in materialized view
On Fri, Mar 8, 2013 at 2:42 AM, Kevin Grittner kgri...@ymail.com wrote: Fujii Masao masao.fu...@gmail.com wrote: I found one typo in the document of MV. Please see the attached patch. Pushed. Thanks! Thanks! I found that pg_dump always fails against 9.2 or before server because of the MV patch. $ pg_dump pg_dump: [archiver (db)] query failed: pg_dump: [archiver (db)] query was: Attached patch fixes this problem. Regards, -- Fujii Masao pg_dump_bugfix_matview.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers