Re: [HACKERS] SSI rw-conflicts and 2PC
On 14.02.2012 04:57, Dan Ports wrote: Looking over the SSI 2PC code recently, I noticed that I overlooked a case that could lead to non-serializable behavior after a crash. When we PREPARE a serializable transaction, we store part of the SERIALIZABLEXACT in the statefile (in addition to the list of SIREAD locks). One of the pieces of information we record is whether the transaction had any conflicts in or out. The problem is that that can change if a new conflict occurs after the transaction has prepared. Here's an example of the problem (based on the receipt-report test): -- Setup CREATE TABLE ctl (k text NOT NULL PRIMARY KEY, deposit_date date NOT NULL); INSERT INTO ctl VALUES ('receipt', DATE '2008-12-22'); CREATE TABLE receipt (receipt_no int NOT NULL PRIMARY KEY, deposit_date date NOT NULL, amount numeric(13,2)); -- T2 BEGIN ISOLATION LEVEL SERIALIZABLE; INSERT INTO receipt VALUES (3, (SELECT deposit_date FROM ctl WHERE k = 'receipt'), 4.00); PREPARE TRANSACTION 't2'; -- T3 BEGIN ISOLATION LEVEL SERIALIZABLE; UPDATE ctl SET deposit_date = DATE '2008-12-23' WHERE k = 'receipt'; COMMIT; -- T1 BEGIN ISOLATION LEVEL SERIALIZABLE; SELECT * FROM ctl WHERE k = 'receipt'; SELECT * FROM receipt WHERE deposit_date = DATE '2008-12-22'; COMMIT; Running this sequence of transactions normally, T1 will be rolled back because of the pattern of conflicts T1 - T2 - T3, as we'd expect. This should still be true even if we restart the database before executing the last transaction -- but it's not. The problem is that, when T2 prepared, it had no conflicts, so we recorded that in the statefile. The T2 - T3 conflict happened later, so we didn't know about it during recovery. I discussed this a bit with Kevin and we agreed that this is important to fix, since it's a false negative that violates serializability. The question is how to fix it. There are a couple of options... The easiest answer would be to just treat every prepared transaction found during recovery as though it had a conflict in and out. This is roughly a one-line change, and it's certainly safe.But the downside is that this is pretty restrictive: after recovery, we'd have to abort any serializable transaction that tries to read anything that a prepared transaction wrote, or modify anything that it read, until that transaction is either committed or rolled back. +1 for this solution. To do better than that, we want to know accurately whether the prepared transaction had a conflict with a transaction that prepared or committed before the crash. We could do this if we had a way to append a record to the 2PC statefile of an already-prepared transaction -- then we'd just add a new record indicating the conflict. Of course, we don't have a way to do that. It'd be tricky to add support for this, since it has to be crash-safe, so the question is whether the improved precision justifies the complexity it would require. Not worth the complexity, IMO. Perhaps it would be simpler to add the extra information to the commit records of the transactions that commit after the first transaction is prepared. In the commit record, you would include a list of prepared transactions that this transaction conflicted with. During recovery, you would collect those lists in memory, and use them at the end of recovery to flag the conflicts in prepared transactions that are still in prepared state. A third option is to observe that the only conflicts *in* that matter from a recovered prepared transaction are from other prepared transactions. So we could have prepared transactions include in their statefile the xids of any prepared transactions they conflicted with at prepare time, and match them up during recovery to reconstruct the graph. This is a middle ground between the other two options. It doesn't require modifying the statefile after prepare. However, conflicts *out* to non-prepared transactions do matter, and this doesn't record those, so we'd have to do the conservative thing -- which means that after recovery, no one can read anything a prepared transaction wrote. This would be fairly simple to do, but I'm not sure it's worth it, either. The nasty thing about this is whole thing is precisely that no-one can read anything the prepared transaction wrote, so making the conflict-in bookkeeping more accurate doesn't seem very helpful. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server
(2012/02/14 15:15), Shigeru Hanada wrote: (2012/02/13 20:50), Etsuro Fujita wrote: The patches have been applied, but role-related regression tests failed in my environment. I fixed it in a similar fashion of /src/test/regress/sql/foreign_data.sql. Please find attached a updated patch for the regression tests. Good catch, thanks. I'll revise pgsql_fdw tests little more. BTW, What do you think about this? http://archives.postgresql.org/pgsql-hackers/2012-01/msg00229.php I'm sorry that I've left the thread unfinished... I've given up to propose Join-push-down of foreign tables for 9.2, because it will take a while to achieve general semantics mapping for join push-down and WHERE clause push-down. For 9.2, I'm proposing pgsql_fdw which has WHERE clause push-down for built-in elements which are free from collation. I'd like to go back to that item after 9.2 development enters beta or RC, hopefully :) OK. But my question was about the PlanForeignScan API. As discussed at that thread, it would have to change the PlanForeignScan API to let the FDW generate multiple paths and dump them all to add_path instead of returning a FdwPlan struct. With this change, I think it would also have to add a new FDW API that is called from create_foreignscan_plan() and lets the FDW generate foreignscan plan for the base relation scanned by the best path choosed by postgres optimizer for itself. What do you think about it? Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory usage during sorting
On Sat, Feb 11, 2012 at 11:34 AM, Jeff Janes jeff.ja...@gmail.com wrote: On Wed, Feb 8, 2012 at 1:01 AM, Hitoshi Harada umi.tan...@gmail.com wrote: On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes jeff.ja...@gmail.com wrote: The attached patch allows it to reuse that memory. On my meager system it reduced the building of an index on an integer column in a skinny 200 million row totally randomly ordered table by about 3% from a baseline of 25 minutes. Just to give a standard review, this patch is one line change and applies cleanly, builds ok. I'm not pretty sure what exactly you're trying to accomplish, but it seems to me that it's avoiding the first dumptuples cycle by forcing availMem = 0 even if it's negative. Yes. Currently when it switches to the TSS_BUILDRUNS part of a tape-sort, it starts by calling WRITETUP a large number of time consecutively, to work off the memory deficit incurred by the 3 blocks per tape of tape overhead, and then after that calls WRITETUP about once per puttuple.. Under my patch, it would only call WRITETUP about once per puttuple, right from the beginning. I read your comments as it'd be avoiding to alternate reading/writing back and force with scattered memory failing memory cache much during merge phase, but actually it doesn't affect merge phase but only init-dump phase, does it? It effects the building of the runs. But this building of the runs is not a simple dump, it is itself a mini merge phase, in which it merges the existing in-memory priority queue against the still-incoming tuples from the node which invoked the sort. By using less memory than it could, this means that the resulting runs are smaller than they could be, and so will sometimes necessitate an additional layer of merging later on. (This effect is particularly large for the very first run being built. Generally by merging incoming tuples into the memory-tuples, you can create runs that are 1.7 times the size of fits in memory. By wasting some memory, we are getting 1.7 the size of a smaller starting point. But for the first run, it is worse than that. Most of the benefit that leads to that 1.7 multiplier comes at the very early stage of each run-build. But by initially using the full memory, then writing out a bunch of tuples without doing any merge of the incoming, we have truncated the part that gives the most benefit.) My analysis that the freed memory is never reused (because we refuse to reuse it ourselves and it is too fragmented to be reused by anyone else, like the palloc or VM system) only applies to the run-building phase. So never was a bit of an overstatement. By the time the last initial run is completely written out to tape, the heap used for the priority queue should be totally empty. So at this point the allocator would have the chance to congeal all of the fragmented memory back into larger chunks, or maybe it parcels the allocations back out again in an order so that the unused space is contiguous and could be meaningfully paged out. But, it is it worth worrying about how much we fragment memory and if we overshoot our promises by 10 or 20%? If so, I'm not so convinced your benchmark gave 3 % gain by this change. Correct me as I'm probably wrong. I've now done more complete testing. Building an index on an 200,000,000 row table with an integer column populated in random order with integers from 1..500,000,000, non-unique, on a machine with 2GB of RAM and 600MB of shared_buffers. It improves things by 6-7 percent at the end of working mem size, the rest are in the noise except at 77936 KB, where it reproducibly makes things 4% worse, for reasons I haven't figured out. So maybe the best thing to do is, rather than micromanaging memory usage, simply don't set maintenance_work_mem way to low. (But, it is the default). I've tested here with only a million rows mix of integer/text (table size is 80MB or so) with default setting, running CREATE INDEX continuously, but couldn't find performance improvement. The number varies from -2% to +2%, which I think is just error. While I agree with your insist that avoiding the first dump would make sense, I guess it depends on situations; if the dump goes with lots of tuples (which should happen when availMem is big), writing tuples a lot at a time will be faster than writing little by little later. I'm not sure about the conclusion, but given this discussion, I'm inclined to mark this Returned with Feedback. Thanks, -- Hitoshi Harada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [v9.2] LEAKPROOF attribute of FUNCTION (Re: [HACKERS] [v9.2] Fix Leaky View Problem)
2012/2/14 Robert Haas robertmh...@gmail.com: On Mon, Feb 13, 2012 at 7:51 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: I rebased the patch due to the updates of pg_proc.h. Please see the newer one. Thanks, Thanks, committed. I think, though, that some further adjustment is needed here, because you currently can't do ALTER FUNCTION ... NO LEAKPROOF, which seems unacceptable. It's fairly clear why not, though: you get a grammar conflict, because the parser allows this: create or replace function z() returns int as $$select 1$$ language sql set transaction not deferrable; However, since that syntax doesn't actually work, I'm thinking we could just refactor things a bit to reject that at the parser stage. The attached patch adopts that approach. Anyone have a better idea? I could not find out where is the origin of grammer conflicts, although it does not conflict with any options within ALTER FUNCTION. Do you think the idea of ALTER ... NOT LEAKPROOF should be integrated within v9.2 timeline also? I also think we ought to stick create_function_3 into one of the parallel groups in the regression tests, if possible. Can you investigate that? Not yet. This test does not have dependency with other tests, so, I'm optimistic to run create_function_3 concurrently. Thanks, -- KaiGai Kohei kai...@kaigai.gr.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server
(2012/02/14 17:40), Etsuro Fujita wrote: OK. But my question was about the PlanForeignScan API. Sorry for misunderstanding. :( As discussed at that thread, it would have to change the PlanForeignScan API to let the FDW generate multiple paths and dump them all to add_path instead of returning a FdwPlan struct. With this change, I think it would also have to add a new FDW API that is called from create_foreignscan_plan() and lets the FDW generate foreignscan plan for the base relation scanned by the best path choosed by postgres optimizer for itself. What do you think about it? Though I have only random thoughts about this issue at the moment... Multiple valuable Paths for a scan of a foreign table by FDW, but changing PlanForeignScan to return list of FdwPlan in 9.2 seems too hasty. It would need more consideration about general interface for possible results such as: * Full output (no WHERE push-down) is expensive on both remote and transfer. * Filtered output (WHERE push-down) has cheap total costs when only few rows come through the filter. * Ordered output (ORDER BY push-down) is expensive on remote, but has chance to omit upper Sort node. * Aggregated output (GROUP BY push-down) is expensive on remote, but have chance to omit upper Agg node, and reduces data transfer. * Limited output (LIMIT/OFFSET push-down) can reduce data transfer, and have chance to omit upper Limit node. Currently FDWs can consider only first two, AFAIK. If FDW generates multiple FdwPlan (Full and Filtered) and sets different start-up costs and total costs to them (may be former has higher start-up and lower total than latter), planner would choose better for the whole plan. In addition to changing FdwRoutine, it seems worth changing FdwPlan too so that FDWs can return more information to planner, such as pathkeys and rows, for each possible path. In short, I have some ideas to enhance foreign table scans, but IMO they are half-baked and we don't have enough time to achieve them for 9.2. Regards, -- Shigeru Hanada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server
(2012/02/14 19:42), Shigeru Hanada wrote: (2012/02/14 17:40), Etsuro Fujita wrote: As discussed at that thread, it would have to change the PlanForeignScan API to let the FDW generate multiple paths and dump them all to add_path instead of returning a FdwPlan struct. With this change, I think it would also have to add a new FDW API that is called from create_foreignscan_plan() and lets the FDW generate foreignscan plan for the base relation scanned by the best path choosed by postgres optimizer for itself. What do you think about it? In short, I have some ideas to enhance foreign table scans, but IMO they are half-baked and we don't have enough time to achieve them for 9.2. OK. Thank you for your answer. Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stats_recovery view
Excerpts from Jaime Casanova's message of mar feb 14 04:10:58 -0300 2012: On Thu, Feb 2, 2012 at 2:32 AM, Magnus Hagander mag...@hagander.net wrote: I haven't looked through the code in detail, but one direct comment: do we really need/want to send this through the stats collector? It will only ever have one sender - perhaps we should just either store it in shared memory or store it locally and only send it on demand? fyi, i intend to send a reworked patch later today, it will store the info locally and send it on demand. about the _short_desc functions, i added that because i wanted to understand what was happening during recovery and the wal_record_type (xl_info) being a number is not that clear Maybe it'd be clearer if you display it in hex and filter out just the bits that are interesting for this use? IIRC xl_info carries some other bits than the ones to identify the record type, which could be confusing. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [v9.2] LEAKPROOF attribute of FUNCTION (Re: [HACKERS] [v9.2] Fix Leaky View Problem)
On Tue, Feb 14, 2012 at 4:55 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: I could not find out where is the origin of grammer conflicts, although it does not conflict with any options within ALTER FUNCTION. Do you think the idea of ALTER ... NOT LEAKPROOF should be integrated within v9.2 timeline also? Yes. Did you notice that I attached a patch to make that work? I'll commit that today or tomorrow unless someone comes up with a better solution. I also think we ought to stick create_function_3 into one of the parallel groups in the regression tests, if possible. Can you investigate that? Not yet. This test does not have dependency with other tests, so, I'm optimistic to run create_function_3 concurrently. Me, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [v9.2] LEAKPROOF attribute of FUNCTION (Re: [HACKERS] [v9.2] Fix Leaky View Problem)
2012/2/14 Robert Haas robertmh...@gmail.com: On Tue, Feb 14, 2012 at 4:55 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: I could not find out where is the origin of grammer conflicts, although it does not conflict with any options within ALTER FUNCTION. Do you think the idea of ALTER ... NOT LEAKPROOF should be integrated within v9.2 timeline also? Yes. Did you notice that I attached a patch to make that work? I'll commit that today or tomorrow unless someone comes up with a better solution. Yes. I'll be available to work on the feature based on this patch. It was a headache of mine to implement alter statement to add/remove leakproof attribute. I also think we ought to stick create_function_3 into one of the parallel groups in the regression tests, if possible. Can you investigate that? Not yet. This test does not have dependency with other tests, so, I'm optimistic to run create_function_3 concurrently. Me, too. I tried to move create_function_3 into the group of create_view and create_index, then it works correctly. Thanks, -- KaiGai Kohei kai...@kaigai.gr.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server
Shigeru Hanada shigeru.han...@gmail.com writes: (2012/02/14 17:40), Etsuro Fujita wrote: As discussed at that thread, it would have to change the PlanForeignScan API to let the FDW generate multiple paths and dump them all to add_path instead of returning a FdwPlan struct. Multiple valuable Paths for a scan of a foreign table by FDW, but changing PlanForeignScan to return list of FdwPlan in 9.2 seems too hasty. I would really like to see that happen in 9.2, because the longer we let that mistake live, the harder it will be to change. More and more FDWs are getting written. I don't think it's that hard to do: we just have to agree that PlanForeignScan should return void and call add_path for itself, possibly more than once. If we do that, I'm inclined to think we cou;d get rid of the separate Node type FdwPlan, and just incorporate List *fdw_private into ForeignPath and ForeignScan. This does mean that FDWs will be a bit more tightly coupled to the planner, because they'll have to change whenever we add new fields to struct Path; but that is not really something that happens often. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI rw-conflicts and 2PC
Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 14.02.2012 04:57, Dan Ports wrote: Looking over the SSI 2PC code recently, I noticed that I overlooked a case that could lead to non-serializable behavior after a crash. When we PREPARE a serializable transaction, we store part of the SERIALIZABLEXACT in the statefile (in addition to the list of SIREAD locks). One of the pieces of information we record is whether the transaction had any conflicts in or out. The problem is that that can change if a new conflict occurs after the transaction has prepared. I discussed this a bit with Kevin and we agreed that this is important to fix, since it's a false negative that violates serializability. The question is how to fix it. There are a couple of options... The easiest answer would be to just treat every prepared transaction found during recovery as though it had a conflict in and out. This is roughly a one-line change, and it's certainly safe. Dan, could you post such a patch, please? But the downside is that this is pretty restrictive: after recovery, we'd have to abort any serializable transaction that tries to read anything that a prepared transaction wrote, or modify anything that it read, until that transaction is either committed or rolled back. +1 for this solution. +1 for 9.2 and backpatching this; with the notion that we might be able to do better in some later release. (A TODO entry?) Should we add anything to the docs to warn people that if they crash with serializable prepared transactions pending, they will see this behavior until the prepared transactions are either committed or rolled back, either by the transaction manager or through manual intervention? Perhaps it would be simpler to add the extra information to the commit records of the transactions that commit after the first transaction is prepared. In the commit record, you would include a list of prepared transactions that this transaction conflicted with. During recovery, you would collect those lists in memory, and use them at the end of recovery to flag the conflicts in prepared transactions that are still in prepared state. That indeed seems simpler. I'm not even sure that you would need to build a list and process it at the end; couldn't this be done as the commit records are replayed? Keep in mind that if the prepared transaction is not still pending, the information can be safely ignored, and if it *is* still pending you don't need to know *which* transaction it had the conflict with, because it will certainly have committed before the start of any post-recovery transaction. A third option is to observe that the only conflicts *in* that matter from a recovered prepared transaction are from other prepared transactions. So we could have prepared transactions include in their statefile the xids of any prepared transactions they conflicted with at prepare time, and match them up during recovery to reconstruct the graph. This is a middle ground between the other two options. It doesn't require modifying the statefile after prepare. However, conflicts *out* to non-prepared transactions do matter, and this doesn't record those, so we'd have to do the conservative thing -- which means that after recovery, no one can read anything a prepared transaction wrote. This would be fairly simple to do, but I'm not sure it's worth it, either. The nasty thing about this is whole thing is precisely that no-one can read anything the prepared transaction wrote, so making the conflict-in bookkeeping more accurate doesn't seem very helpful. Yeah, the benefit of this would be marginal without solving the other side of the problem; but if we're adding TODO entries for this area, perhaps they should be two separate entries, because either side of this could be done without touching the other. To summarize the above discussion, there is a bug that can be hit when using both SSI and 2PC if a crash or shutdown occurs while any serializable prepared transactions are pending and certain other conditions are met. The proposed quick fix would be to cause a serialization failure after recovery on any attempt by a serializable transaction to read data written by a serializable prepared transaction that was pending when a crash or shutdown occurred, and on any attempt by a serializable transaction to do a write which conflicts with a predicate lock acquired by such a prepared transaction. This would tend to be more than a little inconvenient until the prepared statements pending at crash or shutdown were all committed or rolled back. A more sophisticated solution is available that could be implemented in 9.3 or later. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI rw-conflicts and 2PC
Kevin Grittner kevin.gritt...@wicourts.gov wrote: This would tend to be more than a little inconvenient until the prepared statements pending at crash or shutdown were all committed or rolled back. [sigh] Probably obvious, but to avoid confusion: s/prepared statements/prepared transactions/ -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_test_fsync performance
On Mon, Feb 13, 2012 at 09:54:06PM -0500, Bruce Momjian wrote: On Mon, Feb 13, 2012 at 08:28:03PM -0500, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: Instead of or in addition to a fixed number operations per test, maybe we should cut off each test after a certain amount of wall-clock time, like 15 seconds. +1, I was about to suggest the same thing. Running any of these tests for a fixed number of iterations will result in drastic degradation of accuracy as soon as the machine's behavior changes noticeably from what you were expecting. Run them for a fixed time period instead. Or maybe do a few, then check elapsed time and estimate a number of iterations to use, if you're worried about the cost of doing gettimeofday after each write. Good idea, and it worked out very well. I changed the -o loops parameter to -s seconds which calls alarm() after (default) 2 seconds, and then once the operation completes, computes a duration per operation. Update patch applied, with additional fix for usage message, and use of macros for start/stop testing. I like this method much better because not only does it speed up the test, but it also allows the write test, which completes very quickly, to run longer and report more accurate numbers. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Initial 9.2 pgbench write results
Last year at this time, I was investigating things like ext3 vs xfs, how well Linux's dirty_bytes parameter worked, and how effective a couple of patches were on throughput latency. The only patch that ended up applied for 9.1 was for fsync compaction. That was measurably better in terms of eliminating backend syncs altogether, and it also pulled up average TPS a bit on the database scales I picked out to test it on. That rambling group of test sets is available at http://highperfpostgres.com/pgbench-results/index.htm For the first round of 9.2 testing under a write-heavy load, I started with 9.0 via the yum.postgresql.org packages for SL6, upgraded to 9.1 from there, and then used a source code build of 9.2 HEAD as of Feb 11 (58a9596ed4a509467e1781b433ff9c65a4e5b5ce). Attached is an Excel spreadsheet showing the major figures, along with a CSV formatted copy of that data too. Results that are ready so far are available at http://highperfpostgres.com/results-write-9.2-cf4/index.htm Most of that is good; here's the best and worst parts of the news in compact form: scale=500, db is 46% of RAM Version Avg TPS 9.0 1961 9.1 2255 9.2 2525 scale=1000, db is 94% of RAM; clients=4 Version TPS 9.0 535 9.1 491 (-8.4% relative to 9.0) 9.2 338 (-31.2% relative to 9.1) There's usually a tipping point with pgbench results, where the characteristics change quite a bit as the database exceeds total RAM size. You can see the background writer statistics change quite a bit around there too. Last year the sharpest part of that transition happened when exceeding total RAM; now it's happening just below that. This test set takes about 26 hours to run in the stripped down form I'm comparing, which doesn't even bother trying larger than RAM scales like 2000 or 3000 that might also be helpful. Most of the runtime time is spent on the larger scale database tests, which unfortunately are the interesting ones this year. I'm torn at this point between chasing down where this regression came from, moving forward with testing the new patches proposed for this CF, and seeing if this regression also holds with SSD storage. Obvious big commit candidates to bisect this over are the bgwriter/checkpointer split (Nov 1) and the group commit changes (Jan 30). Now I get to pay for not having set this up to run automatically each week since earlier in the 9.2 development cycle. If someone else wants to try and replicate the bad part of this, best guess for how is using the same minimal postgresql.conf changes I have here, and picking your database scale so that the test database just barely fits into RAM. pgbench gives rough 16MB of data per unit of scale, and scale=1000 is 15GB; percentages above are relative to the 16GB of RAM in my server. Client count should be small, number of physical cores is probably a good starter point (that's 4 in my system, I didn't test below that). At higher client counts, the general scalability improvements in 9.2 negate some of this downside. = Server config = The main change to the 8 hyperthreaded core test server (Intel i7-870) for this year is bumping it from 8GB to 16GB of RAM, which effectively doubles the scale I can reach before things slow dramatically. It's also been updated to run Scientific Linux 6.0, giving a slightly later kernel. That kernel does have different defaults for dirty_background_ratio and dirty_ratio, they're 10% and 20% now (compared to 5%/10% in last year's tests). Drive set for tests I'm publishing so far is basically the same: 4-port Areca card with 256MB battery-backed cache, 3 disk RAID0 for the database, single disk for the WAL, all cheap 7200 RPM drives. The OS is a separate drive, not connected to the caching controller. That's also where the pgbench latency data is writing to. Idea is that this will be similar to having around 10 drives in a production server, where you'll also be using RAID1 for redundancy. I have some numbers brewing for this system running with an Intel 320 series SSD, too, but they're not ready yet. = Test setup = pgbench-tools has been upgraded to break down its graphs per test set now, and there's even a configuration option to use client-side Javascript to put that into a tab-like interface available. Thanks to Ben Bleything for that one. Minimal changes were made to the postgresql.conf. shared_buffers=2GB, checkpoint_segments=64, and I left wal_buffers at its default so that 9.1 got credit for that going up. See http://highperfpostgres.com/results-write-9.2-cf4/541/pg_settings.txt for a full list of changes, drive mount options, and important kernel settings. Much of that data wasn't collected in last year's pgbench-tools runs. = Results commentary = For the most part the 9.2 results are quite good. The increase at high client counts is solid, as expected from all the lock refactoring this release has gotten. The smaller than RAM
Re: [HACKERS] Initial 9.2 pgbench write results
On 02/14/2012 01:45 PM, Greg Smith wrote: scale=1000, db is 94% of RAM; clients=4 Version TPS 9.0 535 9.1 491 (-8.4% relative to 9.0) 9.2 338 (-31.2% relative to 9.1) A second pass through this data noted that the maximum number of buffers cleaned by the background writer is =2785 in 9.0/9.1, while it goes as high as 17345 times in 9.2. The background writer is so busy now it hits the max_clean limit around 147 times in the slower[1] of the 9.2 runs. That's an average of once every 4 seconds, quite frequent. Whereas max_clean rarely happens in the comparable 9.0/9.1 results. This is starting to point my finger more toward this being an unintended consequence of the background writer/checkpointer split. Thinking out loud, about solutions before the problem is even nailed down, I wonder if we should consider lowering bgwriter_lru_maxpages now in the default config? In older versions, the page cleaning work had at most a 50% duty cycle; it was only running when checkpoints were not. If we wanted to keep the ceiling on background writer cleaning at the same level in the default configuration, that would require dropping bgwriter_lru_maxpages from 100 to 50. That would be roughly be the same amount of maximum churn. It's obviously more complicated than that, but I think there's a defensible position along those lines to consider. As a historical aside, I wonder how much this behavior might have been to blame for my failing to get spread checkpoints to show a positive outcome during 9.1 development. The way that was written also kept the cleaner running during checkpoints. I didn't measure those two changes individually as much as I did the combination. [1] I normally do 3 runs of every scale/client combination, and find that more useful than a single run lasting 3X as long. The first out of each of the 3 runs I do at any scale is usually a bit faster than the later two, presumably due to table and/or disk fragmentation. I've tried to make this less of a factor in pgbench-tools by iterating through all requested client counts first, before beginning a second run of those scale/client combination. So if the two client counts were 4 and 8, it would be 4/8/4/8/4/8, which works much better than 4/4/4/8/8/8 in terms of fragmentation impacting the average result. Whether it would be better or worse to eliminate this difference by rebuilding the whole database multiple times for each scale is complicated. I happen to like seeing the results with a bit more fragmentation mixed in, see how they compare with the fresh database. Since more rebuilds would also make these tests take much longer than they already do, that's the tie-breaker that's led to the current testing schedule being the preferred one. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] random_page_cost vs seq_page_cost
On Tue, Feb 07, 2012 at 07:58:28PM -0500, Bruce Momjian wrote: I was initially concerned that tuning advice in this part of the docs would look out of place, but now see the 25% shared_buffers recommentation, and it looks fine, so we are OK. (Should we caution against more than 8GB of shared buffers? I don't see that in the docs.) I agree we are overdue for better a explanation of random page cost, so I agree with your direction. I did a little word-smithing to tighten up your text; feel free to discard what you don't like: Random access to mechanical disk storage is normally much more expensive than four-times sequential access. However, a lower default is used (4.0) because the majority of random accesses to disk, such as indexed reads, are assumed to be in cache. The default value can be thought of as modeling random access as 40 times slower than sequential, while expecting 90% of random reads to be cached. If you believe a 90% cache rate is an incorrect assumption for your workload, you can increase random_page_cost to better reflect the true cost of random storage reads. Correspondingly, if your data is likely to be completely in cache, such as when the database is smaller than the total server memory, decreasing random_page_cost can be appropriate. Storage that has a low random read cost relative to sequential, e.g. solid-state drives, might also be better modeled with a lower value for random_page_cost. Patch applied for random_page_cost docs. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bugs/slowness inserting and indexing cubes
ITSM, I found the problem. This piece of code is triggering an error. It assumes each page of corresponding to have initialized buffer. That should be true because we're inserting index tuples from up to down while splits propagate from down to up. if (!found) { /* * Node buffer should exist at this point. If it didn't exist before, * the insertion that caused the page to split should've created it. */ elog(ERROR, node buffer of page being split (%u) does not exist, blocknum); } But this assumptions becomes false we turn buffer off in the root page. So, root page can produce pages without initialized buffers when splits. /* * Does specified level have buffers? (Beware of multiple evaluation of * arguments.) */ #define LEVEL_HAS_BUFFERS(nlevel, gfbb) \ ((nlevel) != 0 (nlevel) % (gfbb)-levelStep == 0 \ (nlevel) != (gfbb)-rootitem-level) So, I think we should just do silent return from the function instead of triggering error. Patch is attached. -- With best regards, Alexander Korotkov. *** a/src/backend/access/gist/gistbuildbuffers.c --- b/src/backend/access/gist/gistbuildbuffers.c *** *** 607,617 gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb, GISTSTATE *giststate, if (!found) { /* ! * Node buffer should exist at this point. If it didn't exist before, ! * the insertion that caused the page to split should've created it. */ ! elog(ERROR, node buffer of page being split (%u) does not exist, ! blocknum); } /* --- 607,616 if (!found) { /* ! * Page without buffer could be produced by split of root page. So ! * we've just nothing to do here when there is no buffer. */ ! return; } /* -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] When do we lose column names?
Andrew Dunstan and...@dunslane.net writes: On 02/13/2012 11:00 AM, Tom Lane wrote: This is surely fixable by passing a bit more information down. If you (Andrew) have something that covers everything but this issue, pass it over and I'll take a whack at it. What I have fixes one of the three cases I have identified that appear to need fixing, the one that caused the json tests to crash with your original partial patch. See https://bitbucket.org/adunstan/pgdevel/changesets/tip/rowexprs. I won't have time to return to this for a few days. The two remaining cases should be fairly independent I think. If you don't get to this before then I'll look at the flatten_join_alias_vars_mutator case again and try to fix it based on the above, and then give you what I've got. OK, I fixed this up and committed it. I made some cosmetic changes (the most notable being that the definition of RowExpr is really changing here, and so should its comment). The adjust_appendrel_attrs situation was fixed by passing in the PlannerInfo, which is something that probably should have been made available all along --- there are very few nontrivial functions in the planner that don't need it. I'm still a bit annoyed by the behavior I mentioned here, http://archives.postgresql.org/pgsql-hackers/2011-11/msg01031.php that we don't get real column names from an unflattened VALUES RTE. Might be worth looking into that, but I don't have time for it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bugs/slowness inserting and indexing cubes
Alexander Korotkov aekorot...@gmail.com writes: ITSM, I found the problem. This piece of code is triggering an error. It assumes each page of corresponding to have initialized buffer. That should be true because we're inserting index tuples from up to down while splits propagate from down to up. But this assumptions becomes false we turn buffer off in the root page. So, root page can produce pages without initialized buffers when splits. Hmm ... can we tighten the error check rather than just remove it? It feels less than safe to assume that a hash-entry-not-found condition *must* reflect a corner-case situation like that. At the very least I'd like to see it verify that we'd turned off buffering before deciding this is OK. Better, would it be practical to make dummy entries in the hash table even after turning buffers off, so that the logic here becomes if (!found) error; else if (entry is dummy) return without doing anything; else proceed; regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_test_fsync performance
Bruce Momjian br...@momjian.us writes: On Mon, Feb 13, 2012 at 08:28:03PM -0500, Tom Lane wrote: +1, I was about to suggest the same thing. Running any of these tests for a fixed number of iterations will result in drastic degradation of accuracy as soon as the machine's behavior changes noticeably from what you were expecting. Run them for a fixed time period instead. Or maybe do a few, then check elapsed time and estimate a number of iterations to use, if you're worried about the cost of doing gettimeofday after each write. Good idea, and it worked out very well. I changed the -o loops parameter to -s seconds which calls alarm() after (default) 2 seconds, and then once the operation completes, computes a duration per operation. I was kind of wondering how portable alarm() is, and the answer according to the buildfarm is that it isn't. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_test_fsync performance
On Tue, Feb 14, 2012 at 05:59:06PM -0500, Tom Lane wrote: Bruce Momjian br...@momjian.us writes: On Mon, Feb 13, 2012 at 08:28:03PM -0500, Tom Lane wrote: +1, I was about to suggest the same thing. Running any of these tests for a fixed number of iterations will result in drastic degradation of accuracy as soon as the machine's behavior changes noticeably from what you were expecting. Run them for a fixed time period instead. Or maybe do a few, then check elapsed time and estimate a number of iterations to use, if you're worried about the cost of doing gettimeofday after each write. Good idea, and it worked out very well. I changed the -o loops parameter to -s seconds which calls alarm() after (default) 2 seconds, and then once the operation completes, computes a duration per operation. I was kind of wondering how portable alarm() is, and the answer according to the buildfarm is that it isn't. I'm using following simplistic alarm() implementation for win32: https://github.com/markokr/libusual/blob/master/usual/signal.c#L21 this works with fake sigaction()/SIGALARM hack below - to remember function to call. Good enough for simple stats printing, and avoids win32-specific code spreading around. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI rw-conflicts and 2PC
On Tue, Feb 14, 2012 at 10:04:15AM +0200, Heikki Linnakangas wrote: Perhaps it would be simpler to add the extra information to the commit records of the transactions that commit after the first transaction is prepared. In the commit record, you would include a list of prepared transactions that this transaction conflicted with. During recovery, you would collect those lists in memory, and use them at the end of recovery to flag the conflicts in prepared transactions that are still in prepared state. Yeah, doing it that way might be a better strategy if we wanted to go that route. I hadn't really considered it because I'm not that familiar with the xlog code (plus, the commit record already contains a variable length field, making it that much more difficult to add another). Dan -- Dan R. K. Ports MIT CSAILhttp://drkp.net/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI rw-conflicts and 2PC
On Tue, Feb 14, 2012 at 09:27:58AM -0600, Kevin Grittner wrote: Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 14.02.2012 04:57, Dan Ports wrote: The easiest answer would be to just treat every prepared transaction found during recovery as though it had a conflict in and out. This is roughly a one-line change, and it's certainly safe. Dan, could you post such a patch, please? Sure. See attached. Should we add anything to the docs to warn people that if they crash with serializable prepared transactions pending, they will see this behavior until the prepared transactions are either committed or rolled back, either by the transaction manager or through manual intervention? Hmm, it occurs to me if we have to abort a transaction due to serialization failure involving a prepared transaction, we might want to include the prepared transaction's gid in the errdetail. This semes like it'd be especially useful for prepared transactions. We can generally pick the transaction to abort to ensure the safe retry property -- if that transaction is immediately retried, it won't fail with the same conflict -- but we can't always guarantee that when prepared transactions are involved. And prepared transactions already have a convenient, user-visible ID we can report. Dan -- Dan R. K. Ports MIT CSAILhttp://drkp.net/ diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c index b75b73a..b102e19 100644 --- a/src/backend/storage/lmgr/predicate.c +++ b/src/backend/storage/lmgr/predicate.c @@ -4733,14 +4733,11 @@ AtPrepare_PredicateLocks(void) xactRecord-flags = MySerializableXact-flags; /* - * Tweak the flags. Since we're not going to output the inConflicts and - * outConflicts lists, if they're non-empty we'll represent that by - * setting the appropriate summary conflict flags. + * Note that we don't include the list of conflicts in our out in + * the statefile, because new conflicts can be added even after the + * transaction prepares. We'll just make a conservative assumption + * during recovery instead. */ - if (!SHMQueueEmpty(MySerializableXact-inConflicts)) - xactRecord-flags |= SXACT_FLAG_SUMMARY_CONFLICT_IN; - if (!SHMQueueEmpty(MySerializableXact-outConflicts)) - xactRecord-flags |= SXACT_FLAG_SUMMARY_CONFLICT_OUT; RegisterTwoPhaseRecord(TWOPHASE_RM_PREDICATELOCK_ID, 0, record, sizeof(record)); @@ -4875,15 +4872,6 @@ predicatelock_twophase_recover(TransactionId xid, uint16 info, sxact-SeqNo.lastCommitBeforeSnapshot = RecoverySerCommitSeqNo; - - /* - * We don't need the details of a prepared transaction's conflicts, - * just whether it had conflicts in or out (which we get from the - * flags) - */ - SHMQueueInit((sxact-outConflicts)); - SHMQueueInit((sxact-inConflicts)); - /* * Don't need to track this; no transactions running at the time the * recovered xact started are still active, except possibly other @@ -4905,6 +4893,17 @@ predicatelock_twophase_recover(TransactionId xid, uint16 info, (MaxBackends + max_prepared_xacts)); } + /* + * We don't know whether the transaction had any conflicts or + * not, so we'll conservatively assume that it had both a + * conflict in and a conflict out, and represent that with the + * summary conflict flags. + */ + SHMQueueInit((sxact-outConflicts)); + SHMQueueInit((sxact-inConflicts)); + sxact-flags |= SXACT_FLAG_SUMMARY_CONFLICT_IN; + sxact-flags |= SXACT_FLAG_SUMMARY_CONFLICT_OUT; + /* Register the transaction's xid */ sxidtag.xid = xid; sxid = (SERIALIZABLEXID *) hash_search(SerializableXidHash, -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CUDA Sorting
On 13/02/2012 08:26, Greg Smith wrote: On 02/11/2012 08:14 PM, Gaetano Mendola wrote: The trend is to have server capable of running CUDA providing GPU via external hardware (PCI Express interface with PCI Express switches), look for example at PowerEdge C410x PCIe Expansion Chassis from DELL. The C410X adds 16 PCIe slots to a server, housed inside a separate 3U enclosure. That's a completely sensible purchase if your goal is to build a computing cluster, where a lot of work is handed off to a set of GPUs. I think that's even less likely to be a cost-effective option for a database server. Adding a single dedicated GPU installed in a server to accelerate sorting is something that might be justifiable, based on your benchmarks. This is a much more expensive option than that though. Details at http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for anyone who wants to see just how big this external box is. I did some experimenst timing the sort done with CUDA and the sort done with pg_qsort: CUDA pg_qsort 33Milion integers: ~ 900 ms, ~ 6000 ms 1Milion integers: ~ 21 ms, ~ 162 ms 100k integers: ~ 2 ms, ~ 13 ms CUDA time has already in the copy operations (host-device, device-host). As GPU I was using a C2050, and the CPU doing the pg_qsort was a Intel(R) Xeon(R) CPU X5650 @ 2.67GHz That's really interesting, and the X5650 is by no means a slow CPU. So this benchmark is providing a lot of CPU power yet still seeing over a 6X speedup in sort times. It sounds like the PCI Express bus has gotten fast enough that the time to hand data over and get it back again can easily be justified for medium to large sized sorts. It would be helpful to take this patch and confirm whether it scales when using in parallel. Easiest way to do that would be to use the pgbench -f feature, which allows running an arbitrary number of some query at once. Seeing whether this acceleration continued to hold as the number of clients increases is a useful data point. Is it possible for you to break down where the time is being spent? For example, how much of this time is consumed in the GPU itself, compared to time spent transferring data between CPU and GPU? I'm also curious where the bottleneck is at with this approach. If it's the speed of the PCI-E bus for smaller data sets, adding more GPUs may never be practical. If the bus can handle quite a few of these at once before it saturates, it might be possible to overload a single GPU. That seems like it would be really hard to reach for database sorting though; I can't really defend justify my gut feel for that being true though. There you go (times are in ms): Size H-D SORT D-H TOTAL 64 0.209824 0.479392 0.013856 0.703072 128 0.098144 0.41744 0.01312 0.528704 256 0.096832 0.420352 0.013696 0.53088 512 0.097568 0.3952 0.014464 0.507232 1024 0.09872 0.396608 0.014624 0.509952 2048 0.101344 0.56224 0.016896 0.68048 4096 0.106176 0.562976 0.02016 0.689312 8192 0.116512 0.571264 0.02672 0.714496 163840.136096 0.587584 0.040192 0.763872 327680.179296 0.658112 0.066304 0.903712 655360.212352 0.84816 0.118016 1.178528 131072 0.317056 1.1465 0.22784 1.691396 262144 0.529376 1.82237 0.42512 2.776866 524288 0.724032 2.39834 0.64576 3.768132 1048576 1.11162 3.51978 1.12176 5.75316 2097152 1.95939 5.93434 2.06992 9.96365 4194304 3.76192 10.6011 4.10614 18.46916 8388608 7.16845 19.9245 7.93741 35.03036 16777216 13.8693 38.7413 15.4073 68.0179 33554432 27.3017 75.6418 30.6646 133.6081 67108864 54.2171 151.192 60.327 265.7361 pg_sort 64 0.01 128 0.01 256 0.021000 512 0.128000 1024 0.092000 2048 0.196000 4096 0.415000 8192 0.883000 163841.881000 327683.96 655368.432000 131072 17.951000 262144 37.14 524288 78.32 1048576163.276000 2097152339.118000 4194304693.223000 8388608 1423.142000 16777216 2891.218000 33554432 5910.851000 67108864 11980.93 As you can notice the times with CUDA are lower than the timing I have reported on my previous post because the server was doing something else in mean while, I have repeated those benchmarks with server completely unused. And this is the boost as in pg_sort/cuda : 64 0.0142232943 128 0.018914175 256 0.039556962 512 0.2070058671 1024 0.1804091365 2048 0.2880319774 4096 0.6078524674 8192 1.2372357578 163842.4637635625 327684.4106972133 655367.1742037525 131072 10.5090706139 262144 13.3719091955 524288 20.5834084369 1048576 28.2516043357 2097152 33.9618513296 4194304 37.5247168794 8388608 40.5135716561 16777216 42.4743633661 33554432 44.2394809896 67108864 45.1499777411 I've never seen a PostgreSQL server capable of running CUDA, and I don't expect that to change. That sounds like: I think there is a world market for
Re: [HACKERS] pg_test_fsync performance
On Wed, Feb 15, 2012 at 01:35:05AM +0200, Marko Kreen wrote: On Tue, Feb 14, 2012 at 05:59:06PM -0500, Tom Lane wrote: Bruce Momjian br...@momjian.us writes: On Mon, Feb 13, 2012 at 08:28:03PM -0500, Tom Lane wrote: +1, I was about to suggest the same thing. Running any of these tests for a fixed number of iterations will result in drastic degradation of accuracy as soon as the machine's behavior changes noticeably from what you were expecting. Run them for a fixed time period instead. Or maybe do a few, then check elapsed time and estimate a number of iterations to use, if you're worried about the cost of doing gettimeofday after each write. Good idea, and it worked out very well. I changed the -o loops parameter to -s seconds which calls alarm() after (default) 2 seconds, and then once the operation completes, computes a duration per operation. I was kind of wondering how portable alarm() is, and the answer according to the buildfarm is that it isn't. I'm using following simplistic alarm() implementation for win32: https://github.com/markokr/libusual/blob/master/usual/signal.c#L21 this works with fake sigaction()/SIGALARM hack below - to remember function to call. Good enough for simple stats printing, and avoids win32-specific code spreading around. Wow, I wasn't even aware this compiled in Win32; I thought it was ifdef'ed out. Anyway, I am looking at SetTimer as a way of making this work. (Me wonders if the GoGrid Windows images have compilers.) I see backend/port/win32/timer.c so I might go with a simple create a thread, sleep(2), set flag, exit solution. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] client performance v.s. server statistics
Hi, I am checking a performance problem encountered after porting old embeded DB to postgreSQL. While the system is real-time sensitive, we are concerning for per-query cost. In our environment sequential scanning (select * from ...) for a table with tens of thousands of record costs 1 - 2 seconds, regardless of using ODBC driver or the timing result shown in psql client (which in turn, relies on libpq). However, using EXPLAIN ANALYZE, or checking the statistics in pg_stat_statement view, the query costs only less than 100ms. So, is it client interface (ODBC, libpq) 's cost mainly due to TCP? Has the pg_stat_statement or EXPLAIN ANALYZE included the cost of copying tuples from shared buffers to result sets? Could you experts share your views on this big gap? And any suggestions to optimise? P.S. In our original embeded DB a fastpath interface is provided to read directly from shared memory for the records, thus provides extremely realtime access (of course sacrifice some other features such as consistency). Best regards, Han
Re: [HACKERS] When do we lose column names?
On 02/14/2012 05:39 PM, Tom Lane wrote: OK, I fixed this up and committed it. I made some cosmetic changes (the most notable being that the definition of RowExpr is really changing here, and so should its comment). The adjust_appendrel_attrs situation was fixed by passing in the PlannerInfo, which is something that probably should have been made available all along --- there are very few nontrivial functions in the planner that don't need it. Great, many thanks for finishing this up. I'm still a bit annoyed by the behavior I mentioned here, http://archives.postgresql.org/pgsql-hackers/2011-11/msg01031.php that we don't get real column names from an unflattened VALUES RTE. Might be worth looking into that, but I don't have time for it. A TODO maybe? cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On Fri, Feb 10, 2012 at 10:30 AM, Peter Geoghegan pe...@2ndquadrant.com wrote: [ new patch ] I spent quite a bit of time looking at this today - the patch specifically, and the issue of making quicksort fast more generally. It seemed to me that if we're going to have separate copies of the quicksort code for tuple sorting, we might as well go whole hog and specialize those copies to the particular needs of tuplesort.c as much as possible. Accordingly, I whacked the code around so that it knows that it is sorting SortTuple objects and need not conditionalize at runtime on the size of the objects being swapped. You suggested upthread that this might be worthwhile, and it seems that it is, so I think we should do it. Your patch removes the CHECK_FOR_INTERRUPTS() call from comparetup_heap, which is no good. However, since I'd already decided to specialize the copies of quicksort intended for sorting as much as possible, it made sense to me to refactor things so that the qsort routine itself, rather than the comparator, is responsible for calling CHECK_FOR_INTERRUPTS(). This slightly reduces the number of times we CHECK_FOR_INTERRUPTS(), but never allows more than a few comparisons before doing it. I find that your pg_always_inline macro is equivalent to just plain inline on my system (MacOS X v10.6.8, gcc 4.2.1). It seems to need something like this: +#elif __GNUC__ +#define pg_always_inline inline __attribute__((always_inline)) ...but I'm not very happy about relying on that, because I don't know that it will work on every gcc version (never mind non-gcc compilers), and I'm not convinced it's actually improving performance even on this one. The documentation seems to indicate that this is intended to force inlining even when not optimizing, which may have something to do with the lack of effect: that's not really the point here anyway. What I did instead is to replace template_qsort_arg.h with a script called gen_qsort_tuple.pl, which generates a file called qsort_tuple.c that tuplesort.c then #includes. This seems more flexible to me than the macro-based approach. In particular, it allows me to generate versions of qsort with different call signatures. The attached patch generates two: static void qsort_tuple(SortTuple *a, size_t n, SortTupleComparator cmp_tuple, Tuplesortstate *state); static void qsort_ssup(SortTuple *a, size_t n, SortSupport ssup); The first of these is a drop-in replacement for qsort_arg() - any tuplesort can use it, not just heap sorts. But it is faster than qsort_arg() because of the specializations for the SortTuple data type. The second handles the special case where we are sorting by a single key that has an associated SortSupport object. In this case we don't need to carry the overhead of passing around the Tuplesortstate and dereferencing it, nor do we need the SortTupleComparator: we can just pass the SortSupport itself. Maybe there's a way to get this effect using macros, but I couldn't figure it out. At any rate, at least for the single-key case, this approach effectively forces the comparator to be inlined without requiring pg_always_inline. With this patch, I get the following results, as compared with your 2012-02-10 version and master, using the same test cases I tested before. select * from nodups order by g offset 10001; tps on master: 289.471274, 289.967984, 289.595958 tps on 2012-02-10 version: 359.150280, 356.284723, 356.888900 tps on attached version: 388.212793, 386.085083, 386.867478 select * from twocol order by a, b offset 1; tps on master: 261.676611, 260.440886, 259.529362 tps on 2012-02-10 version: 283.941312, 279.981723, 283.140208 tps on attached version: 283.146463, 278.344827, 280.727798 select * from f8 order by g offset 1; tps on master: 228.299924, 222.650355, 227.408506 tps on 2012-02-10 version: 260.289273, 257.181035, 256.377456 tps on attached version: 276.985299, 275.341575, 274.428095 There's some variation (which I can't account for) between the results on master now and the results on master before - possibly just code shifting around between cache lines due to unrelated changes, or maybe some inadvertent change in my test setup. But it looks to me like your 2012-02-10 version, without any type-specific optimizations, does pretty much just as well on multi-key sorting as your previous version, which had them - or if there is a difference, it's pretty small. Overall, I think the numbers for the version I'm attaching here look pretty good: the single-key performance is clearly better than your last version, and the multi-key performance is very slightly worse. I think that slight worsening is a good trade-off, though, because this version can use qsort_tuple() for all kinds of tuplesorts, not just heap tuplesorts. Still, it seems like we ought to be able to do even better: the multi-key specialization that you had in your patch can be coded in this framework, too, and in theory those are ndependent of the
Re: [HACKERS] client performance v.s. server statistics
Hi, I have tried unix domain socket and the performance is similar with TCP socket. It is MIPS architecture so memory copy to/from kernel can occupy much time, and apparently using unit domain socket has no difference than TCP in terms of memory copy. But it is still unbelievable for the ten-fold gap between the client side statistic and the server side statistics. So I want to know what exactly the operations are involved in the server side statistics in EXPLAIN ANALYZE. May I check the code later on when I get time. For the query itself, it was just for performance comparison. There are other index based queries, which are of course much faster, but still result in similar ten-fold of time gap between client side and server side statistics. I am thinking of non-kernel involved client interface, is there such an option, or do I have to develop one from scratch? Best regards, Han On Wed, Feb 15, 2012 at 1:23 PM, Amit Kapila amit.kap...@huawei.com wrote: So, is it client interface (ODBC, libpq) 's cost mainly due to TCP? ** ** The difference as compare to your embedded DB you are seeing is mainly seems to be due to TCP. One optimization you can use is to use Unix-domain socket mode of PostgreSQL. You can refer unix_socket_directory parameter in postgresql.conf and other related parameters. I am suggesting you this as earlier you were using embedded DB, so your client/server should be on same machine. If now this is not the case then it will not work. ** ** Can you please clarify some more things like **1. **After doing sequence scan, do you need all the records in client for which seq. scan is happening. If less records then why you have not created index. **2. **What is exact scenario for fetching records ** ** ** ** ** ** * pgsql-hackers-ow...@postgresql.org [mailto: pgsql-hackers-ow...@postgresql.org] On Behalf Of Zhou Han Sent: Wednesday, February 15, 2012 9:30 AM To: pgsql-hackers@postgresql.org Subject: [HACKERS] client performance v.s. server statistics* ** ** Hi, I am checking a performance problem encountered after porting old embeded DB to postgreSQL. While the system is real-time sensitive, we are concerning for per-query cost. In our environment sequential scanning (select * from ...) for a table with tens of thousands of record costs 1 - 2 seconds, regardless of using ODBC driver or the timing result shown in psql client (which in turn, relies on libpq). However, using EXPLAIN ANALYZE, or checking the statistics in pg_stat_statement view, the query costs only less than 100ms. rface (ODBC, libpq) 's cost mainly due to TCP? Has the pg_stat_statement or EXPLAIN ANALYZE included the cost of copying tuples from shared buffers to result sets? Could you experts share your views on this big gap? And any suggestions to optimise? P.S. In our original embeded DB a fastpath interface is provided to read directly from shared memory for the records, thus provides extremely realtime access (of course sacrifice some other features such as consistency). Best regards, Han