Re: [HACKERS] Patch to add support of IF NOT EXISTS to others CREATE statements
On Sun, Jul 14, 2013 at 03:36:09AM -0300, Fabrízio de Royes Mello wrote: Next, changes in src/backend, starting with parser changes: the patch adds IF_P NOT EXISTS variants for various productions. For example: snip I think opt_if_not_exists should be used for the others as well. I could not use the opt_if_not_exists because bison emits an error: /usr/bin/bison -d -o gram.c gram.y gram.y: conflicts: 10 shift/reduce gram.y: expected 0 shift/reduce conflicts make[3]: *** [gram.c] Error 1 I really don't know how to solve this problem. I'm just do ajustments like that: This probably isn't solvable, which is why the coding is double in many existing places. The issue is that by using opt_if_not_exists you make that bison has to decide much earlier which rule it is parsing. Bison only has one token lookahead and if that's not enough you get errors. BTW, bison dumps a large file describing all its states that you should be able to work out from that where the exact problem lies. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Removing Inner Joins
Sent from my iPad On 10-Jul-2013, at 13:11, Hannu Krosing ha...@2ndquadrant.com wrote: On 07/10/2013 09:18 AM, Atri Sharma wrote: Can you please post an example of such a join removal? I mean a query before and after the removal. Thanks, Courtesy Robert Haas: SELECT foo.x, foo.y, foo.z FROM foo WHERE foo.x = bar.x Conditions: 1) foo.x is not null. I guess that this is also not needed. you can just remove rows where foo.x is null That is, replace the join with foo.x is not null 2) foo (x) is a foreign key referencing bar (x). We can ignore bar completely in this case i.e. avoid scanning bar. Regards, Atri -- Regards, Atri l'apprenant I discussed with RhodiumToad and was exploring potential design methods with which we can solve the above problem. My focus is to add support for foreign key detection in planner and allow planner to make decisions based on it. It wouldn't be too much of a cost to maintain the foreign key column and the referenced table. The main issue, as pointed out by RHodiumToad is primarily that, between the generation of the plan, which is made with accordance to the foreign key presence, and the execution of the plan, we may get into an inconsistent state when the corresponding row is deleted or constraints are changed and fk trigger has not yet run and detected those changes. I was thinking of row level locks,which are done by the fk trigger.Is there any way we can modify the fk trigger to forcibly run? Or add an 'looked at' bit, which is reset when a table/row is changed, and set by the fk trigger when it runs on that table? I am just thinking wild here, please let me know your thoughts, feedback and ideas. Regards, Atri -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Removing Inner Joins
On 07/14/2013 06:10 PM, Atri Sharma wrote: Sent from my iPad On 10-Jul-2013, at 13:11, Hannu Krosing ha...@2ndquadrant.com wrote: On 07/10/2013 09:18 AM, Atri Sharma wrote: Can you please post an example of such a join removal? I mean a query before and after the removal. Thanks, Courtesy Robert Haas: SELECT foo.x, foo.y, foo.z FROM foo WHERE foo.x = bar.x Conditions: 1) foo.x is not null. I guess that this is also not needed. you can just remove rows where foo.x is null That is, replace the join with foo.x is not null 2) foo (x) is a foreign key referencing bar (x). We can ignore bar completely in this case i.e. avoid scanning bar. Regards, Atri -- Regards, Atri l'apprenant I discussed with RhodiumToad and was exploring potential design methods with which we can solve the above problem. My focus is to add support for foreign key detection in planner and allow planner to make decisions based on it. It wouldn't be too much of a cost to maintain the foreign key column and the referenced table. The main issue, as pointed out by RHodiumToad is primarily that, between the generation of the plan, which is made with accordance to the foreign key presence, and the execution of the plan, we may get into an inconsistent state when the corresponding row is deleted or constraints are changed and fk trigger has not yet run and detected those changes. Is this not all transactional and taken care of by MVCC ? That is, the problem can only happen for prepared plans, which need to have invalidation in case of underlaying DDL / schema changes anyway ? Or are you worried about the case where the FK constraint is delayed and thus the plan can be invalid between the change and running of FK trigger in the same transaction ? I was thinking of row level locks,which are done by the fk trigger.Is there any way we can modify the fk trigger to forcibly run? Or add an 'looked at' bit, which is reset when a table/row is changed, and set by the fk trigger when it runs on that table? I am just thinking wild here, please let me know your thoughts, feedback and ideas. Regards, Atri -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Removing Inner Joins
Sent from my iPad On 14-Jul-2013, at 22:12, Hannu Krosing ha...@2ndquadrant.com wrote: On 07/14/2013 06:10 PM, Atri Sharma wrote: Sent from my iPad On 10-Jul-2013, at 13:11, Hannu Krosing ha...@2ndquadrant.com wrote: On 07/10/2013 09:18 AM, Atri Sharma wrote: Can you please post an example of such a join removal? I mean a query before and after the removal. Thanks, Courtesy Robert Haas: SELECT foo.x, foo.y, foo.z FROM foo WHERE foo.x = bar.x Conditions: 1) foo.x is not null. I guess that this is also not needed. you can just remove rows where foo.x is null That is, replace the join with foo.x is not null 2) foo (x) is a foreign key referencing bar (x). We can ignore bar completely in this case i.e. avoid scanning bar. Regards, Atri -- Regards, Atri l'apprenant I discussed with RhodiumToad and was exploring potential design methods with which we can solve the above problem. My focus is to add support for foreign key detection in planner and allow planner to make decisions based on it. It wouldn't be too much of a cost to maintain the foreign key column and the referenced table. The main issue, as pointed out by RHodiumToad is primarily that, between the generation of the plan, which is made with accordance to the foreign key presence, and the execution of the plan, we may get into an inconsistent state when the corresponding row is deleted or constraints are changed and fk trigger has not yet run and detected those changes. Is this not all transactional and taken care of by MVCC ? That is, the problem can only happen for prepared plans, which need to have invalidation in case of underlaying DDL / schema changes anyway ? Or are you worried about the case where the FK constraint is delayed and thus the plan can be invalid between the change and running of FK trigger in the same transaction ? Yes, that is precisely what I am concerned about.Thanks for wording it so nicely! Regards, Atri -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql: Optimize pglz compressor for small inputs.
Heikki, * Heikki Linnakangas (heikki.linnakan...@iki.fi) wrote: This patch alleviates that in two ways. First, instead of storing pointers in the hash table, store 16-bit indexes into the hist_entries array. That slashes the size of the hash table to 1/2 or 1/4 of the original, depending on the pointer width. Secondly, adjust the size of the hash table based on input size. For very small inputs, you don't need a large hash table to avoid collisions. The coverity scanner has a bit of an issue with this patch which, at least on first blush, looks like a valid concern. While the change in pg_lzcompress.c:pglz_find_match() to loop on: while (hent != INVALID_ENTRY_PTR) { const char *ip = input; const char *hp = hent-pos; looks good, and INVALID_ENTRY_PTR is the address of the first entry in the array (and can't be NULL), towards the end of the loop we do: hent = hent-next; if (hent) ... Should we really be checking for 'hent != INVALID_ENTRY_PTR' here? If not, and hent really can end up as NULL, then we're going to segfault on the next loop due to the unchecked 'hent-pos' early in the loop. If hent can never be NULL, then we probably don't need this check at all. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
On 7/13/13 12:13 PM, Fabien COELHO wrote: My 0.02€: if it means adding complexity to the pgbench code, I think that it is not worth it. The point of pgbench is to look at a steady state, not to end in the most graceful possible way as far as measures are concerned. That's how some people use pgbench. I'm just as likely to use it to characterize system latency. If there's a source of latency that's specific to the pgbench code, I want that out of there even if it's hard. But we don't have to argue about that because it isn't. The attached new patch seems to fix the latency spikes at the end, with -2 lines of new code! With that resolved I did a final pass across the rate limit code too, attached as a v14 and ready for a committer. I don't really care what order these two changes are committed, there's no hard dependency, but I would like to see them both go in eventually. No functional code was changed from your v13 except for tweaking the output. The main thing I did was expand/edit comments and rename a few variables to try and make this easier to read. If you have any objections to my cosmetic changes feel free to post an update. I've put a good bit of time into trying to simplify this further, thinking it can't really be this hard. But this seems to be the minimum complexity that still works given the mess of the pgbench state machine. Every change I try now breaks something. To wrap up the test results motivating my little pgbench-delay-finish patch, the throttled cases that were always giving 100ms of latency clustered at the end here now look like this: average rate limit lag: 0.181 ms (max 53.108 ms) tps = 10088.727398 (including connections establishing) tps = 10105.775864 (excluding connections establishing) There are still some of these cases where latency spikes, but they're not as big and they're randomly distributed throughout the run. The problem I had with the ones at the end is how they tended to happen a few times in a row. I kept seeing multiple of these ~50ms lulls adding up to a huge one, because the source of the lag kept triggering at every connection close. pgbench was already cleaning up all of its connections at the end, after all the transactions were finished. It looks safe to me to just rely on that for calling PQfinish in the normal case. And calls to client_done already label themselves ok or abort, the code just didn't do anything with that state before. I tried adding some more complicated state tracking, but that adds complexity while doing the exact same thing as the simple implementation I did. The only part of your code change I reverted was altering the latency log transaction timestamps to read like 1373821907.65702 instead of 1373821907 65702. Both formats were considered when I added that feature, and I completely understand why you'd like to change it. One problem is that doing so introduces a new class of float parsing and rounding issues to consumers of that data. I'd rather not revisit that without a better reason to break the output format. Parsing tools like my pgbench-tools already struggle trying to support multiple versions of pgbench, and I don't think there's enough benefit to the float format to bother breaking them today. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c new file mode 100644 index 08095a9..8dc81e5 *** a/contrib/pgbench/pgbench.c --- b/contrib/pgbench/pgbench.c *** preparedStatementName(char *buffer, int *** 862,875 static bool clientDone(CState *st, bool ok) { ! /* !* When the connection finishes normally, don't call PQfinish yet. !* PQfinish can cause significant delays in other clients that are !* still running. Rather than finishing all of them here, in the !* normal case clients are instead closed in bulk by disconnect_all, !* after they have all stopped. !*/ ! if ((st-con != NULL) ok) { PQfinish(st-con); st-con = NULL; --- 862,870 static bool clientDone(CState *st, bool ok) { ! (void) ok; /* unused */ ! ! if (st-con != NULL) { PQfinish(st-con); st-con = NULL; diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 80203d6..da88bd7 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -137,6 +137,12 @@ intunlogged_tables = 0; double sample_rate = 0.0; /* + * When threads are throttled to a given rate limit, this is the target delay + * to reach that rate in usec. 0 is the default and means no throttling. + */ +int64 throttle_delay = 0; + +/* * tablespace selection */ char *tablespace = NULL;
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On 6/16/13 10:27 AM, Heikki Linnakangas wrote: Yeah, the checkpoint scheduling logic doesn't take into account the heavy WAL activity caused by full page images... Rationalizing a bit, I could even argue to myself that it's a *good* thing. At the beginning of a checkpoint, the OS write cache should be relatively empty, as the checkpointer hasn't done any writes yet. So it might make sense to write a burst of pages at the beginning, to partially fill the write cache first, before starting to throttle. But this is just handwaving - I have no idea what the effect is in real life. That's exactly right. When a checkpoint finishes the OS write cache is clean. That means all of the full-page writes aren't even hitting disk in many cases. They just pile up in the OS dirty memory, often sitting there all the way until when the next checkpoint fsyncs start. That's why I never wandered down the road of changing FPW behavior. I have never seen a benchmark workload hit a write bottleneck until long after the big burst of FPW pages is over. I could easily believe that there are low-memory systems where the FPW write pressure becomes a problem earlier. And slim VMs make sense as the place this behavior is being seen at. I'm a big fan of instrumenting the code around a performance change before touching anything, as a companion patch that might make sense to commit on its own. In the case of a change to FPW spacing, I'd want to see some diagnostic output in something like pg_stat_bgwriter that tracks how many FPW pages are being modified. A pgstat_bgwriter.full_page_writes counter would be perfect here, and then graph that data over time as the benchmark runs. Another thought is that rather than trying to compensate for that effect in the checkpoint scheduler, could we avoid the sudden rush of full-page images in the first place? The current rule for when to write a full page image is conservative: you don't actually need to write a full page image when you modify a buffer that's sitting in the buffer cache, if that buffer hasn't been flushed to disk by the checkpointer yet, because the checkpointer will write and fsync it later. I'm not sure how much it would smoothen WAL write I/O, but it would be interesting to try. There I also think the right way to proceed is instrumenting that area first. A long time ago, Itagaki wrote a patch to sort the checkpoint writes: www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp. He posted very promising performance numbers, but it was dropped because Tom couldn't reproduce the numbers, and because sorting requires allocating a large array, which has the risk of running out of memory, which would be bad when you're trying to checkpoint. I updated and re-reviewed that in 2011: http://www.postgresql.org/message-id/4d31ae64.3000...@2ndquadrant.com and commented on why I think the improvement was difficult to reproduce back then. The improvement didn't follow for me either. It would take a really amazing bit of data to get me to believe write sorting code is worthwhile after that. On large systems capable of dirtying enough blocks to cause a problem, the operating system and RAID controllers are already sorting block. And *that* sorting is also considering concurrent read requests, which are a lot more important to an efficient schedule than anything the checkpoint process knows about. The database doesn't have nearly enough information yet to compete against OS level sorting. Bad point of my patch is longer checkpoint. Checkpoint time was increased about 10% - 20%. But it can work correctry on schedule-time in checkpoint_timeout. Please see checkpoint result (http://goo.gl/NsbC6). For a fair comparison, you should increase the checkpoint_completion_target of the unpatched test, so that the checkpoints run for roughly the same amount of time with and without the patch. Otherwise the benefit you're seeing could be just because of a more lazy checkpoint. Heikki has nailed the problem with the submitted dbt-2 results here. If you spread checkpoints out more, you cannot fairly compare the resulting TPS or latency numbers anymore. Simple example: 20 minute long test. Server A does a checkpoint every 5 minutes. Server B has modified parameters or server code such that checkpoints happen every 6 minutes. If you run both to completion, A will have hit 4 checkpoints that flush the buffer cache, B only 3. Of course B will seem faster. It didn't do as much work. pgbench_tools measures the number of checkpoints during the test, as well as the buffer count statistics. If those numbers are very different between two tests, I have to throw them out as unfair. A lot of things that seem promising turn out to have this sort of problem. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
Hello Greg, But we don't have to argue about that because it isn't. The attached new patch seems to fix the latency spikes at the end, with -2 lines of new code! Hmmm... that looks like not too much complexity:-) With that resolved I did a final pass across the rate limit code too, attached as a v14 and ready for a committer. You attached my v13. Could you send your v14? -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On 6/27/13 11:08 AM, Robert Haas wrote: I'm pretty sure Greg Smith tried it the fixed-sleep thing before and it didn't work that well. That's correct, I spent about a year whipping that particular horse and submitted improvements on it to the community. http://www.postgresql.org/message-id/4d4f9a3d.5070...@2ndquadrant.com and its updates downthread are good ones to compare this current work against. The important thing to realize about just delaying fsync calls is that it *cannot* increase TPS throughput. Not possible in theory, obviously doesn't happen in practice. The most efficient way to write things out is to delay those writes as long as possible. The longer you postpone a write, the more elevator sorting and write combining you get out of the OS. This is why operating systems like Linux come tuned for such delayed writes in the first place. Throughput and latency are linked; any patch that aims to decrease latency will probably slow throughput. Accordingly, the current behavior--no delay--is already the best possible throughput. If you apply a write timing change and it seems to increase TPS, that's almost certainly because it executed less checkpoint writes. It's not a fair comparison. You have to adjust any delaying to still hit the same end point on the checkpoint schedule. That's what my later submissions did, and under that sort of controlled condition most of the improvements went away. Now, I still do really believe that better spacing of fsync calls helps latency in the real world. Far as I know the server that I developed that patch for originally in 2010 is still running with that change. The result is not a throughput change though; there is a throughput drop with a latency improvement. That is the unbreakable trade-off in this area if all you touch is scheduling. The reason why I was ignoring this discussion and working on pgbench throttling until now is that you need to measure latency at a constant throughput to advance here on this topic, and that's exactly what the new pgbench feature enables. If we can take the current checkpoint scheduler and an altered one, run both at exactly the same rate, and one gives lower latency, now we're onto something. It's possible to do that with DBT-2 as well, but I wanted something really simple that people could replicate results with in pgbench. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
On 7/14/13 2:48 PM, Fabien COELHO wrote: You attached my v13. Could you send your v14? Correct patch (and the little one from me again) attached this time. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c new file mode 100644 index 08095a9..6564057 *** a/contrib/pgbench/pgbench.c --- b/contrib/pgbench/pgbench.c *** int unlogged_tables = 0; *** 137,142 --- 137,148 doublesample_rate = 0.0; /* + * When threads are throttled to a given rate limit, this is the target delay + * to reach that rate in usec. 0 is the default and means no throttling. + */ + int64 throttle_delay = 0; + + /* * tablespace selection */ char *tablespace = NULL; *** typedef struct *** 200,210 --- 206,218 int listen; /* 0 indicates that an async query has been * sent */ int sleeping; /* 1 indicates that the client is napping */ + boolthrottling; /* whether nap is for throttling */ int64 until; /* napping until (usec) */ Variable *variables; /* array of variable definitions */ int nvariables; instr_time txn_begin; /* used for measuring transaction latencies */ instr_time stmt_begin; /* used for measuring statement latencies */ + boolis_throttled; /* whether transaction should be throttled */ int use_file; /* index in sql_files for this client */ boolprepared[MAX_FILES]; } CState; *** typedef struct *** 222,227 --- 230,238 instr_time *exec_elapsed; /* time spent executing cmds (per Command) */ int*exec_count; /* number of cmd executions (per Command) */ unsigned short random_state[3]; /* separate randomness for each thread */ + int64 throttle_trigger; /* previous/next throttling (us) */ + int64 throttle_lag; /* total transaction lag behind throttling */ + int64 throttle_lag_max; /* max transaction lag */ } TState; #define INVALID_THREAD((pthread_t) 0) *** typedef struct *** 230,235 --- 241,248 { instr_time conn_time; int xacts; + int64 throttle_lag; + int64 throttle_lag_max; } TResult; /* *** usage(void) *** 353,358 --- 366,372 -n, --no-vacuum do not run VACUUM before tests\n -N, --skip-some-updates skip updates of pgbench_tellers and pgbench_branches\n -r, --report-latencies report average latency per command\n +-R, --rate SPEC target rate in transactions per second\n -s, --scale=NUM report this scale factor in output\n -S, --select-onlyperform SELECT-only transactions\n -t, --transactions number of transactions each client runs *** doCustom(TState *thread, CState *st, ins *** 900,916 { PGresult *res; Command **commands; top: commands = sql_files[st-use_file]; if (st-sleeping) { /* are we sleeping? */ instr_time now; INSTR_TIME_SET_CURRENT(now); ! if (st-until = INSTR_TIME_GET_MICROSEC(now)) st-sleeping = 0; /* Done sleeping, go ahead with next command */ else return true;/* Still sleeping, nothing to do here */ } --- 914,973 { PGresult *res; Command **commands; + booltrans_needs_throttle = false; top: commands = sql_files[st-use_file]; + /* +* Handle throttling once per transaction by sleeping. It is simpler +* to do this here rather than at the end, because so much complicated +* logic happens below when statements finish. +*/ + if (throttle_delay ! st-is_throttled) + { + /* +* Use inverse transform sampling to randomly generate a delay, such +* that the series of delays will approximate a Poisson distribution +* centered on the throttle_delay time. +* +* If transactions are too slow or a given wait is shorter than +* a
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On 7/3/13 9:39 AM, Andres Freund wrote: I wonder how much of this could be gained by doing a sync_file_range(SYNC_FILE_RANGE_WRITE) (or similar) either while doing the original checkpoint-pass through the buffers or when fsyncing the files. The fsync calls decomposing into the queued set of block writes. If they all need to go out eventually to finish a checkpoint, the most efficient way from a throughput perspective is to dump them all at once. I'm not sure sync_file_range targeting checkpoint writes will turn out any differently than block sorting. Let's say the database tries to get involved in forcing a particular write order that way. Right now it's going to be making that ordering decision without the benefit of also knowing what blocks are being read. That makes it hard to do better than the OS, which knows a different--and potentially more useful in a ready-heavy environment--set of information about all the pending I/O. And it would be very expensive to made all the backends start sharing information about what they read to ever pull that logic into the database. It's really easy to wander down the path where you assume you must know more than the OS does, which leads to things like direct I/O. I am skeptical of that path in general. I really don't want Postgres to be competing with the innovation rate in Linux kernel I/O if we can ride it instead. One idea I was thinking about that overlaps with a sync_file_range refactoring is simply tracking how many blocks have been written to each relation. If there was a rule like fsync any relation that's gotten more than 100 8K writes, we'd never build up the sort of backlog that causes the worst latency issues. You really need to start tracking the file range there, just to fairly account for multiple writes to the same block. One of the reasons I don't mind all the work I'm planning to put into block write statistics is that I think that will make it easier to build this sort of facility too. The original page write and the fsync call that eventually flushes it out are very disconnected right now, and file range data seems the right missing piece to connect them well. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Materialized views WIP patch
While doing some post-commit review of the 9.3 materialized view feature, I noticed a few loose ends: On Thu, Jan 24, 2013 at 01:09:28PM -0500, Noah Misch wrote: Note that [...] ALTER TABLE ... RENAME CONSTRAINT [is] currently supported for MVs by ALTER TABLE but not by ALTER MATERIALIZED VIEW. There's no documented support for table constraints on MVs, but UNIQUE constraints are permitted: [local] test=# alter materialized view mymv add unique (c); ALTER MATERIALIZED VIEW [local] test=# alter materialized view mymv add check (c 0); ERROR: mymv is not a table [local] test=# alter materialized view mymv add primary key (c); ERROR: mymv is not a table or foreign table The above points still apply. Also, could you explain the use of RelationCacheInvalidateEntry() in ExecRefreshMatView()? CacheInvalidateRelcacheByRelid() followed by CommandCounterIncrement() is the typical pattern; this is novel. I suspect, though, neither is necessary now that the relcache does not maintain populated status based on a fork size reading. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On 14/07/2013 20:13, Greg Smith wrote: The most efficient way to write things out is to delay those writes as long as possible. That doesn't smell right to me. It might be that delaying allows more combining and allows the kernel to see more at once and optimise it, but I think the counter-argument is that it is an efficiency loss to have either CPU or disk idle waiting on the other. It cannot make sense from a throughput point of view to have disks doing nothing and then become overloaded so they are a bottleneck (primarily seeking) and the CPU does nothing. Now I have NOT measured behaviour but I'd observe that we see disks that can stream 100MB/s but do only 5% of that if they are doing random IO. Some random seeks during sync can't be helped, but if they are done when we aren't waiting for sync completion then they are in effect free. The flip side is that we can't really know whether they will get merged with adjacent writes later so its hard to schedule them early. But we can observe that if we have a bunch of writes to adjacent data then a seek to do the write is effectively amortised across them. So it occurs to me that perhaps we can watch for patterns where we have groups of adjacent writes that might stream, and when they form we might schedule them to be pushed out early (if not immediately), ideally out as far as the drive (but not flushed from its cache) and without forcing all other data to be flushed too. And perhaps we should always look to be getting drives dedicated to dbms to do something, even if it turns out to have been redundant in the end. That's not necessarily easy on Linux without using a direct unbuffered IO but to me that is Linux' problem. For a start its not the only target system, and having feedback 'we need' from db and mail system groups to the NT kernels devs hasn't hurt, and it never hurt Solaris to hear what Oracle and Sybase devs felt they needed either. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
Hello Greg, Correct patch (and the little one from me again) attached this time. Please find an updated v15 with only comment changes: * The new comment about is_throttled was inverted wrt the meaning of the variable, at least to my understanding. * ISTM that the impact of the chosen 1000 should appear somewhere. * The trans_need_throttle comment was slightly wrong: the first transaction is also throttled, when initially entering doCustom. About 123456 12345 vs 123456.012345: My data parser is usually gnuplot or my eyes, and in both cases the later option is better:-) About the little patch: Parameter ok should be renamed to something meaningful (say do_finish?). Also, it seems that when timer is exceeded in doCustom it is called with true, but maybe you intended that it should be called with false?? Moreover, under normal circumstances (throttling is significantly below the maximum rate) PQfinish will be called directly by threadRun while interrupting sleeping threads. Also, it is important to remove the connection because it serves as a marker to know whether a client must run or not. So basically I do not understand how it works. Note that it does not mean that it does not work, it just means that I do not really understand:-) -- Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 8dc81e5..4b379ab 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -137,6 +137,12 @@ int unlogged_tables = 0; double sample_rate = 0.0; /* + * When threads are throttled to a given rate limit, this is the target delay + * to reach that rate in usec. 0 is the default and means no throttling. + */ +int64 throttle_delay = 0; + +/* * tablespace selection */ char *tablespace = NULL; @@ -200,11 +206,13 @@ typedef struct int listen; /* 0 indicates that an async query has been * sent */ int sleeping; /* 1 indicates that the client is napping */ + boolthrottling; /* whether nap is for throttling */ int64 until; /* napping until (usec) */ Variable *variables; /* array of variable definitions */ int nvariables; instr_time txn_begin; /* used for measuring transaction latencies */ instr_time stmt_begin; /* used for measuring statement latencies */ + bool is_throttled; /* whether transaction throttling is done */ int use_file; /* index in sql_files for this client */ bool prepared[MAX_FILES]; } CState; @@ -222,6 +230,9 @@ typedef struct instr_time *exec_elapsed; /* time spent executing cmds (per Command) */ int *exec_count; /* number of cmd executions (per Command) */ unsigned short random_state[3]; /* separate randomness for each thread */ + int64 throttle_trigger; /* previous/next throttling (us) */ + int64 throttle_lag; /* total transaction lag behind throttling */ + int64 throttle_lag_max; /* max transaction lag */ } TState; #define INVALID_THREAD ((pthread_t) 0) @@ -230,6 +241,8 @@ typedef struct { instr_time conn_time; int xacts; + int64 throttle_lag; + int64 throttle_lag_max; } TResult; /* @@ -353,6 +366,7 @@ usage(void) -n, --no-vacuum do not run VACUUM before tests\n -N, --skip-some-updates skip updates of pgbench_tellers and pgbench_branches\n -r, --report-latencies report average latency per command\n + -R, --rate SPEC target rate in transactions per second\n -s, --scale=NUM report this scale factor in output\n -S, --select-onlyperform SELECT-only transactions\n -t, --transactions number of transactions each client runs @@ -895,17 +909,62 @@ doCustom(TState *thread, CState *st, instr_time *conn_time, FILE *logfile, AggVa { PGresult *res; Command **commands; + booltrans_needs_throttle = false; top: commands = sql_files[st-use_file]; + /* + * Handle throttling once per transaction by sleeping. It is simpler + * to do this here rather than at the end, because so much complicated + * logic happens below when statements finish. + */ + if (throttle_delay ! st-is_throttled) + { + /* + * Use inverse transform sampling to randomly generate a delay, such + * that the series of delays will approximate a Poisson distribution + * centered on the throttle_delay time. + * + * 1000 implies a 6.9 (-log(1/1000)) to 0.0 (log 1.0) delay multiplier. + * + * If transactions are too slow or a given wait is shorter than + * a transaction, the next transaction will start right away. + */ + int64 wait = (int64) + throttle_delay * -log(getrand(thread, 1, 1000)/1000.0); + + thread-throttle_trigger += wait; + + st-until = thread-throttle_trigger; + st-sleeping = 1; + st-throttling = true; + st-is_throttled = true; + if (debug) + fprintf(stderr, client %d throttling INT64_FORMAT us\n, + st-id, wait); + } + if (st-sleeping) { /* are we
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On 7/11/13 8:29 AM, KONDO Mitsumasa wrote: I use linear combination method for considering about total checkpoint schedule which are write phase and fsync phase. V3 patch was considered about only fsync phase, V4 patch was considered about write phase and fsync phase, and v5 patch was considered about only fsync phase. Your v5 now looks like my Self-tuning checkpoint sync spread series: https://commitfest.postgresql.org/action/patch_view?id=514 which I did after deciding write phase delays didn't help. It looks to me like some, maybe all, of your gain is coming from how any added delays spread out the checkpoints. The self-tuning part I aimed at was trying to stay on exactly the same checkpoint end time even with the delays in place. I got that part to work, but the performance gain went away once the schedule was a fair comparison. You are trying to solve a very hard problem. How long are you running your dbt-2 tests for? I didn't see that listed anywhere. ** Average checkpoint duration (sec) (Not include during loading time) | write_duration | sync_duration | total fsync v3-0.7 | 296.6 | 251.8898 | 548.48 | OK fsync v3-0.9 | 292.086| 276.4525 | 568.53 | OK fsync v3-0.7_disabled| 303.5706 | 155.6116 | 459.18 | OK fsync v4-0.7 | 273.8338 | 355.6224 | 629.45 | OK fsync v4-0.9 | 329.0522 | 231.77| 560.82 | OK I graphed the total times against the resulting NOTPM values and attached that. I expect transaction rate to increase along with time time between checkpoints, and that's what I see here. The fsync v4-0.7 result is worse than the rest for some reason, but all the rest line up nicely. Notice how fsync v3-0.7_disabled has the lowest total time between checkpoints, at 459.18. That is why it has the most I/O and therefore runs more slowly than the rest. If you take your fsync v3-0.7_disabled and increase checkpoint_segments and/or checkpoint_timeout until that test is averaging about 550 seconds between checkpoints, NOTPM should also increase. That's interesting to know, but you don't need any change to Postgres for that. That's what always happens when you have less checkpoints per run. If you get a checkpoint time table like this where the total duration is very close--within +/-20 seconds is the sort of noise I would expect there--at that point I would say you have all your patches on the same checkpoint schedule. And then you can compare the NOTPM numbers usefully. When the checkpoint times are in a large range like 459.18 to 629.45 in this table, as my graph shows the associated NOTPM numbers are going to be based on that time. I would recommend taking a snapshot of pg_stat_bgwriter before and after the test runs, and then showing the difference between all of those numbers too. If the test runs for a while--say 30 minutes--the total number of checkpoints should be very close too. * Test Server Server: HP Proliant DL360 G7 CPU:Xeon E5640 2.66GHz (1P/4C) Memory: 18GB(PC3-10600R-9) Disk: 146GB(15k)*4 RAID1+0 RAID controller: P410i/256MB (Add) Set off energy efficient function in BIOS and OS. Excellent, here I have a DL160 G6 with 2 processors, 72GB of RAM, and that same P410 controller + 4 disks. I've been meaning to get DBT-2 running on there usefully, your research gives me a reason to do that. You seem to be in a rush due to the commitfest schedule. I have some bad news for you there. You're not going to see a change here committed in this CF based on where it's at, so you might as well think about the best longer term plan. I would be shocked if anything came out of this in less than 3 months really. That's the shortest amount of time I've ever done something useful in this area. Each useful benchmarking run takes me about 3 days of computer time, it's not a very fast development cycle. Even if all of your results were great, we'd need to get someone to duplicate them on another server, and we'd need to make sure they didn't make other workloads worse. DBT-2 is very useful, but no one is going to get a major change to the write logic in the database committed based on one benchmark. Past changes like this have used both DBT-2 and a large number of pgbench tests to get enough evidence of improvement to commit. I can help with that part when you get to something I haven't tried already. I am very interesting in improving this area, it just takes a lot of work to do it. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com attachment: NOTPM-Checkpoints.png -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On 7/14/13 5:28 PM, james wrote: Some random seeks during sync can't be helped, but if they are done when we aren't waiting for sync completion then they are in effect free. That happens sometimes, but if you measure you'll find this doesn't actually occur usefully in the situation everyone dislikes. In a write heavy environment where the database doesn't fit in RAM, backends and/or the background writer are constantly writing data out to the OS. WAL is going out constantly as well, and in many cases that's competing for the disks too. The most popular blocks in the database get high usage counts and they never leave shared_buffers except at checkpoint time. That's easy to prove to yourself with pg_buffercache. And once the write cache fills, every I/O operation is now competing. There is nothing happening for free. You're stealing I/O from something else any time you force a write out. The optimal throughput path for checkpoints turns out to be delaying every single bit of I/O as long as possible, in favor of the [backend|bgwriter] writes and WAL. Whenever you delay a buffer write, you have increased the possibility that someone else will write the same block again. And the buffers being written by the checkpointer are, on average, the most popular ones in the database. Writing any of them to disk pre-emptively has high odds of writing the same block more than once per checkpoint. And that easy to measure waste--it shows as more writes/transaction in pg_stat_bgwriter--it hurts throughput more than every reduction in seek overhead you might otherwise get from early writes. The big gain isn't chasing after cheap seeks. The best path is the one that decreases the total volume of writes. We played this game with the background writer work for 8.3. The main reason the one committed improved on the original design is that it completely eliminated doing work on popular buffers in advance. Everything happens at the last possible time, which is the optimal throughput situation. The 8.1/8.2 BGW used to try and write things out before they were strictly necessary, in hopes that that I/O would be free. But it rarely was, while there was always a cost to forcing them to disk early. And that cost is highest when you're talking about the higher usage blocks the checkpointer tends to write. When in doubt, always delay the write in hopes it will be written to again and you'll save work. So it occurs to me that perhaps we can watch for patterns where we have groups of adjacent writes that might stream, and when they form we might schedule them... Stop here. I mentioned something upthread that is worth repeating. The checkpointer doesn't know what concurrent reads are happening. We can't even easily make it know, not without adding a whole new source of IPC and locking contention among clients. Whatever scheduling decision the checkpointer might make with its limited knowledge of system I/O is going to be poor. You might find a 100% write benchmark that it helps, but those are not representative of the real world. In any mixed read/write case, the operating system is likely to do better. That's why things like sorting blocks sometimes seem to help someone, somewhere, with one workload, but then aren't repeatable. We can decide to trade throughput for latency by nudging the OS to deal with its queued writes more regularly. That will result in more total writes, which is the reason throughput drops. But the idea that PostgreSQL is going to do a better global job of I/O scheduling, that road is a hard one to walk. It's only going to happen if we pull all of the I/O into the database *and* do a better job on the entire process than the existing OS kernel does. That sort of dream, of outperforming the filesystem, it is very difficult to realize. There's a good reason that companies like Oracle stopped pushing so hard on recommending raw partitions. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] ECPG timestamp '%j'
Michael, While looking at complaints from the Coverity scanner system, it looks like it's detected a case in ECPG where we provide a day-of-year format option (%j), but we never actually calculate what the day of year *is*, resulting in an uninitialized value. Other parts of the code (non-ECPG) appears to realize that timestamp2tm doesn't fill in day-of-year in the struct and they calculate it afterwards. Perhaps ECPG needs to adopt that approach also, perhaps either in dttofmtasc_replace() or PGTYPEStimestamp_fmt_asc()..? I was able to get what I believe is an incorrect result through a bit of hacking on the ECPG test cases: timestamp_fmt_asc: 0: 10922-abc% after adding: out = (char*) malloc(32); i = PGTYPEStimestamp_fmt_asc(ts1, out, 31, %j-abc%%); printf(timestamp_fmt_asc: %d: %s\n, i, out); free(out); into pgtypeslib/dt_test.pgc. If you don't have time to look into this, let me know and I'll try and get back to it soon. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Changing recovery.conf parameters into GUCs
On 7/10/13 9:39 AM, Heikki Linnakangas wrote: On 10.07.2013 02:54, Josh Berkus wrote: One bit of complexity I'd like to see go away is that we have two trigger files: one to put a server into replication, and one to promote it. The promotion trigger file is a legacy of warm standby, I believe. Maybe, now that we have pg_ctl promote available, we can eliminate the promotion trigger? No, see http://www.postgresql.org/message-id/5112a54b.8090...@vmware.com. Right, you had some good points in there. STONITH is so hard already, we need to be careful about eliminating options there. All the summaries added here have actually managed to revive this one usefully early in the release cycle! Well done. I just tried to apply Michael's 20130325_recovery_guc_v3.patch and the bit rot isn't that bad either. 5 rejection files, nothing big in them. The only overlap between the recovery.conf GUC work and refactoring the trigger file is that the trigger file is referenced in there, and we really need to point it somewhere to be most useful. Personally, I don't care if we support the old recovery.conf-trigger behavior as long as I'm not forced to use it. If you accept Heikki's argument for why the file can't go away altogether, and it's pretty reasonable, I think two changes reach a point where everyone can live with this: -We need a useful default filename for trigger_file to point at. -pg_ctl failover creates that file. As for the location to default to, the pg_standby docs suggest /tmp/pgsql.trigger.5432 while the Binary Replication Tutorial on the wiki uses a RedHat directory layout $PGDATA/failover The main reason I've preferred something in the data directory is that triggering a standby is too catastrophic for me to be comfortable putting it in /tmp. Any random hooligan with a shell account can trigger a standby and break its replication. Putting the unix socket into /tmp only works because the server creates the file as part of startup. Here that's not possible, because creating the trigger is the signalling mechanism. I don't think there needs to be a CLI interface for putting the alternate possible text into the trigger--that you can ask for 'fast' startup. It's nice to have available as an expert, but it's fine for that to be harder to do. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_memory_barrier() doesn't compile, let alone work, for me
Commit 9a20a9b2 breaks the build for me, using gcc on HPPA: xlog.c:2182: macro `pg_memory_barrier' used without args xlog.c:2546: macro `pg_memory_barrier' used without args make[4]: *** [xlog.o] Error 1 The reason for this is that the fallback definition of pg_memory_barrier has been wrong since day one; it needs this fix: -#define pg_memory_barrier(x) \ +#define pg_memory_barrier() \ However, fixing that doesn't yield much joy; initdb stalls and then crashes with PANIC: stuck spinlock (40054a88) detected at xlog.c:2182 The reason for that is that the code does not bother to initialize dummy_spinlock anywhere. It might accidentally fail to fail on machines where the unlocked state of a spinlock is all-zeroes (given a compiler that's not picky about the incorrect macro usage) ... but HPPA is not such a machine. Rather than trying to think of a fix for that, I'm of the opinion that we should rip this out. The fallback implementation of pg_memory_barrier ought to be pg_compiler_barrier(), on the theory that non-mainstream architectures don't have weak memory ordering anyway, or if they do you need to do some work to get PG to work on them. Or maybe we ought to stop pretending that the code is likely to work on arbitrary machines, and just #error if there's not a supplied machine-specific macro. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On Sunday, July 14, 2013, Greg Smith wrote: On 6/27/13 11:08 AM, Robert Haas wrote: I'm pretty sure Greg Smith tried it the fixed-sleep thing before and it didn't work that well. That's correct, I spent about a year whipping that particular horse and submitted improvements on it to the community. http://www.postgresql.org/* *message-id/4D4F9A3D.5070700@**2ndquadrant.comhttp://www.postgresql.org/message-id/4d4f9a3d.5070...@2ndquadrant.comand its updates downthread are good ones to compare this current work against. The important thing to realize about just delaying fsync calls is that it *cannot* increase TPS throughput. Not possible in theory, obviously doesn't happen in practice. The most efficient way to write things out is to delay those writes as long as possible. The longer you postpone a write, the more elevator sorting and write combining you get out of the OS. This is why operating systems like Linux come tuned for such delayed writes in the first place. Throughput and latency are linked; any patch that aims to decrease latency will probably slow throughput. Do common low level IO benchmarking tools cover this territory? I've looked at Bonnie, which seems to be the most famous one, and it doesn't look like it covers effectiveness of write combining at all. I've done my own casual benchmarking, and the results were astonishingly bad for the OS/FS. If I over-wrote 1024*1024 blocks of 8KB in random order and then fsynced the 8GB of data (divided into 8x1GB files, in deference to PG segment size) it took way longer than if I did the overwrite in block order and then fsynced that. This was a gift-horse machine not speced out to be a database server, but the linux kernel is still the kernel regardless of the hardware it sits on so I don't how much that should matter. To be clear, the writes did not take longer, it was the fsyncs that took longer. All writes were successfully absorbed into memory promptly. Alas, I no longer have access to a machine which can absorb 8GB of writes into RAM without thinking twice and which I can use for casual experimentation. Cheers, Jeff
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On Sunday, July 14, 2013, Greg Smith wrote: On 7/14/13 5:28 PM, james wrote: Some random seeks during sync can't be helped, but if they are done when we aren't waiting for sync completion then they are in effect free. That happens sometimes, but if you measure you'll find this doesn't actually occur usefully in the situation everyone dislikes. In a write heavy environment where the database doesn't fit in RAM, backends and/or the background writer are constantly writing data out to the OS. WAL is going out constantly as well, and in many cases that's competing for the disks too. While I think it is probably true that many systems don't separate WAL from non-WAL to different IO controllers, is it true that many systems that are in need of heavy IO tuning don't do so? I thought that that would be the first stop for any DBA of an highly IO-write constrained database. The most popular blocks in the database get high usage counts and they never leave shared_buffers except at checkpoint time. That's easy to prove to yourself with pg_buffercache. And once the write cache fills, every I/O operation is now competing. There is nothing happening for free. You're stealing I/O from something else any time you force a write out. The optimal throughput path for checkpoints turns out to be delaying every single bit of I/O as long as possible, in favor of the [backend|bgwriter] writes and WAL. Whenever you delay a buffer write, you have increased the possibility that someone else will write the same block again. And the buffers being written by the checkpointer are, on average, the most popular ones in the database. Writing any of them to disk pre-emptively has high odds of writing the same block more than once per checkpoint. Should the checkpointer make multiple passes over the buffer pool, writing out the high usage_count buffers first, because no one else is going to do it, and then going back for the low usage_count buffers in the hope they were already written out? On the other hand, if the checkpointer writes out a low-usage buffer, why would anyone else need to write it again soon? If it were likely to get dirtied often, it wouldn't be low usage. If it was dirtied rarely, it wouldn't be dirty anymore once written. Cheers, Jeff
Re: [HACKERS] pg_memory_barrier() doesn't compile, let alone work, for me
On Sun, Jul 14, 2013 at 8:41 PM, Tom Lane t...@sss.pgh.pa.us wrote: Commit 9a20a9b2 breaks the build for me, using gcc on HPPA: xlog.c:2182: macro `pg_memory_barrier' used without args xlog.c:2546: macro `pg_memory_barrier' used without args make[4]: *** [xlog.o] Error 1 The reason for this is that the fallback definition of pg_memory_barrier has been wrong since day one; it needs this fix: -#define pg_memory_barrier(x) \ +#define pg_memory_barrier() \ Uggh, sorry. However, fixing that doesn't yield much joy; initdb stalls and then crashes with PANIC: stuck spinlock (40054a88) detected at xlog.c:2182 The reason for that is that the code does not bother to initialize dummy_spinlock anywhere. It might accidentally fail to fail on machines where the unlocked state of a spinlock is all-zeroes (given a compiler that's not picky about the incorrect macro usage) ... but HPPA is not such a machine. This would not be hard to fix, I think. Rather than trying to think of a fix for that, I'm of the opinion that we should rip this out. The fallback implementation of pg_memory_barrier ought to be pg_compiler_barrier(), on the theory that non-mainstream architectures don't have weak memory ordering anyway, or if they do you need to do some work to get PG to work on them. Or maybe we ought to stop pretending that the code is likely to work on arbitrary machines, and just #error if there's not a supplied machine-specific macro. Well, pg_memory_barrier() isn't even equivalent to pg_compiler_barrier() on x86, which has among the strongest memory orderings out there, so I think your first idea is a non-starter. A compiler barrier on x86 will fence reads from reads and writes from writes, but it will not prevent a read from being speculated before a subsequent write, which a full barrier must do. I'm pretty sure we've got latent memory-ordering risks in our existing code which we just haven't detected and fixed yet. Consider, for example, this exciting code from GetNewTransactionId: myproc-subxids.xids[nxids] = xid; mypgxact-nxids = nxids + 1; I don't believe that's technically safe even on an architecture like x86, because the compiler could decide to reorder those assignments. Of course there is probably no reason to do so, and even if it does you'd have to get really unlucky to see a user-visible failure, and if you did you'd probably misguess the cause. However, I'm guessing that as we start using memory barriers more, and as we speed up the system generally, we're going to see bugs like this that are currently hidden start to become visible from time to time. Especially in view of that, I think it's wise to be pessimistic rather than optimistic about what barriers are needed, when we don't know for certain. That way, when we get a memory-ordering-related bug report, we can at least hope that the problem must be something specific to the problem area rather than a general failure to use the right memory barrier primitives. My preference would be to fix this in a narrow way, by initializing dummy_spinlock somewhere. But if not, then I think #error is the only safe way to go. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] review: Non-recursive processing of AND/OR lists
On Wed, Jul 10, 2013 at 9:02 PM, Josh Berkus j...@agliodbs.com wrote: I think it's a waste of code to try to handle bushy trees. A list is not a particularly efficient representation of the pending list; this will probably be slower than recusing in the common case. I'd suggest keeping the logic to handle left-deep trees, which I find rather elegant, but ditching the pending list. Is there going to be further discussion of this patch, or do I return it? Considering it's not been updated, nor my comments responded to, in almost two weeks, I think we return it at this point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: FILTER for aggregates [was Re: Department of Redundancy Department: makeNode(FuncCall) division]
On Sun, Jul 07, 2013 at 06:37:26PM -0700, David Fetter wrote: On Sat, Jul 06, 2013 at 11:49:21AM +0100, Dean Rasheed wrote: Overall I think this patch offers useful additional functionality, in compliance with the SQL spec, which should be handy to simplify complex grouping queries. As I understand this feature, it is syntactic sugar for the typical case of an aggregate with a strict transition function. For example, min(x) FILTER (WHERE y 0) is rigorously equivalent to min(CASE y 0 THEN x END). Every SQL aggregate is strict (ISO 9075-2 section 4.15.4), so for standard queries it is *only* syntactic sugar. In PostgreSQL, it lets you do novel things with, for example, array_agg(). Is that accurate? I think this is ready for committer. The patch was thorough. I updated applicable comments, revisited some cosmetic choices, and made these functional changes: 1. The pg_stat_statements query jumble should incorporate the filter. 2. The patch did not update costing. I made it add the cost of the filter expression the same way we add the cost of the argument expressions. This makes min(x) FILTER (WHERE y 0) match min(case y 0 THEN x end) in terms of cost, which is a fair start. At some point, we could do better by reducing the argument cost by the filter selectivity. A few choices/standard interpretations may deserve discussion. The patch makes filter clauses contribute to the subquery level chosen to be the aggregation query. This is observable through the behavior of these two standard-conforming queries: select (select count(outer_c) from (values (1)) t0(inner_c)) from (values (2),(3)) t1(outer_c); -- outer query is aggregation query select (select count(outer_c) filter (where inner_c 0) from (values (1)) t0(inner_c)) from (values (2),(3)) t1(outer_c); -- inner query is aggregation query I believe SQL (ISO 9075-2 section 6.9 SR 3,4,6) does require this. Since that still isn't crystal-clear to me, I mention it in case anyone has a different reading. Distinct from that, albeit in a similar vein, SQL does not permit outer references in a filter clause. This patch permits them; I think that qualifies as a reasonable PostgreSQL extension. --- a/doc/src/sgml/keywords.sgml +++ b/doc/src/sgml/keywords.sgml @@ -3200,7 +3200,7 @@ /row row entrytokenOVER/token/entry -entryreserved (can be function or type)/entry +entrynon-reserved/entry I committed this one-line correction separately. --- a/src/backend/optimizer/plan/planagg.c +++ b/src/backend/optimizer/plan/planagg.c @@ -314,7 +314,7 @@ find_minmax_aggs_walker(Node *node, List **context) ListCell *l; Assert(aggref-agglevelsup == 0); - if (list_length(aggref-args) != 1 || aggref-aggorder != NIL) + if (list_length(aggref-args) != 1 || aggref-aggorder != NIL || aggref-agg_filter != NULL) return true;/* it couldn't be MIN/MAX */ /* note: we do not care if DISTINCT is mentioned ... */ I twitched upon reading this, because neither ORDER BY nor FILTER preclude the aggregate being MIN or MAX. Perhaps Andrew can explain why he put aggorder there back in 2009. All I can figure is that writing max(c ORDER BY x) is so unlikely that we'd too often waste the next syscache lookup. But the same argument would apply to DISTINCT. With FILTER, the rationale is entirely different. The aggregate could well be MIN/MAX; we just haven't implemented the necessary support elsewhere in this file. See attached patch revisions. The first patch edits find_minmax_aggs_walker() per my comments just now. The second is an update of your FILTER patch with the changes to which I alluded above; it applies atop the first patch. Would you verify that I didn't ruin anything? Barring problems, will commit. Are you the sole named author of this patch? That's what the CF page says, but that wasn't fully clear to me from the list discussions. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com *** a/src/backend/optimizer/plan/planagg.c --- b/src/backend/optimizer/plan/planagg.c *** *** 314,328 find_minmax_aggs_walker(Node *node, List **context) ListCell *l; Assert(aggref-agglevelsup == 0); ! if (list_length(aggref-args) != 1 || aggref-aggorder != NIL) return true;/* it couldn't be MIN/MAX */ ! /* note: we do not care if DISTINCT is mentioned ... */ ! curTarget = (TargetEntry *) linitial(aggref-args); aggsortop = fetch_agg_sort_op(aggref-aggfnoid); if (!OidIsValid(aggsortop)) return true;/* not a MIN/MAX aggregate */ if (contain_mutable_functions((Node *) curTarget-expr)) return true;/* not potentially
Re: [HACKERS] Eliminating PD_ALL_VISIBLE, take 2
On Fri, Jul 5, 2013 at 4:18 PM, Jeff Davis pg...@j-davis.com wrote: On Tue, 2013-07-02 at 10:12 -0700, Jeff Davis wrote: Regardless, this is at least a concrete issue that I can focus on, and I appreciate that. Are scans of small tables the primary objection to this patch, or are there others? If I solve it, will this patch make real progress? I had an idea here: What if we keep PD_ALL_VISIBLE, but make it more like other hints, and make it optional? If a page is all visible, either or both of the VM bit or PD_ALL_VISIBLE could be set (please suspend disbelief for a moment). Then, we could use a heuristic, like setting PD_ALL_VISIBLE in the first 256 pages of a relation; but in later pages, only setting it if the page is already dirty for some other reason. That has the following benefits: 1. Eliminates the worry over contention related to scans, because we wouldn't need to use the VM for small tables. And I don't think anyone was worried about using the VM on a large table scan. 2. Still avoids dirtying lots of pages after a data load. I'm not worried if a few MB of data is rewritten on a large table. 3. Eliminates the complex way in which we (ab)use WAL and the recovery mechanism to keep PD_ALL_VISIBLE and the VM bit in sync. Of course, there's a reason that PD_ALL_VISIBLE is not like a normal hint: we need to make sure that inserts/updates/deletes clear the VM bit. But my patch already addresses that by keeping the VM page pinned. I'm of the opinion that we ought to extract the parts of the patch that hold the VM pin for longer, review those separately, and if they're good and desirable, apply them. Although that optimization becomes more necessary if we were to adopt your proposal than it is now, it's really separate from this patch. Given that VM pin caching can be done with or without removing PD_ALL_VISIBLE, it seems to me that the fair comparison is between master + VM pin caching and master + VM pin caching + remove PD_ALL_VISIBLE. Comparing the latter vs. unpatched master seems to me to be confusing the issue. That has some weaknesses: as Heikki pointed out[1], you can defeat the cache of the pin by randomly seeking between 512MB regions during an update (would only be noticable if it's all in shared buffers already, of course). But even in that case, it was a fairly modest degradation (20%), so overall this seems like a fairly minor drawback. I am not convinced. I thought about the problem of repeatedly switching pinned VM pages during the index-only scans work, and decided that we could live with it because, if the table was large enough that we were pinning VM pages frequently, we were also avoiding I/O. Of course, this is a logical fallacy, since the table could easily be large enough to have quite a few VM pages and yet small enough to fit in RAM. And, indeed, at least in the early days, an index scan could beat out an index-only scan by a significant margin on a memory-resident table, precisely because of the added cost of the VM lookups. I haven't benchmarked lately so I don't know for sure whether that's still the case, but I bet it is. From your other email: I have a gut feeling that the complexity we go through to maintain PD_ALL_VISIBLE is unnecessary and will cause us problems later. If we could combine freezing and marking all-visible, and use WAL for PD_ALL_VISIBLE in a normal fashion, then I'd be content with that. I think this idea is worth exploring, although I fear the overhead is likely to be rather large. We could find out, though. Suppose we simply change XLOG_HEAP2_VISIBLE to emit FPIs for the heap pages; how much does that slow down vacuuming a large table into which many pages have been bulk loaded? Sadly, I bet it's rather a lot, but I'd like to be wrong. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add regression tests for COLLATE
On Sat, Jul 6, 2013 at 9:27 AM, Andrew Dunstan and...@dunslane.net wrote: Or maybe just invent a magic result file name such as none or do_not_run. I'm not keen to have us build up a large patchwork of regression tests that run on some platforms and not on others, though. Me, neither. But I don't really see a plausible alternative in this particular case, since we're trying to test a feature that only works in the presence of specific platform support. I don't think that having no buildfarm testing of collation support is better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: FILTER for aggregates [was Re: Department of Redundancy Department: makeNode(FuncCall) division]
On Sun, Jul 14, 2013 at 10:15:12PM -0400, Noah Misch wrote: On Sun, Jul 07, 2013 at 06:37:26PM -0700, David Fetter wrote: On Sat, Jul 06, 2013 at 11:49:21AM +0100, Dean Rasheed wrote: Overall I think this patch offers useful additional functionality, in compliance with the SQL spec, which should be handy to simplify complex grouping queries. As I understand this feature, it is syntactic sugar for the typical case of an aggregate with a strict transition function. For example, min(x) FILTER (WHERE y 0) is rigorously equivalent to min(CASE y 0 THEN x END). Every SQL aggregate is strict (ISO 9075-2 section 4.15.4), so for standard queries it is *only* syntactic sugar. In PostgreSQL, it lets you do novel things with, for example, array_agg(). Is that accurate? I think this is ready for committer. The patch was thorough. I updated applicable comments, revisited some cosmetic choices, and made these functional changes: 1. The pg_stat_statements query jumble should incorporate the filter. 2. The patch did not update costing. I made it add the cost of the filter expression the same way we add the cost of the argument expressions. This makes min(x) FILTER (WHERE y 0) match min(case y 0 THEN x end) in terms of cost, which is a fair start. At some point, we could do better by reducing the argument cost by the filter selectivity. Thanks! A few choices/standard interpretations may deserve discussion. The patch makes filter clauses contribute to the subquery level chosen to be the aggregation query. This is observable through the behavior of these two standard-conforming queries: select (select count(outer_c) from (values (1)) t0(inner_c)) from (values (2),(3)) t1(outer_c); -- outer query is aggregation query select (select count(outer_c) filter (where inner_c 0) from (values (1)) t0(inner_c)) from (values (2),(3)) t1(outer_c); -- inner query is aggregation query I believe SQL (ISO 9075-2 section 6.9 SR 3,4,6) does require this. Since that still isn't crystal-clear to me, I mention it in case anyone has a different reading. Distinct from that, albeit in a similar vein, SQL does not permit outer references in a filter clause. This patch permits them; I think that qualifies as a reasonable PostgreSQL extension. Thanks again. --- a/doc/src/sgml/keywords.sgml +++ b/doc/src/sgml/keywords.sgml @@ -3200,7 +3200,7 @@ /row row entrytokenOVER/token/entry -entryreserved (can be function or type)/entry +entrynon-reserved/entry I committed this one-line correction separately. --- a/src/backend/optimizer/plan/planagg.c +++ b/src/backend/optimizer/plan/planagg.c @@ -314,7 +314,7 @@ find_minmax_aggs_walker(Node *node, List **context) ListCell *l; Assert(aggref-agglevelsup == 0); - if (list_length(aggref-args) != 1 || aggref-aggorder != NIL) + if (list_length(aggref-args) != 1 || aggref-aggorder != NIL || aggref-agg_filter != NULL) return true;/* it couldn't be MIN/MAX */ /* note: we do not care if DISTINCT is mentioned ... */ I twitched upon reading this, because neither ORDER BY nor FILTER preclude the aggregate being MIN or MAX. Perhaps Andrew can explain why he put aggorder there back in 2009. All I can figure is that writing max(c ORDER BY x) is so unlikely that we'd too often waste the next syscache lookup. But the same argument would apply to DISTINCT. With FILTER, the rationale is entirely different. The aggregate could well be MIN/MAX; we just haven't implemented the necessary support elsewhere in this file. Excellent reasoning, and good catch. See attached patch revisions. The first patch edits find_minmax_aggs_walker() per my comments just now. The second is an update of your FILTER patch with the changes to which I alluded above; it applies atop the first patch. Would you verify that I didn't ruin anything? Barring problems, will commit. Are you the sole named author of this patch? That's what the CF page says, but that wasn't fully clear to me from the list discussions. While Andrew's help was invaluable and pervasive, I wrote the patch, and all flaws in it are my fault. Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
On Fri, Jul 5, 2013 at 2:35 PM, Pavel Stehule pavel.steh...@gmail.com wrote: Yes, what I know almost all use utf8 without problems. Long time I didn't see any request for multi encoding support. Well, not *everything* can be represented as UTF-8; I think this is particularly an issue with Asian languages. If we chose to do it, I think that per-column encoding support would end up looking a lot like per-column collation support: it would be yet another per-column property along with typoid, typmod, and typcollation. I'm not entirely sure it's worth it, although FWIW I do believe Oracle has something like this. At any rate, it seems like quite a lot of work. Another idea would be to do something like what we do for range types - i.e. allow a user to declare a type that is a differently-encoded version of some base type. But even that seems pretty hard. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: template-ify (binary) extensions
On Fri, Jul 5, 2013 at 4:27 PM, Markus Wanner mar...@bluegap.ch wrote: One way to resolve that - and that seems to be the direction Dimitri's patch is going - is to make the extension depend on its template. (And thereby breaking the mental model of a template, IMO. In the spirit of that mental model, one could argue that code for stored procedures shouldn't be duplicated, but instead just reference its ancestor.) The security problems with this model have been discussed in every email thread about the extension template feature. There is a lot of (well-founded, IMHO) resistance to the idea of letting users install C code via an SQL connection. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] review: Non-recursive processing of AND/OR lists
On Sun, Jul 14, 2013 at 8:27 PM, Robert Haas robertmh...@gmail.com wrote: On Wed, Jul 10, 2013 at 9:02 PM, Josh Berkus j...@agliodbs.com wrote: I think it's a waste of code to try to handle bushy trees. A list is not a particularly efficient representation of the pending list; this will probably be slower than recusing in the common case. I'd suggest keeping the logic to handle left-deep trees, which I find rather elegant, but ditching the pending list. Somehow I find it hard to believe that recursing would be more efficient than processing the items right there. The recursion is not direct either; transformExprRecurse() is going to call this function again, but after a few more switch-case comparisons. Agreed that there's overhead in allocating list items, but is it more overhead than pushing functions on the call stack? Not sure, so I leave it to others who understand such things better than I do. If by common-case you mean a list of just one logical AND/OR operator, then I agree that creating and destroying a list may incur overhead that is relatively very expensive. To that end, I have altered the patch, attached, to not build a pending list until we encounter a node with root_expr_kind in a right branch. We're getting bushy-tree processing with very little extra code, but if you deem it not worthwhile or adding complexity, please feel free to rip it out. Is there going to be further discussion of this patch, or do I return it? Considering it's not been updated, nor my comments responded to, in almost two weeks, I think we return it at this point. Sorry, I didn't notice that this patch was put back in 'Waiting on Author' state. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc. non_recursive_and_or_transformation_v6.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
On Fri, Jul 5, 2013 at 2:35 PM, Pavel Stehule pavel.steh...@gmail.com wrote: Yes, what I know almost all use utf8 without problems. Long time I didn't see any request for multi encoding support. Well, not *everything* can be represented as UTF-8; I think this is particularly an issue with Asian languages. If we chose to do it, I think that per-column encoding support would end up looking a lot like per-column collation support: it would be yet another per- column property along with typoid, typmod, and typcollation. I'm not entirely sure it's worth it, although FWIW I do believe Oracle has something like this. Yes, the idea is that users will be able to declare columns of type NCHAR or NVARCHAR which will use the pre-determined encoding type. If we say that NCHAR is UTF-8 then the NCHAR column will be of UTF-8 encoding irrespective of the database encoding. It will be up to us to restrict what Unicode encodings we want to support for NCHAR/NVARCHAR columns. This is based on my interpretation of the SQL standard. As you allude to above, Oracle has a similar behaviour (they support UTF-16 as well). Support for UTF-16 will be difficult without linking with some external libraries such as ICU. At any rate, it seems like quite a lot of work. Thanks for putting my mind at ease ;-) Rgds, Arul Shaji Another idea would be to do something like what we do for range types - i.e. allow a user to declare a type that is a differently-encoded version of some base type. But even that seems pretty hard. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers