Re: [HACKERS] updated emacs configuration
On Thu, Jun 13, 2013 at 6:27 PM, Peter Eisentraut pete...@gmx.net wrote: First, I propose adding a .dir-locals.el file to the top-level directory with basic emacs settings. These get applied automatically. This especially covers the particular tab and indentation settings that PostgreSQL uses. With this, casual developers will not need to modify any of their emacs settings. Yes please. I've had the pgsql stuff in my .emacs for-ever (ever since I was a student and compelled to do homework on Postgres) and knew the magical rules about naming the directory, but it always felt so dirty and very much a 'you need to know the trick' level of intimacy. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Add transforms feature
Peter Eisentraut pete...@gmx.net a écrit : A transform is an SQL object that supplies to functions for converting between data types and procedural languages. For example, a transform could arrange that hstore is converted to an appropriate hash or dictionary object in PL/Perl or PL/Python. Nice ! Continued from 2013-01 commit fest. All known open issues have been fixed. You kept PGXS style makefile... -- Envoyé de mon téléphone excusez la brièveté. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Reduce maximum error in tuples estimation after vacuum.
Hello, Postgresql estimates the number of live tuples after the vacuum has left some buffers unscanned. This estimation does well for most cases, but makes completely different result with a strong imbalance of tuple density. For example, create table t (a int, b int); insert into t (select a, (random() * 10)::int from generate_series((select count(*) from t) + 1, 100) a); update t set b = b + 1 where a (select count(*) from t) * 0.7; vacuum t; delete from t where a (select count(*) from t) * 0.99; After this, pg_stat_user_tables.n_live_tup shows 417670 which is 41 times larger than the real number of rows 11. And what makes it worse, autovacuum nor autoanalyze won't run until n_dead_tup goes above 8 times larger than the real number of tuples in the table for the default settings.. | postgres=# select n_live_tup, n_dead_tup |from pg_stat_user_tables where relname='t'; | n_live_tup | n_dead_tup | + | 417670 | 0 | | postgres=# select reltuples from pg_class where relname='t'; | reltuples | --- | 417670 | | postgres=# select count(*) from t; | count | --- | 10001 Using n_dead_tup before vacuuming seems to make it better but I heard that the plan is abandoned from some reason I don't know. So I've come up with the another plan - using FSM to estimate the tuple density in unscanned pages. The point is that make estimation reliying on the uniformity of tuple length instead of tuple density. This change seems keeping that errors under a few times of tuples. Additional page reads for FSM are about 4000th (SlotsPerFSMPage) of the skipped pages, and I suppose this is tolerable during vacuum. Overall algorithm could be illistrated as below, - summing up used bytes, max offnum(PageGetMaxOffsetNumber), maximum free bytes for tuple data , and free bytes after page vacuum through all scanned pages. - summing up free bytes informed by FSM through all skipped pages. - Calculate mean tuple length from the overall used bytes and sum of max offnums, and scanned pages. - Guess tuple density in skipped pages using overall free bytes from FSM and the mean tuple length calculated above. - Finally, feed estimated number of the live tuples BEFORE vacuum into vac_estimate_reltuples. Of course this method affected by the imbalance of tuple LENGTH, but it also seems to be kept within a few times of the number of tuples. for rows with invariable length, the test for head shows, where tups est is pg_class.reltuples and tups real is count(*). del% | pages | n_live_tup | tups est | tups real | est/real | bufs -+---++--+---+--+-- 0.9 | 4425 | 11 | 470626 |11 |4.706 | 3985 0.95 | 4425 | 50001 | 441196 | 50001 |8.824 | 4206 0.99 | 4425 | 417670 | 417670 | 10001 | 41.763 | 4383 and with the patch 0.9 | 4425 | 106169 | 106169 |11 |1.062 | 3985 0.95 | 4425 | 56373 |56373 | 50001 |1.127 | 4206 0.99 | 4425 | 10001 |16535 | 10001 |1.653 | 4383 What do you think about this? = The attached files are: - vacuum_est_improve_20130614.patch: the patch for this proposal - vactest.sql: sql script to cause the sitiation - vactest.sh: test script to find the errors relating this patch. - test_result.txt: all of the test result for various deletion ratio which the test script above yields. regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index d6d20fd..1e581c1 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -1280,7 +1280,8 @@ acquire_sample_rows(Relation onerel, int elevel, *totalrows = vac_estimate_reltuples(onerel, true, totalblocks, bs.m, - liverows); + liverows, + onerel-rd_rel-reltuples); if (bs.m 0) *totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5); else diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index 641c740..4bdf0c1 100644 --- a/src/backend/commands/vacuum.c +++ b/src/backend/commands/vacuum.c @@ -501,10 +501,10 @@ double vac_estimate_reltuples(Relation relation, bool is_analyze, BlockNumber total_pages, BlockNumber scanned_pages, - double scanned_tuples) + double scanned_tuples, + double old_rel_tuples) { BlockNumber old_rel_pages = relation-rd_rel-relpages; - double old_rel_tuples = relation-rd_rel-reltuples; double old_density; double new_density; double multiplier; diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index 7e46f9e..80304a6 100644 --- a/src/backend/commands/vacuumlazy.c +++ b/src/backend/commands/vacuumlazy.c @@ -396,7 +396,11 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, double num_tuples,
[HACKERS] Add visibility map information to pg_freespace.
Helle, I've added visibility map information to pg_freespace for my utility. This looks like this, postgres=# select * from pg_freespace('t'::regclass); blkno | avail | all_visible ---+---+- 0 | 7424 | t 1 | 7424 | t 2 | 7424 | t 3 | 7424 | t 4 | 7424 | t 5 | 7424 | t 6 | 7424 | t 7 | 7424 | t ... What do you think about this? regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/contrib/pg_freespacemap/pg_freespacemap--1.0.sql b/contrib/pg_freespacemap/pg_freespacemap--1.0.sql index 2adb52a..e38b466 100644 --- a/contrib/pg_freespacemap/pg_freespacemap--1.0.sql +++ b/contrib/pg_freespacemap/pg_freespacemap--1.0.sql @@ -9,12 +9,17 @@ RETURNS int2 AS 'MODULE_PATHNAME', 'pg_freespace' LANGUAGE C STRICT; +CREATE FUNCTION pg_is_all_visible(regclass, bigint) +RETURNS bool +AS 'MODULE_PATHNAME', 'pg_is_all_visible' +LANGUAGE C STRICT; + -- pg_freespace shows the recorded space avail at each block in a relation CREATE FUNCTION - pg_freespace(rel regclass, blkno OUT bigint, avail OUT int2) + pg_freespace(rel regclass, blkno OUT bigint, avail OUT int2, all_visible OUT bool) RETURNS SETOF RECORD AS $$ - SELECT blkno, pg_freespace($1, blkno) AS avail + SELECT blkno, pg_freespace($1, blkno) AS avail, pg_is_all_visible($1, blkno) AS all_visible FROM generate_series(0, pg_relation_size($1) / current_setting('block_size')::bigint - 1) AS blkno; $$ LANGUAGE SQL; diff --git a/contrib/pg_freespacemap/pg_freespacemap.c b/contrib/pg_freespacemap/pg_freespacemap.c index f6f7d2e..de4eff7 100644 --- a/contrib/pg_freespacemap/pg_freespacemap.c +++ b/contrib/pg_freespacemap/pg_freespacemap.c @@ -10,17 +10,20 @@ #include funcapi.h #include storage/freespace.h +#include access/visibilitymap.h PG_MODULE_MAGIC; Datum pg_freespace(PG_FUNCTION_ARGS); +Datum pg_is_all_visible(PG_FUNCTION_ARGS); /* * Returns the amount of free space on a given page, according to the * free space map. */ PG_FUNCTION_INFO_V1(pg_freespace); +PG_FUNCTION_INFO_V1(pg_is_all_visible); Datum pg_freespace(PG_FUNCTION_ARGS) @@ -38,7 +41,32 @@ pg_freespace(PG_FUNCTION_ARGS) errmsg(invalid block number))); freespace = GetRecordedFreeSpace(rel, blkno); - relation_close(rel, AccessShareLock); PG_RETURN_INT16(freespace); } + +Datum +pg_is_all_visible(PG_FUNCTION_ARGS) +{ + Oid relid = PG_GETARG_OID(0); + int64 blkno = PG_GETARG_INT64(1); + Buffer vmbuffer = InvalidBuffer; + int all_visible; + Relation rel; + + rel = relation_open(relid, AccessShareLock); + + if (blkno 0 || blkno MaxBlockNumber) + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg(invalid block number))); + + all_visible = visibilitymap_test(rel, blkno, vmbuffer); + if (vmbuffer != InvalidBuffer) + { + ReleaseBuffer(vmbuffer); + vmbuffer = InvalidBuffer; + } + relation_close(rel, AccessShareLock); + PG_RETURN_BOOL(all_visible); +} diff --git a/contrib/pg_freespacemap/pg_freespacemap.control b/contrib/pg_freespacemap/pg_freespacemap.control index 34b695f..395350a 100644 --- a/contrib/pg_freespacemap/pg_freespacemap.control +++ b/contrib/pg_freespacemap/pg_freespacemap.control @@ -1,5 +1,5 @@ # pg_freespacemap extension -comment = 'examine the free space map (FSM)' +comment = 'examine the free space map (FSM) and visibility map (VM)' default_version = '1.0' module_pathname = '$libdir/pg_freespacemap' relocatable = true -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
(2013/06/12 23:07), Robert Haas wrote: On Mon, Jun 10, 2013 at 3:48 PM, Simon Riggs si...@2ndquadrant.com wrote: On 10 June 2013 11:51, KONDO Mitsumasa kondo.mitsum...@lab.ntt.co.jp wrote: I create patch which is improvement of checkpoint IO scheduler for stable transaction responses. Looks like good results, with good measurements. Should be an interesting discussion. +1. I suspect we want to poke at the algorithms a little here and maybe see if we can do this without adding new GUCs. Also, I think this is probably two separate patches, in the end. But the direction seems good to me. Thank you for comment! I separate my patch in checkpoint-wirte and in checkpoint-fsync. As you say, my patch has a lot of new GUCs. I don't think it cannot be decided automatic. However, it is difficult that chekpoint-scheduler is suitable for all of enviroments which are like virtual server, public cloude server, and embedded server, etc. So I think that default setting parameter works same as before. Setting parameter is primitive and difficult, but if we can set correctly, it is suitable for a lot of enviroments and will not work unintended action. I try to take something into consideration about less GUCs version. And if you have good idea, please discussion about this! Best Regards, -- Mitsumasa KONDO NTT Open Source Software Center diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c index fdf6625..0c0f215 100644 --- a/src/backend/postmaster/checkpointer.c +++ b/src/backend/postmaster/checkpointer.c @@ -141,9 +141,12 @@ static CheckpointerShmemStruct *CheckpointerShmem; /* * GUC parameters */ +int CheckPointerWriteDelay = 200; int CheckPointTimeout = 300; int CheckPointWarning = 30; double CheckPointCompletionTarget = 0.5; +double CheckPointSmoothTarget = 0.0; +double CheckPointSmoothMargin = 0.0; /* * Flags set by interrupt handlers for later service in the main loop. @@ -715,7 +718,7 @@ CheckpointWriteDelay(int flags, double progress) * Checkpointer and bgwriter are no longer related so take the Big * Sleep. */ - pg_usleep(10L); + pg_usleep(CheckPointerWriteDelay * 1000L); } else if (--absorb_counter = 0) { @@ -742,14 +745,36 @@ IsCheckpointOnSchedule(double progress) { XLogRecPtr recptr; struct timeval now; - double elapsed_xlogs, + double original_progress, + elapsed_xlogs, elapsed_time; Assert(ckpt_active); - /* Scale progress according to checkpoint_completion_target. */ - progress *= CheckPointCompletionTarget; + /* This variable is used by smooth checkpoint schedule.*/ + original_progress = progress * CheckPointCompletionTarget; + /* Scale progress according to checkpoint_completion_target and checkpoint_smooth_target. */ + if(progress = CheckPointSmoothTarget) + { + /* Normal checkpoint schedule. */ + progress *= CheckPointCompletionTarget; + } + else + { + /* + * Smooth checkpoint schedule. + * + * When initial checkpoint, it tends to be high IO road average + * and slow executing transactions. This schedule reduces them + * and improve IO responce. As 'progress' approximates CheckPointSmoothTarget, + * it becomes near normal checkpoint schedule. If you want to more + * smooth checkpoint schedule, you set higher CheckPointSmoothTarget. + */ + progress *= ((CheckPointSmoothTarget - progress) / CheckPointSmoothTarget) * +(CheckPointSmoothMargin + 1 - CheckPointCompletionTarget) + +CheckPointCompletionTarget; + } /* * Check against the cached value first. Only do the more expensive * calculations once we reach the target previously calculated. Since @@ -779,6 +804,14 @@ IsCheckpointOnSchedule(double progress) ckpt_cached_elapsed = elapsed_xlogs; return false; } + else if (original_progress elapsed_xlogs) + { + ckpt_cached_elapsed = elapsed_xlogs; + + /* smooth checkpoint write */ + pg_usleep(CheckPointerWriteDelay * 1000L); + return false; + } } /* @@ -793,6 +826,14 @@ IsCheckpointOnSchedule(double progress) ckpt_cached_elapsed = elapsed_time; return false; } + else if (original_progress elapsed_time) + { + ckpt_cached_elapsed = elapsed_time; + + /* smooth checkpoint write */ + pg_usleep(CheckPointerWriteDelay * 1000L); + return false; + } /* It looks like we're on schedule. */ return true; diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index ea16c64..d41dc17 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -2014,6 +2014,17 @@ static struct config_int ConfigureNamesInt[] = }, { + {checkpointer_write_delay, PGC_SIGHUP, RESOURCES_CHECKPOINTER, + gettext_noop(checkpointer sleep time during dirty buffers write in checkpoint.), + NULL, + GUC_UNIT_MS + }, + CheckPointerWriteDelay, + 200, 10, 1, + NULL, NULL, NULL + }, + + { {wal_buffers, PGC_POSTMASTER, WAL_SETTINGS, gettext_noop(Sets the number of disk-page buffers in shared
Re: [HACKERS] Reduce maximum error in tuples estimation after vacuum.
Sorry, I made an mistake. Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp: Overall algorithm could be illistrated as below, - summing up used bytes, max offnum(PageGetMaxOffsetNumber), Not max offnum, the number of linp's used after page vacuum. maximum free bytes for tuple data , and free bytes after page vacuum through all scanned pages. - summing up free bytes informed by FSM through all skipped pages. - Calculate mean tuple length from the overall used bytes and sum of max offnums, and scanned pages. Here also is the same. not sum of max offnum but total of used entrre(linp)s. - Guess tuple density in skipped pages using overall free bytes from FSM and the mean tuple length calculated above. - Finally, feed estimated number of the live tuples BEFORE vacuum into vac_estimate_reltuples. regards, -- Kyotaro Horiguchi
Re: [HACKERS] Patch for fail-back without fresh backup
On Fri, Jun 14, 2013 at 10:11 AM, Samrat Revagade revagade.sam...@gmail.com wrote: Hello, We have already started a discussion on pgsql-hackers for the problem of taking fresh backup during the failback operation here is the link for that: http://www.postgresql.org/message-id/caf8q-gxg3pqtf71nvece-6ozraew5pwhk7yqtbjgwrfu513...@mail.gmail.com Let me again summarize the problem we are trying to address. When the master fails, last few WAL files may not reach the standby. But the master may have gone ahead and made changes to its local file system after flushing WAL to the local storage. So master contains some file system level changes that standby does not have. At this point, the data directory of master is ahead of standby's data directory. Subsequently, the standby will be promoted as new master. Later when the old master wants to be a standby of the new master, it can't just join the setup since there is inconsistency in between these two servers. We need to take the fresh backup from the new master. This can happen in both the synchronous as well as asynchronous replication. Fresh backup is also needed in case of clean switch-over because in the current HEAD, the master does not wait for the standby to receive all the WAL up to the shutdown checkpoint record before shutting down the connection. Fujii Masao has already submitted a patch to handle clean switch-over case, but the problem is still remaining for failback case. The process of taking fresh backup is very time consuming when databases are of very big sizes, say several TB's, and when the servers are connected over a relatively slower link. This would break the service level agreement of disaster recovery system. So there is need to improve the process of disaster recovery in PostgreSQL. One way to achieve this is to maintain consistency between master and standby which helps to avoid need of fresh backup. So our proposal on this problem is that we must ensure that master should not make any file system level changes without confirming that the corresponding WAL record is replicated to the standby. A alternative proposal (which will probably just reveal my lack of understanding about what is or isn't possible with WAL). Provide a way to restart the master so that it rolls back the WAL changes that the slave hasn't seen. There are many suggestions and objections pgsql-hackers about this problem The brief summary is as follows:
Re: [HACKERS] Patch for fail-back without fresh backup
That will not happen if there is inconsistency in between both the servers. Please refer to the discussions on the link provided in the first post: http://www.postgresql.org/message-id/caf8q-gxg3pqtf71nvece-6ozraew5pwhk7yqtbjgwrfu513...@mail.gmail.com Regards, Samrat Revgade
Re: [HACKERS] Patch for fail-back without fresh backup
On 14.06.2013 12:11, Samrat Revagade wrote: We have already started a discussion on pgsql-hackers for the problem of taking fresh backup during the failback operation here is the link for that: http://www.postgresql.org/message-id/caf8q-gxg3pqtf71nvece-6ozraew5pwhk7yqtbjgwrfu513...@mail.gmail.com Let me again summarize the problem we are trying to address. When the master fails, last few WAL files may not reach the standby. But the master may have gone ahead and made changes to its local file system after flushing WAL to the local storage. So master contains some file system level changes that standby does not have. At this point, the data directory of master is ahead of standby's data directory. Subsequently, the standby will be promoted as new master. Later when the old master wants to be a standby of the new master, it can't just join the setup since there is inconsistency in between these two servers. We need to take the fresh backup from the new master. This can happen in both the synchronous as well as asynchronous replication. Did you see the thread on the little tool I wrote called pg_rewind? http://www.postgresql.org/message-id/519df910.4020...@vmware.com It solves that problem, for both clean and unexpected shutdown. It needs some more work and a lot more testing, but requires no changes to the backend. Robert Haas pointed out in that thread that it has a problem with hint bits that are not WAL-logged, but it will still work if you also enable the new checksums feature, which forces hint bit updates to be WAL-logged. Perhaps we could add a GUC to enable hint bits to be WAL-logged, regardless of checksums, to make pg_rewind work. I think that's a more flexible approach to solve this problem. It doesn't require an online feedback loop from the standby to master, for starters. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch for fail-back without fresh backup
On Fri, Jun 14, 2013 at 4:12 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Robert Haas pointed out in that thread that it has a problem with hint bits that are not WAL-logged, I liked that tool a lot until Robert pointed out the above problem. I thought this is a show stopper because I can't really see any way to circumvent it unless we enable checksums or explicitly WAL log hint bits. but it will still work if you also enable the new checksums feature, which forces hint bit updates to be WAL-logged. Are we expecting a lot of people to run their clusters with checksums on ? Sorry, I haven't followed the checksum discussions and don't know how much overhead it causes. But if the general expectation is that checksums will be turned on most often, I agree pg_rewind is probably good enough. Perhaps we could add a GUC to enable hint bits to be WAL-logged, regardless of checksums, to make pg_rewind work. Wouldn't that be too costly ? I mean, in the worst case every hint bit on a page may get updated separately. If each such update is WAL logged, we are looking for a lot more unnecessary WAL traffic. I think that's a more flexible approach to solve this problem. It doesn't require an online feedback loop from the standby to master, for starters. I agree. That's a big advantage of pg_rewind. Unfortunately, it can't work with 9.3 and below because of the hint bits issue, otherwise it would have been even more cool. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
Re: [HACKERS] Patch for fail-back without fresh backup
On Fri, Jun 14, 2013 at 2:51 PM, Benedikt Grundmann bgrundm...@janestreet.com wrote: A alternative proposal (which will probably just reveal my lack of understanding about what is or isn't possible with WAL). Provide a way to restart the master so that it rolls back the WAL changes that the slave hasn't seen. WAL records in PostgreSQL can only be used for physical redo. They can not be used for undo. So what you're suggesting is not possible though I am sure a few other databases do that. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
Re: [HACKERS] Patch for fail-back without fresh backup
On 14.06.2013 14:06, Pavan Deolasee wrote: On Fri, Jun 14, 2013 at 4:12 PM, Heikki Linnakangashlinnakan...@vmware.com wrote: Robert Haas pointed out in that thread that it has a problem with hint bits that are not WAL-logged, I liked that tool a lot until Robert pointed out the above problem. I thought this is a show stopper because I can't really see any way to circumvent it unless we enable checksums or explicitly WAL log hint bits. but it will still work if you also enable the new checksums feature, which forces hint bit updates to be WAL-logged. Are we expecting a lot of people to run their clusters with checksums on ? Sorry, I haven't followed the checksum discussions and don't know how much overhead it causes. But if the general expectation is that checksums will be turned on most often, I agree pg_rewind is probably good enough. Well, time will tell I guess. The biggest overhead with the checksums is exactly the WAL-logging of hint bits. Perhaps we could add a GUC to enable hint bits to be WAL-logged, regardless of checksums, to make pg_rewind work. Wouldn't that be too costly ? I mean, in the worst case every hint bit on a page may get updated separately. If each such update is WAL logged, we are looking for a lot more unnecessary WAL traffic. Yep, same as with checksums. I was not very enthusiastic about the checksums patch because of that, but a lot of people are willing to pay that price. Maybe we can figure out a way to reduce that cost in 9.4. It'd benefit the checksums greatly. For pg_rewind, we wouldn't actually need a full-page image for hint bit updates, just a small record saying hey, I touched this page. And you'd only need to write that the first time a page is touched after a checkpoint. I think that's a more flexible approach to solve this problem. It doesn't require an online feedback loop from the standby to master, for starters. I agree. That's a big advantage of pg_rewind. Unfortunately, it can't work with 9.3 and below because of the hint bits issue, otherwise it would have been even more cool. The proposed patch is clearly not 9.3 material either. If anything, there's a much better change that we could still sneak in a GUC to allow hint bits to be WAL-logged without checksums in 9.3. All the code is there, it'd just be a new guc to control it separetely from checksums. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch for fail-back without fresh backup
On Fri, Jun 14, 2013 at 12:20 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: For pg_rewind, we wouldn't actually need a full-page image for hint bit updates, just a small record saying hey, I touched this page. And you'd only need to write that the first time a page is touched after a checkpoint. I would expect that to be about the same cost though. The latency for the fsync on the wal record before being able to flush the buffer is the biggest cost. The proposed patch is clearly not 9.3 material either. If anything, there's a much better change that we could still sneak in a GUC to allow hint bits to be WAL-logged without checksums in 9.3. All the code is there, it'd just be a new guc to control it separetely from checksums. On the other hand if you're going to wal log the hint bits why not enable checksums? Do we allow turning off checksums after a database is initdb'd? IIRC we can't turn it on later but I don't see why we couldn't turn them off. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed dean.a.rash...@gmail.com wrote: Attached is a patch implementing a new aggregate function md5_agg() to compute the aggregate MD5 sum across a number of rows. This is something I've wished for a number of times. I think the primary use case is to do a quick check that 2 tables, possibly on different servers, contain the same data, using a query like SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo; or SELECT md5_agg(foo.*::text ORDER BY id) FROM foo; these would be equivalent to SELECT md5(string_agg(foo.*::text, '' ORDER BY id)) FROM foo; but without the excessive memory consumption for the intermediate concatenated string, and the resulting 1GB table size limit. It's more efficient to calculate per-row md5, and then sum() them. This avoids the need for ORDER BY. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Issue with PGC_BACKEND parameters
I had observed one problem with PGC_BACKEND parameters while testing patch for ALTER SYSTEM command. Problem statement: If I change PGC_BACKEND parameters directly in postgresql.conf and then do pg_reload_conf() and reconnect, it will still show the old value. Detailed steps 1. Start server with default settings 2. Connect Client 3. show log_connections; -- it will show as off, this is correct. 4. Change log_connections in postgresql.conf to on 5. issue command select pg_reload_conf() in client (which is started in step-2) 6. Connect a new client 7. show log_connections; -- it will show as off, this is in-correct. The problem is in step-7, it should show as on. This problem occur only in Windows. The reason for this problem is that in WINDOWS, when a new session is started it will load the changed parameters in new backend by global/config_exec_params file. The flow is in SubPostmasterMain()-read_nondefault_variables()-set_config_option(). In below code in function set_config_option(), it will not allow to change PGC_BACKEND variable and even in comments it has mentioned that only postmaster will be allowed to change and the same will propagate to subsequently started backends, but this is not TRUE for Windows. switch (record-context) { .. .. case PGC_BACKEND: if (context == PGC_SIGHUP) { /* * If a PGC_BACKEND parameter is changed in the config file, * we want to accept the new value in the postmaster (whence * it will propagate to subsequently-started backends), but * ignore it in existing backends. This is a tad klugy, but * necessary because we don't re-read the config file during * backend start. */ if (IsUnderPostmaster) return -1; } } I think to fix the issue we need to pass the information whether PGC_BACKEND parameter is allowed to change in set_config_option() function. One way is to pass a new parameter. Kindly let me know your suggestions. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SPGist triple parity concept doesn't work
Anyway I now think that we might be better off with the other idea of abandoning an insertion and retrying if we get a lock conflict. done, look at the patch. I was faced with the fact that my mail is considered spam by postgresql.org, so I repeat some hthoughts from previous mail: I considered the idea to forbid placement of child on the same page as parent, but this implementation a) could significantly increase size of index, b) doesn't solve Greg's point. We definetly need new idea of locking protocol and I'll return to this problem at autumn (sorry, I havn't time in summer to do this research). -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ spgist_deadlock-1.patch.gz Description: Unix tar archive -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Add transforms feature
On 6/14/13 3:46 AM, Cédric Villemain wrote: You kept PGXS style makefile... I know, but that's a separate issue that hasn't been decided yet. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Remove useless USE_PGXS support in contrib
On 6/13/13 9:20 PM, amul sul wrote: Agree, only if we consider these contrib module is always gonna deployed with the postgresql. But, what if user going to install such module elsewhere i.e. not from contrib directory of pg source. Why would anyone do that? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Remove useless USE_PGXS support in contrib
On Fri, Jun 14, 2013 at 9:35 PM, Peter Eisentraut pete...@gmx.net wrote: On 6/13/13 9:20 PM, amul sul wrote: Agree, only if we consider these contrib module is always gonna deployed with the postgresql. But, what if user going to install such module elsewhere i.e. not from contrib directory of pg source. Why would anyone do that? Is he probably saying install such module *from* elsewhere? Like directly from the source directory of a module using something like following: cd /path/to/module-source make USE_PGXS=1 PG_CONFIG=/path/to/pg_config make USE_PGXS=1 PG_CONFIG=/path/to/pg_config install When user does not work with pg source directly and does not have postgresql-contrib installed? Am I missing something here? -- Amit Langote -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch for fail-back without fresh backup
Heikki Linnakangas hlinnakan...@vmware.com writes: Well, time will tell I guess. The biggest overhead with the checksums is exactly the WAL-logging of hint bits. Refresh my memory as to why we need to WAL-log hints for checksumming? I just had my nose in the part of the checksum patch that tediously copies entire pages out of shared buffers to avoid possible instability of the hint bits while we checksum and write the page. Given that we're paying that cost, I don't see why we'd need to do any extra WAL-logging (above and beyond the log-when-freeze cost that we have to pay already). But I've not absorbed any caffeine yet today, so maybe I'm just missing it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
Marko Kreen mark...@gmail.com writes: On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed dean.a.rash...@gmail.com wrote: Attached is a patch implementing a new aggregate function md5_agg() to compute the aggregate MD5 sum across a number of rows. It's more efficient to calculate per-row md5, and then sum() them. This avoids the need for ORDER BY. Good point. The aggregate md5 function also fails to distinguish the case where we have 'xyzzy' followed by 'xyz' in two adjacent rows from the case where they contain 'xyz' followed by 'zyxyz'. Now, as against that, you lose any sensitivity to the ordering of the values. Personally I'd be a bit inclined to xor the per-row md5's rather than sum them, but that's a small matter. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch for fail-back without fresh backup
On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote: Hello, We have already started a discussion on pgsql-hackers for the problem of taking fresh backup during the failback operation here is the link for that: http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtb jgwrfu513...@mail.gmail.com Let me again summarize the problem we are trying to address. When the master fails, last few WAL files may not reach the standby. But the master may have gone ahead and made changes to its local file system after flushing WAL to the local storage. So master contains some file system level changes that standby does not have. At this point, the data directory of master is ahead of standby's data directory. Subsequently, the standby will be promoted as new master. Later when the old master wants to be a standby of the new master, it can't just join the setup since there is inconsistency in between these two servers. We need to take the fresh backup from the new master. This can happen in both the synchronous as well as asynchronous replication. Fresh backup is also needed in case of clean switch-over because in the current HEAD, the master does not wait for the standby to receive all the WAL up to the shutdown checkpoint record before shutting down the connection. Fujii Masao has already submitted a patch to handle clean switch-over case, but the problem is still remaining for failback case. The process of taking fresh backup is very time consuming when databases are of very big sizes, say several TB's, and when the servers are connected over a relatively slower link. This would break the service level agreement of disaster recovery system. So there is need to improve the process of disaster recovery in PostgreSQL. One way to achieve this is to maintain consistency between master and standby which helps to avoid need of fresh backup. So our proposal on this problem is that we must ensure that master should not make any file system level changes without confirming that the corresponding WAL record is replicated to the standby. How will you take care of extra WAL on old master during recovery. If it plays the WAL which has not reached new-master, it can be a problem. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch for fail-back without fresh backup
On 14.06.2013 16:08, Tom Lane wrote: Heikki Linnakangashlinnakan...@vmware.com writes: Well, time will tell I guess. The biggest overhead with the checksums is exactly the WAL-logging of hint bits. Refresh my memory as to why we need to WAL-log hints for checksumming? Torn pages: 1. Backend sets a hint bit, dirtying the buffer. 2. Checksum is calculated, and buffer is written out to disk. 3. crash If the page is torn, the checksum won't match. Without checksums, a torn page is not a problem with hint bits, as a single bit can't be torn and the page is otherwise intact. But with checksums, it causes a checksum failure. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch for fail-back without fresh backup
Heikki Linnakangas hlinnakan...@vmware.com writes: On 14.06.2013 16:08, Tom Lane wrote: Refresh my memory as to why we need to WAL-log hints for checksumming? Torn pages: So it's not that we actually need to log the individual hint bit changes, it's that we need to WAL-log a full page image on the first update after a checkpoint, so as to recover from torn-page cases. Which one are we doing? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On Fri, Jun 14, 2013 at 2:14 PM, Tom Lane t...@sss.pgh.pa.us wrote: Marko Kreen mark...@gmail.com writes: On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed dean.a.rash...@gmail.com wrote: Attached is a patch implementing a new aggregate function md5_agg() to compute the aggregate MD5 sum across a number of rows. It's more efficient to calculate per-row md5, and then sum() them. This avoids the need for ORDER BY. Good point. The aggregate md5 function also fails to distinguish the case where we have 'xyzzy' followed by 'xyz' in two adjacent rows from the case where they contain 'xyz' followed by 'zyxyz'. Now, as against that, you lose any sensitivity to the ordering of the values. Personally I'd be a bit inclined to xor the per-row md5's rather than sum them, but that's a small matter. regards, tom lane xor works but only if each row is different (e.g. at the very least all columns together make a unique key). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch for fail-back without fresh backup
On 2013-06-14 09:08:15 -0400, Tom Lane wrote: Heikki Linnakangas hlinnakan...@vmware.com writes: Well, time will tell I guess. The biggest overhead with the checksums is exactly the WAL-logging of hint bits. Refresh my memory as to why we need to WAL-log hints for checksumming? I just had my nose in the part of the checksum patch that tediously copies entire pages out of shared buffers to avoid possible instability of the hint bits while we checksum and write the page. I am really rather uncomfortable with that piece of code, and I hacked it up after Jeff Janes had reported a bug there (The one aborting WAL replay to early...). So I am very happy that you are looking at it. Jeff Davis and I were talking about whether the usage of PGXAC-delayChkpt makes the whole thing sufficiently safe at pgcon - we couldn't find any real danger but... Given that we're paying that cost, I don't see why we'd need to do any extra WAL-logging (above and beyond the log-when-freeze cost that we have to pay already). But I've not absorbed any caffeine yet today, so maybe I'm just missing it. The usual torn page spiel I think. If we crash while only one half of the page made it to disk we would get spurious checksum failures from there on. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Remove useless USE_PGXS support in contrib
On 06/14/2013 08:35 AM, Peter Eisentraut wrote: On 6/13/13 9:20 PM, amul sul wrote: Agree, only if we consider these contrib module is always gonna deployed with the postgresql. But, what if user going to install such module elsewhere i.e. not from contrib directory of pg source. Why would anyone do that? Maybe they wouldn't. I do think we need to make sure that we have at least buildfarm coverage of pgxs module building and testing. I have some coverage of a few extensions I have written, which exercise that, so maybe that will suffice. If not, maybe we need to have one module that only builds via pgxs and is build after an install (i.e. not via the standard contrib build). I don't really like the directory layout we use for these modules anyway, so I'm not sure they constitute best practice for extension builders. Lately I have been using an extension skeleton that looks something like this: License Readme.md META.json (for pgxn) extension.control Makefile doc/extension.md (soft linked to ../Readme.md) src/extension.c sql/extension.sql test/sql/extension.sql test/expected/extension.out cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch for fail-back without fresh backup
On 14.06.2013 16:21, Tom Lane wrote: Heikki Linnakangashlinnakan...@vmware.com writes: On 14.06.2013 16:08, Tom Lane wrote: Refresh my memory as to why we need to WAL-log hints for checksumming? Torn pages: So it's not that we actually need to log the individual hint bit changes, it's that we need to WAL-log a full page image on the first update after a checkpoint, so as to recover from torn-page cases. Which one are we doing? Correct. We're doing the latter, see XLogSaveBufferForHint(). - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch for fail-back without fresh backup
On 2013-06-14 09:21:52 -0400, Tom Lane wrote: Heikki Linnakangas hlinnakan...@vmware.com writes: On 14.06.2013 16:08, Tom Lane wrote: Refresh my memory as to why we need to WAL-log hints for checksumming? Torn pages: So it's not that we actually need to log the individual hint bit changes, it's that we need to WAL-log a full page image on the first update after a checkpoint, so as to recover from torn-page cases. Which one are we doing? MarkBufferDirtyHint() loggs an FPI (just not via a BKP block) via XLogSaveBufferForHint() iff XLogCheckBuffer() says we need to by comparing GetRedoRecPtr() with the page's lsn. Otherwise we don't do anything besides marking the buffer dirty. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
* Tom Lane (t...@sss.pgh.pa.us) wrote: Marko Kreen mark...@gmail.com writes: On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed dean.a.rash...@gmail.com wrote: Attached is a patch implementing a new aggregate function md5_agg() to compute the aggregate MD5 sum across a number of rows. It's more efficient to calculate per-row md5, and then sum() them. This avoids the need for ORDER BY. Good point. The aggregate md5 function also fails to distinguish the case where we have 'xyzzy' followed by 'xyz' in two adjacent rows from the case where they contain 'xyz' followed by 'zyxyz'. Now, as against that, you lose any sensitivity to the ordering of the values. Personally I'd be a bit inclined to xor the per-row md5's rather than sum them, but that's a small matter. Where I'd take this is actually in a completely different direction.. I'd like the aggregate to be able to match the results of running the 'md5sum' unix utility on a file that's been COPY'd out. Yes, that means we'd need a way to get back what would this row look like if it was sent through COPY with these parameters, but I've long wanted that also. No, no clue about how to put all that together. Yes, having this would be better than nothing, so I'm still for adding this even if we can't make it match COPY output. :) Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Patch for fail-back without fresh backup
On 14.06.2013 16:15, Andres Freund wrote: On 2013-06-14 09:08:15 -0400, Tom Lane wrote: I just had my nose in the part of the checksum patch that tediously copies entire pages out of shared buffers to avoid possible instability of the hint bits while we checksum and write the page. I am really rather uncomfortable with that piece of code, and I hacked it up after Jeff Janes had reported a bug there (The one aborting WAL replay to early...). So I am very happy that you are looking at it. Hmm. In XLogSaveBufferForHint(): * Note that this only works for buffers that fit the standard page model, * i.e. those for which buffer_std == true The free-space-map uses non-standard pages, and MarkBufferDirtyHint(). Isn't that completely broken for the FSM? If I'm reading it correctly, what will happen is that replay will completely zero out all FSM pages that have been touched. All the FSM data is between pd_lower and pd_upper, which on standard pages is the hole. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On 06/14/2013 09:40 AM, Stephen Frost wrote: * Tom Lane (t...@sss.pgh.pa.us) wrote: Marko Kreen mark...@gmail.com writes: On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed dean.a.rash...@gmail.com wrote: Attached is a patch implementing a new aggregate function md5_agg() to compute the aggregate MD5 sum across a number of rows. It's more efficient to calculate per-row md5, and then sum() them. This avoids the need for ORDER BY. Good point. The aggregate md5 function also fails to distinguish the case where we have 'xyzzy' followed by 'xyz' in two adjacent rows from the case where they contain 'xyz' followed by 'zyxyz'. Now, as against that, you lose any sensitivity to the ordering of the values. Personally I'd be a bit inclined to xor the per-row md5's rather than sum them, but that's a small matter. Where I'd take this is actually in a completely different direction.. I'd like the aggregate to be able to match the results of running the 'md5sum' unix utility on a file that's been COPY'd out. Yes, that means we'd need a way to get back what would this row look like if it was sent through COPY with these parameters, but I've long wanted that also. No, no clue about how to put all that together. Yes, having this would be better than nothing, so I'm still for adding this even if we can't make it match COPY output. :) I'd rather go the other way, processing the records without having to process them otherwise at all. Turning things into text must slow things down, surely. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch for fail-back without fresh backup
On 2013-06-14 09:21:52 -0400, Tom Lane wrote: Heikki Linnakangas hlinnakan...@vmware.com writes: On 14.06.2013 16:08, Tom Lane wrote: Refresh my memory as to why we need to WAL-log hints for checksumming? Torn pages: So it's not that we actually need to log the individual hint bit changes, it's that we need to WAL-log a full page image on the first update after a checkpoint, so as to recover from torn-page cases. Which one are we doing? From quickly looking at the code again I think the MarkBufferDirtyHint() code makes at least one assumption that isn't correct in the fact of checksums. It tests for the need to dirty the page with: if ((bufHdr-flags (BM_DIRTY | BM_JUST_DIRTIED)) != (BM_DIRTY | BM_JUST_DIRTIED)) *before* taking a lock. A comment explains why that is safe: * Since we make this test unlocked, there's a chance we * might fail to notice that the flags have just been cleared, and failed * to reset them, due to memory-ordering issues. That's fine for the classical usecase without checksums but what about the following scenario: 1) page is dirtied, FPI is logged 2) SetHintBits gets called on the same page, holding only a share lock 3) checkpointer/bgwriter/... writes out the the page, clearing the dirty flag 4) checkpoint finishes, updates redo ptr 5) SetHintBits actually modifies the hint bits 6) SetHintBits calls MarkBufferDirtyHint which doesn't notice that the page isn't dirty anymore and thus doesn't check whether something needs to get logged. At this point we have a page that has been modified without an FPI. But it's not marked dirty, so it won't be written out without further cause. Which might be fine since there's no cause to write out the page and there probably won't be anyone doing that without logging an FPI independently. Can anybody see a scenario where this is actually dangerous? Since Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch for fail-back without fresh backup
On 14.06.2013 17:01, Andres Freund wrote: At this point we have a page that has been modified without an FPI. But it's not marked dirty, so it won't be written out without further cause. Which might be fine since there's no cause to write out the page and there probably won't be anyone doing that without logging an FPI independently. Can anybody see a scenario where this is actually dangerous? The code also relies on that being safe during recovery: * If we're in recovery we cannot dirty a page because of a hint. * We can set the hint, just not dirty the page as a result so the * hint is lost when we evict the page or shutdown. * * See src/backend/storage/page/README for longer discussion. */ if (RecoveryInProgress()) return; I can't immediately see a problem with that. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch for fail-back without fresh backup
On 2013-06-14 16:58:38 +0300, Heikki Linnakangas wrote: On 14.06.2013 16:15, Andres Freund wrote: On 2013-06-14 09:08:15 -0400, Tom Lane wrote: I just had my nose in the part of the checksum patch that tediously copies entire pages out of shared buffers to avoid possible instability of the hint bits while we checksum and write the page. I am really rather uncomfortable with that piece of code, and I hacked it up after Jeff Janes had reported a bug there (The one aborting WAL replay to early...). So I am very happy that you are looking at it. Hmm. In XLogSaveBufferForHint(): * Note that this only works for buffers that fit the standard page model, * i.e. those for which buffer_std == true The free-space-map uses non-standard pages, and MarkBufferDirtyHint(). Isn't that completely broken for the FSM? If I'm reading it correctly, what will happen is that replay will completely zero out all FSM pages that have been touched. All the FSM data is between pd_lower and pd_upper, which on standard pages is the hole. Jeff Davis has a patch pending (1365493015.7580.3240.camel@sussancws0025) that passes the buffer_std flag down to MarkBufferDirtyHint() for exactly that reason. I thought we were on track committing that, but rereading the thread it doesn't look that way. Jeff, care to update that patch? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
On 6/12/13 3:19 AM, Fabien COELHO wrote: If you are still worried: if you run the very same command without throttling and measure the same latency, does the same thing happens at the end? My guess is that it should be yes. If it is no, I'll try out pgbench-tools. It looks like it happens rarely for one client without the rate limit, but that increases to every time for multiple client with limiting in place. pgbench-tools just graphs the output from the latency log. Here's a setup that runs the test I'm doing: $ createdb pgbench $ pgbench -i -s 10 pgbench $ pgbench -S -c 25 -T 30 -l pgbench tail -n 40 pgbench_log* Sometimes there's no slow entries. but I've seen this once so far: 0 21822 1801 0 1371217462 945264 1 21483 1796 0 1371217462 945300 8 20891 1931 0 1371217462 945335 14 20520 2084 0 1371217462 945374 15 20517 1991 0 1371217462 945410 16 20393 1928 0 1371217462 945444 17 20183 2000 0 1371217462 945479 18 20277 2209 0 1371217462 945514 23 20316 2114 0 1371217462 945549 22 20267 250128 0 1371217463 193656 The third column is the latency for that transaction. Notice how it's a steady ~2000 us except for the very last transaction, which takes 250,128 us. That's the weird thing; these short SELECT statements should never take that long. It suggests there's something weird happening with how the client exits, probably that its latency number is being computed after more work than it should. Here's what a rate limited run looks like for me. Note that I'm still using the version I re-submitted since that's where I ran into this issue, I haven't merged your changes to split the rate among each client here--which means this is 400 TPS per client == 1 TPS total: $ pgbench -S -c 25 -T 30 -R 400 -l pgbench tail -n 40 pgbench_log 7 12049 2070 0 1371217859 195994 22 12064 2228 0 1371217859 196115 18 11957 1570 0 1371217859 196243 23 12130 989 0 1371217859 196374 8 11922 1598 0 1371217859 196646 11 12229 4833 0 1371217859 196702 21 11981 1943 0 1371217859 196754 20 11930 1026 0 1371217859 196799 14 11990 13119 0 1371217859 208014 ^^^ fast section vvv delayed section 1 11982 91926 0 1371217859 287862 2 12033 116601 0 1371217859 308644 6 12195 115957 0 1371217859 308735 17 12130 114375 0 1371217859 308776 0 12026 115507 0 1371217859 308822 3 11948 118228 0 1371217859 308859 4 12061 113484 0 1371217859 308897 5 12110 113586 0 1371217859 308933 9 12032 117744 0 1371217859 308969 10 12045 114626 0 1371217859 308989 12 11953 113372 0 1371217859 309030 13 11883 114405 0 1371217859 309066 15 12018 116069 0 1371217859 309101 16 11890 115727 0 1371217859 309137 19 12140 114006 0 1371217859 309177 24 11884 115782 0 1371217859 309212 There's almost 90,000 usec of latency showing up between epoch time 1371217859.208014 and 1371217859.287862 here. What's weird about it is that the longer the test runs, the larger the gap is. If collecting the latency data itself caused the problem, that would make sense, so maybe this is related to flushing that out to disk. If you want to look just at the latency numbers without the other columns in the way you can use: cat pgbench_log.* | awk {'print $3'} That is how I was evaluating the smoothness of the rate limit, by graphing those latency values. pgbench-tools takes those and a derived TPS/s number and plots them, which made it easier for me to spot this weirdness. But I've already moved onto analyzing the raw latency data instead, I can see the issue without the graph once I've duplicated the conditions. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
* Andrew Dunstan (and...@dunslane.net) wrote: I'd rather go the other way, processing the records without having to process them otherwise at all. Turning things into text must slow things down, surely. That's certainly an interesting idea also.. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] MD5 aggregate
On 14 June 2013 14:14, Tom Lane t...@sss.pgh.pa.us wrote: Marko Kreen mark...@gmail.com writes: On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed dean.a.rash...@gmail.com wrote: Attached is a patch implementing a new aggregate function md5_agg() to compute the aggregate MD5 sum across a number of rows. It's more efficient to calculate per-row md5, and then sum() them. This avoids the need for ORDER BY. Good point. The aggregate md5 function also fails to distinguish the case where we have 'xyzzy' followed by 'xyz' in two adjacent rows from the case where they contain 'xyz' followed by 'zyxyz'. Well, if you aggregated foo.*::text as in my original example, then the textual representation of the row would protect you from that. But yes, if you were just doing it with a single text column that might be a risk. Now, as against that, you lose any sensitivity to the ordering of the values. Personally I'd be a bit inclined to xor the per-row md5's rather than sum them, but that's a small matter. But this would be a much riskier thing to do with a single column, because if you updated multiple rows in the same way (e.g., UPDATE t SET x='foo' WHERE x='bar') then xor'ing the md5's would cancel out if there were an even number of matches. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add visibility map information to pg_freespace.
Kyotaro HORIGUCHI wrote: Helle, I've added visibility map information to pg_freespace for my utility. This makes sense to me. I only lament the fact that this makes the module a misnomer. Do we want to 1) rename the module (how inconvenient), 2) create a separate module for this (surely not warranted), or 3) accept it and move on? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Buildfarm client 4.11 released
Version 4.11 of the PostgreSQL Buildfarm client has been released. It can be downloaded from http://www.pgbuildfarm.org/downloads/releases/build-farm-4_11.tgz Changes since 4.10: * Turn down module cleanup verbosity * Add check for rogue postmasters. * Add pseudo-branch targets HEAD_PLUS_LATEST and HEAD_PLUS_LATEST2. * Use Digest::SHA instead of Digest::SHA1. * Make directory handling more robust in git code. * Move web transaction into a module procedure. * Switch to using the porcelain format of git status. * Provide parameter for core file patterns. * Use a command file for gdb instead of the -ex option The web transaction and Digest::SHA changes have allowed the removal of a couple of long-standing uglinesses on the system. In almost all cases, the config parameter aux_path and the separate run_web_transaction.pl script are now redundant (the exception is older Msys systems). Enjoy cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add visibility map information to pg_freespace.
On 2013-06-14 10:22:19 -0400, Alvaro Herrera wrote: Kyotaro HORIGUCHI wrote: Helle, I've added visibility map information to pg_freespace for my utility. This makes sense to me. +1 I only lament the fact that this makes the module a misnomer. Do we want to 1) rename the module (how inconvenient), 2) create a separate module for this (surely not warranted), or 3) accept it and move on? 3). All the others seem to inflict unneccesary pain for not all that much gain. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch for fail-back without fresh backup
On Fri, Jun 14, 2013 at 2:21 PM, Tom Lane t...@sss.pgh.pa.us wrote: So it's not that we actually need to log the individual hint bit changes, it's that we need to WAL-log a full page image on the first update after a checkpoint, so as to recover from torn-page cases. Which one are we doing? Wal logging a full page image after a checkpoint wouldn't actually be enough since subsequent hint bits will dirty the page and not wal log anything creating a new torn page risk. FPI are only useful if all the subsequent updates are wal logged. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch for fail-back without fresh backup
Greg Stark st...@mit.edu writes: On Fri, Jun 14, 2013 at 2:21 PM, Tom Lane t...@sss.pgh.pa.us wrote: So it's not that we actually need to log the individual hint bit changes, it's that we need to WAL-log a full page image on the first update after a checkpoint, so as to recover from torn-page cases. Which one are we doing? Wal logging a full page image after a checkpoint wouldn't actually be enough since subsequent hint bits will dirty the page and not wal log anything creating a new torn page risk. FPI are only useful if all the subsequent updates are wal logged. No, there's no new torn page risk, because any crash recovery would replay starting from the checkpoint. You might lose the subsequently-set hint bits, but that's okay. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
Dean Rasheed dean.a.rash...@gmail.com writes: On 14 June 2013 14:14, Tom Lane t...@sss.pgh.pa.us wrote: Personally I'd be a bit inclined to xor the per-row md5's rather than sum them, but that's a small matter. But this would be a much riskier thing to do with a single column, because if you updated multiple rows in the same way (e.g., UPDATE t SET x='foo' WHERE x='bar') then xor'ing the md5's would cancel out if there were an even number of matches. I was implicitly thinking that the sum would be a modulo sum so that the final result is still the size of an md5 signature. If that's true, then leaking bits via carry out is just as bad as xor's deficiencies. Now, you could certainly make it a non-modulo sum and not lose any information to carries, if you're willing to do the arithmetic in NUMERIC and have a variable-width result. Sounds a bit slow though. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On 14 June 2013 15:19, Stephen Frost sfr...@snowman.net wrote: * Andrew Dunstan (and...@dunslane.net) wrote: I'd rather go the other way, processing the records without having to process them otherwise at all. Turning things into text must slow things down, surely. That's certainly an interesting idea also.. md5_agg(record) ? Yes, I like it. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On 2013-06-14 15:49:31 +0100, Dean Rasheed wrote: On 14 June 2013 15:19, Stephen Frost sfr...@snowman.net wrote: * Andrew Dunstan (and...@dunslane.net) wrote: I'd rather go the other way, processing the records without having to process them otherwise at all. Turning things into text must slow things down, surely. That's certainly an interesting idea also.. md5_agg(record) ? Yes, I like it. It's more complex than just memcmp()ing HeapTupleData though. At least if the Datum contains varlena columns there's so many different representations (short, long, compressed, external, external compressed) of the same data that a md5 without normalizing that wouldn't be very interesting. So you would at least need a normalizing version of toast_flatten_tuple() that also deals with short/long varlenas. But even after that, you would still need to deal with Datums that can have different representation (like short numerics, old style hstore, ...). It might be more realistic to use the binary output functions, but I am not sure whether all of those are sufficiently reproduceable. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On 06/14/2013 04:47 PM, Tom Lane wrote: Dean Rasheed dean.a.rash...@gmail.com writes: On 14 June 2013 14:14, Tom Lane t...@sss.pgh.pa.us wrote: Personally I'd be a bit inclined to xor the per-row md5's rather than sum them, but that's a small matter. But this would be a much riskier thing to do with a single column, because if you updated multiple rows in the same way (e.g., UPDATE t SET x='foo' WHERE x='bar') then xor'ing the md5's would cancel out if there were an even number of matches. I was implicitly thinking that the sum would be a modulo sum so that the final result is still the size of an md5 signature. If that's true, then leaking bits via carry out is just as bad as xor's deficiencies. Now, you could certainly make it a non-modulo sum and not lose any information to carries, if you're willing to do the arithmetic in NUMERIC and have a variable-width result. Sounds a bit slow though. What skytools/pgq/londiste uses for comparing tables on master and slave is query like this select sum(hashtext(t.*::text)) from yourtable t; This is non-modulo sum and does not use md5 but relies on whatever the hashtext() du jour is :) So it is not comparable to anything external (like the md5sum compatible idea above) but is usually good enough for fast checks of compatible tables. As tables are unordered by definition anyway, this should be good enough for most SQL. The speed comes from both fast(er) hashtext() function and avoiding the sort. -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_filedump 9.3: checksums (and a few other fixes)
On Thu, 2013-06-13 at 20:09 -0400, Tom Lane wrote: What I propose we do about this is reduce backend/storage/page/checksum.c to something like #include postgres.h #include storage/checksum.h #include storage/checksum_impl.h moving all the code currently in the file into a new .h file. Then, any external programs such as pg_filedump can use the checksum code by including checksum_impl.h. This is essentially the same thing we did with the CRC support functionality some time ago. Thank you for taking care of that. After seeing that it needed to be in a header file, I was going to try doing it all as macros. I have a question about the commit though: shouldn't both functions be static if they are in a .h file? Otherwise, it could lead to naming conflicts. I suppose it's wrong to include the implementation file twice, but it still might be confusing if someone tries. Two ideas that come to mind are: * make both static and then have a trivial wrapper in checksum.c * export one or both functions, but use #ifndef CHECKSUM_IMPL_H to prevent redefinition Also, we have the cut-point between checksum.c and bufpage.c at the wrong place. IMO we should move PageCalcChecksum16 in toto into checksum.c (or really now into checksum_impl.h), because that and not just checksum_block() is the functionality that is wanted. Agreed. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_filedump 9.3: checksums (and a few other fixes)
Jeff Davis pg...@j-davis.com writes: I have a question about the commit though: shouldn't both functions be static if they are in a .h file? Otherwise, it could lead to naming conflicts. I suppose it's wrong to include the implementation file twice, but it still might be confusing if someone tries. Two ideas that come to mind are: * make both static and then have a trivial wrapper in checksum.c * export one or both functions, but use #ifndef CHECKSUM_IMPL_H to prevent redefinition Ah, you are right, I forgot the #ifndef CHECKSUM_IMPL_H dance. Will fix in a bit. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] refresh materialized view concurrently
Attached is a patch for REFRESH MATERIALIZED VIEW CONCURRENTLY for 9.4 CF1. The goal of this patch is to allow a refresh without interfering with concurrent reads, using transactional semantics. It is my hope to get this committed during this CF to allow me to focus on incremental maintenance for the rest of the release cycle. I didn't need to touch very much outside of matview-specific files for this. My biggest concern is that I needed two small functions which did *exactly* what some static functions in ri_triggers.c were doing and couldn't see where the best place to share them from was. For the moment I just duplicated them, but my hope would be that they could be put in a suitable location and called from both places, rather than duplicating the 30-some lines of code. The function signatures are: void quoteOneName(char *buffer, const char *name) void quoteRelationName(char *buffer, Relation rel) Comments in the patch describe the technique used for the transactional refresh, but I'm not sure how easy it is to understand the technique from the comments. Here is a demonstration of the basic technique, using a table to mock the materialized view so it can be run directly. --- -- -- Setup -- drop table if exists n, nt, nd cascade; drop table if exists nm; create table n (id int not null primary key, val text); insert into n values (1, 'one'), (2, 'two'), (3, 'three'), (4, 'four'), (5, 'five'), (6, null), (7, null), (8, null), (9, null); -- We use a table to mock this materialized view definition: -- create materialized view nm as select * from n; create table nm as select * from n; insert into n values (10, 'ten'), (11, null); update n set val = 'zwei' where id = 2; update n set val = null where id = 3; update n set id = 44, val = 'forty-four' where id = 4; update n set val = 'seven' where id = 7; delete from n where id = 5; delete from n where id = 8; vacuum analyze; -- -- Sample of internal processing for REFRESH MV CONCURRENTLY. -- begin; create temp table nt as select * from n; analyze nt; create temp table nd as SELECT x.ctid as tid, y FROM nm x FULL JOIN n y ON (y.id OPERATOR(pg_catalog.=) x.id) WHERE (y.*) IS DISTINCT FROM (x.*) ORDER BY tid; analyze nd; delete from nm where ctid in (select tid from nd where tid is not null and y is not distinct from null); update nm x set id = (d.y).id, val = (d.y).val from nd d where d.tid is not null and x.ctid = d.tid; insert into nm select (y).* from nd where tid is null; commit; -- -- Check that results match. -- select * from n order by id; select * from nm order by id; --- I also tried a million-row materialized view with the patch to see what the performace was like on a large table with just a few changes. I was surprised that a small change-set like this was actually faster than replacing the heap, at least on my machine. Obviously, when a larger number of rows are affected the transactional CONCURRENTLY option will be slower, and this is not intended in any way as a performace-enhancing feature, that was just a happy surprise in testing. --- -- drop from previous test drop table if exists testv cascade; -- create and populate permanent table create table testv (id int primary key, val text); insert into testv select n, cash_words((floor(random() * 1) / 100)::text::money) from (select generate_series(1, 200, 2)) s(n); update testv set val = NULL where id = 547345; create materialized view matv as select * from testv; create unique index matv_id on matv (id); vacuum analyze matv; delete from testv where id = 16405; insert into testv values (393466, cash_words((floor(random() * 1) / 100)::text::money)); update testv set val = cash_words((floor(random() * 1) / 100)::text::money) where id = 1947141; refresh materialized view concurrently matv; --- People may be surprised to see this using SPI even more than ri_triggers.c does. I think this is the safest and most maintainable approach, although I welcome alternative suggestions. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company*** a/doc/src/sgml/mvcc.sgml --- b/doc/src/sgml/mvcc.sgml *** *** 928,935 ERROR: could not serialize access due to read/write dependencies among transact /para para ! This lock mode is not automatically acquired on tables by any ! productnamePostgreSQL/productname command. /para /listitem /varlistentry --- 928,934 /para para ! Acquired by commandREFRESH MATERIALIZED VIEW CONCURRENTLY/command. /para /listitem /varlistentry ***
Re: [HACKERS] pg_filedump 9.3: checksums (and a few other fixes)
On 2013-06-14 11:59:04 -0400, Tom Lane wrote: Jeff Davis pg...@j-davis.com writes: I have a question about the commit though: shouldn't both functions be static if they are in a .h file? Otherwise, it could lead to naming conflicts. I suppose it's wrong to include the implementation file twice, but it still might be confusing if someone tries. Two ideas that come to mind are: * make both static and then have a trivial wrapper in checksum.c * export one or both functions, but use #ifndef CHECKSUM_IMPL_H to prevent redefinition Ah, you are right, I forgot the #ifndef CHECKSUM_IMPL_H dance. Will fix in a bit. That won't help against errors if it's included in two different files/translation units though. I don't really see a valid case where it could be validly be included multiple times in one TU? If anything we should #error in that case, but I am not sure it's worth bothering. E.g. in rmgrlist.h we have the following comment: /* there is deliberately not an #ifndef RMGRLIST_H here */ and I think the reasoning behind that comment applies here as well. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch for fail-back without fresh backup
On Fri, 2013-06-14 at 16:10 +0200, Andres Freund wrote: Jeff Davis has a patch pending (1365493015.7580.3240.camel@sussancws0025) that passes the buffer_std flag down to MarkBufferDirtyHint() for exactly that reason. I thought we were on track committing that, but rereading the thread it doesn't look that way. Jeff, care to update that patch? Rebased and attached. Changed so all callers use buffer_std=true except those in freespace.c and fsmpage.c. Simon, did you (or anyone else) have an objection to this patch? If not, I'll go ahead and commit it tomorrow morning. Regards, Jeff Davis *** a/src/backend/access/hash/hash.c --- b/src/backend/access/hash/hash.c *** *** 287,293 hashgettuple(PG_FUNCTION_ARGS) /* * Since this can be redone later if needed, mark as a hint. */ ! MarkBufferDirtyHint(buf); } /* --- 287,293 /* * Since this can be redone later if needed, mark as a hint. */ ! MarkBufferDirtyHint(buf, true); } /* *** a/src/backend/access/heap/pruneheap.c --- b/src/backend/access/heap/pruneheap.c *** *** 262,268 heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin, { ((PageHeader) page)-pd_prune_xid = prstate.new_prune_xid; PageClearFull(page); ! MarkBufferDirtyHint(buffer); } } --- 262,268 { ((PageHeader) page)-pd_prune_xid = prstate.new_prune_xid; PageClearFull(page); ! MarkBufferDirtyHint(buffer, true); } } *** a/src/backend/access/nbtree/nbtinsert.c --- b/src/backend/access/nbtree/nbtinsert.c *** *** 413,421 _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel, * crucial. Be sure to mark the proper buffer dirty. */ if (nbuf != InvalidBuffer) ! MarkBufferDirtyHint(nbuf); else ! MarkBufferDirtyHint(buf); } } } --- 413,421 * crucial. Be sure to mark the proper buffer dirty. */ if (nbuf != InvalidBuffer) ! MarkBufferDirtyHint(nbuf, true); else ! MarkBufferDirtyHint(buf, true); } } } *** a/src/backend/access/nbtree/nbtree.c --- b/src/backend/access/nbtree/nbtree.c *** *** 1052,1058 restart: opaque-btpo_cycleid == vstate-cycleid) { opaque-btpo_cycleid = 0; ! MarkBufferDirtyHint(buf); } } --- 1052,1058 opaque-btpo_cycleid == vstate-cycleid) { opaque-btpo_cycleid = 0; ! MarkBufferDirtyHint(buf, true); } } *** a/src/backend/access/nbtree/nbtutils.c --- b/src/backend/access/nbtree/nbtutils.c *** *** 1789,1795 _bt_killitems(IndexScanDesc scan, bool haveLock) if (killedsomething) { opaque-btpo_flags |= BTP_HAS_GARBAGE; ! MarkBufferDirtyHint(so-currPos.buf); } if (!haveLock) --- 1789,1795 if (killedsomething) { opaque-btpo_flags |= BTP_HAS_GARBAGE; ! MarkBufferDirtyHint(so-currPos.buf, true); } if (!haveLock) *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *** *** 7681,7692 XLogRestorePoint(const char *rpName) * records. In that case, multiple copies of the same block would be recorded * in separate WAL records by different backends, though that is still OK from * a correctness perspective. - * - * Note that this only works for buffers that fit the standard page model, - * i.e. those for which buffer_std == true */ XLogRecPtr ! XLogSaveBufferForHint(Buffer buffer) { XLogRecPtr recptr = InvalidXLogRecPtr; XLogRecPtr lsn; --- 7681,7689 * records. In that case, multiple copies of the same block would be recorded * in separate WAL records by different backends, though that is still OK from * a correctness perspective. */ XLogRecPtr ! XLogSaveBufferForHint(Buffer buffer, bool buffer_std) { XLogRecPtr recptr = InvalidXLogRecPtr; XLogRecPtr lsn; *** *** 7708,7714 XLogSaveBufferForHint(Buffer buffer) * and reset rdata for any actual WAL record insert. */ rdata[0].buffer = buffer; ! rdata[0].buffer_std = true; /* * Check buffer while not holding an exclusive lock. --- 7705,7711 * and reset rdata for any actual WAL record insert. */ rdata[0].buffer = buffer; ! rdata[0].buffer_std = buffer_std; /* * Check buffer while not holding an exclusive lock. *** a/src/backend/commands/sequence.c --- b/src/backend/commands/sequence.c *** *** 1118,1124 read_seq_tuple(SeqTable elm, Relation rel, Buffer *buf, HeapTuple seqtuple) HeapTupleHeaderSetXmax(seqtuple-t_data, InvalidTransactionId); seqtuple-t_data-t_infomask = ~HEAP_XMAX_COMMITTED; seqtuple-t_data-t_infomask |= HEAP_XMAX_INVALID; ! MarkBufferDirtyHint(*buf); } seq = (Form_pg_sequence) GETSTRUCT(seqtuple); --- 1118,1124 HeapTupleHeaderSetXmax(seqtuple-t_data,
Re: [HACKERS] Patch for fail-back without fresh backup
On 2013-06-14 09:21:12 -0700, Jeff Davis wrote: On Fri, 2013-06-14 at 16:10 +0200, Andres Freund wrote: Jeff Davis has a patch pending (1365493015.7580.3240.camel@sussancws0025) that passes the buffer_std flag down to MarkBufferDirtyHint() for exactly that reason. I thought we were on track committing that, but rereading the thread it doesn't look that way. Jeff, care to update that patch? Rebased and attached. Changed so all callers use buffer_std=true except those in freespace.c and fsmpage.c. Simon, did you (or anyone else) have an objection to this patch? If not, I'll go ahead and commit it tomorrow morning. I'd like to see a comment around the memcpys in XLogSaveBufferForHint() that mentions that they are safe in a non std buffer due to XLogCheckBuffer setting an appropriate hole/offset. Or make an explicit change of the copy algorithm there. Btw, if you touch that code, I'd vote for renaming XLOG_HINT to XLOG_FPI or something like that. I find the former name confusing... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_restore -l with a directory archive
Hi, When I ran pg_restore -l with the directory arhicve input, I found that its format is wrongly reported as UNKNOWN. $ pg_dump -F d -f hoge $ pg_restore -l hoge ; ; Archive created at Sat Jun 15 01:38:14 2013 ; dbname: postgres ; TOC Entries: 9 ; Compression: -1 ; Dump Version: 1.12-0 ; Format: UNKNOWN ; Integer: 4 bytes ; Offset: 8 bytes ; Dumped from database version: 9.3beta1 ; Dumped by pg_dump version: 9.3beta1 ; ; ; Selected TOC Entries: In this case, the format should be reported as DIRECTORY. The attached patch fixes this problem. Regards, -- Fujii Masao pg_restore_tocsummary_v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On 14 June 2013 16:09, Hannu Krosing ha...@2ndquadrant.com wrote: What skytools/pgq/londiste uses for comparing tables on master and slave is query like this select sum(hashtext(t.*::text)) from yourtable t; This is non-modulo sum and does not use md5 but relies on whatever the hashtext() du jour is :) So it is not comparable to anything external (like the md5sum compatible idea above) but is usually good enough for fast checks of compatible tables. As tables are unordered by definition anyway, this should be good enough for most SQL. The speed comes from both fast(er) hashtext() function and avoiding the sort. That sounds like a pretty good approach. We could do that if we had a version of md5() that returned numeric. My impression is that numeric computations are pretty fast compared to the sorting overhead. On the other hand, if there is a usable index, select md5_agg(..) from (sub-query) will do and index scan rather than a sort, making it much faster than using an ORDER BY in the aggregate. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_filedump 9.3: checksums (and a few other fixes)
Andres Freund and...@2ndquadrant.com writes: On 2013-06-14 11:59:04 -0400, Tom Lane wrote: Ah, you are right, I forgot the #ifndef CHECKSUM_IMPL_H dance. Will fix in a bit. That won't help against errors if it's included in two different files/translation units though. Good point, but there's not any real reason to do that --- only checksum.h should ever be #include'd in more than one file. Any program using this stuff is expected to #include checksum_impl.h in exactly one place. So maybe it's fine as-is. E.g. in rmgrlist.h we have the following comment: /* there is deliberately not an #ifndef RMGRLIST_H here */ and I think the reasoning behind that comment applies here as well. Well, that's a different case: there, and also in kwlist.h, there's an idea that it could actually be useful to #include the file more than once, redefining the PG_RMGR() macro each time. There's no such use case that I can see for checksum_impl.h. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On Sat, 2013-05-25 at 13:55 -0500, Jon Nelson wrote: Ack. I've revised the patch to always have the GUC (for now), default to false, and if configure can't find posix_fallocate (or the user disables it by way of pg_config_manual.h) then it remains a GUC that simply can't be changed. Why have a GUC here at all? Perhaps this was already discussed, and I missed it? Is it just for testing purposes, or did you intend for it to be in the final version? If it's supported, it seems like we always want it. I doubt there are cases where it hurts performance; but if there are, it's pretty hard for a DBA to know what those cases are, anyway. Also: * The other code assumes that no errno means ENOSPC. We should be consistent about that assumption, and do the same thing in the fallocate case. * You check for the presence of posix_fallocate at configure time, but don't #ifdef the call site. It looks like you removed this from the v2 patch, was there a reason for that? Won't that cause build errors for platforms without it? I started looking at this patch and it looks like we are getting a consensus that it's the right approach. Microbenchmarks appear to show a benefit, and (thanks to Noah's comment) it seems like the change is safe. Are there any remaining questions or objections? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] request a new feature in fuzzystrmatch
On Tue, Jun 11, 2013 at 3:19 PM, Liming Hu dawnin...@gmail.com wrote: On Tue, Jun 11, 2013 at 2:56 PM, Joe Conway m...@joeconway.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/11/2013 02:23 PM, Liming Hu wrote: On Tue, Jun 11, 2013 at 1:57 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Liming Hu escribió: I have implemented the code according to Joe's suggestion, and put the code at: https://github.com/liminghu/fuzzystrmatch/tree/fuzzystrmatchv1.1 Please submit a proper patch so it can be seen on our mailing list archives. Hi Alvaro, I am kind of new to the Postgresql hacker community, Can you please help me on submit the patch? Hi Liming, I might be able to help, but it will be at least a couple of days before I have the time to look at this, Joe ok, thanks, I will wait. Hi Joe, Do you have some time in the weekend to help me submit the patch? Thanks, Liming Liming - -- Joe Conway credativ LLC: http://www.credativ.us Linux, PostgreSQL, and general Open Source Training, Service, Consulting, 24x7 Support -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBAgAGBQJRt50oAAoJEDfy90M199hliukP/A8IZf7L0DcYWG/jZUttTg0M yWpCz1KHZ7fAGqeJ5ddzuSx5CMjsmt2zYQ+0EfLX1ftoSxIHaFpSC49GgJHUdAoq TSOOC1rfkfmNJG98WfUPH14Flq4eEl9reUZDXXi3jqol+npJdAQaEt9EM5y0qkcB pDCy9iMaYdjYNV6RXFOBNI+7Up43oULbMLhWwwFnGX9AgCLk8SGRZHnfT1zOaXYE bW/Dl9TIu058ENZqFFLIfjxqngE/Y61SOaLRAxEkbO7HAFkuwgQwnIayrasR29F+ GX4+HRcsd0jrcF858Dm4E+YAffI2quOR2HgZTGQj6jTNtRpb16+EI+X1vrYI3AH8 5tf3exq9FDL/02zlCpKm8+uXksaLzffQXlbQPw8uDlpk+ThD0uo2990/TC6QYgXO o2vCu+nkWdTc2AUk4NkoVFwXaaFLZ+M8U+swRHNAShqH7VQVx8rRwubsRJ1msT3i nC8BFyMsBzOmsKNwO0IM6ZcsJXaIpmMCUshF8cGDfgpsWUQ/wzovzopb+PSiEKQB X45hWYtiK3tIvo0f9gvEWzRJ4+O8tmmIZzzWz127yBdugV0xjsEHnDihpjkeMPJx WHs5ViN62u62r34UCtX1oiClCC+FYR0f//alh48VlXWzP7NkFt4dcKOD6ZCzSOeQ udKM0QF2TNbUebj9QIpq =W6Oy -END PGP SIGNATURE- -- Liming Hu cell: (435)-512-4190 Seattle Washington -- Liming Hu cell: (435)-512-4190 Seattle Washington -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] request a new feature in fuzzystrmatch
On Fri, Jun 14, 2013 at 10:08:24AM -0700, Liming Hu wrote: On Tue, Jun 11, 2013 at 3:19 PM, Liming Hu dawnin...@gmail.com wrote: On Tue, Jun 11, 2013 at 2:56 PM, Joe Conway m...@joeconway.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/11/2013 02:23 PM, Liming Hu wrote: On Tue, Jun 11, 2013 at 1:57 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Liming Hu escribió: I have implemented the code according to Joe's suggestion, and put the code at: https://github.com/liminghu/fuzzystrmatch/tree/fuzzystrmatchv1.1 Please submit a proper patch so it can be seen on our mailing list archives. Hi Alvaro, I am kind of new to the Postgresql hacker community, Can you please help me on submit the patch? Hi Liming, I might be able to help, but it will be at least a couple of days before I have the time to look at this, Joe ok, thanks, I will wait. Hi Joe, Do you have some time in the weekend to help me submit the patch? Thanks, Liming Liming, Is your git skill good enough to create a patch vs. PostgreSQL's git master? If so, send that and once it's hit the mailing list, record same on commitfest.postgresql.org in the current open commitfest. If not, let us know where in that process you got stuck. Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] request a new feature in fuzzystrmatch
On 06/14/2013 10:11 AM, David Fetter wrote: ok, thanks, I will wait. Hi Joe, Do you have some time in the weekend to help me submit the patch? Thanks, Liming Liming, Is your git skill good enough to create a patch vs. PostgreSQL's git master? If so, send that and once it's hit the mailing list, record same on commitfest.postgresql.org in the current open commitfest. If not, let us know where in that process you got stuck. Cheers, David. This sounds like a wiki page FAQ in the making. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc For my dreams of your image that blossoms a rose in the deeps of my heart. - W.B. Yeats -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] request a new feature in fuzzystrmatch
On Fri, Jun 14, 2013 at 10:14:14AM -0700, Joshua D. Drake wrote: On 06/14/2013 10:11 AM, David Fetter wrote: ok, thanks, I will wait. Hi Joe, Do you have some time in the weekend to help me submit the patch? Thanks, Liming Liming, Is your git skill good enough to create a patch vs. PostgreSQL's git master? If so, send that and once it's hit the mailing list, record same on commitfest.postgresql.org in the current open commitfest. If not, let us know where in that process you got stuck. This sounds like a wiki page FAQ in the making. With utmost respect, this sounds like several pages should be consolidated into one and clarified. Yet another page will just make matters more confusing. Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On 6/14/13 1:06 PM, Jeff Davis wrote: Why have a GUC here at all? Perhaps this was already discussed, and I missed it? Is it just for testing purposes, or did you intend for it to be in the final version? You have guessed correctly! I suggested it stay in there only to make review benchmarking easier. I started looking at this patch and it looks like we are getting a consensus that it's the right approach. Microbenchmarks appear to show a benefit, and (thanks to Noah's comment) it seems like the change is safe. Are there any remaining questions or objections? I'm planning to duplicate Jon's test program on a few machines here, and then see if that turns into a useful latency improvement for clients. I'm trying to get this pgbench rate limit stuff working first though, because one of the tests I had in mind for WAL creation overhead would benefit from it. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On Tue, 2013-06-11 at 12:58 -0400, Stephen Frost wrote: My main question is really- would this be useful for extending *relations*? Apologies if it's already been discussed; I do plan to go back and read the threads about this more fully, but I wanted to voice my support for using posix_fallocate, when available, in general. +1, though separate from this patch. Andres also pointed out that we can try to track a point in the file that is below any place where a zero page might still exist. That will allow us to call zero pages invalid unless they are related to a recent extension, which is a weakness in the current checksums code. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] another error perhaps to be enhanced
ERROR: index foo_idx We should probably add the schema. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc For my dreams of your image that blossoms a rose in the deeps of my heart. - W.B. Yeats -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add visibility map information to pg_freespace.
On Fri, Jun 14, 2013 at 7:23 AM, Andres Freund and...@2ndquadrant.com wrote: 3). All the others seem to inflict unneccesary pain for not all that much gain. +1. You might want to add a historical note about the name to the pg_freespace documentation, though. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
I don't have this resolved yet, but I think I've identified the cause. Updating here mainly so Fabien doesn't duplicate my work trying to track this down. I'm going to keep banging at this until it's resolved now that I got this far. Here's a slow transaction: 1371226017.568515 client 1 executing \set naccounts 10 * :scale 1371226017.568537 client 1 throttling 6191 us 1371226017.747858 client 1 executing \setrandom aid 1 :naccounts 1371226017.747872 client 1 sending SELECT abalance FROM pgbench_accounts WHERE aid = 268721; 1371226017.789816 client 1 receiving That confirms it is getting stuck at the throttling step. Looks like the code pauses there because it's trying to overload the sleeping state that was already in pgbench, but handle it in a special way inside of doCustom(), and that doesn't always work. The problem is that pgbench doesn't always stay inside doCustom when a client sleeps. It exits there to poll for incoming messages from the other clients, via select() on a shared socket. It's not safe to assume doCustom will be running regularly; that's only true if clients keep returning messages. So as long as other clients keep banging on the shared socket, doCustom is called regularly, and everything works as expected. But at the end of the test run that happens less often, and that's when the problem shows up. pgbench already has a \sleep command, and the way that delay is handled happens inside threadRun() instead. The pausing of the rate limit throttle needs to operate in the same place. I have to redo a few things to confirm this actually fixes the issue, as well as look at Fabien's later updates to this since I wandered off debugging. I'm sure it's in the area of code I'm poking at now though. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] another error perhaps to be enhanced
On Jun 14, 2013, at 13:38, Joshua D. Drake j...@commandprompt.com wrote: ERROR: index foo_idx We should probably add the schema. I've noticed similar issues with functions. I'd like to see those schema-qualified as well. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc For my dreams of your image that blossoms a rose in the deeps of my heart. - W.B. Yeats -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] another error perhaps to be enhanced
I think you'll need to better describe what you mean here. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] another error perhaps to be enhanced
On 06/14/2013 10:47 AM, Peter Geoghegan wrote: I think you'll need to better describe what you mean here. postgres=# create schema foo; CREATE SCHEMA postgres=# create schema bar; CREATE SCHEMA postgres=# create table foo.foo(id serial); NOTICE: CREATE TABLE will create implicit sequence foo_id_seq for serial column foo.id CREATE TABLE postgres=# create table bar.bar(id serial); NOTICE: CREATE TABLE will create implicit sequence bar_id_seq for serial column bar.id CREATE TABLE postgres=# create index one_idx on foo.foo(id); CREATE INDEX postgres=# create index one_idx on bar.bar(id); CREATE INDEX postgres=# Now, with the error previously shown, which one_idx needs to be reindexed? JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc For my dreams of your image that blossoms a rose in the deeps of my heart. - W.B. Yeats -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] another error perhaps to be enhanced
On Fri, Jun 14, 2013 at 10:54 AM, Joshua D. Drake j...@commandprompt.com wrote: Now, with the error previously shown, which one_idx needs to be reindexed? Well, you didn't show an actual error message. But if you \set VERBOSITY verbose within psql while connected to a 9.3 server, you'll get fully qualified details of the constraint blamed for the error, if any. Example: postgres=# insert into a(a, b) values (3, 'test'); ERROR: 23505: duplicate key value violates unique constraint a_pkey DETAIL: Key (a)=(3) already exists. SCHEMA NAME: public TABLE NAME: a CONSTRAINT NAME: a_pkey LOCATION: _bt_check_unique, nbtinsert.c:398 -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)
On Fri, 2013-06-14 at 13:21 -0400, Greg Smith wrote: I'm planning to duplicate Jon's test program on a few machines here, and then see if that turns into a useful latency improvement for clients. I'm trying to get this pgbench rate limit stuff working first though, because one of the tests I had in mind for WAL creation overhead would benefit from it. Great, I'll wait on those results. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] another error perhaps to be enhanced
On 06/14/2013 11:01 AM, Peter Geoghegan wrote: On Fri, Jun 14, 2013 at 10:54 AM, Joshua D. Drake j...@commandprompt.com wrote: Now, with the error previously shown, which one_idx needs to be reindexed? Well, you didn't show an actual error message. ERROR: index foo_idx Is not an error message? Granted I didn't show the whole error message but my point is, it should ALWAYS be fully qualified. But if you \set VERBOSITY verbose within psql while connected to a 9.3 server, you'll get fully qualified details of the constraint blamed for the error, if any. Example: postgres=# insert into a(a, b) values (3, 'test'); ERROR: 23505: duplicate key value violates unique constraint a_pkey DETAIL: Key (a)=(3) already exists. SCHEMA NAME: public TABLE NAME: a CONSTRAINT NAME: a_pkey LOCATION: _bt_check_unique, nbtinsert.c:398 I was looking in the logs. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc For my dreams of your image that blossoms a rose in the deeps of my heart. - W.B. Yeats -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
pgbench already has a \sleep command, and the way that delay is handled happens inside threadRun() instead. The pausing of the rate limit throttle needs to operate in the same place. It does operate at the same place. The throttling is performed by inserting a sleep first thing when starting a new transaction. So if their is a weirdness, it should show as well without throttling but with a fixed \sleep instead? -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard to Use WAS: Hard limit on WAL space
On 06/12/2013 02:03 PM, Joshua D. Drake wrote: What concerns me is we seem to be trying to make this easy. It isn't supposed to be easy. This is hard stuff. Smart people built it and it takes a smart person to run it. When did it become a bad thing to be something that smart people need to run? 1997, last I checked. Our unofficial motto: PostgreSQL: making very hard things possible, and simple things hard. It *is* hard. But that's because we've *made* it hard to understand and manage, not because the problem is inherently hard. For example: can you explain to me in 10 words or less how to monitor to see if archiving is falling behind? I'll bet you can't, and that's because we've provided no reliable way to do so. It's normal when you're developing features for the ability to utilize them to go from hacker -- high-end user -- regular user. We suck at moving to that last stage, partly because whenever someone on this list introduces the idea of making a feature not just great but easy to use, people actually object to the idea that anything should be easy to use. It's like we're afraid of being polluted by the unwashed DevOps masses. In the meantime, Mongo kicks our butts a new user adoption. Why? Their features suck, but the features they do have are easy to use. You'd think we would have learned something from MySQL. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hard to Use WAS: Hard limit on WAL space
On 06/14/2013 11:16 AM, Josh Berkus wrote: On 06/12/2013 02:03 PM, Joshua D. Drake wrote: What concerns me is we seem to be trying to make this easy. It isn't supposed to be easy. This is hard stuff. Smart people built it and it takes a smart person to run it. When did it become a bad thing to be something that smart people need to run? 1997, last I checked. Our unofficial motto: PostgreSQL: making very hard things possible, and simple things hard. It *is* hard. But that's because we've *made* it hard to understand and manage, not because the problem is inherently hard. For example: can you explain to me in 10 words or less how to monitor to see if archiving is falling behind? I'll bet you can't, and that's because we've provided no reliable way to do so. Hey, I never said we shouldn't have a complete feature set. I agree with you. IMO it should not have even been committed without the ability to actually know what is going on and we have had it since (in theory) 8.1? My primary concern is: Don't make it stupid. I liked Claudio's comment, More than easy, it should be obvious.. It should be obvious from a review of the documentation how to manage this stuff. It isn't, and worse even if we wrote the documentation it still isn't because the feature is not complete. With great power comes great responsibility :P JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc For my dreams of your image that blossoms a rose in the deeps of my heart. - W.B. Yeats -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SPGist triple parity concept doesn't work
Teodor Sigaev teo...@sigaev.ru writes: Anyway I now think that we might be better off with the other idea of abandoning an insertion and retrying if we get a lock conflict. done, look at the patch. Looks good, committed with some cosmetic adjustments. We definetly need new idea of locking protocol and I'll return to this problem at autumn (sorry, I havn't time in summer to do this research). OK. I think the performance of this way will be okay, actually, in most cases anyhow. It'll do till we have a better idea. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
Hello Greg, I think that the weirdness really comes from the way transactions times are measured, their interactions with throttling, and latent bugs in the code. One issue is that the throttling time was included in the measure, but not the first time because txn_begin is not set at the beginning of doCustom. Also, flag st-listen is set to 1 but *never* set back to 0... sh grep listen pgbench.c int listen; if (st-listen) st-listen = 1; st-listen = 1; st-listen = 1; st-listen = 1; st-listen = 1; st-listen = 1; ISTM that I can fix the weirdness by inserting an ugly goto top;, but I would feel better about it by removing all gotos and reordering some actions in doCustom in a more logical way. However that would be a bigger patch. Please find attached 2 patches: - the first is the full throttle patch which ensures that the txn_begin is taken at a consistent point, after throttling, which requires resetting listen. There is an ugly goto. I've also put times in a consistent format in the log, 789.012345 instead of 789 12345. - the second patch just shows the diff between v10 and the first one. -- Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 8c202bf..dc4f819 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -137,6 +137,12 @@ int unlogged_tables = 0; double sample_rate = 0.0; /* + * When threads are throttled to a given rate limit, this is the target delay + * to reach that rate in usec. 0 is the default and means no throttling. + */ +int64 throttle_delay = 0; + +/* * tablespace selection */ char *tablespace = NULL; @@ -205,6 +211,7 @@ typedef struct int nvariables; instr_time txn_begin; /* used for measuring transaction latencies */ instr_time stmt_begin; /* used for measuring statement latencies */ + bool throttled; /* whether current transaction was throttled */ int use_file; /* index in sql_files for this client */ bool prepared[MAX_FILES]; } CState; @@ -222,6 +229,10 @@ typedef struct instr_time *exec_elapsed; /* time spent executing cmds (per Command) */ int *exec_count; /* number of cmd executions (per Command) */ unsigned short random_state[3]; /* separate randomness for each thread */ +int64 throttle_trigger; /* previous/next throttling (us) */ + int64 throttle_lag; /* total transaction lag behind throttling */ + int64 throttle_lag_max; /* max transaction lag */ + } TState; #define INVALID_THREAD ((pthread_t) 0) @@ -230,6 +241,8 @@ typedef struct { instr_time conn_time; int xacts; + int64 throttle_lag; + int64 throttle_lag_max; } TResult; /* @@ -355,6 +368,8 @@ usage(void) -n do not run VACUUM before tests\n -N do not update tables \pgbench_tellers\ and \pgbench_branches\\n -r report average latency per command\n + -R SPEC, --rate SPEC\n + target rate in transactions per second\n -s NUM report this scale factor in output\n -S perform SELECT-only transactions\n -t NUM number of transactions each client runs (default: 10)\n @@ -898,17 +913,56 @@ doCustom(TState *thread, CState *st, instr_time *conn_time, FILE *logfile, AggVa { PGresult *res; Command **commands; + booldo_throttle = false; top: commands = sql_files[st-use_file]; + /* handle throttling once per transaction by inserting a sleep. + * this is simpler than doing it at the end. + */ + if (throttle_delay ! st-throttled) + { + /* compute delay to approximate a Poisson distribution + * 100 = 13.8 .. 0 multiplier + * 10 = 11.5 .. 0 + * 1 = 9.2 .. 0 + *1000 = 6.9 .. 0 + * if transactions are too slow or a given wait shorter than + * a transaction, the next transaction will start right away. + */ + int64 wait = (int64) + throttle_delay * -log(getrand(thread, 1, 1000)/1000.0); + + thread-throttle_trigger += wait; + + st-until = thread-throttle_trigger; + st-sleeping = 1; + st-throttled = true; + if (debug) + fprintf(stderr, client %d throttling INT64_FORMAT us\n, + st-id, wait); + } + if (st-sleeping) { /* are we sleeping? */ instr_time now; + int64 now_us; INSTR_TIME_SET_CURRENT(now); - if (st-until = INSTR_TIME_GET_MICROSEC(now)) + now_us = INSTR_TIME_GET_MICROSEC(now); + if (st-until = now_us) + { st-sleeping = 0; /* Done sleeping, go ahead with next command */ + if (throttle_delay st-state==0) + { +/* measure lag of throttled transaction */ +int64 lag = now_us - st-until; +thread-throttle_lag += lag; +if (lag thread-throttle_lag_max) + thread-throttle_lag_max = lag; + } + } else return true; /*
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
On 6/14/13 3:50 PM, Fabien COELHO wrote: I think that the weirdness really comes from the way transactions times are measured, their interactions with throttling, and latent bugs in the code. measurement times, no; interactions with throttling, no. If it was either of those I'd have finished this off days ago. Latent bugs, possibly. We may discover there's nothing wrong with your code at the end here, that it just makes hitting this bug more likely. Unfortunately today is the day *some* bug is popping up, and I want to get it squashed before I'll be happy. The lag is actually happening during a kernel call that isn't working as expected. I'm not sure whether this bug was there all along if \sleep was used, or if it's specific to the throttle sleep. Also, flag st-listen is set to 1 but *never* set back to 0... I noticed that st-listen was weird too, and that's on my short list of suspicious things I haven't figured out yet. I added a bunch more logging as pgbench steps through its work to track down where it's stuck at. Until the end all transactions look like this: 1371238832.084783 client 10 throttle lag 2 us 1371238832.084783 client 10 executing \setrandom aid 1 :naccounts 1371238832.084803 client 10 sending SELECT abalance FROM pgbench_accounts WHERE aid = 753099; 1371238832.084840 calling select 1371238832.086539 client 10 receiving 1371238832.086539 client 10 finished All clients who hit lag spikes at the end are going through this sequence instead: 1371238832.085912 client 13 throttle lag 790 us 1371238832.085912 client 13 executing \setrandom aid 1 :naccounts 1371238832.085931 client 13 sending SELECT abalance FROM pgbench_accounts WHERE aid = 564894; 1371238832.086592 client 13 receiving 1371238832.086662 calling select 1371238832.235543 client 13 receiving 1371238832.235543 client 13 finished Note the calling select here that wasn't in the normal length transaction before it. The client is receiving something here, but rather than it finishing the transaction it falls through and ends up at the select() system call outside of doCustom. All of the clients that are sleeping when the system slips into one of these long select() calls are getting stuck behind it. I'm not 100% sure, but I think this only happens when all remaining clients are sleeping. Here's another one, it hits the receive that doesn't finish the transaction earlier (1371238832.086587) but then falls into the same select() call at 1371238832.086662: 1371238832.085884 client 12 throttle lag 799 us 1371238832.085884 client 12 executing \setrandom aid 1 :naccounts 1371238832.085903 client 12 sending SELECT abalance FROM pgbench_accounts WHERE aid = 299080; 1371238832.086587 client 12 receiving 1371238832.086662 calling select 1371238832.231032 client 12 receiving 1371238832.231032 client 12 finished Investigation is still going here... -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)
I think that the weirdness really comes from the way transactions times are measured, their interactions with throttling, and latent bugs in the code. measurement times, no; interactions with throttling, no. If it was either of those I'd have finished this off days ago. Latent bugs, possibly. We may discover there's nothing wrong with your code at the end here, To summarize my point: I think my v10 code does not take into account all of the strangeness in doCustom, and I'm pretty sure that there is no point in including thottle sleeps into latency measures, which was more or less the case. So it is somehow a bug which only shows up if you look at the latency measures, but the tps are fine. that it just makes hitting this bug more likely. Unfortunately today is the day *some* bug is popping up, and I want to get it squashed before I'll be happy. The lag is actually happening during a kernel call that isn't working as expected. I'm not sure whether this bug was there all along if \sleep was used, or if it's specific to the throttle sleep. The throttle sleep is inserted out of the state machine. That is why in the test patch I added a goto to ensure that it is always taken at the right time, that is when state==0 and before txn_begin is set, and not possibly between other states when doCustom happens to be recalled after a return. I added a bunch more logging as pgbench steps through its work to track down where it's stuck at. Until the end all transactions look like this: 1371238832.084783 client 10 throttle lag 2 us 1371238832.084783 client 10 executing \setrandom aid 1 :naccounts 1371238832.084803 client 10 sending SELECT abalance FROM pgbench_accounts WHERE aid = 753099; 1371238832.084840 calling select 1371238832.086539 client 10 receiving 1371238832.086539 client 10 finished All clients who hit lag spikes at the end are going through this sequence instead: 1371238832.085912 client 13 throttle lag 790 us 1371238832.085912 client 13 executing \setrandom aid 1 :naccounts 1371238832.085931 client 13 sending SELECT abalance FROM pgbench_accounts WHERE aid = 564894; 1371238832.086592 client 13 receiving 1371238832.086662 calling select 1371238832.235543 client 13 receiving 1371238832.235543 client 13 finished Note the calling select here that wasn't in the normal length transaction before it. The client is receiving something here, but rather than it finishing the transaction it falls through and ends up at the select() system call outside of doCustom. All of the clients that are sleeping when the system slips into one of these long select() calls are getting stuck behind it. I'm not 100% sure, but I think this only happens when all remaining clients are sleeping. Note: in both the slow cases there is a receiving between sending and select. This suggests that the goto top at the very end of doCustom is followed in one case but not the other. ISTM that there is a timeout passed to select which is computed based on the current sleeping time of each client. I'm pretty sure that not a well tested path... -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [RFC] Minmax indexes
Hi, This is a preliminary proposal for Minmax indexes. I'm experimenting with the code, but it's too crude to post yet, so here's a document explaining what they are and how they work, so that reviewers can poke holes to have the design improved. My intention is to have a patch to show for CF2, so please do have a look at this and comment. This is part of the AXLE project http://www.axleproject.eu and the intention is to support tables of very large size. In a quick experiment, I have a table of ~12 GB and its corresponding index is 65 kB in size, making the time to do the equivalent of a seqscan a small fraction of that taken by a real seqscan. This technique sits between a bitmap scan of a normal btree, and a seqscan: the minmax index tells the bitmap heap scan what pages to seqscan, allowing it to skip a large fraction of pages that are known not to contain tuples matching the query quals. This is a huge win for large data warehouses. Without further ado, here's what I propose. Minmax Range Indexes Minmax indexes are a new access method intended to enable very fast scanning of extremely large tables. The essential idea of a minmax index is to keep track of the min() and max() values in consecutive groups of heap pages (page ranges). These values can be used by constraint exclusion to avoid scanning such pages, depending on query quals. The main drawback of this is having to update the stored min/max values of each page range as tuples are inserted into them. Other database systems already have this feature. Some examples: * Oracle Exadata calls this storage indexes http://richardfoote.wordpress.com/category/storage-indexes/ * Netezza has zone maps http://nztips.com/2010/11/netezza-integer-join-keys/ * Infobright has this automatically within their data packs http://www.infobright.org/Blog/Entry/organizing_data_and_more_about_rough_data_contest/ * MonetDB seems to have it http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.2662 Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS Grammar --- To create a minmax index, we use CREATE INDEX foo_minmax_idx ON foo USING MINMAX (a, b, e); Partial indexes are not supported; since an index is concerned with minimum and maximum values of the involved columns across all the pages in the table, it doesn't make sense to exclude values. Another way to see partial indexes here would be those that only considered some pages in the table instead of all of them; but this would be difficult to implement and manage and, most likely, pointless. Expressional indexes can probably be supported in the future, but we disallow them initially for conceptual simplicity. Having multiple minmax indexes in the same table is acceptable, though most of the time it would make more sense to have a single index covering all the interesting columns. Multiple indexes might be useful for columns added later. Access Method Design Since item pointers are not stored inside indexes of this type, it is not possible to support the amgettuple interface. Instead, we only provide amgetbitmap support; scanning a relation using this index requires a recheck node on top. The amgetbitmap routine would return a TIDBitmap comprising all the pages in those page groups that comply with the query quals; the recheck node prunes tuples that are not visible per snapshot and those that are not visible per query quals. For each supported datatype, we need an opclass with the following catalog entries: - support functions (pg_amproc) * pg_proc entry for min() * pg_proc entry for max() - support operators (pg_amop): same as btree (, =, =, =, ) The min() and max() support functions are used during index construction. The support operators are used in the optimizer, so that the index is chosen when queries on the indexed table are planned. (Also, we use them in the amgetbitmap routine, to compare ScanKeys and decide whether to emit a certain block or not). In each index tuple (corresponding to one page range), we store: - first block this tuple applies to - last block this tuple applies to - for each indexed column: * min() value across all tuples in the range * max() value across all tuples in the range * nulls present in any tuple? With the default INDEX_MAX_KEYS of 32, and considering columns of 8-byte length types (timestamptz, bigint), each tuple would be 524 bytes in length, which seems reasonable. Of course, larger columns are possible, such as varchar, but creating minmax indexes on such columns seems of little practical usefulness. This maximum index tuple size is calculated as: BlockNumber (4 bytes) * 2 + data value (8 bytes) * 32 * 2 + null bitmap (4 bytes) Block ranges mentioned in index entries shouldn't overlap. However, there can be gaps where some pages have no covering index entry. (In particular, the last few pages of the table would commonly not be summarized.) In order to scan
Re: [HACKERS] single-user vs standalone in docs and messages
On Thu, Jun 13, 2013 at 6:10 PM, Jeff Janes jeff.ja...@gmail.com wrote: Some places in the docs and elog hints refer to standalone backends, while the official name as used in app-postgres.html is single-user mode, and in fact standalone does not appear on that page. This tries to standardize the other locations to use single-user. I think I did the right thing with the message translation files, but I can't figure out how to test that. I made no attempt to change code-comments, just the user-facing parts. I think you could tell people to use single-user mode instead of a standalone backend, but telling them to use a single-user backend just seems weird. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] stray SIGALRM
In 9.3beta1, a backend will receive a SIGALRM after authentication_timeout seconds, even if authentication has been successful. Most of the time this doesn't hurt anyone, but there are cases, such as when the backend is doing the open() of a backend copy, when it breaks things and results in an error getting reported to the client. In particular, if you're doing a copy from a FIFO, it is normal for open() to block until the process at the other end has data ready, so you're very likely to have it interrupted by the SIGALRM and fail. To see the SIGALRM just run psql then determine your backend's pid, attach an strace to it, and wait 60 seconds, or whatever you've got authentication_timeout set to. This behaviour appears in 6ac7facdd3990baf47efc124e9d7229422a06452 as a side-effect of speeding things up by getting rid of setitimer() calls; it's not obvious what's a good way to fix it without losing the benefits of that commit. Thanks Alvaro and Andres for helping me get from why is my copy getting these signals to understanding what's actually going on. Richard -- Richard Poole http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] logical changeset generation v5
Hi! I am rather pleased to announce the next version of the changeset extraction patchset. Thanks to help from a large number of people I think we are slowly getting to the point where it is getting committable. Since the last submitted version (20121115002746.ga7...@awork2.anarazel.de) a large number of fixes and the result of good amount of review has been added to the tree. All bugs known to me have been fixed. Fixes include: * synchronous replication support * don't peg the xmin for user tables, do it only for catalog ones. * arbitrarily large transaction support by spilling large transactions to disk * spill snapshots to disk, so we can restart without waiting for a new snapshot to be built * Don't read all WAL from the establishment of a logical slot * tests via SQL interface to changeset extraction The todo list includes: * morph the logical slot interface into being replication slots that can also be used by streaming replication * move some more code from snapbuild.c to decode.c to remove a largely duplicated switch * do some more header/comment cleanup clarification * move pg_receivellog into its own directory in src/bin or contrib/. * user/developer level documentation The patch series currently has two interfaces to logical decoding. One - which is primarily useful for pg_regress style tests and playing around - is SQL based, the other one uses a walsender replication connection. A quick demonstration of the SQL interface (server needs to be started with wal_level = logical and max_logical_slots 0): =# CREATE EXTENSION test_logical_decoding; =# SELECT * FROM init_logical_replication('regression_slot', 'test_decoding'); slotname | xlog_position -+--- regression_slot | 0/17D5908 (1 row) =# CREATE TABLE foo(id serial primary key, data text); =# INSERT INTO foo(data) VALUES(1); =# UPDATE foo SET id = -id, data = ':'||data; =# DELETE FROM foo; =# DROP TABLE foo; =# SELECT * FROM start_logical_replication('regression_slot', 'now', 'hide-xids', '0'); location | xid | data ---+-+ 0/17D59B8 | 695 | BEGIN 0/17D59B8 | 695 | COMMIT 0/17E8B58 | 696 | BEGIN 0/17E8B58 | 696 | table foo: INSERT: id[int4]:1 data[text]:1 0/17E8B58 | 696 | COMMIT 0/17E8CA8 | 697 | BEGIN 0/17E8CA8 | 697 | table foo: UPDATE: old-pkey: id[int4]:1 new-tuple: id[int4]:-1 data[text]::1 0/17E8CA8 | 697 | COMMIT 0/17E8E50 | 698 | BEGIN 0/17E8E50 | 698 | table foo: DELETE: id[int4]:-1 0/17E8E50 | 698 | COMMIT 0/17E9058 | 699 | BEGIN 0/17E9058 | 699 | COMMIT (13 rows) =# SELECT * FROM pg_stat_logical_decoding ; slot_name|plugin | database | active | xmin | restart_decoding_lsn -+---+--++--+-- regression_slot | test_decoding |12042 | f | 695 | 0/17D58D0 (1 row) =# SELECT * FROM stop_logical_replication('regression_slot'); stop_logical_replication -- 0 The walsender interface has the same calls INIT_LOGICAL_REPLICATION 'slot' 'plugin'; START_LOGICAL_REPLICATION 'slot' restart_lsn [(option value)*]; STOP_LOGICAL_REPLICATION 'slot'; The only difference is that START_LOGICAL_REPLICATION can stream changes and it can support synchronous replication. The output seen in the 'data' column is produced by a so called 'output plugin' which users of the facility can write to suit their needs. They can be written by implementing 5 functions in the shared object that's passed to init_logical_replication() above: * pg_decode_init (optional) * pg_decode_begin_txn * pg_decode_change * pg_decode_commit_txn * pg_decode_cleanup (optional) The most interesting function pg_decode_change get's passed a structure containing old/new versions of the row, the 'struct Relation' belonging to it and metainformation about the transaction. The output plugin can rely on syscache lookups et al. to decode the changed tuple in whatever fashion it wants. I'd like to invite reviewers to first look at: * the output plugin interface * the walsender/SRF interface * patch 12 which contains most of the code When reading the code, the information flow during decoding might be interesting: --- +---+ | XLogReader| +---+ | XLOG Records | v +---+ | decode.c | +---+ | | | | v | +---+| | snapbuild.c | HeapTupleData +---+| | | catalog snapshots | | | v v +---+ |reorderbuffer.c| +---+ | HeapTuple Metadata
[HACKERS] GIN improvements part2: fast scan
Hackes, attached patch implementing fast scan technique for GIN. This is second patch of GIN improvements, see the 1st one here: http://www.postgresql.org/message-id/capphfduxv-il7aedwpw0w5fxrwgakfxijwm63_hzujacrxn...@mail.gmail.com This patch allow to skip parts of posting trees when their scan is not necessary. In particular, it solves frequent_term rare_term problem of FTS. It introduces new interface method pre_consistent which behaves like consistent, but: 1) allows false positives on input (check[]) 2) allowed to return false positives Some example: frequent_term rare_term becomes pretty fast. create table test as (select to_tsvector('english', 'bbb') as v from generate_series(1,100)); insert into test (select to_tsvector('english', 'ddd') from generate_series(1,10)); create index test_idx on test using gin (v); postgres=# explain analyze select * from test where v @@ to_tsquery('english', 'bbb ddd'); QUERY PLAN --- Bitmap Heap Scan on test (cost=942.75..7280.63 rows=5000 width=17) (actual time=0.458..0.461 rows=10 loops=1) Recheck Cond: (v @@ '''bbb'' ''ddd'''::tsquery) - Bitmap Index Scan on test_idx (cost=0.00..941.50 rows=5000 width=0) (actual time=0.449..0.449 rows=10 loops=1) Index Cond: (v @@ '''bbb'' ''ddd'''::tsquery) Total runtime: 0.516 ms (5 rows) -- With best regards, Alexander Korotkov. gin_fast_scan.1.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stray SIGALRM
Richard Poole rich...@2ndquadrant.com writes: In 9.3beta1, a backend will receive a SIGALRM after authentication_timeout seconds, even if authentication has been successful. Most of the time this doesn't hurt anyone, but there are cases, such as when the backend is doing the open() of a backend copy, when it breaks things and results in an error getting reported to the client. In particular, if you're doing a copy from a FIFO, it is normal for open() to block until the process at the other end has data ready, so you're very likely to have it interrupted by the SIGALRM and fail. To see the SIGALRM just run psql then determine your backend's pid, attach an strace to it, and wait 60 seconds, or whatever you've got authentication_timeout set to. This behaviour appears in 6ac7facdd3990baf47efc124e9d7229422a06452 as a side-effect of speeding things up by getting rid of setitimer() calls; it's not obvious what's a good way to fix it without losing the benefits of that commit. Ugh. It doesn't sound very practical to try to guarantee that every single kernel call in the backend is set up to recover from EINTR, even though ideally they should all be able to cope. Maybe we have to revert those signal-handling changes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Minmax indexes
Alvaro, This sounds really interesting, and I can see the possibilities. However ... Value changes in columns that are part of a minmax index, and tuple insertion in summarized pages, would invalidate the stored min/max values. To support this, each minmax index has a validity map; a range can only be considered in a scan if it hasn't been invalidated by such changes (A range not considered in the scan needs to be returned in whole regardless of the stored min/max values, that is, it cannot be pruned per query quals). The validity map is very similar to the visibility map in terms of performance characteristics: quick enough that it's not contentious, allowing updates and insertions to proceed even when data values violate the minmax index conditions. An invalidated range can be made valid by re-summarization (see below). This begins to sound like these indexes are only useful on append-only tables. Not that there aren't plenty of those, but ... Re-summarization is relatively expensive, because the complete page range has to be scanned. Why? Why can't we just update the affected pages in the index? To avoid this, a table having a minmax index would be configured so that inserts only go to the page(s) at the end of the table; this avoids frequent invalidation of ranges in the middle of the table. We provide a table reloption that tweaks the FSM behavior, so that summarized pages are not candidates for insertion. We haven't had an index type which modifies table insertion behavior before, and I'm not keen to start now; imagine having two indexes on the same table each with their own, conflicting, requirements. This is sounding a lot more like a candidate for our prospective pluggable storage manager. Also, the above doesn't help us at all with UPDATEs. If we're going to start adding reloptions for specific table behavior, I'd rather think of all of the optimizations we might have for a prospective append-only table and bundle those, rather than tying it to whether a certain index exists or not. Also, I hate the name ... if this feature goes ahead, I'm going to be lobbying to change it. But that's pretty minor compared to the update issues. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] GIN improvements part 3: ordering in index
Hackers, attached patch implementing ordering inside GIN index. This is third patch of GIN improvements, see previous two: http://www.postgresql.org/message-id/capphfduxv-il7aedwpw0w5fxrwgakfxijwm63_hzujacrxn...@mail.gmail.com http://www.postgresql.org/message-id/CAPpHfdvftaJq7www381naLw1=4u0h+qpxgwvnhceb9hmvyw...@mail.gmail.com This patch introduces new interface method of GIN which takes same arguments as consistent but returns float8. float8 gin_ordering(bool check[], StrategyNumber n, Datum query, int32 nkeys, Pointer extra_data[], bool *recheck, Datum queryKeys[], bool nullFlags[], Datum addInfo[], bool addInfoIsNull[]) This patch implements gingettuple method which can return ordering data using KNN infrastructure. Also it introduces operator for fts which support ordering in GIN index. Some example: postgres=# explain analyze select * from dblp_titles2 where tsvector @@ to_tsquery('english', 'statistics') order by tsvector to_tsquery('english', 'statistics') limit 10; QUERY PLAN - Limit (cost=12.00..48.22 rows=10 width=136) (actual time=6.999..7.120 rows=10 loops=1) - Index Scan using dblp_titles2_idx on dblp_titles2 (cost=12.00..43003.03 rows=11868 width=136) (actual time=6.996..7.115 rows=10 loops=1) Index Cond: (tsvector @@ '''statist'''::tsquery) Order By: (tsvector '''statist'''::tsquery) Total runtime: 7.556 ms (5 rows) -- With best regards, Alexander Korotkov.
Re: [HACKERS] extensible external toast tuple support
On 2013-05-31 23:42:51 -0400, Robert Haas wrote: On Thu, May 30, 2013 at 7:42 AM, Andres Freund and...@2ndquadrant.com wrote: In http://archives.postgresql.org/message-id/20130216164231.GA15069%40awork2.anarazel.de I presented the need for 'indirect' toast tuples which point into memory instead of a toast table. In the comments to that proposal, off-list and in-person talks the wish to make that a more general concept has been voiced. The previous patch used varattrib_1b_e.va_len_1be to discern between different types of external tuples. That obviously only works if the data sizes of all possibly stored datum types are distinct which isn't nice. So what the newer patch now does is to rename that field into 'va_tag' and decide based on that what kind of Datum we have. To get the actual length of that datum there now is a VARTAG_SIZE() macro which maps the tags back to size. To keep on-disk compatibility the size of an external toast tuple containing a varatt_external is used as its tag value. This should allow for fairly easy development of a new compression scheme for out-of-line toast tuples. It will *not* work for compressed inline tuples (i.e. VARATT_4B_C). I am not convinced that that is a problem or that if it is, that it cannot be solved separately. FWIW, in some quick microbenchmarks I couldn't find any performance difference due to the slightly more complex size computation which I do *not* find surprising. Opinions? Seems pretty sensible to me. The patch is obviously WIP but the direction seems fine to me. Here's the updated version. It shouldn't contain any obvious WIP pieces anymore, although I think it needs some more documentation. I am just not sure where to add it yet, postgres.h seems like a bad place :/ Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From 654e24e9a615dcacea4d9714cf8cdbf6953983d5 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Tue, 11 Jun 2013 23:25:26 +0200 Subject: [PATCH] Add support for multiple kinds of external toast datums There are several usecases where our current representation of external toast datums is limiting: * adding new compression schemes * avoidance of repeated detoasting * externally decoded toast tuples For that support 'tags' on external (varattrib_1b_e) varlenas which recoin the current va_len_1be field to store the tag (or type) of a varlena. To determine the actual length a macro VARTAG_SIZE(tag) is added which can be used to map from a tag to the actual length. This patch adds support for 'indirect' tuples which point to some externally allocated memory containing a toast tuple. It also implements the stub for a different compression algorithm. --- src/backend/access/heap/tuptoaster.c | 100 +++ src/include/c.h | 2 + src/include/postgres.h | 83 + 3 files changed, 153 insertions(+), 32 deletions(-) diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c index fc37ceb..99044d0 100644 --- a/src/backend/access/heap/tuptoaster.c +++ b/src/backend/access/heap/tuptoaster.c @@ -128,7 +128,7 @@ heap_tuple_fetch_attr(struct varlena * attr) struct varlena * heap_tuple_untoast_attr(struct varlena * attr) { - if (VARATT_IS_EXTERNAL(attr)) + if (VARATT_IS_EXTERNAL_ONDISK(attr)) { /* * This is an externally stored datum --- fetch it back from there @@ -145,6 +145,15 @@ heap_tuple_untoast_attr(struct varlena * attr) pfree(tmp); } } + else if (VARATT_IS_EXTERNAL_INDIRECT(attr)) + { + struct varatt_indirect redirect; + VARATT_EXTERNAL_GET_POINTER(redirect, attr); + attr = (struct varlena *)redirect.pointer; + Assert(!VARATT_IS_EXTERNAL_INDIRECT(attr)); + + attr = heap_tuple_untoast_attr(attr); + } else if (VARATT_IS_COMPRESSED(attr)) { /* @@ -191,7 +200,7 @@ heap_tuple_untoast_attr_slice(struct varlena * attr, char *attrdata; int32 attrsize; - if (VARATT_IS_EXTERNAL(attr)) + if (VARATT_IS_EXTERNAL_ONDISK(attr)) { struct varatt_external toast_pointer; @@ -204,6 +213,13 @@ heap_tuple_untoast_attr_slice(struct varlena * attr, /* fetch it back (compressed marker will get set automatically) */ preslice = toast_fetch_datum(attr); } + else if (VARATT_IS_EXTERNAL_INDIRECT(attr)) + { + struct varatt_indirect redirect; + VARATT_EXTERNAL_GET_POINTER(redirect, attr); + return heap_tuple_untoast_attr_slice(redirect.pointer, + sliceoffset, slicelength); + } else preslice = attr; @@ -267,7 +283,7 @@ toast_raw_datum_size(Datum value) struct varlena *attr = (struct varlena *) DatumGetPointer(value); Size result; - if (VARATT_IS_EXTERNAL(attr)) + if (VARATT_IS_EXTERNAL_ONDISK(attr)) { /* va_rawsize is the size of the original datum -- including header */ struct
Re: [HACKERS] [RFC] Minmax indexes
Josh Berkus j...@agliodbs.com writes: To avoid this, a table having a minmax index would be configured so that inserts only go to the page(s) at the end of the table; this avoids frequent invalidation of ranges in the middle of the table. We provide a table reloption that tweaks the FSM behavior, so that summarized pages are not candidates for insertion. We haven't had an index type which modifies table insertion behavior before, and I'm not keen to start now; imagine having two indexes on the same table each with their own, conflicting, requirements. I agree; such a restriction is a nonstarter for a secondary index. I don't believe that hacking the FSM would be sufficient to guarantee the required behavior, either. We've talked a lot about index-organized tables in the past. How much of the use case for this would be subsumed by a feature like that? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] extensible external toast tuple support
Andres Freund escribió: Here's the updated version. It shouldn't contain any obvious WIP pieces anymore, although I think it needs some more documentation. I am just not sure where to add it yet, postgres.h seems like a bad place :/ How about a new file, say src/include/access/toast.h? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] extensible external toast tuple support
On 2013-06-14 19:14:15 -0400, Alvaro Herrera wrote: Andres Freund escribió: Here's the updated version. It shouldn't contain any obvious WIP pieces anymore, although I think it needs some more documentation. I am just not sure where to add it yet, postgres.h seems like a bad place :/ How about a new file, say src/include/access/toast.h? Well, the question is if that buys us all that much, we need the varlena definitions to be available pretty much everywhere. Except of section 3 - which we reduced to be pretty darn small these days - of postgres.h pretty much all of it is concerned with Datums, a good of them being varlenas. We could move section 1) into its own file and unconditionally include it... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pluggable compression support
On 06/14/2013 04:01 PM, Andres Freund wrote: It still contains a guc as described in the above message to control the algorithm used for compressing new tuples but I think we should remove that guc after testing. Did you add the storage attribute? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pluggable compression support
On 2013-06-14 17:12:01 -0700, Josh Berkus wrote: On 06/14/2013 04:01 PM, Andres Freund wrote: It still contains a guc as described in the above message to control the algorithm used for compressing new tuples but I think we should remove that guc after testing. Did you add the storage attribute? No. I think as long as we only have pglz and one new algorithm (even if that is lz4 instead of the current snappy) we should just always use the new algorithm. Unless I missed it nobody seemed to have voiced a contrary position? For testing/evaluation the guc seems to be sufficient. If we want to make it configurable on a per column basis I think the way to go is to add a new column to pg_attribute and split compression related things out of attstorage into attcompression. That's a fair amount of work and it includes a minor compatibility break in the catalog format, so I'd prefer not to do it until there's a good reason to do so. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PATCH] Revive line type
Complete the implementations of line_in, line_out, line_recv, line_send. Remove comments and error messages about the line type not being implemented. Add regression tests for existing line operators and functions. --- This just revives existing functionality, doesn't add anything new. One thing that the original code did not settle was how to convert a line in form Ax+By+C=0 to the two-points output form. Obviously, you can just pick to random points on the line, but I wonder whether there is a more standard solution. doc/src/sgml/datatype.sgml | 34 +++- doc/src/sgml/func.sgml |6 + src/backend/utils/adt/geo_ops.c| 108 + src/include/catalog/pg_type.h |3 +- src/include/utils/geo_decls.h |7 - src/test/regress/expected/geometry.out |3 - src/test/regress/expected/line.out | 243 src/test/regress/expected/sanity_check.out |3 +- src/test/regress/output/misc.source|3 +- src/test/regress/parallel_schedule |2 +- src/test/regress/serial_schedule |1 + src/test/regress/sql/geometry.sql |4 - src/test/regress/sql/line.sql | 77 + 13 files changed, 408 insertions(+), 86 deletions(-) create mode 100644 src/test/regress/expected/line.out create mode 100644 src/test/regress/sql/line.sql diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index f73e6b2..ecbbdd8 100644 --- a/doc/src/sgml/datatype.sgml +++ b/doc/src/sgml/datatype.sgml @@ -3066,7 +3066,7 @@ titleGeometric Types/title row entrytypeline/type/entry entry32 bytes/entry -entryInfinite line (not fully implemented)/entry +entryInfinite line/entry entry((x1,y1),(x2,y2))/entry /row row @@ -3142,6 +3142,38 @@ titlePoints/title /sect2 sect2 +titleLines/title + +indexterm + primaryline/primary +/indexterm + +para + Lines (typeline/type) are specified by pairs of points. + Values of type typeline/type are specified using any of the following + syntaxes: + +synopsis +[ ( replaceablex1/replaceable , replaceabley1/replaceable ) , ( replaceablex2/replaceable , replaceabley2/replaceable ) ] +( ( replaceablex1/replaceable , replaceabley1/replaceable ) , ( replaceablex2/replaceable , replaceabley2/replaceable ) ) + ( replaceablex1/replaceable , replaceabley1/replaceable ) , ( replaceablex2/replaceable , replaceabley2/replaceable ) +replaceablex1/replaceable , replaceabley1/replaceable , replaceablex2/replaceable , replaceabley2/replaceable +/synopsis + + where + literal(replaceablex1/replaceable,replaceabley1/replaceable)/literal + and + literal(replaceablex2/replaceable,replaceabley2/replaceable)/literal + are two (different) points on the line. +/para + +para + Lines are output using the first syntax. The points used in the output + are not necessarily the points used on input. +/para + /sect2 + + sect2 titleLine Segments/title indexterm diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 4c5af4b..835a189 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -8070,6 +8070,12 @@ titleGeometric Type Conversion Functions/title entryliteralcircle(polygon '((0,0),(1,1),(2,0))')/literal/entry /row row +entryliteralfunctionline(typepoint/type, typepoint/type)/function/literal/entry +entrytypeline/type/entry +entrypoints to line/entry +entryliterallseg(point '(-1,0)', point '(1,0)')/literal/entry + /row + row entry indexterm primarylseg/primary diff --git a/src/backend/utils/adt/geo_ops.c b/src/backend/utils/adt/geo_ops.c index ad18cf0..61a1900 100644 --- a/src/backend/utils/adt/geo_ops.c +++ b/src/backend/utils/adt/geo_ops.c @@ -933,13 +933,8 @@ Datum line_in(PG_FUNCTION_ARGS) { -#ifdef ENABLE_LINE_TYPE char *str = PG_GETARG_CSTRING(0); -#endif LINE *line; - -#ifdef ENABLE_LINE_TYPE - /* when fixed, modify not implemented, catalog/pg_type.h and SGML */ LSEGlseg; int isopen; char *s; @@ -950,15 +945,13 @@ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), errmsg(invalid input syntax for type line: \%s\, str))); + if (FPeq(lseg.p[0].x, lseg.p[1].x) FPeq(lseg.p[0].y, lseg.p[1].y)) + ereport(ERROR, + (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), +errmsg(invalid line specification: must be two distinct points))); + line = (LINE *) palloc(sizeof(LINE)); line_construct_pts(line, lseg.p[0], lseg.p[1]); -#else - ereport(ERROR, -
Re: [HACKERS] [RFC] Minmax indexes
On Fri, Jun 14, 2013 at 11:28 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Re-summarization is relatively expensive, because the complete page range has to be scanned. That doesn't sound too bad to me. It just means there's a downside to having larger page ranges. I would expect the page ranges to be something in the ballpark of 32 pages -- scanning 32 pages to resummarize doesn't sound that painful but sounds like it's large enough that the resulting index would be a reasonable size. But I don't understand why an insert would invalid a tuple. An insert can just update the min and max incrementally. It's a delete that invalidates the range but as you note it doesn't really invalidate it, just mark it as needing a refresh -- and even then only if the value being deleted is equal to either the min or max. Same-size page ranges? Current related literature seems to consider that each index entry in a minmax index must cover the same number of pages. There doesn't seem to be a I assume the reason for this in the literature is the need to quickly find the summary for a given page when you're handling an insert or delete. If you have some kind of meta data structure that lets you find it (which I gather is what the validity map is?) then you wouldn't need it. But that seems like a difficulty cost to justify compared to just having a 1:1 mapping from block to bitmap tuple. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pluggable compression support
No. I think as long as we only have pglz and one new algorithm (even if that is lz4 instead of the current snappy) we should just always use the new algorithm. Unless I missed it nobody seemed to have voiced a contrary position? For testing/evaluation the guc seems to be sufficient. Then it's not pluggable, is it? It's upgradable compression support, if anything. Which is fine, but let's not confuse people. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pluggable compression support
On 2013-06-14 17:35:02 -0700, Josh Berkus wrote: No. I think as long as we only have pglz and one new algorithm (even if that is lz4 instead of the current snappy) we should just always use the new algorithm. Unless I missed it nobody seemed to have voiced a contrary position? For testing/evaluation the guc seems to be sufficient. Then it's not pluggable, is it? It's upgradable compression support, if anything. Which is fine, but let's not confuse people. The point is that it's pluggable on the storage level in the sense of that several different algorithms can coexist and new ones can relatively easily added. That part is what seems to have blocked progress for quite a while now. So fixing that seems to be the interesting thing. I am happy enough to do the work of making it configurable if we want it to be... But I have zap interest of doing it and throw it away in the end because we decide we don't need it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dynamic background workers
On Sat, Jun 15, 2013 at 6:00 AM, Robert Haas robertmh...@gmail.com wrote: The second patch, dynamic-bgworkers-v1.patch, revises the background worker API to allow background workers to be started dynamically. This requires some communication channel from ordinary workers to the postmaster, because it is the postmaster that must ultimately start the newly-registered workers. However, that communication channel has to be designed pretty carefully, lest a shared memory corruption take out the postmaster and lead to inadvertent failure to restart after a crash. Here's how I implemented that: there's an array in shared memory of a size equal to max_worker_processes. This array is separate from the backend-private list of workers maintained by the postmaster, but the two are kept in sync. When a new background worker registration is added to the shared data structure, the backend adding it uses the existing pmsignal mechanism to kick the postmaster, which then scans the array for new registrations. I have attempted to make the code that transfers the shared_memory state into the postmaster's private state as paranoid as humanly possible. The precautions taken are documented in the comments. Conversely, when a background worker flagged as BGW_NEVER_RESTART is considered for restart (and we decide against it), the corresponding slot in the shared memory array is marked as no longer in use, allowing it to be reused for a new registration. Since the postmaster cannot take locks, synchronization between the postmaster and other backends using the shared memory segment has to be lockless. This mechanism is also documented in the comments. An lwlock is used to prevent two backends that are both registering a new worker at about the same time from stomping on each other, but the postmaster need not care about that lwlock. This patch also extends worker_spi as a demonstration of the new interface. With this patch, you can CREATE EXTENSION worker_spi and then call worker_spi_launch(int4) to launch a new background worker, or combine it with generate_series() to launch a bunch at once. Then you can kill them off with pg_terminate_backend() and start some new ones. That, in my humble opinion, is pretty cool. This looks really interesting, +1. I'll test the patch if possible next week. -- Michael
Re: [HACKERS] pluggable compression support
On Fri, Jun 14, 2013 at 8:45 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-06-14 17:35:02 -0700, Josh Berkus wrote: No. I think as long as we only have pglz and one new algorithm (even if that is lz4 instead of the current snappy) we should just always use the new algorithm. Unless I missed it nobody seemed to have voiced a contrary position? For testing/evaluation the guc seems to be sufficient. Then it's not pluggable, is it? It's upgradable compression support, if anything. Which is fine, but let's not confuse people. The point is that it's pluggable on the storage level in the sense of that several different algorithms can coexist and new ones can relatively easily added. That part is what seems to have blocked progress for quite a while now. So fixing that seems to be the interesting thing. I am happy enough to do the work of making it configurable if we want it to be... But I have zap interest of doing it and throw it away in the end because we decide we don't need it. I don't think we need it. I think what we need is to decide is which algorithm is legally OK to use. And then put it in. In the past, we've had a great deal of speculation about that legal question from people who are not lawyers. Maybe it would be valuable to get some opinions from people who ARE lawyers. Tom and Heikki both work for real big companies which, I'm guessing, have substantial legal departments; perhaps they could pursue getting the algorithms of possible interest vetted. Or, I could try to find out whether it's possible do something similar through EnterpriseDB. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pluggable compression support
On 06/14/2013 06:56 PM, Robert Haas wrote: On Fri, Jun 14, 2013 at 8:45 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-06-14 17:35:02 -0700, Josh Berkus wrote: No. I think as long as we only have pglz and one new algorithm (even if that is lz4 instead of the current snappy) we should just always use the new algorithm. Unless I missed it nobody seemed to have voiced a contrary position? For testing/evaluation the guc seems to be sufficient. Then it's not pluggable, is it? It's upgradable compression support, if anything. Which is fine, but let's not confuse people. The point is that it's pluggable on the storage level in the sense of that several different algorithms can coexist and new ones can relatively easily added. That part is what seems to have blocked progress for quite a while now. So fixing that seems to be the interesting thing. I am happy enough to do the work of making it configurable if we want it to be... But I have zap interest of doing it and throw it away in the end because we decide we don't need it. I don't think we need it. I think what we need is to decide is which algorithm is legally OK to use. And then put it in. In the past, we've had a great deal of speculation about that legal question from people who are not lawyers. Maybe it would be valuable to get some opinions from people who ARE lawyers. Tom and Heikki both work for real big companies which, I'm guessing, have substantial legal departments; perhaps they could pursue getting the algorithms of possible interest vetted. Or, I could try to find out whether it's possible do something similar through EnterpriseDB. We have IP legal representation through Software in the Public interest who pretty much specializes in this type of thing. Should I follow up? If so, I need a summary of the exact question including licenses etc. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc For my dreams of your image that blossoms a rose in the deeps of my heart. - W.B. Yeats -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers