Re: [HACKERS] updated emacs configuration

2013-06-14 Thread Daniel Farina
On Thu, Jun 13, 2013 at 6:27 PM, Peter Eisentraut pete...@gmx.net wrote:
 First, I propose adding a .dir-locals.el file to the top-level directory
 with basic emacs settings.  These get applied automatically.  This
 especially covers the particular tab and indentation settings that
 PostgreSQL uses.  With this, casual developers will not need to modify
 any of their emacs settings.

Yes please.  I've had the pgsql stuff in my .emacs for-ever (ever since I
was a student and compelled to do homework on Postgres) and knew the
magical rules about naming the directory, but it always felt so dirty
and very much a 'you need to know the trick' level of intimacy.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Add transforms feature

2013-06-14 Thread Cédric Villemain



Peter Eisentraut pete...@gmx.net a écrit :
A transform is an SQL object that supplies to functions for converting
between data types and procedural languages.  For example, a transform
could arrange that hstore is converted to an appropriate hash or
dictionary object in PL/Perl or PL/Python.

Nice !

Continued from 2013-01 commit fest.  All known open issues have been
fixed.

You kept PGXS style makefile...

--
Envoyé de mon téléphone excusez la brièveté.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Reduce maximum error in tuples estimation after vacuum.

2013-06-14 Thread Kyotaro HORIGUCHI
Hello, 

Postgresql estimates the number of live tuples after the vacuum
has left some buffers unscanned. This estimation does well for
most cases, but makes completely different result with a strong
imbalance of tuple density.

For example,

create table t (a int, b int);
insert into t (select a, (random() * 10)::int from generate_series((select 
count(*) from t) + 1, 100) a);
update t set b = b + 1 where a   (select count(*) from t) * 0.7;
vacuum t;
delete from t where a  (select count(*) from t) * 0.99;

After this, pg_stat_user_tables.n_live_tup shows 417670 which is
41 times larger than the real number of rows 11. And what
makes it worse, autovacuum nor autoanalyze won't run until
n_dead_tup goes above 8 times larger than the real number of
tuples in the table for the default settings..


| postgres=# select n_live_tup, n_dead_tup
|from pg_stat_user_tables where relname='t';
|  n_live_tup | n_dead_tup 
| +
|  417670 |  0
| 
| postgres=# select reltuples from pg_class where relname='t';
|  reltuples 
| ---
| 417670
| 
| postgres=# select count(*) from t;
|  count 
| ---
|  10001

Using n_dead_tup before vacuuming seems to make it better but I
heard that the plan is abandoned from some reason I don't know.

So I've come up with the another plan - using FSM to estimate the
tuple density in unscanned pages. The point is that make
estimation reliying on the uniformity of tuple length instead of
tuple density. This change seems keeping that errors under a few
times of tuples. Additional page reads for FSM are about 4000th
(SlotsPerFSMPage) of the skipped pages, and I suppose this is
tolerable during vacuum.

Overall algorithm could be illistrated as below,

 - summing up used bytes, max offnum(PageGetMaxOffsetNumber),
   maximum free bytes for tuple data , and free bytes after page
   vacuum through all scanned pages.

 - summing up free bytes informed by FSM through all skipped
   pages.

 - Calculate mean tuple length from the overall used bytes and
   sum of max offnums, and scanned pages.

 - Guess tuple density in skipped pages using overall free bytes
   from FSM and the mean tuple length calculated above.

 - Finally, feed estimated number of the live tuples BEFORE
   vacuum into vac_estimate_reltuples.

Of course this method affected by the imbalance of tuple LENGTH,
but it also seems to be kept within a few times of the number of
tuples.

for rows with invariable length, the test for head shows, where
tups est is pg_class.reltuples and tups real is count(*).

del% | pages | n_live_tup | tups est | tups real | est/real | bufs 
-+---++--+---+--+--
 0.9 |  4425 | 11 |   470626 |11 |4.706 | 3985
0.95 |  4425 |  50001 |   441196 | 50001 |8.824 | 4206
0.99 |  4425 | 417670 |   417670 | 10001 |   41.763 | 4383

and with the patch

 0.9 |  4425 | 106169 |   106169 |11 |1.062 | 3985
0.95 |  4425 |  56373 |56373 | 50001 |1.127 | 4206
0.99 |  4425 |  10001 |16535 | 10001 |1.653 | 4383


What do you think about this?

=

The attached files are:

  - vacuum_est_improve_20130614.patch: the patch for this proposal

  - vactest.sql: sql script to cause the sitiation

  - vactest.sh: test script to find the errors relating this patch.

  - test_result.txt: all of the test result for various deletion
ratio which the test script above yields.

regards,
-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index d6d20fd..1e581c1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1280,7 +1280,8 @@ acquire_sample_rows(Relation onerel, int elevel,
 	*totalrows = vac_estimate_reltuples(onerel, true,
 		totalblocks,
 		bs.m,
-		liverows);
+		liverows,
+		onerel-rd_rel-reltuples);
 	if (bs.m  0)
 		*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
 	else
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 641c740..4bdf0c1 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -501,10 +501,10 @@ double
 vac_estimate_reltuples(Relation relation, bool is_analyze,
 	   BlockNumber total_pages,
 	   BlockNumber scanned_pages,
-	   double scanned_tuples)
+	   double scanned_tuples,
+	   double old_rel_tuples)
 {
 	BlockNumber old_rel_pages = relation-rd_rel-relpages;
-	double		old_rel_tuples = relation-rd_rel-reltuples;
 	double		old_density;
 	double		new_density;
 	double		multiplier;
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 7e46f9e..80304a6 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -396,7 +396,11 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 	double		num_tuples,
 			

[HACKERS] Add visibility map information to pg_freespace.

2013-06-14 Thread Kyotaro HORIGUCHI
Helle,

I've added visibility map information to pg_freespace for my
utility.

This looks like this,

postgres=# select * from pg_freespace('t'::regclass);
 blkno | avail | all_visible 
---+---+-
 0 |  7424 | t
 1 |  7424 | t
 2 |  7424 | t
 3 |  7424 | t
 4 |  7424 | t
 5 |  7424 | t
 6 |  7424 | t
 7 |  7424 | t
...

What do you think about this?


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/contrib/pg_freespacemap/pg_freespacemap--1.0.sql b/contrib/pg_freespacemap/pg_freespacemap--1.0.sql
index 2adb52a..e38b466 100644
--- a/contrib/pg_freespacemap/pg_freespacemap--1.0.sql
+++ b/contrib/pg_freespacemap/pg_freespacemap--1.0.sql
@@ -9,12 +9,17 @@ RETURNS int2
 AS 'MODULE_PATHNAME', 'pg_freespace'
 LANGUAGE C STRICT;
 
+CREATE FUNCTION pg_is_all_visible(regclass, bigint)
+RETURNS bool
+AS 'MODULE_PATHNAME', 'pg_is_all_visible'
+LANGUAGE C STRICT;
+
 -- pg_freespace shows the recorded space avail at each block in a relation
 CREATE FUNCTION
-  pg_freespace(rel regclass, blkno OUT bigint, avail OUT int2)
+  pg_freespace(rel regclass, blkno OUT bigint, avail OUT int2, all_visible OUT bool)
 RETURNS SETOF RECORD
 AS $$
-  SELECT blkno, pg_freespace($1, blkno) AS avail
+  SELECT blkno, pg_freespace($1, blkno) AS avail, pg_is_all_visible($1, blkno) AS all_visible
   FROM generate_series(0, pg_relation_size($1) / current_setting('block_size')::bigint - 1) AS blkno;
 $$
 LANGUAGE SQL;
diff --git a/contrib/pg_freespacemap/pg_freespacemap.c b/contrib/pg_freespacemap/pg_freespacemap.c
index f6f7d2e..de4eff7 100644
--- a/contrib/pg_freespacemap/pg_freespacemap.c
+++ b/contrib/pg_freespacemap/pg_freespacemap.c
@@ -10,17 +10,20 @@
 
 #include funcapi.h
 #include storage/freespace.h
+#include access/visibilitymap.h
 
 
 PG_MODULE_MAGIC;
 
 Datum		pg_freespace(PG_FUNCTION_ARGS);
+Datum		pg_is_all_visible(PG_FUNCTION_ARGS);
 
 /*
  * Returns the amount of free space on a given page, according to the
  * free space map.
  */
 PG_FUNCTION_INFO_V1(pg_freespace);
+PG_FUNCTION_INFO_V1(pg_is_all_visible);
 
 Datum
 pg_freespace(PG_FUNCTION_ARGS)
@@ -38,7 +41,32 @@ pg_freespace(PG_FUNCTION_ARGS)
  errmsg(invalid block number)));
 
 	freespace = GetRecordedFreeSpace(rel, blkno);
-
 	relation_close(rel, AccessShareLock);
 	PG_RETURN_INT16(freespace);
 }
+
+Datum
+pg_is_all_visible(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	int64		blkno = PG_GETARG_INT64(1);
+	Buffer  vmbuffer = InvalidBuffer;
+	int			all_visible;
+	Relation	rel;
+
+	rel = relation_open(relid, AccessShareLock);
+
+	if (blkno  0 || blkno  MaxBlockNumber)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg(invalid block number)));
+
+	all_visible = visibilitymap_test(rel, blkno, vmbuffer);
+	if (vmbuffer != InvalidBuffer)
+	{
+		ReleaseBuffer(vmbuffer);
+		vmbuffer = InvalidBuffer;
+	}
+	relation_close(rel, AccessShareLock);
+	PG_RETURN_BOOL(all_visible);
+}
diff --git a/contrib/pg_freespacemap/pg_freespacemap.control b/contrib/pg_freespacemap/pg_freespacemap.control
index 34b695f..395350a 100644
--- a/contrib/pg_freespacemap/pg_freespacemap.control
+++ b/contrib/pg_freespacemap/pg_freespacemap.control
@@ -1,5 +1,5 @@
 # pg_freespacemap extension
-comment = 'examine the free space map (FSM)'
+comment = 'examine the free space map (FSM) and visibility map (VM)'
 default_version = '1.0'
 module_pathname = '$libdir/pg_freespacemap'
 relocatable = true

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-14 Thread KONDO Mitsumasa

(2013/06/12 23:07), Robert Haas wrote:

On Mon, Jun 10, 2013 at 3:48 PM, Simon Riggs si...@2ndquadrant.com wrote:

On 10 June 2013 11:51, KONDO Mitsumasa kondo.mitsum...@lab.ntt.co.jp wrote:

I create patch which is improvement of checkpoint IO scheduler for stable
transaction responses.


Looks like good results, with good measurements. Should be an
interesting discussion.


+1.

I suspect we want to poke at the algorithms a little here and maybe
see if we can do this without adding new GUCs.  Also, I think this is
probably two separate patches, in the end.  But the direction seems
good to me.

Thank you for comment!

I separate my patch in checkpoint-wirte and in checkpoint-fsync. As you
say, my patch has a lot of new GUCs. I don't think it cannot be decided
automatic. However, it is difficult that chekpoint-scheduler is suitable
for all of enviroments which are like virtual server, public cloude server,
and embedded server, etc. So I think that default setting parameter works
same as before. Setting parameter is primitive and difficult, but if we can
set correctly, it is suitable for a lot of enviroments and will not work 
unintended action.


I try to take something into consideration about less GUCs version. And if you 
have good idea, please discussion about this!


Best Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fdf6625..0c0f215 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -141,9 +141,12 @@ static CheckpointerShmemStruct *CheckpointerShmem;
 /*
  * GUC parameters
  */
+int			CheckPointerWriteDelay = 200;
 int			CheckPointTimeout = 300;
 int			CheckPointWarning = 30;
 double		CheckPointCompletionTarget = 0.5;
+double		CheckPointSmoothTarget = 0.0;
+double		CheckPointSmoothMargin = 0.0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
@@ -715,7 +718,7 @@ CheckpointWriteDelay(int flags, double progress)
 		 * Checkpointer and bgwriter are no longer related so take the Big
 		 * Sleep.
 		 */
-		pg_usleep(10L);
+		pg_usleep(CheckPointerWriteDelay * 1000L);
 	}
 	else if (--absorb_counter = 0)
 	{
@@ -742,14 +745,36 @@ IsCheckpointOnSchedule(double progress)
 {
 	XLogRecPtr	recptr;
 	struct timeval now;
-	double		elapsed_xlogs,
+	double		original_progress,
+			elapsed_xlogs,
 elapsed_time;
 
 	Assert(ckpt_active);
 
-	/* Scale progress according to checkpoint_completion_target. */
-	progress *= CheckPointCompletionTarget;
+	/* This variable is used by smooth checkpoint schedule.*/
+	original_progress = progress * CheckPointCompletionTarget;
 
+	/* Scale progress according to checkpoint_completion_target and checkpoint_smooth_target. */
+	if(progress = CheckPointSmoothTarget)
+	{
+		/* Normal checkpoint schedule. */
+		progress *= CheckPointCompletionTarget;
+	}
+	else
+	{
+		/*
+		 * Smooth checkpoint schedule.
+		 *
+		 * When initial checkpoint, it tends to be high IO road average
+		 * and slow executing transactions. This schedule reduces them
+		 * and improve IO responce. As 'progress' approximates CheckPointSmoothTarget,
+		 * it becomes near normal checkpoint schedule. If you want to more
+		 * smooth checkpoint schedule, you set higher CheckPointSmoothTarget.
+		 */
+		progress *= ((CheckPointSmoothTarget - progress) / CheckPointSmoothTarget) *
+(CheckPointSmoothMargin + 1 - CheckPointCompletionTarget) +
+CheckPointCompletionTarget;
+	}
 	/*
 	 * Check against the cached value first. Only do the more expensive
 	 * calculations once we reach the target previously calculated. Since
@@ -779,6 +804,14 @@ IsCheckpointOnSchedule(double progress)
 			ckpt_cached_elapsed = elapsed_xlogs;
 			return false;
 		}
+		else if (original_progress  elapsed_xlogs)
+		{
+			ckpt_cached_elapsed = elapsed_xlogs;
+
+			/* smooth checkpoint write */
+			pg_usleep(CheckPointerWriteDelay * 1000L);
+			return false;
+		}
 	}
 
 	/*
@@ -793,6 +826,14 @@ IsCheckpointOnSchedule(double progress)
 		ckpt_cached_elapsed = elapsed_time;
 		return false;
 	}
+	else if (original_progress  elapsed_time)
+	{
+		ckpt_cached_elapsed = elapsed_time;
+
+		/* smooth checkpoint write */
+		pg_usleep(CheckPointerWriteDelay * 1000L);
+		return false;
+	}
 
 	/* It looks like we're on schedule. */
 	return true;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ea16c64..d41dc17 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2014,6 +2014,17 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{checkpointer_write_delay, PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+			gettext_noop(checkpointer sleep time during dirty buffers write in checkpoint.),
+			NULL,
+			GUC_UNIT_MS
+		},
+		CheckPointerWriteDelay,
+		200, 10, 1,
+		NULL, NULL, NULL
+	},
+
+	{
 		{wal_buffers, PGC_POSTMASTER, WAL_SETTINGS,
 			gettext_noop(Sets the number of disk-page buffers in shared 

Re: [HACKERS] Reduce maximum error in tuples estimation after vacuum.

2013-06-14 Thread Kyotaro HORIGUCHI
Sorry, I made an mistake.

Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp:

 Overall algorithm could be illistrated as below,

  - summing up used bytes, max offnum(PageGetMaxOffsetNumber),

Not max offnum, the number of linp's used after page vacuum.

maximum free bytes for tuple data , and free bytes after page
vacuum through all scanned pages.

  - summing up free bytes informed by FSM through all skipped
pages.

  - Calculate mean tuple length from the overall used bytes and
sum of max offnums, and scanned pages.

Here also is the same. not sum of max offnum but total of used
entrre(linp)s.

  - Guess tuple density in skipped pages using overall free bytes
from FSM and the mean tuple length calculated above.

  - Finally, feed estimated number of the live tuples BEFORE
vacuum into vac_estimate_reltuples.

regards,

-- 
Kyotaro Horiguchi


Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Benedikt Grundmann
On Fri, Jun 14, 2013 at 10:11 AM, Samrat Revagade revagade.sam...@gmail.com
 wrote:

 Hello,


 We have already started a discussion on pgsql-hackers for the problem of
 taking fresh backup during the failback operation here is the link for that:




 http://www.postgresql.org/message-id/caf8q-gxg3pqtf71nvece-6ozraew5pwhk7yqtbjgwrfu513...@mail.gmail.com



 Let me again summarize the problem we are trying to address.



 When the master fails, last few WAL files may not reach the standby. But
 the master may have gone ahead and made changes to its local file system
 after flushing WAL to the local storage.  So master contains some file
 system level changes that standby does not have.  At this point, the data
 directory of master is ahead of standby's data directory.

 Subsequently, the standby will be promoted as new master.  Later when the
 old master wants to be a standby of the new master, it can't just join the
 setup since there is inconsistency in between these two servers. We need to
 take the fresh backup from the new master.  This can happen in both the
 synchronous as well as asynchronous replication.



 Fresh backup is also needed in case of clean switch-over because in the
 current HEAD, the master does not wait for the standby to receive all the
 WAL up to the shutdown checkpoint record before shutting down the
 connection. Fujii Masao has already submitted a patch to handle clean
 switch-over case, but the problem is still remaining for failback case.



 The process of taking fresh backup is very time consuming when databases
 are of very big sizes, say several TB's, and when the servers are connected
 over a relatively slower link.  This would break the service level
 agreement of disaster recovery system.  So there is need to improve the
 process of disaster recovery in PostgreSQL.  One way to achieve this is to
 maintain consistency between master and standby which helps to avoid need
 of fresh backup.



 So our proposal on this problem is that we must ensure that master should
 not make any file system level changes without confirming that the
 corresponding WAL record is replicated to the standby.



A alternative proposal (which will probably just reveal my lack of
understanding about what is or isn't possible with WAL).  Provide a way to
restart the master so that it rolls back the WAL changes that the slave
hasn't seen.

 There are many suggestions and objections pgsql-hackers about this problem
 The brief summary is as follows:





Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Samrat Revagade
That will not happen if there is inconsistency in between both the servers.

Please refer to the discussions on the link provided in the first post:

http://www.postgresql.org/message-id/caf8q-gxg3pqtf71nvece-6ozraew5pwhk7yqtbjgwrfu513...@mail.gmail.com

Regards,

Samrat Revgade


Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Heikki Linnakangas

On 14.06.2013 12:11, Samrat Revagade wrote:

We have already started a discussion on pgsql-hackers for the problem of
taking fresh backup during the failback operation here is the link for that:

http://www.postgresql.org/message-id/caf8q-gxg3pqtf71nvece-6ozraew5pwhk7yqtbjgwrfu513...@mail.gmail.com

Let me again summarize the problem we are trying to address.

When the master fails, last few WAL files may not reach the standby. But
the master may have gone ahead and made changes to its local file system
after flushing WAL to the local storage.  So master contains some file
system level changes that standby does not have.  At this point, the data
directory of master is ahead of standby's data directory.

Subsequently, the standby will be promoted as new master.  Later when the
old master wants to be a standby of the new master, it can't just join the
setup since there is inconsistency in between these two servers. We need to
take the fresh backup from the new master.  This can happen in both the
synchronous as well as asynchronous replication.


Did you see the thread on the little tool I wrote called pg_rewind?

http://www.postgresql.org/message-id/519df910.4020...@vmware.com

It solves that problem, for both clean and unexpected shutdown. It needs 
some more work and a lot more testing, but requires no changes to the 
backend. Robert Haas pointed out in that thread that it has a problem 
with hint bits that are not WAL-logged, but it will still work if you 
also enable the new checksums feature, which forces hint bit updates to 
be WAL-logged. Perhaps we could add a GUC to enable hint bits to be 
WAL-logged, regardless of checksums, to make pg_rewind work.


I think that's a more flexible approach to solve this problem. It 
doesn't require an online feedback loop from the standby to master, for 
starters.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Pavan Deolasee
On Fri, Jun 14, 2013 at 4:12 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

  Robert Haas pointed out in that thread that it has a problem with hint
 bits that are not WAL-logged,


I liked that tool a lot until Robert pointed out the above problem. I
thought this is a show stopper because I can't really see any way to
circumvent it unless we enable checksums or explicitly WAL log hint bits.


 but it will still work if you also enable the new checksums feature, which
 forces hint bit updates to be WAL-logged.


Are we expecting a lot of people to run their clusters with checksums on ?
Sorry, I haven't followed the checksum discussions and don't know how much
overhead it causes. But if the general expectation is that checksums will
be turned on most often, I agree pg_rewind is probably good enough.


 Perhaps we could add a GUC to enable hint bits to be WAL-logged,
 regardless of checksums, to make pg_rewind work.


Wouldn't that be too costly ? I mean, in the worst case every hint bit on a
page may get updated separately. If each such update is WAL logged, we are
looking for a lot more unnecessary WAL traffic.


 I think that's a more flexible approach to solve this problem. It doesn't
 require an online feedback loop from the standby to master, for starters.


I agree. That's a big advantage of pg_rewind. Unfortunately, it can't work
with 9.3 and below because of the hint bits issue, otherwise it would have
been even more cool.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Pavan Deolasee
On Fri, Jun 14, 2013 at 2:51 PM, Benedikt Grundmann 
bgrundm...@janestreet.com wrote:


 A alternative proposal (which will probably just reveal my lack of
 understanding about what is or isn't possible with WAL).  Provide a way to
 restart the master so that it rolls back the WAL changes that the slave
 hasn't seen.


WAL records in PostgreSQL can only be used for physical redo. They can not
be used for undo. So what you're suggesting is not possible though I am
sure a few other databases do that.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Heikki Linnakangas

On 14.06.2013 14:06, Pavan Deolasee wrote:

On Fri, Jun 14, 2013 at 4:12 PM, Heikki Linnakangashlinnakan...@vmware.com

wrote:



  Robert Haas pointed out in that thread that it has a problem with hint
bits that are not WAL-logged,


I liked that tool a lot until Robert pointed out the above problem. I
thought this is a show stopper because I can't really see any way to
circumvent it unless we enable checksums or explicitly WAL log hint bits.


but it will still work if you also enable the new checksums feature, which
forces hint bit updates to be WAL-logged.


Are we expecting a lot of people to run their clusters with checksums on ?
Sorry, I haven't followed the checksum discussions and don't know how much
overhead it causes. But if the general expectation is that checksums will
be turned on most often, I agree pg_rewind is probably good enough.


Well, time will tell I guess. The biggest overhead with the checksums is 
exactly the WAL-logging of hint bits.



Perhaps we could add a GUC to enable hint bits to be WAL-logged,
regardless of checksums, to make pg_rewind work.


Wouldn't that be too costly ? I mean, in the worst case every hint bit on a
page may get updated separately. If each such update is WAL logged, we are
looking for a lot more unnecessary WAL traffic.


Yep, same as with checksums. I was not very enthusiastic about the 
checksums patch because of that, but a lot of people are willing to pay 
that price. Maybe we can figure out a way to reduce that cost in 9.4. 
It'd benefit the checksums greatly.


For pg_rewind, we wouldn't actually need a full-page image for hint bit 
updates, just a small record saying hey, I touched this page. And 
you'd only need to write that the first time a page is touched after a 
checkpoint.



I think that's a more flexible approach to solve this problem. It doesn't
require an online feedback loop from the standby to master, for starters.


I agree. That's a big advantage of pg_rewind. Unfortunately, it can't work
with 9.3 and below because of the hint bits issue, otherwise it would have
been even more cool.


The proposed patch is clearly not 9.3 material either. If anything, 
there's a much better change that we could still sneak in a GUC to allow 
hint bits to be WAL-logged without checksums in 9.3. All the code is 
there, it'd just be a new guc to control it separetely from checksums.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Greg Stark
On Fri, Jun 14, 2013 at 12:20 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 For pg_rewind, we wouldn't actually need a full-page image for hint bit
 updates, just a small record saying hey, I touched this page. And you'd
 only need to write that the first time a page is touched after a checkpoint.

I would expect that to be about the same cost though. The latency for
the fsync on the wal record before being able to flush the buffer is
the biggest cost.


 The proposed patch is clearly not 9.3 material either. If anything, there's
 a much better change that we could still sneak in a GUC to allow hint bits
 to be WAL-logged without checksums in 9.3. All the code is there, it'd just
 be a new guc to control it separetely from checksums.

On the other hand if you're going to wal log the hint bits why not
enable checksums?

Do we allow turning off checksums after a database is initdb'd? IIRC
we can't turn it on later but I don't see why we couldn't turn them
off.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Marko Kreen
On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed dean.a.rash...@gmail.com wrote:
 Attached is a patch implementing a new aggregate function md5_agg() to
 compute the aggregate MD5 sum across a number of rows. This is
 something I've wished for a number of times. I think the primary use
 case is to do a quick check that 2 tables, possibly on different
 servers, contain the same data, using a query like

   SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo;

 or

   SELECT md5_agg(foo.*::text ORDER BY id) FROM foo;

 these would be equivalent to

   SELECT md5(string_agg(foo.*::text, '' ORDER BY id)) FROM foo;

 but without the excessive memory consumption for the intermediate
 concatenated string, and the resulting 1GB table size limit.

It's more efficient to calculate per-row md5, and then sum() them.
This avoids the need for ORDER BY.

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Issue with PGC_BACKEND parameters

2013-06-14 Thread Amit Kapila
I had observed one problem with PGC_BACKEND parameters while testing patch
for ALTER SYSTEM command.

Problem statement: If I change PGC_BACKEND parameters directly in
postgresql.conf and then do pg_reload_conf() and reconnect, it will 
   still show the old value.
Detailed steps
1. Start server with default settings
2. Connect Client
3. show log_connections; -- it will show as off, this is correct.
4. Change log_connections in postgresql.conf to on
5. issue command select pg_reload_conf() in client (which is started in
step-2)
6. Connect a new client
7. show log_connections; -- it will show as off, this is in-correct.

The problem is in step-7, it should show as on.

This problem occur only in Windows. 

The reason for this problem is that in WINDOWS, when a new session is
started it will load the changed parameters in new backend by
global/config_exec_params file. The flow is in
SubPostmasterMain()-read_nondefault_variables()-set_config_option().

In below code in function set_config_option(), it will not allow to change
PGC_BACKEND variable and even in comments it has mentioned that only
postmaster will be allowed to change and the same will propagate to
subsequently started backends, but this is not TRUE for Windows.

switch (record-context) 
{
..
..
case PGC_BACKEND: 
   if (context == PGC_SIGHUP) 
   { 
   /* 
* If a PGC_BACKEND parameter is changed in
the config file, 
* we want to accept the new value in the
postmaster (whence 
* it will propagate to subsequently-started
backends), but 
* ignore it in existing backends.
This is a tad klugy, but 
* necessary because we don't re-read the
config file during 
* backend start. 
*/ 
   if (IsUnderPostmaster) 
   return -1; 
   }

}


I think to fix the issue we need to pass the information whether PGC_BACKEND
parameter is allowed to change in set_config_option() function.
One way is to pass a new parameter.

Kindly let me know your suggestions. 

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SPGist triple parity concept doesn't work

2013-06-14 Thread Teodor Sigaev



Anyway I now think that we might be better off with the other idea of
abandoning an insertion and retrying if we get a lock conflict.


done, look at the patch.

I was faced with the fact that my mail is considered spam by postgresql.org, so 
I repeat some hthoughts from previous mail:


I considered the idea to forbid placement of child on the same page as parent, 
but this implementation a) could significantly increase size of index, b) 
doesn't solve Greg's point.


We definetly need new idea of locking protocol and I'll return to this problem 
at autumn (sorry, I havn't time in summer to do this research).


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


spgist_deadlock-1.patch.gz
Description: Unix tar archive

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Add transforms feature

2013-06-14 Thread Peter Eisentraut
On 6/14/13 3:46 AM, Cédric Villemain wrote:
 You kept PGXS style makefile...

I know, but that's a separate issue that hasn't been decided yet.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Remove useless USE_PGXS support in contrib

2013-06-14 Thread Peter Eisentraut
On 6/13/13 9:20 PM, amul sul wrote:
 Agree, only if we consider these contrib module is always gonna deployed with 
 the postgresql.
 But, what if user going to install such module elsewhere i.e. not from 
 contrib directory of pg source.

Why would anyone do that?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Remove useless USE_PGXS support in contrib

2013-06-14 Thread Amit Langote
On Fri, Jun 14, 2013 at 9:35 PM, Peter Eisentraut pete...@gmx.net wrote:
 On 6/13/13 9:20 PM, amul sul wrote:
 Agree, only if we consider these contrib module is always gonna deployed 
 with the postgresql.
 But, what if user going to install such module elsewhere i.e. not from 
 contrib directory of pg source.

 Why would anyone do that?

Is he probably saying install such module *from* elsewhere? Like
directly from the source directory of a module using something like
following:

cd /path/to/module-source
make USE_PGXS=1 PG_CONFIG=/path/to/pg_config
make USE_PGXS=1 PG_CONFIG=/path/to/pg_config install

When user does not work with pg source directly and does not have
postgresql-contrib installed?
Am I missing something here?

--
Amit Langote


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 Well, time will tell I guess. The biggest overhead with the checksums is 
 exactly the WAL-logging of hint bits.

Refresh my memory as to why we need to WAL-log hints for checksumming?
I just had my nose in the part of the checksum patch that tediously
copies entire pages out of shared buffers to avoid possible instability
of the hint bits while we checksum and write the page.  Given that we're
paying that cost, I don't see why we'd need to do any extra WAL-logging
(above and beyond the log-when-freeze cost that we have to pay already).
But I've not absorbed any caffeine yet today, so maybe I'm just missing
it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Tom Lane
Marko Kreen mark...@gmail.com writes:
 On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed dean.a.rash...@gmail.com 
 wrote:
 Attached is a patch implementing a new aggregate function md5_agg() to
 compute the aggregate MD5 sum across a number of rows.

 It's more efficient to calculate per-row md5, and then sum() them.
 This avoids the need for ORDER BY.

Good point.  The aggregate md5 function also fails to distinguish the
case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
from the case where they contain 'xyz' followed by 'zyxyz'.

Now, as against that, you lose any sensitivity to the ordering of the
values.

Personally I'd be a bit inclined to xor the per-row md5's rather than
sum them, but that's a small matter.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Amit Kapila
On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:
 Hello,

 We have already started a discussion on pgsql-hackers for the problem of
taking fresh backup during the failback operation here is the link for that:
 

http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtb
jgwrfu513...@mail.gmail.com
 
 Let me again summarize the problem we are trying to address.
 
 When the master fails, last few WAL files may not reach the standby. But
the master may have gone ahead and made changes to its local file system
after  flushing WAL to the local storage.  So master contains some file
system level changes that standby does not have.  At this point, the data
directory of  master is ahead of standby's data directory.
 Subsequently, the standby will be promoted as new master.  Later when the
old master wants to be a standby of the new master, it can't just join the 
 setup since there is inconsistency in between these two servers. We need
to take the fresh backup from the new master.  This can happen in both the 
 synchronous as well as asynchronous replication.
 
 Fresh backup is also needed in case of clean switch-over because in the
current HEAD, the master does not wait for the standby to receive all the
WAL 
 up to the shutdown checkpoint record before shutting down the connection.
Fujii Masao has already submitted a patch to handle clean switch-over case, 
 but the problem is still remaining for failback case.
 
 The process of taking fresh backup is very time consuming when databases
are of very big sizes, say several TB's, and when the servers are connected 
 over a relatively slower link.  This would break the service level
agreement of disaster recovery system.  So there is need to improve the
process of 
 disaster recovery in PostgreSQL.  One way to achieve this is to maintain
consistency between master and standby which helps to avoid need of fresh 
 backup.
 
 So our proposal on this problem is that we must ensure that master should
not make any file system level changes without confirming that the 
 corresponding WAL record is replicated to the standby.
 
  How will you take care of extra WAL on old master during recovery. If it
plays the WAL which has not reached new-master, it can be a problem.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Heikki Linnakangas

On 14.06.2013 16:08, Tom Lane wrote:

Heikki Linnakangashlinnakan...@vmware.com  writes:

Well, time will tell I guess. The biggest overhead with the checksums is
exactly the WAL-logging of hint bits.


Refresh my memory as to why we need to WAL-log hints for checksumming?


Torn pages:

1. Backend sets a hint bit, dirtying the buffer.
2. Checksum is calculated, and buffer is written out to disk.
3. crash

If the page is torn, the checksum won't match. Without checksums, a torn 
page is not a problem with hint bits, as a single bit can't be torn and 
the page is otherwise intact. But with checksums, it causes a checksum 
failure.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 14.06.2013 16:08, Tom Lane wrote:
 Refresh my memory as to why we need to WAL-log hints for checksumming?

 Torn pages:

So it's not that we actually need to log the individual hint bit
changes, it's that we need to WAL-log a full page image on the first
update after a checkpoint, so as to recover from torn-page cases.
Which one are we doing?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Benedikt Grundmann
On Fri, Jun 14, 2013 at 2:14 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Marko Kreen mark...@gmail.com writes:
  On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed dean.a.rash...@gmail.com
 wrote:
  Attached is a patch implementing a new aggregate function md5_agg() to
  compute the aggregate MD5 sum across a number of rows.

  It's more efficient to calculate per-row md5, and then sum() them.
  This avoids the need for ORDER BY.

 Good point.  The aggregate md5 function also fails to distinguish the
 case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
 from the case where they contain 'xyz' followed by 'zyxyz'.

 Now, as against that, you lose any sensitivity to the ordering of the
 values.

 Personally I'd be a bit inclined to xor the per-row md5's rather than
 sum them, but that's a small matter.

 regards, tom lane


xor works but only if each row is different (e.g. at the very least all
columns together make a unique key).





 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Andres Freund
On 2013-06-14 09:08:15 -0400, Tom Lane wrote:
 Heikki Linnakangas hlinnakan...@vmware.com writes:
  Well, time will tell I guess. The biggest overhead with the checksums is 
  exactly the WAL-logging of hint bits.
 
 Refresh my memory as to why we need to WAL-log hints for checksumming?
 I just had my nose in the part of the checksum patch that tediously
 copies entire pages out of shared buffers to avoid possible instability
 of the hint bits while we checksum and write the page.

I am really rather uncomfortable with that piece of code, and I hacked
it up after Jeff Janes had reported a bug there (The one aborting WAL
replay to early...). So I am very happy that you are looking at it.

Jeff Davis and I were talking about whether the usage of
PGXAC-delayChkpt makes the whole thing sufficiently safe at pgcon - we
couldn't find any real danger but...

 Given that we're
 paying that cost, I don't see why we'd need to do any extra WAL-logging
 (above and beyond the log-when-freeze cost that we have to pay already).
 But I've not absorbed any caffeine yet today, so maybe I'm just missing
 it.

The usual torn page spiel I think. If we crash while only one half of
the page made it to disk we would get spurious checksum failures from
there on.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Remove useless USE_PGXS support in contrib

2013-06-14 Thread Andrew Dunstan


On 06/14/2013 08:35 AM, Peter Eisentraut wrote:

On 6/13/13 9:20 PM, amul sul wrote:

Agree, only if we consider these contrib module is always gonna deployed with 
the postgresql.
But, what if user going to install such module elsewhere i.e. not from contrib 
directory of pg source.

Why would anyone do that?




Maybe they wouldn't.

I do think we need to make sure that we have at least buildfarm coverage 
of pgxs module building and testing. I have some coverage of a few 
extensions I have written, which exercise that, so maybe that will 
suffice. If not, maybe we need to have one module that only builds via 
pgxs and is build after an install (i.e. not via the standard contrib 
build).


I don't really like the directory layout we use for these modules 
anyway, so I'm not sure they constitute best practice for extension 
builders. Lately I have been using an extension skeleton that looks 
something like this:


License
Readme.md
META.json (for pgxn)
extension.control
Makefile
doc/extension.md (soft linked to ../Readme.md)
src/extension.c
sql/extension.sql
test/sql/extension.sql
test/expected/extension.out


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Heikki Linnakangas

On 14.06.2013 16:21, Tom Lane wrote:

Heikki Linnakangashlinnakan...@vmware.com  writes:

On 14.06.2013 16:08, Tom Lane wrote:

Refresh my memory as to why we need to WAL-log hints for checksumming?



Torn pages:


So it's not that we actually need to log the individual hint bit
changes, it's that we need to WAL-log a full page image on the first
update after a checkpoint, so as to recover from torn-page cases.
Which one are we doing?


Correct. We're doing the latter, see XLogSaveBufferForHint().

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Andres Freund
On 2013-06-14 09:21:52 -0400, Tom Lane wrote:
 Heikki Linnakangas hlinnakan...@vmware.com writes:
  On 14.06.2013 16:08, Tom Lane wrote:
  Refresh my memory as to why we need to WAL-log hints for checksumming?
 
  Torn pages:
 
 So it's not that we actually need to log the individual hint bit
 changes, it's that we need to WAL-log a full page image on the first
 update after a checkpoint, so as to recover from torn-page cases.
 Which one are we doing?

MarkBufferDirtyHint() loggs an FPI (just not via a BKP block) via
XLogSaveBufferForHint() iff XLogCheckBuffer() says we need to by
comparing GetRedoRecPtr() with the page's lsn.
Otherwise we don't do anything besides marking the buffer dirty.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 Marko Kreen mark...@gmail.com writes:
  On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed dean.a.rash...@gmail.com 
  wrote:
  Attached is a patch implementing a new aggregate function md5_agg() to
  compute the aggregate MD5 sum across a number of rows.
 
  It's more efficient to calculate per-row md5, and then sum() them.
  This avoids the need for ORDER BY.
 
 Good point.  The aggregate md5 function also fails to distinguish the
 case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
 from the case where they contain 'xyz' followed by 'zyxyz'.
 
 Now, as against that, you lose any sensitivity to the ordering of the
 values.
 
 Personally I'd be a bit inclined to xor the per-row md5's rather than
 sum them, but that's a small matter.

Where I'd take this is actually in a completely different direction..
I'd like the aggregate to be able to match the results of running the
'md5sum' unix utility on a file that's been COPY'd out.  Yes, that means
we'd need a way to get back what would this row look like if it was
sent through COPY with these parameters, but I've long wanted that
also.

No, no clue about how to put all that together.  Yes, having this would
be better than nothing, so I'm still for adding this even if we can't
make it match COPY output. :)

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Heikki Linnakangas

On 14.06.2013 16:15, Andres Freund wrote:

On 2013-06-14 09:08:15 -0400, Tom Lane wrote:

I just had my nose in the part of the checksum patch that tediously
copies entire pages out of shared buffers to avoid possible instability
of the hint bits while we checksum and write the page.


I am really rather uncomfortable with that piece of code, and I hacked
it up after Jeff Janes had reported a bug there (The one aborting WAL
replay to early...). So I am very happy that you are looking at it.


Hmm. In XLogSaveBufferForHint():


 * Note that this only works for buffers that fit the standard page model,
 * i.e. those for which buffer_std == true


The free-space-map uses non-standard pages, and MarkBufferDirtyHint(). 
Isn't that completely broken for the FSM? If I'm reading it correctly, 
what will happen is that replay will completely zero out all FSM pages 
that have been touched. All the FSM data is between pd_lower and 
pd_upper, which on standard pages is the hole.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Andrew Dunstan


On 06/14/2013 09:40 AM, Stephen Frost wrote:

* Tom Lane (t...@sss.pgh.pa.us) wrote:

Marko Kreen mark...@gmail.com writes:

On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed dean.a.rash...@gmail.com wrote:

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows.

It's more efficient to calculate per-row md5, and then sum() them.
This avoids the need for ORDER BY.

Good point.  The aggregate md5 function also fails to distinguish the
case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
from the case where they contain 'xyz' followed by 'zyxyz'.

Now, as against that, you lose any sensitivity to the ordering of the
values.

Personally I'd be a bit inclined to xor the per-row md5's rather than
sum them, but that's a small matter.

Where I'd take this is actually in a completely different direction..
I'd like the aggregate to be able to match the results of running the
'md5sum' unix utility on a file that's been COPY'd out.  Yes, that means
we'd need a way to get back what would this row look like if it was
sent through COPY with these parameters, but I've long wanted that
also.

No, no clue about how to put all that together.  Yes, having this would
be better than nothing, so I'm still for adding this even if we can't
make it match COPY output. :)






I'd rather go the other way, processing the records without having to 
process them otherwise at all. Turning things into text must slow things 
down, surely.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Andres Freund
On 2013-06-14 09:21:52 -0400, Tom Lane wrote:
 Heikki Linnakangas hlinnakan...@vmware.com writes:
  On 14.06.2013 16:08, Tom Lane wrote:
  Refresh my memory as to why we need to WAL-log hints for checksumming?
 
  Torn pages:
 
 So it's not that we actually need to log the individual hint bit
 changes, it's that we need to WAL-log a full page image on the first
 update after a checkpoint, so as to recover from torn-page cases.
 Which one are we doing?


From quickly looking at the code again I think the MarkBufferDirtyHint()
code makes at least one assumption that isn't correct in the fact of
checksums.

It tests for the need to dirty the page with:
if ((bufHdr-flags  (BM_DIRTY | BM_JUST_DIRTIED)) !=
(BM_DIRTY | BM_JUST_DIRTIED))

*before* taking a lock. A comment explains why that is safe:

 * Since we make this test unlocked, there's a chance we
 * might fail to notice that the flags have just been cleared, and 
failed
 * to reset them, due to memory-ordering issues.

That's fine for the classical usecase without checksums but what about
the following scenario:

1) page is dirtied, FPI is logged
2) SetHintBits gets called on the same page, holding only a share lock
3) checkpointer/bgwriter/... writes out the the page, clearing the dirty
   flag
4) checkpoint finishes, updates redo ptr
5) SetHintBits actually modifies the hint bits
6) SetHintBits calls MarkBufferDirtyHint which doesn't notice that the
   page isn't dirty anymore and thus doesn't check whether something
   needs to get logged.

At this point we have a page that has been modified without an FPI. But
it's not marked dirty, so it won't be written out without further
cause. Which might be fine since there's no cause to write out the page
and there probably won't be anyone doing that without logging an FPI
independently.
Can anybody see a scenario where this is actually dangerous?

Since

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Heikki Linnakangas

On 14.06.2013 17:01, Andres Freund wrote:

At this point we have a page that has been modified without an FPI. But
it's not marked dirty, so it won't be written out without further
cause. Which might be fine since there's no cause to write out the page
and there probably won't be anyone doing that without logging an FPI
independently.
Can anybody see a scenario where this is actually dangerous?


The code also relies on that being safe during recovery:


 * If we're in recovery we cannot dirty a page because 
of a hint.
 * We can set the hint, just not dirty the page as a 
result so the
 * hint is lost when we evict the page or shutdown.
 *
 * See src/backend/storage/page/README for longer 
discussion.
 */
if (RecoveryInProgress())
return;


I can't immediately see a problem with that.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Andres Freund
On 2013-06-14 16:58:38 +0300, Heikki Linnakangas wrote:
 On 14.06.2013 16:15, Andres Freund wrote:
 On 2013-06-14 09:08:15 -0400, Tom Lane wrote:
 I just had my nose in the part of the checksum patch that tediously
 copies entire pages out of shared buffers to avoid possible instability
 of the hint bits while we checksum and write the page.
 
 I am really rather uncomfortable with that piece of code, and I hacked
 it up after Jeff Janes had reported a bug there (The one aborting WAL
 replay to early...). So I am very happy that you are looking at it.
 
 Hmm. In XLogSaveBufferForHint():
 
  * Note that this only works for buffers that fit the standard page model,
  * i.e. those for which buffer_std == true
 
 The free-space-map uses non-standard pages, and MarkBufferDirtyHint(). Isn't
 that completely broken for the FSM? If I'm reading it correctly, what will
 happen is that replay will completely zero out all FSM pages that have been
 touched. All the FSM data is between pd_lower and pd_upper, which on
 standard pages is the hole.

Jeff Davis has a patch pending
(1365493015.7580.3240.camel@sussancws0025) that passes the buffer_std
flag down to MarkBufferDirtyHint() for exactly that reason. I thought we
were on track committing that, but rereading the thread it doesn't look
that way.

Jeff, care to update that patch?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-06-14 Thread Greg Smith

On 6/12/13 3:19 AM, Fabien COELHO wrote:

If you are still worried: if you run the very same command without
throttling and measure the same latency, does the same thing happens at
the end? My guess is that it should be yes. If it is no, I'll try out
pgbench-tools.


It looks like it happens rarely for one client without the rate limit, 
but that increases to every time for multiple client with limiting in 
place.  pgbench-tools just graphs the output from the latency log. 
Here's a setup that runs the test I'm doing:


$ createdb pgbench
$ pgbench -i -s 10 pgbench
$ pgbench -S -c 25 -T 30 -l pgbench  tail -n 40 pgbench_log*

Sometimes there's no slow entries. but I've seen this once so far:

0 21822 1801 0 1371217462 945264
1 21483 1796 0 1371217462 945300
8 20891 1931 0 1371217462 945335
14 20520 2084 0 1371217462 945374
15 20517 1991 0 1371217462 945410
16 20393 1928 0 1371217462 945444
17 20183 2000 0 1371217462 945479
18 20277 2209 0 1371217462 945514
23 20316 2114 0 1371217462 945549
22 20267 250128 0 1371217463 193656

The third column is the latency for that transaction.  Notice how it's a 
steady ~2000 us except for the very last transaction, which takes 
250,128 us.  That's the weird thing; these short SELECT statements 
should never take that long.  It suggests there's something weird 
happening with how the client exits, probably that its latency number is 
being computed after more work than it should.


Here's what a rate limited run looks like for me.  Note that I'm still 
using the version I re-submitted since that's where I ran into this 
issue, I haven't merged your changes to split the rate among each client 
here--which means this is 400 TPS per client == 1 TPS total:


$ pgbench -S -c 25 -T 30 -R 400 -l pgbench  tail -n 40 pgbench_log

7 12049 2070 0 1371217859 195994
22 12064 2228 0 1371217859 196115
18 11957 1570 0 1371217859 196243
23 12130 989 0 1371217859 196374
8 11922 1598 0 1371217859 196646
11 12229 4833 0 1371217859 196702
21 11981 1943 0 1371217859 196754
20 11930 1026 0 1371217859 196799
14 11990 13119 0 1371217859 208014
^^^ fast section
vvv delayed section
1 11982 91926 0 1371217859 287862
2 12033 116601 0 1371217859 308644
6 12195 115957 0 1371217859 308735
17 12130 114375 0 1371217859 308776
0 12026 115507 0 1371217859 308822
3 11948 118228 0 1371217859 308859
4 12061 113484 0 1371217859 308897
5 12110 113586 0 1371217859 308933
9 12032 117744 0 1371217859 308969
10 12045 114626 0 1371217859 308989
12 11953 113372 0 1371217859 309030
13 11883 114405 0 1371217859 309066
15 12018 116069 0 1371217859 309101
16 11890 115727 0 1371217859 309137
19 12140 114006 0 1371217859 309177
24 11884 115782 0 1371217859 309212

There's almost 90,000 usec of latency showing up between epoch time 
1371217859.208014 and 1371217859.287862 here.  What's weird about it is 
that the longer the test runs, the larger the gap is.  If collecting the 
latency data itself caused the problem, that would make sense, so maybe 
this is related to flushing that out to disk.


If you want to look just at the latency numbers without the other 
columns in the way you can use:


cat pgbench_log.* | awk {'print $3'}

That is how I was evaluating the smoothness of the rate limit, by 
graphing those latency values.  pgbench-tools takes those and a derived 
TPS/s number and plots them, which made it easier for me to spot this 
weirdness.  But I've already moved onto analyzing the raw latency data 
instead, I can see the issue without the graph once I've duplicated the 
conditions.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Stephen Frost
* Andrew Dunstan (and...@dunslane.net) wrote:
 I'd rather go the other way, processing the records without having
 to process them otherwise at all. Turning things into text must slow
 things down, surely.

That's certainly an interesting idea also..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Dean Rasheed
On 14 June 2013 14:14, Tom Lane t...@sss.pgh.pa.us wrote:
 Marko Kreen mark...@gmail.com writes:
 On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed dean.a.rash...@gmail.com 
 wrote:
 Attached is a patch implementing a new aggregate function md5_agg() to
 compute the aggregate MD5 sum across a number of rows.

 It's more efficient to calculate per-row md5, and then sum() them.
 This avoids the need for ORDER BY.

 Good point.  The aggregate md5 function also fails to distinguish the
 case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
 from the case where they contain 'xyz' followed by 'zyxyz'.


Well, if you aggregated foo.*::text as in my original example, then
the textual representation of the row would protect you from that. But
yes, if you were just doing it with a single text column that might be
a risk.


 Now, as against that, you lose any sensitivity to the ordering of the
 values.

 Personally I'd be a bit inclined to xor the per-row md5's rather than
 sum them, but that's a small matter.


But this would be a much riskier thing to do with a single column,
because if you updated multiple rows in the same way (e.g., UPDATE t
SET x='foo' WHERE x='bar') then xor'ing the md5's would cancel out if
there were an even number of matches.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add visibility map information to pg_freespace.

2013-06-14 Thread Alvaro Herrera
Kyotaro HORIGUCHI wrote:
 Helle,
 
 I've added visibility map information to pg_freespace for my
 utility.

This makes sense to me.  I only lament the fact that this makes the
module a misnomer.  Do we want to 1) rename the module (how
inconvenient), 2) create a separate module for this (surely not
warranted), or 3) accept it and move on?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Buildfarm client 4.11 released

2013-06-14 Thread Andrew Dunstan


Version 4.11 of the PostgreSQL Buildfarm client has been released. It 
can be downloaded from 
http://www.pgbuildfarm.org/downloads/releases/build-farm-4_11.tgz


Changes since 4.10:

 * Turn down module cleanup verbosity
 * Add check for rogue postmasters.
 * Add pseudo-branch targets HEAD_PLUS_LATEST and HEAD_PLUS_LATEST2.
 * Use Digest::SHA instead of Digest::SHA1.
 * Make directory handling more robust in git code.
 * Move web transaction into a module procedure.
 * Switch to using the porcelain format of git status.
 * Provide parameter for core file patterns.
 * Use a command file for gdb instead of the -ex option


The web transaction and Digest::SHA changes have allowed the removal of 
a couple of long-standing uglinesses on the system. In almost all cases, 
the config parameter aux_path and the separate run_web_transaction.pl 
script are now redundant (the exception is older Msys systems).


Enjoy

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add visibility map information to pg_freespace.

2013-06-14 Thread Andres Freund
On 2013-06-14 10:22:19 -0400, Alvaro Herrera wrote:
 Kyotaro HORIGUCHI wrote:
  Helle,
  
  I've added visibility map information to pg_freespace for my
  utility.
 
 This makes sense to me.

+1

 I only lament the fact that this makes the
 module a misnomer.  Do we want to 1) rename the module (how
 inconvenient), 2) create a separate module for this (surely not
 warranted), or 3) accept it and move on?

3). All the others seem to inflict unneccesary pain for not all that
much gain.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Greg Stark
On Fri, Jun 14, 2013 at 2:21 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 So it's not that we actually need to log the individual hint bit
 changes, it's that we need to WAL-log a full page image on the first
 update after a checkpoint, so as to recover from torn-page cases.
 Which one are we doing?

Wal logging a full page image after a checkpoint wouldn't actually be
enough since subsequent hint bits will dirty the page and not wal log
anything creating a new torn page risk. FPI are only useful if all the
subsequent updates are wal logged.




-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 On Fri, Jun 14, 2013 at 2:21 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 So it's not that we actually need to log the individual hint bit
 changes, it's that we need to WAL-log a full page image on the first
 update after a checkpoint, so as to recover from torn-page cases.
 Which one are we doing?

 Wal logging a full page image after a checkpoint wouldn't actually be
 enough since subsequent hint bits will dirty the page and not wal log
 anything creating a new torn page risk. FPI are only useful if all the
 subsequent updates are wal logged.

No, there's no new torn page risk, because any crash recovery would
replay starting from the checkpoint.  You might lose the
subsequently-set hint bits, but that's okay.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Tom Lane
Dean Rasheed dean.a.rash...@gmail.com writes:
 On 14 June 2013 14:14, Tom Lane t...@sss.pgh.pa.us wrote:
 Personally I'd be a bit inclined to xor the per-row md5's rather than
 sum them, but that's a small matter.

 But this would be a much riskier thing to do with a single column,
 because if you updated multiple rows in the same way (e.g., UPDATE t
 SET x='foo' WHERE x='bar') then xor'ing the md5's would cancel out if
 there were an even number of matches.

I was implicitly thinking that the sum would be a modulo sum so that the
final result is still the size of an md5 signature.  If that's true,
then leaking bits via carry out is just as bad as xor's deficiencies.
Now, you could certainly make it a non-modulo sum and not lose any
information to carries, if you're willing to do the arithmetic in
NUMERIC and have a variable-width result.  Sounds a bit slow though.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Dean Rasheed
On 14 June 2013 15:19, Stephen Frost sfr...@snowman.net wrote:
 * Andrew Dunstan (and...@dunslane.net) wrote:
 I'd rather go the other way, processing the records without having
 to process them otherwise at all. Turning things into text must slow
 things down, surely.

 That's certainly an interesting idea also..


md5_agg(record) ?

Yes, I like it.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Andres Freund
On 2013-06-14 15:49:31 +0100, Dean Rasheed wrote:
 On 14 June 2013 15:19, Stephen Frost sfr...@snowman.net wrote:
  * Andrew Dunstan (and...@dunslane.net) wrote:
  I'd rather go the other way, processing the records without having
  to process them otherwise at all. Turning things into text must slow
  things down, surely.
 
  That's certainly an interesting idea also..
 
 
 md5_agg(record) ?
 
 Yes, I like it.

It's more complex than just memcmp()ing HeapTupleData though. At least
if the Datum contains varlena columns there's so many different
representations (short, long, compressed, external, external compressed)
of the same data that a md5 without normalizing that wouldn't be very
interesting.
So you would at least need a normalizing version of
toast_flatten_tuple() that also deals with short/long varlenas. But even
after that, you would still need to deal with Datums that can have
different representation (like short numerics, old style hstore, ...).

It might be more realistic to use the binary output functions, but I am
not sure whether all of those are sufficiently reproduceable.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Hannu Krosing
On 06/14/2013 04:47 PM, Tom Lane wrote:
 Dean Rasheed dean.a.rash...@gmail.com writes:
 On 14 June 2013 14:14, Tom Lane t...@sss.pgh.pa.us wrote:
 Personally I'd be a bit inclined to xor the per-row md5's rather than
 sum them, but that's a small matter.
 But this would be a much riskier thing to do with a single column,
 because if you updated multiple rows in the same way (e.g., UPDATE t
 SET x='foo' WHERE x='bar') then xor'ing the md5's would cancel out if
 there were an even number of matches.
 I was implicitly thinking that the sum would be a modulo sum so that the
 final result is still the size of an md5 signature.  If that's true,
 then leaking bits via carry out is just as bad as xor's deficiencies.
 Now, you could certainly make it a non-modulo sum and not lose any
 information to carries, if you're willing to do the arithmetic in
 NUMERIC and have a variable-width result.  Sounds a bit slow though.
What skytools/pgq/londiste uses for comparing tables on master
and slave is query like this

select sum(hashtext(t.*::text)) from yourtable t;

This is non-modulo sum and does not use md5 but relies on
whatever the hashtext() du jour is :)

So it is not comparable to anything external (like the md5sum
compatible idea above) but is usually good enough for fast
checks of compatible tables.

As tables are unordered by definition anyway, this should be
good enough for most SQL.

The speed comes from both fast(er) hashtext() function and
avoiding the sort.

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_filedump 9.3: checksums (and a few other fixes)

2013-06-14 Thread Jeff Davis
On Thu, 2013-06-13 at 20:09 -0400, Tom Lane wrote:
 What I propose we do about this is reduce backend/storage/page/checksum.c
 to something like
 
 #include postgres.h
 #include storage/checksum.h
 #include storage/checksum_impl.h
 
 moving all the code currently in the file into a new .h file.  Then,
 any external programs such as pg_filedump can use the checksum code
 by including checksum_impl.h.  This is essentially the same thing we
 did with the CRC support functionality some time ago.

Thank you for taking care of that. After seeing that it needed to be in
a header file, I was going to try doing it all as macros.

I have a question about the commit though: shouldn't both functions be
static if they are in a .h file? Otherwise, it could lead to naming
conflicts. I suppose it's wrong to include the implementation file
twice, but it still might be confusing if someone tries. Two ideas that
come to mind are:
  * make both static and then have a trivial wrapper in checksum.c
  * export one or both functions, but use #ifndef CHECKSUM_IMPL_H to
prevent redefinition

 Also, we have the cut-point between checksum.c and bufpage.c at the
 wrong place.  IMO we should move PageCalcChecksum16 in toto into
 checksum.c (or really now into checksum_impl.h), because that and not
 just checksum_block() is the functionality that is wanted.

Agreed.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_filedump 9.3: checksums (and a few other fixes)

2013-06-14 Thread Tom Lane
Jeff Davis pg...@j-davis.com writes:
 I have a question about the commit though: shouldn't both functions be
 static if they are in a .h file? Otherwise, it could lead to naming
 conflicts. I suppose it's wrong to include the implementation file
 twice, but it still might be confusing if someone tries. Two ideas that
 come to mind are:
   * make both static and then have a trivial wrapper in checksum.c
   * export one or both functions, but use #ifndef CHECKSUM_IMPL_H to
 prevent redefinition

Ah, you are right, I forgot the #ifndef CHECKSUM_IMPL_H dance.  Will fix
in a bit.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] refresh materialized view concurrently

2013-06-14 Thread Kevin Grittner
Attached is a patch for REFRESH MATERIALIZED VIEW CONCURRENTLY for
9.4 CF1.  The goal of this patch is to allow a refresh without
interfering with concurrent reads, using transactional semantics.

It is my hope to get this committed during this CF to allow me to
focus on incremental maintenance for the rest of the release cycle.

I didn't need to touch very much outside of matview-specific files
for this.  My biggest concern is that I needed two small functions
which did *exactly* what some static functions in ri_triggers.c
were doing and couldn't see where the best place to share them from
was.  For the moment I just duplicated them, but my hope would be
that they could be put in a suitable location and called from both
places, rather than duplicating the 30-some lines of code.  The
function signatures are:

void quoteOneName(char *buffer, const char *name)
void quoteRelationName(char *buffer, Relation rel)

Comments in the patch describe the technique used for the
transactional refresh, but I'm not sure how easy it is to
understand the technique from the comments.  Here is a
demonstration of the basic technique, using a table to mock the
materialized view so it can be run directly.

---

--
-- Setup
--
drop table if exists n, nt, nd cascade;
drop table if exists nm;

create table n (id int not null primary key, val text);
insert into n values
  (1, 'one'), (2, 'two'), (3, 'three'), (4, 'four'), (5, 'five'),
  (6, null), (7, null), (8, null), (9, null);
-- We use a table to mock this materialized view definition:
--   create materialized view nm as select * from n;
create table nm as select * from n;
insert into n values (10, 'ten'), (11, null);
update n set val = 'zwei' where id = 2;
update n set val = null where id = 3;
update n set id = 44, val = 'forty-four' where id = 4;
update n set val = 'seven' where id = 7;
delete from n where id = 5;
delete from n where id = 8;

vacuum analyze;

--
-- Sample of internal processing for REFRESH MV CONCURRENTLY.
--
begin;
create temp table nt as select * from n;
analyze nt;
create temp table nd as
  SELECT x.ctid as tid, y
    FROM nm x
    FULL JOIN n y ON (y.id OPERATOR(pg_catalog.=) x.id)
    WHERE (y.*) IS DISTINCT FROM (x.*)
    ORDER BY tid;
analyze nd;

delete from nm where ctid in
  (select tid from nd
    where tid is not null and y is not distinct from null);
update nm x set id = (d.y).id, val = (d.y).val from nd d
  where d.tid is not null and x.ctid = d.tid;
insert into nm select (y).* from nd where tid is null;
commit;

--
-- Check that results match.
--
select * from n order by id;
select * from nm order by id;

---

I also tried a million-row materialized view with the patch to see
what the performace was like on a large table with just a few
changes.  I was surprised that a small change-set like this was
actually faster than replacing the heap, at least on my machine.
Obviously, when a larger number of rows are affected the
transactional CONCURRENTLY option will be slower, and this is not
intended in any way as a performace-enhancing feature, that was
just a happy surprise in testing.

---

-- drop from previous test
drop table if exists testv cascade;

-- create and populate permanent table
create table testv (id int primary key, val text);
insert into testv
  select n, cash_words((floor(random() * 1) / 100)::text::money)
  from (select generate_series(1, 200, 2)) s(n);
update testv
  set val = NULL
  where id = 547345;

create materialized view matv as select * from testv;
create unique index matv_id on matv (id);
vacuum analyze matv;

delete from testv where id = 16405;
insert into testv
  values (393466, cash_words((floor(random() * 1) / 100)::text::money));
update testv
  set val = cash_words((floor(random() * 1) / 100)::text::money)
  where id = 1947141;

refresh materialized view concurrently matv;

---

People may be surprised to see this using SPI even more than
ri_triggers.c does.  I think this is the safest and most
maintainable approach, although I welcome alternative suggestions.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company*** a/doc/src/sgml/mvcc.sgml
--- b/doc/src/sgml/mvcc.sgml
***
*** 928,935  ERROR:  could not serialize access due to read/write dependencies among transact
  /para
  
  para
!  This lock mode is not automatically acquired on tables by any
!  productnamePostgreSQL/productname command.
  /para
 /listitem
/varlistentry
--- 928,934 
  /para
  
  para
!  Acquired by commandREFRESH MATERIALIZED VIEW CONCURRENTLY/command.
  /para
 /listitem
/varlistentry
*** 

Re: [HACKERS] pg_filedump 9.3: checksums (and a few other fixes)

2013-06-14 Thread Andres Freund
On 2013-06-14 11:59:04 -0400, Tom Lane wrote:
 Jeff Davis pg...@j-davis.com writes:
  I have a question about the commit though: shouldn't both functions be
  static if they are in a .h file? Otherwise, it could lead to naming
  conflicts. I suppose it's wrong to include the implementation file
  twice, but it still might be confusing if someone tries. Two ideas that
  come to mind are:
* make both static and then have a trivial wrapper in checksum.c
* export one or both functions, but use #ifndef CHECKSUM_IMPL_H to
  prevent redefinition
 
 Ah, you are right, I forgot the #ifndef CHECKSUM_IMPL_H dance.  Will fix
 in a bit.

That won't help against errors if it's included in two different
files/translation units though. I don't really see a valid case where it
could be validly be included multiple times in one TU?
If anything we should #error in that case, but I am not sure it's worth
bothering.
E.g. in rmgrlist.h we have the following comment:
/* there is deliberately not an #ifndef RMGRLIST_H here */
and I think the reasoning behind that comment applies here as well.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Jeff Davis
On Fri, 2013-06-14 at 16:10 +0200, Andres Freund wrote:
 Jeff Davis has a patch pending
 (1365493015.7580.3240.camel@sussancws0025) that passes the buffer_std
 flag down to MarkBufferDirtyHint() for exactly that reason. I thought we
 were on track committing that, but rereading the thread it doesn't look
 that way.
 
 Jeff, care to update that patch?

Rebased and attached. Changed so all callers use buffer_std=true except
those in freespace.c and fsmpage.c.

Simon, did you (or anyone else) have an objection to this patch? If not,
I'll go ahead and commit it tomorrow morning.

Regards,
Jeff Davis

*** a/src/backend/access/hash/hash.c
--- b/src/backend/access/hash/hash.c
***
*** 287,293  hashgettuple(PG_FUNCTION_ARGS)
  			/*
  			 * Since this can be redone later if needed, mark as a hint.
  			 */
! 			MarkBufferDirtyHint(buf);
  		}
  
  		/*
--- 287,293 
  			/*
  			 * Since this can be redone later if needed, mark as a hint.
  			 */
! 			MarkBufferDirtyHint(buf, true);
  		}
  
  		/*
*** a/src/backend/access/heap/pruneheap.c
--- b/src/backend/access/heap/pruneheap.c
***
*** 262,268  heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
  		{
  			((PageHeader) page)-pd_prune_xid = prstate.new_prune_xid;
  			PageClearFull(page);
! 			MarkBufferDirtyHint(buffer);
  		}
  	}
  
--- 262,268 
  		{
  			((PageHeader) page)-pd_prune_xid = prstate.new_prune_xid;
  			PageClearFull(page);
! 			MarkBufferDirtyHint(buffer, true);
  		}
  	}
  
*** a/src/backend/access/nbtree/nbtinsert.c
--- b/src/backend/access/nbtree/nbtinsert.c
***
*** 413,421  _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
  	 * crucial. Be sure to mark the proper buffer dirty.
  	 */
  	if (nbuf != InvalidBuffer)
! 		MarkBufferDirtyHint(nbuf);
  	else
! 		MarkBufferDirtyHint(buf);
  }
  			}
  		}
--- 413,421 
  	 * crucial. Be sure to mark the proper buffer dirty.
  	 */
  	if (nbuf != InvalidBuffer)
! 		MarkBufferDirtyHint(nbuf, true);
  	else
! 		MarkBufferDirtyHint(buf, true);
  }
  			}
  		}
*** a/src/backend/access/nbtree/nbtree.c
--- b/src/backend/access/nbtree/nbtree.c
***
*** 1052,1058  restart:
  opaque-btpo_cycleid == vstate-cycleid)
  			{
  opaque-btpo_cycleid = 0;
! MarkBufferDirtyHint(buf);
  			}
  		}
  
--- 1052,1058 
  opaque-btpo_cycleid == vstate-cycleid)
  			{
  opaque-btpo_cycleid = 0;
! MarkBufferDirtyHint(buf, true);
  			}
  		}
  
*** a/src/backend/access/nbtree/nbtutils.c
--- b/src/backend/access/nbtree/nbtutils.c
***
*** 1789,1795  _bt_killitems(IndexScanDesc scan, bool haveLock)
  	if (killedsomething)
  	{
  		opaque-btpo_flags |= BTP_HAS_GARBAGE;
! 		MarkBufferDirtyHint(so-currPos.buf);
  	}
  
  	if (!haveLock)
--- 1789,1795 
  	if (killedsomething)
  	{
  		opaque-btpo_flags |= BTP_HAS_GARBAGE;
! 		MarkBufferDirtyHint(so-currPos.buf, true);
  	}
  
  	if (!haveLock)
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 7681,7692  XLogRestorePoint(const char *rpName)
   * records. In that case, multiple copies of the same block would be recorded
   * in separate WAL records by different backends, though that is still OK from
   * a correctness perspective.
-  *
-  * Note that this only works for buffers that fit the standard page model,
-  * i.e. those for which buffer_std == true
   */
  XLogRecPtr
! XLogSaveBufferForHint(Buffer buffer)
  {
  	XLogRecPtr	recptr = InvalidXLogRecPtr;
  	XLogRecPtr	lsn;
--- 7681,7689 
   * records. In that case, multiple copies of the same block would be recorded
   * in separate WAL records by different backends, though that is still OK from
   * a correctness perspective.
   */
  XLogRecPtr
! XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
  {
  	XLogRecPtr	recptr = InvalidXLogRecPtr;
  	XLogRecPtr	lsn;
***
*** 7708,7714  XLogSaveBufferForHint(Buffer buffer)
  	 * and reset rdata for any actual WAL record insert.
  	 */
  	rdata[0].buffer = buffer;
! 	rdata[0].buffer_std = true;
  
  	/*
  	 * Check buffer while not holding an exclusive lock.
--- 7705,7711 
  	 * and reset rdata for any actual WAL record insert.
  	 */
  	rdata[0].buffer = buffer;
! 	rdata[0].buffer_std = buffer_std;
  
  	/*
  	 * Check buffer while not holding an exclusive lock.
*** a/src/backend/commands/sequence.c
--- b/src/backend/commands/sequence.c
***
*** 1118,1124  read_seq_tuple(SeqTable elm, Relation rel, Buffer *buf, HeapTuple seqtuple)
  		HeapTupleHeaderSetXmax(seqtuple-t_data, InvalidTransactionId);
  		seqtuple-t_data-t_infomask = ~HEAP_XMAX_COMMITTED;
  		seqtuple-t_data-t_infomask |= HEAP_XMAX_INVALID;
! 		MarkBufferDirtyHint(*buf);
  	}
  
  	seq = (Form_pg_sequence) GETSTRUCT(seqtuple);
--- 1118,1124 
  		HeapTupleHeaderSetXmax(seqtuple-t_data, 

Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Andres Freund
On 2013-06-14 09:21:12 -0700, Jeff Davis wrote:
 On Fri, 2013-06-14 at 16:10 +0200, Andres Freund wrote:
  Jeff Davis has a patch pending
  (1365493015.7580.3240.camel@sussancws0025) that passes the buffer_std
  flag down to MarkBufferDirtyHint() for exactly that reason. I thought we
  were on track committing that, but rereading the thread it doesn't look
  that way.
  
  Jeff, care to update that patch?
 
 Rebased and attached. Changed so all callers use buffer_std=true except
 those in freespace.c and fsmpage.c.
 
 Simon, did you (or anyone else) have an objection to this patch? If not,
 I'll go ahead and commit it tomorrow morning.

I'd like to see a comment around the memcpys in XLogSaveBufferForHint()
that mentions that they are safe in a non std buffer due to
XLogCheckBuffer setting an appropriate hole/offset. Or make an explicit
change of the copy algorithm there.

Btw, if you touch that code, I'd vote for renaming XLOG_HINT to XLOG_FPI
or something like that. I find the former name confusing...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_restore -l with a directory archive

2013-06-14 Thread Fujii Masao
Hi,

When I ran pg_restore -l with the directory arhicve input, I found that
its format is wrongly reported as UNKNOWN.

$ pg_dump -F d -f hoge
$ pg_restore -l hoge
;
; Archive created at Sat Jun 15 01:38:14 2013
; dbname: postgres
; TOC Entries: 9
; Compression: -1
; Dump Version: 1.12-0
; Format: UNKNOWN
; Integer: 4 bytes
; Offset: 8 bytes
; Dumped from database version: 9.3beta1
; Dumped by pg_dump version: 9.3beta1
;
;
; Selected TOC Entries:


In this case, the format should be reported as DIRECTORY.
The attached patch fixes this problem.

Regards,

-- 
Fujii Masao


pg_restore_tocsummary_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Dean Rasheed
On 14 June 2013 16:09, Hannu Krosing ha...@2ndquadrant.com wrote:
 What skytools/pgq/londiste uses for comparing tables on master
 and slave is query like this

 select sum(hashtext(t.*::text)) from yourtable t;

 This is non-modulo sum and does not use md5 but relies on
 whatever the hashtext() du jour is :)

 So it is not comparable to anything external (like the md5sum
 compatible idea above) but is usually good enough for fast
 checks of compatible tables.

 As tables are unordered by definition anyway, this should be
 good enough for most SQL.

 The speed comes from both fast(er) hashtext() function and
 avoiding the sort.


That sounds like a pretty good approach. We could do that if we had a
version of md5() that returned numeric. My impression is that numeric
computations are pretty fast compared to the sorting overhead.

On the other hand, if there is a usable index, select md5_agg(..) from
(sub-query) will do and index scan rather than a sort, making it much
faster than using an ORDER BY in the aggregate.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_filedump 9.3: checksums (and a few other fixes)

2013-06-14 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2013-06-14 11:59:04 -0400, Tom Lane wrote:
 Ah, you are right, I forgot the #ifndef CHECKSUM_IMPL_H dance.  Will fix
 in a bit.

 That won't help against errors if it's included in two different
 files/translation units though.

Good point, but there's not any real reason to do that --- only
checksum.h should ever be #include'd in more than one file.  Any program
using this stuff is expected to #include checksum_impl.h in exactly one
place.  So maybe it's fine as-is.

 E.g. in rmgrlist.h we have the following comment:
 /* there is deliberately not an #ifndef RMGRLIST_H here */
 and I think the reasoning behind that comment applies here as well.

Well, that's a different case: there, and also in kwlist.h, there's an
idea that it could actually be useful to #include the file more than
once, redefining the PG_RMGR() macro each time.  There's no such use
case that I can see for checksum_impl.h.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-14 Thread Jeff Davis
On Sat, 2013-05-25 at 13:55 -0500, Jon Nelson wrote:
 Ack.  I've revised the patch to always have the GUC (for now), default
 to false, and if configure can't find posix_fallocate (or the user
 disables it by way of pg_config_manual.h) then it remains a GUC that
 simply can't be changed.

Why have a GUC here at all? Perhaps this was already discussed, and I
missed it? Is it just for testing purposes, or did you intend for it to
be in the final version?

If it's supported, it seems like we always want it. I doubt there are
cases where it hurts performance; but if there are, it's pretty hard for
a DBA to know what those cases are, anyway.

Also:

* The other code assumes that no errno means ENOSPC. We should be
consistent about that assumption, and do the same thing in the fallocate
case.

* You check for the presence of posix_fallocate at configure time, but
don't #ifdef the call site. It looks like you removed this from the v2
patch, was there a reason for that? Won't that cause build errors for
platforms without it?

I started looking at this patch and it looks like we are getting a
consensus that it's the right approach. Microbenchmarks appear to show a
benefit, and (thanks to Noah's comment) it seems like the change is
safe. Are there any remaining questions or objections?

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] request a new feature in fuzzystrmatch

2013-06-14 Thread Liming Hu
On Tue, Jun 11, 2013 at 3:19 PM, Liming Hu dawnin...@gmail.com wrote:
 On Tue, Jun 11, 2013 at 2:56 PM, Joe Conway m...@joeconway.com wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 On 06/11/2013 02:23 PM, Liming Hu wrote:
 On Tue, Jun 11, 2013 at 1:57 PM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
 Liming Hu escribió:

 I have implemented the code according to Joe's suggestion, and
 put the code at:
 https://github.com/liminghu/fuzzystrmatch/tree/fuzzystrmatchv1.1



 Please submit a proper patch so it can be seen on our mailing list
 archives.

 Hi Alvaro,

 I am kind of new to the Postgresql hacker community, Can you
 please help me on submit the patch?

 Hi Liming,

 I might be able to help, but it will be at least a couple of days
 before I have the time to look at this,

 Joe


 ok, thanks, I will wait.
Hi Joe,

Do you have some time in the weekend to help me submit the patch?
Thanks,

Liming

 Liming
 - --
 Joe Conway
 credativ LLC: http://www.credativ.us
 Linux, PostgreSQL, and general Open Source
 Training, Service, Consulting,  24x7 Support
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.12 (GNU/Linux)
 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

 iQIcBAEBAgAGBQJRt50oAAoJEDfy90M199hliukP/A8IZf7L0DcYWG/jZUttTg0M
 yWpCz1KHZ7fAGqeJ5ddzuSx5CMjsmt2zYQ+0EfLX1ftoSxIHaFpSC49GgJHUdAoq
 TSOOC1rfkfmNJG98WfUPH14Flq4eEl9reUZDXXi3jqol+npJdAQaEt9EM5y0qkcB
 pDCy9iMaYdjYNV6RXFOBNI+7Up43oULbMLhWwwFnGX9AgCLk8SGRZHnfT1zOaXYE
 bW/Dl9TIu058ENZqFFLIfjxqngE/Y61SOaLRAxEkbO7HAFkuwgQwnIayrasR29F+
 GX4+HRcsd0jrcF858Dm4E+YAffI2quOR2HgZTGQj6jTNtRpb16+EI+X1vrYI3AH8
 5tf3exq9FDL/02zlCpKm8+uXksaLzffQXlbQPw8uDlpk+ThD0uo2990/TC6QYgXO
 o2vCu+nkWdTc2AUk4NkoVFwXaaFLZ+M8U+swRHNAShqH7VQVx8rRwubsRJ1msT3i
 nC8BFyMsBzOmsKNwO0IM6ZcsJXaIpmMCUshF8cGDfgpsWUQ/wzovzopb+PSiEKQB
 X45hWYtiK3tIvo0f9gvEWzRJ4+O8tmmIZzzWz127yBdugV0xjsEHnDihpjkeMPJx
 WHs5ViN62u62r34UCtX1oiClCC+FYR0f//alh48VlXWzP7NkFt4dcKOD6ZCzSOeQ
 udKM0QF2TNbUebj9QIpq
 =W6Oy
 -END PGP SIGNATURE-



 --
 Liming Hu
 cell: (435)-512-4190
 Seattle Washington



--
Liming Hu
cell: (435)-512-4190
Seattle Washington


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] request a new feature in fuzzystrmatch

2013-06-14 Thread David Fetter
On Fri, Jun 14, 2013 at 10:08:24AM -0700, Liming Hu wrote:
 On Tue, Jun 11, 2013 at 3:19 PM, Liming Hu dawnin...@gmail.com wrote:
  On Tue, Jun 11, 2013 at 2:56 PM, Joe Conway m...@joeconway.com wrote:
  -BEGIN PGP SIGNED MESSAGE-
  Hash: SHA1
 
  On 06/11/2013 02:23 PM, Liming Hu wrote:
  On Tue, Jun 11, 2013 at 1:57 PM, Alvaro Herrera
  alvhe...@2ndquadrant.com wrote:
  Liming Hu escribió:
 
  I have implemented the code according to Joe's suggestion, and
  put the code at:
  https://github.com/liminghu/fuzzystrmatch/tree/fuzzystrmatchv1.1
 
 
 
  Please submit a proper patch so it can be seen on our mailing list
  archives.
 
  Hi Alvaro,
 
  I am kind of new to the Postgresql hacker community, Can you
  please help me on submit the patch?
 
  Hi Liming,
 
  I might be able to help, but it will be at least a couple of days
  before I have the time to look at this,
 
  Joe
 
 
  ok, thanks, I will wait.
 Hi Joe,
 
 Do you have some time in the weekend to help me submit the patch?
 Thanks,
 
 Liming

Liming,

Is your git skill good enough to create a patch vs. PostgreSQL's git
master?  If so, send that and once it's hit the mailing list, record
same on commitfest.postgresql.org in the current open commitfest.  If
not, let us know where in that process you got stuck.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] request a new feature in fuzzystrmatch

2013-06-14 Thread Joshua D. Drake


On 06/14/2013 10:11 AM, David Fetter wrote:


ok, thanks, I will wait.

Hi Joe,

Do you have some time in the weekend to help me submit the patch?
Thanks,

Liming


Liming,

Is your git skill good enough to create a patch vs. PostgreSQL's git
master?  If so, send that and once it's hit the mailing list, record
same on commitfest.postgresql.org in the current open commitfest.  If
not, let us know where in that process you got stuck.

Cheers,
David.



This sounds like a wiki page FAQ in the making.

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
   a rose in the deeps of my heart. - W.B. Yeats


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] request a new feature in fuzzystrmatch

2013-06-14 Thread David Fetter
On Fri, Jun 14, 2013 at 10:14:14AM -0700, Joshua D. Drake wrote:
 On 06/14/2013 10:11 AM, David Fetter wrote:
 
 ok, thanks, I will wait.
 Hi Joe,
 
 Do you have some time in the weekend to help me submit the patch?
 Thanks,
 
 Liming
 
 Liming,
 
 Is your git skill good enough to create a patch vs. PostgreSQL's git
 master?  If so, send that and once it's hit the mailing list, record
 same on commitfest.postgresql.org in the current open commitfest.  If
 not, let us know where in that process you got stuck.
 
 
 This sounds like a wiki page FAQ in the making.

With utmost respect, this sounds like several pages should be
consolidated into one and clarified.  Yet another page will just make
matters more confusing.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-14 Thread Greg Smith

On 6/14/13 1:06 PM, Jeff Davis wrote:

Why have a GUC here at all? Perhaps this was already discussed, and I
missed it? Is it just for testing purposes, or did you intend for it to
be in the final version?


You have guessed correctly!  I suggested it stay in there only to make 
review benchmarking easier.



I started looking at this patch and it looks like we are getting a
consensus that it's the right approach. Microbenchmarks appear to show a
benefit, and (thanks to Noah's comment) it seems like the change is
safe. Are there any remaining questions or objections?


I'm planning to duplicate Jon's test program on a few machines here, and 
then see if that turns into a useful latency improvement for clients. 
I'm trying to get this pgbench rate limit stuff working first though, 
because one of the tests I had in mind for WAL creation overhead would 
benefit from it.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-14 Thread Jeff Davis
On Tue, 2013-06-11 at 12:58 -0400, Stephen Frost wrote:
 My main question is really- would this be useful for extending
 *relations*?  Apologies if it's already been discussed; I do plan to go
 back and read the threads about this more fully, but I wanted to voice
 my support for using posix_fallocate, when available, in general.

+1, though separate from this patch.

Andres also pointed out that we can try to track a point in the file
that is below any place where a zero page might still exist. That will
allow us to call zero pages invalid unless they are related to a recent
extension, which is a weakness in the current checksums code.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] another error perhaps to be enhanced

2013-06-14 Thread Joshua D. Drake


ERROR:  index foo_idx

We should probably add the schema.

JD
--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
   a rose in the deeps of my heart. - W.B. Yeats


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add visibility map information to pg_freespace.

2013-06-14 Thread Peter Geoghegan
On Fri, Jun 14, 2013 at 7:23 AM, Andres Freund and...@2ndquadrant.com wrote:
 3). All the others seem to inflict unneccesary pain for not all that
 much gain.

+1. You might want to add a historical note about the name to the
pg_freespace documentation, though.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-06-14 Thread Greg Smith
I don't have this resolved yet, but I think I've identified the cause. 
Updating here mainly so Fabien doesn't duplicate my work trying to track 
this down.  I'm going to keep banging at this until it's resolved now 
that I got this far.


Here's a slow transaction:

1371226017.568515 client 1 executing \set naccounts 10 * :scale
1371226017.568537 client 1 throttling 6191 us
1371226017.747858 client 1 executing \setrandom aid 1 :naccounts
1371226017.747872 client 1 sending SELECT abalance FROM pgbench_accounts 
WHERE aid = 268721;

1371226017.789816 client 1 receiving

That confirms it is getting stuck at the throttling step.  Looks like 
the code pauses there because it's trying to overload the sleeping 
state that was already in pgbench, but handle it in a special way inside 
of doCustom(), and that doesn't always work.


The problem is that pgbench doesn't always stay inside doCustom when a 
client sleeps.  It exits there to poll for incoming messages from the 
other clients, via select() on a shared socket.  It's not safe to assume 
doCustom will be running regularly; that's only true if clients keep 
returning messages.


So as long as other clients keep banging on the shared socket, doCustom 
is called regularly, and everything works as expected.  But at the end 
of the test run that happens less often, and that's when the problem 
shows up.


pgbench already has a \sleep command, and the way that delay is 
handled happens inside threadRun() instead.  The pausing of the rate 
limit throttle needs to operate in the same place.  I have to redo a few 
things to confirm this actually fixes the issue, as well as look at 
Fabien's later updates to this since I wandered off debugging.  I'm sure 
it's in the area of code I'm poking at now though.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] another error perhaps to be enhanced

2013-06-14 Thread Michael Glaesemann


On Jun 14, 2013, at 13:38, Joshua D. Drake j...@commandprompt.com wrote:

 
 ERROR:  index foo_idx
 
 We should probably add the schema.

I've noticed similar issues with functions. I'd like to see those 
schema-qualified as well.


 
 JD
 -- 
 Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
 PostgreSQL Support, Training, Professional Services and Development
 High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
 For my dreams of your image that blossoms
   a rose in the deeps of my heart. - W.B. Yeats
 
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] another error perhaps to be enhanced

2013-06-14 Thread Peter Geoghegan
I think you'll need to better describe what you mean here.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] another error perhaps to be enhanced

2013-06-14 Thread Joshua D. Drake


On 06/14/2013 10:47 AM, Peter Geoghegan wrote:


I think you'll need to better describe what you mean here.



postgres=# create schema foo;
CREATE SCHEMA
postgres=# create schema bar;
CREATE SCHEMA
postgres=# create table foo.foo(id serial);
NOTICE:  CREATE TABLE will create implicit sequence foo_id_seq for 
serial column foo.id

CREATE TABLE
postgres=# create table bar.bar(id serial);
NOTICE:  CREATE TABLE will create implicit sequence bar_id_seq for 
serial column bar.id

CREATE TABLE
postgres=# create index one_idx on foo.foo(id);
CREATE INDEX
postgres=# create index one_idx on bar.bar(id);
CREATE INDEX
postgres=#


Now, with the error previously shown, which one_idx needs to be reindexed?

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
   a rose in the deeps of my heart. - W.B. Yeats


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] another error perhaps to be enhanced

2013-06-14 Thread Peter Geoghegan
On Fri, Jun 14, 2013 at 10:54 AM, Joshua D. Drake j...@commandprompt.com 
wrote:
 Now, with the error previously shown, which one_idx needs to be reindexed?

Well, you didn't show an actual error message. But if you \set
VERBOSITY verbose within psql while connected to a 9.3 server, you'll
get fully qualified details of the constraint blamed for the error, if
any. Example:

postgres=# insert into a(a, b) values (3, 'test');
ERROR:  23505: duplicate key value violates unique constraint a_pkey
DETAIL:  Key (a)=(3) already exists.
SCHEMA NAME:  public
TABLE NAME:  a
CONSTRAINT NAME:  a_pkey
LOCATION:  _bt_check_unique, nbtinsert.c:398


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-14 Thread Jeff Davis
On Fri, 2013-06-14 at 13:21 -0400, Greg Smith wrote:
 I'm planning to duplicate Jon's test program on a few machines here, and 
 then see if that turns into a useful latency improvement for clients. 
 I'm trying to get this pgbench rate limit stuff working first though, 
 because one of the tests I had in mind for WAL creation overhead would 
 benefit from it.

Great, I'll wait on those results.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] another error perhaps to be enhanced

2013-06-14 Thread Joshua D. Drake


On 06/14/2013 11:01 AM, Peter Geoghegan wrote:


On Fri, Jun 14, 2013 at 10:54 AM, Joshua D. Drake j...@commandprompt.com 
wrote:

Now, with the error previously shown, which one_idx needs to be reindexed?


Well, you didn't show an actual error message.



ERROR:  index foo_idx

Is not an error message? Granted I didn't show the whole error message 
but my point is, it should ALWAYS be fully qualified.




But if you \set
VERBOSITY verbose within psql while connected to a 9.3 server, you'll
get fully qualified details of the constraint blamed for the error, if
any. Example:

postgres=# insert into a(a, b) values (3, 'test');
ERROR:  23505: duplicate key value violates unique constraint a_pkey
DETAIL:  Key (a)=(3) already exists.
SCHEMA NAME:  public
TABLE NAME:  a
CONSTRAINT NAME:  a_pkey
LOCATION:  _bt_check_unique, nbtinsert.c:398




I was looking in the logs.


JD


--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
   a rose in the deeps of my heart. - W.B. Yeats


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-06-14 Thread Fabien COELHO


pgbench already has a \sleep command, and the way that delay is 
handled happens inside threadRun() instead.  The pausing of the rate 
limit throttle needs to operate in the same place.


It does operate at the same place. The throttling is performed by 
inserting a sleep first thing when starting a new transaction. So if 
their is a weirdness, it should show as well without throttling but with a 
fixed \sleep instead?


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hard to Use WAS: Hard limit on WAL space

2013-06-14 Thread Josh Berkus
On 06/12/2013 02:03 PM, Joshua D. Drake wrote:
 What concerns me is we seem to be trying to make this easy. It isn't
 supposed to be easy. This is hard stuff. Smart people built it and it
 takes a smart person to run it. When did it become a bad thing to be
 something that smart people need to run?

1997, last I checked.

Our unofficial motto: PostgreSQL: making very hard things possible, and
simple things hard.

It *is* hard.  But that's because we've *made* it hard to understand and
manage, not because the problem is inherently hard.  For example: can
you explain to me in 10 words or less how to monitor to see if archiving
is falling behind?  I'll bet you can't, and that's because we've
provided no reliable way to do so.

It's normal when you're developing features for the ability to utilize
them to go from hacker -- high-end user -- regular user.  We suck at
moving to that last stage, partly because whenever someone on this list
introduces the idea of making a feature not just great but easy to use,
people actually object to the idea that anything should be easy to use.
  It's like we're afraid of being polluted by the unwashed DevOps masses.

In the meantime, Mongo kicks our butts a new user adoption.  Why?  Their
features suck, but the features they do have are easy to use.  You'd
think we would have learned something from MySQL.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hard to Use WAS: Hard limit on WAL space

2013-06-14 Thread Joshua D. Drake


On 06/14/2013 11:16 AM, Josh Berkus wrote:


On 06/12/2013 02:03 PM, Joshua D. Drake wrote:

What concerns me is we seem to be trying to make this easy. It isn't
supposed to be easy. This is hard stuff. Smart people built it and it
takes a smart person to run it. When did it become a bad thing to be
something that smart people need to run?


1997, last I checked.

Our unofficial motto: PostgreSQL: making very hard things possible, and
simple things hard.

It *is* hard.  But that's because we've *made* it hard to understand and
manage, not because the problem is inherently hard.  For example: can
you explain to me in 10 words or less how to monitor to see if archiving
is falling behind?  I'll bet you can't, and that's because we've
provided no reliable way to do so.


Hey, I never said we shouldn't have a complete feature set. I agree with 
you. IMO it should not have even been committed without the ability to 
actually know what is going on and we have had it since (in theory) 8.1?


My primary concern is: Don't make it stupid.

I liked Claudio's comment, More than easy, it should be obvious..

It should be obvious from a review of the documentation how to manage 
this stuff. It isn't, and worse even if we wrote the documentation it 
still isn't because the feature is not complete.


With great power comes great responsibility :P

JD


--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
   a rose in the deeps of my heart. - W.B. Yeats


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SPGist triple parity concept doesn't work

2013-06-14 Thread Tom Lane
Teodor Sigaev teo...@sigaev.ru writes:
 Anyway I now think that we might be better off with the other idea of
 abandoning an insertion and retrying if we get a lock conflict.

 done, look at the patch.

Looks good, committed with some cosmetic adjustments.

 We definetly need new idea of locking protocol and I'll return to this
 problem at autumn (sorry, I havn't time in summer to do this
 research).

OK.  I think the performance of this way will be okay, actually, in most
cases anyhow.  It'll do till we have a better idea.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-06-14 Thread Fabien COELHO


Hello Greg,

I think that the weirdness really comes from the way transactions times 
are measured, their interactions with throttling, and latent bugs in the 
code.


One issue is that the throttling time was included in the measure, but not 
the first time because txn_begin is not set at the beginning of 
doCustom.


Also, flag st-listen is set to 1 but *never* set back to 0...
 sh grep listen pgbench.c
int listen;
if (st-listen)
st-listen = 1;
st-listen = 1;
st-listen = 1;
st-listen = 1;
st-listen = 1;
st-listen = 1;

ISTM that I can fix the weirdness by inserting an ugly goto top;, but 
I would feel better about it by removing all gotos and reordering some 
actions in doCustom in a more logical way. However that would be a bigger 
patch.


Please find attached 2 patches:

 - the first is the full throttle patch which ensures that the
   txn_begin is taken at a consistent point, after throttling,
   which requires resetting listen. There is an ugly goto.
   I've also put times in a consistent format in the log,
   789.012345 instead of 789 12345.

 - the second patch just shows the diff between v10 and the first one.

--
Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 8c202bf..dc4f819 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -137,6 +137,12 @@ int			unlogged_tables = 0;
 double		sample_rate = 0.0;
 
 /*
+ * When threads are throttled to a given rate limit, this is the target delay
+ * to reach that rate in usec.  0 is the default and means no throttling.
+ */
+int64		throttle_delay = 0;
+
+/*
  * tablespace selection
  */
 char	   *tablespace = NULL;
@@ -205,6 +211,7 @@ typedef struct
 	int			nvariables;
 	instr_time	txn_begin;		/* used for measuring transaction latencies */
 	instr_time	stmt_begin;		/* used for measuring statement latencies */
+	bool		throttled;  /* whether current transaction was throttled */
 	int			use_file;		/* index in sql_files for this client */
 	bool		prepared[MAX_FILES];
 } CState;
@@ -222,6 +229,10 @@ typedef struct
 	instr_time *exec_elapsed;	/* time spent executing cmds (per Command) */
 	int		   *exec_count;		/* number of cmd executions (per Command) */
 	unsigned short random_state[3];		/* separate randomness for each thread */
+int64   throttle_trigger;  /* previous/next throttling (us) */
+	int64   throttle_lag;  /* total transaction lag behind throttling */
+	int64   throttle_lag_max;  /* max transaction lag */
+
 } TState;
 
 #define INVALID_THREAD		((pthread_t) 0)
@@ -230,6 +241,8 @@ typedef struct
 {
 	instr_time	conn_time;
 	int			xacts;
+	int64   throttle_lag;
+	int64   throttle_lag_max;
 } TResult;
 
 /*
@@ -355,6 +368,8 @@ usage(void)
 		 -n   do not run VACUUM before tests\n
 		 -N   do not update tables \pgbench_tellers\ and \pgbench_branches\\n
 		 -r   report average latency per command\n
+		 -R SPEC, --rate SPEC\n
+		  target rate in transactions per second\n
 		 -s NUM   report this scale factor in output\n
 		 -S   perform SELECT-only transactions\n
 	   -t NUM   number of transactions each client runs (default: 10)\n
@@ -898,17 +913,56 @@ doCustom(TState *thread, CState *st, instr_time *conn_time, FILE *logfile, AggVa
 {
 	PGresult   *res;
 	Command   **commands;
+	booldo_throttle = false;
 
 top:
 	commands = sql_files[st-use_file];
 
+	/* handle throttling once per transaction by inserting a sleep.
+	 * this is simpler than doing it at the end.
+	 */
+	if (throttle_delay  ! st-throttled)
+	{
+		/* compute delay to approximate a Poisson distribution
+		 * 100 = 13.8 .. 0 multiplier
+		 *  10 = 11.5 .. 0
+		 *   1 =  9.2 .. 0
+		 *1000 =  6.9 .. 0
+		 * if transactions are too slow or a given wait shorter than
+		 * a transaction, the next transaction will start right away.
+		 */
+		int64 wait = (int64)
+			throttle_delay * -log(getrand(thread, 1, 1000)/1000.0);
+
+		thread-throttle_trigger += wait;
+
+		st-until = thread-throttle_trigger;
+		st-sleeping = 1;
+		st-throttled = true;
+		if (debug)
+			fprintf(stderr, client %d throttling INT64_FORMAT us\n,
+	st-id, wait);
+	}
+
 	if (st-sleeping)
 	{			/* are we sleeping? */
 		instr_time	now;
+		int64 now_us;
 
 		INSTR_TIME_SET_CURRENT(now);
-		if (st-until = INSTR_TIME_GET_MICROSEC(now))
+		now_us = INSTR_TIME_GET_MICROSEC(now);
+		if (st-until = now_us)
+		{
 			st-sleeping = 0;	/* Done sleeping, go ahead with next command */
+			if (throttle_delay  st-state==0)
+			{
+/* measure lag of throttled transaction */
+int64 lag = now_us - st-until;
+thread-throttle_lag += lag;
+if (lag  thread-throttle_lag_max)
+	thread-throttle_lag_max = lag;
+			}
+		}
 		else
 			return true;		/* 

Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-06-14 Thread Greg Smith

On 6/14/13 3:50 PM, Fabien COELHO wrote:

I think that the weirdness really comes from the way transactions times
are measured, their interactions with throttling, and latent bugs in the
code.


measurement times, no; interactions with throttling, no.  If it was 
either of those I'd have finished this off days ago.  Latent bugs, 
possibly.  We may discover there's nothing wrong with your code at the 
end here, that it just makes hitting this bug more likely. 
Unfortunately today is the day *some* bug is popping up, and I want to 
get it squashed before I'll be happy.


The lag is actually happening during a kernel call that isn't working as 
expected.  I'm not sure whether this bug was there all along if \sleep 
was used, or if it's specific to the throttle sleep.


 Also, flag st-listen is set to 1 but *never* set back to 0...

I noticed that st-listen was weird too, and that's on my short list of 
suspicious things I haven't figured out yet.


I added a bunch more logging as pgbench steps through its work to track 
down where it's stuck at.  Until the end all transactions look like this:


1371238832.084783 client 10 throttle lag 2 us
1371238832.084783 client 10 executing \setrandom aid 1 :naccounts
1371238832.084803 client 10 sending SELECT abalance FROM 
pgbench_accounts WHERE aid = 753099;

1371238832.084840 calling select
1371238832.086539 client 10 receiving
1371238832.086539 client 10 finished

All clients who hit lag spikes at the end are going through this 
sequence instead:


1371238832.085912 client 13 throttle lag 790 us
1371238832.085912 client 13 executing \setrandom aid 1 :naccounts
1371238832.085931 client 13 sending SELECT abalance FROM 
pgbench_accounts WHERE aid = 564894;

1371238832.086592 client 13 receiving
1371238832.086662 calling select
1371238832.235543 client 13 receiving
1371238832.235543 client 13 finished

Note the calling select here that wasn't in the normal length 
transaction before it.  The client is receiving something here, but 
rather than it finishing the transaction it falls through and ends up at 
the select() system call outside of doCustom.  All of the clients that 
are sleeping when the system slips into one of these long select() calls 
are getting stuck behind it.  I'm not 100% sure, but I think this only 
happens when all remaining clients are sleeping.


Here's another one, it hits the receive that doesn't finish the 
transaction earlier (1371238832.086587) but then falls into the same 
select() call at 1371238832.086662:


1371238832.085884 client 12 throttle lag 799 us
1371238832.085884 client 12 executing \setrandom aid 1 :naccounts
1371238832.085903 client 12 sending SELECT abalance FROM 
pgbench_accounts WHERE aid = 299080;

1371238832.086587 client 12 receiving
1371238832.086662 calling select
1371238832.231032 client 12 receiving
1371238832.231032 client 12 finished

Investigation is still going here...

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-06-14 Thread Fabien COELHO



I think that the weirdness really comes from the way transactions times
are measured, their interactions with throttling, and latent bugs in the
code.


measurement times, no; interactions with throttling, no.  If it was either of 
those I'd have finished this off days ago.  Latent bugs, possibly.  We may 
discover there's nothing wrong with your code at the end here,


To summarize my point: I think my v10 code does not take into account all 
of the strangeness in doCustom, and I'm pretty sure that there is no point
in including thottle sleeps into latency measures, which was more or less 
the case. So it is somehow a bug which only shows up if you look at the 
latency measures, but the tps are fine.


that it just makes hitting this bug more likely. Unfortunately today is 
the day *some* bug is popping up, and I want to get it squashed before 
I'll be happy.


The lag is actually happening during a kernel call that isn't working as 
expected.  I'm not sure whether this bug was there all along if \sleep was 
used, or if it's specific to the throttle sleep.


The throttle sleep is inserted out of the state machine. That is why in 
the test patch I added a goto to ensure that it is always taken at the 
right time, that is when state==0 and before txn_begin is set, and not 
possibly between other states when doCustom happens to be recalled after a 
return.


I added a bunch more logging as pgbench steps through its work to track down 
where it's stuck at.  Until the end all transactions look like this:


1371238832.084783 client 10 throttle lag 2 us
1371238832.084783 client 10 executing \setrandom aid 1 :naccounts
1371238832.084803 client 10 sending SELECT abalance FROM pgbench_accounts 
WHERE aid = 753099;

1371238832.084840 calling select
1371238832.086539 client 10 receiving
1371238832.086539 client 10 finished

All clients who hit lag spikes at the end are going through this sequence

 instead:


1371238832.085912 client 13 throttle lag 790 us
1371238832.085912 client 13 executing \setrandom aid 1 :naccounts
1371238832.085931 client 13 sending SELECT abalance FROM pgbench_accounts 
WHERE aid = 564894;

1371238832.086592 client 13 receiving
1371238832.086662 calling select
1371238832.235543 client 13 receiving
1371238832.235543 client 13 finished


Note the calling select here that wasn't in the normal length transaction 
before it.  The client is receiving something here, but rather than it 
finishing the transaction it falls through and ends up at the select() system 
call outside of doCustom.  All of the clients that are sleeping when the 
system slips into one of these long select() calls are getting stuck behind 
it.  I'm not 100% sure, but I think this only happens when all remaining 
clients are sleeping.


Note: in both the slow cases there is a receiving between sending and 
select. This suggests that the goto top at the very end of doCustom is 
followed in one case but not the other.


ISTM that there is a timeout passed to select which is computed based on 
the current sleeping time of each client. I'm pretty sure that not a well 
tested path...


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [RFC] Minmax indexes

2013-06-14 Thread Alvaro Herrera
Hi,

This is a preliminary proposal for Minmax indexes.  I'm experimenting
with the code, but it's too crude to post yet, so here's a document
explaining what they are and how they work, so that reviewers can poke
holes to have the design improved.  My intention is to have a patch to
show for CF2, so please do have a look at this and comment.

This is part of the AXLE project http://www.axleproject.eu and the
intention is to support tables of very large size.  In a quick
experiment, I have a table of ~12 GB and its corresponding index is 65
kB in size, making the time to do the equivalent of a seqscan a small
fraction of that taken by a real seqscan.  This technique sits between a
bitmap scan of a normal btree, and a seqscan: the minmax index tells the
bitmap heap scan what pages to seqscan, allowing it to skip a large
fraction of pages that are known not to contain tuples matching the
query quals.  This is a huge win for large data warehouses.  

Without further ado, here's what I propose.


Minmax Range Indexes


Minmax indexes are a new access method intended to enable very fast scanning of
extremely large tables.

The essential idea of a minmax index is to keep track of the min() and max()
values in consecutive groups of heap pages (page ranges).  These values can be
used by constraint exclusion to avoid scanning such pages, depending on query
quals.

The main drawback of this is having to update the stored min/max values of each
page range as tuples are inserted into them.

Other database systems already have this feature. Some examples:

* Oracle Exadata calls this storage indexes
  http://richardfoote.wordpress.com/category/storage-indexes/

* Netezza has zone maps
  http://nztips.com/2010/11/netezza-integer-join-keys/

* Infobright has this automatically within their data packs
  
http://www.infobright.org/Blog/Entry/organizing_data_and_more_about_rough_data_contest/

* MonetDB seems to have it
  http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.2662
  Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS

Grammar
---

To create a minmax index, we use

  CREATE INDEX foo_minmax_idx ON foo USING MINMAX (a, b, e);

Partial indexes are not supported; since an index is concerned with minimum and
maximum values of the involved columns across all the pages in the table, it
doesn't make sense to exclude values.  Another way to see partial indexes
here would be those that only considered some pages in the table instead of all
of them; but this would be difficult to implement and manage and, most likely,
pointless.

Expressional indexes can probably be supported in the future, but we disallow
them initially for conceptual simplicity.

Having multiple minmax indexes in the same table is acceptable, though most of
the time it would make more sense to have a single index covering all the
interesting columns.  Multiple indexes might be useful for columns added later.

Access Method Design


Since item pointers are not stored inside indexes of this type, it is not
possible to support the amgettuple interface.  Instead, we only provide
amgetbitmap support; scanning a relation using this index requires a recheck
node on top.  The amgetbitmap routine would return a TIDBitmap comprising all
the pages in those page groups that comply with the query quals; the recheck 
node
prunes tuples that are not visible per snapshot and those that are not visible
per query quals.

For each supported datatype, we need an opclass with the following catalog
entries:

- support functions (pg_amproc)
  * pg_proc entry for min()
  * pg_proc entry for max()
- support operators (pg_amop): same as btree (, =, =, =, )

The min() and max() support functions are used during index construction.
The support operators are used in the optimizer, so that the index is chosen
when queries on the indexed table are planned.  (Also, we use them in the
amgetbitmap routine, to compare ScanKeys and decide whether to emit a certain
block or not).

In each index tuple (corresponding to one page range), we store:
- first block this tuple applies to
- last block this tuple applies to
- for each indexed column:
  * min() value across all tuples in the range
  * max() value across all tuples in the range
  * nulls present in any tuple?

With the default INDEX_MAX_KEYS of 32, and considering columns of 8-byte length
types (timestamptz, bigint), each tuple would be 524 bytes in length, which
seems reasonable.  Of course, larger columns are possible, such as varchar, but
creating minmax indexes on such columns seems of little practical usefulness.

This maximum index tuple size is calculated as:
BlockNumber (4 bytes) * 2 + data value (8 bytes) * 32 * 2 + null bitmap (4 
bytes)


Block ranges mentioned in index entries shouldn't overlap. However, there can
be gaps where some pages have no covering index entry. (In particular, the last
few pages of the table would commonly not be summarized.)

In order to scan 

Re: [HACKERS] single-user vs standalone in docs and messages

2013-06-14 Thread Robert Haas
On Thu, Jun 13, 2013 at 6:10 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 Some places in the docs and elog hints refer to standalone backends, while
 the official name as used in app-postgres.html is single-user mode, and in
 fact standalone does not appear on that page.

 This tries to standardize the other locations to use single-user.  I think
 I did the right thing with the message translation files, but I can't figure
 out how to test that.

 I made no attempt to change code-comments, just the user-facing parts.

I think you could tell people to use single-user mode instead of a
standalone backend, but telling them to use a single-user backend
just seems weird.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] stray SIGALRM

2013-06-14 Thread Richard Poole
In 9.3beta1, a backend will receive a SIGALRM after authentication_timeout
seconds, even if authentication has been successful. Most of the time
this doesn't hurt anyone, but there are cases, such as when the backend
is doing the open() of a backend copy, when it breaks things and results
in an error getting reported to the client. In particular, if you're doing
a copy from a FIFO, it is normal for open() to block until the process at
the other end has data ready, so you're very likely to have it interrupted
by the SIGALRM and fail.

To see the SIGALRM just run psql then determine your backend's pid,
attach an strace to it, and wait 60 seconds, or whatever you've got
authentication_timeout set to.

This behaviour appears in 6ac7facdd3990baf47efc124e9d7229422a06452 as a
side-effect of speeding things up by getting rid of setitimer() calls;
it's not obvious what's a good way to fix it without losing the benefits
of that commit.

Thanks Alvaro and Andres for helping me get from why is my copy getting
these signals to understanding what's actually going on.

Richard

-- 
Richard Poole http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] logical changeset generation v5

2013-06-14 Thread Andres Freund
Hi!

I am rather pleased to announce the next version of the changeset
extraction patchset. Thanks to help from a large number of people I
think we are slowly getting to the point where it is getting
committable.

Since the last submitted version
(20121115002746.ga7...@awork2.anarazel.de) a large number of fixes and
the result of good amount of review has been added to the tree. All
bugs known to me have been fixed.

Fixes include:
* synchronous replication support
* don't peg the xmin for user tables, do it only for catalog ones.
* arbitrarily large transaction support by spilling large transactions
  to disk
* spill snapshots to disk, so we can restart without waiting for a new
  snapshot to be built
* Don't read all WAL from the establishment of a logical slot
* tests via SQL interface to changeset extraction

The todo list includes:
* morph the logical slot interface into being replication slots that
  can also be used by streaming replication
* move some more code from snapbuild.c to decode.c to remove a largely
  duplicated switch
* do some more header/comment cleanup  clarification
* move pg_receivellog into its own directory in src/bin or contrib/.
* user/developer level documentation

The patch series currently has two interfaces to logical decoding. One -
which is primarily useful for pg_regress style tests and playing around
- is SQL based, the other one uses a walsender replication connection.

A quick demonstration of the SQL interface (server needs to be started
with wal_level = logical and max_logical_slots  0):
=# CREATE EXTENSION test_logical_decoding;
=# SELECT * FROM init_logical_replication('regression_slot', 'test_decoding');
slotname | xlog_position 
-+---
 regression_slot | 0/17D5908
(1 row)

=# CREATE TABLE foo(id serial primary key, data text);

=# INSERT INTO foo(data) VALUES(1);

=# UPDATE foo SET id = -id, data = ':'||data;

=# DELETE FROM foo;

=# DROP TABLE foo;

=# SELECT * FROM start_logical_replication('regression_slot', 'now', 
'hide-xids', '0');
 location  | xid |  data
---+-+
 0/17D59B8 | 695 | BEGIN
 0/17D59B8 | 695 | COMMIT
 0/17E8B58 | 696 | BEGIN
 0/17E8B58 | 696 | table foo: INSERT: id[int4]:1 data[text]:1
 0/17E8B58 | 696 | COMMIT
 0/17E8CA8 | 697 | BEGIN
 0/17E8CA8 | 697 | table foo: UPDATE: old-pkey: id[int4]:1 new-tuple: 
id[int4]:-1 data[text]::1
 0/17E8CA8 | 697 | COMMIT
 0/17E8E50 | 698 | BEGIN
 0/17E8E50 | 698 | table foo: DELETE: id[int4]:-1
 0/17E8E50 | 698 | COMMIT
 0/17E9058 | 699 | BEGIN
 0/17E9058 | 699 | COMMIT
(13 rows)

=# SELECT * FROM pg_stat_logical_decoding ;
slot_name|plugin | database | active | xmin | 
restart_decoding_lsn 
-+---+--++--+--
 regression_slot | test_decoding |12042 | f  |  695 | 0/17D58D0
(1 row)

=# SELECT * FROM stop_logical_replication('regression_slot');
 stop_logical_replication
--
0

The walsender interface has the same calls
INIT_LOGICAL_REPLICATION 'slot' 'plugin';
START_LOGICAL_REPLICATION 'slot' restart_lsn [(option value)*];
STOP_LOGICAL_REPLICATION 'slot';

The only difference is that START_LOGICAL_REPLICATION can stream changes
and it can support synchronous replication.

The output seen in the 'data' column is produced by a so called 'output
plugin' which users of the facility can write to suit their needs. They
can be written by implementing 5 functions in the shared object that's
passed to init_logical_replication() above:
* pg_decode_init (optional)
* pg_decode_begin_txn
* pg_decode_change
* pg_decode_commit_txn
* pg_decode_cleanup (optional)

The most interesting function pg_decode_change get's passed a structure
containing old/new versions of the row, the 'struct Relation' belonging
to it and metainformation about the transaction.

The output plugin can rely on syscache lookups et al. to decode the
changed tuple in whatever fashion it wants.

I'd like to invite reviewers to first look at:
* the output plugin interface
* the walsender/SRF interface
* patch 12 which contains most of the code

When reading the code, the information flow during decoding might be
interesting:
---
  +---+
  | XLogReader|
  +---+
  |
XLOG Records
  |
  v
  +---+
  | decode.c  |
  +---+
 |   |
 |   |
 v   |
+---+|
| snapbuild.c   |  HeapTupleData
+---+|
 |   |
  catalog snapshots  |
 |   |
 v   v
  +---+
  |reorderbuffer.c|
  +---+
 |
HeapTuple  Metadata
 

[HACKERS] GIN improvements part2: fast scan

2013-06-14 Thread Alexander Korotkov
Hackes,

attached patch implementing fast scan technique for GIN. This is second
patch of GIN improvements, see the 1st one here:
http://www.postgresql.org/message-id/capphfduxv-il7aedwpw0w5fxrwgakfxijwm63_hzujacrxn...@mail.gmail.com
This patch allow to skip parts of posting trees when their scan is not
necessary. In particular, it solves frequent_term  rare_term problem of
FTS.
It introduces new interface method pre_consistent which behaves like
consistent, but:
1) allows false positives on input (check[])
2) allowed to return false positives

Some example: frequent_term  rare_term becomes pretty fast.

create table test as (select to_tsvector('english', 'bbb') as v from
generate_series(1,100));
insert into test (select to_tsvector('english', 'ddd') from
generate_series(1,10));
create index test_idx on test using gin (v);

postgres=# explain analyze select * from test where v @@
to_tsquery('english', 'bbb  ddd');
  QUERY PLAN
---
 Bitmap Heap Scan on test  (cost=942.75..7280.63 rows=5000 width=17)
(actual time=0.458..0.461 rows=10 loops=1)
   Recheck Cond: (v @@ '''bbb''  ''ddd'''::tsquery)
   -  Bitmap Index Scan on test_idx  (cost=0.00..941.50 rows=5000 width=0)
(actual time=0.449..0.449 rows=10 loops=1)
 Index Cond: (v @@ '''bbb''  ''ddd'''::tsquery)
 Total runtime: 0.516 ms
(5 rows)


--
With best regards,
Alexander Korotkov.


gin_fast_scan.1.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] stray SIGALRM

2013-06-14 Thread Tom Lane
Richard Poole rich...@2ndquadrant.com writes:
 In 9.3beta1, a backend will receive a SIGALRM after authentication_timeout
 seconds, even if authentication has been successful. Most of the time
 this doesn't hurt anyone, but there are cases, such as when the backend
 is doing the open() of a backend copy, when it breaks things and results
 in an error getting reported to the client. In particular, if you're doing
 a copy from a FIFO, it is normal for open() to block until the process at
 the other end has data ready, so you're very likely to have it interrupted
 by the SIGALRM and fail.

 To see the SIGALRM just run psql then determine your backend's pid,
 attach an strace to it, and wait 60 seconds, or whatever you've got
 authentication_timeout set to.

 This behaviour appears in 6ac7facdd3990baf47efc124e9d7229422a06452 as a
 side-effect of speeding things up by getting rid of setitimer() calls;
 it's not obvious what's a good way to fix it without losing the benefits
 of that commit.

Ugh.  It doesn't sound very practical to try to guarantee that every
single kernel call in the backend is set up to recover from EINTR,
even though ideally they should all be able to cope.  Maybe we have to
revert those signal-handling changes.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-14 Thread Josh Berkus
Alvaro,

This sounds really interesting, and I can see the possibilities.
However ...

 Value changes in columns that are part of a minmax index, and tuple insertion
 in summarized pages, would invalidate the stored min/max values.  To support
 this, each minmax index has a validity map; a range can only be considered in 
 a
 scan if it hasn't been invalidated by such changes (A range not considered 
 in
 the scan needs to be returned in whole regardless of the stored min/max 
 values,
 that is, it cannot be pruned per query quals).  The validity map is very
 similar to the visibility map in terms of performance characteristics: quick
 enough that it's not contentious, allowing updates and insertions to proceed
 even when data values violate the minmax index conditions.  An invalidated
 range can be made valid by re-summarization (see below).

This begins to sound like these indexes are only useful on append-only
tables.  Not that there aren't plenty of those, but ...

 Re-summarization is relatively expensive, because the complete page range has
 to be scanned.

Why?  Why can't we just update the affected pages in the index?

  To avoid this, a table having a minmax index would be
 configured so that inserts only go to the page(s) at the end of the table; 
 this
 avoids frequent invalidation of ranges in the middle of the table.  We provide
 a table reloption that tweaks the FSM behavior, so that summarized pages are
 not candidates for insertion.

We haven't had an index type which modifies table insertion behavior
before, and I'm not keen to start now; imagine having two indexes on the
same table each with their own, conflicting, requirements.  This is
sounding a lot more like a candidate for our prospective pluggable
storage manager.  Also, the above doesn't help us at all with UPDATEs.

If we're going to start adding reloptions for specific table behavior,
I'd rather think of all of the optimizations we might have for a
prospective append-only table and bundle those, rather than tying it
to whether a certain index exists or not.

Also, I hate the name ... if this feature goes ahead, I'm going to be
lobbying to change it.  But that's pretty minor compared to the update
issues.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] GIN improvements part 3: ordering in index

2013-06-14 Thread Alexander Korotkov
Hackers,

attached patch implementing ordering inside GIN index. This is third patch
of GIN improvements, see previous two:
http://www.postgresql.org/message-id/capphfduxv-il7aedwpw0w5fxrwgakfxijwm63_hzujacrxn...@mail.gmail.com
http://www.postgresql.org/message-id/CAPpHfdvftaJq7www381naLw1=4u0h+qpxgwvnhceb9hmvyw...@mail.gmail.com

This patch introduces new interface method of GIN which takes same
arguments as consistent but returns float8.
float8 gin_ordering(bool check[], StrategyNumber n, Datum query, int32
nkeys, Pointer extra_data[], bool *recheck, Datum queryKeys[], bool
nullFlags[], Datum addInfo[], bool addInfoIsNull[])
This patch implements gingettuple method which can return ordering data
using KNN infrastructure. Also it introduces  operator for fts which
support ordering in GIN index. Some example:

postgres=# explain analyze select * from dblp_titles2 where tsvector @@
to_tsquery('english', 'statistics') order by tsvector 
to_tsquery('english', 'statistics') limit 10;
   QUERY
PLAN
-
 Limit  (cost=12.00..48.22 rows=10 width=136) (actual time=6.999..7.120
rows=10 loops=1)
   -  Index Scan using dblp_titles2_idx on dblp_titles2
 (cost=12.00..43003.03 rows=11868 width=136) (actual time=6.996..7.115
rows=10 loops=1)
 Index Cond: (tsvector @@ '''statist'''::tsquery)
 Order By: (tsvector  '''statist'''::tsquery)
 Total runtime: 7.556 ms
(5 rows)

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] extensible external toast tuple support

2013-06-14 Thread Andres Freund
On 2013-05-31 23:42:51 -0400, Robert Haas wrote:
 On Thu, May 30, 2013 at 7:42 AM, Andres Freund and...@2ndquadrant.com wrote:
  In
  http://archives.postgresql.org/message-id/20130216164231.GA15069%40awork2.anarazel.de
  I presented the need for 'indirect' toast tuples which point into memory
  instead of a toast table. In the comments to that proposal, off-list and
  in-person talks the wish to make that a more general concept has
  been voiced.
 
  The previous patch used varattrib_1b_e.va_len_1be to discern between
  different types of external tuples. That obviously only works if the
  data sizes of all possibly stored datum types are distinct which isn't
  nice. So what the newer patch now does is to rename that field into
  'va_tag' and decide based on that what kind of Datum we have. To get the
  actual length of that datum there now is a VARTAG_SIZE() macro which
  maps the tags back to size.
  To keep on-disk compatibility the size of an external toast tuple
  containing a varatt_external is used as its tag value.
 
  This should allow for fairly easy development of a new compression
  scheme for out-of-line toast tuples. It will *not* work for compressed
  inline tuples (i.e. VARATT_4B_C). I am not convinced that that is a
  problem or that if it is, that it cannot be solved separately.
 
  FWIW, in some quick microbenchmarks I couldn't find any performance
  difference due to the slightly more complex size computation which I do
  *not* find surprising.
 
  Opinions?
 
 Seems pretty sensible to me.  The patch is obviously WIP but the
 direction seems fine to me.

Here's the updated version. It shouldn't contain any obvious WIP pieces
anymore, although I think it needs some more documentation. I am just
not sure where to add it yet, postgres.h seems like a bad place :/

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From 654e24e9a615dcacea4d9714cf8cdbf6953983d5 Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Tue, 11 Jun 2013 23:25:26 +0200
Subject: [PATCH] Add support for multiple kinds of external toast datums

There are several usecases where our current representation of external toast
datums is limiting:
* adding new compression schemes
* avoidance of repeated detoasting
* externally decoded toast tuples

For that support 'tags' on external (varattrib_1b_e) varlenas which recoin the
current va_len_1be field to store the tag (or type) of a varlena. To determine
the actual length a macro VARTAG_SIZE(tag) is added which can be used to map
from a tag to the actual length.

This patch adds support for 'indirect' tuples which point to some externally
allocated memory containing a toast tuple. It also implements the stub for a
different compression algorithm.
---
 src/backend/access/heap/tuptoaster.c | 100 +++
 src/include/c.h  |   2 +
 src/include/postgres.h   |  83 +
 3 files changed, 153 insertions(+), 32 deletions(-)

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index fc37ceb..99044d0 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -128,7 +128,7 @@ heap_tuple_fetch_attr(struct varlena * attr)
 struct varlena *
 heap_tuple_untoast_attr(struct varlena * attr)
 {
-	if (VARATT_IS_EXTERNAL(attr))
+	if (VARATT_IS_EXTERNAL_ONDISK(attr))
 	{
 		/*
 		 * This is an externally stored datum --- fetch it back from there
@@ -145,6 +145,15 @@ heap_tuple_untoast_attr(struct varlena * attr)
 			pfree(tmp);
 		}
 	}
+	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
+	{
+		struct varatt_indirect redirect;
+		VARATT_EXTERNAL_GET_POINTER(redirect, attr);
+		attr = (struct varlena *)redirect.pointer;
+		Assert(!VARATT_IS_EXTERNAL_INDIRECT(attr));
+
+		attr = heap_tuple_untoast_attr(attr);
+	}
 	else if (VARATT_IS_COMPRESSED(attr))
 	{
 		/*
@@ -191,7 +200,7 @@ heap_tuple_untoast_attr_slice(struct varlena * attr,
 	char	   *attrdata;
 	int32		attrsize;
 
-	if (VARATT_IS_EXTERNAL(attr))
+	if (VARATT_IS_EXTERNAL_ONDISK(attr))
 	{
 		struct varatt_external toast_pointer;
 
@@ -204,6 +213,13 @@ heap_tuple_untoast_attr_slice(struct varlena * attr,
 		/* fetch it back (compressed marker will get set automatically) */
 		preslice = toast_fetch_datum(attr);
 	}
+	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
+	{
+		struct varatt_indirect redirect;
+		VARATT_EXTERNAL_GET_POINTER(redirect, attr);
+		return heap_tuple_untoast_attr_slice(redirect.pointer,
+			 sliceoffset, slicelength);
+	}
 	else
 		preslice = attr;
 
@@ -267,7 +283,7 @@ toast_raw_datum_size(Datum value)
 	struct varlena *attr = (struct varlena *) DatumGetPointer(value);
 	Size		result;
 
-	if (VARATT_IS_EXTERNAL(attr))
+	if (VARATT_IS_EXTERNAL_ONDISK(attr))
 	{
 		/* va_rawsize is the size of the original datum -- including header */
 		struct 

Re: [HACKERS] [RFC] Minmax indexes

2013-06-14 Thread Tom Lane
Josh Berkus j...@agliodbs.com writes:
 To avoid this, a table having a minmax index would be
 configured so that inserts only go to the page(s) at the end of the table; 
 this
 avoids frequent invalidation of ranges in the middle of the table.  We 
 provide
 a table reloption that tweaks the FSM behavior, so that summarized pages are
 not candidates for insertion.

 We haven't had an index type which modifies table insertion behavior
 before, and I'm not keen to start now; imagine having two indexes on the
 same table each with their own, conflicting, requirements.

I agree; such a restriction is a nonstarter for a secondary index.  I
don't believe that hacking the FSM would be sufficient to guarantee the
required behavior, either.

We've talked a lot about index-organized tables in the past.  How much
of the use case for this would be subsumed by a feature like that?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] extensible external toast tuple support

2013-06-14 Thread Alvaro Herrera
Andres Freund escribió:

 Here's the updated version. It shouldn't contain any obvious WIP pieces
 anymore, although I think it needs some more documentation. I am just
 not sure where to add it yet, postgres.h seems like a bad place :/

How about a new file, say src/include/access/toast.h?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] extensible external toast tuple support

2013-06-14 Thread Andres Freund
On 2013-06-14 19:14:15 -0400, Alvaro Herrera wrote:
 Andres Freund escribió:
 
  Here's the updated version. It shouldn't contain any obvious WIP pieces
  anymore, although I think it needs some more documentation. I am just
  not sure where to add it yet, postgres.h seems like a bad place :/
 
 How about a new file, say src/include/access/toast.h?

Well, the question is if that buys us all that much, we need the varlena
definitions to be available pretty much everywhere. Except of section 3
- which we reduced to be pretty darn small these days - of postgres.h
pretty much all of it is concerned with Datums, a good of them being
varlenas.
We could move section 1) into its own file and unconditionally include
it...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pluggable compression support

2013-06-14 Thread Josh Berkus
On 06/14/2013 04:01 PM, Andres Freund wrote:
 It still contains a guc as described in the above message to control the
 algorithm used for compressing new tuples but I think we should remove
 that guc after testing.

Did you add the storage attribute?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pluggable compression support

2013-06-14 Thread Andres Freund
On 2013-06-14 17:12:01 -0700, Josh Berkus wrote:
 On 06/14/2013 04:01 PM, Andres Freund wrote:
  It still contains a guc as described in the above message to control the
  algorithm used for compressing new tuples but I think we should remove
  that guc after testing.
 
 Did you add the storage attribute?

No. I think as long as we only have pglz and one new algorithm (even if
that is lz4 instead of the current snappy) we should just always use the
new algorithm. Unless I missed it nobody seemed to have voiced a
contrary position?
For testing/evaluation the guc seems to be sufficient.

If we want to make it configurable on a per column basis I think the way
to go is to add a new column to pg_attribute and split compression
related things out of attstorage into attcompression.
That's a fair amount of work and it includes a minor compatibility break
in the catalog format, so I'd prefer not to do it until there's a good
reason to do so.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] Revive line type

2013-06-14 Thread Peter Eisentraut
Complete the implementations of line_in, line_out, line_recv,
line_send.  Remove comments and error messages about the line type not
being implemented.  Add regression tests for existing line operators
and functions.
---
This just revives existing functionality, doesn't add anything new.
One thing that the original code did not settle was how to convert a
line in form Ax+By+C=0 to the two-points output form.  Obviously, you
can just pick to random points on the line, but I wonder whether there
is a more standard solution.

 doc/src/sgml/datatype.sgml |   34 +++-
 doc/src/sgml/func.sgml |6 +
 src/backend/utils/adt/geo_ops.c|  108 +
 src/include/catalog/pg_type.h  |3 +-
 src/include/utils/geo_decls.h  |7 -
 src/test/regress/expected/geometry.out |3 -
 src/test/regress/expected/line.out |  243 
 src/test/regress/expected/sanity_check.out |3 +-
 src/test/regress/output/misc.source|3 +-
 src/test/regress/parallel_schedule |2 +-
 src/test/regress/serial_schedule   |1 +
 src/test/regress/sql/geometry.sql  |4 -
 src/test/regress/sql/line.sql  |   77 +
 13 files changed, 408 insertions(+), 86 deletions(-)
 create mode 100644 src/test/regress/expected/line.out
 create mode 100644 src/test/regress/sql/line.sql

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index f73e6b2..ecbbdd8 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -3066,7 +3066,7 @@ titleGeometric Types/title
row
 entrytypeline/type/entry
 entry32 bytes/entry
-entryInfinite line (not fully implemented)/entry
+entryInfinite line/entry
 entry((x1,y1),(x2,y2))/entry
/row
row
@@ -3142,6 +3142,38 @@ titlePoints/title
/sect2
 
sect2
+titleLines/title
+
+indexterm
+ primaryline/primary
+/indexterm
+
+para
+ Lines (typeline/type) are specified by pairs of points.
+ Values of type typeline/type are specified using any of the following
+ syntaxes:
+
+synopsis
+[ ( replaceablex1/replaceable , replaceabley1/replaceable ) , ( 
replaceablex2/replaceable , replaceabley2/replaceable ) ]
+( ( replaceablex1/replaceable , replaceabley1/replaceable ) , ( 
replaceablex2/replaceable , replaceabley2/replaceable ) )
+  ( replaceablex1/replaceable , replaceabley1/replaceable ) , ( 
replaceablex2/replaceable , replaceabley2/replaceable )
+replaceablex1/replaceable , replaceabley1/replaceable   ,   
replaceablex2/replaceable , replaceabley2/replaceable
+/synopsis
+
+ where
+ 
literal(replaceablex1/replaceable,replaceabley1/replaceable)/literal
+ and
+ 
literal(replaceablex2/replaceable,replaceabley2/replaceable)/literal
+ are two (different) points on the line.
+/para
+
+para
+ Lines are output using the first syntax.  The points used in the output
+ are not necessarily the points used on input.
+/para
+   /sect2
+
+   sect2
 titleLine Segments/title
 
 indexterm
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 4c5af4b..835a189 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -8070,6 +8070,12 @@ titleGeometric Type Conversion Functions/title
 entryliteralcircle(polygon '((0,0),(1,1),(2,0))')/literal/entry
/row
row
+entryliteralfunctionline(typepoint/type, 
typepoint/type)/function/literal/entry
+entrytypeline/type/entry
+entrypoints to line/entry
+entryliterallseg(point '(-1,0)', point '(1,0)')/literal/entry
+   /row
+   row
 entry
  indexterm
   primarylseg/primary
diff --git a/src/backend/utils/adt/geo_ops.c b/src/backend/utils/adt/geo_ops.c
index ad18cf0..61a1900 100644
--- a/src/backend/utils/adt/geo_ops.c
+++ b/src/backend/utils/adt/geo_ops.c
@@ -933,13 +933,8 @@
 Datum
 line_in(PG_FUNCTION_ARGS)
 {
-#ifdef ENABLE_LINE_TYPE
char   *str = PG_GETARG_CSTRING(0);
-#endif
LINE   *line;
-
-#ifdef ENABLE_LINE_TYPE
-   /* when fixed, modify not implemented, catalog/pg_type.h and SGML */
LSEGlseg;
int isopen;
char   *s;
@@ -950,15 +945,13 @@
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
 errmsg(invalid input syntax for type line: 
\%s\, str)));
 
+   if (FPeq(lseg.p[0].x, lseg.p[1].x)  FPeq(lseg.p[0].y, lseg.p[1].y))
+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+errmsg(invalid line specification: must be 
two distinct points)));
+
line = (LINE *) palloc(sizeof(LINE));
line_construct_pts(line, lseg.p[0], lseg.p[1]);
-#else
-   ereport(ERROR,
-   

Re: [HACKERS] [RFC] Minmax indexes

2013-06-14 Thread Greg Stark
On Fri, Jun 14, 2013 at 11:28 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Re-summarization is relatively expensive, because the complete page range has
 to be scanned.

That doesn't sound too bad to me. It just means there's a downside to
having larger page ranges. I would expect the page ranges to be
something in the ballpark of 32 pages --  scanning 32 pages to
resummarize doesn't sound that painful but sounds like it's large
enough that the resulting index would be a reasonable size.

But I don't understand why an insert would invalid a tuple. An insert
can just update the min and max incrementally. It's a delete that
invalidates the range but as you note it doesn't really invalidate it,
just mark it as needing a refresh -- and even then only if the value
being deleted is equal to either the min or max.

 Same-size page ranges?
 Current related literature seems to consider that each index entry in a
 minmax index must cover the same number of pages.  There doesn't seem to be a

I assume the reason for this in the literature is the need to quickly
find the summary for a given page when you're handling an insert or
delete. If you have some kind of meta data structure that lets you
find it (which I gather is what the validity map is?) then you
wouldn't need it. But that seems like a difficulty cost to justify
compared to just having a 1:1 mapping from block to bitmap tuple.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pluggable compression support

2013-06-14 Thread Josh Berkus

 No. I think as long as we only have pglz and one new algorithm (even if
 that is lz4 instead of the current snappy) we should just always use the
 new algorithm. Unless I missed it nobody seemed to have voiced a
 contrary position?
 For testing/evaluation the guc seems to be sufficient.

Then it's not pluggable, is it?  It's upgradable compression
support, if anything.  Which is fine, but let's not confuse people.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pluggable compression support

2013-06-14 Thread Andres Freund
On 2013-06-14 17:35:02 -0700, Josh Berkus wrote:
 
  No. I think as long as we only have pglz and one new algorithm (even if
  that is lz4 instead of the current snappy) we should just always use the
  new algorithm. Unless I missed it nobody seemed to have voiced a
  contrary position?
  For testing/evaluation the guc seems to be sufficient.
 
 Then it's not pluggable, is it?  It's upgradable compression
 support, if anything.  Which is fine, but let's not confuse people.

The point is that it's pluggable on the storage level in the sense of
that several different algorithms can coexist and new ones can
relatively easily added.
That part is what seems to have blocked progress for quite a while
now. So fixing that seems to be the interesting thing.

I am happy enough to do the work of making it configurable if we want it
to be... But I have zap interest of doing it and throw it away in the
end because we decide we don't need it.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dynamic background workers

2013-06-14 Thread Michael Paquier
On Sat, Jun 15, 2013 at 6:00 AM, Robert Haas robertmh...@gmail.com wrote:

 The second patch, dynamic-bgworkers-v1.patch, revises the background
 worker API to allow background workers to be started dynamically.
 This requires some communication channel from ordinary workers to the
 postmaster, because it is the postmaster that must ultimately start
 the newly-registered workers.  However, that communication channel has
 to be designed pretty carefully, lest a shared memory corruption take
 out the postmaster and lead to inadvertent failure to restart after a
 crash.  Here's how I implemented that: there's an array in shared
 memory of a size equal to max_worker_processes.  This array is
 separate from the backend-private list of workers maintained by the
 postmaster, but the two are kept in sync.  When a new background
 worker registration is added to the shared data structure, the backend
 adding it uses the existing pmsignal mechanism to kick the postmaster,
 which then scans the array for new registrations.  I have attempted to
 make the code that transfers the shared_memory state into the
 postmaster's private state as paranoid as humanly possible.  The
 precautions taken are documented in the comments.  Conversely, when a
 background worker flagged as BGW_NEVER_RESTART is considered for
 restart (and we decide against it), the corresponding slot in the
 shared memory array is marked as no longer in use, allowing it to be
 reused for a new registration.

 Since the postmaster cannot take locks, synchronization between the
 postmaster and other backends using the shared memory segment has to
 be lockless.  This mechanism is also documented in the comments.  An
 lwlock is used to prevent two backends that are both registering a new
 worker at about the same time from stomping on each other, but the
 postmaster need not care about that lwlock.

 This patch also extends worker_spi as a demonstration of the new
 interface.  With this patch, you can CREATE EXTENSION worker_spi and
 then call worker_spi_launch(int4) to launch a new background worker,
 or combine it with generate_series() to launch a bunch at once.  Then
 you can kill them off with pg_terminate_backend() and start some new
 ones.  That, in my humble opinion, is pretty cool.

This looks really interesting, +1. I'll test the patch if possible next
week.
-- 
Michael


Re: [HACKERS] pluggable compression support

2013-06-14 Thread Robert Haas
On Fri, Jun 14, 2013 at 8:45 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-06-14 17:35:02 -0700, Josh Berkus wrote:

  No. I think as long as we only have pglz and one new algorithm (even if
  that is lz4 instead of the current snappy) we should just always use the
  new algorithm. Unless I missed it nobody seemed to have voiced a
  contrary position?
  For testing/evaluation the guc seems to be sufficient.

 Then it's not pluggable, is it?  It's upgradable compression
 support, if anything.  Which is fine, but let's not confuse people.

 The point is that it's pluggable on the storage level in the sense of
 that several different algorithms can coexist and new ones can
 relatively easily added.
 That part is what seems to have blocked progress for quite a while
 now. So fixing that seems to be the interesting thing.

 I am happy enough to do the work of making it configurable if we want it
 to be... But I have zap interest of doing it and throw it away in the
 end because we decide we don't need it.

I don't think we need it.  I think what we need is to decide is which
algorithm is legally OK to use.  And then put it in.

In the past, we've had a great deal of speculation about that legal
question from people who are not lawyers.  Maybe it would be valuable
to get some opinions from people who ARE lawyers.  Tom and Heikki both
work for real big companies which, I'm guessing, have substantial
legal departments; perhaps they could pursue getting the algorithms of
possible interest vetted.  Or, I could try to find out whether it's
possible do something similar through EnterpriseDB.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pluggable compression support

2013-06-14 Thread Joshua D. Drake


On 06/14/2013 06:56 PM, Robert Haas wrote:


On Fri, Jun 14, 2013 at 8:45 PM, Andres Freund and...@2ndquadrant.com wrote:

On 2013-06-14 17:35:02 -0700, Josh Berkus wrote:



No. I think as long as we only have pglz and one new algorithm (even if
that is lz4 instead of the current snappy) we should just always use the
new algorithm. Unless I missed it nobody seemed to have voiced a
contrary position?
For testing/evaluation the guc seems to be sufficient.


Then it's not pluggable, is it?  It's upgradable compression
support, if anything.  Which is fine, but let's not confuse people.


The point is that it's pluggable on the storage level in the sense of
that several different algorithms can coexist and new ones can
relatively easily added.
That part is what seems to have blocked progress for quite a while
now. So fixing that seems to be the interesting thing.

I am happy enough to do the work of making it configurable if we want it
to be... But I have zap interest of doing it and throw it away in the
end because we decide we don't need it.


I don't think we need it.  I think what we need is to decide is which
algorithm is legally OK to use.  And then put it in.

In the past, we've had a great deal of speculation about that legal
question from people who are not lawyers.  Maybe it would be valuable
to get some opinions from people who ARE lawyers.  Tom and Heikki both
work for real big companies which, I'm guessing, have substantial
legal departments; perhaps they could pursue getting the algorithms of
possible interest vetted.  Or, I could try to find out whether it's
possible do something similar through EnterpriseDB.


We have IP legal representation through Software in the Public interest 
who pretty much specializes in this type of thing.


Should I follow up? If so, I need a summary of the exact question 
including licenses etc.


JD








--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
   a rose in the deeps of my heart. - W.B. Yeats


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >