date:20130614

Re: [HACKERS] updated emacs configuration

2013-06-14 Thread Daniel Farina

On Thu, Jun 13, 2013 at 6:27 PM, Peter Eisentraut pete...@gmx.net wrote:
 First, I propose adding a .dir-locals.el file to the top-level directory
 with basic emacs settings.  These get applied automatically.  This
 especially covers the particular tab and indentation settings that
 PostgreSQL uses.  With this, casual developers will not need to modify
 any of their emacs settings.

Yes please.  I've had the pgsql stuff in my .emacs for-ever (ever since I
was a student and compelled to do homework on Postgres) and knew the
magical rules about naming the directory, but it always felt so dirty
and very much a 'you need to know the trick' level of intimacy.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Add transforms feature

2013-06-14 Thread Cédric Villemain




Peter Eisentraut pete...@gmx.net a écrit :
A transform is an SQL object that supplies to functions for converting
between data types and procedural languages.  For example, a transform
could arrange that hstore is converted to an appropriate hash or
dictionary object in PL/Perl or PL/Python.

Nice !

Continued from 2013-01 commit fest.  All known open issues have been
fixed.

You kept PGXS style makefile...

--
Envoyé de mon téléphone excusez la brièveté.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Reduce maximum error in tuples estimation after vacuum.

2013-06-14 Thread Kyotaro HORIGUCHI

Hello, 

Postgresql estimates the number of live tuples after the vacuum
has left some buffers unscanned. This estimation does well for
most cases, but makes completely different result with a strong
imbalance of tuple density.

For example,

create table t (a int, b int);
insert into t (select a, (random() * 10)::int from generate_series((select 
count(*) from t) + 1, 100) a);
update t set b = b + 1 where a   (select count(*) from t) * 0.7;
vacuum t;
delete from t where a  (select count(*) from t) * 0.99;

After this, pg_stat_user_tables.n_live_tup shows 417670 which is
41 times larger than the real number of rows 11. And what
makes it worse, autovacuum nor autoanalyze won't run until
n_dead_tup goes above 8 times larger than the real number of
tuples in the table for the default settings..


| postgres=# select n_live_tup, n_dead_tup
|from pg_stat_user_tables where relname='t';
|  n_live_tup | n_dead_tup 
| +
|  417670 |  0
| 
| postgres=# select reltuples from pg_class where relname='t';
|  reltuples 
| ---
| 417670
| 
| postgres=# select count(*) from t;
|  count 
| ---
|  10001

Using n_dead_tup before vacuuming seems to make it better but I
heard that the plan is abandoned from some reason I don't know.

So I've come up with the another plan - using FSM to estimate the
tuple density in unscanned pages. The point is that make
estimation reliying on the uniformity of tuple length instead of
tuple density. This change seems keeping that errors under a few
times of tuples. Additional page reads for FSM are about 4000th
(SlotsPerFSMPage) of the skipped pages, and I suppose this is
tolerable during vacuum.

Overall algorithm could be illistrated as below,

 - summing up used bytes, max offnum(PageGetMaxOffsetNumber),
   maximum free bytes for tuple data , and free bytes after page
   vacuum through all scanned pages.

 - summing up free bytes informed by FSM through all skipped
   pages.

 - Calculate mean tuple length from the overall used bytes and
   sum of max offnums, and scanned pages.

 - Guess tuple density in skipped pages using overall free bytes
   from FSM and the mean tuple length calculated above.

 - Finally, feed estimated number of the live tuples BEFORE
   vacuum into vac_estimate_reltuples.

Of course this method affected by the imbalance of tuple LENGTH,
but it also seems to be kept within a few times of the number of
tuples.

for rows with invariable length, the test for head shows, where
tups est is pg_class.reltuples and tups real is count(*).

del% | pages | n_live_tup | tups est | tups real | est/real | bufs 
-+---++--+---+--+--
 0.9 |  4425 | 11 |   470626 |11 |4.706 | 3985
0.95 |  4425 |  50001 |   441196 | 50001 |8.824 | 4206
0.99 |  4425 | 417670 |   417670 | 10001 |   41.763 | 4383

and with the patch

 0.9 |  4425 | 106169 |   106169 |11 |1.062 | 3985
0.95 |  4425 |  56373 |56373 | 50001 |1.127 | 4206
0.99 |  4425 |  10001 |16535 | 10001 |1.653 | 4383


What do you think about this?

=

The attached files are:

  - vacuum_est_improve_20130614.patch: the patch for this proposal

  - vactest.sql: sql script to cause the sitiation

  - vactest.sh: test script to find the errors relating this patch.

  - test_result.txt: all of the test result for various deletion
ratio which the test script above yields.

regards,
-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index d6d20fd..1e581c1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1280,7 +1280,8 @@ acquire_sample_rows(Relation onerel, int elevel,
 	*totalrows = vac_estimate_reltuples(onerel, true,
 		totalblocks,
 		bs.m,
-		liverows);
+		liverows,
+		onerel-rd_rel-reltuples);
 	if (bs.m  0)
 		*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
 	else
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 641c740..4bdf0c1 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -501,10 +501,10 @@ double
 vac_estimate_reltuples(Relation relation, bool is_analyze,
 	   BlockNumber total_pages,
 	   BlockNumber scanned_pages,
-	   double scanned_tuples)
+	   double scanned_tuples,
+	   double old_rel_tuples)
 {
 	BlockNumber old_rel_pages = relation-rd_rel-relpages;
-	double		old_rel_tuples = relation-rd_rel-reltuples;
 	double		old_density;
 	double		new_density;
 	double		multiplier;
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 7e46f9e..80304a6 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -396,7 +396,11 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 	double		num_tuples,

[HACKERS] Add visibility map information to pg_freespace.

2013-06-14 Thread Kyotaro HORIGUCHI

Helle,

I've added visibility map information to pg_freespace for my
utility.

This looks like this,

postgres=# select * from pg_freespace('t'::regclass);
 blkno | avail | all_visible 
---+---+-
 0 |  7424 | t
 1 |  7424 | t
 2 |  7424 | t
 3 |  7424 | t
 4 |  7424 | t
 5 |  7424 | t
 6 |  7424 | t
 7 |  7424 | t
...

What do you think about this?


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/contrib/pg_freespacemap/pg_freespacemap--1.0.sql b/contrib/pg_freespacemap/pg_freespacemap--1.0.sql
index 2adb52a..e38b466 100644
--- a/contrib/pg_freespacemap/pg_freespacemap--1.0.sql
+++ b/contrib/pg_freespacemap/pg_freespacemap--1.0.sql
@@ -9,12 +9,17 @@ RETURNS int2
 AS 'MODULE_PATHNAME', 'pg_freespace'
 LANGUAGE C STRICT;
 
+CREATE FUNCTION pg_is_all_visible(regclass, bigint)
+RETURNS bool
+AS 'MODULE_PATHNAME', 'pg_is_all_visible'
+LANGUAGE C STRICT;
+
 -- pg_freespace shows the recorded space avail at each block in a relation
 CREATE FUNCTION
-  pg_freespace(rel regclass, blkno OUT bigint, avail OUT int2)
+  pg_freespace(rel regclass, blkno OUT bigint, avail OUT int2, all_visible OUT bool)
 RETURNS SETOF RECORD
 AS $$
-  SELECT blkno, pg_freespace($1, blkno) AS avail
+  SELECT blkno, pg_freespace($1, blkno) AS avail, pg_is_all_visible($1, blkno) AS all_visible
   FROM generate_series(0, pg_relation_size($1) / current_setting('block_size')::bigint - 1) AS blkno;
 $$
 LANGUAGE SQL;
diff --git a/contrib/pg_freespacemap/pg_freespacemap.c b/contrib/pg_freespacemap/pg_freespacemap.c
index f6f7d2e..de4eff7 100644
--- a/contrib/pg_freespacemap/pg_freespacemap.c
+++ b/contrib/pg_freespacemap/pg_freespacemap.c
@@ -10,17 +10,20 @@
 
 #include funcapi.h
 #include storage/freespace.h
+#include access/visibilitymap.h
 
 
 PG_MODULE_MAGIC;
 
 Datum		pg_freespace(PG_FUNCTION_ARGS);
+Datum		pg_is_all_visible(PG_FUNCTION_ARGS);
 
 /*
  * Returns the amount of free space on a given page, according to the
  * free space map.
  */
 PG_FUNCTION_INFO_V1(pg_freespace);
+PG_FUNCTION_INFO_V1(pg_is_all_visible);
 
 Datum
 pg_freespace(PG_FUNCTION_ARGS)
@@ -38,7 +41,32 @@ pg_freespace(PG_FUNCTION_ARGS)
  errmsg(invalid block number)));
 
 	freespace = GetRecordedFreeSpace(rel, blkno);
-
 	relation_close(rel, AccessShareLock);
 	PG_RETURN_INT16(freespace);
 }
+
+Datum
+pg_is_all_visible(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	int64		blkno = PG_GETARG_INT64(1);
+	Buffer  vmbuffer = InvalidBuffer;
+	int			all_visible;
+	Relation	rel;
+
+	rel = relation_open(relid, AccessShareLock);
+
+	if (blkno  0 || blkno  MaxBlockNumber)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg(invalid block number)));
+
+	all_visible = visibilitymap_test(rel, blkno, vmbuffer);
+	if (vmbuffer != InvalidBuffer)
+	{
+		ReleaseBuffer(vmbuffer);
+		vmbuffer = InvalidBuffer;
+	}
+	relation_close(rel, AccessShareLock);
+	PG_RETURN_BOOL(all_visible);
+}
diff --git a/contrib/pg_freespacemap/pg_freespacemap.control b/contrib/pg_freespacemap/pg_freespacemap.control
index 34b695f..395350a 100644
--- a/contrib/pg_freespacemap/pg_freespacemap.control
+++ b/contrib/pg_freespacemap/pg_freespacemap.control
@@ -1,5 +1,5 @@
 # pg_freespacemap extension
-comment = 'examine the free space map (FSM)'
+comment = 'examine the free space map (FSM) and visibility map (VM)'
 default_version = '1.0'
 module_pathname = '$libdir/pg_freespacemap'
 relocatable = true

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-14 Thread KONDO Mitsumasa


(2013/06/12 23:07), Robert Haas wrote:

On Mon, Jun 10, 2013 at 3:48 PM, Simon Riggs si...@2ndquadrant.com wrote:

On 10 June 2013 11:51, KONDO Mitsumasa kondo.mitsum...@lab.ntt.co.jp wrote:

I create patch which is improvement of checkpoint IO scheduler for stable
transaction responses.


Looks like good results, with good measurements. Should be an
interesting discussion.


+1.

I suspect we want to poke at the algorithms a little here and maybe
see if we can do this without adding new GUCs.  Also, I think this is
probably two separate patches, in the end.  But the direction seems
good to me.

Thank you for comment!

I separate my patch in checkpoint-wirte and in checkpoint-fsync. As you
say, my patch has a lot of new GUCs. I don't think it cannot be decided
automatic. However, it is difficult that chekpoint-scheduler is suitable
for all of enviroments which are like virtual server, public cloude server,
and embedded server, etc. So I think that default setting parameter works
same as before. Setting parameter is primitive and difficult, but if we can
set correctly, it is suitable for a lot of enviroments and will not work 
unintended action.


I try to take something into consideration about less GUCs version. And if you 
have good idea, please discussion about this!


Best Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fdf6625..0c0f215 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -141,9 +141,12 @@ static CheckpointerShmemStruct *CheckpointerShmem;
 /*
  * GUC parameters
  */
+int			CheckPointerWriteDelay = 200;
 int			CheckPointTimeout = 300;
 int			CheckPointWarning = 30;
 double		CheckPointCompletionTarget = 0.5;
+double		CheckPointSmoothTarget = 0.0;
+double		CheckPointSmoothMargin = 0.0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
@@ -715,7 +718,7 @@ CheckpointWriteDelay(int flags, double progress)
 		 * Checkpointer and bgwriter are no longer related so take the Big
 		 * Sleep.
 		 */
-		pg_usleep(10L);
+		pg_usleep(CheckPointerWriteDelay * 1000L);
 	}
 	else if (--absorb_counter = 0)
 	{
@@ -742,14 +745,36 @@ IsCheckpointOnSchedule(double progress)
 {
 	XLogRecPtr	recptr;
 	struct timeval now;
-	double		elapsed_xlogs,
+	double		original_progress,
+			elapsed_xlogs,
 elapsed_time;
 
 	Assert(ckpt_active);
 
-	/* Scale progress according to checkpoint_completion_target. */
-	progress *= CheckPointCompletionTarget;
+	/* This variable is used by smooth checkpoint schedule.*/
+	original_progress = progress * CheckPointCompletionTarget;
 
+	/* Scale progress according to checkpoint_completion_target and checkpoint_smooth_target. */
+	if(progress = CheckPointSmoothTarget)
+	{
+		/* Normal checkpoint schedule. */
+		progress *= CheckPointCompletionTarget;
+	}
+	else
+	{
+		/*
+		 * Smooth checkpoint schedule.
+		 *
+		 * When initial checkpoint, it tends to be high IO road average
+		 * and slow executing transactions. This schedule reduces them
+		 * and improve IO responce. As 'progress' approximates CheckPointSmoothTarget,
+		 * it becomes near normal checkpoint schedule. If you want to more
+		 * smooth checkpoint schedule, you set higher CheckPointSmoothTarget.
+		 */
+		progress *= ((CheckPointSmoothTarget - progress) / CheckPointSmoothTarget) *
+(CheckPointSmoothMargin + 1 - CheckPointCompletionTarget) +
+CheckPointCompletionTarget;
+	}
 	/*
 	 * Check against the cached value first. Only do the more expensive
 	 * calculations once we reach the target previously calculated. Since
@@ -779,6 +804,14 @@ IsCheckpointOnSchedule(double progress)
 			ckpt_cached_elapsed = elapsed_xlogs;
 			return false;
 		}
+		else if (original_progress  elapsed_xlogs)
+		{
+			ckpt_cached_elapsed = elapsed_xlogs;
+
+			/* smooth checkpoint write */
+			pg_usleep(CheckPointerWriteDelay * 1000L);
+			return false;
+		}
 	}
 
 	/*
@@ -793,6 +826,14 @@ IsCheckpointOnSchedule(double progress)
 		ckpt_cached_elapsed = elapsed_time;
 		return false;
 	}
+	else if (original_progress  elapsed_time)
+	{
+		ckpt_cached_elapsed = elapsed_time;
+
+		/* smooth checkpoint write */
+		pg_usleep(CheckPointerWriteDelay * 1000L);
+		return false;
+	}
 
 	/* It looks like we're on schedule. */
 	return true;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ea16c64..d41dc17 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2014,6 +2014,17 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{checkpointer_write_delay, PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+			gettext_noop(checkpointer sleep time during dirty buffers write in checkpoint.),
+			NULL,
+			GUC_UNIT_MS
+		},
+		CheckPointerWriteDelay,
+		200, 10, 1,
+		NULL, NULL, NULL
+	},
+
+	{
 		{wal_buffers, PGC_POSTMASTER, WAL_SETTINGS,
 			gettext_noop(Sets the number of disk-page buffers in shared

Re: [HACKERS] Reduce maximum error in tuples estimation after vacuum.

2013-06-14 Thread Kyotaro HORIGUCHI

Sorry, I made an mistake.

Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp:

 Overall algorithm could be illistrated as below,

  - summing up used bytes, max offnum(PageGetMaxOffsetNumber),

Not max offnum, the number of linp's used after page vacuum.

maximum free bytes for tuple data , and free bytes after page
vacuum through all scanned pages.

  - summing up free bytes informed by FSM through all skipped
pages.

  - Calculate mean tuple length from the overall used bytes and
sum of max offnums, and scanned pages.

Here also is the same. not sum of max offnum but total of used
entrre(linp)s.

  - Guess tuple density in skipped pages using overall free bytes
from FSM and the mean tuple length calculated above.

  - Finally, feed estimated number of the live tuples BEFORE
vacuum into vac_estimate_reltuples.

regards,

-- 
Kyotaro Horiguchi

Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Benedikt Grundmann

On Fri, Jun 14, 2013 at 10:11 AM, Samrat Revagade revagade.sam...@gmail.com
wrote:

Hello,

We have already started a discussion on pgsql-hackers for the problem of
taking fresh backup during the failback operation here is the link for that:

http://www.postgresql.org/message-id/caf8q-gxg3pqtf71nvece-6ozraew5pwhk7yqtbjgwrfu513...@mail.gmail.com

Let me again summarize the problem we are trying to address.

When the master fails, last few WAL files may not reach the standby. But
the master may have gone ahead and made changes to its local file system
after flushing WAL to the local storage. So master contains some file
system level changes that standby does not have. At this point, the data
directory of master is ahead of standby's data directory.

Subsequently, the standby will be promoted as new master. Later when the
old master wants to be a standby of the new master, it can't just join the
setup since there is inconsistency in between these two servers. We need to
take the fresh backup from the new master. This can happen in both the
synchronous as well as asynchronous replication.

Fresh backup is also needed in case of clean switch-over because in the
current HEAD, the master does not wait for the standby to receive all the
WAL up to the shutdown checkpoint record before shutting down the
connection. Fujii Masao has already submitted a patch to handle clean
switch-over case, but the problem is still remaining for failback case.

The process of taking fresh backup is very time consuming when databases
are of very big sizes, say several TB's, and when the servers are connected
over a relatively slower link. This would break the service level
agreement of disaster recovery system. So there is need to improve the
process of disaster recovery in PostgreSQL. One way to achieve this is to
maintain consistency between master and standby which helps to avoid need
of fresh backup.

So our proposal on this problem is that we must ensure that master should
not make any file system level changes without confirming that the
corresponding WAL record is replicated to the standby.

A alternative proposal (which will probably just reveal my lack of
understanding about what is or isn't possible with WAL). Provide a way to
restart the master so that it rolls back the WAL changes that the slave
hasn't seen.

There are many suggestions and objections pgsql-hackers about this problem
The brief summary is as follows:

Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Samrat Revagade

That will not happen if there is inconsistency in between both the servers.

Please refer to the discussions on the link provided in the first post:

http://www.postgresql.org/message-id/caf8q-gxg3pqtf71nvece-6ozraew5pwhk7yqtbjgwrfu513...@mail.gmail.com

Regards,

Samrat Revgade

Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-14 Thread Heikki Linnakangas

On 14.06.2013 12:11, Samrat Revagade wrote:

We have already started a discussion on pgsql-hackers for the problem of
taking fresh backup during the failback operation here is the link for that:

http://www.postgresql.org/message-id/caf8q-gxg3pqtf71nvece-6ozraew5pwhk7yqtbjgwrfu513...@mail.gmail.com

Let me again summarize the problem we are trying to address.

Did you see the thread on the little tool I wrote called pg_rewind?

http://www.postgresql.org/message-id/519df910.4020...@vmware.com

It solves that problem, for both clean and unexpected shutdown. It needs
some more work and a lot more testing, but requires no changes to the
backend. Robert Haas pointed out in that thread that it has a problem
with hint bits that are not WAL-logged, but it will still work if you
also enable the new checksums feature, which forces hint bit updates to
be WAL-logged. Perhaps we could add a GUC to enable hint bits to be
WAL-logged, regardless of checksums, to make pg_rewind work.

I think that's a more flexible approach to solve this problem. It
doesn't require an online feedback loop from the standby to master, for
starters.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 106 matches

Mail list logo