Re: [HACKERS] Resource Owner reassign Locks

2015-06-07 Thread Jeff Janes
On Thu, Jun 21, 2012 at 5:32 AM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:

 On 18.06.2012 13:59, Heikki Linnakangas wrote:

 On 10.06.2012 23:39, Jeff Janes wrote:
 I found the interface between resowner.c and lock.c a bit confusing.
 resowner.c would sometimes call LockReassignCurrentOwner() to reassign
 all the locks, and sometimes it would call LockReassignCurrentOwner() on
 each individual lock, with the net effect that's the same as calling
 LockReleaseCurrentOwner(). And it doesn't seem right for
 ResourceOwnerRemember/ForgetLock to have to accept a NULL owner.

 I rearranged that so that there's just a single
 LockReassignCurrentOwner() function, like before this patch. But it
 takes as an optional argument a list of locks held by the current
 resource owner. If the caller knows it, it can pass them to make the
 call faster, but if it doesn't it can just pass NULL and the function
 will traverse the hash table the old-fashioned way. I think that's a
 better API.

 Please take a look to see if I broke something.


 I hear no complaints, so committed. Thanks!


I'd like to advocate for back-patching this to 9.0, 9.1, and 9.2.  It has
run without problems for a while now, and it can be considered a bug that
systems with a very large number of objects cannot be upgraded in a
reasonable time.

There are possible ways to make the upgrade smoother by changing the new
systems pg_upgrade rather the old systems server, but those methods will
not be simpler, and not be as well tested.

I'll add this proposal to the commit fest.

Thanks,

Jeff


Re: [HACKERS] [CORE] Restore-reliability mode

2015-06-07 Thread Jeff Janes
On Sat, Jun 6, 2015 at 7:35 AM, Geoff Winkless pgsqlad...@geoff.dj wrote:

 On 6 June 2015 at 13:41, Sehrope Sarkuni sehr...@jackdb.com wrote:



 It's much easier to work into dev/test setups if there are system
 packages as it's just a config change to an existing script. Building
 from source would require a whole new workflow that I don't have time
 to incorporate.


 ​Really? You genuinely don't have time to paste, say:

 mkdir -p ~/src/pgdevel
 cd ~/src/pgdevel
 wget
 https://ftp.postgresql.org/pub/snapshot/dev/postgresql-snapshot.tar.bz2
 tar xjf postgresql-snapshot.tar.bz2
 ​mkdir bld
 ​
 cd bld
 ../postgresql-9.5devel/configure $(pg_config --configure | sed -e
 's/\(pg\|postgresql[-\/]\)\(doc-\)\?9\.[0-9]*\(dev\)\?/\1\29.5dev/g')
 make wor
 ​ld​
 ​make check
 make world-install


I think this is rather uncharitable.  You don't include yum, zypper, or
apt-get anywhere in there, and I vaguely recall it took me quite a bit of
time to install all the prereqs in order to get it to compile several years
ago when I first started trying to compile it.  And then even more time get
it to compile with several of the config flags I wanted.  And then even
more time to get the docs to compile.

And now after I got all of that, when I try your code, it still doesn't
work.  $(pg_config ) seems to not quote things the way that configure
wants them quoted, or something.  And the package from which I was using
pg_config uses more config options than I was set up for when compiling
myself, so I either have to manually remove the flags, or find more
dependencies (pam, xslt, ossp-uuid, tcl, tcl-dev, and counting).  This is
not very fun, and I didn't even need to get bureaucratic approval to do any
of this stuff, like a lot of people do.

And then when I try to install it, it tries to overwrite some of the files
which were initially installed by the package manager in /usr/lib.  That
doesn't seem good.

And how do I, as a hypothetical package manager user, start this puppy up?
Where is pg_ctlcluster?  How does one do pg_upgrade between a
package-controlled data directory and this new binary?

And then when I find a bug, how do I know it is a bug and not me doing
something wrong in the build process, and getting the wrong .so to load
with the wrong binary or something like that?


 ​and yet you think you have enough time to provide more than a looks like
 it's working report to the developers?​


If it isn't working, reports of it isn't working with error messages are
pretty helpful and don't take a whole lot of time.  If it is working,
reports of that probably aren't terribly helpful without putting a lot more
work into it.  But people might be willing to put a lot of work into, say,
performance regression testing it that is their area of expertise, if they
don' t also have to put a lot of work into getting the software to compile
in the first place, which is not their area.

Now I don't see a lot of evidence of beta testing from the public (i.e.
unfamiliar names) on -hackers and -bugs lists.  But a lot of hackers report
things that a client or someone on IRC reported to them, so I'm willing
to believe that a lot of useful beta testing does go on, even though I
don't directly see it, and if there were alpha releases, why wouldn't it
extend to that?



 (NB the sed for the pg_config line will probably need work, it looks like
 it should work on the two types of system I have here but I have to admit I
 changed the config line manually when I built it)


Right, and are the people who use apt-get to install everything likely to
know how to do that work?


Cheers,

Jeff


Re: [HACKERS] PoC: Partial sort

2015-06-07 Thread Peter Geoghegan
On Sun, Jun 7, 2015 at 8:10 AM, Andreas Karlsson andr...@proxel.se wrote:
 Are you planning to work on this patch for 9.6?

FWIW I hope so. It's a nice patch.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [CORE] Restore-reliability mode

2015-06-07 Thread Kevin Grittner
Joshua D. Drake j...@commandprompt.com wrote:
 On 06/06/2015 07:14 PM, Peter Geoghegan wrote:
 On Sat, Jun 6, 2015 at 7:07 PM, Robert Haas robertmh...@gmail.com wrote:

 Perhaps we're honoring this more in the breech than in the
 observance, but I'm not making up what Tom has said about this:

 http://www.postgresql.org/message-id/27310.1251410...@sss.pgh.pa.us

That's 9.0 release discussion:

| I think that the traditional criterion is that we don't release beta1
| as long as there are any known issues that might force an initdb.
| We were successful in avoiding a post-beta initdb this time, although
| IIRC the majority of release cycles have had one --- so maybe you
| could argue that that's not so important.  It would certainly be
| less important if we had working pg_migrator functionality to ease
| the pain of going from beta to final.

 http://www.postgresql.org/message-id/19174.1299782...@sss.pgh.pa.us

That's 9.1 release discussion:

| Historically we've declared it beta once we think we are done with
| initdb-forcing problems.

| In any case, the existence of pg_upgrade means that might we need
| another initdb? is not as strong a consideration as it once was, so
| I'm not sure if we should still use that as a criterion.  I don't know
| quite what ready for beta should mean otherwise, though.

 http://www.postgresql.org/message-id/3413.1301154...@sss.pgh.pa.us

Also 9.1, it is listed as one criterion:

| * No open issues that are expected to result in a catversion bump.
| (With pg_upgrade, this is not as critical as it used to be, but
| I still think catalog stability is a good indicator of a release's
| maturity)

 http://www.postgresql.org/message-id/3261.1401915...@sss.pgh.pa.us

Here we jump to 9.4 discussion:

|  Agreed. Additionally I also agree with Stefan that the price of a initdb
|  during beta isn't that high these days.
|
| Yeah, if nothing else it gives testers another opportunity to exercise
| pg_upgrade ;-).  The policy about post-beta1 initdb is avoid if
| practical, not avoid at all costs.

So I think these examples show that the policy has shifted from a
pretty strong requirement to it's probably nice if status, with
some benefits seen in pg_upgrade testing to actually having a bump.

 Of course, not doing a catversion bump after beta1 doesn't necessarily
 have much value in and of itself. *Promising* to not do a catversion
 bump, and then usually keeping that promise definitely has a certain
 value, but clearly we are incapable of that.

As someone who was able to bring up a new production application on
8.2 because it was all redundant data and not yet mission-critical,
I appreciate that in very rate circumstances that combination could
have benefit.  But really, how often are people in that position?

 It seems to me that a cat bump during Alpha or Beta should be absolutely
 fine and reservedly fine respectively. Where we should absolutely not
 cat bump unless there is absolutely no other choice is during and RC.

+1 on all of that.  And for a while now we've been talking about an
alpha test release, right?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Proposal] More Vacuum Statistics

2015-06-07 Thread Tomas Vondra

Hi,

On 06/05/15 14:10, Naoya Anzai wrote:

Thank you for quick feedback, and I'm sorry for slow response.
All of your opinions were very helpful for me.

I have confirmed Greg's Idea Timing events.
http://www.postgresql.org/message-id/509300f7.5000...@2ndquadrant.com

Greg said at first,
Parsing log files for commonly needed performance data is no fun.
Yes, I completely agree with him.



That looks a nice idea but I don't know why this idea has
not been commited yet. Anybody knows?


Most likely lack of time, I guess.



I have reworked my idea since I heard dear hacker's opinions.


pg_stat_vacuum view


I understand it is not good to simply add more counters in
pg_stat_*_tables. For now, I'd like to suggest an extension
which can confirm vacuum statistics like pg_stat_statements.


I don't see how you want to collect the necessary information from an
extension? pg_stat_statements get most of the stats from BufferUsage 
structure, but vacuum keeps all this internal, AFAIK.


So it'd be necessary to make this somehow public - either by creating 
something like BufferUsage with all the vacuum stats, or perhaps a set 
of callbacks (either a single log_relation_vacuum or different callbacks 
for tables and indexs).


IMHO the callbacks are a better idea - for example because it naturally 
handles database-wide vacuum. The global structure makes this difficult, 
because you'll only see data for all the vacuumed objects (or it'd have 
to track per-object stats internally, somehow).



VACUUM is a most important feature in PostgreSQL, but a
special view for vacuum does not exist. Don't you think
the fact is inconvenience? At least, I am disgruntled with
that we need to parse pg_log for tune VACUUM.


+1


My first design of pg_stat_vacuum view is following.
(There are two views.)

pg_stat_vacuum_table
---
dbid
schemaname
relid
relname
elapsed
page_removed
page_remain
page_skipped
tuple_removed
tuple_remain
tuple_notremovable
buffer_hit
buffer_miss
buffer_dirty
avg_read
avg_write
vm_count
vac_start
vac_end
is_autovacuum

pg_stat_vacuum_index
---
dbid
shemaname
relid
indexrelid
indexname
elapsed
num_index_tuples
num_pages
tuples_removed
pages_deleted
pages_free
is_autovacuum

At present, I think memory design of pg_stat_statements can
divert into this feature.And I think this module needs to
prepare following parameters like pg_stat_statements.


I'm not really sure about this.

Firstly, the very fist response from TL in this thread was that adding 
per-table counters is not a particularly good idea, as it'll bloat the 
statistics files. It's true you're not adding the data into the main 
stats files, but you effectively establish a new 'vertical partition' 
with one record per table/index. It might be worth the overhead, if it 
really brings useful functionality (especially if it's opt-in feature, 
like pg_stat_statements).


Secondly, the main issue of this design IMHO is that it only tracks the 
very last vacuum run (or do I understand it wrong?). That means even if 
you snapshot the pg_stat_vacuum views, you'll not know how many vacuums 
executed in between (and the more frequently you snapshot that, the 
greater the overhead). The other stats counters have the same issue, but 
the snapshotting works a bit better because the counters are cumulative 
(so you can easily do deltas etc.). But that's not the case here - 
certainly not with the timestamps, for example.


I don't think the vacuum start/end timestamps are particularly 
interesting, TBH - we already have them in pg_stat_all_tables anyway, 
including the vacuum_count etc. So I'd propose dropping the timestamps, 
possibly replacing them with a single 'elapsed time', and making all the 
counters cumulative (so that you can do snapshots and deltas).


I'm also wondering whether this should track the vacuum costs (because 
that determines how aggressive the vacuum is, and how much work will be 
done in a particular time), if it was anti-wraparound vacuum, if there 
was also ANALYZE performed, if the autovacuum was interrupted because of 
user activity, etc.



pg_stat_vacuum.max(integer)
pg_stat_vacuum.save(boolean)
pg_stat_vacuum.excluded_dbnames(text)
pg_stat_vacuum.excluded_schemas(text)
pg_stat_vacuum.min_duration(integer)
... and so on.

To implement this feature, I have to collect each vacuum-stats
every lazy_vacuum_* and I need to embed a hook function point
where needed. (probably last point of lazy_vacuum_rel).
Do you hesitate to add the hook only for this function?


Aha! So you plan to use the callbacks.



Similar feature has been already provided by pg_statsinfo package.
But it is a full-stack package for PG-stats and it needs to
redesign pg_log and design a repository database for introduce.
And it is not a core-extension for PostgreSQL.
(I don't intend to hate pg_statsinfo,
  I think this package is a very convinient tool)

Everyone will be able to do more easily tuning of 

Re: [HACKERS] checkpointer continuous flushing

2015-06-07 Thread Fabien COELHO


Hello Andres,


They pretty much can't if you flush things frequently. That's why I
think this won't be acceptable without the sorting in the checkpointer.



* VERSION 2 WORK IN PROGRESS.

The implementation is more a proof-of-concept for having feedback than
clean code. What it does:

 - as version 1 : simplified asynchronous flush based on Andres Freund
   patch, with sync_file_range/posix_fadvise used to hint the OS that
   the buffer must be sent to disk now.

 - added: checkpoint buffer sorting based on a 2007 patch by Takahiro Itagaki
   but with a smaller and static buffer allocated once. Also,
   sorting is done by chunks in the current version.

 - also added: sync/advise calls are now merged if possible,
   so less calls are used, especially when buffers are sorted,
   but also if there are few files.


* PERFORMANCE TESTS

Impacts on pgbench -M prepared -N -P 1 scale 10  (simple update pgbench
with a mostly-write activity),  with checkpoint_completion_target=0.8
and shared_buffers=1GB.

Contrary to v1, I have not tested bgwriter flushing as the impact
on the first round was close to nought. This does not mean that particular
loads may benefit or be harmed but flushing from bgwriter.

- 100 tps throttled max 100 ms latency over 6400 seconds
  with checkpoint_timeout=30s

  flush | sort | late transactions
off |  off | 6.0 %
off |   on | 6.1 %
 on |  off | 0.4 %
 on |   on | 0.4 % (93% improvement)

- 100 tps throttled max 100 ms latency over 4000 seconds
  with checkpoint_timeout=10mn

  flush | sort | late transactions
off |  off | 1.5 %
off |   on | 0.6 % (?!)
 on |  off | 0.8 %
 on |   on | 0.6 % (60% improvement)

- 150 tps throttled max 100 ms latency over 19600 seconds (5.5 hours)
  with checkpoint_timeout=30s

  flush | sort | late transactions
off |  off | 8.5 %
off |   on | 8.1 %
 on |  off | 0.5 %
 on |   on | 0.4 % (95% improvement)

- full speed bgbench over 6400 seconds with checkpoint_timeout=30s

  flush | sort | tps performance over per second data
off |  off | 676 +- 230
off |   on | 683 +- 213
 on |  off | 712 +- 130
 on |   on | 725 +- 116 (7.2% avg/50% stddev improvements)

- full speed bgbench over 4000 seconds with checkpoint_timeout=10mn

  flush | sort | tps performance over per second data
off |  off | 885 +- 188
off |   on | 940 +- 120 (6%/36%!)
 on |  off | 778 +- 245 (hmmm... not very consistent?)
 on |   on | 927 +- 108 (4.5% avg/43% sttdev improvements)

- full speed bgbench -j2 -c4 over 6400 seconds with checkpoint_timeout=30s

  flush | sort | tps performance over per second data
off |  off | 2012 +- 747
off |   on | 2086 +- 708
 on |  off | 2099 +- 459
 on |   on | 2114 +- 422 (5% avg/44% stddev improvements)


* CONCLUSION :

For all these HDD tests, when both options are activated the tps performance
is improved, the latency is reduced and the performance is more stable
(smaller standard deviation).

Overall the option effects, not surprisingly, are quite (with exceptions) 
orthogonal:

 - latency is essentially improved (60 to 95% reduction) by flushing
 - throughput is improved (4 to 7% better) thanks to sorting

In detail, some loads may benefit more from only one option activated.
Also on SSD probably both options would have limited benefit.

Usual caveat: these are only benches on one host at a particular time and
location, which may or may not be reproducible nor be representative
as such of any other load.  The good news is that all these tests tell
the same thing.


* LOOK FOR THOUGHTS

- The bgwriter flushing option seems ineffective, it could be removed
  from the patch?

- Move fsync as early as possible, suggested by Andres Freund?

In these tests, when the flush option is activated, the fsync duration
at the end of the checkpoint is small: on more than 5525 checkpoint
fsyncs, 0.5% are above 1 second when flush is on, but the figure raises
to 24% when it is off This suggest that doing the fsync as soon as
possible would probably have no significant effect on these tests.

My opinion is that this should be left out for the nonce.


- Take into account tablespaces, as pointed out by Andres Freund?

The issue is that if writes are sorted, they are not be distributed 
randomly over tablespaces, inducing lower performance on such systems.


How to do it: while scanning shared_buffers, count dirty buffers for each
tablespace. Then start as many threads as table spaces, each one doing
its own independent throttling for a tablespace? For some obscure reason 
there are 2 tablespaces by default (pg_global and  pg_default), that would 
mean at least 2 threads.


Alternatively, maybe it can be done from one thread, but it would probably 
involve some strange hocus-pocus to switch frequently between tablespaces.


--
Fabien.diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 1da7dfb..2e6bb10 100644
--- a/doc/src/sgml/config.sgml
+++ 

Re: [HACKERS] PoC: Partial sort

2015-06-07 Thread Andreas Karlsson

On 09/15/2014 01:58 PM, Alexander Korotkov wrote:

On Sun, Sep 14, 2014 at 9:32 AM, Peter Geoghegan p...@heroku.com
mailto:p...@heroku.com wrote:

I think we might be better off if a tuplesort function was called
shortly after tuplesort_begin_heap() is called. How top-n heap sorts
work is something that largely lives in tuplesort's head. Today, we
call tuplesort_set_bound() to hint to tuplesort By the way, this is a
top-n heap sort applicable sort. I think that with this patch, we
should then hint (where applicable) by the way, you won't actually be
required to sort those first n indexed attributes; rather, you can
expect to scan those in logical order. You may work the rest out
yourself, and may be clever about exploiting the sorted-ness of the
first few columns. The idea of managing a bunch of tiny sorts from
with ExecSort(), and calling the new function tuplesort_reset() seems
questionable. tuplesortstate is supposed to be private/opaque to
nodeSort.c, and the current design strains that.

I'd like to keep nodeSort.c simple. I think it's pretty clear that the
guts of this do not belong within ExecSort(), in any case. Also, the
additions there should be much better commented, wherever they finally
end up.


As I understand, you propose to incapsulate partial sort algorithm into
tuplesort. However, in this case we anyway need some significant change
of its interface: let tuplesort decide when it's able to return tuple.
Otherwise, we would miss significant part of LIMIT clause optimization.
tuplesort_set_bound() can't solve all the cases. There could be other
planner nodes between the partial sort and LIMIT.


Hi,

Are you planning to work on this patch for 9.6?

I generally agree with Peter that the changes to the sorting probably 
belong in the tuplesort code rather than in the executor. This way it 
should also be theoretically possible to support mark/restore.


Andreas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] skipping pg_log in basebackup (was Re: pg_basebackup and pg_stat_tmp directory)

2015-06-07 Thread Abhijit Menon-Sen
Hi.

This is a followup to a 2014-02 discussion that led to pg_stats_temp
being excluded from pg_basebackup. At the time, it was discussed to
exclude pg_log as well, but nothing eventually came of that.

Recently, one of our customers has had a basebackup fail because pg_log
contained files that were 8GB:

FATAL: archive member pg_log/postgresql-20150119.log too large for tar format

I think pg_basebackup should also skip pg_log entries, as it does for
pg_stats_temp and pg_replslot, etc. I've attached a patch along those
lines for discussion.

-- Abhijit

P.S. Aren't we leaking statrelpath?
From 8db162c1385b1cdd2b0e666975b76aa814f09f35 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen a...@2ndquadrant.com
Date: Mon, 27 Apr 2015 12:58:52 +0530
Subject: Skip files in pg_log during basebackup

---
 src/backend/replication/basebackup.c | 29 +
 1 file changed, 29 insertions(+)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 1e86e4c..cc75a03 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -27,6 +27,7 @@
 #include nodes/pg_list.h
 #include pgtar.h
 #include pgstat.h
+#include postmaster/syslogger.h
 #include replication/basebackup.h
 #include replication/walsender.h
 #include replication/walsender_private.h
@@ -72,6 +73,9 @@ static bool backup_started_in_recovery = false;
 /* Relative path of temporary statistics directory */
 static char *statrelpath = NULL;
 
+/* Relative path to log directory */
+static char logpath[MAXPGPATH];
+
 /*
  * Size of each block sent into the tar stream for larger files.
  */
@@ -157,6 +161,18 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
 		else
 			statrelpath = pgstat_stat_directory;
 
+		/*
+		 * Do the same for the log_directory.
+		 */
+
+		if (is_absolute_path(Log_directory) 
+			path_is_prefix_of_path(DataDir, Log_directory))
+			snprintf(logpath, MAXPGPATH, ./%s, Log_directory + datadirpathlen + 1);
+		else if (strncmp(Log_directory, ./, 2) != 0)
+			snprintf(logpath, MAXPGPATH, ./%s, Log_directory);
+		else
+			strncpy(logpath, Log_directory, MAXPGPATH);
+
 		/* Add a node for the base directory at the end */
 		ti = palloc0(sizeof(tablespaceinfo));
 		ti-size = opt-progress ? sendDir(., 1, true, tablespaces, true) : -1;
@@ -965,6 +981,19 @@ sendDir(char *path, int basepathlen, bool sizeonly, List *tablespaces,
 		}
 
 		/*
+		 * We can skip pg_log (or whatever log_directory is set to, if
+		 * it's under the data directory), but include it as an empty
+		 * directory anyway, so we get permissions right.
+		 */
+		if (strcmp(pathbuf, logpath) == 0)
+		{
+			if (!sizeonly)
+_tarWriteHeader(pathbuf + basepathlen + 1, NULL, statbuf);
+			size += 512;
+			continue;
+		}
+
+		/*
 		 * Skip pg_replslot, not useful to copy. But include it as an empty
 		 * directory anyway, so we get permissions right.
 		 */
-- 
1.9.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] skipping pg_log in basebackup

2015-06-07 Thread Abhijit Menon-Sen
At 2015-06-08 13:09:02 +0900, michael.paqu...@gmail.com wrote:

 It seems to be that:
 http://www.postgresql.org/message-id/cahgqgwh0okz6ckpjkcwojga3ejwffm1enrmro3dkdoteaai...@mail.gmail.com

(Note that this is about calculating the wrong size, whereas my bug is
about the file being too large to be written to a tar archive.)

 And a recent discussion about that is this one:
 http://www.postgresql.org/message-id/82897A1301080E4B8E461DDAA0FFCF142A1B2660@SYD1216

Oh, sorry, I somehow did miss that thread entirely. Thanks for the
pointer. (I've added Vaishnavi to the Cc: list here.)

I'm not convinced that we need a mechanism to let people exclude the
torrent files they've stored in their data directory, but if we have to
do it, the idea of having a GUC setting rather than specifying excludes
on the basebackup command line each time does have a certain appeal.

Anyone else interested in doing it that way?

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Collection of memory leaks for ECPG driver

2015-06-07 Thread Michael Paquier
Hi all,

Please find attached a patch aimed at fixing a couple of memory leaks
in the ECPG driver. Coverity (and sometimes I after reading some other
code paths) found them.
Regards,
-- 
Michael
diff --git a/src/interfaces/ecpg/compatlib/informix.c b/src/interfaces/ecpg/compatlib/informix.c
index d6de3ea..c1e3dfb 100644
--- a/src/interfaces/ecpg/compatlib/informix.c
+++ b/src/interfaces/ecpg/compatlib/informix.c
@@ -672,6 +672,7 @@ intoasc(interval * i, char *str)
 	if (!str)
 		return -errno;
 
+	free(str);
 	return 0;
 }
 
diff --git a/src/interfaces/ecpg/ecpglib/connect.c b/src/interfaces/ecpg/ecpglib/connect.c
index 55c5680..12f445d 100644
--- a/src/interfaces/ecpg/ecpglib/connect.c
+++ b/src/interfaces/ecpg/ecpglib/connect.c
@@ -298,7 +298,8 @@ ECPGconnect(int lineno, int c, const char *name, const char *user, const char *p
 		envname = getenv(PG_DBPATH);
 		if (envname)
 		{
-			ecpg_free(dbname);
+			if (dbname)
+ecpg_free(dbname);
 			dbname = ecpg_strdup(envname, lineno);
 		}
 
@@ -314,14 +315,19 @@ ECPGconnect(int lineno, int c, const char *name, const char *user, const char *p
 	/* check if the identifier is unique */
 	if (ecpg_get_connection(connection_name))
 	{
-		ecpg_free(dbname);
+		if (dbname)
+			ecpg_free(dbname);
 		ecpg_log(ECPGconnect: connection identifier %s is already in use\n,
  connection_name);
 		return false;
 	}
 
 	if ((this = (struct connection *) ecpg_alloc(sizeof(struct connection), lineno)) == NULL)
+	{
+		if (dbname)
+			ecpg_free(dbname);
 		return false;
+	}
 
 	if (dbname != NULL)
 	{
diff --git a/src/interfaces/ecpg/preproc/descriptor.c b/src/interfaces/ecpg/preproc/descriptor.c
index 053a7af..ebd95d3 100644
--- a/src/interfaces/ecpg/preproc/descriptor.c
+++ b/src/interfaces/ecpg/preproc/descriptor.c
@@ -175,6 +175,7 @@ output_get_descr(char *desc_name, char *index)
 	for (results = assignments; results != NULL; results = results-next)
 	{
 		const struct variable *v = find_variable(results-variable);
+		char *str_zero = mm_strdup(0);
 
 		switch (results-value)
 		{
@@ -188,7 +189,8 @@ output_get_descr(char *desc_name, char *index)
 break;
 		}
 		fprintf(yyout, %s,, get_dtype(results-value));
-		ECPGdump_a_type(yyout, v-name, v-type, v-brace_level, NULL, NULL, -1, NULL, NULL, mm_strdup(0), NULL, NULL);
+		ECPGdump_a_type(yyout, v-name, v-type, v-brace_level, NULL, NULL, -1, NULL, NULL, str_zero, NULL, NULL);
+		free(str_zero);
 	}
 	drop_assignments();
 	fputs(ECPGd_EODT);\n, yyout);
@@ -292,8 +294,12 @@ output_set_descr(char *desc_name, char *index)
 			case ECPGd_indicator:
 			case ECPGd_length:
 			case ECPGd_type:
-fprintf(yyout, %s,, get_dtype(results-value));
-ECPGdump_a_type(yyout, v-name, v-type, v-brace_level, NULL, NULL, -1, NULL, NULL, mm_strdup(0), NULL, NULL);
+{
+	char *str_zero = mm_strdup(0);
+	fprintf(yyout, %s,, get_dtype(results-value));
+	ECPGdump_a_type(yyout, v-name, v-type, v-brace_level, NULL, NULL, -1, NULL, NULL, str_zero, NULL, NULL);
+	free(str_zero);
+}
 break;
 
 			default:
diff --git a/src/interfaces/ecpg/preproc/pgc.l b/src/interfaces/ecpg/preproc/pgc.l
index c70f298..0453409 100644
--- a/src/interfaces/ecpg/preproc/pgc.l
+++ b/src/interfaces/ecpg/preproc/pgc.l
@@ -27,7 +27,7 @@
 extern YYSTYPE yylval;
 
 static int		xcdepth = 0;	/* depth of nesting in slash-star comments */
-static char	   *dolqstart;		/* current $foo$ quote start string */
+static char	   *dolqstart = NULL;	/* current $foo$ quote start string */
 static YY_BUFFER_STATE scanbufhandle;
 static char *scanbuf;
 
@@ -550,6 +550,8 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})(.*\\{space})*.
 			}
 SQL{dolqdelim} {
 token_start = yytext;
+if (dolqstart)
+	free(dolqstart);
 dolqstart = mm_strdup(yytext);
 BEGIN(xdolq);
 startlit();
diff --git a/src/interfaces/ecpg/preproc/variable.c b/src/interfaces/ecpg/preproc/variable.c
index 2ad7b5d..bb4b664 100644
--- a/src/interfaces/ecpg/preproc/variable.c
+++ b/src/interfaces/ecpg/preproc/variable.c
@@ -437,11 +437,13 @@ remove_variable_from_list(struct arguments ** list, struct variable * var)
 void
 dump_variables(struct arguments * list, int mode)
 {
-	char	   *str_zero = mm_strdup(0);
+	char	   *str_zero;
 
 	if (list == NULL)
 		return;
 
+	str_zero = mm_strdup(0);
+
 	/*
 	 * The list is build up from the beginning so lets first dump the end of
 	 * the list:

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Restore-reliability mode

2015-06-07 Thread Peter Geoghegan
On Sat, Jun 6, 2015 at 12:58 PM, Noah Misch n...@leadboat.com wrote:
   - Call VALGRIND_MAKE_MEM_NOACCESS() on a shared buffer when its local pin
 count falls to zero.  Under CLOBBER_FREED_MEMORY, wipe a shared buffer
 when its global pin count falls to zero.

Did a patch for this ever materialize?


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] skipping pg_log in basebackup (was Re: pg_basebackup and pg_stat_tmp directory)

2015-06-07 Thread Michael Paquier
On Mon, Jun 8, 2015 at 12:42 PM, Abhijit Menon-Sen a...@2ndquadrant.com wrote:
 This is a followup to a 2014-02 discussion that led to pg_stats_temp
 being excluded from pg_basebackup. At the time, it was discussed to
 exclude pg_log as well, but nothing eventually came of that.

It seems to be that:
http://www.postgresql.org/message-id/cahgqgwh0okz6ckpjkcwojga3ejwffm1enrmro3dkdoteaai...@mail.gmail.com

 Recently, one of our customers has had a basebackup fail because pg_log
 contained files that were 8GB:
 FATAL: archive member pg_log/postgresql-20150119.log too large for tar 
 format

 I think pg_basebackup should also skip pg_log entries, as it does for
 pg_stats_temp and pg_replslot, etc. I've attached a patch along those
 lines for discussion.

And a recent discussion about that is this one:
http://www.postgresql.org/message-id/82897A1301080E4B8E461DDAA0FFCF142A1B2660@SYD1216
Bringing the point: some users may want to keep log files in a base
backup, and some users may want to skip some of them, and not only
pg_log. Hence we may want more flexibility than what is proposed here.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-06-07 Thread Amit Kapila
On Mon, Jun 8, 2015 at 5:52 AM, Andrew Dunstan and...@dunslane.net wrote:

 On 06/05/2015 11:08 PM, Amit Kapila wrote:


 Okay, I think I can understand why you want to be cautious for
 having a different check for this path, but in that case there is a
 chance that recovery might fail when it will try to create a symlink
 for that file.  Shouldn't we then try to call this new function only
 when we are trying to restore from tablespace_map file and also
 is there a need of ifdef S_ISLINK in remove_tablespace_link?


 Shall I send an updated patch on these lines or do you want to
 proceed with v3 version?



 The point seems to me that we should not be removing anything that's not
an empty directory or symlink, and that nothing else has any business being
in pg_tblspc. If we encounter such a thing whose name conflicts with the
name of a tablespace link we wish to create, rather than quietly blowing it
away we should complain loudly, and error out, making it the user's
responsibility to clean up their mess. Am I missing something?


How about if it is just a flat file with same name as tablespace link,
why we want to give error for that case?  I think now it just don't do
anything with that file (unlink will fail with ENOENT and it will be
ignored, atleast thats the way currently it behaves in Windows) and
create a separate symlink with same name which seems okay to
me and in the change proposed by you it will give error, do you see
any reason for doing so?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Memory leak with XLogFileCopy since de768844 (WAL file with .partial)

2015-06-07 Thread Michael Paquier
On Fri, Jun 5, 2015 at 10:45 PM, Fujii Masao wrote:
 Why don't we call InstallXLogFileSegment() at the end of XLogFileCopy()?
 If we do that, the risk of memory leak you're worried will disappear at all.

Yes, that looks fine, XLogFileCopy() would copy to a temporary file,
then install it definitely. Updated patch attached.
-- 
Michael
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 666fa37..c03f70e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -807,7 +807,7 @@ static bool XLogCheckpointNeeded(XLogSegNo new_segno);
 static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
 static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	   bool find_free, XLogSegNo max_segno,
-	   bool use_lock);
+	   bool use_lock, int elevel);
 static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 			 int source, bool notexistOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source);
@@ -3012,7 +3012,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	max_segno = logsegno + CheckPointSegments;
 	if (!InstallXLogFileSegment(installed_segno, tmppath,
 *use_existent, max_segno,
-use_lock))
+use_lock, LOG))
 	{
 		/*
 		 * No need for any more future segments, or InstallXLogFileSegment()
@@ -3041,18 +3041,16 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 /*
  * Copy a WAL segment file in pg_xlog directory.
  *
- * dstfname		destination filename
  * srcfname		source filename
  * upto			how much of the source file to copy? (the rest is filled with
  *zeros)
+ * segno		identify segment to install.
  *
- * If dstfname is not given, the file is created with a temporary filename,
- * which is returned.  Both filenames are relative to the pg_xlog directory.
- *
- * NB: Any existing file with the same name will be overwritten!
+ * The file is first copied with a temporary filename, and then installed as
+ * a newly-created segment.
  */
-static char *
-XLogFileCopy(char *dstfname, char *srcfname, int upto)
+static void
+XLogFileCopy(char *srcfname, int upto, XLogSegNo segno)
 {
 	char		srcpath[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
@@ -3150,25 +3148,9 @@ XLogFileCopy(char *dstfname, char *srcfname, int upto)
 
 	CloseTransientFile(srcfd);
 
-	/*
-	 * Now move the segment into place with its final name.  (Or just return
-	 * the path to the file we created, if the caller wants to handle the rest
-	 * on its own.)
-	 */
-	if (dstfname)
-	{
-		char		dstpath[MAXPGPATH];
-
-		snprintf(dstpath, MAXPGPATH, XLOGDIR /%s, dstfname);
-		if (rename(tmppath, dstpath)  0)
-			ereport(ERROR,
-	(errcode_for_file_access(),
-	 errmsg(could not rename file \%s\ to \%s\: %m,
-			tmppath, dstpath)));
-		return NULL;
-	}
-	else
-		return pstrdup(tmppath);
+	/* install the new file */
+	(void) InstallXLogFileSegment(segno, tmppath, false,
+  0, false, ERROR);
 }
 
 /*
@@ -3195,6 +3177,8 @@ XLogFileCopy(char *dstfname, char *srcfname, int upto)
  * place.  This should be TRUE except during bootstrap log creation.  The
  * caller must *not* hold the lock at call.
  *
+ * elevel: log level used by this routine.
+ *
  * Returns TRUE if the file was installed successfully.  FALSE indicates that
  * max_segno limit was exceeded, or an error occurred while renaming the
  * file into place.
@@ -3202,7 +3186,7 @@ XLogFileCopy(char *dstfname, char *srcfname, int upto)
 static bool
 InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	   bool find_free, XLogSegNo max_segno,
-	   bool use_lock)
+	   bool use_lock, int elevel)
 {
 	char		path[MAXPGPATH];
 	struct stat stat_buf;
@@ -3247,7 +3231,7 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	{
 		if (use_lock)
 			LWLockRelease(ControlFileLock);
-		ereport(LOG,
+		ereport(elevel,
 (errcode_for_file_access(),
  errmsg(could not link file \%s\ to \%s\ (initialization of log file): %m,
 		tmppath, path)));
@@ -3259,7 +3243,7 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	{
 		if (use_lock)
 			LWLockRelease(ControlFileLock);
-		ereport(LOG,
+		ereport(elevel,
 (errcode_for_file_access(),
  errmsg(could not rename file \%s\ to \%s\ (initialization of log file): %m,
 		tmppath, path)));
@@ -3748,7 +3732,7 @@ RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 	if (endlogSegNo = recycleSegNo 
 		lstat(path, statbuf) == 0  S_ISREG(statbuf.st_mode) 
 		InstallXLogFileSegment(endlogSegNo, path,
-			   true, recycleSegNo, true))
+			   true, recycleSegNo, true, LOG))
 	{
 		ereport(DEBUG2,
 (errmsg(recycled transaction log file \%s\,
@@ -5227,8 +5211,6 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 	 */
 	if (endLogSegNo == startLogSegNo)
 	{
-		char	   *tmpfname;
-
 		XLogFileName(xlogfname, endTLI, endLogSegNo);
 
 		/*
@@ -5238,9 +5220,7 @@ 

Re: [HACKERS] pg_stat_archiver issue with aborted archiver

2015-06-07 Thread Michael Paquier
On Sun, Jun 7, 2015 at 1:11 AM, Julien Rouhaud
julien.rouh...@dalibo.com wrote:
 I just noticed that if the archiver aborts (for instance if the
 archive_command exited with a return code  127), pg_stat_archiver won't
 report those failed attempts. This happens with both 9.4 and 9.5 branches.

 Please find attached a patch that fix this issue, based on current head.

The current code seems right to me. When the archive command dies
because of a signal (exit code  128), the server should fail
immediately with FATAL and should not do any extra processing. It will
also try to archive again the same segment file after restart. When
trying again, if this time the failure is not caused by a signal but
still fails it will be reported to pg_stat_archiver.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing tuple overhead

2015-06-07 Thread Amit Kapila
On Sun, Jun 7, 2015 at 3:02 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On 23 April 2015 at 17:24, Andres Freund and...@anarazel.de wrote:



 It's hard to see how to save space there without reference to a specific
use case. I see different solutions depending upon whether we assume a low
number of transactions or a high number of transactions.


I have tried to check theoretically, how much difference such a
change could give us.

Assuming BLK_SZ - 8192 bytes; Page header - 24 bytes; each
line pointer - 4 bytes; average tuple - 150 bytes, roughly 53
tuples could be accommodated in one page.  Now each of this
tuple contains 12 bytes transaction information (xmin, xmax,
cid/combocid).  Now considering that in average workload 4
transactions operate on a page at the same time (I think for a
workload like pgbench tpc-b, it shouldn't be more otherwise it
should have been visible in perf reports),  4 additional tuples [1]
could be accommodated on a page which is approximately 7% savings
in space (which in-turns means that much less I/O).  This gain
could vary based on tuple size, no. of transactions that can operate
on page, page size..

Some additional benefits that I could see from such a change:

1. I think we don't need to traverse the whole page while freezing,
so there should be some savings in freeze operation as well.

2. Now I think with this, we might be able to reduce WAL also
if we can avoid some transaction related info in the cases where
it is currently stored (update/delete).

3. Another gain could come if we want to add transaction information
in index segment as well, because if such information can be stored at
page level, then there won't be much impact in adding it there which
will help us in avoiding multiple-passes of Vaccum (heap and index
could be vacuumed separately which will definitely help in IO and
bloat reduction).

[1]
Calc for 4 additional tuples:
(saving by removing trans info from tuple - new space consumed by trans)
/new tuple size
(53 * 12  - 12 * 4) / (150 - 12) = 4.26

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Reducing tuple overhead

2015-06-07 Thread Simon Riggs
On 23 April 2015 at 17:24, Andres Freund and...@anarazel.de wrote:

 Split into a new thread, the other one is already growing fast
 enough. This discussion started at
 http://archives.postgresql.org/message-id/55391469.5010506%40iki.fi

 On April 23, 2015 6:48:57 PM GMT+03:00, Heikki Linnakangas 
 hlinn...@iki.fi wrote:
 Stop right there. You need to reserve enough space on the page to store
 
 an xmax for *every* tuple on the page. Because if you don't, what are
 you going to do when every tuple on the page is deleted by a different
 transaction.
 
 Even if you store the xmax somewhere else than the page header, you
 need
 to reserve the same amount of space for them, so it doesn't help at
 all.

 Depends on how you do it and what you optimize for (disk space, runtime,
 code complexity..).  You can e.g. use apply a somewhat similar trick to
 xmin/xmax as done to cmin/cmax; only that the data structure needs to be
 persistent.

 In fact, we already have combocid like structure for xids that's
 persistent - multixacts. We could just have one xid saved that's either
 xmin or xmax (indicated by bits) or a multixact.  When a tuple is
 updated/deleted whose xmin is still required we could replace the former
 xmin with a multixact, otherwise just change the flag that it's now a
 xmax without a xmin.  To check visibility and if the xid is a multixact
 we'd just have to look for the relevant member for the actual xmin and
 xmax.

 To avoid exessive overhead when a tuple is repeatedly updated within one
 session we could store some of the data in the combocid entry that we
 anyway need in that case.

 Whether that's feasible complexity wise is debatable, but it's certainly
 possible.


 I do wonder what, in realistic cases, is actually the bigger contributor
 to the overhead. The tuple header or the padding we liberally add in
 many cases...


It's hard to see how to save space there without reference to a specific
use case. I see different solutions depending upon whether we assume a low
number of transactions or a high number of transactions.

A case we could optimize for is insert-mostly tables. But in that case if
you get rid of the xmax then you still have to freeze the tuples later.

I would have thought a better optimization would be to use the xmax for the
xid epoch by default, so that such rows would never need freezing. Then at
least we are using the xmax for something useful in a larger insert-mostly
database rather than just leaving it at zero.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] could not truncate directory pg_subtrans: apparent wraparound

2015-06-07 Thread Thomas Munro
On Sat, Jun 6, 2015 at 4:51 PM, Thomas Munro
thomas.mu...@enterprisedb.com wrote:
 On Sat, Jun 6, 2015 at 1:25 PM, Alvaro Herrera alvhe...@2ndquadrant.com 
 wrote:
 Thomas Munro wrote:

 My idea was that if I could get oldestXact == next XID in
 TruncateSUBSTRANS, then TransactionIdToPage(oldestXact) for a value of
 oldestXact that happens to be immediately after a page boundary (so
 that xid % 2048 == 0) might give page number that is =
 latest_page_number, causing SimpleLruTruncate to print that message.
 But I can't figure out how to get next XID == oldest XID, because
 vacuumdb --freeze --all consumes xids itself, so in my first attempt
 at this, next XID is always 3 ahead of the oldest XID when a
 checkpoint is run.

 vacuumdb starts by querying pg_database, which eats one XID.

 Vacuum itself only uses one XID when vac_truncate_clog() is called.
 This is called from vac_update_datfrozenxid(), which always happen at
 the end of each user-invoked VACUUM (so three times for vacuumdb if you
 have three databases); autovacuum does it also at the end of each run.
 Maybe you can get autovacuum to quit before doing it.

 OTOH, if the values in the pg_database entry do not change,
 vac_truncate_clog is not called, and thus vacuum would finish without
 consuming an XID.

 I have manage to reproduce it a few times but haven't quite found the
 right synchronisation hacks to make it reliable so I'm not posting a
 repro script yet.

 I think it's a scary sounding message but very rare and entirely
 harmless (unless you really have wrapped around...).  The fix is
 probably something like: if oldest XID == next XID, then just don't
 call SimpleLruTruncate (truncation is deferred until the next
 checkpoint), or perhaps (if we can confirm this doesn't cause problems
 for dirty pages or that there can't be any dirty pages before cutoff
 page because of the preceding flush (as I suspect)) we could use
 cutoffPage = TransactionIdToPage(oldextXact - 1) if oldest == next, or
 maybe even always.

Here's a repro script and a suggested patch.  (What I said about dirty
pages in parentheses above was nonsense, I was confusing this with
something else.)

-- 
Thomas Munro
http://www.enterprisedb.com


fix-bogus-subtrans-wraparound-error.patch
Description: Binary data


checkpoint-subtrans-boundary.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Resource Owner reassign Locks

2015-06-07 Thread Andres Freund
On 2015-06-07 13:44:08 -0700, Jeff Janes wrote:
 I'd like to advocate for back-patching this to 9.0, 9.1, and 9.2.  It has
 run without problems for a while now, and it can be considered a bug that
 systems with a very large number of objects cannot be upgraded in a
 reasonable time.

+1

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-06-07 Thread Andrew Dunstan


On 06/05/2015 11:08 PM, Amit Kapila wrote:
On Fri, Jun 5, 2015 at 10:51 AM, Amit Kapila amit.kapil...@gmail.com 
mailto:amit.kapil...@gmail.com wrote:


On Fri, Jun 5, 2015 at 9:57 AM, Andrew Dunstan
and...@dunslane.net mailto:and...@dunslane.net wrote:


On 06/04/2015 11:35 PM, Amit Kapila wrote:


Theoretically, I don't see much problem by changing the checks
way you have done in patch, but it becomes different than what
we have in destroy_tablespace_directories() and it is slightly
changing the way check was originally done in
create_tablespace_directories(), basically original check
will try
unlink if lstat returns non-zero return code. If you want
to proceed
with the changed checks as in v3, then may be we can modify
comments on top of function remove_tablespace_symlink() which
indicates that it works like destroy_tablespace_directories().



The difference is that here we're getting the list from a base
backup and it seems to me the risk of having a file we don't
really want to unlink is significantly greater.


Okay, I think I can understand why you want to be cautious for
having a different check for this path, but in that case there is a
chance that recovery might fail when it will try to create a symlink
for that file.  Shouldn't we then try to call this new function only
when we are trying to restore from tablespace_map file and also
is there a need of ifdef S_ISLINK in remove_tablespace_link?


Shall I send an updated patch on these lines or do you want to
proceed with v3 version?




The point seems to me that we should not be removing anything that's not 
an empty directory or symlink, and that nothing else has any business 
being in pg_tblspc. If we encounter such a thing whose name conflicts 
with the name of a tablespace link we wish to create, rather than 
quietly blowing it away we should complain loudly, and error out, making 
it the user's responsibility to clean up their mess. Am I missing something?


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] could not truncate directory pg_subtrans: apparent wraparound

2015-06-07 Thread Thomas Munro
On Mon, Jun 8, 2015 at 12:29 PM, Thomas Munro
thomas.mu...@enterprisedb.com wrote:
 Here's a repro script and a suggested patch.

Argh... I realised immediately after sending this that subtransaction
truncation doesn't even use the oldest XID computed by vacuum, it uses
GetOldestXmin (the oldest transaction that was running when any
current transaction was started).  So here is an even simpler repro,
though the fix is the same (with different comments).

-- 
Thomas Munro
http://www.enterprisedb.com


checkpoint-subtrans-boundary-v2.sh
Description: Bourne shell script


fix-bogus-subtrans-wraparound-error-v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [CORE] Restore-reliability mode

2015-06-07 Thread Alvaro Herrera
Joshua D. Drake wrote:
 
 On 06/05/2015 08:07 PM, Bruce Momjian wrote:
 
  From my side, it is only recently I got some clear answers to my questions
 about how it worked. I think it is very important that major features have
 extensive README type documentation with them so the underlying principles 
 used
 in the development are clear. I would define the measure of a good feature 
 as
 whether another committer can read the code comments and get a good feel. A 
 bad
 feature is one where committers walk away from it, saying I don't really 
 get it
 and I can't read an explanation of why it does that. Tom's most significant
 contribution is his long descriptive comments on what the problem is that 
 need
 to be solved, the options and the method chosen. Clarity of thought is what
 solves bugs.
 
 Yes, I think we should have done that early-on for multi-xact, and I am
 hopeful we will learn to do that more often when complex features are
 implemented, or when we identify areas that are more complex than we
 thought.
 
 I see this idea of the README as very useful. There are far more people like
 me in this community than Simon or Alvaro. I can test, I can break things, I
 can script up a harness but I need to be understand HOW and the README would
 help allow for that.

There is a src/backend/access/README.tuplock that attempts to describe
multixacts.  Is that not sufficient?

Now that I think about it, this file hasn't been updated with the latest
changes, so it's probably a bit outdated now.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing tuple overhead

2015-06-07 Thread Peter Geoghegan
On Thu, Apr 30, 2015 at 6:54 AM, Robert Haas robertmh...@gmail.com wrote:
 The other, related problem is that the ordering operator might start
 to return different results than it did at index creation time.  For
 example, consider a btree index built on a text column.  Now consider
 'yum update'.  glibc gets updated, collation ordering of various
 strings change, and now you've got tuples that are in the wrong
 place in the index, because when the index was built, we thought A 
 B, but now we think B  A.  You would think the glibc maintainers
 might avoid such changes in minor releases, or that the Red Hat guys
 would avoid packaging and shipping those changes in minor releases,
 but you'd be wrong.

I would not think that. Unicode Technical Standard #10 states:


Collation order is not fixed.

Over time, collation order will vary: there may be fixes needed as
more information becomes available about languages; there may be new
government or industry standards for the language that require
changes; and finally, new characters added to the Unicode Standard
will interleave with the previously-defined ones. This means that
collations must be carefully versioned.


Also, in the paper Modern B-Tree Techniques, by Goetz Graefe, page
238, it states:


In many operating systems, appropriate functions are provided to
compute a normalized key from a localized string value, date value, or
time value. This functionality is used, for example, to list files in
a directory as appropriate for the local language. Adding
normalization for numeric data types is relatively straightforward, as
is concatenation of multiple normalized values. Database code must not
rely on such operating system code, however. The problem with relying
on operating systems support for database indexes is the update
frequency. An operating system might update its normalization code due
to an error or extension in the code or in the definition of a local
sort order; it is unacceptable, however, if such an update silently
renders existing large database indexes incorrect.


Unfortunately, it is simply not the case that we can rely on OS
collations being immutable. We have *no* contract with any C standard
library concerning collation stability whatsoever. I'm surprised that
we don't see hear more about this kind of corruption.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers