date:20131104

On 09/04/2013 11:22 PM, Tom Lane wrote:
 AFAICT, to deal with update/delete the RLS patch needs to constrain order
 of qual application without the crutch of having a separate level of
 subquery; and it's that behavior that I have zero confidence in, either
 as to whether it works as submitted or as to our odds of not breaking it
 in the future.

Wouldn't CASE do that job, albeit in a somewhat ugly manner, and also
protect against malicious RLS functions?

For any subclause in a WHERE clause (a) OR (b) you can instead write a
short-circuit OR version as:

  CASE WHEN (a) THEN true ELSE (b) END

no?

So, given a row security condition like (rowowner = current_user AND
rls_malicious_leak()) on table test, this:

SELECT * FROM test WHERE client_leak_fn();

could be rewritten by row security as:

SELECT *
FROM test
WHERE

CASE WHEN
  CASE WHEN is_superuser() THEN true
  ELSE (rowowner = current_user AND rls_malicious_leak())
  END
THEN
  client_leak_fn()
END;


in other words: if the user is the superuser, don't evaluate the RLS
predicate, just assume they have the right to see the row. Otherwise
evaluate the RLS predicate to determine whether they can see the row. In
either case, if they have the right to see the row, run the original
WHERE clause.

My main concern is that it'd be relying on a guarantee that CASE is
always completely ordered, and that might not be ideal. It's also
hideously ugly, and future optimiser smarts could create unexpected issues.

Unless you propose the creaton of a new short-circuit left-to-right
order guaranteed OR operator, and think the planner/optimizer should be
taught special case rules to prevent it from doing pull-up / push-down
or pre-evaluating subclauses for it, I suspect the notion of using
security barrier views or subqueries is still going to be the sanest way
to do it.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Something fishy happening on frogmouth


On 01.11.2013 18:22, Noah Misch wrote:

On Fri, Nov 01, 2013 at 12:27:31AM -0400, Robert Haas wrote:

On Thu, Oct 31, 2013 at 7:48 PM, Heikki Linnakangas hlinnakan...@vmware.com 
wrote:

On 31.10.2013 16:43, Robert Haas wrote:

There should be no cases where the main shared memory
segment gets cleaned up and the dynamic shared memory segments do not.


1. initdb -D data1
2. initdb -D data2
3. postgres -D data1
4. killall -9 postgres
5. postgres -D data2

The system V shmem segment orphaned at step 4 will be cleaned up at step 5.
The DSM segment will not.


Note that dynamic_shared_memory_type='mmap' will give the desired behavior.


Hmm, here's another idea:

Postmaster creates the POSIX shared memory object at startup, by calling 
shm_open(), and immediately calls shm_unlink on it. That way, once all 
the processes have exited, the object will be removed automatically. 
Child processes inherit the file descriptor at fork(), and don't need to 
call shm_open, just mmap().


I'm not sure how dynamic these segments need to be, but if 1-2 such file 
descriptors is not enough, you could mmap() different offsets from the 
same shmem object for different segments.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] What stopped SECURITY BARRIER views from being auto-updatable?

The current code just reads:

/*
 * For now, we also don't support security-barrier views, because of the
 * difficulty of keeping upper-level qual expressions away from
 * lower-level data.  This might get relaxed in future.
 */
if (RelationIsSecurityView(view))
return gettext_noop(Security-barrier views are not automatically
updatable.);


which doesn't tell me tons. Is it the same problem RLS faces where you
can't just append the view predicate to the update predicate because you
may leak information if the update predicate gets evaluated before the
view predicate?

RLS attempts to solve that by wrapping the range table entry for the
relation in a subquery over the original table qualified with the rls
predicate, forcing its evaluation before the outer predicate of the
user-supplied query. That approach seems to be unpopular, but how much
of that is about the implementation by doing rewriting in the planner
its self, and how much is objection to the principle?

If updateable views supported security barriers it'd really help reduce
the amount of complex planner messing-around in the RLS patch, so this
is something I'm quite interested in tacling if anyone has a few
pointers on where to look to start with.

It wouldn't be total solution to RLS, as we'd still have the issue of
the rewriter not getting re-run on plan invalidation and re-planning,
which causes issues if the RLS clause was omitted first time and
shouldn't be this time or vice versa. We'd need to be able to store the
pre-rewrite parse tree and re-run the rewriter for plans where one or
more relations were flagged as having RLS.

From what Kohei KaiGai was saying it sounds like we'd also need to
re-define how RLS and inheritance interacted, so a parent's RLS
predicate applies to all its children, as one of the reasons he had to
do it in the optimizer/planner was because of difficulties handling
totally independent RLS expressions for child and parent tables at the
rewriter level.

Might this be a more palatable way forward than the fiddling with Vars
and doing query rewriting inside the optimizer that the current RLS
patch does?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] pg_receivexlog: fixed to work with logical segno 0


On 01.11.2013 11:42, Mika Eloranta wrote:

pg_receivexlog calculated the xlog segment number incorrectly
when started after the previous instance was interrupted.

Resuming streaming only worked when the physical wal segment
counter was zero, i.e. for the first 256 segments or so.


Oops. Fixed, thanks for the report!

It's a bit scary that this bug went unnoticed for this long; it was 
introduced quite early in the 9.3 development cycle. Seems that I did 
all the testing of streaming timeline changes with pg_receivexlog later 
in 9.3 cycle with segment numbers  256, and no-one else have done 
long-running tests with pg_receivexlog either.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] missing RelationCloseSmgr in FreeFakeRelcacheEntry?

On 2013-11-04 09:38:27 +0200, Heikki Linnakangas wrote:
 On 29.10.2013 03:16, Andres Freund wrote:
 Hi,
 
 I've started a valgrind run earlier when trying to run the regression
 tests with valgrind --error-exitcode=122 (to cause the regression tests
 to fail visibly) but it crashed frequently...
 
 One of them was:
 ...
 Which seems to indicate a dangling -rd_smgr pointer, confusing the heck
 out of me because I couldn't see how that'd happen.
 
 Looking a bit closer it seems to me that the fake relcache
 infrastructure seems to neglect the chance that something used the fake
 entry to read something which will have done a RelationOpenSmgr(). Which
 in turn will have set rel-rd_smgr-smgr_owner to rel.
 When we then pfree() the fake relation in FreeFakeRelcacheEntry we have
 a dangling pointer.
 
 Yeah, that's clearly a bug.
 
 diff --git a/src/backend/access/transam/xlogutils.c 
 b/src/backend/access/transam/xlogutils.c
 index 5429d5e..ee7c892 100644
 --- a/src/backend/access/transam/xlogutils.c
 +++ b/src/backend/access/transam/xlogutils.c
 @@ -433,6 +433,7 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
   void
   FreeFakeRelcacheEntry(Relation fakerel)
   {
 +   RelationCloseSmgr(fakerel);
  pfree(fakerel);
   }
 
 Hmm, I don't think that's a very good solution. Firstly, it will force the
 underlying files to be closed, hurting performance. Fake relcache entries
 are used in heapam when clearing visibility map bits, which might happen
 frequently enough for that to matter.

Well, it will only close them when they actually were smgropen()ed which
will not always be the case. Although it probably is in the visibility
map case.

Feels like premature optimization to me.

 Secondly, it will fail if you create
 two fake relcache entries for the same relfilenode. Freeing the first will
 close the smgr entry, and freeing the second will try to close the same smgr
 entry again. We never do that in the current code, but it seems like a
 possible source of bugs in the future.

I think if we go there we'll need refcounted FakeRelcacheEntry's
anyway. Otherwise we'll end up with a relation smgropen()ed multiple
times in the same backend and such which doesn't seem like a promising
course to me since the smgr itself isn't refcounted.

 How about the attached instead?

 diff --git a/src/backend/access/transam/xlogutils.c 
 b/src/backend/access/transam/xlogutils.c
 index 5429d5e..f732e71 100644
 --- a/src/backend/access/transam/xlogutils.c
 +++ b/src/backend/access/transam/xlogutils.c
 @@ -422,7 +422,11 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
   rel-rd_lockInfo.lockRelId.dbId = rnode.dbNode;
   rel-rd_lockInfo.lockRelId.relId = rnode.relNode;
  
 - rel-rd_smgr = NULL;
 + /*
 +  * Open the underlying storage, but don't set rd_owner. We want the
 +  * SmgrRelation to persist after the fake relcache entry is freed.
 +  */
 + rel-rd_smgr = smgropen(rnode, InvalidBackendId);
  
   return rel;
  }

I don't really like that - that will mean we'll leak open smgr handles
for every relation touched via smgr during recovery since they will
never be closed unless the relation is dropped or such. And in some
databases there can be huge amounts of relations.
Since recovery is long lived that doesn't seem like a good idea, besides
the memory usage it will also bloat smgr's internal hash which actually
seems just as likely to hurt performance.

I think intentionally not using the owner mechanism also is dangerous -
what if we have longer lived fake relcache entries and we've just
processed sinval messages? Then we'll have a -rd_smgr pointing into
nowhere.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] missing locking in at least INSERT INTO view WITH CHECK

On 2013-11-02 17:05:24 -0700, Kevin Grittner wrote:
 Andres Freund and...@2ndquadrant.com wrote:
 
  the matview patch (0002)
 
 This is definitely needed as a bug fix.  Will adjust comments and
 commit, back-patched to 9.3.

Thanks.

  Also attached is 0004 which just adds a heap_lock() around a
  newly created temporary table in the matview code which shouldn't
  be required for correctness but gives warm and fuzzy feelings as
  well as less debugging noise.
 
 Will think about this.  I agree is is probably worth doing
 something to reduce the noise when looking for cases that actually
 matter.

It's pretty much free, so I don't think there really is any reason to
deviate from other parts of the code. Note how e.g. copy_heap_data(),
DefineRelation() and ATRewriteTable() all lock the new relation, even if
it just has been created and is (and in the latter two cases will always
be) invisible.

  Wouldn't it be a good idea to tack such WARNINGs (in a proper and
  clean form) to index_open (checking the underlying relation is
  locked), relation_open(..., NoLock) (checking the relation has
  previously been locked) and maybe RelationIdGetRelation() when
  cassert is enabled? ISTM we frequently had bugs around this.
 
 It would be nice to have such omissions pointed out during early
 testing.

Will try to come up with something.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] buffile.c resource owner breakage on segment extension

On 2013-11-01 15:28:54 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  While not particularly nice, given the API, it seems best for buffile.c
  to remember the resource owner used for the original segment and
  temporarily set that during the extension.
 
 Hm, yeah, that seems right.  It's just like repalloc keeping the memory
 chunk in its original context.  The comments here are a bit inadequate...

Thanks for committing and sorry for the need to freshen up the
comments. I don't think I had ever opened buffile.c before and thus
wasn't sure if there isn't a better fix, so I didn't want to spend too
much time on it before somebody agreed it is the right fix.

Also, I was actually just trying to recover some data from a corrupted
database and this stopped me from it ;)

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Something fishy happening on frogmouth

On 2013-11-04 10:27:47 +0200, Heikki Linnakangas wrote:
 On 01.11.2013 18:22, Noah Misch wrote:
 On Fri, Nov 01, 2013 at 12:27:31AM -0400, Robert Haas wrote:
 On Thu, Oct 31, 2013 at 7:48 PM, Heikki Linnakangas 
 hlinnakan...@vmware.com wrote:
 On 31.10.2013 16:43, Robert Haas wrote:
 There should be no cases where the main shared memory
 segment gets cleaned up and the dynamic shared memory segments do not.
 
 1. initdb -D data1
 2. initdb -D data2
 3. postgres -D data1
 4. killall -9 postgres
 5. postgres -D data2
 
 The system V shmem segment orphaned at step 4 will be cleaned up at step 5.
 The DSM segment will not.
 
 Note that dynamic_shared_memory_type='mmap' will give the desired behavior.

Well, with the significant price of causing file-io.

 Hmm, here's another idea:
 
 Postmaster creates the POSIX shared memory object at startup, by calling
 shm_open(), and immediately calls shm_unlink on it. That way, once all the
 processes have exited, the object will be removed automatically. Child
 processes inherit the file descriptor at fork(), and don't need to call
 shm_open, just mmap().

Uh. Won't that completely and utterly remove the point of dsm which is
that you can create segments *after* startup? We surely don't want to
start overallocating enough shmem so we don't ever dynamically need to
allocate segments.
Also, I don't think it's portable across platforms to access segments
that already have been unlinked.

I think this is looking for a solution without an actually relevant
problem disregarding the actual problem space.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #8573: int4range memory consumption

On 2013-11-02 15:29:36 -0400, Tom Lane wrote:
 Attached is a proposed patch for this.  It fixes most of the functions
 in printtup.c to use a per-row memory context.  (I did not bother to
 fix debugtup(), which is used only in standalone mode.  If you're doing
 queries large enough for mem leaks to be problematic in standalone mode,
 you're nuts.) 

FWIW, by that definition I have been nuts several time in the past -
without much choice since I was recovering data from a corrupted cluster
and the database couldn't be started normally. This now has gotten worse
(even in the backbranches) since debugtup won't clean up detoasted data
anymore.
But I guess the solution for that is to use COPY in those situations
which shouldn't have that problem and should work. Also, easier to parse ;)

 My original thought had been to get rid of all pfree's of output function
 results, so as to make the world safe for returning constant strings when
 such functions find it appropriate.  However, I ran into a showstopper
 problem: SPI_getvalue(), which is little more than a wrapper around the
 appropriate type output function, is documented as returning a pfree'able
 string.  Some though not all of the callers in core and contrib take the
 hint and pfree the result, and I'm sure there are more in third-party
 extensions.  So if we want to allow output functions to return
 non-palloc'd strings, it seems we have to either change SPI_getvalue()'s
 API contract or insert a pstrdup() call in it.  Neither of these is
 attractive, mainly because the vast majority of output function results
 would still be palloc'd even if we made aggressive use of the option not
 to.  That means that if we did the former, light testing wouldn't
 necessarily show a problem if someone forgot to remove a pfree() after a
 SPI_getvalue() call, and it also means that if we did the latter, the
 pstrdup() would usually be a waste of cycles and space.

I guess one option for the future is to to pstrdup() in SPI_getvalue()
and adding a new function that tells whether the result can be freed or
not. Then callers that care can move over. Although I'd guess that many
of those caring will already use SPI_getbinval() manually.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Removal of archive in wal_level

2013-11-04 Thread Michael Paquier

 Please find attached a patch doing what is written in the $subject.
With the documentation updated, this is even better...
Regards,
--
Michael
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index ccb76d8..0f20253 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -582,7 +582,7 @@ tar -cf backup.tar /usr/local/pgsql/data
 
para
 To enable WAL archiving, set the xref linkend=guc-wal-level
-configuration parameter to literalarchive/ (or literalhot_standby/),
+configuration parameter to literalhot_standby/,
 xref linkend=guc-archive-mode to literalon/,
 and specify the shell command to use in the xref
 linkend=guc-archive-command configuration parameter.  In practice
@@ -1246,7 +1246,7 @@ restore_command = 'cp /mnt/server/archivedir/%f %p'
   If more flexibility in copying the backup files is needed, a lower
   level process can be used for standalone hot backups as well.
   To prepare for low level standalone hot backups, set varnamewal_level/ to
-  literalarchive/ (or literalhot_standby/), varnamearchive_mode/ to
+  literalhot_standby/, varnamearchive_mode/ to
   literalon/, and set up an varnamearchive_command/ that performs
   archiving only when a emphasisswitch file/ exists.  For example:
 programlisting
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 77a9303..2c6a022 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1606,10 +1606,9 @@ include 'filename'
 varnamewal_level/ determines how much information is written
 to the WAL. The default value is literalminimal/, which writes
 only the information needed to recover from a crash or immediate
-shutdown. literalarchive/ adds logging required for WAL archiving,
-and literalhot_standby/ further adds information required to run
-read-only queries on a standby server.
-This parameter can only be set at server start.
+shutdown. literalhot_standby/ adds logging required for WAL
+archiving and information required to run read-only queries on a
+standby server. This parameter can only be set at server start.
/para
para
 In literalminimal/ level, WAL-logging of some bulk
@@ -1625,22 +1624,17 @@ include 'filename'
 /simplelist
 But minimal WAL does not contain
 enough information to reconstruct the data from a base backup and the
-WAL logs, so either literalarchive/ or literalhot_standby/
-level must be used to enable
+WAL logs literalhot_standby/ level must be used to enable
 WAL archiving (xref linkend=guc-archive-mode) and streaming
 replication.
/para
para
-In literalhot_standby/ level, the same information is logged as
-with literalarchive/, plus information needed to reconstruct
-the status of running transactions from the WAL. To enable read-only
-queries on a standby server, varnamewal_level/ must be set to
-literalhot_standby/ on the primary, and
-xref linkend=guc-hot-standby must be enabled in the standby. It is
-thought that there is
-little measurable difference in performance between using
-literalhot_standby/ and literalarchive/ levels, so feedback
-is welcome if any production impacts are noticeable.
+In literalhot_standby/ level, necessary information is logged
+to reconstruct the status of running transactions from the WAL and
+to enable read-only queries on a standby server. Note that
+varnamewal_level/ must be set to literalhot_standby/ on
+the primary, and xref linkend=guc-hot-standby must be enabled
+in the standby.
/para
   /listitem
  /varlistentry
@@ -2198,8 +2192,8 @@ include 'filename'
 of connections, so the parameter cannot be set higher than
 xref linkend=guc-max-connections.  This parameter can only
 be set at server start. varnamewal_level/ must be set
-to literalarchive/ or literalhot_standby/ to allow
-connections from standby servers.
+to literalhot_standby/ to allow connections from standby
+servers.
/para
/listitem
   /varlistentry
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index f407753..af4 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -630,7 +630,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 	 * WAL record that will allow us to conflict with queries
 	 * running on standby.
 	 */
-	if (XLogStandbyInfoActive())
+	if (XLogIsNeeded())
 	{
 		BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 073190f..81f7c75 100644
---

Re: [HACKERS] Something fishy happening on frogmouth


On 04.11.2013 11:55, Andres Freund wrote:

On 2013-11-04 10:27:47 +0200, Heikki Linnakangas wrote:

Hmm, here's another idea:

Postmaster creates the POSIX shared memory object at startup, by calling
shm_open(), and immediately calls shm_unlink on it. That way, once all the
processes have exited, the object will be removed automatically. Child
processes inherit the file descriptor at fork(), and don't need to call
shm_open, just mmap().


Uh. Won't that completely and utterly remove the point of dsm which is
that you can create segments *after* startup? We surely don't want to
start overallocating enough shmem so we don't ever dynamically need to
allocate segments.


You don't need to allocate the shared memory beforehand, only create the 
file descriptor. Note that the size of the segment is specified in the 
shm_open() call, but the mmap() that comes later.


If we need a large amount of small segments, so that it's not feasible 
to shm_open() them all at postmaster startup, you could shm_open() just 
one segment, and carve out smaller segments from it by mmap()ing at 
different offsets.



Also, I don't think it's portable across platforms to access segments
that already have been unlinked.


See 
http://pubs.opengroup.org/onlinepubs/009695299/functions/shm_unlink.html: If 
one or more references to the shared memory object exist when the object 
is unlinked, the name shall be removed before shm_unlink() returns, but 
the removal of the memory object contents shall be postponed until all 
open and map references to the shared memory object have been removed.


That doesn't explicitly say that a new shm_open() on the file descriptor 
must still work after shm_unlink, but I would be surprised if there is a 
platform where it doesn't.



I think this is looking for a solution without an actually relevant
problem disregarding the actual problem space.


I agree. What are these dynamic shared memory segments supposed to be 
used for?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Something fishy happening on frogmouth

On 2013-11-04 13:13:27 +0200, Heikki Linnakangas wrote:
 On 04.11.2013 11:55, Andres Freund wrote:
 On 2013-11-04 10:27:47 +0200, Heikki Linnakangas wrote:
 Postmaster creates the POSIX shared memory object at startup, by calling
 shm_open(), and immediately calls shm_unlink on it. That way, once all the
 processes have exited, the object will be removed automatically. Child
 processes inherit the file descriptor at fork(), and don't need to call
 shm_open, just mmap().
 
 Uh. Won't that completely and utterly remove the point of dsm which is
 that you can create segments *after* startup? We surely don't want to
 start overallocating enough shmem so we don't ever dynamically need to
 allocate segments.

 You don't need to allocate the shared memory beforehand, only create the
 file descriptor. Note that the size of the segment is specified in the
 shm_open() call, but the mmap() that comes later.

 If we need a large amount of small segments, so that it's not feasible to
 shm_open() them all at postmaster startup, you could shm_open() just one
 segment, and carve out smaller segments from it by mmap()ing at different
 offsets.

That quickly will result in fragmentation which we don't have the tools
to handle.

 Also, I don't think it's portable across platforms to access segments
 that already have been unlinked.
 
 See
 http://pubs.opengroup.org/onlinepubs/009695299/functions/shm_unlink.html:
 If one or more references to the shared memory object exist when the object
 is unlinked, the name shall be removed before shm_unlink() returns, but the
 removal of the memory object contents shall be postponed until all open and
 map references to the shared memory object have been removed.

We also support sysv shmem and have the same cleanup problem there.

 That doesn't explicitly say that a new shm_open() on the file descriptor
 must still work after shm_unlink, but I would be surprised if there is a
 platform where it doesn't.

Probably true.

 I think this is looking for a solution without an actually relevant
 problem disregarding the actual problem space.

To make that clearer: I think the discussions about making it impossible
to leak segments after rm -rf are the irrelevant problem.

 I agree. What are these dynamic shared memory segments supposed to be used
 for?

Parallel sort and stuff like that.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] pg_receivexlog: fixed to work with logical segno 0

2013-11-04 Thread Mika Eloranta

On Nov 4, 2013, at 11:06, Heikki Linnakangas wrote:
 On 01.11.2013 11:42, Mika Eloranta wrote:
 pg_receivexlog calculated the xlog segment number incorrectly
 when started after the previous instance was interrupted.
 
 Resuming streaming only worked when the physical wal segment
 counter was zero, i.e. for the first 256 segments or so.
 
 Oops. Fixed, thanks for the report!
 
 It's a bit scary that this bug went unnoticed for this long; it was 
 introduced quite early in the 9.3 development cycle. Seems that I did all the 
 testing of streaming timeline changes with pg_receivexlog later in 9.3 cycle 
 with segment numbers  256, and no-one else have done long-running tests with 
 pg_receivexlog either.

Thanks for the fix, Heikki!

It sounds like either PostgreSQL 9.3.x and/or pg_receivexlog is not yet used in 
a lot of places. Otherwise this probably would have been found earlier.

Affected versions:

$ git tag --contains dfda6eba
REL9_3_0
REL9_3_1
REL9_3_BETA1
REL9_3_BETA2
REL9_3_RC1

What makes this a really sneaky and severe problem is the way it stays dormant 
for a period of time after a fresh db init or pg_upgrade. Here's how I bumped 
into it:

1. Old postgresql 9.2 db running, pg_receivexlog streaming extra backups to a 
remote box.
2. pg_upgrade to 9.3.1.
3. pg_receivexlog from the upgraded DB still works ok and handles restarts 
fine, because the xlog indexes were reset back to zero at pg_upgrade.
4. xlog history eventually grows over 256 * 16MB.
5. pg_receivexlog gets interrupted for whatever reason (gets stopped, killed, 
crashes, host is restarted).
6. A new pg_receivexlog instance fails to resume streaming and there is no easy 
workaround that would maintain an uninterrupted, gapless xlog history.

Initially, before I had analysed the problem any further, I had to stash the 
xlogs, restart pg_receivexlog and after that trigger new pg_basebackups.

Regardless of this bug, I find that pg_receivexlog (and pg_basebackup) are 
excellent tools and people should use them more!

PS. something like pg_receivexlog --start-pos=2D/1500 might be nice for 
overriding the streaming start position.

-- 
Mika Eloranta
Ohmu Ltd.  http://www.ohmu.fi/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] missing RelationCloseSmgr in FreeFakeRelcacheEntry?


On 04.11.2013 11:35, Andres Freund wrote:

On 2013-11-04 09:38:27 +0200, Heikki Linnakangas wrote:

Secondly, it will fail if you create
two fake relcache entries for the same relfilenode. Freeing the first will
close the smgr entry, and freeing the second will try to close the same smgr
entry again. We never do that in the current code, but it seems like a
possible source of bugs in the future.


I think if we go there we'll need refcounted FakeRelcacheEntry's
anyway. Otherwise we'll end up with a relation smgropen()ed multiple
times in the same backend and such which doesn't seem like a promising
course to me since the smgr itself isn't refcounted.


How about the attached instead?



diff --git a/src/backend/access/transam/xlogutils.c 
b/src/backend/access/transam/xlogutils.c
index 5429d5e..f732e71 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -422,7 +422,11 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
rel-rd_lockInfo.lockRelId.dbId = rnode.dbNode;
rel-rd_lockInfo.lockRelId.relId = rnode.relNode;

-   rel-rd_smgr = NULL;
+   /*
+* Open the underlying storage, but don't set rd_owner. We want the
+* SmgrRelation to persist after the fake relcache entry is freed.
+*/
+   rel-rd_smgr = smgropen(rnode, InvalidBackendId);

return rel;
  }


I don't really like that - that will mean we'll leak open smgr handles
for every relation touched via smgr during recovery since they will
never be closed unless the relation is dropped or such. And in some
databases there can be huge amounts of relations.
Since recovery is long lived that doesn't seem like a good idea, besides
the memory usage it will also bloat smgr's internal hash which actually
seems just as likely to hurt performance.


That's the way SmgrRelations are supposed to be used. Opened on first 
use, and kept open forever. We never smgrclose() during normal operation 
either, unless the relation is dropped or something like that. And see 
how XLogReadBuffer() also calls smgropen() with no corresponding 
smgrclose().



I think intentionally not using the owner mechanism also is dangerous -
what if we have longer lived fake relcache entries and we've just
processed sinval messages? Then we'll have a -rd_smgr pointing into
nowhere.


Hmm, the startup process doesn't participate in sinval messaging at all, 
does it?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] missing RelationCloseSmgr in FreeFakeRelcacheEntry?

On 2013-11-04 14:37:53 +0200, Heikki Linnakangas wrote:
 On 04.11.2013 11:35, Andres Freund wrote:
 On 2013-11-04 09:38:27 +0200, Heikki Linnakangas wrote:
 diff --git a/src/backend/access/transam/xlogutils.c 
 b/src/backend/access/transam/xlogutils.c
 index 5429d5e..f732e71 100644
 --- a/src/backend/access/transam/xlogutils.c
 +++ b/src/backend/access/transam/xlogutils.c
 @@ -422,7 +422,11 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
 rel-rd_lockInfo.lockRelId.dbId = rnode.dbNode;
 rel-rd_lockInfo.lockRelId.relId = rnode.relNode;
 
 -   rel-rd_smgr = NULL;
 +   /*
 +* Open the underlying storage, but don't set rd_owner. We want the
 +* SmgrRelation to persist after the fake relcache entry is freed.
 +*/
 +   rel-rd_smgr = smgropen(rnode, InvalidBackendId);
 
 return rel;
   }
 
 I don't really like that - that will mean we'll leak open smgr handles
 for every relation touched via smgr during recovery since they will
 never be closed unless the relation is dropped or such. And in some
 databases there can be huge amounts of relations.
 Since recovery is long lived that doesn't seem like a good idea, besides
 the memory usage it will also bloat smgr's internal hash which actually
 seems just as likely to hurt performance.
 
 That's the way SmgrRelations are supposed to be used. Opened on first use,
 and kept open forever. We never smgrclose() during normal operation either,
 unless the relation is dropped or something like that. And see how
 XLogReadBuffer() also calls smgropen() with no corresponding smgrclose().

Ok, that's quite the fair argument although I'd say normal backends
won't open many relations in comparison to the startup process when
doing replication. But that certainly isn't sufficient argument for
changing logic in the back branches or even HEAD.

 I think intentionally not using the owner mechanism also is dangerous -
 what if we have longer lived fake relcache entries and we've just
 processed sinval messages? Then we'll have a -rd_smgr pointing into
 nowhere.

 Hmm, the startup process doesn't participate in sinval messaging at all,
 does it?

Well, not sinval but inval, in hot standby via commit messages.

What about just unowning the smgr entry with
if (rel-rd_smgr != NULL)
   smgrsetowner(NULL, rel-rd_smgr)
when closing the fake relcache entry?

That shouldn't require any further changes than changing the comment in
smgrsetowner() that this isn't something expected to frequently happen.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] missing RelationCloseSmgr in FreeFakeRelcacheEntry?

On 2013-11-04 13:48:32 +0100, Andres Freund wrote:
  Hmm, the startup process doesn't participate in sinval messaging at all,
  does it?
 
 Well, not sinval but inval, in hot standby via commit messages.

Err, that's bullshit, sorry for that. We send the messages via sinval,
but never (probably at least?) process sinval messages.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extension Templates S03E11

2013-11-04 Thread Stephen Frost

Dimitri,

* Dimitri Fontaine (dimi...@2ndquadrant.fr) wrote:
 So please find v15 of the patch attached to this email, that passes all
 previously done checks and this one too now.

Looks like there's been a bit of unfortunate bitrot due to Tom's change
to disable fancy output:

patching file src/test/regress/expected/sanity_check.out
Hunk #1 FAILED at 104.
Hunk #2 FAILED at 166.
2 out of 2 hunks FAILED -- saving rejects to file 
src/test/regress/expected/sanity_check.out.rej

Are there any other changes you have pending for this..?  Would be nice
to see the latest version which you've tested and which patches cleanly
against master... ;)

I'll still go ahead and start looking through this, per our discussion.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Row-security writer-side checks proposal

On Fri, Nov 1, 2013 at 3:52 AM, Craig Ringer cr...@2ndquadrant.com wrote:
 I've been looking some more into write-side checks in row-security and
 have a suggestion.

 Even though write-side checks are actually fairly separate to read
 checks, and can be done as another step, I'd like to think about them
 before the catalog format and syntax are settled. I think we need fields
 for write operations in pg_rowsecurity and the syntax to set them so
 that the catalog information can be used by triggers to enforce write
 checks. Even if, for the first cut, they're not supported by built-in
 auto-created triggers.

 Here's my proposal, let me know what you think:

 SET ROW SECURITY FOR { ALL COMMANDS | {[SELECT,INSERT,UPDATE,DELETE}+}

 in other words, you specify either:

 SET ROW SECURITY FOR ALL COMMANDS

I continue to think that this syntax is misguided.  For SELECT and
DELETE there is only read-side security, and for INSERT there is only
write-side security, so that's OK as far as it goes, but for UPDATE
both read-side security and write-side security are possible, and
there ought to be a way to get one without the other.  This syntax
won't support that cleanly.

I wonder whether it's worth thinking about the relationship between
the write-side security contemplated for this feature iand the WITH
CHECK OPTION syntax that we have for auto-updateable views, which
serves more or less the same purpose.  I'm not sure that syntax is any
great shakes, but it's existing precedent of some form and could
perhaps at least be looked at as a source of inspiration.

I would generally expect that most people would want either read side
security for all commands or read and write side security for all
commands.  I think whatever syntax we come up with this feature ought
to make each of those things straightforward to get.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Removal of archive in wal_level

On Mon, Nov 4, 2013 at 5:57 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Please find attached a patch doing what is written in the $subject.
 With the documentation updated, this is even better...

I'm unconvinced that there's any value in this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.4] row level security

Craig Ringer cr...@2ndquadrant.com writes:
 On 09/04/2013 11:22 PM, Tom Lane wrote:
 AFAICT, to deal with update/delete the RLS patch needs to constrain order
 of qual application without the crutch of having a separate level of
 subquery; and it's that behavior that I have zero confidence in, either
 as to whether it works as submitted or as to our odds of not breaking it
 in the future.

 Wouldn't CASE do that job, albeit in a somewhat ugly manner, and also
 protect against malicious RLS functions?

You mean wrap all the query quals in a CASE?  Sure, if you didn't mind
totally destroying any optimization possibilities.  If you did that,
every table scan would become a seqscan and every join a nestloop.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #8573: int4range memory consumption

Andres Freund and...@2ndquadrant.com writes:
 On 2013-11-02 15:29:36 -0400, Tom Lane wrote:
 Attached is a proposed patch for this.  It fixes most of the functions
 in printtup.c to use a per-row memory context.  (I did not bother to
 fix debugtup(), which is used only in standalone mode.  If you're doing
 queries large enough for mem leaks to be problematic in standalone mode,
 you're nuts.) 

 FWIW, by that definition I have been nuts several time in the past -
 without much choice since I was recovering data from a corrupted cluster
 and the database couldn't be started normally. This now has gotten worse
 (even in the backbranches) since debugtup won't clean up detoasted data
 anymore.
 But I guess the solution for that is to use COPY in those situations
 which shouldn't have that problem and should work. Also, easier to parse ;)

Considering the output from debugtup goes to stdout where it will be
intermixed with prompts etc, I'd have to think that COPY is a way better
solution for dealing with bulk data.

Really I'd like to see standalone mode, in its current form, go away
completely.  I had a prototype patch for allowing psql and other clients
to interface to a standalone backend.  I think getting that finished would
be a way more productive use of time than improving debugtup.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #8573: int4range memory consumption

On 2013-11-04 09:45:22 -0500, Tom Lane wrote:
 Really I'd like to see standalone mode, in its current form, go away
 completely.  I had a prototype patch for allowing psql and other clients
 to interface to a standalone backend.  I think getting that finished would
 be a way more productive use of time than improving debugtup.

+many

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Something fishy happening on frogmouth

Andres Freund and...@2ndquadrant.com writes:
 On 2013-11-04 13:13:27 +0200, Heikki Linnakangas wrote:
 On 04.11.2013 11:55, Andres Freund wrote:
 Also, I don't think it's portable across platforms to access segments
 that already have been unlinked.

 See
 http://pubs.opengroup.org/onlinepubs/009695299/functions/shm_unlink.html:
 If one or more references to the shared memory object exist when the object
 is unlinked, the name shall be removed before shm_unlink() returns, but the
 removal of the memory object contents shall be postponed until all open and
 map references to the shared memory object have been removed.

 We also support sysv shmem and have the same cleanup problem there.

And what about Windows?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.4] row level security

On Mon, Nov 4, 2013 at 9:37 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Craig Ringer cr...@2ndquadrant.com writes:
 On 09/04/2013 11:22 PM, Tom Lane wrote:
 AFAICT, to deal with update/delete the RLS patch needs to constrain order
 of qual application without the crutch of having a separate level of
 subquery; and it's that behavior that I have zero confidence in, either
 as to whether it works as submitted or as to our odds of not breaking it
 in the future.

 Wouldn't CASE do that job, albeit in a somewhat ugly manner, and also
 protect against malicious RLS functions?

 You mean wrap all the query quals in a CASE?  Sure, if you didn't mind
 totally destroying any optimization possibilities.  If you did that,
 every table scan would become a seqscan and every join a nestloop.

I'd still like to here what's wrong with what I said here:

http://www.postgresql.org/message-id/ca+tgmoyr1phw3x9vnvuwdcfxkzk2p_jhtwc0fv2q58negcx...@mail.gmail.com

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #8573: int4range memory consumption

2013-11-04 Thread Amit Kapila

On Mon, Nov 4, 2013 at 8:15 PM, Tom Lane t...@sss.pgh.pa.us wrote:
Andres Freund and...@2ndquadrant.com writes:
On 2013-11-02 15:29:36 -0400, Tom Lane wrote:
Attached is a proposed patch for this. It fixes most of the functions
in printtup.c to use a per-row memory context. (I did not bother to
fix debugtup(), which is used only in standalone mode. If you're doing
queries large enough for mem leaks to be problematic in standalone mode,
you're nuts.)

FWIW, by that definition I have been nuts several time in the past -
without much choice since I was recovering data from a corrupted cluster
and the database couldn't be started normally. This now has gotten worse
(even in the backbranches) since debugtup won't clean up detoasted data
anymore.
But I guess the solution for that is to use COPY in those situations
which shouldn't have that problem and should work. Also, easier to parse ;)

Considering the output from debugtup goes to stdout where it will be
intermixed with prompts etc, I'd have to think that COPY is a way better
solution for dealing with bulk data.

Really I'd like to see standalone mode, in its current form, go away
completely. I had a prototype patch for allowing psql and other clients
to interface to a standalone backend. I think getting that finished would
be a way more productive use of time than improving debugtup.

The last patch including Windows implementation was posted at below
link sometime back.
http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382853263C@szxeml509-mbs

If this is the right thing, I can rebase the patch and make it ready.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dsm use of uint64

On Fri, Nov 1, 2013 at 11:45 PM, Peter Eisentraut pete...@gmx.net wrote:
 On Mon, 2013-10-28 at 12:17 -0400, Robert Haas wrote:
 On Sun, Oct 27, 2013 at 11:34 PM, Noah Misch n...@leadboat.com wrote:
  On Fri, Oct 25, 2013 at 10:11:41PM -0400, Robert Haas wrote:
  When I wrote the dynamic shared memory patch, I used uint64 everywhere
  to measure sizes - rather than, as we do for the main shared memory
  segment, Size.  This now seems to me to have been the wrong decision;

 This change is now causing compiler warnings on 32-bit platforms.  You
 can see them here, for example:
 http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=lapwingdt=2013-11-01%2020%3A45%3A01stg=make

Ah.  This is because I didn't change the format code used to print the
arguments; it's still using UINT64_FORMAT, but the argument is now a
Size.  What's the right way to print out a Size, anyway?  Should I
just try %lu?  It seems that sysv_shmem.c uses %lu, but also casts the
value to (unsigned long); I could follow that precedent here, if
there's no consistency to what format is needed to print a size_t.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Removal of archive in wal_level

2013-11-04 Thread Peter Eisentraut

On 11/4/13, 8:58 AM, Robert Haas wrote:
 On Mon, Nov 4, 2013 at 5:57 AM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 Please find attached a patch doing what is written in the $subject.
 With the documentation updated, this is even better...
 
 I'm unconvinced that there's any value in this.

Yeah, the only thing this will accomplish is to annoy people who are
actually using that level.  It would be more interesting if we could get
rid of the wal_level setting altogether, but of course there are valid
reasons against that.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dsm use of uint64

On 2013-11-04 10:46:06 -0500, Robert Haas wrote:
On Fri, Nov 1, 2013 at 11:45 PM, Peter Eisentraut pete...@gmx.net wrote:
On Mon, 2013-10-28 at 12:17 -0400, Robert Haas wrote:
On Sun, Oct 27, 2013 at 11:34 PM, Noah Misch n...@leadboat.com wrote:
On Fri, Oct 25, 2013 at 10:11:41PM -0400, Robert Haas wrote:
When I wrote the dynamic shared memory patch, I used uint64 everywhere
to measure sizes - rather than, as we do for the main shared memory
segment, Size. This now seems to me to have been the wrong decision;

This change is now causing compiler warnings on 32-bit platforms. You
can see them here, for example:
http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=lapwingdt=2013-11-01%2020%3A45%3A01stg=make

Ah. This is because I didn't change the format code used to print the
arguments; it's still using UINT64_FORMAT, but the argument is now a
Size. What's the right way to print out a Size, anyway?

There isn't a nice one currently. glibc/gcc have %z that automatically
has the right width, but that's not supported by windows. I've been
wondering if we shouldn't add support for that just like we have added
support for %m.

Should I
just try %lu? It seems that sysv_shmem.c uses %lu, but also casts the
value to (unsigned long); I could follow that precedent here, if
there's no consistency to what format is needed to print a size_t.

Yes, you need a cast like that.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

On Sat, Nov 2, 2013 at 6:07 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 29 October 2013 16:10, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci m_li...@yahoo.it 
 wrote:
 I don't see much interest in insert-efficient indexes.

 Presumably someone will get around to implementing a btree index
 insertion buffer one day. I think that would be a particularly
 compelling optimization for us, because we could avoid ever inserting
 index tuples that are already dead when the deferred insertion
 actually occurs.

 That's pretty much what the LSM-tree is.

What is pretty cool about this sort of thing is that there's no
intrinsic reason the insertion buffer needs to be block-structured or
disk-backed.  In theory, you can structure the in-memory portion of
the tree any way you like, using pointers and arbitrary-size memory
allocations and all that fun stuff.  You need to log that there's a
deferred insert (or commit to flushing the insertion buffer before
every commit, which would seem to miss the point) so that recovery can
reconstruct the in-memory data structure and flush it, but that's it:
the WAL format need not know any other details of the in-memory
portion of the tree.  I think that, plus the ability to use pointers
and so forth, might lead to significant performance gains.

In practice, the topology of our shared memory segment makes this a
bit tricky.  The problem isn't so much that it's fixed size as that it
lacks a real allocator, and that all the space used for shared_buffers
is nailed down and can't be borrowed for other purposes.  I'm very
interested in solving the problem of getting a real allocator for
shared memory because I think I need it for parallel sort, and even if
there's a way to avoid needing it there I have a strong feeling that
we'll want it for other applications of dynamic shared memory.  But it
should be possible to write the allocator in such a way that you just
give it a chunk of memory to manage, and it does, remaining agnostic
about whether the memory is from the main shared memory segment or a
dynamic one.

Of course, it's possible that even we do get a shared memory
allocator, a hypothetical person working on this project might prefer
to make the data block-structured anyway and steal storage from
shared_buffers.  So my aspirations in this area may not even be
relevant.  But I wanted to mention them, just in case anyone else is
thinking about similar things, so that we can potentially coordinate.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dsm use of uint64

On Mon, Nov 4, 2013 at 10:55 AM, Andres Freund and...@2ndquadrant.com wrote:
 Ah.  This is because I didn't change the format code used to print the
 arguments; it's still using UINT64_FORMAT, but the argument is now a
 Size.  What's the right way to print out a Size, anyway?

 There isn't a nice one currently. glibc/gcc have %z that automatically
 has the right width, but that's not supported by windows. I've been
 wondering if we shouldn't add support for that just like we have added
 support for %m.

  Should I
 just try %lu?  It seems that sysv_shmem.c uses %lu, but also casts the
 value to (unsigned long); I could follow that precedent here, if
 there's no consistency to what format is needed to print a size_t.

 Yes, you need a cast like that.

Whee, portability is fun.  OK, I changed it to work that way.
Hopefully that'll do the trick.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-04 Thread Claudio Freire

On Mon, Nov 4, 2013 at 1:09 PM, Robert Haas robertmh...@gmail.com wrote:
 On Sat, Nov 2, 2013 at 6:07 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 29 October 2013 16:10, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci m_li...@yahoo.it 
 wrote:
 I don't see much interest in insert-efficient indexes.

 Presumably someone will get around to implementing a btree index
 insertion buffer one day. I think that would be a particularly
 compelling optimization for us, because we could avoid ever inserting
 index tuples that are already dead when the deferred insertion
 actually occurs.

 That's pretty much what the LSM-tree is.

 What is pretty cool about this sort of thing is that there's no
 intrinsic reason the insertion buffer needs to be block-structured or
 disk-backed.  In theory, you can structure the in-memory portion of
 the tree any way you like, using pointers and arbitrary-size memory
 allocations and all that fun stuff.  You need to log that there's a
 deferred insert (or commit to flushing the insertion buffer before
 every commit, which would seem to miss the point)


Such a thing would help COPY, so maybe it's worth a look


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire klaussfre...@gmail.com wrote:
 Such a thing would help COPY, so maybe it's worth a look

I have little doubt that a deferred insertion buffer of some kind
could help performance on some workloads, though I suspect the buffer
would have to be pretty big to make it worthwhile on a big COPY that
generates mostly-random insertions.  I think the question is not so
much whether it's worth doing but where anyone's going to find the
time to do it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-04 Thread Claudio Freire

On Mon, Nov 4, 2013 at 1:27 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire klaussfre...@gmail.com 
 wrote:
 Such a thing would help COPY, so maybe it's worth a look

 I have little doubt that a deferred insertion buffer of some kind
 could help performance on some workloads, though I suspect the buffer
 would have to be pretty big to make it worthwhile on a big COPY that
 generates mostly-random insertions.  I think the question is not so
 much whether it's worth doing but where anyone's going to find the
 time to do it.


However, since an admin can increase work_mem for that COPY, using
work_mem for this would be reasonable, don't you agree?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_ctl status with nonexistent data directory

On Sat, Nov 2, 2013 at 3:32 PM, Peter Eisentraut pete...@gmx.net wrote:
 This doesn't seem right:

 $ pg_ctl -D /nowhere status
 pg_ctl: no server running

 It does exit with status 3, so it's not all that broken, but I think the
 error message could be more accurate.

I doubt anyone will object if you feel the urge to fix it.  I won't, anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

On Mon, Nov 4, 2013 at 11:31 AM, Claudio Freire klaussfre...@gmail.com wrote:
 On Mon, Nov 4, 2013 at 1:27 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire klaussfre...@gmail.com 
 wrote:
 Such a thing would help COPY, so maybe it's worth a look

 I have little doubt that a deferred insertion buffer of some kind
 could help performance on some workloads, though I suspect the buffer
 would have to be pretty big to make it worthwhile on a big COPY that
 generates mostly-random insertions.  I think the question is not so
 much whether it's worth doing but where anyone's going to find the
 time to do it.


 However, since an admin can increase work_mem for that COPY, using
 work_mem for this would be reasonable, don't you agree?

Without implementing it and benchmarking the result, it's pretty hard
to know.  But if I were a betting man, I'd bet that's not nearly big
enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

On 2013-11-04 11:27:33 -0500, Robert Haas wrote:
 On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire klaussfre...@gmail.com 
 wrote:
  Such a thing would help COPY, so maybe it's worth a look
 
 I have little doubt that a deferred insertion buffer of some kind
 could help performance on some workloads, though I suspect the buffer
 would have to be pretty big to make it worthwhile on a big COPY that
 generates mostly-random insertions.

Even for random data presorting the to-be-inserted data appropriately
could result in much better io patterns.

 I think the question is not so
 much whether it's worth doing but where anyone's going to find the
 time to do it.

Yea :(

I think doing this outside of s_b will make stuff rather hard for
physical replication and crash recovery since we either will need to
flush the whole buffer at checkpoints - which is hard since the
checkpointer doesn't work inside individual databases - or we need to
persist the in-memory buffer across restart which also sucks.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

On Mon, Nov 4, 2013 at 11:32 AM, Andres Freund and...@2ndquadrant.com wrote:
 I think doing this outside of s_b will make stuff rather hard for
 physical replication and crash recovery since we either will need to
 flush the whole buffer at checkpoints - which is hard since the
 checkpointer doesn't work inside individual databases - or we need to
 persist the in-memory buffer across restart which also sucks.

You might be right, but I think part of the value of LSM-trees is that
the in-memory portion of the data structure is supposed to be able to
be optimized for in-memory storage rather than on disk storage.  It
may be that block-structuring that data bleeds away much of the
performance benefit.  Of course, I'm talking out of my rear end here:
I don't really have a clue how these algorithms are supposed to work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Removal of archive in wal_level

2013-11-04 Thread Stephen Frost

* Peter Eisentraut (pete...@gmx.net) wrote:
 On 11/4/13, 8:58 AM, Robert Haas wrote:
  On Mon, Nov 4, 2013 at 5:57 AM, Michael Paquier
  michael.paqu...@gmail.com wrote:
  Please find attached a patch doing what is written in the $subject.
  With the documentation updated, this is even better...
  
  I'm unconvinced that there's any value in this.
 
 Yeah, the only thing this will accomplish is to annoy people who are
 actually using that level.  It would be more interesting if we could get
 rid of the wal_level setting altogether, but of course there are valid
 reasons against that.

It would actually be valuable to 'upgrade' those people to
hot_standby, which is what I had kind of been hoping would happen
eventually.  I agree that there's no use for 'archive' today, but rather
than break existing configs that use it, just make 'archive' and
'hot_standby' mean the same thing.  In the end, I'd probably vote to
make 'hot_standby' the 'legacy/deprecated' term anyway.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Removal of archive in wal_level

On Mon, Nov 4, 2013 at 11:45 AM, Stephen Frost sfr...@snowman.net wrote:
 * Peter Eisentraut (pete...@gmx.net) wrote:
 On 11/4/13, 8:58 AM, Robert Haas wrote:
  On Mon, Nov 4, 2013 at 5:57 AM, Michael Paquier
  michael.paqu...@gmail.com wrote:
  Please find attached a patch doing what is written in the $subject.
  With the documentation updated, this is even better...
 
  I'm unconvinced that there's any value in this.

 Yeah, the only thing this will accomplish is to annoy people who are
 actually using that level.  It would be more interesting if we could get
 rid of the wal_level setting altogether, but of course there are valid
 reasons against that.

 It would actually be valuable to 'upgrade' those people to
 hot_standby, which is what I had kind of been hoping would happen
 eventually.  I agree that there's no use for 'archive' today, but rather
 than break existing configs that use it, just make 'archive' and
 'hot_standby' mean the same thing.  In the end, I'd probably vote to
 make 'hot_standby' the 'legacy/deprecated' term anyway.

That strikes me as a better idea than what the patch actually does,
but I still think it's nanny-ism.  I don't believe we have the right
to second-guess the choices our users make in this area.  We can make
recommendations in the documentation, but at the end of the day if
users choose to use archive rather than hot_standby, we should respect
that choice, not break it because we think we know better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How can I build OSSP UUID support on Windows to avoid duplicate UUIDs?

2013-11-04 Thread Christopher Browne

On Thu, Oct 31, 2013 at 3:42 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Oct 31, 2013 at 2:44 PM, Garick Hamlin gham...@isc.upenn.edu
wrote:
 I think using /dev/urandom directly would be surprising. At least it
would
 have probably have taken me a while to figure out what was depleting the
 entropy pool here.

 Perhaps so; a bigger problem IMHO is that it's not portable. I think
 the only way to solve this problem is to import (or have an option to
 link with) a strong, sophisticated PRNG with much larger internal
 state than pg_lrand48, which uses precisely 48 bits of internal state.
 For this kind of thing, I'm fairly sure that we need something with
 at least 128 bits of internal state (as wide as the random value we
 want to generate) and I suspect it might be advantageous to have
 something a whole lot wider, maybe a few kB.

I mentioned the notion of building an entropy pool, into which one might
add various sorts of random inputs, under separate cover...

The last time I had need of a rather non-repeating RNG, I went with
a Fibonacci-based one, namely Mersenne Twister...

http://en.wikipedia.org/wiki/Mersenne_twister

The sample has 624 integers (presumably that means 624x32 bits) as
its internal state. Apparently not terribly suitable for cryptographic
purposes,
but definitely highly non-repetitive, which is what we're notably
worried about for UUIDs.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, How would the Lone Ranger handle this?

Re: [HACKERS] pgsql: Remove internal uses of CTimeZone/HasCTZSet.

2013-11-04 Thread Noah Misch

On Fri, Nov 01, 2013 at 04:51:34PM +, Tom Lane wrote:
 Remove internal uses of CTimeZone/HasCTZSet.
 
 The only remaining places where we actually look at CTimeZone/HasCTZSet
 are abstime2tm() and timestamp2tm().  Now that session_timezone is always
 valid, we can remove these special cases.  The caller-visible impact of
 this is that these functions now always return a valid zone abbreviation
 if requested, whereas before they'd return a NULL pointer if a brute-force
 timezone was in use.  In the existing code, the only place I can find that
 changes behavior is to_char(), whose TZ format code will now print
 something useful rather than nothing for such zones.  (In the places where
 the returned zone abbreviation is passed to EncodeDateTime, the lack of
 visible change is because we've chosen the abbreviation used for these
 zones to match what EncodeTimezone would have printed.)

This changed EncodeDateTime() output for USE_SQL_DATES and USE_GERMAN_DATES
styles, because it inserts a space before tzn but does not insert a space
before EncodeTimezone() output.  Example:

  set datestyle = sql,mdy;
  select '2013-01-01'::timestamptz;

old output:

  timestamptz   

 01/01/2013 00:00:00+00

new output:

   timestamptz
-
 01/01/2013 00:00:00 +00

I assume this was unintended.

 This could be back-patched if we decide we'd like to fix to_char()'s
 behavior in the back branches, but there doesn't seem to be much
 enthusiasm for that at present.

Note that for the corresponding scenario in Oracle, the TZD format code
yields blank and the TZR format code yields -01:30.  (Oracle does not have
a plain TZ escape.)  So there's precedent for each choice of behavior, and I
don't see this as a back-patch candidate.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #8573: int4range memory consumption

Amit Kapila amit.kapil...@gmail.com writes:
 On Mon, Nov 4, 2013 at 8:15 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Really I'd like to see standalone mode, in its current form, go away
 completely.  I had a prototype patch for allowing psql and other clients
 to interface to a standalone backend.  I think getting that finished would
 be a way more productive use of time than improving debugtup.

 The last patch including Windows implementation was posted at below
 link sometime back.
 http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382853263C@szxeml509-mbs

 If this is the right thing, I can rebase the patch and make it ready.

IIRC, the discussion stalled out because people had security concerns
and/or there wasn't consensus about how it should look at the user level.
There's probably not much point in polishing a patch until we have more
agreement about the high-level design.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql: Remove internal uses of CTimeZone/HasCTZSet.

Noah Misch n...@leadboat.com writes:
 On Fri, Nov 01, 2013 at 04:51:34PM +, Tom Lane wrote:
 Remove internal uses of CTimeZone/HasCTZSet.

 This changed EncodeDateTime() output for USE_SQL_DATES and USE_GERMAN_DATES
 styles, because it inserts a space before tzn but does not insert a space
 before EncodeTimezone() output.  Example:

   set datestyle = sql,mdy;
   select '2013-01-01'::timestamptz;

 old output:

   timestamptz   
 
  01/01/2013 00:00:00+00

 new output:

timestamptz
 -
  01/01/2013 00:00:00 +00

 I assume this was unintended.

Yeah, I had seen some cases of that.  I don't find it of great concern.
We'd have to insert some ugly special-case code that looks at the zone
name to decide whether to insert a space or not, and I don't think it'd
actually be an improvement to do so.  (Arguably, these formats are
more consistent this way.)  Still, this change is probably a sufficient
reason not to back-patch this part of the fix.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-04 Thread Gavin Flower


On 05/11/13 05:35, Robert Haas wrote:

On Mon, Nov 4, 2013 at 11:32 AM, Andres Freund and...@2ndquadrant.com wrote:

I think doing this outside of s_b will make stuff rather hard for
physical replication and crash recovery since we either will need to
flush the whole buffer at checkpoints - which is hard since the
checkpointer doesn't work inside individual databases - or we need to
persist the in-memory buffer across restart which also sucks.

You might be right, but I think part of the value of LSM-trees is that
the in-memory portion of the data structure is supposed to be able to
be optimized for in-memory storage rather than on disk storage.  It
may be that block-structuring that data bleeds away much of the
performance benefit.  Of course, I'm talking out of my rear end here:
I don't really have a clue how these algorithms are supposed to work.

How about having a 'TRANSIENT INDEX' that only exists in memory, so 
there is no requirement to write it to disk or to replicate directly? 
This type of index would be very fast and easier to implement.  Recovery 
would involve rebuilding the index, and sharing would involve recreating 
on a slave.  Probably not appropriate for a primary index, but may be 
okay for secondary indexes used to speed specific queries.


This could be useful in some situations now, and allow time to get 
experience in how best to implement the basic concept.  Then a more 
robust solution using WAL etc can be developed later.


I suspect that such a TRANSIENT INDEX would still be useful even when a 
more robust in memory index method was available.  As I expect it would 
be faster to set up than a robust memory index - which might be good 
when you need to have one or more indexes for a short period of time, or 
the size of the index is so small that recreating it requires very 
little time (total elapsed time might even be less than a disk backed one?).



Cheers,
Gavin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-04 Thread Simon Riggs

On 4 November 2013 16:09, Robert Haas robertmh...@gmail.com wrote:
 On Sat, Nov 2, 2013 at 6:07 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 29 October 2013 16:10, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci m_li...@yahoo.it 
 wrote:
 I don't see much interest in insert-efficient indexes.

 Presumably someone will get around to implementing a btree index
 insertion buffer one day. I think that would be a particularly
 compelling optimization for us, because we could avoid ever inserting
 index tuples that are already dead when the deferred insertion
 actually occurs.

 That's pretty much what the LSM-tree is.

 What is pretty cool about this sort of thing is that there's no
 intrinsic reason the insertion buffer needs to be block-structured or
 disk-backed.  In theory, you can structure the in-memory portion of
 the tree any way you like, using pointers and arbitrary-size memory
 allocations and all that fun stuff.  You need to log that there's a
 deferred insert (or commit to flushing the insertion buffer before
 every commit, which would seem to miss the point) so that recovery can
 reconstruct the in-memory data structure and flush it, but that's it:
 the WAL format need not know any other details of the in-memory
 portion of the tree.  I think that, plus the ability to use pointers
 and so forth, might lead to significant performance gains.

Sounds like it could be cool, yes.

 In practice, the topology of our shared memory segment makes this a
 bit tricky.  The problem isn't so much that it's fixed size as that it
 lacks a real allocator, and that all the space used for shared_buffers
 is nailed down and can't be borrowed for other purposes.  I'm very
 interested in solving the problem of getting a real allocator for
 shared memory because I think I need it for parallel sort

Agreed, you need a shared memory allocator for parallel query. It's
just too tempting to build a hash table in shared memory on one
thread, then use the hash table from multiple sessions as we do a
parallel scan. Roughly same thing for parallel sort - passing pointers
to complete data objects around is much easier than actually moving
the data between processes, which would slow things down. We do also
need generalised inter-node pipe but that's not the best solution to
every problem.

It's also a great way of controlling over-allocation of resources.

 , and even if
 there's a way to avoid needing it there I have a strong feeling that
 we'll want it for other applications of dynamic shared memory.  But it
 should be possible to write the allocator in such a way that you just
 give it a chunk of memory to manage, and it does, remaining agnostic
 about whether the memory is from the main shared memory segment or a
 dynamic one.

Agreed

 Of course, it's possible that even we do get a shared memory
 allocator, a hypothetical person working on this project might prefer
 to make the data block-structured anyway and steal storage from
 shared_buffers.  So my aspirations in this area may not even be
 relevant.  But I wanted to mention them, just in case anyone else is
 thinking about similar things, so that we can potentially coordinate.

If anyone was going to work on LSM tree, I would advise building a
tree in shared/temp buffers first, then merging with the main tree.
The merge process could use the killed tuple approach to mark the
merging.

The most difficult thing about buffering the inserts is deciding which
poor sucker gets the task of cleaning up. That's probably better as an
off-line process, which is where the work comes in. Non shared
buffered approaches would add too much overhead to the main task.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-04 Thread Jeff Janes

On Mon, Nov 4, 2013 at 8:09 AM, Robert Haas robertmh...@gmail.com wrote:

 On Sat, Nov 2, 2013 at 6:07 AM, Simon Riggs si...@2ndquadrant.com wrote:
  On 29 October 2013 16:10, Peter Geoghegan p...@heroku.com wrote:
  On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci m_li...@yahoo.it
 wrote:
  I don't see much interest in insert-efficient indexes.
 
  Presumably someone will get around to implementing a btree index
  insertion buffer one day. I think that would be a particularly
  compelling optimization for us, because we could avoid ever inserting
  index tuples that are already dead when the deferred insertion
  actually occurs.
 
  That's pretty much what the LSM-tree is.

 What is pretty cool about this sort of thing is that there's no
 intrinsic reason the insertion buffer needs to be block-structured or
 disk-backed.


How do we commit to not spilling to disk, in the face of an unbounded
number of indexes existing and wanting to use this mechanism
simultaneously?  If it routinely needs to spill to disk, that would
probably defeat the purpose of having it in the first place, but committing
to never doing so seems to be extremely restrictive. As you say it is also
freeing, in terms of using pointers and such, but I think the restrictions
would outweigh the freedom.




  In theory, you can structure the in-memory portion of
 the tree any way you like, using pointers and arbitrary-size memory
 allocations and all that fun stuff.  You need to log that there's a
 deferred insert (or commit to flushing the insertion buffer before
 every commit, which would seem to miss the point) so that recovery can
 reconstruct the in-memory data structure and flush it, but that's it:
 the WAL format need not know any other details of the in-memory
 portion of the tree.  I think that, plus the ability to use pointers
 and so forth, might lead to significant performance gains.

 In practice, the topology of our shared memory segment makes this a
 bit tricky.  The problem isn't so much that it's fixed size as that it
 lacks a real allocator, and that all the space used for shared_buffers
 is nailed down and can't be borrowed for other purposes.


I think the fixed size is also a real problem, especially given the
ubiquitous advice not to exceed 2 to 8 GB.

Cheers,

Jeff

Re: [HACKERS] [BUGS] BUG #8542: Materialized View with another column_name does not work?

2013-11-04 Thread Kevin Grittner

Kevin Grittner kgri...@ymail.com wrote:
 Michael Paquier michael.paqu...@gmail.com wrote:

 I am not sure that adding a boolean flag introducing a concept
 related to matview inside checkRuleResultList is the best
 approach to solve that. checkRuleResultList is something related
 only to rules, and has nothing related to matviews in it yet.

 Well, I was tempted to keep that concept in DefineQueryRewrite()
 where the call is made, by calling  the new bool something like
 requireColumnNameMatch and not having checkRuleResultList()
 continue to use isSelect for that purpose at all.
 DefineQueryRewrite() already references RELKIND_RELATION once,
 RELKIND_MATVIEW twice, and RELKIND_VIEW three times, so it would
 hardly be introducing a new concept there.

Upon reflection, that seemed to be cleaner.  Pushed fix that way. 
Not much of a change from the previously posted patch, but attached
here in case anyone wants to argue against this approach.

Thanks for the report!

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company*** a/src/backend/rewrite/rewriteDefine.c
--- b/src/backend/rewrite/rewriteDefine.c
***
*** 44,50 
  
  
  static void checkRuleResultList(List *targetList, TupleDesc resultDesc,
! 	bool isSelect);
  static bool setRuleCheckAsUser_walker(Node *node, Oid *context);
  static void setRuleCheckAsUser_Query(Query *qry, Oid userid);
  
--- 44,50 
  
  
  static void checkRuleResultList(List *targetList, TupleDesc resultDesc,
! 	bool isSelect, bool requireColumnNameMatch);
  static bool setRuleCheckAsUser_walker(Node *node, Oid *context);
  static void setRuleCheckAsUser_Query(Query *qry, Oid userid);
  
***
*** 355,361  DefineQueryRewrite(char *rulename,
  		 */
  		checkRuleResultList(query-targetList,
  			RelationGetDescr(event_relation),
! 			true);
  
  		/*
  		 * ... there must not be another ON SELECT rule already ...
--- 355,363 
  		 */
  		checkRuleResultList(query-targetList,
  			RelationGetDescr(event_relation),
! 			true,
! 			event_relation-rd_rel-relkind !=
! RELKIND_MATVIEW);
  
  		/*
  		 * ... there must not be another ON SELECT rule already ...
***
*** 484,490  DefineQueryRewrite(char *rulename,
  		 errmsg(RETURNING lists are not supported in non-INSTEAD rules)));
  			checkRuleResultList(query-returningList,
  RelationGetDescr(event_relation),
! false);
  		}
  	}
  
--- 486,492 
  		 errmsg(RETURNING lists are not supported in non-INSTEAD rules)));
  			checkRuleResultList(query-returningList,
  RelationGetDescr(event_relation),
! false, false);
  		}
  	}
  
***
*** 613,627  DefineQueryRewrite(char *rulename,
   *		Verify that targetList produces output compatible with a tupledesc
   *
   * The targetList might be either a SELECT targetlist, or a RETURNING list;
!  * isSelect tells which.  (This is mostly used for choosing error messages,
!  * but also we don't enforce column name matching for RETURNING.)
   */
  static void
! checkRuleResultList(List *targetList, TupleDesc resultDesc, bool isSelect)
  {
  	ListCell   *tllist;
  	int			i;
  
  	i = 0;
  	foreach(tllist, targetList)
  	{
--- 615,634 
   *		Verify that targetList produces output compatible with a tupledesc
   *
   * The targetList might be either a SELECT targetlist, or a RETURNING list;
!  * isSelect tells which.  This is used for choosing error messages.
!  *
!  * A SELECT targetlist may optionally require that column names match.
   */
  static void
! checkRuleResultList(List *targetList, TupleDesc resultDesc, bool isSelect,
! 	bool requireColumnNameMatch)
  {
  	ListCell   *tllist;
  	int			i;
  
+ 	/* Only a SELECT may require a column name match. */
+ 	Assert(isSelect || !requireColumnNameMatch);
+ 
  	i = 0;
  	foreach(tllist, targetList)
  	{
***
*** 657,663  checkRuleResultList(List *targetList, TupleDesc resultDesc, bool isSelect)
  	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  	 errmsg(cannot convert relation containing dropped columns to view)));
  
! 		if (isSelect  strcmp(tle-resname, attname) != 0)
  			ereport(ERROR,
  	(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
  	 errmsg(SELECT rule's target entry %d has different column name from \%s\, i, attname)));
--- 664,670 
  	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  	 errmsg(cannot convert relation containing dropped columns to view)));
  
! 		if (requireColumnNameMatch  strcmp(tle-resname, attname) != 0)
  			ereport(ERROR,
  	(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
  	 errmsg(SELECT rule's target entry %d has different column name from \%s\, i, attname)));
*** a/src/test/regress/expected/matview.out
--- b/src/test/regress/expected/matview.out
***
*** 450,452  SELECT * FROM boxmv ORDER BY id;
--- 450,475 
  
  DROP TABLE boxes CASCADE;
  NOTICE:  drop cascades to materialized view

Re: [HACKERS] GIN improvements part 1: additional information

2013-11-04 Thread Alexander Korotkov

On Mon, Oct 21, 2013 at 11:12 PM, Alexander Korotkov
aekorot...@gmail.comwrote:

 Attached version of patch is debugged. I would like to note that number of
 bugs was low and it wasn't very hard to debug. I've rerun tests on it. You
 can see that numbers are improved as the result of your refactoring.

  event | period
 ---+-
  index_build   | 00:01:45.416822
  index_build_recovery  | 00:00:06
  index_update  | 00:05:17.263606
  index_update_recovery | 00:01:22
  search_new| 00:24:07.706482
  search_updated| 00:26:25.364494
 (6 rows)

  label  | blocks_mark
 +-
  search_new |   847587636
  search_updated |   881210486
 (2 rows)

  label |   size
 ---+---
  new   | 419299328
  after_updates | 715915264
 (2 rows)

 Beside simple bugs, there was a subtle bug in incremental item indexes
 update. I've added some more comments including ASCII picture about how
 item indexes are shifted.

 Now, I'm trying to implement support of old page format. Then we can
 decide which approach to use.


Attached version of patch has support of old page format. Patch still needs
more documentation and probably refactoring, but I believe idea is clear
and it can be reviewed. In the patch I have to revert change of null
category placement for compatibility with old posting list format.

--
With best regards,
Alexander Korotkov.


gin-packed-postinglists-12.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] List of binary-compatible data types

2013-11-04 Thread Josh Berkus

Folks,

From our docs:

Adding a column with a non-null default or changing the type of an
existing column will require the entire table and indexes to be
rewritten. As an exception, if the USING clause does not change the
column contents and the old type is either binary coercible to the new
type or an unconstrained domain over the new type, a table rewrite is
not needed ...

Which is nice, but nowhere do we present users with a set of
binary-compatible data types, even among the built-in types.  I'd
happily write this up, if I knew what the binary-compatible data types
*were*.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-04 Thread Claudio Freire

On Mon, Nov 4, 2013 at 5:01 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Of course, it's possible that even we do get a shared memory
 allocator, a hypothetical person working on this project might prefer
 to make the data block-structured anyway and steal storage from
 shared_buffers.  So my aspirations in this area may not even be
 relevant.  But I wanted to mention them, just in case anyone else is
 thinking about similar things, so that we can potentially coordinate.

 If anyone was going to work on LSM tree, I would advise building a
 tree in shared/temp buffers first, then merging with the main tree.
 The merge process could use the killed tuple approach to mark the
 merging.

 The most difficult thing about buffering the inserts is deciding which
 poor sucker gets the task of cleaning up. That's probably better as an
 off-line process, which is where the work comes in. Non shared
 buffered approaches would add too much overhead to the main task.

Thing is, if you want crash safety guarantees, you cannot use temp
(unlogged) buffers, and then you always have to flush to WAL at each
commit. If the staging index is shared, then it could mean a lot of
WAL (ie: probably around double the amount of WAL a regular b-tree
would generate).

Process-private staging trees that get merged on commit, ie:
transaction-scope staging trees, on the other hand, do not require WAL
logging, they can use temp buffers, and since they don't outlive the
transaction, it's quite obvious who does the merging (the committer).
Question is what kind of workload does that speed up with any
significance and whether the amount of work is worth that speedup on
those workloads.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] List of binary-compatible data types

2013-11-04 Thread Thom Brown

On 4 November 2013 21:58, Josh Berkus j...@agliodbs.com wrote:
 Folks,

 From our docs:

 Adding a column with a non-null default or changing the type of an
 existing column will require the entire table and indexes to be
 rewritten. As an exception, if the USING clause does not change the
 column contents and the old type is either binary coercible to the new
 type or an unconstrained domain over the new type, a table rewrite is
 not needed ...

 Which is nice, but nowhere do we present users with a set of
 binary-compatible data types, even among the built-in types.  I'd
 happily write this up, if I knew what the binary-compatible data types
 *were*.

You could try this:

SELECT
  castsource::regtype::text,
  array_agg(casttarget::regtype order by casttarget::regtype::text) casttargets
FROM pg_cast
WHERE castmethod = 'b'
GROUP BY 1
ORDER BY 1;

-- 
Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Row-security writer-side checks proposal

On 11/04/2013 09:55 PM, Robert Haas wrote:
 I continue to think that this syntax is misguided.  For SELECT and
 DELETE there is only read-side security, and for INSERT there is only
 write-side security, so that's OK as far as it goes, but for UPDATE
 both read-side security and write-side security are possible, and
 there ought to be a way to get one without the other.  This syntax
 won't support that cleanly.

That's what I was thinking earlier too - separate FOR READ and FOR
WRITE instead.

The reason I came back to insert/update/delete was that it's entirely
reasonable to want to prohibit deletes but permit updates to the same
tuple. Both are writes; both set xmax, it's just that one _replaces_ the
tuple, the other doesn't.

So really, there are four cases:

READ
WRITE INSERT
WRITE UPDATE
WRITE DELETE

 I wonder whether it's worth thinking about the relationship between
 the write-side security contemplated for this feature iand the WITH
 CHECK OPTION syntax that we have for auto-updateable views, which
 serves more or less the same purpose.  I'm not sure that syntax is any
 great shakes, but it's existing precedent of some form and could
 perhaps at least be looked at as a source of inspiration.

I've been thinking about the overlap with WITH CHECK OPTION as well.

 I would generally expect that most people would want either read side
 security for all commands or read and write side security for all
 commands.  I think whatever syntax we come up with this feature ought
 to make each of those things straightforward to get.

but sometimes with different predicates for read and write, i.e. you can
see rows you can't modify or can insert rows / update rows that you
can't see after the change.

Similarly, saying you can update but not delete seems quite reasonable
to me.

On the other hand, we might choose to say if you want to do things with
that granularity use your own triggers to enforce it and provide only
READ and WRITE for RLS.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql: Remove internal uses of CTimeZone/HasCTZSet.

2013-11-04 Thread Noah Misch

On Mon, Nov 04, 2013 at 02:34:02PM -0500, Tom Lane wrote:
 Noah Misch n...@leadboat.com writes:
  On Fri, Nov 01, 2013 at 04:51:34PM +, Tom Lane wrote:
  Remove internal uses of CTimeZone/HasCTZSet.
 
  This changed EncodeDateTime() output for USE_SQL_DATES and USE_GERMAN_DATES
  styles, because it inserts a space before tzn but does not insert a space
  before EncodeTimezone() output.  Example:
 
set datestyle = sql,mdy;
select '2013-01-01'::timestamptz;
 
  old output:
 
timestamptz   
  
   01/01/2013 00:00:00+00
 
  new output:
 
 timestamptz
  -
   01/01/2013 00:00:00 +00
 
  I assume this was unintended.
 
 Yeah, I had seen some cases of that.  I don't find it of great concern.
 We'd have to insert some ugly special-case code that looks at the zone
 name to decide whether to insert a space or not, and I don't think it'd
 actually be an improvement to do so.  (Arguably, these formats are
 more consistent this way.)  Still, this change is probably a sufficient
 reason not to back-patch this part of the fix.

I was prepared to suppose that no substantial clientele relies on the
to_char() TZ format code expanding to blank, the other behavior that changed
with this patch.  It's more of a stretch to figure applications won't stumble
over this new whitespace.  I do see greater consistency in the new behavior,
but changing type output is a big deal.  While preserving the old output does
require ugly special-case code, such code was in place for years until removal
by this commit and the commit following it.  Perhaps making the behavior
change is best anyway, but we should not conclude that decision on the basis
of its origin as a side effect of otherwise-desirable refactoring.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] List of binary-compatible data types

2013-11-04 Thread Josh Berkus

Thom,


 SELECT
   castsource::regtype::text,
   array_agg(casttarget::regtype order by casttarget::regtype::text) 
 casttargets
 FROM pg_cast
 WHERE castmethod = 'b'
 GROUP BY 1
 ORDER BY 1;

Are we actually covering 100% of these for ALTER COLUMN now?


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] List of binary-compatible data types

2013-11-04 Thread Josh Berkus

On 11/04/2013 05:21 PM, Josh Berkus wrote:
 Thom,
 
 
 SELECT
   castsource::regtype::text,
   array_agg(casttarget::regtype order by casttarget::regtype::text) 
 casttargets
 FROM pg_cast
 WHERE castmethod = 'b'
 GROUP BY 1
 ORDER BY 1;
 
 Are we actually covering 100% of these for ALTER COLUMN now?

Also, JSON -- Text seems to be missing from the possible binary
conversions.  That's a TODO, I suppose.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.4] row level security