Re: [HACKERS] discarding duplicate indexes

2012-12-20 Thread Gavin Flower

On 20/12/12 14:57, Josh Kupershmidt wrote:

CREATE TABLE test (id int);
CREATE INDEX test_idx1 ON test (id);
CREATE INDEX test_idx2 ON test (id);
I initially misread your example code, but after I realised my mistake, 
I thought of an alternative scenario that might be worth considering.


CREATE TABLE test (id int, int sub, text payload);
CREATE INDEX test_idx1 ON test (id, sub);
CREATE INDEX test_idx2 ON test (id);


Nowtest_idx2  is logically included intest_idx1, but if the majority of 
transactions only query onid, thentest_idx2  would be more better as it ties up 
less RAM


Cheers,
Gavin




Re: [HACKERS] [GENERAL] trouble with pg_upgrade 9.0 - 9.1

2012-12-20 Thread Bruce Momjian
 19.12.2012, 21:47, Tom Lane t...@sss.pgh.pa.us:
  Kevin Grittner kgri...@mail.com writes:
 
   Groshev Andrey wrote:
     Mismatch of relation names: database database, old rel 
  public.lob.ВерсияВнешнегоДокумента$Документ_pkey, new rel 
  public.plob.ВерсияВнешнегоДокумента$Документ
   There is a limit on identifiers of 63 *bytes* (not characters)
   after which the name is truncated. In UTF8 encoding, the underscore
   would be in the 64th position.
 
  Hmm ... that is a really good point, except that you are not counting
  the lob. or plob. part, which we previously saw is part of the
  relation name not the schema name.  Counting that part, it's already
  overlimit, which seems to be proof that Andrey isn't using UTF8 but
  some single-byte encoding.
 
  Anyway, that would only explain the issue if pg_upgrade were somehow
  changing the database encoding, which surely we'd have heard complaints
  about already?  Or maybe this has something to do with pg_upgrade's
  client-side encoding rather than the server encoding...
 
  regards, tom lane

 I'm initialize data dir with use ru_RU.UTF8, but this databse use CP1251, ie 
 one byte per character.

Agreed.  This is a complicated report because the identifiers:

*  contain periods
*  are long
*  are in cyrillic
*  don't use utf8
*  are very similar

However, I just can't see how these could be causing the problem. 
Looking at the 9.1 pg_upgrade code, we already know that there are the
same number of relations in old and new clusters, so everything must be
being restored.  And there is a lob.* and a plob.* that exist.  The C
code is also saying that the pg_class.oid of the lob.* in the old
database is the same as the plob.* in the new database.  That question
is how is that happening.  

Can you email me privately the output of:

pg_dump --schema-only --binary-upgrade database

Thanks.  If you want to debug this yourself, check these lines in the
pg_dump output:

-- For binary upgrade, must preserve pg_class oids
SELECT 
binary_upgrade.set_next_index_pg_class_oid('786665369'::pg_catalog.oid);

ALTER TABLE ONLY lob.ВерсияВнешнегоДокумента$Документ
ADD CONSTRAINT plob.ВерсияВнешнегоДокумента$Документ PRIMARY KEY 
(@Файл, Страница);

See that 786665369?  That is the pg_class.oid of the plob in the old
cluster, and hopefully the new one.  Find where the lob*_pkey index is
created and get that oid.  Those should match the same names of the
pg_class.oid in the old and new clusters, but it seems the new plob* oid
is matching the lob oid in the old cluster.

Also, pg_upgrade sorts everything by oid, so it can't be that somehow
pg_upgrade isn't ordering things right, and because we already passed
the oid check, we already know they have the same oid, but different
names.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] discarding duplicate indexes

2012-12-20 Thread John R Pierce

On 12/20/2012 12:26 AM, Gavin Flower wrote:

CREATE TABLE test (id int, int sub, text payload);
CREATE INDEX test_idx1 ON test (id, sub);
CREATE INDEX test_idx2 ON test (id);


Nowtest_idx2  is logically included intest_idx1, but if the majority of 
transactions only query onid, thentest_idx2  would be more better as it ties up 
less RAM


if sub is an integer, that index isn't that much larger.  both indexes 
need to index all the rows, and with the header and block overhead, the 
extra word isn't that big of a deal.   as long as there are some 
transactions using the other index, most of both of them will likely 
want to be in memory, so you'll end up using MORE memory.





Re: [HACKERS] Review of Row Level Security

2012-12-20 Thread Simon Riggs
On 20 December 2012 00:24, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Dec 19, 2012 at 12:54 PM, Simon Riggs si...@2ndquadrant.com wrote:
 I can see a use case for not having security apply for users who have
 *only* INSERT privilege. This would allow people to run bulk loads of
 data into a table with row security. We should add that. That is not
 the common case, so with proper documentation that should be a useful
 feature without relaxing default security.

 Never applying security for INSERT and then forcing them to add BEFORE
 triggers if they want full security is neither secure nor performant.

 I think INSERT vs. not-INSERT is not the relevant distinction, because
 the question also arises for UPDATE.

Not sure I understand you. You suggested it was a valid use case for a
user to have only INSERT privilege and wish to bypass security checks.
I agreed and suggested it could be special-cased.

 In the UPDATE case, the question
 is whether the RLS qual should be checked only against the OLD tuple
 (to make sure that we can see the tuple to modify it) or also against
 the NEW tuple (to make sure that we're not modifying it to a form that
 we can no longer see).  In other words, the question is not do we
 support all of the commands? but rather do we check not only the
 tuple read but also the tuple written?.  For INSERT, we only write a
 tuple, without reading.  For SELECT and DELETE, we only read a tuple,
 without writing a new one.  UPDATE does both a read and a write.

I'm not sure what this comment adds to the discussion. What you say is
understood.

 Previously, I suggested that we handle this by enforcing row-level
 security only on data read from the table - the OLD row, so to speak -
 and not on data written to the table - the NEW row, so to speak -
 because the latter case can be handled well enough by triggers.  (The
 OLD case cannot, because not seeing the row is different from erroring
 out when you do see it.)  There are other alternatives, like allowing
 the user to specify which behavior they want.  But I think that simply
 decreeing that the policy will apply not only to rows read but also
 rows written in all cases will be less flexible than we will
 ultimately want to be.

 As discussed, we should add a security feature that is secure by
 default. Adding options to make it less secure can follow initial
 commit. We might even make it in this release if the review of the
 main feature goes well.

 Saying that something is or is not secure is not meaningful without
 defining what you want to be secure against.  There's nothing
 insecure about checking only the tuples read; it's just a different
 (and useful) threat model.

There are three main points

* Applies to all commands should not be implemented via triggers.
Complex, slow, unacceptable thing to force upon users. Doing that begs
the question of why we would have the feature at all, since we already
have triggers and barrier views.

* the default for row security should be applies to all commands.
Anything else may be useful in some cases, but is surprising to users
and requires careful thought to determine if it is appropriate.

* How to handle asymmetric row security policies? KaiGai has already
begun discussing problems caused by a security policy that differs
between reads/writes, on his latest patch post. That needs further
analysis to check that it actually makes sense to allow it, since it
is more complex. It would be better to fully analyse that situation
and post solutions, rather than simply argue its OK. Kevin has made
good arguments to show there could be value in such a setup; nobody
has talked about banning it, but we do need analysis, suggested
syntax/mechanisms and extensive documentation to explain it etc.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Set visibility map bit after HOT prune

2012-12-20 Thread Simon Riggs
On 20 December 2012 00:43, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Dec 19, 2012 at 12:39 PM, Simon Riggs si...@2ndquadrant.com wrote:
 The benefit of saying that only UPDATEs clean the block is that this
 penalises only the workload making the mess, rather than everybody
 cleaning up repeatedly over one messy guy.

 Right, but there are plenty of situations where having everybody clean
 up after the messy guy is better than waiting around and hoping that
 Mom (aka vacuum) will do it.

The problems I see are that cleaning on SELECT is too frequent,
interferes with foreground performance and re-dirties data blocks too
often.

Waiting for Mom is configurable, since we can set parameters for
autovacuum. But we can't turn off the cleaning by SELECTs, which makes
the configurability of autovacuum somewhat moot.

We could also contact the Cleaner instead.

ISTM that if someone spots a block that could use cleanup, they mark
the block as BM_PIN_COUNT_WAITER, but don't set pid. Then when they
unpin the block they send a signal/queue work for a special cleaning
process to come in and do the work now that nobody is waiting. Logic
would allow VACUUMs to go past that by setting the pid. If we
prioritised cleanup onto blocks that are already dirty we would
minimise I/O.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] trouble with pg_upgrade 9.0 - 9.1

2012-12-20 Thread Bruce Momjian
On Thu, Dec 20, 2012 at 08:55:16AM +0400, Groshev Andrey wrote:
 No, old database not use table plob.. 
 only primary key
 
 --
 -- Name: plob.ВерсияВнешнегоДокумента$Документ; Type: CONSTRAINT; Schema: 
 public; Owner: postgres; Tablespace:
 --
 
 
 -- For binary upgrade, must preserve pg_class oids
 SELECT 
 binary_upgrade.set_next_index_pg_class_oid('786665369'::pg_catalog.oid);
 
 ALTER TABLE ONLY lob.ВерсияВнешнегоДокумента$Документ
 ADD CONSTRAINT plob.ВерсияВнешнегоДокумента$Документ PRIMARY KEY 
 (@Файл, Страница);

OK, now I know what is happening, though I can't figure out yet how you
got there.  Basically, when you create a primary key, the name you
supply goes into two places, pg_class, for the index, and pg_constraint
for the constraint name.

What is happening is that you have a pg_class entry called lob.*_pkey
and a pg_constraint entry with plob.*.  You can verify it yourself by
running queries on the system tables.  Let me know if you want me to
show you the queries.

pg_dump dumps the pg_constraint name when recreating the index, while
pg_upgrade uses the pg_class name.  When you restore the database into
the new cluster, the pg_class index name is lost and the new primary key
gets identical pg_class and pg_constraint names.

I tried to recreate the problem with these commands:

test= create table test (x int primary key);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index 
test_pkey for table test
CREATE TABLE
test= alter index test_pkey rename to ptest;
ALTER INDEX
test= select * from pg_constraint where conname = 'ptest';
 conname | connamespace | 
-+--+-
 ptest   | 2200 | 
(1 row)

test= select * from pg_class where relname = 'ptest';
 relname | relnamespace | 
-+--+-
 ptest   | 2200 | 
(1 row)

As you can see, ALTER INDEX renamed both the pg_constraint and pg_class
names.  Is it possible someone manually updated the system table to
rename this primary key?  That would cause this error message.  The fix
is to just to make sure they match.

Does pg_upgrade need to be modified to handle this case?  Are there
legitimate cases where they will not match and the index name will not
be preserved though a dump/restore?  This seems safe:

test= alter table test add constraint zz  primary key using index ii;
NOTICE:  ALTER TABLE / ADD CONSTRAINT USING INDEX will rename index 
ii to zz
ALTER TABLE

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Set visibility map bit after HOT prune

2012-12-20 Thread Simon Riggs
On 20 December 2012 00:10, Pavan Deolasee pavan.deola...@gmail.com wrote:

 I just thought that we can fairly easily limit the damage if we are
 really worried about SELECTs being penalised. What if we set a
 configurable limit on *extra* things that a query may do which is
 otherwise not very useful for the query itself, but is useful to keep
 the system healthy and steady. HOT prune definitely counts as one of
 them and may be even setting of hint bits. (This is a topic for a
 separate thread though)

I like this idea

transaction_cleanup_limit = -1 (default), 0, 1+

-1 means no limit on number of cleanups in this transaction, which is
current behaviour.
Other numbers are the number of cleanups that will be tolerated in
this transaction; once we hit the limit we don't attempt cleanup
anymore we just get on with it. The limit is ignored for UPDATEs since
they need to clear space for their work.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] EDB hosted buildfarm animals - extended downtime

2012-12-20 Thread Dave Page
Due to circumstances beyond my control (blame the power company), the
following buildfarm animals will be down from 27th December until 2nd
January:

baiji
mastodon
protosciurus
castoroides

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes

2012-12-20 Thread Amit Kapila
On Wednesday, December 19, 2012 9:30 PM Heikki Linnakangas wrote:
 In both checkpointer.c and bgwriter.c, we do this before entering the
 main loop:
 
 /*
   * Use the recovery target timeline ID during recovery
   */
  if (RecoveryInProgress())
  ThisTimeLineID = GetRecoveryTargetTLI();
 
 That seems reasonable. However, since it's only done once, when the
 process starts up, ThisTimeLineID is never updated in those processes,
 even though the startup process changes recovery target timeline.
 
 That actually seems harmless to me, and I also haven't heard of any
 complaints of misbehavior in 9.1 or 9.2 caused by that. I'm not sure
 why
 we bother to set ThisTimeLineID in those processes in the first place.

This is used in RemoveOldXlogFiles(), so if during recovery when it's not
set and
this function gets called, it might have some problem. 
I think it could get called from CreateRestartPoint() during recovery.


With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] trouble with pg_upgrade 9.0 - 9.1

2012-12-20 Thread Groshev Andrey


20.12.2012, 13:00, Bruce Momjian br...@momjian.us:
 On Thu, Dec 20, 2012 at 08:55:16AM +0400, Groshev Andrey wrote:

  No, old database not use table plob..
  only primary key

  --
  -- Name: plob.ВерсияВнешнегоДокумента$Документ; Type: CONSTRAINT; Schema: 
 public; Owner: postgres; Tablespace:
  --

  -- For binary upgrade, must preserve pg_class oids
  SELECT 
 binary_upgrade.set_next_index_pg_class_oid('786665369'::pg_catalog.oid);

  ALTER TABLE ONLY lob.ВерсияВнешнегоДокумента$Документ
  ADD CONSTRAINT plob.ВерсияВнешнегоДокумента$Документ PRIMARY KEY 
 (@Файл, Страница);

 OK, now I know what is happening, though I can't figure out yet how you
 got there.  Basically, when you create a primary key, the name you
 supply goes into two places, pg_class, for the index, and pg_constraint
 for the constraint name.

 What is happening is that you have a pg_class entry called lob.*_pkey
 and a pg_constraint entry with plob.*.  You can verify it yourself by
 running queries on the system tables.  Let me know if you want me to
 show you the queries.

 pg_dump dumps the pg_constraint name when recreating the index, while
 pg_upgrade uses the pg_class name.  When you restore the database into
 the new cluster, the pg_class index name is lost and the new primary key
 gets identical pg_class and pg_constraint names.


I have already begun to approach this to the idea, when noticed that pgAdmin 
describes this index through _pkey, and through the pg_dump plob..
But your letter immediately pointed me to the end of my research :)

 I tried to recreate the problem with these commands:

 test= create table test (x int primary key);
 NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index 
 test_pkey for table test
 CREATE TABLE
 test= alter index test_pkey rename to ptest;
 ALTER INDEX
 test= select * from pg_constraint where conname = 'ptest';
  conname | connamespace |
 -+--+-
  ptest   | 2200 |
 (1 row)

 test= select * from pg_class where relname = 'ptest';
  relname | relnamespace |
 -+--+-
  ptest   | 2200 |
 (1 row)

 As you can see, ALTER INDEX renamed both the pg_constraint and pg_class
 names.  Is it possible someone manually updated the system table to
 rename this primary key?  That would cause this error message.  The fix
 is to just to make sure they match.

 Does pg_upgrade need to be modified to handle this case?  

Unfortunately, my knowledge is not enough to talk about it.
I do not know what comes first in this case: pg_class, pg_constraint or 
pg_catalog.index or pg_catalog.pg_indexes.
Incidentally, in the last of:

#
select  schemaname,tablename,indexname,tablespace from pg_catalog.pg_indexes 
where indexname like '%ВерсияВнешнегоДокумента$Документ%';
 schemaname |  tablename   |  indexname 
  | tablespace
+--+--+
 public | lob.ВерсияВнешнегоДокумента$Документ | 
lob.ВерсияВнешнегоДокумента$Документ_pkey|
 public | ВерсияВнешнегоДокумента$Документ | 
ВерсияВнешнегоДокумента$Документ_pkey|
 public | ВерсияВнешнегоДокумента$Документ | 
iВерсияВнешнегоДокумента$Документ-blb_header |
(3 rows)

If pg_upgrade said that the old database is not in a very good condition, I 
would look for a problem in the database, and not something else.

 Are there legitimate cases where they will not match and the index name will 
 not
 be preserved though a dump/restore?  This seems safe:

 test= alter table test add constraint zz  primary key using index ii;
 NOTICE:  ALTER TABLE / ADD CONSTRAINT USING INDEX will rename index 
 ii to zz
 ALTER TABLE

 --
   Bruce Momjian  br...@momjian.us    http://momjian.us
   EnterpriseDB http://enterprisedb.com

   + It's impossible for everything to be true. +



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction

2012-12-20 Thread Andres Freund
On 2012-12-20 02:54:48 +0100, Tomas Vondra wrote:
 On 19.12.2012 02:18, Andres Freund wrote:
  On 2012-12-17 00:31:00 +0100, Tomas Vondra wrote:
 
  I think except of the temp buffer issue mentioned below its ready.
 
  -DropRelFileNodeAllBuffers(RelFileNodeBackend rnode)
  +DropRelFileNodeAllBuffers(RelFileNodeBackend * rnodes, int nnodes)
   {
  -  int i;
  +  int i, j;
  +
  +  /* sort the list of rnodes */
  +  pg_qsort(rnodes, nnodes, sizeof(RelFileNodeBackend), rnode_comparator);
 
 /* If it's a local relation, it's localbuf.c's problem. */
  -  if (RelFileNodeBackendIsTemp(rnode))
  +  for (i = 0; i  nnodes; i++)
 {
  -  if (rnode.backend == MyBackendId)
  -  DropRelFileNodeAllLocalBuffers(rnode.node);
  -  return;
  +  if (RelFileNodeBackendIsTemp(rnodes[i]))
  +  {
  +  if (rnodes[i].backend == MyBackendId)
  +  DropRelFileNodeAllLocalBuffers(rnodes[i].node);
  +  }
 }
 
  While you deal with local buffers here you don't anymore in the big loop
  over shared buffers. That wasn't needed earlier since we just returned
  after noticing we have local relation, but thats not the case anymore.

 Hmm, but that would require us to handle the temp relations explicitly
 wherever we call DropRelFileNodeAllBuffers. Currently there are two such
 places - smgrdounlink() and smgrdounlinkall().

 By placing it into DropRelFileNodeAllBuffers() this code is shared and I
 think it's a good thing.

 But that does not mean the code is perfect - it was based on the
 assumption that if there's a mix of temp and regular relations, the temp
 relations will be handled in the first part and the rest in the second one.

 Maybe it'd be better to improve it so that the temp relations are
 removed from the array after the first part (and skip the second one if
 there are no remaining relations).

The problem is that afaik without the backend-local part a temporary
relation could match the same relfilenode as a full relation which
would have pretty bad consequences.

  Still surprised this is supposed to be faster than a memcmp, but as you
  seem to have measured it earlier..

 It surprised me too. These are the numbers with the current patch:

 1) one by one
 =
   0246810   12   14   16   18   20
 --
 current  15   22   28   34   4175   77   82   92   99  106
 memcmp   16   23   29   36   44   122  125  128  153  154  158

 Until the number of indexes reaches ~10, the numbers are almost exactly
 the same. Then the bsearch branch kicks in and it's clear how much
 slower the memcmp comparator is.

 2) batches of 100
 =
   0246810   12   14   16   18   20
 --
 current   358   10   1215   17   21   23   27   31
 memcmp47   10   13   1619   22   28   30   32   36

 Here the difference is much smaller, but even here the memcmp is
 consistently a bit slower.


 My theory is that while the current comparator starts with the most
 variable part (relation OID), and thus usually starts after the
 comparing the first 4B, the memcmp starts from the other end (tablespace
 and database) and therefore needs to compare more data.

Thats a good guess. I think its ok this way, but if you feel like
playing you could just change the order in the struct...

  +void smgrdounlinkall(SMgrRelation * rels, int nrels, bool isRedo)
  +{
  +  int i = 0;
  +  RelFileNodeBackend *rnodes;
  +  ForkNumber  forknum;
  +
  +  /* initialize an array which contains all relations to be dropped */
  +  rnodes = palloc(sizeof(RelFileNodeBackend) * nrels);
  +  for (i = 0; i  nrels; i++)
  +  {
  +  RelFileNodeBackend rnode = rels[i]-smgr_rnode;
  +  int which = rels[i]-smgr_which;
  +
  +  rnodes[i] = rnode;
  +
  +  /* Close the forks at smgr level */
  +  for (forknum = 0; forknum = MAX_FORKNUM; forknum++)
  +  (*(smgrsw[which].smgr_close)) (rels[i], forknum);
  +  }
  +
  +  /*
  +   * Get rid of any remaining buffers for the relation.  bufmgr will just
  +   * drop them without bothering to write the contents.
  +   */
  +  DropRelFileNodeAllBuffers(rnodes, nrels);
 
  I think it might be a good idea to handle temp relations on/buffers on
  this level instead of trying to put it into
  DropRelFileNodeAllBuffers. Would also allow you to only build an array
  of RelFileNode's instead of RelFileNodeBackend's which might make it
  slightl faster.

 Hmmm, sadly that'd require duplication of code because it needs to be
 done in smgrdounlink too.

It should be fairly small though.

int nrels_nonlocal = 0;

for (i = 0; i  nrels; i++)
{
RelFileNodeBackend rnode = rels[i]-smgr_rnode;
int which = 

Re: [HACKERS] [GENERAL] trouble with pg_upgrade 9.0 - 9.1

2012-12-20 Thread Bruce Momjian
On Thu, Dec 20, 2012 at 03:19:17PM +0400, Groshev Andrey wrote:
 
 
 20.12.2012, 13:00, Bruce Momjian br...@momjian.us:
  On Thu, Dec 20, 2012 at 08:55:16AM +0400, Groshev Andrey wrote:
 
   No, old database not use table plob..
   only primary key
 
   --
   -- Name: plob.ВерсияВнешнегоДокумента$Документ; Type: CONSTRAINT; Schema: 
  public; Owner: postgres; Tablespace:
   --
 
   -- For binary upgrade, must preserve pg_class oids
   SELECT 
  binary_upgrade.set_next_index_pg_class_oid('786665369'::pg_catalog.oid);
 
   ALTER TABLE ONLY lob.ВерсияВнешнегоДокумента$Документ
   ADD CONSTRAINT plob.ВерсияВнешнегоДокумента$Документ PRIMARY KEY 
  (@Файл, Страница);
 
  OK, now I know what is happening, though I can't figure out yet how you
  got there.  Basically, when you create a primary key, the name you
  supply goes into two places, pg_class, for the index, and pg_constraint
  for the constraint name.
 
  What is happening is that you have a pg_class entry called lob.*_pkey
  and a pg_constraint entry with plob.*.  You can verify it yourself by
  running queries on the system tables.  Let me know if you want me to
  show you the queries.
 
  pg_dump dumps the pg_constraint name when recreating the index, while
  pg_upgrade uses the pg_class name.  When you restore the database into
  the new cluster, the pg_class index name is lost and the new primary key
  gets identical pg_class and pg_constraint names.
 
 
 I have already begun to approach this to the idea, when noticed that pgAdmin 
 describes this index through _pkey, and through the pg_dump plob..
 But your letter immediately pointed me to the end of my research :)

Good.

  I tried to recreate the problem with these commands:
 
  test= create table test (x int primary key);
  NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index 
  test_pkey for table test
  CREATE TABLE
  test= alter index test_pkey rename to ptest;
  ALTER INDEX
  test= select * from pg_constraint where conname = 'ptest';
   conname | connamespace |
  -+--+-
   ptest   | 2200 |
  (1 row)
 
  test= select * from pg_class where relname = 'ptest';
   relname | relnamespace |
  -+--+-
   ptest   | 2200 |
  (1 row)
 
  As you can see, ALTER INDEX renamed both the pg_constraint and pg_class
  names.  Is it possible someone manually updated the system table to
  rename this primary key?  That would cause this error message.  The fix
  is to just to make sure they match.
 
  Does pg_upgrade need to be modified to handle this case?  
 
 Unfortunately, my knowledge is not enough to talk about it.
 I do not know what comes first in this case: pg_class, pg_constraint or 
 pg_catalog.index or pg_catalog.pg_indexes.
 Incidentally, in the last of:
 
 #
 select  schemaname,tablename,indexname,tablespace from pg_catalog.pg_indexes 
 where indexname like '%ВерсияВнешнегоДокумента$Документ%';
  schemaname |  tablename   |  
 indexname   | tablespace
 +--+--+
  public | lob.ВерсияВнешнегоДокумента$Документ | 
 lob.ВерсияВнешнегоДокумента$Документ_pkey|
  public | ВерсияВнешнегоДокумента$Документ | 
 ВерсияВнешнегоДокумента$Документ_pkey|
  public | ВерсияВнешнегоДокумента$Документ | 
 iВерсияВнешнегоДокумента$Документ-blb_header |
 (3 rows)
 
 If pg_upgrade said that the old database is not in a very good condition, I 
 would look for a problem in the database, and not something else.

pg_catalog.pg_indexes is a view.  You can to modify pg_class to match
the pg_constraint name.  You might be able to just rename the index in
Pgadmin to match.

Perhaps PGAdmin allowed this mismatch to happen?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes

2012-12-20 Thread Heikki Linnakangas

On 20.12.2012 12:08, Amit Kapila wrote:

On Wednesday, December 19, 2012 9:30 PM Heikki Linnakangas wrote:

In both checkpointer.c and bgwriter.c, we do this before entering the
main loop:

 /*
   * Use the recovery target timeline ID during recovery
   */
  if (RecoveryInProgress())
  ThisTimeLineID = GetRecoveryTargetTLI();

That seems reasonable. However, since it's only done once, when the
process starts up, ThisTimeLineID is never updated in those processes,
even though the startup process changes recovery target timeline.

That actually seems harmless to me, and I also haven't heard of any
complaints of misbehavior in 9.1 or 9.2 caused by that. I'm not sure
why
we bother to set ThisTimeLineID in those processes in the first place.


This is used in RemoveOldXlogFiles(), so if during recovery when it's not
set and
this function gets called, it might have some problem.
I think it could get called from CreateRestartPoint() during recovery.


Hmm, right, it's used for this:

XLogFileName(lastoff, ThisTimeLineID, segno);

The resulting lastoff string, which is a xlog filename like 
00020008, is used to compare filenames of files in 
pg_xlog. However, the tli part, first 8 characters, are ignored for 
comparison purposes. In addition to that, 'lastoff' is printed in a 
DEBUG2 line, purely for debugging purposes.


So I think we're good on that front. But I'll add a comment there, and 
use 0 explicitly instead of ThisTimeLineID, for clarity.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] trouble with pg_upgrade 9.0 - 9.1

2012-12-20 Thread Groshev Andrey


20.12.2012, 11:43, Bruce Momjian br...@momjian.us:
  19.12.2012, 21:47, Tom Lane t...@sss.pgh.pa.us:
  Kevin Grittner kgri...@mail.com writes:
   Groshev Andrey wrote:
     Mismatch of relation names: database database, old rel 
 public.lob.ВерсияВнешнегоДокумента$Документ_pkey, new rel 
 public.plob.ВерсияВнешнегоДокумента$Документ
   There is a limit on identifiers of 63 *bytes* (not characters)
   after which the name is truncated. In UTF8 encoding, the underscore
   would be in the 64th position.
  Hmm ... that is a really good point, except that you are not counting
  the lob. or plob. part, which we previously saw is part of the
  relation name not the schema name.  Counting that part, it's already
  overlimit, which seems to be proof that Andrey isn't using UTF8 but
  some single-byte encoding.

  Anyway, that would only explain the issue if pg_upgrade were somehow
  changing the database encoding, which surely we'd have heard complaints
  about already?  Or maybe this has something to do with pg_upgrade's
  client-side encoding rather than the server encoding...

  regards, tom lane
  I'm initialize data dir with use ru_RU.UTF8, but this databse use CP1251, 
 ie one byte per character.

 Agreed.  This is a complicated report because the identifiers:

 *  contain periods
 *  are long
 *  are in cyrillic
 *  don't use utf8
 *  are very similar

 However, I just can't see how these could be causing the problem.
 Looking at the 9.1 pg_upgrade code, we already know that there are the
 same number of relations in old and new clusters, so everything must be
 being restored.  And there is a lob.* and a plob.* that exist.  The C
 code is also saying that the pg_class.oid of the lob.* in the old
 database is the same as the plob.* in the new database.  That question
 is how is that happening.

 Can you email me privately the output of:

 pg_dump --schema-only --binary-upgrade database

 Thanks.  If you want to debug this yourself, check these lines in the
 pg_dump output:

 -- For binary upgrade, must preserve pg_class oids
 SELECT 
 binary_upgrade.set_next_index_pg_class_oid('786665369'::pg_catalog.oid);

 ALTER TABLE ONLY lob.ВерсияВнешнегоДокумента$Документ
 ADD CONSTRAINT plob.ВерсияВнешнегоДокумента$Документ PRIMARY 
 KEY (@Файл, Страница);

 See that 786665369?  That is the pg_class.oid of the plob in the old
 cluster, and hopefully the new one.  Find where the lob*_pkey index is
 created and get that oid.  Those should match the same names of the
 pg_class.oid in the old and new clusters, but it seems the new plob* oid
 is matching the lob oid in the old cluster.

 Also, pg_upgrade sorts everything by oid, so it can't be that somehow
 pg_upgrade isn't ordering things right, and because we already passed
 the oid check, we already know they have the same oid, but different
 names.

 --
   Bruce Momjian  br...@momjian.us    http://momjian.us
   EnterpriseDB http://enterprisedb.com

   + It's impossible for everything to be true. +

Yes, was the last question. How to find out which version should stay?
And of course, I forgot to say a great big thank you!


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] trouble with pg_upgrade 9.0 - 9.1

2012-12-20 Thread Bruce Momjian
On Thu, Dec 20, 2012 at 03:41:37PM +0400, Groshev Andrey wrote:
  See that 786665369?  That is the pg_class.oid of the plob in the old
  cluster, and hopefully the new one.  Find where the lob*_pkey index is
  created and get that oid.  Those should match the same names of the
  pg_class.oid in the old and new clusters, but it seems the new plob* oid
  is matching the lob oid in the old cluster.
 
  Also, pg_upgrade sorts everything by oid, so it can't be that somehow
  pg_upgrade isn't ordering things right, and because we already passed
  the oid check, we already know they have the same oid, but different
  names.
 
  --
    Bruce Momjian  br...@momjian.us    http://momjian.us
    EnterpriseDB http://enterprisedb.com
 
    + It's impossible for everything to be true. +
 
 Yes, was the last question. How to find out which version should stay?
 And of course, I forgot to say a great big thank you!

You can pick either name to be the right one;  they just have to match. 
The oids are fine.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: patch submission: truncate trailing nulls from heap rows to reduce the size of the null bitmap [Review]

2012-12-20 Thread Simon Riggs
On 13 October 2012 08:54, Amit kapila amit.kap...@huawei.com wrote:

 As per the last discussion for this patch, performance data needs to be
 provided before this patch's Review can proceed further.

 So as per your suggestion and from the discussions about this patch, I have
 collected the performance data as below:



 Results are taken with following configuration.
 1. Schema - UNLOGGED TABLE with 2,000,000 records having all columns are INT
 type.
 2. shared_buffers = 10GB
 3. All the performance result are taken with single connection.
 4. Performance is collected for INSERT operation (insert into temptable
 select * from inittable)

 Platform details:
 Operating System: Suse-Linux 10.2 x86_64
 Hardware : 4 core (Intel(R) Xeon(R) CPU L5408 @ 2.13GHz)
 RAM : 24GB

 Documents Attached:
 init.sh: Which will create the schema
 sql_used.sql   : sql's used for taking results

 Trim_Nulls_Perf_Report.html :   Performance data


 Observations from Performance Results

 

 1. There is no performance change for cloumns that have all valid
 values(non- NULLs).

 2. There is a visible performance increase when number of columns containing
 NULLS are more than  60~70% in table have large number of columns.

 3. There are visible space savings when number of columns containing NULLS
 are more than  60~70% in table have large number of columns.


 Let me know if there is more performance data needs to be collected for this
 patch?


I can't make sense of your performance report. Because of that I can't
derive the same conclusions from it you do.

Can you explain the performance results in more detail, so we can see
what they mean? Like which are the patched, which are the unpatched
results? Which results are comparable, what the percentages mean etc..

We might then move quickly towards commit, or at least more tests.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_top

2012-12-20 Thread Brett Maton
Hi List,

 

  This might not be the right place to post, but I've got a minor patch for
pg_top and would like to submit it for review.

 

  Basically, the rpm version in the repositories pg_top92-3.6.2 doesn't work
with Postgres 9.2

  

#define QUERY_PROCESSES \

SELECT procpid\n \

FROM pg_stat_activity;

 

It appears that procpid has been renamed to pid at some point, also the
column current_query appears to have been shortened to query.

My patch updates a couple of queries to use the new shorter column names.

 

Where or who should  I submit the changes to?

 

Thanks,

Brett

  _  


Brett Maton
t: +44 (0) 2392 658 264
m: +44 (0) 7427 610 472
e:  mailto:mat...@ltresources.co.uk mat...@ltresources.co.uk

 http://www.ltresources.co.uk/ Description:
http://www.ltresources.co.uk/logow.png

 

 

image003.png

Re: [HACKERS] [GENERAL] trouble with pg_upgrade 9.0 - 9.1

2012-12-20 Thread Bruce Momjian
On Wed, Dec 19, 2012 at 10:19:30PM -0500, Bruce Momjian wrote:
 On Wed, Dec 19, 2012 at 12:56:05PM -0500, Kevin Grittner wrote:
  Groshev Andrey wrote:
  
  Mismatch of relation names: database database, old rel 
public.lob.ВерсияВнешнегоДокумента$Документ_pkey, new rel 
public.plob.ВерсияВнешнегоДокумента$Документ
  
  There is a limit on identifiers of 63 *bytes* (not characters)
  after which the name is truncated. In UTF8 encoding, the underscore
  would be in the 64th position.
 
 OK, Kevin is certainly pointing out a bug in the pg_upgrade code, though
 I am unclear how it would exhibit the mismatch error reported.
 
 pg_upgrade uses NAMEDATALEN for database, schema, and relation name
 storage lengths.  While NAMEDATALEN works fine in the backend, it is
 possible that a frontend client, like pg_upgrade, could retrieve a name
 in the client encoding whose length exceeds NAMEDATALEN if the client
 encoding did not match the database encoding (or is it the cluster
 encoding for system tables).  This would cause truncation of these
 values.  The truncation would not cause crashes, but might cause
 failures by not being able to connect to overly-long database names, and
 it weakens the checking of relation/schema names --- the same check that
 is reported above.
 
 (I believe initdb.c also erroneously uses NAMEDATALEN.)

I have developed the attached patch to pg_strdup() the string returned
from libpq, rather than use a fixed NAMEDATALEN buffer to store the
value.  I am only going to apply this to 9.3 because I can't see this
causing problems except for weaker comparisons for very long identifiers
where the client encoding is longer than the server encoding, and
failures for very long database names, no of which we have gotten bug
reports about.

Turns out initdb.c was fine because it expects only collation names to
be only in ASCII;   I added a comment to that effect.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/contrib/pg_upgrade/info.c b/contrib/pg_upgrade/info.c
new file mode 100644
index 2250442..5cb9b61
*** a/contrib/pg_upgrade/info.c
--- b/contrib/pg_upgrade/info.c
*** static void get_db_infos(ClusterInfo *cl
*** 23,29 
  static void get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo);
  static void free_rel_infos(RelInfoArr *rel_arr);
  static void print_db_infos(DbInfoArr *dbinfo);
! static void print_rel_infos(RelInfoArr *arr);
  
  
  /*
--- 23,29 
  static void get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo);
  static void free_rel_infos(RelInfoArr *rel_arr);
  static void print_db_infos(DbInfoArr *dbinfo);
! static void print_rel_infos(RelInfoArr *rel_arr);
  
  
  /*
*** create_rel_filename_map(const char *old_
*** 127,134 
  	map-new_relfilenode = new_rel-relfilenode;
  
  	/* used only for logging and error reporing, old/new are identical */
! 	snprintf(map-nspname, sizeof(map-nspname), %s, old_rel-nspname);
! 	snprintf(map-relname, sizeof(map-relname), %s, old_rel-relname);
  }
  
  
--- 127,134 
  	map-new_relfilenode = new_rel-relfilenode;
  
  	/* used only for logging and error reporing, old/new are identical */
! 	map-nspname = old_rel-nspname;
! 	map-relname = old_rel-relname;
  }
  
  
*** get_db_infos(ClusterInfo *cluster)
*** 220,227 
  	for (tupnum = 0; tupnum  ntups; tupnum++)
  	{
  		dbinfos[tupnum].db_oid = atooid(PQgetvalue(res, tupnum, i_oid));
! 		snprintf(dbinfos[tupnum].db_name, sizeof(dbinfos[tupnum].db_name), %s,
!  PQgetvalue(res, tupnum, i_datname));
  		snprintf(dbinfos[tupnum].db_tblspace, sizeof(dbinfos[tupnum].db_tblspace), %s,
   PQgetvalue(res, tupnum, i_spclocation));
  	}
--- 220,226 
  	for (tupnum = 0; tupnum  ntups; tupnum++)
  	{
  		dbinfos[tupnum].db_oid = atooid(PQgetvalue(res, tupnum, i_oid));
! 		dbinfos[tupnum].db_name = pg_strdup(PQgetvalue(res, tupnum, i_datname));
  		snprintf(dbinfos[tupnum].db_tblspace, sizeof(dbinfos[tupnum].db_tblspace), %s,
   PQgetvalue(res, tupnum, i_spclocation));
  	}
*** get_rel_infos(ClusterInfo *cluster, DbIn
*** 346,355 
  		curr-reloid = atooid(PQgetvalue(res, relnum, i_oid));
  
  		nspname = PQgetvalue(res, relnum, i_nspname);
! 		strlcpy(curr-nspname, nspname, sizeof(curr-nspname));
  
  		relname = PQgetvalue(res, relnum, i_relname);
! 		strlcpy(curr-relname, relname, sizeof(curr-relname));
  
  		curr-relfilenode = atooid(PQgetvalue(res, relnum, i_relfilenode));
  
--- 345,354 
  		curr-reloid = atooid(PQgetvalue(res, relnum, i_oid));
  
  		nspname = PQgetvalue(res, relnum, i_nspname);
! 		curr-nspname = pg_strdup(nspname);
  
  		relname = PQgetvalue(res, relnum, i_relname);
! 		curr-relname = pg_strdup(relname);
  
  		curr-relfilenode = atooid(PQgetvalue(res, relnum, i_relfilenode));
  
*** free_db_and_rel_infos(DbInfoArr *db_arr)
*** 377,383 

Re: [HACKERS] pg_top

2012-12-20 Thread P. Christeas
On Thursday 20 December 2012, Brett Maton wrote:
 Hi List,
 
 
 
   This might not be the right place to post, but I've got a minor patch for
 pg_top and would like to submit it for review.
 
 
This project lies on Github:
https://github.com/markwkm/pg_top/

I think, this commit had fixed the issue you are talking about:
https://github.com/markwkm/pg_top/commit/a211ac3f161d7ec2752ed038009f93b4f470e86c



-- 
Say NO to spam and viruses. Stop using Microsoft Windows!


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Switching timeline over streaming replication

2012-12-20 Thread Heikki Linnakangas

On 17.12.2012 15:05, Thom Brown wrote:

I just set up 120 chained standbys, and for some reason I'm seeing these
errors:

LOG:  replication terminated by primary server
DETAIL:  End of WAL reached on timeline 1
LOG:  record with zero length at 0/301EC10
LOG:  fetching timeline history file for timeline 2 from primary server
LOG:  restarted WAL streaming at 0/300 on timeline 1
LOG:  replication terminated by primary server
DETAIL:  End of WAL reached on timeline 1
LOG:  new target timeline is 2
LOG:  restarted WAL streaming at 0/300 on timeline 2
LOG:  replication terminated by primary server
DETAIL:  End of WAL reached on timeline 2
FATAL:  error reading result of streaming command: ERROR:  requested WAL
segment 00020003 has already been removed

ERROR:  requested WAL segment 00020003 has already been
removed
LOG:  started streaming WAL from primary at 0/300 on timeline 2
ERROR:  requested WAL segment 00020003 has already been
removed


I just committed a patch that should make the requested WAL segment 
00020003 has already been removed errors go away. The 
trick was for walsenders to not switch to the new timeline until at 
least one record has been replayed on it. That closes the window where 
the walsender already considers the new timeline to be the latest, but 
the WAL file has not been created yet.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_top

2012-12-20 Thread Brett Maton
Ah, I cloned the repository from git://git.postgresql.org/

  I'll have a look at Marks repo.

Thanks,
Brett

-Original Message-
From: pgsql-hackers-ow...@postgresql.org
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of P. Christeas
Sent: 20 December 2012 12:33
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] pg_top

On Thursday 20 December 2012, Brett Maton wrote:
 Hi List,
 
 
 
   This might not be the right place to post, but I've got a minor 
 patch for pg_top and would like to submit it for review.
 
 
This project lies on Github:
https://github.com/markwkm/pg_top/

I think, this commit had fixed the issue you are talking about:
https://github.com/markwkm/pg_top/commit/a211ac3f161d7ec2752ed038009f93b4f47
0e86c



--
Say NO to spam and viruses. Stop using Microsoft Windows!


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Switching timeline over streaming replication

2012-12-20 Thread Andres Freund
On 2012-12-20 14:45:05 +0200, Heikki Linnakangas wrote:
 On 17.12.2012 15:05, Thom Brown wrote:
 I just set up 120 chained standbys, and for some reason I'm seeing these
 errors:
 
 LOG:  replication terminated by primary server
 DETAIL:  End of WAL reached on timeline 1
 LOG:  record with zero length at 0/301EC10
 LOG:  fetching timeline history file for timeline 2 from primary server
 LOG:  restarted WAL streaming at 0/300 on timeline 1
 LOG:  replication terminated by primary server
 DETAIL:  End of WAL reached on timeline 1
 LOG:  new target timeline is 2
 LOG:  restarted WAL streaming at 0/300 on timeline 2
 LOG:  replication terminated by primary server
 DETAIL:  End of WAL reached on timeline 2
 FATAL:  error reading result of streaming command: ERROR:  requested WAL
 segment 00020003 has already been removed
 
 ERROR:  requested WAL segment 00020003 has already been
 removed
 LOG:  started streaming WAL from primary at 0/300 on timeline 2
 ERROR:  requested WAL segment 00020003 has already been
 removed

 I just committed a patch that should make the requested WAL segment
 00020003 has already been removed errors go away. The trick
 was for walsenders to not switch to the new timeline until at least one
 record has been replayed on it. That closes the window where the walsender
 already considers the new timeline to be the latest, but the WAL file has
 not been created yet.

I vote for introducing InvalidTimeLineID soon... 0 as a invalid
TimeLineID seems to spread and is annoying to grep for.

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parser Cruft in gram.y

2012-12-20 Thread Robert Haas
On Wed, Dec 19, 2012 at 10:18 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 valgrind comes with a tool called cachegrind which can emulate the
 cache algorithm on some variants of various cpus and produce reports.
 Can it be made to produce a report for a specific block of memory?

 I believe that oprofile can be persuaded to produce statistics about
 where in one's code are the most cache misses, not just the most
 wall-clock ticks; which would shed a lot of light on this question.
 However, my oprofile-fu doesn't quite extend to actually persuading it.

perf can certainly do this.

$ perf record -a -e cache-misses pgbench -n -S -T 30
...output elided...
$ perf report -d postgres | grep -v '^#' | head
 8.88%   postgres  base_yyparse
 7.05%swapper  0x807c
 4.67%   postgres  SearchCatCache
 3.77%pgbench  0x172dd58
 3.47%   postgres  hash_search_with_hash_value
 3.23%   postgres  AllocSetAlloc
 2.58%   postgres  core_yylex
 1.87%   postgres  LWLockAcquire
 1.83%   postgres  fmgr_info_cxt_security
 1.75%   postgres  0x4d1054

For comparison:

$ perf record -a -e cycles -d postgres pgbench -n -S -T 30
$ perf report -d postgres | grep -v '^#' | head
 6.54% postgres  AllocSetAlloc
 4.08%  swapper  0x4ce754
 3.60% postgres  hash_search_with_hash_value
 2.74% postgres  base_yyparse
 2.71% postgres  MemoryContextAllocZeroAligned
 2.57% postgres  MemoryContextAlloc
 2.36% postgres  SearchCatCache
 2.10% postgres  _bt_compare
 1.70% postgres  LWLockAcquire
 1.54% postgres  FunctionCall2Coll

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cascading replication: should we detect/prevent cycles?

2012-12-20 Thread Robert Haas
On Wed, Dec 19, 2012 at 5:14 PM, Joshua Berkus j...@agliodbs.com wrote:
 What would such a test look like?  It's not obvious to me that there's any 
 rapid way for a user to detect this situation, without checking each server 
 individually.

Change something on the master and observe that none of the supposed
standbys notice?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes

2012-12-20 Thread Amit Kapila
On Thursday, December 20, 2012 5:12 PM Heikki Linnakangas wrote:
 On 20.12.2012 12:08, Amit Kapila wrote:
  On Wednesday, December 19, 2012 9:30 PM Heikki Linnakangas wrote:
  In both checkpointer.c and bgwriter.c, we do this before entering
 the
  main loop:
 
   /*
 * Use the recovery target timeline ID during recovery
 */
if (RecoveryInProgress())
ThisTimeLineID = GetRecoveryTargetTLI();
 
  That seems reasonable. However, since it's only done once, when the
  process starts up, ThisTimeLineID is never updated in those
 processes,
  even though the startup process changes recovery target timeline.
 
  That actually seems harmless to me, and I also haven't heard of any
  complaints of misbehavior in 9.1 or 9.2 caused by that. I'm not sure
  why
  we bother to set ThisTimeLineID in those processes in the first
 place.
 
  This is used in RemoveOldXlogFiles(), so if during recovery when it's
 not
  set and
  this function gets called, it might have some problem.
  I think it could get called from CreateRestartPoint() during
 recovery.
 
 Hmm, right, it's used for this:
 
   XLogFileName(lastoff, ThisTimeLineID, segno);
 
 
 So I think we're good on that front. But I'll add a comment there, and
 use 0 explicitly instead of ThisTimeLineID, for clarity.

True, it might not have any functionality effect in RemoveOldXlogFiles().
However it can be used in PreallocXlogFiles()-XLogFileInit() as well.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes

2012-12-20 Thread Simon Riggs
On 20 December 2012 13:19, Amit Kapila amit.kap...@huawei.com wrote:

 So I think we're good on that front. But I'll add a comment there, and
 use 0 explicitly instead of ThisTimeLineID, for clarity.

 True, it might not have any functionality effect in RemoveOldXlogFiles().
 However it can be used in PreallocXlogFiles()-XLogFileInit() as well.

PreallocXlogFiles() should have been put to the sword long ago. It's a
performance tweak aimed at people without a performance problem in
this area.

I'll happily accept this excuse to remove it.

We can discuss whether any replacement is required.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes

2012-12-20 Thread Andres Freund
On 2012-12-20 13:32:36 +, Simon Riggs wrote:
 On 20 December 2012 13:19, Amit Kapila amit.kap...@huawei.com wrote:

  So I think we're good on that front. But I'll add a comment there, and
  use 0 explicitly instead of ThisTimeLineID, for clarity.
 
  True, it might not have any functionality effect in RemoveOldXlogFiles().
  However it can be used in PreallocXlogFiles()-XLogFileInit() as well.

 PreallocXlogFiles() should have been put to the sword long ago. It's a
 performance tweak aimed at people without a performance problem in
 this area.

Creating, zeroing  fsync()'ing a 16MB file shouldn't be done in
individual transactions because it would increase latency noticeably. So
it seems it addresses a real performance problem. What do you dislike
about it?

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parser Cruft in gram.y

2012-12-20 Thread Peter Eisentraut
On 12/18/12 5:10 PM, Robert Haas wrote:
 I can't help but suspect that the way we handle keywords today is
 monumentally inefficient.  The unreserved_keyword products, et al,
 just seem somehow badly wrong-headed.  We take the trouble to
 distinguish all of those cases so that we an turn around and not
 distinguish them.  I feel like there ought to be some way to use lexer
 states to handle this - if we're in a context where an unreserved
 keyword will be treated as an IDENT, then have the lexer return IDENT
 when it sees an unreserved keyword.  I might be wrong, but it seems
 like that would eliminate a whole lot of parser state transitions.
 However, even if I'm right, I have no idea how to implement it.  It
 just seems very wasteful that we have so many parser states that have
 no purpose other than (effectively) to convert an unreserved_keyword
 into an IDENT when the lexer could do the same thing much more cheaply
 given a bit more context.

Another way of attack along these lines might be to use the %glr-parser
and then try to cut back on all those redundant rules that were put in
to avoid conflicts.  The number of key words categories and such could
perhaps also be reduced that way.

Of course, this is mostly speculation.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parser Cruft in gram.y

2012-12-20 Thread Simon Riggs
On 18 December 2012 22:10, Robert Haas robertmh...@gmail.com wrote:

 Well that would be nice, but the problem is that I see no way to
 implement it.  If, with a unified parser, the parser is 14% of our
 source code, then splitting it in two will probably crank that number
 up well over 20%, because there will be duplication between the two.
 That seems double-plus un-good.

I don't think the size of the parser binary is that relevant. What is
relevant is how much of that is regularly accessed.

Increasing parser cache misses for DDL and increasing size of binary
overall are acceptable costs if we are able to swap out the unneeded
areas and significantly reduce the cache misses on the well travelled
portions of the parser.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes

2012-12-20 Thread Amit Kapila
On Thursday, December 20, 2012 7:03 PM Simon Riggs wrote:
 On 20 December 2012 13:19, Amit Kapila amit.kap...@huawei.com wrote:
 
  So I think we're good on that front. But I'll add a comment there,
 and
  use 0 explicitly instead of ThisTimeLineID, for clarity.
 
  True, it might not have any functionality effect in
 RemoveOldXlogFiles().
  However it can be used in PreallocXlogFiles()-XLogFileInit() as
 well.
 
 PreallocXlogFiles() should have been put to the sword long ago. It's a
 performance tweak aimed at people without a performance problem in
 this area.

Yes, it seems to be for a rare scenario. 
 
 I'll happily accept this excuse to remove it.
 
 We can discuss whether any replacement is required.

We should think of alternatives if there is no other reasonable fix for the
problem.

With Regards,
Amit Kapila.







-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parser Cruft in gram.y

2012-12-20 Thread Robert Haas
On Thu, Dec 20, 2012 at 8:55 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 18 December 2012 22:10, Robert Haas robertmh...@gmail.com wrote:
 Well that would be nice, but the problem is that I see no way to
 implement it.  If, with a unified parser, the parser is 14% of our
 source code, then splitting it in two will probably crank that number
 up well over 20%, because there will be duplication between the two.
 That seems double-plus un-good.

 I don't think the size of the parser binary is that relevant. What is
 relevant is how much of that is regularly accessed.

 Increasing parser cache misses for DDL and increasing size of binary
 overall are acceptable costs if we are able to swap out the unneeded
 areas and significantly reduce the cache misses on the well travelled
 portions of the parser.

I generally agree.  We don't want to bloat the size of the parser with
wild abandon, but yeah if we can reduce the cache misses on the
well-travelled portions that seems like it ought to help.  My previous
hacky attempt to quantify the potential benefit in this area was:

http://archives.postgresql.org/pgsql-hackers/2011-05/msg01008.php

On my machine there seemed to be a small but consistent win; on a very
old box Jeff Janes tried, it didn't seem like there was any benefit at
all.  Somehow, I have a feeling we're missing a trick here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] recent ALTER whatever .. SET SCHEMA refactoring

2012-12-20 Thread Robert Haas
The recent SET SCHEMA refactoring has changed the error message that
you get when trying to move a function into the schema that already
contains it.

For a table, as ever, you get:

rhaas=# create table foo (a int);
CREATE TABLE
rhaas=# alter table foo set schema public;
ERROR:  table foo is already in schema public

Functions used to produce the same message, but not any more:

rhaas=# create function foo(a int) returns int as $$select 1$$ language sql;
CREATE FUNCTION
rhaas=# alter function foo(int) set schema public;
ERROR:  function foo already exists in schema public

The difference is that the first error message is complaining about an
operation that is a no-op, whereas the second one is complaining about
a name collision.  It seems a bit off in this case because the name
collision is between the function and itself, not the function and
some other object with a conflicting name.  The root of the problem is
that AlterObjectNamespace_internal generates both error messages and
does the checks in the correct order, but for functions,
AlterFunctionNamespace_oid has a second copy of the conflicting-names
check that is argument-type aware, which happens before the
same-schema check in AlterObjectNamespace_internal.

IMHO, we ought to fix this by getting rid of the check in
AlterFunctionNamespace_oid and adding an appropriate special case for
functions in AlterObjectNamespace_internal that knows how to do the
check correctly.  It's not a huge deal, but it seems confusing to have
AlterObjectNamespace_internal know how to do the check correctly in
some cases but not others.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parser Cruft in gram.y

2012-12-20 Thread Andres Freund
On 2012-12-20 09:11:46 -0500, Robert Haas wrote:
 On Thu, Dec 20, 2012 at 8:55 AM, Simon Riggs si...@2ndquadrant.com wrote:
  On 18 December 2012 22:10, Robert Haas robertmh...@gmail.com wrote:
  Well that would be nice, but the problem is that I see no way to
  implement it.  If, with a unified parser, the parser is 14% of our
  source code, then splitting it in two will probably crank that number
  up well over 20%, because there will be duplication between the two.
  That seems double-plus un-good.
 
  I don't think the size of the parser binary is that relevant. What is
  relevant is how much of that is regularly accessed.
 
  Increasing parser cache misses for DDL and increasing size of binary
  overall are acceptable costs if we are able to swap out the unneeded
  areas and significantly reduce the cache misses on the well travelled
  portions of the parser.

 I generally agree.  We don't want to bloat the size of the parser with
 wild abandon, but yeah if we can reduce the cache misses on the
 well-travelled portions that seems like it ought to help.  My previous
 hacky attempt to quantify the potential benefit in this area was:

 http://archives.postgresql.org/pgsql-hackers/2011-05/msg01008.php

 On my machine there seemed to be a small but consistent win; on a very
 old box Jeff Janes tried, it didn't seem like there was any benefit at
 all.  Somehow, I have a feeling we're missing a trick here.

I don't think you will see too many cache misses on such a low number of
extremly simply statements, so its not too surprising not to see a that
big difference with that.

Are we sure its really cache-misses and not just the actions performed
in the grammar that make bison code show up in profiles? I remember the
latter being the case...

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parser Cruft in gram.y

2012-12-20 Thread Andres Freund
On 2012-12-20 15:45:47 +0100, Andres Freund wrote:
 On 2012-12-20 09:11:46 -0500, Robert Haas wrote:
  On Thu, Dec 20, 2012 at 8:55 AM, Simon Riggs si...@2ndquadrant.com wrote:
   On 18 December 2012 22:10, Robert Haas robertmh...@gmail.com wrote:
   Well that would be nice, but the problem is that I see no way to
   implement it.  If, with a unified parser, the parser is 14% of our
   source code, then splitting it in two will probably crank that number
   up well over 20%, because there will be duplication between the two.
   That seems double-plus un-good.
  
   I don't think the size of the parser binary is that relevant. What is
   relevant is how much of that is regularly accessed.
  
   Increasing parser cache misses for DDL and increasing size of binary
   overall are acceptable costs if we are able to swap out the unneeded
   areas and significantly reduce the cache misses on the well travelled
   portions of the parser.
 
  I generally agree.  We don't want to bloat the size of the parser with
  wild abandon, but yeah if we can reduce the cache misses on the
  well-travelled portions that seems like it ought to help.  My previous
  hacky attempt to quantify the potential benefit in this area was:
 
  http://archives.postgresql.org/pgsql-hackers/2011-05/msg01008.php
 
  On my machine there seemed to be a small but consistent win; on a very
  old box Jeff Janes tried, it didn't seem like there was any benefit at
  all.  Somehow, I have a feeling we're missing a trick here.

 I don't think you will see too many cache misses on such a low number of
 extremly simply statements, so its not too surprising not to see a that
 big difference with that.

 Are we sure its really cache-misses and not just the actions performed
 in the grammar that make bison code show up in profiles? I remember the
 latter being the case...

Hm. A very, very quick perf stat -dvvv of pgbench -S -c 20 -j 20 -T 20 later:

 218350.885559 task-clock#   10.095 CPUs utilized
 1,676,479 context-switches  #0.008 M/sec
 2,396 cpu-migrations#0.011 K/sec
   796,038 page-faults   #0.004 M/sec
   506,312,525,518 cycles#2.319 GHz 
[20.00%]
   405,944,435,754 stalled-cycles-frontend   #   80.18% frontend cycles idle
[30.32%]
   236,712,872,641 stalled-cycles-backend#   46.75% backend  cycles idle
[40.51%]
   193,459,120,458 instructions  #0.38  insns per cycle
 #2.10  stalled cycles per insn 
[50.70%]
36,433,144,472 branches  #  166.856 M/sec   
[51.12%]
 3,623,778,087 branch-misses #9.95% of all branches 
[50.87%]
50,344,123,581 L1-dcache-loads   #  230.565 M/sec   
[50.33%]
 5,548,192,686 L1-dcache-load-misses #   11.02% of all L1-dcache hits   
[49.69%]
 2,666,461,361 LLC-loads #   12.212 M/sec   
[35.63%]
   112,407,198 LLC-load-misses   #4.22% of all LL-cache hits
[ 9.67%]

  21.629396701 seconds time elapsed

So there definitely a noticeable rate of cache misses...

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: patch submission: truncate trailing nulls from heap rows to reduce the size of the null bitmap [Review]

2012-12-20 Thread Amit Kapila
On Thursday, December 20, 2012 5:46 PM Simon Riggs wrote:
 On 13 October 2012 08:54, Amit kapila amit.kap...@huawei.com wrote:
 
  As per the last discussion for this patch, performance data needs to
 be
  provided before this patch's Review can proceed further.
 
  So as per your suggestion and from the discussions about this patch,
 I have

 
  
 
  1. There is no performance change for cloumns that have all valid
  values(non- NULLs).
 
  2. There is a visible performance increase when number of columns
 containing
  NULLS are more than  60~70% in table have large number of columns.
 
  3. There are visible space savings when number of columns containing
 NULLS
  are more than  60~70% in table have large number of columns.
 
 
  Let me know if there is more performance data needs to be collected
 for this
  patch?
 
 
 I can't make sense of your performance report. Because of that I can't
 derive the same conclusions from it you do.
 
 Can you explain the performance results in more detail, so we can see
 what they mean? Like which are the patched, which are the unpatched
 results? 
On the extreme let it is mentioned Original Code/ Trim Triling Nulls Patch.
In any case I have framed the results again as below:
1. Table with 800 columns 
A. INSERT tuples with 600 trailing nulls 
B. UPDATE last column value to non-null 
C. UPDATE last column value to null 
-+-+- 
 Original Code   |  Trim Tailing NULLs | Improvement (%)   
 TPS   space used|  TPSspace used  | Results 
   (pages)   | (pages) | 
-+-+-- 
1A:  0.2068  25  | 0.2302  23  | 10.1% tps, 11.1% space 
1B:  0.0448  50  | 0.0481  472223  |  6.8% tps,  5.6% space 
1C:  0.0433  75  | 0.0493  694445  | 12.2% tps,  7.4% space 

2. Table with 800 columns 
A. INSERT tuples with 300 trailing nulls 
B. UPDATE last column value to non-null 
C. UPDATE last column value to null 
-+-+- 
 Original Code   |  Trim Tailing NULLs | Improvement (%)   
 TPS   space used|  TPSspace used  | Results 
   (pages)   | (pages) | 
-+-+-- 
2A:  0.0280   67 | 0.0287   67 | 2.3% tps, 0% space 
2B:  0.0143  134 | 0.0152  134 | 5.3% tps, 0% space 
2C:  0.0145  200 | 0.0149  200 | 2.9% tps, 0% space 

3. Table with 300 columns 
A. INSERT tuples with 150 trailing nulls 
B. UPDATE last column value to non-null 
C. UPDATE last column value to null 
-+-+ 
 Original Code   |  Trim Tailing NULLs | Improvement (%)   
 TPS   space used|  TPSspace used  | Results 
   (pages)   | (pages) | 
-+-+ 
3A:  0.2815  17  | 0.2899  17  | 2.9% tps, 0% space 
3B:  0.0851  34  | 0.0870  34  | 2.2% tps, 0% space 
3C:  0.0846  50  | 0.0852  50  | 0.7% tps, 0% space 

4. Table with 300 columns 
A. INSERT tuples with 250 trailing nulls 
B. UPDATE last column value to non-null 
C. UPDATE last column value to null 
-+-+- 
 Original Code   |  Trim Tailing NULLs | Improvement (%)   
 TPS   space used|  TPSspace used  | Results 
   (pages)   | (pages) | 
-+-+- 
4A:  0.54477 | 0.5996   58824  |  09.2% tps, 11.8% space 
4B:  0.1251   135633 | 0.1232  127790  | -01.5% tps,  5.8% space 
4C:  0.1223   202299 | 0.1361  186613  |  10.1% tps,  7.5% space

Please let me know, if still it is not clear.

With Regards,
Amit Kapila.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parser Cruft in gram.y

2012-12-20 Thread Andres Freund
On 2012-12-20 15:51:37 +0100, Andres Freund wrote:
 On 2012-12-20 15:45:47 +0100, Andres Freund wrote:
  On 2012-12-20 09:11:46 -0500, Robert Haas wrote:
   On Thu, Dec 20, 2012 at 8:55 AM, Simon Riggs si...@2ndquadrant.com 
   wrote:
On 18 December 2012 22:10, Robert Haas robertmh...@gmail.com wrote:
Well that would be nice, but the problem is that I see no way to
implement it.  If, with a unified parser, the parser is 14% of our
source code, then splitting it in two will probably crank that number
up well over 20%, because there will be duplication between the two.
That seems double-plus un-good.
   
I don't think the size of the parser binary is that relevant. What is
relevant is how much of that is regularly accessed.
   
Increasing parser cache misses for DDL and increasing size of binary
overall are acceptable costs if we are able to swap out the unneeded
areas and significantly reduce the cache misses on the well travelled
portions of the parser.
  
   I generally agree.  We don't want to bloat the size of the parser with
   wild abandon, but yeah if we can reduce the cache misses on the
   well-travelled portions that seems like it ought to help.  My previous
   hacky attempt to quantify the potential benefit in this area was:
  
   http://archives.postgresql.org/pgsql-hackers/2011-05/msg01008.php
  
   On my machine there seemed to be a small but consistent win; on a very
   old box Jeff Janes tried, it didn't seem like there was any benefit at
   all.  Somehow, I have a feeling we're missing a trick here.
 
  I don't think you will see too many cache misses on such a low number of
  extremly simply statements, so its not too surprising not to see a that
  big difference with that.
 
  Are we sure its really cache-misses and not just the actions performed
  in the grammar that make bison code show up in profiles? I remember the
  latter being the case...

 Hm. A very, very quick perf stat -dvvv of pgbench -S -c 20 -j 20 -T 20 later:

  218350.885559 task-clock#   10.095 CPUs utilized
  1,676,479 context-switches  #0.008 M/sec
  2,396 cpu-migrations#0.011 K/sec
796,038 page-faults   #0.004 M/sec
506,312,525,518 cycles#2.319 GHz   
   [20.00%]
405,944,435,754 stalled-cycles-frontend   #   80.18% frontend cycles idle  
   [30.32%]
236,712,872,641 stalled-cycles-backend#   46.75% backend  cycles idle  
   [40.51%]
193,459,120,458 instructions  #0.38  insns per cycle
  #2.10  stalled cycles per 
 insn [50.70%]
 36,433,144,472 branches  #  166.856 M/sec 
   [51.12%]
  3,623,778,087 branch-misses #9.95% of all branches   
   [50.87%]
 50,344,123,581 L1-dcache-loads   #  230.565 M/sec 
   [50.33%]
  5,548,192,686 L1-dcache-load-misses #   11.02% of all L1-dcache hits 
   [49.69%]
  2,666,461,361 LLC-loads #   12.212 M/sec 
   [35.63%]
112,407,198 LLC-load-misses   #4.22% of all LL-cache hits  
   [ 9.67%]

   21.629396701 seconds time elapsed

 So there definitely a noticeable rate of cache misses...

L1 misses:
# Samples: 997K of event 'L1-dcache-load-misses'
# Overhead   Command   Shared Object Symbol
#     
...
 6.49%  postgres  postgres[.] SearchCatCache
 3.65%  postgres  postgres[.] base_yyparse
 3.48%  postgres  postgres[.] hash_search_with_hash_value
 3.41%  postgres  postgres[.] AllocSetAlloc
 1.84%  postgres  postgres[.] LWLockAcquire
 1.40%  postgres  postgres[.] fmgr_info_cxt_security
 1.36%  postgres  postgres[.] nocachegetattr
 1.23%  postgres  libc-2.13.so[.] _int_malloc
 1.20%  postgres  postgres[.] core_yylex
 1.15%  postgres  postgres[.] MemoryContextAllocZeroAligned
 0.94%  postgres  postgres[.] PostgresMain
 0.94%  postgres  postgres[.] MemoryContextAlloc
 0.91%  postgres  libc-2.13.so[.] __memcpy_ssse3_back
 0.89%  postgres  postgres[.] CatalogCacheComputeHashValue
 0.86%  postgres  postgres[.] PinBuffer
 0.86%  postgres  [kernel.kallsyms]   [k] __audit_syscall_entry
 0.80%  postgres  libc-2.13.so[.] __strcmp_sse42
 0.80%  postgres  postgres[.] _bt_compare
 0.78%  postgres  postgres[.] FunctionCall2Coll
 0.77%  postgres  libc-2.13.so[.] malloc
 0.73%  postgres  libc-2.13.so[.] __memset_sse2
 0.72%  postgres  postgres[.] GetSnapshotData
 0.69%  postgres  [kernel.kallsyms]   

Re: [HACKERS] discarding duplicate indexes

2012-12-20 Thread Josh Kupershmidt
On Thu, Dec 20, 2012 at 1:26 AM, Gavin Flower
gavinflo...@archidevsys.co.nz wrote:
 On 20/12/12 14:57, Josh Kupershmidt wrote:

 CREATE TABLE test (id int);
 CREATE INDEX test_idx1 ON test (id);
 CREATE INDEX test_idx2 ON test (id);

 I initially misread your example code, but after I realised my mistake, I
 thought of an alternative scenario that might be worth considering.

 CREATE TABLE test (id int, int sub, text payload);
 CREATE INDEX test_idx1 ON test (id, sub);
 CREATE INDEX test_idx2 ON test (id);


 Now test_idx2 is logically included in test_idx1, but if the majority of
 transactions only query on id, then test_idx2 would be more better as it
 ties up less RAM

Well, this situation works without any LIKE ... INCLUDING INDEXES
surprises. If you
  CREATE TABLE test_copycat (LIKE test INCLUDING INDEXES);

you should see test_copycat created with both indexes, since
indexParams is considered for this deduplicating.

Josh


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parser Cruft in gram.y

2012-12-20 Thread Andres Freund
On 2012-12-20 16:04:49 +0100, Andres Freund wrote:
 On 2012-12-20 15:51:37 +0100, Andres Freund wrote:
 When doing a source/assembly annotation in SearchCatCache about half of
 the misses are attributed to the memcpy() directly at the beginning.

Using a separate array for storing the arguments instead of copying 
modifying cache-cc_skey yields a 2% speedup in pgbench -S for me...

The attached patch is clearly not ready and I don't really have time 
energy to pursue it right now, but it seems interesting enough to
post. The approach seems solid and sensible although the implementation
is not (too much cp, no comments).

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 9ae1432..bee6f3d 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -73,7 +73,7 @@ static CatCacheHeader *CacheHdr = NULL;
 
 
 static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys,
-			 ScanKey cur_skey);
+		   ScanKey cur_skey, Datum *argument);
 static uint32 CatalogCacheComputeTupleHashValue(CatCache *cache,
   HeapTuple tuple);
 
@@ -173,7 +173,7 @@ GetCCHashEqFuncs(Oid keytype, PGFunction *hashfunc, RegProcedure *eqfunc)
  * Compute the hash value associated with a given set of lookup keys
  */
 static uint32
-CatalogCacheComputeHashValue(CatCache *cache, int nkeys, ScanKey cur_skey)
+CatalogCacheComputeHashValue(CatCache *cache, int nkeys, ScanKey cur_skey, Datum *argument)
 {
 	uint32		hashValue = 0;
 	uint32		oneHash;
@@ -188,28 +188,28 @@ CatalogCacheComputeHashValue(CatCache *cache, int nkeys, ScanKey cur_skey)
 		case 4:
 			oneHash =
 DatumGetUInt32(DirectFunctionCall1(cache-cc_hashfunc[3],
-   cur_skey[3].sk_argument));
+   argument[3]));
 			hashValue ^= oneHash  24;
 			hashValue ^= oneHash  8;
 			/* FALLTHROUGH */
 		case 3:
 			oneHash =
 DatumGetUInt32(DirectFunctionCall1(cache-cc_hashfunc[2],
-   cur_skey[2].sk_argument));
+   argument[2]));
 			hashValue ^= oneHash  16;
 			hashValue ^= oneHash  16;
 			/* FALLTHROUGH */
 		case 2:
 			oneHash =
 DatumGetUInt32(DirectFunctionCall1(cache-cc_hashfunc[1],
-   cur_skey[1].sk_argument));
+   argument[1]));
 			hashValue ^= oneHash  8;
 			hashValue ^= oneHash  24;
 			/* FALLTHROUGH */
 		case 1:
 			oneHash =
 DatumGetUInt32(DirectFunctionCall1(cache-cc_hashfunc[0],
-   cur_skey[0].sk_argument));
+   argument[0]));
 			hashValue ^= oneHash;
 			break;
 		default:
@@ -228,17 +228,14 @@ CatalogCacheComputeHashValue(CatCache *cache, int nkeys, ScanKey cur_skey)
 static uint32
 CatalogCacheComputeTupleHashValue(CatCache *cache, HeapTuple tuple)
 {
-	ScanKeyData cur_skey[CATCACHE_MAXKEYS];
+	Datum arguments[CATCACHE_MAXKEYS];
 	bool		isNull = false;
 
-	/* Copy pre-initialized overhead data for scankey */
-	memcpy(cur_skey, cache-cc_skey, sizeof(cur_skey));
-
 	/* Now extract key fields from tuple, insert into scankey */
 	switch (cache-cc_nkeys)
 	{
 		case 4:
-			cur_skey[3].sk_argument =
+			arguments[3] =
 (cache-cc_key[3] == ObjectIdAttributeNumber)
 ? ObjectIdGetDatum(HeapTupleGetOid(tuple))
 : fastgetattr(tuple,
@@ -248,7 +245,7 @@ CatalogCacheComputeTupleHashValue(CatCache *cache, HeapTuple tuple)
 			Assert(!isNull);
 			/* FALLTHROUGH */
 		case 3:
-			cur_skey[2].sk_argument =
+			arguments[2] =
 (cache-cc_key[2] == ObjectIdAttributeNumber)
 ? ObjectIdGetDatum(HeapTupleGetOid(tuple))
 : fastgetattr(tuple,
@@ -258,7 +255,7 @@ CatalogCacheComputeTupleHashValue(CatCache *cache, HeapTuple tuple)
 			Assert(!isNull);
 			/* FALLTHROUGH */
 		case 2:
-			cur_skey[1].sk_argument =
+			arguments[1] =
 (cache-cc_key[1] == ObjectIdAttributeNumber)
 ? ObjectIdGetDatum(HeapTupleGetOid(tuple))
 : fastgetattr(tuple,
@@ -268,7 +265,7 @@ CatalogCacheComputeTupleHashValue(CatCache *cache, HeapTuple tuple)
 			Assert(!isNull);
 			/* FALLTHROUGH */
 		case 1:
-			cur_skey[0].sk_argument =
+			arguments[0] =
 (cache-cc_key[0] == ObjectIdAttributeNumber)
 ? ObjectIdGetDatum(HeapTupleGetOid(tuple))
 : fastgetattr(tuple,
@@ -282,7 +279,7 @@ CatalogCacheComputeTupleHashValue(CatCache *cache, HeapTuple tuple)
 			break;
 	}
 
-	return CatalogCacheComputeHashValue(cache, cache-cc_nkeys, cur_skey);
+	return CatalogCacheComputeHashValue(cache, cache-cc_nkeys, cache-cc_skey, arguments);
 }
 
 
@@ -1058,6 +1055,7 @@ SearchCatCache(CatCache *cache,
 			   Datum v4)
 {
 	ScanKeyData cur_skey[CATCACHE_MAXKEYS];
+	Datum arguments[CATCACHE_MAXKEYS];
 	uint32		hashValue;
 	Index		hashIndex;
 	dlist_iter	iter;
@@ -1080,16 +1078,15 @@ SearchCatCache(CatCache *cache,
 	/*
 	 * initialize the search key information
 	 */
-	memcpy(cur_skey, cache-cc_skey, sizeof(cur_skey));
-	

Re: [HACKERS] operator dependency of commutator and negator, redux

2012-12-20 Thread Tom Lane
Brendan Jurd dire...@gmail.com writes:
 On 20 December 2012 11:51, Tom Lane t...@sss.pgh.pa.us wrote:
 While reconsidering the various not-too-satisfactory fixes we thought of
 back then, I had a sudden thought.  Instead of having a COMMUTATOR or
 NEGATOR forward reference create a shell operator and link to it,
 why not simply *ignore* such references?  Then when the second operator
 is defined, go ahead and fill in both links?

 Ignore with warning sounds pretty good.  So it would go something like this?

 # CREATE OPERATOR  (... COMMUTATOR );
 WARNING: COMMUTATOR  (foo, foo) undefined, ignoring.
 CREATE OPERATOR

 # CREATE OPERATOR  (... COMMUTATOR );
 CREATE OPERATOR

I was thinking a NOTICE at most.  If it's a WARNING then restoring
perfectly valid pg_dump files will result in lots of scary-looking
chatter.  You could make an argument for printing nothing at all,
but that would probably mislead people who'd fat-fingered their
COMMUTATOR entries.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Set visibility map bit after HOT prune

2012-12-20 Thread Robert Haas
On Wed, Dec 19, 2012 at 11:12 PM, Pavan Deolasee
pavan.deola...@gmail.com wrote:
 Yeah, VM buffer contention can become prominent if we break the
 invariant that page level bit status implies the vm bit status, at
 least when its clear.OTOH IMHO we need some mechanism to address the
 issue of aggressive clearing of the VM bits, but a very lame
 corresponding set operation.

I agree.

 Today we don't have much contention on
 the VM page, but we must be sacrificing its usability in return. IOS
 as well as vacuum optimizations using VMs will turn out not so useful
 for many workloads.

I have that fear too, but the evidence isn't really in yet.  The
testing that people have reported on this list has had mostly positive
outcomes.  Of course that doesn't mean that it will work as well in
the field as it did in the lab, but it doesn't mean that it won't,
either.

 I'm very reluctant to suggest that we can solve
 this my setting aside another page-level bit to track visibility of
 tuples for heapscans. Or even have a bit in the tuple header itself to
 track this information at that level to avoid repeated visibility
 check for a tuple which is known to be visible to all current and
 future transactions.

This has been suggested before, as an alternative to freezing tuples.
It seems to have some potential although I think more thought and work
is needed to figure out exactly where to go with it.

 And we expect vacuums to be very less or none. AFAIR in pgbench, it
 now takes hours for accounts table to get chosen for vacuum and we
 should be happy about it. But IOS are almost impossible for pgbench
 kind of workloads today because of our aggressive strategy to clear
 the VM bits.

IMHO, it's probably fairly hopeless to make a pure pgbench workload
show a benefit from index-only scans.  A large table under a very
heavy, completely random write workload is just about the worst
possible case for index-only scans.  Index-only scans are a way of
avoiding unnecessary visibility checks when the target data hasn't
changed recently, not a magic bullet to escape all heap access.  If
the target data has changed, you're going to have to touch the heap.
And while I agree that we aren't aggressive enough in setting the VM
bits right now, I also think it wouldn't be too hard to go too far in
the opposite direction: we could easily spend more effort trying to
make index-only scans effective than we could ever hope to recoup from
the scans themselves.

Now, if you set up N threads of which 10% are doing regular pgbench
and the other 90% are doing pgbench -S, or something like that, then
you might start to hope for some benefit from index-only scans.  But I
think you might also GET some benefit in that case, even at steady
state.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Set visibility map bit after HOT prune

2012-12-20 Thread Robert Haas
On Thu, Dec 20, 2012 at 4:58 AM, Simon Riggs si...@2ndquadrant.com wrote:
 ISTM that if someone spots a block that could use cleanup, they mark
 the block as BM_PIN_COUNT_WAITER, but don't set pid. Then when they
 unpin the block they send a signal/queue work for a special cleaning
 process to come in and do the work now that nobody is waiting. Logic
 would allow VACUUMs to go past that by setting the pid. If we
 prioritised cleanup onto blocks that are already dirty we would
 minimise I/O.

I don't favor that particular signaling mechanism, but I agree that
there is quite a bit of potential utility in having foreground
processes notice that work (like a HOT prune, or setting the VM bit)
needs to be done and pass those requests off to a background process.
I'm hoping the new background worker framework in 9.3 will make that
sort of thing easier for people to play around with.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] operator dependency of commutator and negator, redux

2012-12-20 Thread Robert Haas
On Thu, Dec 20, 2012 at 10:52 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Brendan Jurd dire...@gmail.com writes:
 On 20 December 2012 11:51, Tom Lane t...@sss.pgh.pa.us wrote:
 While reconsidering the various not-too-satisfactory fixes we thought of
 back then, I had a sudden thought.  Instead of having a COMMUTATOR or
 NEGATOR forward reference create a shell operator and link to it,
 why not simply *ignore* such references?  Then when the second operator
 is defined, go ahead and fill in both links?

 Ignore with warning sounds pretty good.  So it would go something like this?

 # CREATE OPERATOR  (... COMMUTATOR );
 WARNING: COMMUTATOR  (foo, foo) undefined, ignoring.
 CREATE OPERATOR

 # CREATE OPERATOR  (... COMMUTATOR );
 CREATE OPERATOR

 I was thinking a NOTICE at most.  If it's a WARNING then restoring
 perfectly valid pg_dump files will result in lots of scary-looking
 chatter.  You could make an argument for printing nothing at all,
 but that would probably mislead people who'd fat-fingered their
 COMMUTATOR entries.

What about jiggering the dump so that only the second of the two
operators to be dumped includes the COMMUTATOR clause?  Or even adding
a separate ALTER OPERATOR  COMMUTATOR  statement (or something of
the sort) that pg_dump can emit as a separate item.  Even a NOTICE in
pg_dump seems like too much chatter (witness recent quieting of some
other NOTICE messages we've all grown tired of) but silently ignoring
the problem doesn't seem smart either, for the reason you mention.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parser Cruft in gram.y

2012-12-20 Thread Greg Stark
On Thu, Dec 20, 2012 at 3:18 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Greg Stark st...@mit.edu writes:
 But I'm not entirely convinced any of this is actually useful. Just
 becuase the transition table is large doesn't mean it's inefficient.

 That's a fair point.  However, I've often noticed base_yyparse() showing
 up rather high on profiles --- higher than seemed plausible at the time,
 given that its state-machine implementation is pretty tight.  Now I'm
 wondering whether that isn't coming from cache stalls from trying to
 touch all the requisite parts of the transition table.

For what it's worth the bloat isn't in the parser transition table at all:
516280 yy_transition
147208 yytable
147208 yycheck
146975 base_yyparse
17468 yypact
9571 core_yylex
8734 yystos
8734 yydefact

Unless I'm confused yy_transition is in fact the *lexer* transition
table. I'm not sure how to reconcile that with the profiling results
showing the cache misses in base_yyparse() though.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] trouble with pg_upgrade 9.0 - 9.1

2012-12-20 Thread Tom Lane
Bruce Momjian br...@momjian.us writes:
 As you can see, ALTER INDEX renamed both the pg_constraint and pg_class
 names.  Is it possible someone manually updated the system table to
 rename this primary key?  That would cause this error message.  The fix
 is to just to make sure they match.

 Does pg_upgrade need to be modified to handle this case?  Are there
 legitimate cases where they will not match and the index name will not
 be preserved though a dump/restore?  This seems safe:

It's not always been true that ALTER INDEX would try to rename
constraints to keep 'em in sync.  A quick check says that only 8.3 and
later do that.  I'm not sure though how a 9.0 database could get into
such a state without manual catalog hacking, since as you say a dump and
reload should have ended up with the index and constraint having the
same name again.

I'd be inclined not to worry about this in pg_upgrade, at least not till
we see a plausible scenario for the situation to arise without manual
catalog changes.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parser Cruft in gram.y

2012-12-20 Thread Andres Freund
On 2012-12-20 15:58:12 +, Greg Stark wrote:
 On Thu, Dec 20, 2012 at 3:18 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  Greg Stark st...@mit.edu writes:
  But I'm not entirely convinced any of this is actually useful. Just
  becuase the transition table is large doesn't mean it's inefficient.
 
  That's a fair point.  However, I've often noticed base_yyparse() showing
  up rather high on profiles --- higher than seemed plausible at the time,
  given that its state-machine implementation is pretty tight.  Now I'm
  wondering whether that isn't coming from cache stalls from trying to
  touch all the requisite parts of the transition table.

 For what it's worth the bloat isn't in the parser transition table at all:
 516280 yy_transition
 147208 yytable
 147208 yycheck
 146975 base_yyparse
 17468 yypact
 9571 core_yylex
 8734 yystos
 8734 yydefact

 Unless I'm confused yy_transition is in fact the *lexer* transition
 table. I'm not sure how to reconcile that with the profiling results
 showing the cache misses in base_yyparse() though.

The scanner is compiled together with the parser, so its not unlikely
that the compiler bunches them together making only base_yyparse visible
in the profile.
I quickly checked though, and the top three offenders for L1 caches are
in bison not in the lexer... Its not implausible that the state
transitions in the lexer are better cached and/or predicted...

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes

2012-12-20 Thread Fujii Masao
On Thu, Dec 20, 2012 at 8:41 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 20.12.2012 12:08, Amit Kapila wrote:

 On Wednesday, December 19, 2012 9:30 PM Heikki Linnakangas wrote:

 In both checkpointer.c and bgwriter.c, we do this before entering the
 main loop:

  /*
* Use the recovery target timeline ID during recovery
*/
   if (RecoveryInProgress())
   ThisTimeLineID = GetRecoveryTargetTLI();

 That seems reasonable. However, since it's only done once, when the
 process starts up, ThisTimeLineID is never updated in those processes,
 even though the startup process changes recovery target timeline.

 That actually seems harmless to me, and I also haven't heard of any
 complaints of misbehavior in 9.1 or 9.2 caused by that. I'm not sure
 why
 we bother to set ThisTimeLineID in those processes in the first place.


 This is used in RemoveOldXlogFiles(), so if during recovery when it's not
 set and
 this function gets called, it might have some problem.
 I think it could get called from CreateRestartPoint() during recovery.


 Hmm, right, it's used for this:

 XLogFileName(lastoff, ThisTimeLineID, segno);

 The resulting lastoff string, which is a xlog filename like
 00020008, is used to compare filenames of files in
 pg_xlog. However, the tli part, first 8 characters, are ignored for
 comparison purposes. In addition to that, 'lastoff' is printed in a DEBUG2
 line, purely for debugging purposes.

InstallXLogFileSegment() also uses ThisTimeLineID. But your recent commit
doesn't take care of it and prevents the standby from recycling the WAL files
properly. Specifically, the standby recycles the WAL file to wrong name.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes

2012-12-20 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 PreallocXlogFiles() should have been put to the sword long ago. It's a
 performance tweak aimed at people without a performance problem in
 this area.

This claim seems remarkably lacking in any supporting evidence.

I'll freely grant that PreallocXlogFiles could stand to be improved
(by which I mean made more aggressive, ie willing to create files more
often and/or further in advance).  I don't see how it follows that it's
okay to remove the functionality altogether.  To the extent that we can
make that activity happen in checkpointer rather than in foreground
processes, it's surely a good thing.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] operator dependency of commutator and negator, redux

2012-12-20 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Dec 20, 2012 at 10:52 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I was thinking a NOTICE at most.  If it's a WARNING then restoring
 perfectly valid pg_dump files will result in lots of scary-looking
 chatter.  You could make an argument for printing nothing at all,
 but that would probably mislead people who'd fat-fingered their
 COMMUTATOR entries.

 What about jiggering the dump so that only the second of the two
 operators to be dumped includes the COMMUTATOR clause?

Seems messy and fragile.  In particular this'd represent a lot of work
in order to make it more likely that the restore malfunctions if someone
makes use of pg_restore's -l switch to reorder the entries.  Also it
would not retroactively fix the problem for restoring dumps made with
existing pg_dump versions.

 Even a NOTICE in
 pg_dump seems like too much chatter (witness recent quieting of some
 other NOTICE messages we've all grown tired of)

pg_dump has included set client_min_messages = warning in its output
for quite some time now.  So as long as we don't insist on making this
a WARNING, people won't see it in that usage.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] ALTER .. OWNER TO error mislabels schema as other object type

2012-12-20 Thread Robert Haas
This looks busted:

rhaas=# create role clerks;
CREATE ROLE
rhaas=# create role bob in role clerks;
CREATE ROLE
rhaas=# create schema foo;
CREATE SCHEMA
rhaas=# grant usage on schema foo to bob, clerks;
GRANT
rhaas=# create aggregate
foo.sum(basetype=text,sfunc=textcat,stype=text,initcond='');
CREATE AGGREGATE
rhaas=# alter aggregate foo.sum(text) owner to bob;
ALTER AGGREGATE
rhaas=# set role bob;
SET
rhaas= alter aggregate foo.sum(text) owner to clerks;
ERROR:  permission denied for function foo

Eh?  There's no function called foo.  There's a schema called foo,
which seems to be the real problem: clerks needs to have CREATE on foo
in order for bob to complete the rename.  But somehow the error
message is confused about what type of object it's dealing with.

[ Credit: The above example is adapted from an EDB-internal regression
test, the failure of which was what alerted me to this problem. ]

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-12-20 Thread Martijn van Oosterhout
On Tue, Dec 18, 2012 at 04:06:02AM -0500, Greg Smith wrote:
 On 12/18/12 3:17 AM, Simon Riggs wrote:
 Clearly part of the response could involve pg_dump on the damaged
 structure, at some point.
 
 This is the main thing I wanted to try out more, once I have a
 decent corruption generation tool.  If you've corrupted a single
 record but can still pg_dump the remainder, that seems the best we
 can do to help people recover from that.  Providing some
 documentation on how to figure out what rows are in that block,
 presumably by using the contrib inspection tools, would be helpful
 too.

FWIW, Postgres is pretty resiliant against corruption. I've maintained
a postgres db on a server with bad memory (don't ask) and since most
scrambling was in text strings you just got funny output sometimes. The
most common failure was a memory allocation failure as postgres tried
to copy a datum whose length field was correupted.

If things went really wonky you could identify the bad tuples by hand
and then delete them by ctid. Regular reindexing helped too.

All I'm saying is that a mode where you log a warning but proceed
anyway is useful.  It won't pin down the exact error, but it will tell
you where to look and help find the non-obvious corruption (so you can
possibly fix it by hand).

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes

2012-12-20 Thread Simon Riggs
On 20 December 2012 13:19, Amit Kapila amit.kap...@huawei.com wrote:

 True, it might not have any functionality effect in RemoveOldXlogFiles().
 However it can be used in PreallocXlogFiles()-XLogFileInit() as well.

Which is never called in recovery because we never write WAL.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ALTER .. OWNER TO error mislabels schema as other object type

2012-12-20 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 This looks busted:

Between this and your previous example, it's becoming clear that the
recent refactorings of the ALTER code were not ready for prime time.
Perhaps we should just revert those instead of playing bug whack-a-mole.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Switching timeline over streaming replication

2012-12-20 Thread Fujii Masao
On Sat, Dec 15, 2012 at 9:36 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Sat, Dec 8, 2012 at 12:51 AM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 On 06.12.2012 15:39, Amit Kapila wrote:

 On Thursday, December 06, 2012 12:53 AM Heikki Linnakangas wrote:

 On 05.12.2012 14:32, Amit Kapila wrote:

 On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote:

 After some diversions to fix bugs and refactor existing code, I've
 committed a couple of small parts of this patch, which just add some
 sanity checks to notice incorrect PITR scenarios. Here's a new
 version of the main patch based on current HEAD.


 After testing with the new patch, the following problems are observed.

 Defect - 1:

   1. start primary A
   2. start standby B following A
   3. start cascade standby C following B.
   4. start another standby D following C.
   5. Promote standby B.
   6. After successful time line switch in cascade standby C   D,

 stop D.

   7. Restart D, Startup is successful and connecting to standby C.
   8. Stop C.
   9. Restart C, startup is failing.


 Ok, the error I get in that scenario is:

 C 2012-12-05 19:55:43.840 EET 9283 FATAL:  requested timeline 2 does not
 contain minimum recovery point 0/3023F08 on timeline 1 C 2012-12-05
 19:55:43.841 EET 9282 LOG:  startup process (PID 9283) exited with exit
 code 1 C 2012-12-05 19:55:43.841 EET 9282 LOG:  aborting startup due to
 startup process failure



 That mismatch causes the error. I'd like to fix this by always treating
 the checkpoint record to be part of the new timeline. That feels more
 correct. The most straightforward way to implement that would be to peek
 at the xlog record before updating replayEndRecPtr and replayEndTLI. If
 it's a checkpoint record that changes TLI, set replayEndTLI to the new
 timeline before calling the redo-function. But it's a bit of a
 modularity violation to peek into the record like that.

 Or we could just revert the sanity check at beginning of recovery that
 throws the requested timeline 2 does not contain minimum recovery point
 0/3023F08 on timeline 1 error. The error I added to redo of checkpoint
 record that says unexpected timeline ID %u in checkpoint record, before
 reaching minimum recovery point %X/%X on timeline %u checks basically
 the same thing, but at a later stage. However, the way
 minRecoveryPointTLI is updated still seems wrong to me, so I'd like to
 fix that.

 I'm thinking of something like the attached (with some more comments
 before committing). Thoughts?


 This has fixed the problem reported.
 However, I am not able to think will there be any problem if we remove
 check
 requested timeline 2 does not contain minimum recovery point

 0/3023F08 on timeline 1 at beginning of recovery and just update

 replayEndTLI with ThisTimeLineID?


 Well, it seems wrong for the control file to contain a situation like this:

 pg_control version number:932
 Catalog version number:   201211281
 Database system identifier:   5819228770976387006
 Database cluster state:   shut down in recovery
 pg_control last modified: pe  7. joulukuuta 2012 17.39.57
 Latest checkpoint location:   0/3023EA8
 Prior checkpoint location:0/260
 Latest checkpoint's REDO location:0/3023EA8
 Latest checkpoint's REDO WAL file:00020003
 Latest checkpoint's TimeLineID:   2
 ...
 Time of latest checkpoint:pe  7. joulukuuta 2012 17.39.49
 Min recovery ending location: 0/3023F08
 Min recovery ending loc's timeline:   1

 Note the latest checkpoint location and its TimelineID, and compare them
 with the min recovery ending location. The min recovery ending location is
 ahead of latest checkpoint's location; the min recovery ending location
 actually points to the end of the checkpoint record. But how come the min
 recovery ending location's timeline is 1, while the checkpoint record's
 timeline is 2.

 Now maybe that would happen to work if remove the sanity check, but it still
 seems horribly confusing. I'm afraid that discrepancy will come back to
 haunt us later if we leave it like that. So I'd like to fix that.

 Mulling over this for some more, I propose the attached patch. With the
 patch, we peek into the checkpoint record, and actually perform the timeline
 switch (by changing ThisTimeLineID) before replaying it. That way the
 checkpoint record is really considered to be on the new timeline for all
 purposes. At the moment, the only difference that makes in practice is that
 we set replayEndTLI, and thus minRecoveryPointTLI, to the new TLI, but it
 feels logically more correct to do it that way.

 This patch has already been included in HEAD. Right?

 I found another requested timeline does not contain minimum recovery point
 error scenario in HEAD:

 1. Set up the master 'M', one standby 'S1', and one cascade standby 'S2'.
 2. Shutdown the master 'M' 

Re: [HACKERS] Set visibility map bit after HOT prune

2012-12-20 Thread Pavan Deolasee
On Thu, Dec 20, 2012 at 9:23 PM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Dec 19, 2012 at 11:12 PM, Pavan Deolasee


 I'm very reluctant to suggest that we can solve
 this my setting aside another page-level bit to track visibility of
 tuples for heapscans. Or even have a bit in the tuple header itself to
 track this information at that level to avoid repeated visibility
 check for a tuple which is known to be visible to all current and
 future transactions.

 This has been suggested before, as an alternative to freezing tuples.
 It seems to have some potential although I think more thought and work
 is needed to figure out exactly where to go with it.


Ok. Will try to read archives to see what was actually suggested and
why it was put on back burner. BTW at the risk of being shot down
again, I wonder if can we push down the freeze operation to HOT prune
also. A single WAL record can then record this action as well. Also,
it saves us from repeated checks for transaction status flags in
heap_freeze_tuple(). Of course, we do all these only if HOT prune has
work on its on and just try to piggyback.

I wonder if we should add a flag to heap_page_prune and try to do some
additional work if its being called from lazy vacuum such as setting
the VM bit and the tuple freeze. IIRC I had put something like that in
the early patches, but then ripped of for simplicity. May be its time
to play with that again.

In fact, I'd also suggested ripping off the line pointer scan in lazy
vacuum since its preceded by a HOT prune which does bulk of the work
anyways. I remember Tom taking objection to that, but can't remember
why. Will try to read up the old thread again.

 And we expect vacuums to be very less or none. AFAIR in pgbench, it
 now takes hours for accounts table to get chosen for vacuum and we
 should be happy about it. But IOS are almost impossible for pgbench
 kind of workloads today because of our aggressive strategy to clear
 the VM bits.

 IMHO, it's probably fairly hopeless to make a pure pgbench workload
 show a benefit from index-only scans.  A large table under a very
 heavy, completely random write workload is just about the worst
 possible case for index-only scans.  Index-only scans are a way of
 avoiding unnecessary visibility checks when the target data hasn't
 changed recently, not a magic bullet to escape all heap access.  If
 the target data has changed, you're going to have to touch the heap.

Not always. Not clearing the VM bit at HOT update is one such idea we
discussed. Of course, there are open issues with that, but they are
not unsolvable. The advantage of not touching heap is just too big to
ignore.

 And while I agree that we aren't aggressive enough in setting the VM
 bits right now, I also think it wouldn't be too hard to go too far in
 the opposite direction: we could easily spend more effort trying to
 make index-only scans effective than we could ever hope to recoup from
 the scans themselves.


I agree. I also started having that worry. We are at one extreme right
now and it might not help to get to the other extreme. Looks like I'm
coming along the idea of somehow detecting if the scan is happening on
the result relation of a ModifyTable and avoid setting VM bit in that
case.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes

2012-12-20 Thread Simon Riggs
On 20 December 2012 16:21, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 PreallocXlogFiles() should have been put to the sword long ago. It's a
 performance tweak aimed at people without a performance problem in
 this area.

 This claim seems remarkably lacking in any supporting evidence.

 I'll freely grant that PreallocXlogFiles could stand to be improved
 (by which I mean made more aggressive, ie willing to create files more
 often and/or further in advance).  I don't see how it follows that it's
 okay to remove the functionality altogether.  To the extent that we can
 make that activity happen in checkpointer rather than in foreground
 processes, it's surely a good thing.

More aggressive implies it is currently in some way aggressive.

Removing it will make as little difference as keeping it, so let it stay.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes

2012-12-20 Thread Fujii Masao
On Fri, Dec 21, 2012 at 1:46 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 20 December 2012 13:19, Amit Kapila amit.kap...@huawei.com wrote:

 True, it might not have any functionality effect in RemoveOldXlogFiles().
 However it can be used in PreallocXlogFiles()-XLogFileInit() as well.

 Which is never called in recovery because we never write WAL.

No. CreateRestartPoint() calls PreallocXlogFiles(). Walreceiver may
write WAL, so PreallocXlogFiles() would be useful even during recovery
to some extent.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes

2012-12-20 Thread Andres Freund
On 2012-12-20 16:46:21 +, Simon Riggs wrote:
 On 20 December 2012 13:19, Amit Kapila amit.kap...@huawei.com wrote:

  True, it might not have any functionality effect in RemoveOldXlogFiles().
  However it can be used in PreallocXlogFiles()-XLogFileInit() as well.

 Which is never called in recovery because we never write WAL.

It's called from CreateRestartPoint. And the pre-inited files get used
by the walreceiver and I would guess thats beneficial at elast for sync
rep...

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FDW: ForeignPlan and parameterized paths

2012-12-20 Thread Tom Lane
Ronan Dunklau rdunk...@gmail.com writes:
 I intentionally did the nestloop_params substitution after calling
 GetForeignPlan not before.  It's not apparent to me why it would be
 useful to do it before, because the FDW is going to have no idea what
 those params represent.  (Note that they represent values coming from
 some other, probably local, relation; not from the foreign table.)

 Even if the FDW have no idea what they represent, it can identify a
 clause of the form Var Operator Param, which allows to store the param
 reference (paramid) for retrieving the param value at execution time.

I don't see any plausible reason for an FDW to special-case nestloop
params like that.  What an FDW should be looking for is clauses of the
form Var-of-foreign-table Operator Expression-not-involving-foreign-table,
and a Param is just one case of Expression-not-involving-foreign-table.
(Compare the handling of indexscan clauses: indxpath.c doesn't much care
what's on the righthand side of an indexable clause, so long as there
is no Var of the indexed table there.)

Moreover, in order to do effective parameterized-path creation in the
first place, the FDW's GetForeignPaths function will already have had
to recognize these same clauses in their original form.  If we do the
param substitution before calling GetForeignPlan, that will just mean
that the two functions can't share code anymore.

Or in short: the fact that the righthand-side expression gets replaced
(perhaps only partially) by a Param is an implementation detail of the
executor's expression evaluation methods.  The FDW shouldn't care about
that, only about the result of the expression.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ALTER .. OWNER TO error mislabels schema as other object type

2012-12-20 Thread Robert Haas
On Thu, Dec 20, 2012 at 11:46 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 This looks busted:

 Between this and your previous example, it's becoming clear that the
 recent refactorings of the ALTER code were not ready for prime time.
 Perhaps we should just revert those instead of playing bug whack-a-mole.

Well, as yet, I have no clear evidence that there is any problem with
anything other than the error messages.  It seems like overkill to
revert the whole thing just for that.  Not to say that there might not
be further issues, of course.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Set visibility map bit after HOT prune

2012-12-20 Thread Tom Lane
Pavan Deolasee pavan.deola...@gmail.com writes:
 Ok. Will try to read archives to see what was actually suggested and
 why it was put on back burner. BTW at the risk of being shot down
 again, I wonder if can we push down the freeze operation to HOT prune
 also.

Seems unlikely to be a win.  We only care about freezing tuples in the
context of being able to advance a relation-wide relfrozenxid horizon.
Freezing pages retail accomplishes nothing whatsoever towards that goal,
unless you have some way to know that no new freeze work will be needed
on the page before the next vacuum freeze happens.  Otherwise, you're
just moving portions of the work from background vacuuming into
foreground processes, with no benefit gained thereby.  In fact, you
might well be *creating* work that would otherwise not have had to be
done at all --- the tuple might get deleted before the next freeze
happens.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Set visibility map bit after HOT prune

2012-12-20 Thread Robert Haas
On Thu, Dec 20, 2012 at 11:49 AM, Pavan Deolasee
pavan.deola...@gmail.com wrote:
 I wonder if we should add a flag to heap_page_prune and try to do some
 additional work if its being called from lazy vacuum such as setting
 the VM bit and the tuple freeze. IIRC I had put something like that in
 the early patches, but then ripped of for simplicity. May be its time
 to play with that again.

That seems unlikely to be a good trade-off.  If VACUUM is going to do
extra stuff, it's better to have that in the vacuum-specific code,
rather than in code that is also traversed from other places.
Otherwise the conditional logic might impose a penalty on people who
aren't taking those branches.

 IMHO, it's probably fairly hopeless to make a pure pgbench workload
 show a benefit from index-only scans.  A large table under a very
 heavy, completely random write workload is just about the worst
 possible case for index-only scans.  Index-only scans are a way of
 avoiding unnecessary visibility checks when the target data hasn't
 changed recently, not a magic bullet to escape all heap access.  If
 the target data has changed, you're going to have to touch the heap.

 Not always. Not clearing the VM bit at HOT update is one such idea we
 discussed. Of course, there are open issues with that, but they are
 not unsolvable. The advantage of not touching heap is just too big to
 ignore.

I don't really agree.  Sure, not touching the heap is nice, but mostly
because you avoid pulling pages into shared_buffers that aren't
otherwise needed.  IIRC, an index-only scan isn't faster than an index
scan if all the necessary table and index pages are already cached.
Touching already-resident pages just isn't that expensive.  And of
course, if a page has recently suffered an insert, update, or delete,
it is more likely to be resident.  You can construct access patterns
where this isn't so - e.g. update the page, wait for it to get paged
out, and then SELECT from it with an index-only scan, wait for it to
get paged out again, etc. - but I'm not sure how much of a problem
that is in the real world.

 And while I agree that we aren't aggressive enough in setting the VM
 bits right now, I also think it wouldn't be too hard to go too far in
 the opposite direction: we could easily spend more effort trying to
 make index-only scans effective than we could ever hope to recoup from
 the scans themselves.

 I agree. I also started having that worry. We are at one extreme right
 now and it might not help to get to the other extreme. Looks like I'm
 coming along the idea of somehow detecting if the scan is happening on
 the result relation of a ModifyTable and avoid setting VM bit in that
 case.

It's unclear to me that that's the right way to slice it.  There are
several different sets of concerns here: (1) avoiding setting the
all-visible bit when it'll be cleared again just after, (2) avoiding
slowing down SELECT with hot-pruning, and (3) avoiding slowing down
repeated SELECTs by NOT having the first one do HOT-pruning.  And
maybe others.  The right thing to do depends on which problems you
think are relatively more important.  That question might not even
have one right answer, but even if it does we don't have consensus on
what it is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes

2012-12-20 Thread Heikki Linnakangas

On 20.12.2012 18:19, Fujii Masao wrote:

InstallXLogFileSegment() also uses ThisTimeLineID. But your recent commit
doesn't take care of it and prevents the standby from recycling the WAL files
properly. Specifically, the standby recycles the WAL file to wrong name.


A-ha, good catch. So that's actually a live bug in 9.1 and 9.2 as well: 
after the recovery target timeline has changed, restartpoints will 
continue to preallocate/recycle WAL files for the old timeline. That's 
otherwise harmless, but the useless WAL files waste space, and 
walreceiver will have to always create new files.


So instead of always running with ThisTimeLineID = 0 in the checkpointer 
process, I guess we'll have to update it to the timeline being replayed, 
when creating a restartpoint.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Set visibility map bit after HOT prune

2012-12-20 Thread Pavan Deolasee
On Thu, Dec 20, 2012 at 10:55 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Pavan Deolasee pavan.deola...@gmail.com writes:
 Ok. Will try to read archives to see what was actually suggested and
 why it was put on back burner. BTW at the risk of being shot down
 again, I wonder if can we push down the freeze operation to HOT prune
 also.

 Seems unlikely to be a win.  We only care about freezing tuples in the
 context of being able to advance a relation-wide relfrozenxid horizon.
 Freezing pages retail accomplishes nothing whatsoever towards that goal,
 unless you have some way to know that no new freeze work will be needed
 on the page before the next vacuum freeze happens. Otherwise, you're
 just moving portions of the work from background vacuuming into
 foreground processes, with no benefit gained thereby.

If we can establish an invariant that a all-visible page is always
fully freezed, then vacuum freeze does not need to look at those pages
again. Another advantage is that we are holding the right lock and
piggyback freeze with cleanup WAL-logging, thus avoiding re-dirtying
of the page and additional WAL logging.

  In fact, you
 might well be *creating* work that would otherwise not have had to be
 done at all --- the tuple might get deleted before the next freeze
 happens.


Yeah, there will be cases where it might not add any value or even add
little overhead. Don't know what will serve better on an average or
majority of the workloads though. Vacuum freeze has known to add
sudden and unexpected load on the system, so I thought this might
mitigate that to a certain extent.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Set visibility map bit after HOT prune

2012-12-20 Thread Pavan Deolasee
On Thu, Dec 20, 2012 at 10:59 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Dec 20, 2012 at 11:49 AM, Pavan Deolasee
 pavan.deola...@gmail.com wrote:
 I wonder if we should add a flag to heap_page_prune and try to do some
 additional work if its being called from lazy vacuum such as setting
 the VM bit and the tuple freeze. IIRC I had put something like that in
 the early patches, but then ripped of for simplicity. May be its time
 to play with that again.

 That seems unlikely to be a good trade-off.  If VACUUM is going to do
 extra stuff, it's better to have that in the vacuum-specific code,
 rather than in code that is also traversed from other places.
 Otherwise the conditional logic might impose a penalty on people who
 aren't taking those branches.


Thats a call we need to take between code duplication vs customising
execution. We do that all over the code. Not sure if it will be any
different here.


 It's unclear to me that that's the right way to slice it.  There are
 several different sets of concerns here: (1) avoiding setting the
 all-visible bit when it'll be cleared again just after, (2) avoiding
 slowing down SELECT with hot-pruning, and (3) avoiding slowing down
 repeated SELECTs by NOT having the first one do HOT-pruning.  And
 maybe others.  The right thing to do depends on which problems you
 think are relatively more important.  That question might not even
 have one right answer, but even if it does we don't have consensus on
 what it is.

Hmm. We tossed and discussed many interesting ideas in this thread. It
will be sad if none of them go anywhere. When I look at archives, I
see we might have discussed some of these even in the past but never
got an agreement because there always be a workload which may not be
served well by any specific idea. And many a times, they are so
interrelated that we either have to do all or none. Unfortunately,
trying to do all is too-much and too-invasive most often.

May be what we need an official experimental branch where such ideas
can be checked-in and encourage people to try out those branches in
their real world tests or set up dedicated benchmark machines to run
regular tests. Tested and proven ideas can then be merged into the
main trunk. That will be the only way to know efficacy of such ideas.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_top

2012-12-20 Thread John R Pierce

On 12/20/2012 4:17 AM, Brett Maton wrote:


It appears that procpid has been renamed to pid at some point, also 
the column current_query appears to have been shortened to query.


My patch updates a couple of queries to use the new shorter column names.




IMHO, any such fix should check the version, and use the old name for  
9.2, and new for = 9.2


/me tossed another mumbled curse at whomever changed that field name.




Re: [HACKERS] [GENERAL] trouble with pg_upgrade 9.0 - 9.1

2012-12-20 Thread Bruce Momjian
On Thu, Dec 20, 2012 at 11:08:58AM -0500, Tom Lane wrote:
 Bruce Momjian br...@momjian.us writes:
  As you can see, ALTER INDEX renamed both the pg_constraint and pg_class
  names.  Is it possible someone manually updated the system table to
  rename this primary key?  That would cause this error message.  The fix
  is to just to make sure they match.
 
  Does pg_upgrade need to be modified to handle this case?  Are there
  legitimate cases where they will not match and the index name will not
  be preserved though a dump/restore?  This seems safe:
 
 It's not always been true that ALTER INDEX would try to rename
 constraints to keep 'em in sync.  A quick check says that only 8.3 and
 later do that.  I'm not sure though how a 9.0 database could get into
 such a state without manual catalog hacking, since as you say a dump and
 reload should have ended up with the index and constraint having the
 same name again.
 
 I'd be inclined not to worry about this in pg_upgrade, at least not till
 we see a plausible scenario for the situation to arise without manual
 catalog changes.

Agreed.  I added a C comment so we don't forget about the matching
requirement.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Set visibility map bit after HOT prune

2012-12-20 Thread Tom Lane
Pavan Deolasee pavan.deola...@gmail.com writes:
 On Thu, Dec 20, 2012 at 10:55 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Seems unlikely to be a win.  We only care about freezing tuples in the
 context of being able to advance a relation-wide relfrozenxid horizon.
 Freezing pages retail accomplishes nothing whatsoever towards that goal,
 unless you have some way to know that no new freeze work will be needed
 on the page before the next vacuum freeze happens. Otherwise, you're
 just moving portions of the work from background vacuuming into
 foreground processes, with no benefit gained thereby.

 If we can establish an invariant that a all-visible page is always
 fully freezed, then vacuum freeze does not need to look at those pages
 again.

We're not going to do that, because it would require freezing tuples
immediately after they fall below the RecentGlobalXmin horizon.  This
would be a significant loss of capability from a forensic standpoint,
not to mention breaking existing applications that look at xmin to
determine whether a tuple has recently been updated.  Besides which,
I think it would result in a large increase in the WAL volume emitted
by prune operations (hint bit setting doesn't require WAL, unlike
freezing).  I don't believe for a minute your argument that it would
result in a net reduction in WAL.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Set visibility map bit after HOT prune

2012-12-20 Thread Robert Haas
On Thu, Dec 20, 2012 at 1:30 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Pavan Deolasee pavan.deola...@gmail.com writes:
 On Thu, Dec 20, 2012 at 10:55 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Seems unlikely to be a win.  We only care about freezing tuples in the
 context of being able to advance a relation-wide relfrozenxid horizon.
 Freezing pages retail accomplishes nothing whatsoever towards that goal,
 unless you have some way to know that no new freeze work will be needed
 on the page before the next vacuum freeze happens. Otherwise, you're
 just moving portions of the work from background vacuuming into
 foreground processes, with no benefit gained thereby.

 If we can establish an invariant that a all-visible page is always
 fully freezed, then vacuum freeze does not need to look at those pages
 again.

 We're not going to do that, because it would require freezing tuples
 immediately after they fall below the RecentGlobalXmin horizon.  This
 would be a significant loss of capability from a forensic standpoint,
 not to mention breaking existing applications that look at xmin to
 determine whether a tuple has recently been updated.  Besides which,
 I think it would result in a large increase in the WAL volume emitted
 by prune operations (hint bit setting doesn't require WAL, unlike
 freezing).  I don't believe for a minute your argument that it would
 result in a net reduction in WAL.

I don't think the above makes sense, because making a page all-visible
already requires emitting a WAL record.  Pavan didn't say freeze the
page every time we set a hint bit; he said freeze the page every
time it gets marked all-visible.  And that's already WAL-logged.

Now, there is a downside: right now, we play a tricky little game
where we emit a WAL record for setting the visibility map bit, but we
don't actually set the LSN of the heap page.  It's OK because it's
harmless if the PD_ALL_VISIBLE bit makes it to disk and the
visibility-map doesn't, and also because the PD_ALL_VISIBLE bit can be
set without relying on the previous page contents.  But doing anything
more complicated with the same WAL record, like freezing, is likely to
require setting the LSN on the heap page.  And that will result in a
huge increase in WAL traffic when vacuuming an insert-only table.
Whee, crash recovery is fun.

With respect to the forensic problem, we've previously discussed
setting a HEAP_XMIN_FROZEN bit in the tuple header rather than
overwriting the xmin with FrozenXID.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] trouble with pg_upgrade 9.0 - 9.1

2012-12-20 Thread Bruce Momjian

Applied.  We can mark this report closed.  Groshev, let us know if you
have any further problems.

---

On Thu, Dec 20, 2012 at 07:19:48AM -0500, Bruce Momjian wrote:
 On Wed, Dec 19, 2012 at 10:19:30PM -0500, Bruce Momjian wrote:
  On Wed, Dec 19, 2012 at 12:56:05PM -0500, Kevin Grittner wrote:
   Groshev Andrey wrote:
   
   Mismatch of relation names: database database, old rel 
 public.lob.ВерсияВнешнегоДокумента$Документ_pkey, new rel 
 public.plob.ВерсияВнешнегоДокумента$Документ
   
   There is a limit on identifiers of 63 *bytes* (not characters)
   after which the name is truncated. In UTF8 encoding, the underscore
   would be in the 64th position.
  
  OK, Kevin is certainly pointing out a bug in the pg_upgrade code, though
  I am unclear how it would exhibit the mismatch error reported.
  
  pg_upgrade uses NAMEDATALEN for database, schema, and relation name
  storage lengths.  While NAMEDATALEN works fine in the backend, it is
  possible that a frontend client, like pg_upgrade, could retrieve a name
  in the client encoding whose length exceeds NAMEDATALEN if the client
  encoding did not match the database encoding (or is it the cluster
  encoding for system tables).  This would cause truncation of these
  values.  The truncation would not cause crashes, but might cause
  failures by not being able to connect to overly-long database names, and
  it weakens the checking of relation/schema names --- the same check that
  is reported above.
  
  (I believe initdb.c also erroneously uses NAMEDATALEN.)
 
 I have developed the attached patch to pg_strdup() the string returned
 from libpq, rather than use a fixed NAMEDATALEN buffer to store the
 value.  I am only going to apply this to 9.3 because I can't see this
 causing problems except for weaker comparisons for very long identifiers
 where the client encoding is longer than the server encoding, and
 failures for very long database names, no of which we have gotten bug
 reports about.
 
 Turns out initdb.c was fine because it expects only collation names to
 be only in ASCII;   I added a comment to that effect.
 
 -- 
   Bruce Momjian  br...@momjian.ushttp://momjian.us
   EnterpriseDB http://enterprisedb.com
 
   + It's impossible for everything to be true. +

 diff --git a/contrib/pg_upgrade/info.c b/contrib/pg_upgrade/info.c
 new file mode 100644
 index 2250442..5cb9b61
 *** a/contrib/pg_upgrade/info.c
 --- b/contrib/pg_upgrade/info.c
 *** static void get_db_infos(ClusterInfo *cl
 *** 23,29 
   static void get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo);
   static void free_rel_infos(RelInfoArr *rel_arr);
   static void print_db_infos(DbInfoArr *dbinfo);
 ! static void print_rel_infos(RelInfoArr *arr);
   
   
   /*
 --- 23,29 
   static void get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo);
   static void free_rel_infos(RelInfoArr *rel_arr);
   static void print_db_infos(DbInfoArr *dbinfo);
 ! static void print_rel_infos(RelInfoArr *rel_arr);
   
   
   /*
 *** create_rel_filename_map(const char *old_
 *** 127,134 
   map-new_relfilenode = new_rel-relfilenode;
   
   /* used only for logging and error reporing, old/new are identical */
 ! snprintf(map-nspname, sizeof(map-nspname), %s, old_rel-nspname);
 ! snprintf(map-relname, sizeof(map-relname), %s, old_rel-relname);
   }
   
   
 --- 127,134 
   map-new_relfilenode = new_rel-relfilenode;
   
   /* used only for logging and error reporing, old/new are identical */
 ! map-nspname = old_rel-nspname;
 ! map-relname = old_rel-relname;
   }
   
   
 *** get_db_infos(ClusterInfo *cluster)
 *** 220,227 
   for (tupnum = 0; tupnum  ntups; tupnum++)
   {
   dbinfos[tupnum].db_oid = atooid(PQgetvalue(res, tupnum, i_oid));
 ! snprintf(dbinfos[tupnum].db_name, 
 sizeof(dbinfos[tupnum].db_name), %s,
 !  PQgetvalue(res, tupnum, i_datname));
   snprintf(dbinfos[tupnum].db_tblspace, 
 sizeof(dbinfos[tupnum].db_tblspace), %s,
PQgetvalue(res, tupnum, i_spclocation));
   }
 --- 220,226 
   for (tupnum = 0; tupnum  ntups; tupnum++)
   {
   dbinfos[tupnum].db_oid = atooid(PQgetvalue(res, tupnum, i_oid));
 ! dbinfos[tupnum].db_name = pg_strdup(PQgetvalue(res, tupnum, 
 i_datname));
   snprintf(dbinfos[tupnum].db_tblspace, 
 sizeof(dbinfos[tupnum].db_tblspace), %s,
PQgetvalue(res, tupnum, i_spclocation));
   }
 *** get_rel_infos(ClusterInfo *cluster, DbIn
 *** 346,355 
   curr-reloid = atooid(PQgetvalue(res, relnum, i_oid));
   
   nspname = PQgetvalue(res, relnum, i_nspname);
 ! strlcpy(curr-nspname, nspname, sizeof(curr-nspname));
   
   relname = 

Re: [HACKERS] Feature Request: pg_replication_master()

2012-12-20 Thread Bruce Momjian
On Wed, Dec 19, 2012 at 07:32:33PM -0500, Robert Haas wrote:
 On Wed, Dec 19, 2012 at 5:34 PM, Simon Riggs si...@2ndquadrant.com wrote:
  As ever, we spent much energy on debating backwards compatibility
  rather than just solving the problem it posed, which is fairly easy to
  solve.
 
 I'm still of the opinion (as were a lot of people on the previous
 thread, IIRC) that just making them GUCs and throwing backward
 compatibility under the bus is acceptable in this case.  Changes that
 break application code are anathema to me, because people can have a
 LOT of application code and updating it can be REALLY hard.  The same
 cannot be said about recovery.conf - you have at most one of those per
 standby, and if it needs to be changed in some way, you can do it with
 a very small Perl script.  Yes, third-party tools will need to be
 updated; that is surely a downside, but I think it might be a
 tolerable one in this case.

Agreed.  We have always has a more lax requirement of changing the
Postgres administration interface.  I am not saying to ignore backward
compatibility, but future admin interface clarity overrules backward
compatibility in most cases.  If people are really worried about this,
they can write a Perl script to convert from the old to new format.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Feature Request: pg_replication_master()

2012-12-20 Thread Bruce Momjian
On Wed, Dec 19, 2012 at 10:34:14PM +, Simon Riggs wrote:
 On 19 December 2012 22:19, Joshua Berkus j...@agliodbs.com wrote:
 
  It stalled because the patch author decided not to implement the
  request to detect recovery.conf in data directory, which allows
  backwards compatibility.
 
  Well, I don't think we had agreement on how important backwards 
  compatibility for recovery.conf was, particularly not on the whole 
  recovery.conf/recovery.done functionality and the wierd formatting of 
  recovery.conf.
 
 As ever, we spent much energy on debating backwards compatibility
 rather than just solving the problem it posed, which is fairly easy to
 solve.

Let me also add that I am tired of having recovery.conf improvement
stalled by backward compatibility concerns.   At this point, let's just
trash recovery.conf backward compatibility and move on.  

And I don't want to hear complaints about tool breakage either.  These
are external tools, not shipped with community Postgres, and they will
just have to adjust.  I will be glad to beat all complainants into the
ground for the good of the community.  ;-)   We just can't operate like
this, and if we allowed these things to block us in the past, Postgres
would be a royal mess today!

At this point backward compatibility has paralized us from fixing a
recovery.conf API that everyone agrees is non-optimal, and this has
gone on for multiple major releases.  I don't care what we have to do,
just clean this up for 9.3!

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of Row Level Security

2012-12-20 Thread Stephen Frost
Kevin, all,

* Kevin Grittner (kgri...@mail.com) wrote:
 The more secure behavior is to allow entry of data which will not
 be visible by the person doing the entry.

wrt this- I'm inclined to agree with Kevin.  It's certainly common in
certain environments that you can write to a higher level than you can
read from.  Granting those writers access to read the data later would
be... difficult.

What we're really arguing about here, afaict, is what the default should
be.  In line with Kevin's comments and Tom's reading of the spec (along
with my own experience in these environments), I'd argue for the default
to allow writing rows you're not allowed to read.

It would certainly be ideal if we could support both options, on a
per-relation basis, when we release the overall feature.  It doesn't
feel like it'd be a lot of work to do that, but I've not been able to
follow this discussion up til now.  Thankfully, I'm hopeful that I'm
going to have more time now to keep up with PG. :)

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Review of Row Level Security

2012-12-20 Thread Robert Haas
On Thu, Dec 20, 2012 at 4:35 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Not sure I understand you. You suggested it was a valid use case for a
 user to have only INSERT privilege and wish to bypass security checks.
 I agreed and suggested it could be special-cased.

That's not really what I intended to suggest.  I view checking an
inserted tuple and checking the new version of an updated tuple as of
a piece.  I would think we would check against the RLS quals in either
both of those situations or neither, not one without the other.

 * Applies to all commands should not be implemented via triggers.
 Complex, slow, unacceptable thing to force upon users. Doing that begs
 the question of why we would have the feature at all, since we already
 have triggers and barrier views.

I agree that it is questionable whether we need this feature given
that we already have security barrier views.  I don't agree that
performing security checks via triggers is unacceptably slow or
complex.  Rather, I would say it is flexible and can meet a variety of
needs, unlike this feature, which imposes much tighter constraints on
what you can and cannot check and in which situations.

 * the default for row security should be applies to all commands.
 Anything else may be useful in some cases, but is surprising to users
 and requires careful thought to determine if it is appropriate.

I (and several other people, it seems) do not agree.

 * How to handle asymmetric row security policies? KaiGai has already
 begun discussing problems caused by a security policy that differs
 between reads/writes, on his latest patch post. That needs further
 analysis to check that it actually makes sense to allow it, since it
 is more complex. It would be better to fully analyse that situation
 and post solutions, rather than simply argue its OK. Kevin has made
 good arguments to show there could be value in such a setup; nobody
 has talked about banning it, but we do need analysis, suggested
 syntax/mechanisms and extensive documentation to explain it etc.

Frankly, in view of your comments above, I am starting to rethink
whether we want this at all.  I mean, if you've got security barrier
views, you can check the data being read.  If you've got triggers, you
can check the data being written.  So what's left?  There's something
notationally appealing about being able to apply a security policy to
a table rather than creating a separate view and telling people to use
the view in lieu of the table, but how much is that notational
convenience worth?  It has some value from the standpoint of
compatibility with other database products ... but probably not a
whole lot, since all the syntax we're inventing here is
PostgreSQL-specific anyway.  Your proposal to check both tuples read
and tuples written might add some value ... but unless there's an
as-yet-undemonstrated performance benefit, it is again mostly a
notational benefit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Feature Request: pg_replication_master()

2012-12-20 Thread Bruce Momjian
On Thu, Dec 20, 2012 at 02:29:49PM -0500, Bruce Momjian wrote:
 Let me also add that I am tired of having recovery.conf improvement
 stalled by backward compatibility concerns.   At this point, let's just
 trash recovery.conf backward compatibility and move on.  
 
 And I don't want to hear complaints about tool breakage either.  These
 are external tools, not shipped with community Postgres, and they will
 just have to adjust.  I will be glad to beat all complainants into the
 ground for the good of the community.  ;-)   We just can't operate like
 this, and if we allowed these things to block us in the past, Postgres
 would be a royal mess today!
 
 At this point backward compatibility has paralized us from fixing a
 recovery.conf API that everyone agrees is non-optimal, and this has
 gone on for multiple major releases.  I don't care what we have to do,
 just clean this up for 9.3!

Let me add two more things.  First, no matter how many people you
interact with who use Postgres, there are many more who you do not
interact with.   Please don't give excessive weight to those people you
know, or who use your tools, or who are your customers, and forget the
many more Postgres users who you will never know.  They are not using
your tools or scripts --- they just want Postgres to be simple to use.

Second, no matter how successful Postgres is now, there are many more
users in our future, and we have a responsibility to give them a
database that is as easy to configure as possible, without hampering
them with decisions to avoid disruption for current Postgres users.

I am not saying we should ignore current users, or our customers or our
tool users, but it is very clear to me that we have lost the proper
balance in this area.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of Row Level Security

2012-12-20 Thread Kohei KaiGai
2012/12/20 Stephen Frost sfr...@snowman.net:
 Kevin, all,

 * Kevin Grittner (kgri...@mail.com) wrote:
 The more secure behavior is to allow entry of data which will not
 be visible by the person doing the entry.

 wrt this- I'm inclined to agree with Kevin.  It's certainly common in
 certain environments that you can write to a higher level than you can
 read from.  Granting those writers access to read the data later would
 be... difficult.

 What we're really arguing about here, afaict, is what the default should
 be.  In line with Kevin's comments and Tom's reading of the spec (along
 with my own experience in these environments), I'd argue for the default
 to allow writing rows you're not allowed to read.

If system ensures writer's permission is always equivalent or more restrictive
than reader's permission, it also eliminates the problem around asymmetric
row-security policy between commands.
The problematic scenario was that updatable but invisible rows are exposed;
using update with returning clause for example. In case when updatable
rows are always visible, here is no matter even if user shows it.
Probably, we can implement it as ((row-security of select) AND (row-security
of update)) that ensures always restrictive row-security policy.

Thanks,

 It would certainly be ideal if we could support both options, on a
 per-relation basis, when we release the overall feature.  It doesn't
 feel like it'd be a lot of work to do that, but I've not been able to
 follow this discussion up til now.  Thankfully, I'm hopeful that I'm
 going to have more time now to keep up with PG. :)

 Thanks,

 Stephen
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of Row Level Security

2012-12-20 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
  * Applies to all commands should not be implemented via triggers.
  Complex, slow, unacceptable thing to force upon users. Doing that begs
  the question of why we would have the feature at all, since we already
  have triggers and barrier views.

I would rather neither requires writing custom triggers but rather both
are supported through this feature.

 I agree that it is questionable whether we need this feature given
 that we already have security barrier views.

This I don't agree with- the plan has long been to have PG-specific RLS
first and then to support SELinux capabilities on top of it.  We didn't
want to have SELinux-specific functionality that couldn't be achieved
without SELinux being involved, and I continue to agree with that.

There are many situations, environments, and individuals that would
view having to implement RLS through views and triggers as being
far-and-away too painful and error-prone to rely on.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Review of Row Level Security

2012-12-20 Thread Kohei KaiGai
2012/12/20 Robert Haas robertmh...@gmail.com:
 On Thu, Dec 20, 2012 at 4:35 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Not sure I understand you. You suggested it was a valid use case for a
 user to have only INSERT privilege and wish to bypass security checks.
 I agreed and suggested it could be special-cased.

 That's not really what I intended to suggest.  I view checking an
 inserted tuple and checking the new version of an updated tuple as of
 a piece.  I would think we would check against the RLS quals in either
 both of those situations or neither, not one without the other.

 * Applies to all commands should not be implemented via triggers.
 Complex, slow, unacceptable thing to force upon users. Doing that begs
 the question of why we would have the feature at all, since we already
 have triggers and barrier views.

 I agree that it is questionable whether we need this feature given
 that we already have security barrier views.  I don't agree that
 performing security checks via triggers is unacceptably slow or
 complex.  Rather, I would say it is flexible and can meet a variety of
 needs, unlike this feature, which imposes much tighter constraints on
 what you can and cannot check and in which situations.

I'd like to ask Simon which point is more significant; performance
penalty or complex operations by users.
If later, FK constraint is a good example that automatically defines
triggers that applies its checks on inserted tuple and newer version
of updated tuple.
Even though we need to consider how to handle dynamically added
row-security policy by extension (e.g sepgsql), I don't think we need
to enforce users to define triggers for each tables with row-security
as long as system support it.

 * the default for row security should be applies to all commands.
 Anything else may be useful in some cases, but is surprising to users
 and requires careful thought to determine if it is appropriate.

 I (and several other people, it seems) do not agree.

 * How to handle asymmetric row security policies? KaiGai has already
 begun discussing problems caused by a security policy that differs
 between reads/writes, on his latest patch post. That needs further
 analysis to check that it actually makes sense to allow it, since it
 is more complex. It would be better to fully analyse that situation
 and post solutions, rather than simply argue its OK. Kevin has made
 good arguments to show there could be value in such a setup; nobody
 has talked about banning it, but we do need analysis, suggested
 syntax/mechanisms and extensive documentation to explain it etc.

 Frankly, in view of your comments above, I am starting to rethink
 whether we want this at all.  I mean, if you've got security barrier
 views, you can check the data being read.  If you've got triggers, you
 can check the data being written.  So what's left?

In some cases, it is not a reasonable choice to re-define kind of
database objects or its name from what existing application assumes.
It is a reason why we need adaptive security features on regular
tables without or minimum application changes

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Feature Request: pg_replication_master()

2012-12-20 Thread Petr Jelinek
 Let me also add that I am tired of having recovery.conf improvement
 stalled by backward compatibility concerns.   At this point, let's just
 trash recovery.conf backward compatibility and move on.
 
 And I don't want to hear complaints about tool breakage either.  These are
 external tools, not shipped with community Postgres, and they will just
have
 to adjust.  I will be glad to beat all complainants into the
 ground for the good of the community.  ;-)   We just can't operate like
 this, and if we allowed these things to block us in the past, Postgres
would be
 a royal mess today!
 
 At this point backward compatibility has paralized us from fixing a
 recovery.conf API that everyone agrees is non-optimal, and this has gone
on
 for multiple major releases.  I don't care what we have to do, just clean
this
 up for 9.3!
 

+1

And the sooner we do it before release, the more time will those external
tools have to adjust.

Regards
Petr Jelinek



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] operator dependency of commutator and negator, redux

2012-12-20 Thread Dimitri Fontaine
Robert Haas robertmh...@gmail.com writes:
 a separate ALTER OPERATOR  COMMUTATOR  statement (or something of
 the sort) that pg_dump can emit as a separate item.  Even a NOTICE in

I like that capability, but it's not helping us in the backward
compatibility section where we will still read commutator declarations
as operator properties. And maintaining an extension with different
syntax for CREATE OPERATOR depending on the major version would be a
pain (it's the case already for create type by the way, painfully so).

So I think Tom's idea is better to fix the problem at hand.


About dropping the Operator Shell, we've been talking in the past about
adding more properties to our operators to allow for some more optimizer
tricks (reducing expressions to constants or straigth variable
references at parse time, reducing joins, adding parametrized paths
etc).

I can think about assiociativity and neutral element, but that's a
property of one operator only. Now there's the distributive property
that happens in between two different operators and that maybe would
better be added as an ALTER OPERATOR statement rather than a possibly
forward reference when we come to that.

I'm not too sure about other concepts that we might want to tackle down
the road here, another angle here would be about support for parallelism
where maybe operators property could tell the planner how to spread a
complex where clause or output column computation…


All in all, it looks to me like the current proposals on the table would
allow us to dispose of the Operator Shell idea entirely. If we ever need
it back, the ALTER OPERATOR trick looks like a better tool.


Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strange OOM errors with EXECUTE in PL/pgSQL

2012-12-20 Thread Dimitri Fontaine
Tom Lane t...@sss.pgh.pa.us writes:
 The reason this fails is that you've got a half-megabyte source string,
 and each of the 11000 plans that are due to be created from it saves
 its own copy of the source string.  Hence, 5500 megabytes needed just
 for source strings.

 We could possibly fix this by inventing some sort of reference-sharing
 arrangement (which'd be complicated and fragile) or by not storing the
 source strings with the plans (which'd deal a serious blow to our
 ability to provide helpful error messages).  Neither answer seems
 appealing.

I don't readily see how complicated and fragile it would be, it looks
like a hash table of symbols pointing to source strings and a reference
counting, and each plan would need to reference that symbol. Now maybe
that's what you call complicated and fragile, and even if not, I'm not
really sure it would pull its weight. The use case of sending over and
over again *in a given session* the exact same query string without
using PREPARE/EXECUTE looks like quite tiny.

-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of Row Level Security

2012-12-20 Thread Kevin Grittner
Kohei KaiGai wrote:

 If system ensures writer's permission is always equivalent or
 more restrictive than reader's permission, it also eliminates the
 problem around asymmetric row-security policy between commands.

I'm not sure we're understanding each other; so far people who
favor asymmetric read and write predicates for row level security
have been arguing that people should be able to write tuples that
they can't read back. Of course you need to be able to read a tuple
to update it, but the argument is that you should be able to
configure security so that a role can (for example) update a row to
set a sealed flag, and then no longer have rights to read that
row (including for purposes of update). You can give away data
which is yours, but you can't then take it back if it's not.

 The problematic scenario was that updatable but invisible rows
 are exposed;

I have not seen anyone argue that such behavior should be allowed.

 Probably, we can implement it as ((row-security of select) AND
 (row-security of update)) that ensures always restrictive
 row-security policy.

I hadn't been paying a lot of attention to this patch until I saw
the question about whether a user with a particular role could
write a row which would then not be visible to that role. I just
took a look at the patch.

I don't think I like ALTER TABLE as a syntax for row level
security. How about using existing GRANT syntax but allowing a
WHERE clause? That seems more natural to me, and it would make it
easy to apply the same conditions to multiple types of operations
when desired, but use different expressions when desired. Without
having spent a lot of time pondering it, I think that if row level
SELECT permissions exist, they would need to be met on the OLD
tuple to allow DELETE or UPDATE, and UPDATE row level permissions
would be applied to the NEW tuple.

So, Simon's use-case could be met with:

GRANT SELECT, INSERT, UPDATE, DELETE ON tabx
  TO org12user
  WHERE org = 12;

... and similar GRANTs.  A user who should be able to access rows
for a particular value of org would be granted the appropriate
permission. These could be combined by granting a role to another
role. To go back to a Wisconsin Courts example, staff in a county
might be granted rights to access data for that county, but
district roles could be set up and granted to court officials, who
need to be able to access data for all counties in their judicial
district, because judges fill in for each other across county
lines, but only within their own district.

My use-case could be met with:

GRANT SELECT, INSERT, UPDATE, DELETE ON addr
  TO general_staff
  WHERE NOT sealed;
GRANT SELECT, INSERT, UPDATE, DELETE ON addr
  TO sealed_addr_authority
  WHERE SEALED;
GRANT general_staff TO sealed_addr_authority;

Note that I think that if one has multiple roles with row level
permissions on a table, access should be allowed if any of those
roles allows it.

I think that the above should be logically equivalent to (although
perhaps slightly less efficient at run-time):

GRANT SELECT, INSERT, UPDATE, DELETE ON addr
  TO general_staff
  WHERE NOT sealed;
GRANT SELECT, INSERT, UPDATE, DELETE ON addr
  TO sealed_addr_authority;

And just to round it out, that these could be applied to users
with:

GRANT general_staff TO bob;
GRANT sealed_addr_authority TO supervisor;
GRANT supervisor TO jean;

I realize this is a major syntactic departure from the current
patch, but it just seems a lot more natural and flexible.

-Kevin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Feature Request: pg_replication_master()

2012-12-20 Thread Joshua Berkus

 As ever, we spent much energy on debating backwards compatibility
 rather than just solving the problem it posed, which is fairly easy
 to
 solve.

Well, IIRC, the debate was primarily of *your* making.  Almost everyone else on 
the thread was fine with the original patch, and it was nearly done for 9.2 
before you stepped in.  I can't find anyone else on that thread who thought 
that backwards compatibility was more important than fixing the API.

--Josh


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Feature Request: pg_replication_master()

2012-12-20 Thread Andres Freund
On 2012-12-18 19:43:09 -0800, Josh Berkus wrote:
 We should probably have a function, like pg_replication_master(), which
 gives the host address of the current master.  This would help DBAs for
 large replication clusters a lot.  Obviously, this would only work in
 streaming.

Do you want the node one step up or the top-level in the chain? Because
I don't think we can do the latter without complicating the replication
protocol noticably.

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Switching timeline over streaming replication

2012-12-20 Thread Joshua Berkus


 I just committed a patch that should make the requested WAL segment
 00020003 has already been removed errors go away.
 The
 trick was for walsenders to not switch to the new timeline until at
 least one record has been replayed on it. That closes the window
 where
 the walsender already considers the new timeline to be the latest,
 but
 the WAL file has not been created yet.

OK, I'll download the snapshot in a couple days and make sure this didn't 
breaks something else.

--Josh


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_top

2012-12-20 Thread Tom Lane
John R Pierce pie...@hogranch.com writes:
 /me tossed another mumbled curse at whomever changed that field name.

The reason for the field name change was that the semantics of the field
changed.  You typically ought to look at what the application is
actually doing with the field, not just do s/current_query/query/g
and expect that all will be well.  (In particular, if the app is looking
for idle or idle in transaction state markers, it's going to need
more adjustment than that.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_top

2012-12-20 Thread Merlin Moncure
On Thu, Dec 20, 2012 at 4:17 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 John R Pierce pie...@hogranch.com writes:
 /me tossed another mumbled curse at whomever changed that field name.

 The reason for the field name change was that the semantics of the field
 changed.  You typically ought to look at what the application is
 actually doing with the field, not just do s/current_query/query/g
 and expect that all will be well.  (In particular, if the app is looking
 for idle or idle in transaction state markers, it's going to need
 more adjustment than that.)

IMSNHO, neither of these should have been changed, I would much rather
have seen a new view or some other way of opting into the new
functionality.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cascading replication: should we detect/prevent cycles?

2012-12-20 Thread Joshua Berkus
Robert,

  What would such a test look like?  It's not obvious to me that
  there's any rapid way for a user to detect this situation, without
  checking each server individually.
 
 Change something on the master and observe that none of the supposed
 standbys notice?

That doesn't sound like an infallible test, or a 60-second one.

My point is that in a complex situation (imagine a shop with 9 replicated 
servers in 3 different cascaded groups, immediately after a failover of the 
original master), it would be easy for a sysadmin, responding to middle of the 
night page, to accidentally fat-finger an IP address and create a cycle instead 
of a new master.  And once he's done that, a longish troubleshooting process to 
figure out what's wrong and why writes aren't working, especially if he goes to 
bed and some other sysadmin picks up the Writes failing to PostgreSQL ticket.

*if* it's relatively easy for us to detect cycles (that's a big if, I'm not 
sure how we'd do it), then it would help a lot for us to at least emit a 
WARNING.  That would short-cut a lot of troubleshooting.

--Josh Berkus


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Feature Request: pg_replication_master()

2012-12-20 Thread Joshua Berkus
Andreas,
 
 Do you want the node one step up or the top-level in the chain?
 Because
 I don't think we can do the latter without complicating the
 replication
 protocol noticably.

Well, clearly a whole chain would be nice for the user.  But even just one step 
up would be very useful.

--Josh


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parser Cruft in gram.y

2012-12-20 Thread McDevitt, Charles
 
 Another way of attack along these lines might be to use the %glr-parser
 and then try to cut back on all those redundant rules that were put in
 to avoid conflicts.  The number of key words categories and such could
 perhaps also be reduced that way.
 
 Of course, this is mostly speculation.
 
 

The GLR output from Bison is licensed under the GPL (unlike the LALR output).
So using Bison's GLR mode isn't an option.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parser Cruft in gram.y

2012-12-20 Thread Andres Freund
On 2012-12-20 23:12:55 +, McDevitt, Charles wrote:
  
  Another way of attack along these lines might be to use the %glr-parser
  and then try to cut back on all those redundant rules that were put in
  to avoid conflicts.  The number of key words categories and such could
  perhaps also be reduced that way.
  
  Of course, this is mostly speculation.
  
  
 
 The GLR output from Bison is licensed under the GPL (unlike the LALR output).
 So using Bison's GLR mode isn't an option.

Thats not the case anymore:
http://www.gnu.org/software/bison/manual/html_node/Conditions.html

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parser Cruft in gram.y

2012-12-20 Thread McDevitt, Charles
 
  The GLR output from Bison is licensed under the GPL (unlike the LALR 
  output).
  So using Bison's GLR mode isn't an option.
 
 Thats not the case anymore:
 http://www.gnu.org/software/bison/manual/html_node/Conditions.html

Sorry!  My mistake... I didn't realize they changed the rules.
I'll be more careful to check these things in the future.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes

2012-12-20 Thread Fujii Masao
On Fri, Dec 21, 2012 at 2:45 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 20.12.2012 18:19, Fujii Masao wrote:

 InstallXLogFileSegment() also uses ThisTimeLineID. But your recent commit
 doesn't take care of it and prevents the standby from recycling the WAL
 files
 properly. Specifically, the standby recycles the WAL file to wrong name.


 A-ha, good catch. So that's actually a live bug in 9.1 and 9.2 as well:
 after the recovery target timeline has changed, restartpoints will continue
 to preallocate/recycle WAL files for the old timeline. That's otherwise
 harmless, but the useless WAL files waste space, and walreceiver will have
 to always create new files.

 So instead of always running with ThisTimeLineID = 0 in the checkpointer
 process, I guess we'll have to update it to the timeline being replayed,
 when creating a restartpoint.

Yes. Thanks for fixing that.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parser Cruft in gram.y

2012-12-20 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2012-12-20 23:12:55 +, McDevitt, Charles wrote:
 Another way of attack along these lines might be to use the %glr-parser
 and then try to cut back on all those redundant rules that were put in
 to avoid conflicts.  The number of key words categories and such could
 perhaps also be reduced that way.

 The GLR output from Bison is licensed under the GPL (unlike the LALR output).
 So using Bison's GLR mode isn't an option.

 Thats not the case anymore:
 http://www.gnu.org/software/bison/manual/html_node/Conditions.html

This does mean that we'd have to specify a minimum bison version of 2.2
in order to be squeaky-clean license wise, if we went over to using the
GLR mode.  However, that would likely be a good idea anyway from a
technical standpoint --- the GLR mode may exist in ancient bison
versions, but who knows how bug-free it is.

Anyway, this is all merest speculation until somebody actually tries it
and sees if a performance gain is possible.  Having just re-read
the description of GLR mode, I wouldn't be surprised if any savings in
table size is squandered by its handling of ambiguous cases (ie, the
need to track and merge multiple parser states).

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Switching timeline over streaming replication

2012-12-20 Thread Thom Brown
On 20 December 2012 12:45, Heikki Linnakangas hlinnakan...@vmware.comwrote:

 On 17.12.2012 15:05, Thom Brown wrote:

 I just set up 120 chained standbys, and for some reason I'm seeing these
 errors:

 LOG:  replication terminated by primary server
 DETAIL:  End of WAL reached on timeline 1
 LOG:  record with zero length at 0/301EC10
 LOG:  fetching timeline history file for timeline 2 from primary server
 LOG:  restarted WAL streaming at 0/300 on timeline 1
 LOG:  replication terminated by primary server
 DETAIL:  End of WAL reached on timeline 1
 LOG:  new target timeline is 2
 LOG:  restarted WAL streaming at 0/300 on timeline 2
 LOG:  replication terminated by primary server
 DETAIL:  End of WAL reached on timeline 2
 FATAL:  error reading result of streaming command: ERROR:  requested WAL
 segment 00020003 has already been removed

 ERROR:  requested WAL segment 00020003 has already been
 removed
 LOG:  started streaming WAL from primary at 0/300 on timeline 2
 ERROR:  requested WAL segment 00020003 has already been
 removed


 I just committed a patch that should make the requested WAL segment
 00020003 has already been removed errors go away. The
 trick was for walsenders to not switch to the new timeline until at least
 one record has been replayed on it. That closes the window where the
 walsender already considers the new timeline to be the latest, but the WAL
 file has not been created yet.


Now I'm getting this on all standbys after promoting the first standby in a
chain.

LOG:  replication terminated by primary server
DETAIL:  End of WAL reached on timeline 1
LOG:  record with zero length at 0/301EC10
LOG:  fetching timeline history file for timeline 2 from primary server
LOG:  restarted WAL streaming at 0/300 on timeline 1
FATAL:  could not receive data from WAL stream:
LOG:  new target timeline is 2
FATAL:  could not connect to the primary server: FATAL:  the database
system is in recovery mode

LOG:  started streaming WAL from primary at 0/300 on timeline 2
TRAP: FailedAssertion(!(((sentPtr) = (SendRqstPtr))), File:
walsender.c, Line: 1425)
LOG:  server process (PID 19917) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted while in recovery at log time
2012-12-20 23:41:23 GMT
HINT:  If this has occurred more than once some data might be corrupted and
you might need to choose an earlier recovery target.
LOG:  entering standby mode
FATAL:  the database system is in recovery mode
LOG:  redo starts at 0/228
LOG:  consistent recovery state reached at 0/2E8
LOG:  database system is ready to accept read only connections
LOG:  record with zero length at 0/301EC70
LOG:  started streaming WAL from primary at 0/300 on timeline 2
LOG:  unexpected EOF on standby connection

And if I restart the new primary, the first new standby connected to it
shows:

LOG:  replication terminated by primary server
DETAIL:  End of WAL reached on timeline 2
FATAL:  error reading result of streaming command: server closed the
connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

LOG:  record with zero length at 0/301F1E0

However, all other standbys don't show any additional log output.

-- 
Thom


Re: [HACKERS] Review of Row Level Security

2012-12-20 Thread Simon Riggs
On 20 December 2012 21:50, Kevin Grittner kgri...@mail.com wrote:

 How about using existing GRANT syntax but allowing a
 WHERE clause?

It's a nice feature, but a completely different thing to what is being
discussed here.

Row security adds the ability to enforce a single coherent policy at
table level. It might be nice to have multiple, potentially
overlapping policies, but that would require significantly different
design and coding to what we have here. For me, enforcing a single
policy at table level helps to make it secure by being coherent and
understandable. So perhaps in later releases we might do the feature
you suggest.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Set visibility map bit after HOT prune

2012-12-20 Thread Jeff Janes
On Wednesday, December 19, 2012, Robert Haas wrote:

 On Wed, Dec 19, 2012 at 12:26 PM, Pavan Deolasee
 pavan.deola...@gmail.com javascript:; wrote:

  I would like to run some pgbench tests where we get the system in a
  steady state such as all/most updates are HOT updates (not entirely
  unlikely scenario for many real life cases). And then try running some
  concurrent queries which can be executed via IOS. My gut feel is that,
  today we will see slow and continuous drop in performance for these
  queries because IOS will slowly stop working.

 If there are no vacuums, I agree.


If the table is randomly updated over its entire size, then pretty much
every block will be not-all-visible (and so disqualified from IOS) before
you hit the default 20% vacuum threshold.  I wonder if there ought not be
another vac threshold, based on vm density rather than estimated obsolete
tuple density.

Cheers,

Jeff


[HACKERS] Set visibility map bit after HOT prune

2012-12-20 Thread Jeff Janes
On Thursday, December 20, 2012, Robert Haas wrote:


 IMHO, it's probably fairly hopeless to make a pure pgbench workload
 show a benefit from index-only scans.  A large table under a very
 heavy, completely random write workload is just about the worst
 possible case for index-only scans.  Index-only scans are a way of
 avoiding unnecessary visibility checks when the target data hasn't
 changed recently, not a magic bullet to escape all heap access.  If
 the target data has changed, you're going to have to touch the heap.
 And while I agree that we aren't aggressive enough in setting the VM
 bits right now, I also think it wouldn't be too hard to go too far in
 the opposite direction: we could easily spend more effort trying to
 make index-only scans effective than we could ever hope to recoup from
 the scans themselves.

 Now, if you set up N threads of which 10% are doing regular pgbench
 and the other 90% are doing pgbench -S, or something like that, then
 you might start to hope for some benefit from index-only scans.


I set this up before, by dropping the primary key and instead building an
index on (aid,abalance) and then just running pgbench with a mixture of -f
flags that corresponded to some -S-like and some default-like  transactions.

On a freshly vacuumed table, I saw a hint of a performance boost at even a
50:50 ratio, and clear boost at 3 -S to 1 default

I ran this at a size where not even all the index fit in RAM, to maximize
the benefit of not having to visit the table.

However, the boost started going away due to vm clearance long before
autovacuum kicked in at default settings.

Cheers,

Jeff


Re: [HACKERS] [PERFORM] pgbench to the MAXINT

2012-12-20 Thread Gurjeet Singh
On Wed, Feb 16, 2011 at 8:15 AM, Greg Smith g...@2ndquadrant.com wrote:

 Tom Lane wrote:

 I think that might be a good idea --- it'd reduce the cross-platform
 variability of the results quite a bit, I suspect.  random() is not
 to be trusted everywhere, but I think erand48 is pretty much the same
 wherever it exists at all (and src/port/ provides it elsewhere).



 Given that pgbench will run with threads in some multi-worker
 configurations, after some more portability research I think odds are good
 we'd get nailed by 
 http://sourceware.org/**bugzilla/show_bug.cgi?id=10320http://sourceware.org/bugzilla/show_bug.cgi?id=10320:
  erand48 implementation not thread safe but POSIX says it should be.
  The AIX docs have a similar warning on them, so who knows how many
 versions of that library have the same issue.

 Maybe we could make sure the one in src/port/ is thread safe and make sure
 pgbench only uses it.  This whole area continues to be messy enough that I
 think the patch needs to brew for another CF before it will all be sorted
 out properly.  I'll mark it accordingly and can pick this back up later.


Hi Greg,

I spent some time rebasing this patch to current master. Attached is
the patch, based on master couple of commits old.

Your concern of using erand48() has been resolved since pgbench now
uses thread-safe and concurrent pg_erand48() from src/port/.

The patch is very much what you had posted, except for a couple of
differences due to bit-rot. (i) I didn't have to #define MAX_RANDOM_VALUE64
since its cousin MAX_RANDOM_VALUE is not used by code anymore, and (ii) I
used ternary operator in DDLs[] array to decide when to use bigint vs int
columns.

Please review.

As for tests, I am currently running 'pgbench -i -s 21474' using
unpatched pgbench, and am recording the time taken;Scale factor 21475 had
actually failed to do anything meaningful using unpatched pgbench. Next
I'll run with '-s 21475' on patched version to see if it does the right
thing, and in acceptable time compared to '-s 21474'.

What tests would you and others like to see, to get some confidence in
the patch? The machine that I have access to has 62 GB RAM, 16-core
64-hw-threads, and about 900 GB of disk space.

Linux host 3.2.6-3.fc16.ppc64 #1 SMP Fri Feb 17 21:41:20 UTC 2012 ppc64
ppc64 ppc64 GNU/Linux

Best regards,

PS: The primary source of patch is this branch:
https://github.com/gurjeet/postgres/tree/64bit_pgbench
-- 
Gurjeet Singh

http://gurjeet.singh.im/


pgbencg-64-v6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ThisTimeLineID in checkpointer and bgwriter processes

2012-12-20 Thread Amit Kapila
On Thursday, December 20, 2012 11:15 PM Heikki Linnakangas wrote:
 On 20.12.2012 18:19, Fujii Masao wrote:
  InstallXLogFileSegment() also uses ThisTimeLineID. But your recent
 commit
  doesn't take care of it and prevents the standby from recycling the
 WAL files
  properly. Specifically, the standby recycles the WAL file to wrong
 name.
 
 A-ha, good catch. So that's actually a live bug in 9.1 and 9.2 as well:
 after the recovery target timeline has changed, restartpoints will
 continue to preallocate/recycle WAL files for the old timeline. That's
 otherwise harmless, but the useless WAL files waste space, and
 walreceiver will have to always create new files.
 
 So instead of always running with ThisTimeLineID = 0 in the
 checkpointer
 process, I guess we'll have to update it to the timeline being
 replayed,
 when creating a restartpoint.

Shouldn't there be a check if(RecoveryInProgress), before assigning
RecoveryTargetTLI to ThisTimeLineID in CreateRestartPoint()?

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of Row Level Security

2012-12-20 Thread Kohei KaiGai
2012/12/20 Kevin Grittner kgri...@mail.com:
 Kohei KaiGai wrote:

 If system ensures writer's permission is always equivalent or
 more restrictive than reader's permission, it also eliminates the
 problem around asymmetric row-security policy between commands.

 I'm not sure we're understanding each other; so far people who
 favor asymmetric read and write predicates for row level security
 have been arguing that people should be able to write tuples that
 they can't read back. Of course you need to be able to read a tuple
 to update it, but the argument is that you should be able to
 configure security so that a role can (for example) update a row to
 set a sealed flag, and then no longer have rights to read that
 row (including for purposes of update). You can give away data
 which is yours, but you can't then take it back if it's not.

Hmm, for this example, the right behavior is to check rows with
((row-security of select) AND (row-security of update)) on scan-
stage, but only (row-security of update) is checked on just-before
row updates.
In case of (row-security of select) is sealed = false, it is not
visible thus not target row of the update, but user can turn on
the flag to make it gone.
Anyway, the row-security to be applied on scanning-stage of
source of update/delete needs to be equivalent or more
restrictive than rules on select. However, it might not be needed
to ensure rules being restrictive on writer-stage.

 Probably, we can implement it as ((row-security of select) AND
 (row-security of update)) that ensures always restrictive
 row-security policy.

 I hadn't been paying a lot of attention to this patch until I saw
 the question about whether a user with a particular role could
 write a row which would then not be visible to that role. I just
 took a look at the patch.

 I don't think I like ALTER TABLE as a syntax for row level
 security. How about using existing GRANT syntax but allowing a
 WHERE clause? That seems more natural to me, and it would make it
 easy to apply the same conditions to multiple types of operations
 when desired, but use different expressions when desired.

It seems to me this syntax gives an impression that row-security
feature is tightly combined with role-mechanism, even though it
does not need. For example, we can set row-security policy of
primary key is even/odd number, independent from working role.

 Without
 having spent a lot of time pondering it, I think that if row level
 SELECT permissions exist, they would need to be met on the OLD
 tuple to allow DELETE or UPDATE, and UPDATE row level permissions
 would be applied to the NEW tuple.

Yes, I agree.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >