Re: [HACKERS] RLS Design

2014-07-04 Thread Kouhei Kaigai
Sorry for my late responding, now I'm catching up the discussion.

 * Robert Haas (robertmh...@gmail.com) wrote:
  On Tue, Jul 1, 2014 at 3:20 PM, Dean Rasheed dean.a.rash...@gmail.com
 wrote:
   If RLS quals are instead regarded as constraints on access, and
   multiple policies apply, then it seems that the quals should now be
   combined with AND rather than OR, right?
 
 I do feel that RLS quals are constraints on access, but I don't see how
 it follows that multiple quals should be AND'd together because of that.
 I view the RLS policies on each table as being independent and standing
 alone regarding what can be seen.  If you have access to a table today
 through policy A, and then later policy B is added, using AND would mean
 that the set of rows returned is less than if only policy A existed.
 That doesn't seem correct to me.
 
It seems to me direction of the constraints (RLS-policy) works to is reverse.

In case when we have no RLS-policy, 100% of rows are visible isn't it?
Addition of a constraint usually reduces the number of rows being visible,
or same number of rows at least. Constraint shall never work to the direction
to increase the number of rows being visible.

If multiple RLS-policies are connected with OR-operator, the first policy
works to the direction to reduce number of visible rows, but the second
policy works to the reverse direction.

If we would have OR'd RLS-policy, how does it merged with user given
qualifiers with?
For example, if RLS-policy of t1 is (t1.credential  get_user_credential)
and user's query is:
  SELECT * FROM t1 WHERE t1.x = t1.x;
Do you think RLS-policy shall be merged with OR'd form?


  Yeah, maybe.  I intuitively feel that OR would be more useful, so it
  would be nice to find a design where that makes sense.  But it depends
  a lot, in my view, on what syntax we end up with.  For example,
  suppose we add just one command:
 
  ALTER TABLE table_name FILTER [ role_name | PUBLIC ] USING qual;
 
  If the given role inherits from multiple roles that have different
  filters, I think the user will naturally expect all of the filters to
  be applied.
 
 Agreed.
 
  But you could do it other ways.  For example:
 
  ALTER TABLE table_name [ NO ] ROW LEVEL SECURITY; ALTER TABLE
  table_name GRANT ROW ACCESS TO role_name USING qual;
 
  If a table is set to NO ROW LEVEL SECURITY then it behaves just like
  it does now: anyone who accesses it sees all the rows, restricted to
  those columns for which they have permission.  If the table is set to
  ROW LEVEL SECURITY then the default is to show no rows.  The second
  command then allows access to a subset of the rows for a give role
  name.  In this case, it is probably logical for access to be combined
  via OR.
 
 I can see value is having a table-level option to indicate if RLS is applied
 for that table or not, but I had been thinking we'd just automatically manage
 that.  That is to say that once you define an RLS policy for a table, we
 go look and see what policy should be applied in each case.  With the user
 able to control that, what happens if they say row security on the table
 and there are no policies?  All access would show the table as empty?  What
 if policies exist and they decide to 'turn off' RLS for the table- suddenly
 everyone can see all the rows?
 
 My answers to the above (which are making me like the idea more,
 actually...) would be:
 
 Yes, if they turn on RLS for the table and there aren't any policies, then
 the table appears empty for anyone with normal SELECT rights (table owner
 and superusers would still see everything).
 
 If policies exist and the user asks to turn off RLS, I'd throw an ERROR
 as there is a security risk there.  We could support a CASCADE option which
 would go and drop the policies from the table first.
 
Hmm... This approach starts from the empty permission then adds permission
to reference a particular range of the configured table. It's one attitude.

However, I think it has a dark side we cannot ignore. Usually, the purpose
of security mechanism is to ensure which is readable/writable according to
the rules. Once multiple RLS-policies are merged with OR'd form, its results
are unpredicatable.
Please assume here are two individual applications that use RLS on table-X.
Even if application-1 want only rows being public become visible, it may
expose credential or secret rows by interaction of orthogonal policy
configured by application-2 (that may configure the policy according to the
source ip-address). It seems to me application-2 partially invalidated the
RLS-policy configured by application-1.

I think, an important characteristic is things to be invisible is invisible
even though multiple rules are configured.

 Otherwise, I'm generally liking Dean's thoughts in
 http://www.postgresql.org/message-id/CAEZATCVftksFH=X+9mVmBNMZo5KsUP+R
 k0kb4oro92jofjo...@mail.gmail.com
 along with the table-level enable RLS option.
 
 Are we getting to a point where there is 

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-07-04 Thread Abhijit Menon-Sen
At 2014-07-04 14:38:27 +0900, masao.fu...@gmail.com wrote:

 But 0002-CompressBackupBlock_snappy_lz4_pglz-2.patch doesn't seem to
 be able to apply to HEAD cleanly.

Yes, and it needs quite some reformatting beyond fixing whitespace
damage too (long lines, comment formatting, consistent spacing etc.).

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issue while calling new PostgreSQL command from a Java Application

2014-07-04 Thread Ashutosh Bapat
You may have to add code to copy inp_str to _copyVacuumStmt(). See how a
character array being copied from other _copy* functions.


On Fri, Jul 4, 2014 at 10:43 AM, Ashoke s.ash...@gmail.com wrote:

 Hi,

 --

 I have defined a new command my_command in PostgreSQL. This command takes
 the path of ANALYZE and inside analyze.c, I have a function to do some
 operations if its my_command.This command takes the input arguments:
 table name, column name and an input string.

 my_command nation (n_nationkey) 'input string';

 When I run this command from command line psql, it works as expected. But
 when I call the same command from a java application, the variable that
 stores the input string is NULL.

 I printed the value of the input string in gram.y file where I have
 defined my_command.
 fprintf (stderr, I am inside gram.y %s\n,n-inp_str); and the input
 string is printed correctly.

 But when I print stmt-inp_str in the function standard_ProcessUtility()
  of utility.c for the case T_VacuumStmt, I get the value as NULL. This is
 as far as I could trace back from analyze.c.

 I am not sure how executing the same command from an application can make
 a difference.

 gram.y content gist:
 --

 MyStmt:
 my_keyword qualified_name name_list my_inp_str
 {
 VacuumStmt *n = makeNode(VacuumStmt);
 n-options = VACOPT_ANALYZE;
 n-freeze_min_age = -1;
 n-freeze_table_age = -1;
 n-relation = $2;
 n-va_cols = $3;
 n-inp_str = $4;
 fprintf (stderr, I am inside gram.y %s\n,n-inp_str);

 $$ = (Node *)n;
 };

 char *inp_str is added to the struct VacuumStmt in parsenodes.h

 ---

 Only the newly added char *inp_str(that is different from ANALYZE) value
 is NULL. I was able to retrieve the column name from va_cols.

 Any help is appreciated. Thanks!
 --
 Regards,
 Ashoke





-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Issue while calling new PostgreSQL command from a Java Application

2014-07-04 Thread Abhijit Menon-Sen
At 2014-07-04 10:43:12 +0530, s.ash...@gmail.com wrote:

 I am not sure how executing the same command from an application can
 make a difference.

It shouldn't make any difference, of course.

But since you're seeing the problem with new code, the overwhelming
probability is that there's an error in the new code. That being the
case, speculating about what might be going wrong without looking at
the code in question is a waste of time.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Can simplify 'limit 1' with slow function?

2014-07-04 Thread Martijn van Oosterhout
Fascinating.

On Fri, Jul 04, 2014 at 10:47:06AM +0800, gotoschool6g wrote:
 slow query(8531 ms):
 SELECT ST_Distance_Sphere(shape,ST_GeomFromText('POINT(116.41386186784513 
 40.12211338311868)')) FROM road order by id LIMIT 1;
 
 explain output:
 Limit  (cost=4653.48..4653.48 rows=1 width=3612)
   -  Sort  (cost=4653.48..4683.06 rows=11832 width=3612)
 Sort Key: id
 -  Seq Scan on road  (cost=0.00..4594.32 rows=11832 width=3612)
 
 fast query(16ms):
 select ST_Distance_Sphere(shape,ST_GeomFromText('POINT(116.41386186784513 
 40.12211338311868)')) from (SELECT shape FROM road order by id  LIMIT 1) a
 
 explain output:
 Subquery Scan on a  (cost=1695.48..1695.74 rows=1 width=3608)
   -  Limit  (cost=1695.48..1695.48 rows=1 width=3612)
 -  Sort  (cost=1695.48..1725.06 rows=11832 width=3612)
   Sort Key: road.id
   -  Seq Scan on road  (cost=0.00..1636.32 rows=11832 
 width=3612)

So Postgres knows perfectly well that it's expensive, it just doesn't
appear to understand it has the option of moving the calculation above
the limit.

In this case though, it seems an index on road(id) would make it
instant in any case.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] pg_xlogdump --stats

2014-07-04 Thread Abhijit Menon-Sen
At 2014-06-30 05:19:10 +, dilip.ku...@huawei.com wrote:

 Please fix these issues and send the updated patch..
 
 I will continue reviewing the patch..

Did you get anywhere with the updated patch?

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cluster name in ps output

2014-07-04 Thread Fujii Masao
On Wed, Jul 2, 2014 at 3:45 AM, Vik Fearing vik.fear...@dalibo.com wrote:
 On 06/29/2014 02:25 PM, Andres Freund wrote:
 On 2014-06-29 11:11:14 +0100, Thomas Munro wrote:
  On 29 June 2014 10:55, Andres Freund and...@2ndquadrant.com wrote:
   So, I'd looked at it with an eye towards committing it and found some
   more things. I've now
   * added the restriction that the cluster name can only be ASCII. It's
 shown from several clusters with differing encodings, and it's shown
 in process titles, so that seems wise.
   * moved everything to the LOGGING_WHAT category
   * added minimal documention to monitoring.sgml
   * expanded the config.sgml entry to mention the restriction on the 
   name.
   * Changed the GUCs short description
 
  Thank you.
 
   I also think, but haven't done so, we should add a double colon after
   the cluster name, so it's not:
  
   postgres: server1 stats collector process
   but
   postgres: server1: stats collector process
 
  +1

 Committed with the above changes. Thanks for the contribution!

 Is there a reason for not using this in synchronous_standby_names,
 perhaps falling back to application_name if not set?

You mean that if synchronous_standby_names is an empty it automatically
should be set to cluster_name? Or, you mean that if application_name is not
set in primary_conninfo the standby should automatically use its cluster_name
as application_name in primary_conninfo? I'm afraid that those may cause
the trouble such as that standby is unexpectedly treated as synchronous one
even though a user want to set up it as asynchronous one by emptying
synchronous_standby_names in the master.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issue while calling new PostgreSQL command from a Java Application

2014-07-04 Thread Ashoke
Thank you Ashutosh*.* That was the issue. But, could you please explain why
it worked from command line?


On Fri, Jul 4, 2014 at 11:49 AM, Ashutosh Bapat 
ashutosh.ba...@enterprisedb.com wrote:

 You may have to add code to copy inp_str to _copyVacuumStmt(). See how a
 character array being copied from other _copy* functions.


 On Fri, Jul 4, 2014 at 10:43 AM, Ashoke s.ash...@gmail.com wrote:

 Hi,

 --

 I have defined a new command my_command in PostgreSQL. This command
 takes the path of ANALYZE and inside analyze.c, I have a function to do
 some operations if its my_command.This command takes the input
 arguments: table name, column name and an input string.

 my_command nation (n_nationkey) 'input string';

 When I run this command from command line psql, it works as expected.
 But when I call the same command from a java application, the variable that
 stores the input string is NULL.

 I printed the value of the input string in gram.y file where I have
 defined my_command.
 fprintf (stderr, I am inside gram.y %s\n,n-inp_str); and the input
 string is printed correctly.

 But when I print stmt-inp_str in the function standard_ProcessUtility()
  of utility.c for the case T_VacuumStmt, I get the value as NULL. This
 is as far as I could trace back from analyze.c.

 I am not sure how executing the same command from an application can make
 a difference.

 gram.y content gist:
 --

 MyStmt:
 my_keyword qualified_name name_list my_inp_str
 {
 VacuumStmt *n = makeNode(VacuumStmt);
 n-options = VACOPT_ANALYZE;
 n-freeze_min_age = -1;
 n-freeze_table_age = -1;
 n-relation = $2;
 n-va_cols = $3;
 n-inp_str = $4;
 fprintf (stderr, I am inside gram.y %s\n,n-inp_str);

 $$ = (Node *)n;
 };

 char *inp_str is added to the struct VacuumStmt in parsenodes.h

 ---

 Only the newly added char *inp_str(that is different from ANALYZE) value
 is NULL. I was able to retrieve the column name from va_cols.

 Any help is appreciated. Thanks!
 --
 Regards,
 Ashoke





 --
 Best Wishes,
 Ashutosh Bapat
 EnterpriseDB Corporation
 The Postgres Database Company




-- 
Regards,
Ashoke


Re: [HACKERS] Cluster name in ps output

2014-07-04 Thread Vik Fearing
On 07/04/2014 08:50 AM, Fujii Masao wrote:
 On Wed, Jul 2, 2014 at 3:45 AM, Vik Fearing vik.fear...@dalibo.com wrote:
 Is there a reason for not using this in synchronous_standby_names,
 perhaps falling back to application_name if not set?
 
 You mean that if synchronous_standby_names is an empty it automatically
 should be set to cluster_name? Or, you mean that if application_name is not
 set in primary_conninfo the standby should automatically use its cluster_name
 as application_name in primary_conninfo? I'm afraid that those may cause
 the trouble such as that standby is unexpectedly treated as synchronous one
 even though a user want to set up it as asynchronous one by emptying
 synchronous_standby_names in the master.

No, I mean that synchronous_standby_names should look at cluster_name
first, and if it's not set then unfortunately look at application_name
for backward compatibility.

Using application_name for this always seems like a hack to me, and
cluster_name is a much better fit.  We should have created cluster_name
back when we created synchronous_standby_names.
-- 
Vik


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issue while calling new PostgreSQL command from a Java Application

2014-07-04 Thread Ashutosh Bapat
On Fri, Jul 4, 2014 at 12:30 PM, Ashoke s.ash...@gmail.com wrote:

 Thank you Ashutosh*.* That was the issue. But, could you please explain
 why it worked from command line?


I do not know. Any time we add a member to a node and find it's value
coming out NULL or 0 instead of the one set, corresponding _copy* is the
first suspect. You may be able find why it worked in command line and why
not through the connector by breaking on copyObject() in either cases.



 On Fri, Jul 4, 2014 at 11:49 AM, Ashutosh Bapat 
 ashutosh.ba...@enterprisedb.com wrote:

 You may have to add code to copy inp_str to _copyVacuumStmt(). See how a
 character array being copied from other _copy* functions.


 On Fri, Jul 4, 2014 at 10:43 AM, Ashoke s.ash...@gmail.com wrote:

 Hi,

 --

 I have defined a new command my_command in PostgreSQL. This command
 takes the path of ANALYZE and inside analyze.c, I have a function to do
 some operations if its my_command.This command takes the input
 arguments: table name, column name and an input string.

 my_command nation (n_nationkey) 'input string';

 When I run this command from command line psql, it works as expected.
 But when I call the same command from a java application, the variable that
 stores the input string is NULL.

 I printed the value of the input string in gram.y file where I have
 defined my_command.
 fprintf (stderr, I am inside gram.y %s\n,n-inp_str); and the input
 string is printed correctly.

 But when I print stmt-inp_str in the function standard_ProcessUtility()
  of utility.c for the case T_VacuumStmt, I get the value as NULL. This
 is as far as I could trace back from analyze.c.

 I am not sure how executing the same command from an application can
 make a difference.

 gram.y content gist:
 --

 MyStmt:
 my_keyword qualified_name name_list my_inp_str
 {
 VacuumStmt *n = makeNode(VacuumStmt);
 n-options = VACOPT_ANALYZE;
 n-freeze_min_age = -1;
 n-freeze_table_age = -1;
 n-relation = $2;
 n-va_cols = $3;
 n-inp_str = $4;
 fprintf (stderr, I am inside gram.y %s\n,n-inp_str);

 $$ = (Node *)n;
 };

 char *inp_str is added to the struct VacuumStmt in parsenodes.h

 ---

 Only the newly added char *inp_str(that is different from ANALYZE)
 value is NULL. I was able to retrieve the column name from va_cols.

 Any help is appreciated. Thanks!
 --
 Regards,
 Ashoke





 --
 Best Wishes,
 Ashutosh Bapat
 EnterpriseDB Corporation
 The Postgres Database Company




 --
 Regards,
 Ashoke







-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] pg_xlogdump --stats

2014-07-04 Thread Dilip kumar
On 04 July 2014 12:07, Abhijit Menon-Sen Wrote,

 -Original Message-
 From: Abhijit Menon-Sen [mailto:a...@2ndquadrant.com]
 Sent: 04 July 2014 12:07
 To: Dilip kumar
 Cc: pgsql-hackers@postgresql.org; furu...@pm.nttdata.co.jp
 Subject: Re: [HACKERS] pg_xlogdump --stats
 
 At 2014-06-30 05:19:10 +, dilip.ku...@huawei.com wrote:
 
  Please fix these issues and send the updated patch..
 
  I will continue reviewing the patch..
 
 Did you get anywhere with the updated patch?
 

Patch looks fine to me, except few small comments.

1. Update this new option in usage function also  this still have the old way 
{ -z, --stats[=record] }

{stats, no_argument, NULL, 'z'},
{record-stats, no_argument, NULL, 'Z'},

2. While applying stats-newopt.dif (after applying stat2.diff), some error in 
merging sgml file.

patching file `doc/src/sgml/pg_xlogdump.sgml'
Hunk #1 FAILED at 181.
1 out of 1 hunk FAILED -- saving rejects to doc/src/sgml/pg_xlogdump.sgml.rej

Once you fix above erros, I think patch is ok from my side.

Thanks  Regards,
Dilip Kumar



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_xlogdump --stats

2014-07-04 Thread Abhijit Menon-Sen
At 2014-07-04 08:38:17 +, dilip.ku...@huawei.com wrote:

 Once you fix above erros, I think patch is ok from my side.

I've attached two updated patches here, including the fix to the usage
message. I'll mark this ready for committer now. (The rm_identify patch
is posted separately from the xlogdump one only to make the whole thing
easier to follow.)

Thank you.

-- Abhijit
diff --git a/contrib/pg_xlogdump/pg_xlogdump.c b/contrib/pg_xlogdump/pg_xlogdump.c
index c555786..239321f 100644
--- a/contrib/pg_xlogdump/pg_xlogdump.c
+++ b/contrib/pg_xlogdump/pg_xlogdump.c
@@ -15,12 +15,28 @@
 #include dirent.h
 #include unistd.h
 
+#include access/clog.h
+#include access/gin_private.h
+#include access/gist_private.h
+#include access/heapam_xlog.h
+#include access/heapam_xlog.h
+#include access/multixact.h
+#include access/nbtree.h
+#include access/spgist_private.h
+#include access/xact.h
 #include access/xlog.h
 #include access/xlogreader.h
 #include access/transam.h
+#include catalog/pg_control.h
+#include catalog/storage_xlog.h
+#include commands/dbcommands.h
+#include commands/sequence.h
+#include commands/tablespace.h
 #include common/fe_memutils.h
 #include getopt_long.h
 #include rmgrdesc.h
+#include storage/standby.h
+#include utils/relmapper.h
 
 
 static const char *progname;
@@ -41,6 +57,8 @@ typedef struct XLogDumpConfig
 	int			stop_after_records;
 	int			already_displayed_records;
 	bool		follow;
+	bool		stats;
+	bool		stats_per_record;
 
 	/* filter options */
 	int			filter_by_rmgr;
@@ -48,6 +66,20 @@ typedef struct XLogDumpConfig
 	bool		filter_by_xid_enabled;
 } XLogDumpConfig;
 
+typedef struct Stats
+{
+	uint64		count;
+	uint64		rec_len;
+	uint64		fpi_len;
+} Stats;
+
+typedef struct XLogDumpStats
+{
+	uint64		count;
+	Stats		rmgr_stats[RM_NEXT_ID];
+	Stats		record_stats[RM_NEXT_ID][16];
+} XLogDumpStats;
+
 static void
 fatal_error(const char *fmt,...)
 __attribute__((format(PG_PRINTF_ATTRIBUTE, 1, 2)));
@@ -322,6 +354,50 @@ XLogDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 }
 
 /*
+ * Store per-rmgr and per-record statistics for a given record.
+ */
+static void
+XLogDumpCountRecord(XLogDumpConfig *config, XLogDumpStats *stats, XLogRecPtr ReadRecPtr, XLogRecord *record)
+{
+	RmgrId		rmid;
+	uint8  		recid;
+
+	if (config-filter_by_rmgr != -1 
+		config-filter_by_rmgr != record-xl_rmid)
+		return;
+
+	if (config-filter_by_xid_enabled 
+		config-filter_by_xid != record-xl_xid)
+		return;
+
+	stats-count++;
+
+	/* Update per-rmgr statistics */
+
+	rmid = record-xl_rmid;
+
+	stats-rmgr_stats[rmid].count++;
+	stats-rmgr_stats[rmid].rec_len +=
+		record-xl_len + SizeOfXLogRecord;
+	stats-rmgr_stats[rmid].fpi_len +=
+		record-xl_tot_len - (record-xl_len + SizeOfXLogRecord);
+
+	/*
+	 * Update per-record statistics, where the record is identified by a
+	 * combination of the RmgrId and the upper four bits of the xl_info
+	 * field (to give sixteen possible entries per RmgrId).
+	 */
+
+	recid = (record-xl_info  ~XLR_INFO_MASK)  4;
+
+	stats-record_stats[rmid][recid].count++;
+	stats-record_stats[rmid][recid].rec_len +=
+		record-xl_len + SizeOfXLogRecord;
+	stats-record_stats[rmid][recid].fpi_len +=
+		record-xl_tot_len - (record-xl_len + SizeOfXLogRecord);
+}
+
+/*
  * Print a record to stdout
  */
 static void
@@ -380,6 +456,498 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogRecPtr ReadRecPtr, XLogRecord
 	}
 }
 
+/*
+ * Given an xl_rmid and the high bits of xl_info, returns a string
+ * describing the record type.
+ */
+static const char *
+identify_record(RmgrId rmid, uint8 info)
+{
+	const RmgrDescData *desc = RmgrDescTable[rmid];
+	const char *rec;
+
+	rec = psprintf(0x%x, info);
+
+	switch (rmid)
+	{
+		case RM_XLOG_ID:
+			switch (info)
+			{
+case XLOG_CHECKPOINT_SHUTDOWN:
+	rec = CHECKPOINT_SHUTDOWN;
+	break;
+case XLOG_CHECKPOINT_ONLINE:
+	rec = CHECKPOINT_ONLINE;
+	break;
+case XLOG_NOOP:
+	rec = NOOP;
+	break;
+case XLOG_NEXTOID:
+	rec = NEXTOID;
+	break;
+case XLOG_SWITCH:
+	rec = SWITCH;
+	break;
+case XLOG_BACKUP_END:
+	rec = BACKUP_END;
+	break;
+case XLOG_PARAMETER_CHANGE:
+	rec = PARAMETER_CHANGE;
+	break;
+case XLOG_RESTORE_POINT:
+	rec = RESTORE_POINT;
+	break;
+case XLOG_FPW_CHANGE:
+	rec = FPW_CHANGE;
+	break;
+case XLOG_END_OF_RECOVERY:
+	rec = END_OF_RECOVERY;
+	break;
+case XLOG_FPI:
+	rec = FPI;
+	break;
+			}
+			break;
+
+		case RM_XACT_ID:
+			switch (info)
+			{
+case XLOG_XACT_COMMIT:
+	rec = COMMIT;
+	break;
+case XLOG_XACT_PREPARE:
+	rec = PREPARE;
+	break;
+case XLOG_XACT_ABORT:
+	rec = ABORT;
+	break;
+case XLOG_XACT_COMMIT_PREPARED:
+	rec = COMMIT_PREPARED;
+	break;
+case XLOG_XACT_ABORT_PREPARED:
+	rec = ABORT_PREPARED;
+	break;
+case XLOG_XACT_ASSIGNMENT:
+	rec = ASSIGNMENT;
+	break;
+case XLOG_XACT_COMMIT_COMPACT:
+

Re: [HACKERS] pg_xlogdump --stats

2014-07-04 Thread Andres Freund
On 2014-07-04 14:16:42 +0530, Abhijit Menon-Sen wrote:
 At 2014-07-04 08:38:17 +, dilip.ku...@huawei.com wrote:
 
  Once you fix above erros, I think patch is ok from my side.
 
 I've attached two updated patches here, including the fix to the usage
 message. I'll mark this ready for committer now. (The rm_identify patch
 is posted separately from the xlogdump one only to make the whole thing
 easier to follow.)

I'm pretty sure that most committers would want to apply the rm_identity
part in a separate commit first. Could you make it two patches, one
introducing rm_identity, and another with xlogdump using it?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_xlogdump --stats

2014-07-04 Thread Abhijit Menon-Sen
At 2014-07-04 10:54:08 +0200, and...@2ndquadrant.com wrote:

 I'm pretty sure that most committers would want to apply the
 rm_identity part in a separate commit first. Could you make it
 two patches, one introducing rm_identity, and another with
 xlogdump using it?

Sure, attached.

-- Abhijit
diff --git a/src/backend/access/rmgrdesc/clogdesc.c b/src/backend/access/rmgrdesc/clogdesc.c
index e82baa8..08f225d 100644
--- a/src/backend/access/rmgrdesc/clogdesc.c
+++ b/src/backend/access/rmgrdesc/clogdesc.c
@@ -40,3 +40,21 @@ clog_desc(StringInfo buf, XLogRecord *record)
 	else
 		appendStringInfoString(buf, UNKNOWN);
 }
+
+const char *
+clog_identify(uint8 info)
+{
+	const char *id = NULL;
+
+	switch (info)
+	{
+		case CLOG_ZEROPAGE:
+			id = ZEROPAGE;
+			break;
+		case CLOG_TRUNCATE:
+			id = TRUNCATE;
+			break;
+	}
+
+	return id;
+}
diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 0230716..38f3a39 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -42,3 +42,21 @@ dbase_desc(StringInfo buf, XLogRecord *record)
 	else
 		appendStringInfoString(buf, UNKNOWN);
 }
+
+const char *
+dbase_identify(uint8 info)
+{
+	const char *id = NULL;
+
+	switch (info)
+	{
+		case XLOG_DBASE_CREATE:
+			id = CREATE;
+			break;
+		case XLOG_DBASE_DROP:
+			id = DROP;
+			break;
+	}
+
+	return id;
+}
diff --git a/src/backend/access/rmgrdesc/gindesc.c b/src/backend/access/rmgrdesc/gindesc.c
index 12d68d7..115ad5b 100644
--- a/src/backend/access/rmgrdesc/gindesc.c
+++ b/src/backend/access/rmgrdesc/gindesc.c
@@ -187,3 +187,45 @@ gin_desc(StringInfo buf, XLogRecord *record)
 			break;
 	}
 }
+
+const char *
+gin_identify(uint8 info)
+{
+	const char *id = NULL;
+
+	switch (info)
+	{
+		case XLOG_GIN_CREATE_INDEX:
+			id = CREATE_INDEX;
+			break;
+		case XLOG_GIN_CREATE_PTREE:
+			id = CREATE_PTREE;
+			break;
+		case XLOG_GIN_INSERT:
+			id = INSERT;
+			break;
+		case XLOG_GIN_SPLIT:
+			id = SPLIT;
+			break;
+		case XLOG_GIN_VACUUM_PAGE:
+			id = VACUUM_PAGE;
+			break;
+		case XLOG_GIN_VACUUM_DATA_LEAF_PAGE:
+			id = VACUUM_DATA_LEAF_PAGE;
+			break;
+		case XLOG_GIN_DELETE_PAGE:
+			id = DELETE_PAGE;
+			break;
+		case XLOG_GIN_UPDATE_META_PAGE:
+			id = UPDATE_META_PAGE;
+			break;
+		case XLOG_GIN_INSERT_LISTPAGE:
+			id = INSERT_LISTPAGE;
+			break;
+		case XLOG_GIN_DELETE_LISTPAGE:
+			id = DELETE_LISTPAGE;
+			break;
+	}
+
+	return id;
+}
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index cdac496..e4f288b 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -67,3 +67,24 @@ gist_desc(StringInfo buf, XLogRecord *record)
 			break;
 	}
 }
+
+const char *
+gist_identify(uint8 info)
+{
+	const char *id = NULL;
+
+	switch (info)
+	{
+		case XLOG_GIST_PAGE_UPDATE:
+			id = PAGE_UPDATE;
+			break;
+		case XLOG_GIST_PAGE_SPLIT:
+			id = PAGE_SPLIT;
+			break;
+		case XLOG_GIST_CREATE_INDEX:
+			id = CREATE_INDEX;
+			break;
+	}
+
+	return id;
+}
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index 86a0376..c58461c 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -20,3 +20,9 @@ void
 hash_desc(StringInfo buf, XLogRecord *record)
 {
 }
+
+const char *
+hash_identify(uint8 info)
+{
+	return NULL;
+}
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index 7df18fa..e6149be 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -202,3 +202,78 @@ heap2_desc(StringInfo buf, XLogRecord *record)
 	else
 		appendStringInfoString(buf, UNKNOWN);
 }
+
+const char *
+heap_identify(uint8 info)
+{
+	const char *id = NULL;
+
+	switch (info  XLOG_HEAP_OPMASK)
+	{
+		case XLOG_HEAP_INSERT:
+			id = INSERT;
+			break;
+		case XLOG_HEAP_DELETE:
+			id = DELETE;
+			break;
+		case XLOG_HEAP_UPDATE:
+			id = UPDATE;
+			break;
+		case XLOG_HEAP_HOT_UPDATE:
+			id = HOT_UPDATE;
+			break;
+		case XLOG_HEAP_NEWPAGE:
+			id = NEWPAGE;
+			break;
+		case XLOG_HEAP_LOCK:
+			id = LOCK;
+			break;
+		case XLOG_HEAP_INPLACE:
+			id = INPLACE;
+			break;
+	}
+
+	if (info  XLOG_HEAP_INIT_PAGE)
+		id = psprintf(%s+INIT, id);
+
+	return id;
+}
+
+const char *
+heap2_identify(uint8 info)
+{
+	const char *id = NULL;
+
+	switch (info  XLOG_HEAP_OPMASK)
+	{
+		case XLOG_HEAP2_CLEAN:
+			id = CLEAN;
+			break;
+		case XLOG_HEAP2_FREEZE_PAGE:
+			id = FREEZE_PAGE;
+			break;
+		case XLOG_HEAP2_CLEANUP_INFO:
+			id = CLEANUP_INFO;
+			break;
+		case XLOG_HEAP2_VISIBLE:
+			id = VISIBLE;
+			break;
+		case XLOG_HEAP2_MULTI_INSERT:
+			id = MULTI_INSERT;
+			break;
+		case XLOG_HEAP2_LOCK_UPDATED:
+			id = LOCK_UPDATED;
+			break;
+		case XLOG_HEAP2_NEW_CID:
+			id = NEW_CID;
+			break;
+		case XLOG_HEAP2_REWRITE:
+			id = REWRITE;
+			break;
+	}
+
+	if (info  XLOG_HEAP_INIT_PAGE)
+		id = 

Re: [HACKERS] pg_xlogdump --stats

2014-07-04 Thread Andres Freund
Hi,

On 2014-07-04 14:42:03 +0530, Abhijit Menon-Sen wrote:
 Sure, attached.
 
 +const char *
 +heap_identify(uint8 info)
 +{
 + const char *id = NULL;
 +
 + switch (info  XLOG_HEAP_OPMASK)
 + {
 + case XLOG_HEAP_INSERT:
 + id = INSERT;
 + break;
 + case XLOG_HEAP_DELETE:
 + id = DELETE;
 + break;
 + case XLOG_HEAP_UPDATE:
 + id = UPDATE;
 + break;
 + case XLOG_HEAP_HOT_UPDATE:
 + id = HOT_UPDATE;
 + break;
 + case XLOG_HEAP_NEWPAGE:
 + id = NEWPAGE;
 + break;
 + case XLOG_HEAP_LOCK:
 + id = LOCK;
 + break;
 + case XLOG_HEAP_INPLACE:
 + id = INPLACE;
 + break;
 + }
 +
 + if (info  XLOG_HEAP_INIT_PAGE)
 + id = psprintf(%s+INIT, id);
 +
 + return id;
 +}

So we're leaking memory here? For --stats that could well be relevant...
 +/*
 + * Display a single row of record counts and sizes for an rmgr or record.
 + */
 +static void
 +XLogDumpStatsRow(const char *name,
 +  uint64 n, double n_pct,
 +  uint64 rec_len, double rec_len_pct,
 +  uint64 fpi_len, double fpi_len_pct,
 +  uint64 total_len, double total_len_pct)
 +{
 + printf(%-27s %20zu (%6.02f) %20zu (%6.02f) %20zu (%6.02f) %20zu 
 (%6.02f)\n,
 +name, n, n_pct, rec_len, rec_len_pct, fpi_len, fpi_len_pct,
 +total_len, total_len_pct);
 +}

This (and following locations) is going to break on 32bit platforms. %z
indicates size_t, not 64bit. I think we're going to have to redefine the
PGAC_FUNC_SNPRINTF_LONG_LONG_INT_FORMAT callsite in configure.in to
define INT64_MODIFIER='ll/l/I64D'
and then define
#define INT64_FORMAT %CppAsString(INT64_MODIFIER)d
#define UINT64_FORMAT %CppAsString(INT64_MODIFIER)u
in c.h based on that, or something like i. This was written blindly, so
it'd might need further work.

Then you can use INT64_MODIFIER in you format strings. Won't be pretty,
but at least it'd work...

 +static void
 +XLogDumpDisplayStats(XLogDumpConfig *config, XLogDumpStats *stats)
 +{

 + /*
 +  * 27 is strlen(Transaction/COMMIT_PREPARED),
 +  * 20 is strlen(2^64), 8 is strlen((100.00%))
 +  */

It's far from guaranteed that 27 will always suffice. I guess it's ok
because it doesn't cause bad breakage, but just some misalignment...

 + for (ri = 0; ri  RM_NEXT_ID; ri++)
 + {
 + uint64  count, rec_len, fpi_len;
 + const RmgrDescData *desc = RmgrDescTable[ri];
 +
 + if (!config-stats_per_record)
 + {
 + count = stats-rmgr_stats[ri].count;
 + rec_len = stats-rmgr_stats[ri].rec_len;
 + fpi_len = stats-rmgr_stats[ri].fpi_len;
 +
 + XLogDumpStatsRow(desc-rm_name,
 +  count, 
 100*(double)count/total_count,
 +  rec_len, 
 100*(double)rec_len/total_rec_len,
 +  fpi_len, 
 100*(double)fpi_len/total_fpi_len,
 +  rec_len+fpi_len, 
 100*(double)(rec_len+fpi_len)/total_len);
 + }

Many missing spaces here.

 + else
 + {
 + for (rj = 0; rj  16; rj++)
 + {
 + const char *id;
 +
 + count = stats-record_stats[ri][rj].count;
 + rec_len = stats-record_stats[ri][rj].rec_len;
 + fpi_len = stats-record_stats[ri][rj].fpi_len;
 +
 + /* Skip undefined combinations and ones that 
 didn't occur */
 + if (count == 0)
 + continue;
 +
 + id = desc-rm_identify(rj  4);
 + if (id == NULL)
 + id = psprintf(0x%x, rj  4);
 +
 + XLogDumpStatsRow(psprintf(%s/%s, 
 desc-rm_name, id),
 +  count, 
 100*(double)count/total_count,
 +  rec_len, 
 100*(double)rec_len/total_rec_len,
 +  fpi_len, 
 100*(double)fpi_len/total_fpi_len,
 +  
 rec_len+fpi_len, 100*(double)(rec_len+fpi_len)/total_len);
 + }
 + }
 + }

Some additional leaking here.

 

Re: [HACKERS] WAL replay bugs

2014-07-04 Thread Kyotaro HORIGUCHI
Hello, thank you for considering my comments, including somewhat
impractical ones.

I'll have a look at the latest patch sooner.


At Fri, 4 Jul 2014 15:29:51 +0900, Michael Paquier michael.paqu...@gmail.com 
wrote in cab7npqt_fs_3jlnhywc6nzej4sbl6dgslfvcg0jbukgjep9...@mail.gmail.com
 OK, I have been working more on this patch, improving on-the-fly the
 following things on top of what Horiguchi-san has reported:
 - Moved sequence page opaque data to sequence.h, renaming it at the same time.
 - Improvement of page type identification, particularly for sequences
 using a correct opaque data structure. For gin the process is not that
 cool, but I guess that there is nothing much to do as it has no proper
 page identifier :(

Year, there seems to be no choice than that.

 On Thu, Jul 3, 2014 at 7:34 PM, Kyotaro HORIGUCHI
 horiguchi.kyot...@lab.ntt.co.jp wrote:
  = bufcapt.c:
 
  - buffer_capture_remember() or so.
...
 Yes, it is worth mentioning and a comment in bufcapt.h seems enough.
 
  - buffer_capture_forget()
...
 Yesh, this seems informative.
 
  - buffer_capture_is_changed()
...
 Hm, yes. This name looks better fine as it remains static within bufcapt.c.
 
  == bufmgr.c:
 
  - ConditionalLockBuffer()
...
 Fixed.
 
  - LockBuffer()
...
   lwlock.h: #define LWLockHoldedExclusive(l) ((l)-exclusive  0)
   lwlock.h: #define LWLockHoldedShared(l) ((l)-shared  0)
 I don't think that there is much to gain with such macros as of now
 LWLock-exclusive is only used in the code this patch introduces.

 Year, I think so, too:p That's simply for the case if you
 thought that.

   If there isn't any particular concern, 'XXX:' should be removed.
 Well yes.

That's great.

  = bufpage.c:
  = bufcapt.h:
 
  - header comment
 
The file name is misspelled as 'bufcaptr.h'.
 Nicely spotted.

Thank you ;)

  - The includes in this header except for buf.h seem not to be
necessary.
 Yep.
 
  = init_env.sh:
 
  - pg_get_test_port()
It determines server port using PG_VERSION_NUM, but is it
necessary? Addition to it, the theoretical maximum initial
PGPORT seems to be 65535 but this script search for free port
up to the number larger by 16 from the start, which would be
beyond the edge.
 Hm, no. As of now, there is still some margin:
 PG_VERSION_NUM = 90500
 PG_VERSION_NUM % 16384 + 49152 = 57732

Yes, it's practically no problem. I said about the theroretical
max value seeing it without any preassumption about the value of
PG_VERSION_NUM. There's in reality no problem before the
PostgreSQL 9.82.88 comes, and 10.0.0 doesn't cause problem. So
I'm not so dissapointed if it is not fixed.

  - pg_get_test_port()
 
It stops with error after 16 times of failure, but the message
complains only about the final attempt. If you want to mention
the port numbers, it might should be 'port $PGPORTSTART-$PGPORT
..' or would be 'All 16 ports attempted failed' or something..
 Hm. You could complain about pg_upgrade as well now for the same
 thing. But I guess that it doesn't hurt to complain back to caller
 about the range of ports already in use, so changed this way.

Yes, this comment is also comes from a kind of
fastidiousness. I'm satisified not to fixed if you think so.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] introduce XLogLockBlockRangeForCleanup()

2014-07-04 Thread Amit Khandekar
On 3 July 2014 16:59, Simon Riggs si...@2ndquadrant.com wrote:


 I think we should say this though

 LockBufHdr(buf);
 valid = ((buf-flags  BM_VALID) != 0);
 if (valid)
 PinBuffer_Locked(buf);
 else
 UnlockBufHdr(buf);

 since otherwise we would access the buffer flags without the spinlock
 and we would leak a pin if the buffer was not valid


Ah right. That is essential.


Re: [HACKERS] WAL replay bugs

2014-07-04 Thread Kyotaro HORIGUCHI
Hello,

At Thu, 3 Jul 2014 14:48:50 +0900, Michael Paquier michael.paqu...@gmail.com 
wrote in cab7npqq2y3qkapasac6oxxqtwbvkkxcrftua0w+dx-j3i-l...@mail.gmail.com
 TODO
...
Each type of page can be confirmed by the following way *if*
its type is previously *hinted* except for gin.
 
btree: 32bit magic at pd-opaque
gin  : No magic
gist : 16-bit magic at ((GISTPageOpaque*)pd-opaque)-gist_page_id
spgist   : 16-bit magic at ((SpGistPageOpaque*)pd-opaque)-spgist_page_id
hash : 16-bit magic at ((HashPageOpaque*)pd-paque)-hasho_page_id
sequence : 16-bit magic at pd-opaque, the last 2 bytes of the page.
 
# Is it comprehensive? and correct?
 Sequence pages use the last 4 bytes. Have a look at sequence_magic in
 sequence.c.
 For btree pages we can use the last 2 bytes and a check on MAX_BT_CYCLE_ID.
 For gin, I'll investigate if it is possible to add a identifier like
 GIN_PAGE_ID, it would make the page analysis more consistent with the
 others. I am not sure for what the 8 bytes allocated for the special
 area are used now for though.
 
The majority is 16-bit magic at the TAIL of opaque struct. If
we can unify them into , say, 16-bit magic at
*(uint16*)(pd-opaque) the sizeof() labyrinth become stable and
simple and other tools which should identify the type of pages
will be happy. Do you think we can do that?
 Yes I think so. I'll raise a different thread though as this is a
 different problem that what this patch is targeting. I would even
 imagine a macro in bufpage.c able to handle that well.

Ok, that being the case, this topic should be stashed and I'll
look into there regardless of it. Thank you.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Escaping from blocked send() reprised.

2014-07-04 Thread Kyotaro HORIGUCHI
Hello,

At Tue, 1 Jul 2014 21:21:38 +0200, Martijn van Oosterhout klep...@svana.org 
wrote in 20140701192138.gb20...@svana.org
 On Tue, Jul 01, 2014 at 12:26:43PM +0900, Kyotaro HORIGUCHI wrote:
   1. I think it's the case that there are platforms around where a
   signal won't cause send() to return EINTR and I'd be entirely
   unsurprised if SSL_write() doesn't necessarily return EINTR in that
   case.  I'm not sure what, if anything, we can do about that.
  
  man 2 send on FreeBSD has not description about EINTR.. And even
  on linux, send won't return EINTR for most cases, at least I
  haven't seen that. So send()=-1,EINTR seems to me as only an
  equivalent of send() = 0. I have no idea about what the
  implementer thought the difference is.
 
 Whether send() returns EINTR or not depends on whether the signal has
 been marked restartable or not. This is configurable per signal, see
 sigaction(). If the signal is marked to restart, the kernel returns
 ERESTARTHAND (IIRC) and the libc will redo the call internally.

Wow, thank you for detailed information. I'll study that and take
it into future discussion.

 Default BSD does not return EINTR normally, but supports sigaction().

I guess it is for easiness-to-keep-compatibility, seems
reasonable.


have a nice day,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] No toast table for pg_shseclabel but for pg_seclabel

2014-07-04 Thread Andres Freund
Hi,

postgres=# SELECT oid::regclass, reltoastrelid FROM pg_class WHERE relname IN 
('pg_seclabel', 'pg_shseclabel');
  oid  | reltoastrelid
---+---
 pg_seclabel   |  3598
 pg_shseclabel | 0
(2 rows)

Isn't that a somewhat odd choice? Why do we assume that there cannot be
lengthy seclabels on shared objects? Granted, most shared objects aren't
candidates for large amounts of data, but both users and databases don't
seem to fall into that category.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Escaping from blocked send() reprised.

2014-07-04 Thread Kyotaro HORIGUCHI
Hello, thank you for keeping this discussion moving.

  I think there's no such a reasonable time. The behavior might
  should be determined from another point.. On alternative would be
  let pg_terminate_backend() have a parameter instructing force
  shutodwn (how to propagate it?), or make a forced shutdown on
  duplicate invocation of pg_terminate_backend().
 
 Well, I think that when people call pg_terminate_backend() just once,
 they expect it to kill the target backend.  I think people will
 tolerate a short delay, like a few seconds; after all, there's no
 guarantee, even today, that the backend will hit a
 CHECK_FOR_INTERRUPTS() in less than a few hundred milliseconds.

Sure.

 But they are not going to want to have to take a second action
 to kill the backend - killing it once should be sufficient.

Hmm, it sounds persuasive. Well, do you think they tolerate
-force option? (Even though its technical practicality is not
clear)

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-07-04 Thread Fabien COELHO


Yea. I certainly disagree with the patch in it's current state because 
it copies the same 15 lines several times with a two word difference. 
Independent of whether we want those options, I don't think that's going 
to fly.


I liked a simple static string for the different variants, which means 
replication. Factorizing out the (large) common part will mean malloc  
sprintf. Well, why not.



OTOH, we've almost reached the consensus that supporting gaussian
and exponential options in \setrandom. So I think that you should
separate those two features into two patches, and we should apply
the \setrandom one first. Then we can discuss whether the other patch
should be applied or not.



Sounds like a good plan.


Sigh. I'll do that as it seems to be a blocker...

The caveat that I have is that without these options there is:

(1) no return about the actual distributions in the final summary, which 
depend on the threshold value, and


(2) no included mean to test the feature, so the first patch is less 
meaningful if the feature cannot be used simply and require a custom 
script.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_xlogdump --stats

2014-07-04 Thread Abhijit Menon-Sen
At 2014-07-04 11:34:21 +0200, and...@2ndquadrant.com wrote:

 So we're leaking memory here? For --stats that could well be
 relevant...

Will fix (I think using a static buffer would be OK here).

 I think we're going to have to redefine the
 PGAC_FUNC_SNPRINTF_LONG_LONG_INT_FORMAT callsite in configure.in […]

OK, will do.

 It's far from guaranteed that 27 will always suffice. I guess it's ok
 because it doesn't cause bad breakage, but just some misalignment...

Yes, that was my thought too.

I could measure the lengths of things and align columns dynamically, but
I really doubt it's worth the effort in this case.

 Many missing spaces here. […]
 Some additional leaking here.

Will fix both.

 Looks like that should at least partially have been in the other
 patch?

Yes, sorry. Will fix.

(I'll set this back to waiting on author. Thanks for having a look.)

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench

2014-07-04 Thread Andres Freund
On 2014-07-04 11:59:23 +0200, Fabien COELHO wrote:
 
 Yea. I certainly disagree with the patch in it's current state because it
 copies the same 15 lines several times with a two word difference.
 Independent of whether we want those options, I don't think that's going
 to fly.
 
 I liked a simple static string for the different variants, which means
 replication. Factorizing out the (large) common part will mean malloc 
 sprintf. Well, why not.

It sucks from a maintenance POV. And I don't see the overhead of malloc
being relevant here...

 OTOH, we've almost reached the consensus that supporting gaussian
 and exponential options in \setrandom. So I think that you should
 separate those two features into two patches, and we should apply
 the \setrandom one first. Then we can discuss whether the other patch
 should be applied or not.
 
 Sounds like a good plan.
 
 Sigh. I'll do that as it seems to be a blocker...

I think we also need documentation about the actual mathematical
behaviour of the randomness generators.

 + para
 +  With the gaussian option, the larger the replaceablethreshold/,
 +  the more frequently values close to the middle of the interval are 
 drawn,
 +  and the less frequently values close to the replaceablemin/ and
 +  replaceablemax/ bounds.
 +  In other worlds, the larger the replaceablethreshold/,
 +  the narrower the access range around the middle.
 +  the smaller the threshold, the smoother the access pattern
 +  distribution. The minimum threshold is 2.0 for performance.
 + /para

The only way to actually understand the distribution here is to create a
table, insert random values, and then look at the result. That's not a
good thing.

 The caveat that I have is that without these options there is:
 
 (1) no return about the actual distributions in the final summary, which
 depend on the threshold value, and
 
 (2) no included mean to test the feature, so the first patch is less
 meaningful if the feature cannot be used simply and require a custom script.

I personally agree that we likely want that as an additional
feature. Even if just because it makes the results easier to compare.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_xlogdump --stats

2014-07-04 Thread Andres Freund
On 2014-07-04 15:34:14 +0530, Abhijit Menon-Sen wrote:
 At 2014-07-04 11:34:21 +0200, and...@2ndquadrant.com wrote:
 
  So we're leaking memory here? For --stats that could well be
  relevant...
 
 Will fix (I think using a static buffer would be OK here).

In that place, yes. There there's several other places where that's
probably isn't going to work. Probably makes sense to check memory usage
after running it over a larger amount of WAL.

  It's far from guaranteed that 27 will always suffice. I guess it's ok
  because it doesn't cause bad breakage, but just some misalignment...
 
 Yes, that was my thought too.
 
 I could measure the lengths of things and align columns dynamically, but
 I really doubt it's worth the effort in this case.

Nah. Too much work for no gain ;)

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] No toast table for pg_shseclabel but for pg_seclabel

2014-07-04 Thread Andres Freund
On 2014-07-04 11:50:17 +0200, Andres Freund wrote:
 Hi,
 
 postgres=# SELECT oid::regclass, reltoastrelid FROM pg_class WHERE relname IN 
 ('pg_seclabel', 'pg_shseclabel');
   oid  | reltoastrelid
 ---+---
  pg_seclabel   |  3598
  pg_shseclabel | 0
 (2 rows)
 
 Isn't that a somewhat odd choice? Why do we assume that there cannot be
 lengthy seclabels on shared objects? Granted, most shared objects aren't
 candidates for large amounts of data, but both users and databases don't
 seem to fall into that category.

Hm. It seems they were explicitly removed around
http://archives.postgresql.org/message-id/1309888389-sup-3853%40alvh.no-ip.org

I don't understand the reasoning there. There's a toast table for
non-shared objects. Why would we expect less data for the shared ones,
even though they're pretty basic objects and more likely to be used to
store policies and such?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_xlogdump --stats

2014-07-04 Thread gotoschool6g
 
 I'm pretty sure that most committers would want to apply the 
 rm_identity part in a separate commit first. Could you make it 
 two patches, one introducing rm_identity, and another with 
 xlogdump using it? 

Carp the code:

const char *
clog_identify(uint8 info)
{ 
 switch (info)
 {
  case CLOG_ZEROPAGE:
   return ZEROPAGE;
  case CLOG_TRUNCATE:
   return TRUNCATE;
   break;
 }
 return NULL;
}

or

const char *
clog_identify(uint8 info)
{
if(info==CLOG_ZEROPAGE)return ZEROPAGE;
if(info==CLOG_TRUNCATE)return TRUNCATE;
return NULL;
}

is a bit faster than:

const char *
clog_identify(uint8 info)
{
 const char *id = NULL;

 switch (info)
 {
  case CLOG_ZEROPAGE:
   id = ZEROPAGE;
   break;
  case CLOG_TRUNCATE:
   id = TRUNCATE;
   break;
 }

 return id;
}

Re: [HACKERS] pg_xlogdump --stats

2014-07-04 Thread Andres Freund
On 2014-07-04 18:31:34 +0800, gotoschool6g wrote:
  
  I'm pretty sure that most committers would want to apply the 
  rm_identity part in a separate commit first. Could you make it 
  two patches, one introducing rm_identity, and another with 
  xlogdump using it? 
 
 Carp the code:
 
 const char *
 clog_identify(uint8 info)
 { 
  switch (info)
  {
   case CLOG_ZEROPAGE:
return ZEROPAGE;
   case CLOG_TRUNCATE:
return TRUNCATE;
break;
  }
  return NULL;
 }
 
 or
 
 const char *
 clog_identify(uint8 info)
 {
 if(info==CLOG_ZEROPAGE)return ZEROPAGE;
 if(info==CLOG_TRUNCATE)return TRUNCATE;
 return NULL;
 }
 
 is a bit faster than:

Any halfway decent compiler will not use a local variable here. Don't
think that matters much. Also the code isn't a performance bottleneck...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_receivexlog add synchronous mode

2014-07-04 Thread furuyao
  Thanks for reviewing the patch!
 
  I think that this refactoring patch is useful for improving source
  code readability and making the future patches simpler, whether we
  adopt your patch or not. So, barring any objections, I'm thinking to
  commit this refactoring patch.
 
 Committed!
It is a patch that added the --fsync-interval option.
Interface of --fsync-interval option was referring to the pg_recvlogical.c.

It is not judgement the flush on a per-message basis. 
It is judgment at the time of receipt stop of the message.

If you specify a zero --fsync-interval make the flush at the same timing as the 
walreceiver .
If you do not specify --fsync-interval, you will flush only when switching as 
WAL files as in the past.

Regards,

--
Furuya Osamu



pg_receivexlog-add-fsync-interval-v1.patch
Description: pg_receivexlog-add-fsync-interval-v1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] log_error_verbosity and unexpected errors

2014-07-04 Thread Oskari Saarenmaa

On 02/07/14 22:10, Tom Lane wrote:

Greg Stark st...@mit.edu writes:

I think log_error_verbosity is a strange variable. It's useless for
expected user-facing errors but essential for unexpected errors that
indicate bugs in the code -- and you can only have it on for
everything or off for everything.



I'm finding I usually want it set to 'verbose' for anything that
PANICs or is generated by an elog() but it's just noise for anything
generated by an ereport() and is ERROR or below.


[...]


[ thinks for a bit... ]  A slightly cleaner approach is to nominate
a specified set of SQLSTATEs, certainly including XX000 and perhaps
some others, as being ones that force verbose reporting.  That would
have the same practical effect as far as elogs go, but wouldn't break
the nominal functional equivalence.

And that brings up the previous work on SQLSTATE-dependent choices
about whether to log at all.  I remember a patch was submitted for
that but don't remember offhand why it didn't get committed.  ISTM
we should think about reviving that and making the choice be not just
log or not, but no log, terse log, normal log, verbose log.


I had a patch for making log_min_error_statement configurable per 
SQLSTATE in https://commitfest.postgresql.org/action/patch_view?id=1360 
but you pointed out various issues in it and I didn't have time to 
update it for 9.4.  I'm going to rewrite it based on the comments and 
submit it again for a 9.5 commitfest.


The same mechanism could be used to set verbosity per SQLSTATE.

/ Oskari



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_xlogdump --stats

2014-07-04 Thread Abhijit Menon-Sen
At 2014-07-04 11:34:21 +0200, and...@2ndquadrant.com wrote:

 I think we're going to have to redefine the
 PGAC_FUNC_SNPRINTF_LONG_LONG_INT_FORMAT callsite in configure.in to
 define INT64_MODIFIER='ll/l/I64D'

I've attached a patch to do this, and also add INT64_MODIFIER to
pg_config.h.in: 0001-modifier.diff

I reran autoconf, and just for convenience I've attached the resulting
changes to configure: 0002-configure.diff

Then there are the rm_identify changes: 0003-rmid.diff

Finally, the xlogdump patch using INT64_MODIFIER: 0004-xlogdump.diff

I can confirm that this series applies in-order to master, and that the
result builds cleanly (including after each patch) on my machine, and
that the resulting pg_xlogdump works as expected.

NOTE: I do not know what to do about pg_config.h.win32. If someone tells
me what to do, I can submit another patch.

 Some additional leaking here.

Two of the extra calls to psprintf in pg_xlogdump happen at most
RM_MAX_ID*16 (i.e. O(record-types) not O(records)) times, and the other
two happen just before exit. It would be easy to use a static buffer and
snprintf, but I don't think it's worth doing in this case.

-- Abhijit, hoping with crossed fingers to not forget attachments now.
diff --git a/config/c-library.m4 b/config/c-library.m4
index 8f45010..4821a61 100644
--- a/config/c-library.m4
+++ b/config/c-library.m4
@@ -221,22 +221,22 @@ HAVE_POSIX_SIGNALS=$pgac_cv_func_posix_signals
 AC_SUBST(HAVE_POSIX_SIGNALS)])# PGAC_FUNC_POSIX_SIGNALS
 
 
-# PGAC_FUNC_SNPRINTF_LONG_LONG_INT_FORMAT
+# PGAC_FUNC_SNPRINTF_LONG_LONG_INT_MODIFIER
 # ---
-# Determine which format snprintf uses for long long int.  We handle
-# %lld, %qd, %I64d.  The result is in shell variable
-# LONG_LONG_INT_FORMAT.
+# Determine which length modifier snprintf uses for long long int.  We
+# handle ll, q, and I64.  The result is in shell variable
+# LONG_LONG_INT_MODIFIER.
 #
 # MinGW uses '%I64d', though gcc throws an warning with -Wall,
 # while '%lld' doesn't generate a warning, but doesn't work.
 #
-AC_DEFUN([PGAC_FUNC_SNPRINTF_LONG_LONG_INT_FORMAT],
-[AC_MSG_CHECKING([snprintf format for long long int])
-AC_CACHE_VAL(pgac_cv_snprintf_long_long_int_format,
-[for pgac_format in '%lld' '%qd' '%I64d'; do
+AC_DEFUN([PGAC_FUNC_SNPRINTF_LONG_LONG_INT_MODIFIER],
+[AC_MSG_CHECKING([snprintf length modifier for long long int])
+AC_CACHE_VAL(pgac_cv_snprintf_long_long_int_modifier,
+[for pgac_modifier in 'll' 'q' 'I64'; do
 AC_TRY_RUN([#include stdio.h
 typedef long long int ac_int64;
-#define INT64_FORMAT $pgac_format
+#define INT64_FORMAT %${pgac_modifier}d
 
 ac_int64 a = 2001;
 ac_int64 b = 4005;
@@ -258,19 +258,19 @@ int does_int64_snprintf_work()
 main() {
   exit(! does_int64_snprintf_work());
 }],
-[pgac_cv_snprintf_long_long_int_format=$pgac_format; break],
+[pgac_cv_snprintf_long_long_int_modifier=$pgac_modifier; break],
 [],
-[pgac_cv_snprintf_long_long_int_format=cross; break])
+[pgac_cv_snprintf_long_long_int_modifier=cross; break])
 done])dnl AC_CACHE_VAL
 
-LONG_LONG_INT_FORMAT=''
+LONG_LONG_INT_MODIFIER=''
 
-case $pgac_cv_snprintf_long_long_int_format in
+case $pgac_cv_snprintf_long_long_int_modifier in
   cross) AC_MSG_RESULT([cannot test (not on host machine)]);;
-  ?*)AC_MSG_RESULT([$pgac_cv_snprintf_long_long_int_format])
- LONG_LONG_INT_FORMAT=$pgac_cv_snprintf_long_long_int_format;;
+  ?*)AC_MSG_RESULT([$pgac_cv_snprintf_long_long_int_modifier])
+ LONG_LONG_INT_MODIFIER=$pgac_cv_snprintf_long_long_int_modifier;;
   *) AC_MSG_RESULT(none);;
-esac])# PGAC_FUNC_SNPRINTF_LONG_LONG_INT_FORMAT
+esac])# PGAC_FUNC_SNPRINTF_LONG_LONG_INT_MODIFIER
 
 
 # PGAC_FUNC_SNPRINTF_ARG_CONTROL

diff --git a/configure.in b/configure.in
index c938a5d..6afc818 100644
--- a/configure.in
+++ b/configure.in
@@ -1636,35 +1636,38 @@ fi
 # If we found long int is 64 bits, assume snprintf handles it.  If
 # we found we need to use long long int, better check.  We cope with
 # snprintfs that use %lld, %qd, or %I64d as the format.  If none of these
-# work, fall back to our own snprintf emulation (which we know uses %lld).
+# works, fall back to our own snprintf emulation (which we know uses %lld).
 
 if test $HAVE_LONG_LONG_INT_64 = yes ; then
   if test $pgac_need_repl_snprintf = no; then
-PGAC_FUNC_SNPRINTF_LONG_LONG_INT_FORMAT
-if test $LONG_LONG_INT_FORMAT = ; then
+PGAC_FUNC_SNPRINTF_LONG_LONG_INT_MODIFIER
+if test $LONG_LONG_INT_MODIFIER = ; then
   # Force usage of our own snprintf, since system snprintf is broken
   pgac_need_repl_snprintf=yes
-  LONG_LONG_INT_FORMAT='%lld'
+  LONG_LONG_INT_MODIFIER='ll'
 fi
   else
 # Here if we previously decided we needed to use our own snprintf
-LONG_LONG_INT_FORMAT='%lld'
+LONG_LONG_INT_MODIFIER='ll'
   fi
-  LONG_LONG_UINT_FORMAT=`echo $LONG_LONG_INT_FORMAT | sed 's/d$/u/'`
-  INT64_FORMAT=\$LONG_LONG_INT_FORMAT\
-  UINT64_FORMAT=\$LONG_LONG_UINT_FORMAT\
 

Re: [HACKERS] No toast table for pg_shseclabel but for pg_seclabel

2014-07-04 Thread Kohei KaiGai
Here is no other reason than what Alvaro mentioned in the upthread.
We intended to store security label of SELinux (less than 100bytes at most),
so I didn't think it leads any problem actually.

On the other hands, pg_seclabel was merged in another development cycle.
We didn't have deep discussion about necessity of toast table of pg_seclabel.
I added its toast table mechanically.

It never means we need toast table for local object but don't need it for shared
database object.

Thanks,

2014-07-04 19:11 GMT+09:00 Andres Freund and...@2ndquadrant.com:
 On 2014-07-04 11:50:17 +0200, Andres Freund wrote:
 Hi,

 postgres=# SELECT oid::regclass, reltoastrelid FROM pg_class WHERE relname 
 IN ('pg_seclabel', 'pg_shseclabel');
   oid  | reltoastrelid
 ---+---
  pg_seclabel   |  3598
  pg_shseclabel | 0
 (2 rows)

 Isn't that a somewhat odd choice? Why do we assume that there cannot be
 lengthy seclabels on shared objects? Granted, most shared objects aren't
 candidates for large amounts of data, but both users and databases don't
 seem to fall into that category.

 Hm. It seems they were explicitly removed around
 http://archives.postgresql.org/message-id/1309888389-sup-3853%40alvh.no-ip.org

 I don't understand the reasoning there. There's a toast table for
 non-shared objects. Why would we expect less data for the shared ones,
 even though they're pretty basic objects and more likely to be used to
 store policies and such?

 Greetings,

 Andres Freund

 --
  Andres Freund http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services



-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] introduce XLogLockBlockRangeForCleanup()

2014-07-04 Thread Abhijit Menon-Sen
Updated patch attached, thanks.

Amit: what's your conclusion from the review?

-- Abhijit
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 5f9fc49..dc90c02 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -501,33 +501,9 @@ btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record)
 	 * here during crash recovery.
 	 */
 	if (HotStandbyActiveInReplay())
-	{
-		BlockNumber blkno;
-
-		for (blkno = xlrec-lastBlockVacuumed + 1; blkno  xlrec-block; blkno++)
-		{
-			/*
-			 * We use RBM_NORMAL_NO_LOG mode because it's not an error
-			 * condition to see all-zero pages.  The original btvacuumpage
-			 * scan would have skipped over all-zero pages, noting them in FSM
-			 * but not bothering to initialize them just yet; so we mustn't
-			 * throw an error here.  (We could skip acquiring the cleanup lock
-			 * if PageIsNew, but it's probably not worth the cycles to test.)
-			 *
-			 * XXX we don't actually need to read the block, we just need to
-			 * confirm it is unpinned. If we had a special call into the
-			 * buffer manager we could optimise this so that if the block is
-			 * not in shared_buffers we confirm it as unpinned.
-			 */
-			buffer = XLogReadBufferExtended(xlrec-node, MAIN_FORKNUM, blkno,
-			RBM_NORMAL_NO_LOG);
-			if (BufferIsValid(buffer))
-			{
-LockBufferForCleanup(buffer);
-UnlockReleaseBuffer(buffer);
-			}
-		}
-	}
+		XLogLockBlockRangeForCleanup(xlrec-node, MAIN_FORKNUM,
+	 xlrec-lastBlockVacuumed + 1,
+	 xlrec-block);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b7829ff..d84f83a 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -287,10 +287,6 @@ XLogReadBuffer(RelFileNode rnode, BlockNumber blkno, bool init)
  *
  * In RBM_ZERO and RBM_ZERO_ON_ERROR modes, if the page doesn't exist, the
  * relation is extended with all-zeroes pages up to the given block number.
- *
- * In RBM_NORMAL_NO_LOG mode, we return InvalidBuffer if the page doesn't
- * exist, and we don't check for all-zeroes.  Thus, no log entry is made
- * to imply that the page should be dropped or truncated later.
  */
 Buffer
 XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
@@ -331,8 +327,6 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 			log_invalid_page(rnode, forknum, blkno, false);
 			return InvalidBuffer;
 		}
-		if (mode == RBM_NORMAL_NO_LOG)
-			return InvalidBuffer;
 		/* OK to extend the file */
 		/* we do this in recovery only - no rel-extension lock needed */
 		Assert(InRecovery);
@@ -375,6 +369,62 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	return buffer;
 }
 
+/*
+ * XLogBlockRangeForCleanup is used in Hot Standby mode to emulate the
+ * locking effects imposed by VACUUM for nbtrees.
+ */
+void
+XLogLockBlockRangeForCleanup(RelFileNode rnode, ForkNumber forkNum,
+			 BlockNumber startBlkno,
+			 BlockNumber uptoBlkno)
+{
+	BlockNumber blkno;
+	BlockNumber lastblock;
+	BlockNumber	endingBlkno;
+	Buffer		buffer;
+	SMgrRelation smgr;
+
+	Assert(startBlkno != P_NEW);
+	Assert(uptoBlkno != P_NEW);
+
+	/* Open the relation at smgr level */
+	smgr = smgropen(rnode, InvalidBackendId);
+
+	/*
+	 * Create the target file if it doesn't already exist.  This lets us cope
+	 * if the replay sequence contains writes to a relation that is later
+	 * deleted.  (The original coding of this routine would instead suppress
+	 * the writes, but that seems like it risks losing valuable data if the
+	 * filesystem loses an inode during a crash.  Better to write the data
+	 * until we are actually told to delete the file.)
+	 */
+	smgrcreate(smgr, forkNum, true);
+
+	lastblock = smgrnblocks(smgr, forkNum);
+
+	endingBlkno = uptoBlkno;
+	if (lastblock  endingBlkno)
+		endingBlkno = lastblock;
+
+	for (blkno = startBlkno; blkno  endingBlkno; blkno++)
+	{
+		/*
+		 * All we need to do here is prove that we can lock each block
+		 * with a cleanup lock. It's not an error to see all-zero pages
+		 * here because the original btvacuumpage would not have thrown
+		 * an error either.
+		 *
+		 * We don't actually need to read the block, we just need to
+		 * confirm it is unpinned.
+		 */
+		buffer = GetBufferWithoutRelcache(rnode, forkNum, blkno);
+		if (BufferIsValid(buffer))
+		{
+			LockBufferForCleanup(buffer);
+			UnlockReleaseBuffer(buffer);
+		}
+	}
+}
 
 /*
  * Struct actually returned by XLogFakeRelcacheEntry, though the declared
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 07ea665..ce5ff70 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -222,8 +222,6 @@ ReadBuffer(Relation reln, BlockNumber blockNum)
  * current physical EOF; that is likely to cause problems in md.c when
  

[HACKERS] [RFC: bug fix?] Connection attempt block forever when the synchronous standby is not running

2014-07-04 Thread MauMau

Hello,

My customer reported a strange connection hang problem.  He and I couldn't 
reproduce it.  I haven't been able to understand the cause, but I can think 
of one hypothesis.  Could you give me your opinions on whether my hypothesis 
is correct, and a direction on how to fix the problem?  I'm willing to 
submit a patch if necessary.



[Problem]
The customer is using synchronous streaming replication with PostgreSQL 
9.2.8.  The cluster consists of two nodes.


He performed archive recovery test like this:

1. Take a base backup.  At that time, some notable settings in 
postgresql.conf are:

synchronous_standby_names = 'node2'
autovacuum = on
# synchronous_commit is commented out, so it's on by default

2. Some update operations.  I don't know what.

3. Shutdown the primary and promote the standby.

4. Shutdown the new primary.

5. Perform archive recovery.  That is, restore the base backup, create 
recovery.conf, and do pg_ctl start.


6. Immediately after the archive recovery is complete, connect to the 
database server and perform some queries to check user data.


The steps 5 and 6 are done in some recovery script.

However, the connection attempt in step 6 got stuck for 12 hours, and the 
test was canceled.  The stack trace was:


#0  0x003f4badf258 in poll () from /lib64/libc.so.6
#1  0x00619b94 in WaitLatchOrSocket ()
#2  0x00640c4c in SyncRepWaitForLSN ()
#3  0x00491c18 in RecordTransactionCommit ()
#4  0x00491d98 in CommitTransaction ()
#5  0x00493135 in CommitTransactionCommand ()
#6  0x0074938a in InitPostgres ()
#7  0x0066ddd7 in PostgresMain ()
#8  0x00627d81 in PostmasterMain ()
#9  0x005c4803 in main ()

The connection attempt is waiting for a reply from the standby.  This is 
strange, because we didn't anticipate that the connection establishment (and 
subsequent SELECT queries) would update something and write some WAL.  The 
doc says:


http://www.postgresql.org/docs/current/static/warm-standby.html#SYNCHRONOUS-REPLICATION

When requesting synchronous replication, each commit of a write transaction 
will wait until confirmation is received that the commit has been written to 
the transaction log on disk of both the primary and standby server.

...
Read only transactions and transaction rollbacks need not wait for replies 
from standby servers. Subtransaction commits do not wait for responses from 
standby servers, only top-level commits.



[Hypothesis]
Why does the connection processing emit WAL?

Probably, it did page-at-a-time vacuum during access to pg_database and 
pg_authid for client authentication.  src/backend/access/heap/README.HOT 
describes:


Effectively, space reclamation happens during tuple retrieval when the
page is nearly full (10% free) and a buffer cleanup lock can be
acquired.  This means that UPDATE, DELETE, and SELECT can trigger space
reclamation, but often not during INSERT ... VALUES because it does
not retrieve a row.

But the customer could not reproduce the problem when he performed the same 
archive recovery from the same base backup again.  Why?  I guess the 
autovacuum daemon vacuumed the system catalogs before he attempted to 
connect to the database.


Is this correct?


[How to fix]
Of course, adding -o '-c synchronous_commit=local' or -o '-c 
synchronous_standby_names=' to pg_ctl start in the recovery script would 
prevent the problem.


But isn't there anything to fix in PostgreSQL?  I think the doc needs 
improvement so that users won't misunderstand that only write transactions 
would block at commit.


Do you think something else should be done?  I guess pg_basebackup, 
pg_isready, and PQping() called in pg_ctl -w start/restart would block 
likewise, and I'm afraid users don't anticipate it.  pg_upgrade appears to 
set synchronous_commit to local when starting the database server.


Regards
MauMau





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issue while calling new PostgreSQL command from a Java Application

2014-07-04 Thread Tom Lane
Ashoke s.ash...@gmail.com writes:
 Thank you Ashutosh*.* That was the issue. But, could you please explain why
 it worked from command line?

Simple vs extended query protocol, probably --- the former avoids copying
the constructed parsetree, but I think the latter doesn't.  Or maybe the
JDBC driver tried to prepare the query; a prepared statement is most
certainly going to copy the parsetree.

In general, if you add a field to any node type, you'd better go through
backend/nodes/ and teach all the relevant functions about it.  What I tend
to do is grep for one of the existing fields in the struct and see which
functions that reference it need additions.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cluster name in ps output

2014-07-04 Thread Tom Lane
Vik Fearing vik.fear...@dalibo.com writes:
 You mean that if synchronous_standby_names is an empty it automatically
 should be set to cluster_name?

 No, I mean that synchronous_standby_names should look at cluster_name
 first, and if it's not set then unfortunately look at application_name
 for backward compatibility.

I think you're failing to explain yourself very well.  What you really
mean is that we should use cluster_name not application_name to determine
the name that a standby server reports to the master.

There's something to that, perhaps, but I think that the mechanism we use
for doing the reporting involves the application_name parameter passed
through libpq.  It might be a bit painful to change that, and I'm not
entirely sure it'd be less confusing.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] No toast table for pg_shseclabel but for pg_seclabel

2014-07-04 Thread Tom Lane
Kohei KaiGai kai...@kaigai.gr.jp writes:
 Here is no other reason than what Alvaro mentioned in the upthread.
 We intended to store security label of SELinux (less than 100bytes at most),
 so I didn't think it leads any problem actually.

 On the other hands, pg_seclabel was merged in another development cycle.
 We didn't have deep discussion about necessity of toast table of pg_seclabel.
 I added its toast table mechanically.

So maybe we should get rid of the toast table for pg_seclabel.  One less
catalog table for a feature that hardly anyone is using seems like a fine
idea to me ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-07-04 Thread Abhijit Menon-Sen
At 2014-07-04 19:27:10 +0530, rahilasye...@gmail.com wrote:

 Please find attached patches with no whitespace error and improved
 formatting.

Thanks.

There are still numerous formatting changes required, e.g. spaces around
= and correct formatting of comments. And git diff --check still has
a few whitespace problems. I won't point these out one by one, but maybe
you should run pgindent.

 diff --git a/src/backend/access/transam/xlog.c 
 b/src/backend/access/transam/xlog.c
 index 3f92482..39635de 100644
 --- a/src/backend/access/transam/xlog.c
 +++ b/src/backend/access/transam/xlog.c
 @@ -60,6 +60,9 @@
  #include storage/spin.h
  #include utils/builtins.h
  #include utils/guc.h
 +#include utils/pg_lzcompress.h
 +#include utils/pg_snappy.h
 +#include utils/pg_lz4.h
  #include utils/ps_status.h
  #include utils/relmapper.h
  #include utils/snapmgr.h

This hunk still fails to apply to master (due to the subsequent
inclusion of memutils.h), but I just added it in by hand.

 +int  compress_backup_block = false;

Should be initialised to BACKUP_BLOCK_COMPRESSION_OFF as noted earlier.

 + /* Allocates memory for compressed backup blocks according to the 
 compression
 + * algorithm used.Once per session at the time of insertion of first XLOG
 + * record.
 + * This memory stays till the end of session. OOM is handled by making 
 the
 + * code proceed without FPW compression*/

I suggest something like this:

/*
 * Allocates pages to store compressed backup blocks, with the page
 * size depending on the compression algorithm selected. These pages
 * persist throughout the life of the backend. If the allocation
 * fails, we disable backup block compression entirely.
 */

But though the code looks better locally than before, the larger problem
is that this is still unsafe. As Pavan pointed out, XLogInsert is called
from inside critical sections, so we can't allocate memory here.

Could you look into his suggestions of other places to do the
allocation, please?

 + static char *compressed_pages[XLR_MAX_BKP_BLOCKS];
 + static bool compressed_pages_allocated = false;

These declarations can't just be in the middle of the function, they'll
have to move up to near the top of the closest enclosing scope (wherever
you end up doing the allocation).

 + if (compress_backup_block != BACKUP_BLOCK_COMPRESSION_OFF 
 + compressed_pages_allocated!= true)

No need for != true with a boolean.

 + if (compress_backup_block == BACKUP_BLOCK_COMPRESSION_SNAPPY)
 + buffer_size += snappy_max_compressed_length(BLCKSZ);
 + else if (compress_backup_block == BACKUP_BLOCK_COMPRESSION_LZ4)
 + buffer_size += LZ4_compressBound(BLCKSZ);
 + else if (compress_backup_block == BACKUP_BLOCK_COMPRESSION_PGLZ)
 + buffer_size += PGLZ_MAX_OUTPUT(BLCKSZ);

There's nothing wrong with this, but given that XLR_MAX_BKP_BLOCKS is 4,
I would just allocate pages of size BLCKSZ. But maybe that's just me.

 + bkpb-block_compression=BACKUP_BLOCK_COMPRESSION_OFF;

Wouldn't it be better to set

bkpb-block_compression = compress_backup_block;

once earlier instead of setting it that way once and setting it to
BACKUP_BLOCK_COMPRESSION_OFF in two other places?

 + if(VARSIZE(buf)  orig_len-2)
 + /* successful compression */
 + {
 + *len = VARSIZE(buf);
 + return (char *) buf;
 + }
 + else
 + return NULL;
 +}

That comment after the if just has to go. It's redundant given the
detailed explanation above anyway. Also, I'd strongly prefer checking
for failure rather than success here, i.e.

if (VARSIZE(buf) = orig_len - 2)
return NULL;

*len = VARSIZE(buf); /* Doesn't this need + VARHDRSIZE? */

return (char *) buf;

I don't quite remember what I suggested last time, but if it was what's
in the patch now, I apologise.

 + /* Decompress if backup block is compressed*/
 + else if (VARATT_IS_COMPRESSED((struct varlena *) blk)
 +  
 bkpb.block_compression!=BACKUP_BLOCK_COMPRESSION_OFF)

If you're using VARATT_IS_COMPRESSED() to detect compression, don't you
need SET_VARSIZE_COMPRESSED() in CompressBackupBlock? pglz_compress()
does it for you, but the other two algorithms don't.

But now that you've added bkpb.block_compression, you should be able to
avoid VARATT_IS_COMPRESSED() altogether, unless I'm missing something.
What do you think?

 +/*
 + */
 +static const struct config_enum_entry backup_block_compression_options[] = {
 + {off, BACKUP_BLOCK_COMPRESSION_OFF, false},
 + {false, BACKUP_BLOCK_COMPRESSION_OFF, true},
 + {no, BACKUP_BLOCK_COMPRESSION_OFF, true},
 + {0, BACKUP_BLOCK_COMPRESSION_OFF, true},
 + {pglz, BACKUP_BLOCK_COMPRESSION_PGLZ, true},
 + {snappy, BACKUP_BLOCK_COMPRESSION_SNAPPY, true},
 + {lz4, 

Re: [HACKERS] RLS Design

2014-07-04 Thread Stephen Frost
Kaigai,

On Thursday, July 3, 2014, Kouhei Kaigai kai...@ak.jp.nec.com wrote:

 Sorry for my late responding, now I'm catching up the discussion.

  * Robert Haas (robertmh...@gmail.com javascript:;) wrote:
   On Tue, Jul 1, 2014 at 3:20 PM, Dean Rasheed dean.a.rash...@gmail.com
 javascript:;
  wrote:
If RLS quals are instead regarded as constraints on access, and
multiple policies apply, then it seems that the quals should now be
combined with AND rather than OR, right?
 
  I do feel that RLS quals are constraints on access, but I don't see how
  it follows that multiple quals should be AND'd together because of that.
  I view the RLS policies on each table as being independent and standing
  alone regarding what can be seen.  If you have access to a table today
  through policy A, and then later policy B is added, using AND would mean
  that the set of rows returned is less than if only policy A existed.
  That doesn't seem correct to me.
 
 It seems to me direction of the constraints (RLS-policy) works to is
 reverse.

 In case when we have no RLS-policy, 100% of rows are visible isn't it?


No, as outlined later, the table would appear empty if no policies exist
and RLS is enabled for the table.


 Addition of a constraint usually reduces the number of rows being visible,
 or same number of rows at least. Constraint shall never work to the
 direction
 to increase the number of rows being visible.


Can you clarify where this is coming from..?  It sounds like you're
referring to an existing implementation and, if so, it'd be good to get
more information on how that works exactly.


 If multiple RLS-policies are connected with OR-operator, the first policy
 works to the direction to reduce number of visible rows, but the second
 policy works to the reverse direction.


This isn't accurate, as mentioned. Each policy stands alone to define what
is visible through it and if no policy exists then no rows are visible.


 If we would have OR'd RLS-policy, how does it merged with user given
 qualifiers with?


The RLS quals are all applied together with OR's and the result is AND'd
with any user quals provided. This is only when multiple policies are being
applied for a given query and seems pretty straight forward to me.


 For example, if RLS-policy of t1 is (t1.credential  get_user_credential)
 and user's query is:
   SELECT * FROM t1 WHERE t1.x = t1.x;
 Do you think RLS-policy shall be merged with OR'd form?


Only the RLS policies are OR'd together, not user provided quals. The above
would result in:

Where t1.x = t1.x and (t1.credential  get_user_credential)

If another policy also applies for this query, such as t1.cred2 
get_user_credential then we would have:

Where t1.x = t1.x and (t1.credential  get_user_credential OR t1.cred2 
get_user_credential)

This is similar to how roles work- your overall access includes all access
granted to any roles you are a member of. You don't need SELECT rights
granted to every role you are a member of to select from the table.
Additionally, if an admin wants to AND the quals together then they can
simply create a policy which does that rather than have 2 policies.

  Yeah, maybe.  I intuitively feel that OR would be more useful, so it
   would be nice to find a design where that makes sense.  But it depends
   a lot, in my view, on what syntax we end up with.  For example,
   suppose we add just one command:
  
   ALTER TABLE table_name FILTER [ role_name | PUBLIC ] USING qual;
  
   If the given role inherits from multiple roles that have different
   filters, I think the user will naturally expect all of the filters to
   be applied.
 
  Agreed.
 
   But you could do it other ways.  For example:
  
   ALTER TABLE table_name [ NO ] ROW LEVEL SECURITY; ALTER TABLE
   table_name GRANT ROW ACCESS TO role_name USING qual;
  
   If a table is set to NO ROW LEVEL SECURITY then it behaves just like
   it does now: anyone who accesses it sees all the rows, restricted to
   those columns for which they have permission.  If the table is set to
   ROW LEVEL SECURITY then the default is to show no rows.  The second
   command then allows access to a subset of the rows for a give role
   name.  In this case, it is probably logical for access to be combined
   via OR.
 
  I can see value is having a table-level option to indicate if RLS is
 applied
  for that table or not, but I had been thinking we'd just automatically
 manage
  that.  That is to say that once you define an RLS policy for a table, we
  go look and see what policy should be applied in each case.  With the
 user
  able to control that, what happens if they say row security on the
 table
  and there are no policies?  All access would show the table as empty?
  What
  if policies exist and they decide to 'turn off' RLS for the table-
 suddenly
  everyone can see all the rows?
 
  My answers to the above (which are making me like the idea more,
  actually...) would be:
 
  Yes, if they turn on RLS for the 

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-07-04 Thread Abhijit Menon-Sen
At 2014-07-02 15:51:08 -0700, p...@heroku.com wrote:

 Attached revision factors in everyone's concerns here, I think.

Is anyone planning to review Peter's revised patch?

 These new measures make the coding somewhat more complex than that of
 the initial version, although overall the parser code added by this
 patch is almost entirely confined to code paths concerned only with
 producing diagnostic messages to help users.

Yes, the new patch looks quite a bit more involved than earlier, but if
that's what it takes to provide a useful HINT, I guess it's not too bad.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-07-04 Thread Abhijit Menon-Sen
At 2014-07-04 21:02:33 +0530, a...@2ndquadrant.com wrote:

  +/*
  + */
  +static const struct config_enum_entry backup_block_compression_options[] = 
  {

Oh, I forgot to mention that the configuration setting changes are also
pending. I think we had a working consensus to use full_page_compression
as the name of the GUC. As I understand it, that'll accept an algorithm
name as an argument while we're still experimenting, but eventually once
we select an algorithm, it'll become just a boolean (and then we don't
need to put algorithm information into BkpBlock any more either).

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_xlogdump --stats

2014-07-04 Thread Alvaro Herrera
Andres Freund wrote:
 On 2014-07-04 18:31:34 +0800, gotoschool6g wrote:

  Carp the code:
  
  const char *
  clog_identify(uint8 info)
  { 
   switch (info)
   {
case CLOG_ZEROPAGE:
 return ZEROPAGE;
case CLOG_TRUNCATE:
 return TRUNCATE;
 break;
   }
   return NULL;
  }

I agree that performance is secondary here.  The part that I don't quite
like in all these blocks is the fact that they initialize the return
value to NULL beforehand, and there's no 'default' label.  Now, I see
that pg_xlogdump is checking for NULL return, but I'm not sure this is
the best possible API.  Shouldn't we perhaps return a different string
that indicates there is no known description?

Also, are we certain that we're never going to need anything other than
the info to identify the record?  In practice we seem to follow that
rule currently, but it's not totally out of the question that someday
the rm_redo function uses more than just info to identify the record.
I wonder if it'd be better to pass the whole record instead and have the
rm_identify function pull out the info if it's all it needs, but leave
the possibility open that it could read, say, some header bytes in the
record data.

Also I didn't check how you handle the init bit in RM_HEAP and
RM_HEAP2.  Are we just saying that insert (init) is a different record
type from insert?  Maybe that's okay, but I'm not 100% sure.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_xlogdump --stats

2014-07-04 Thread Abhijit Menon-Sen
At 2014-07-04 13:43:46 -0400, alvhe...@2ndquadrant.com wrote:

 Now, I see that pg_xlogdump is checking for NULL return, but I'm not
 sure this is the best possible API. 

Note that the patched pg_xlogdump will display rm_name/0xNN for any
records that rm_identify returns a NULL for.

Earlier, when there was the possibility that new record types could be
added without updating pg_xlogdump's identification code, this made good
sense. Now one could argue that pg_xlogdump should fully depend on every
record in WAL being properly identified.

If that's the case, rm_identify could return UNKNOWN, and pg_xlogdump
could use the return value without checking. I considered that, but the
only thing that stopped me was the thought that if a weird record DOES
show up in WAL somehow, it'd be pretty handy to (a) know the value of
xl_info, and (b) see if there's more than one kind (per-rmid).

But I don't feel strongly enough about it that I'd object to displaying
just UNKNOWN.

 I wonder if it'd be better to pass the whole record instead and have
 the rm_identify function pull out the info if it's all it needs, but
 leave the possibility open that it could read, say, some header bytes
 in the record data.

I don't *have* an XLogRecord at the point where I'm calling rm_identify.
I could call rm_identify earlier, and either store the name in the stats
structure, or hash the name and use the hash value to store that record
type's statistics.

We don't even have any other potential callers for rm_identify. Adding
it and making it significantly more difficult to use for the only code
that actually needs it… no, I pretty much hate that idea.

Personally, I think it's fine to keep rm_identify the way it is. It's
hardly very difficult code, and if the assumption that records can be
identified by xl_info ever becomes invalid, this isn't the only code
that will need to be changed either.

(Otherwise, I'd actually prefer to keep all the identification code in
pg_xlogdump, i.e. the pre-rm_identify version of my patch. At least that
would make it possible to maintain a version that could be built against
9.3, the way Marti did.)

 Also I didn't check how you handle the init bit in RM_HEAP and
 RM_HEAP2.  Are we just saying that insert (init) is a different
 record type from insert?

Yes, that's what the patch does currently. X vs. X+INIT.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] buildfarm and rolling release distros

2014-07-04 Thread Christoph Berg
Re: Craig Ringer 2014-07-02 53b39638.9010...@2ndquadrant.com
  +1.  The buildfarm has one such member already, anchovy, and I recall it
  having given at least one helpful forewarning.  It shows as Arch Linux
  testing [updated daily], which is sufficient annotation for me.  Its 
  failure
  rate has been low; member-caused failures due to ENOSPC and other miscellany
  are a good deal more common.
 
 Yep - I see early notice of new gcc special behaviour, etc as quite
 valuable.
 
 If we're dubious about a result, we wait a few days and see if it goes
 away on its own.

I was looking at the zoo some time ago and was surprised there's only
a single animal running Debian unstable/testing - and that's on ia64
which Debian killed some months ago because no one seemed to care
enough about maintaining the toolchain. Is this because
unstable/testing are considered rolling releases? (On a second look,
dugong seems to be running etch/4.0, which is... old.)

My plan was to set up two animals, amd64 and i386, using the compiler
flags we are using for the Debian packages, though I was still waiting
for a VM in a soon-to-be-there cluster at credativ. That would have
catched the ASLR problems we discussed some weeks ago quite early, I
guess.

Does it make sense to look into setting up that target?

Christoph
-- 
c...@df7cb.de | http://www.df7cb.de/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RLS Design

2014-07-04 Thread Kouhei Kaigai
 Kaigai,
 
 On Thursday, July 3, 2014, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 
 
   Sorry for my late responding, now I'm catching up the discussion.
 
* Robert Haas (robertmh...@gmail.com javascript:; ) wrote:
 On Tue, Jul 1, 2014 at 3:20 PM, Dean Rasheed
 dean.a.rash...@gmail.com javascript:; 
wrote:
  If RLS quals are instead regarded as constraints on access,
 and
  multiple policies apply, then it seems that the quals should
 now be
  combined with AND rather than OR, right?
   
I do feel that RLS quals are constraints on access, but I don't
 see how
it follows that multiple quals should be AND'd together because
 of that.
I view the RLS policies on each table as being independent and
 standing
alone regarding what can be seen.  If you have access to a table
 today
through policy A, and then later policy B is added, using AND
 would mean
that the set of rows returned is less than if only policy A existed.
That doesn't seem correct to me.
   
   It seems to me direction of the constraints (RLS-policy) works to
 is reverse.
 
   In case when we have no RLS-policy, 100% of rows are visible isn't
 it?
 
 
 No, as outlined later, the table would appear empty if no policies exist
 and RLS is enabled for the table.
 
 
   Addition of a constraint usually reduces the number of rows being
 visible,
   or same number of rows at least. Constraint shall never work to
 the direction
   to increase the number of rows being visible.
 
 
 Can you clarify where this is coming from..?  It sounds like you're
 referring to an existing implementation and, if so, it'd be good to get
 more information on how that works exactly.
 

Oracle VPD - Multiple Policies for Each Table, View, or Synonym
http://docs.oracle.com/cd/B19306_01/network.102/b14266/apdvpoli.htm#i1008351

It says - Note that all policies applied to a table are enforced with AND 
syntax.

Not only Oracle VPD, it fits attitude of defense in depth.
Please assume a system that installs network firewall, unix permissions
and selinux. If somebody wants to reference an information asset within
a file, he has to connect the server from the network address being allowed
by the firewall configuration AND both of DAC and MAC has to allow his
access.
Usually, we have to pass all the access control to reference the target
information, not one of the access control stuffs being installed.


   For example, if RLS-policy of t1 is (t1.credential 
 get_user_credential)
   and user's query is:
 SELECT * FROM t1 WHERE t1.x = t1.x;
   Do you think RLS-policy shall be merged with OR'd form?
 
 
 Only the RLS policies are OR'd together, not user provided quals. The above
 would result in:
 
 Where t1.x = t1.x and (t1.credential  get_user_credential)
 
 If another policy also applies for this query, such as t1.cred2 
 get_user_credential then we would have:
 
 Where t1.x = t1.x and (t1.credential  get_user_credential OR t1.cred2 
 get_user_credential)
 
 This is similar to how roles work- your overall access includes all access
 granted to any roles you are a member of. You don't need SELECT rights granted
 to every role you are a member of to select from the table. Additionally,
 if an admin wants to AND the quals together then they can simply create
 a policy which does that rather than have 2 policies.
 
It seems to me a pain on database administration, if we have to pay attention
not to conflict each RLS-policy. I expect 90% of RLS-policy will be configured
to PUBLIC user, to apply everybody same rules on access. In this case, DBA
has to ensure the target table has no policy or existing policy does not
conflict with the new policy to be set.
I don't think it is a good idea to enforce DBA these checks.


   Please assume here are two individual applications that use RLS
 on table-X.
   Even if application-1 want only rows being public become visible,
 it may
   expose credential or secret rows by interaction of orthogonal
 policy
   configured by application-2 (that may configure the policy
 according to the
   source ip-address). It seems to me application-2 partially
 invalidated the
   RLS-policy configured by application-1.
 
 
  You are suggesting instead that if application 2 sets up policies on the
 table and then application 1 adds another policy that it should reduce what
 application 2's users can see?  That doesn't make any sense to me.  I'd
 actually expect these applications to at least use different roles anyway,
 which means they could each have a single role specific policy which only
 returns what that application is allowed to see.
 
I don't think this assumption is reasonable.
Please expect two applications: app-X that is a database security product
to apply access control based on remote ip-address of the client for any
table accesses by any database roles. app-Y that is a 

[HACKERS] DISTINCT with btree skip scan

2014-07-04 Thread Thomas Munro
Hello

As an exercise I hacked up the simplest code I could think of that would
demonstrate a faster DISTINCT based on skipping ahead to the next distinct
value in an index-only scan. Please see the attached (extremely buggy)
patch, and the example session below.  (It's against my natural instinct to
send such half-baked early hacking phase code to the list, but thought it
would make sense to demo the concept and then seek advice, warnings, cease
and desist notices etc before pressing on down that route!)  I would be
most grateful for any feedback you might have.

Best regards,

Thomas Munro

postgres=# create table foo (a int, b int, primary key (a, b));
CREATE TABLE
Time: 9.002 ms
postgres=# insert into foo select generate_series(1, 500) % 10,
generate_series(1, 500);
INSERT 0 500
Time: 35183.444 ms
postgres=# analyze foo;
ANALYZE
Time: 493.168 ms

postgres=# set enable_hashagg = true;
SET
Time: 0.274 ms
postgres=# explain select distinct a from foo;
┌───┐
│QUERY PLAN │
├───┤
│ HashAggregate  (cost=84624.00..84624.10 rows=10 width=4)  │
│   Group Key: a│
│   -  Seq Scan on foo  (cost=0.00..72124.00 rows=500 width=4) │
│ Planning time: 0.065 ms   │
└───┘
(4 rows)

Time: 0.500 ms
postgres=# select distinct a from foo;
┌───┐
│ a │
├───┤
│ 6 │
│ 8 │
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 5 │
│ 9 │
│ 7 │
│ 0 │
└───┘
(10 rows)

Time: 2017.019 ms

postgres=# set enable_hashagg = false;
SET
Time: 0.302 ms
postgres=# explain select distinct a from foo;
┌─┐
│ QUERY PLAN
   │
├─┤
│ Index Only Scan for distinct prefix 1 using foo_pkey on foo
 (cost=0.43..263354.20 rows=10 width=4) │
│ Planning time: 0.063 ms
  │
└─┘
(2 rows)

Time: 0.443 ms
postgres=# select distinct a from foo;
┌───┐
│ a │
├───┤
│ 0 │
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 5 │
│ 6 │
│ 7 │
│ 8 │
│ 9 │
└───┘
(10 rows)

Time: 0.565 ms
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 53cf96f..5f10d7f 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -29,6 +29,7 @@
  *		index_can_return	- does index support index-only scans?
  *		index_getprocid - get a support procedure OID
  *		index_getprocinfo - get a support procedure's lookup info
+ *		index_skip		- advance to the next key value in a scan
  *
  * NOTES
  *		This file contains the index_ routines which used
@@ -742,6 +743,37 @@ index_can_return(Relation indexRelation)
 	  PointerGetDatum(indexRelation)));
 }
 
+bool
+index_can_skip(Relation indexRelation)
+{
+	FmgrInfo   *procedure;
+
+	RELATION_CHECKS;
+
+	/* amcanskip is optional; assume FALSE if not provided by AM */
+	if (!RegProcedureIsValid(indexRelation-rd_am-amcanskip))
+		return false;
+
+	GET_REL_PROCEDURE(amcanskip);
+
+	return DatumGetBool(FunctionCall1(procedure,
+	  PointerGetDatum(indexRelation)));
+}
+
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+	FmgrInfo   *procedure;
+	SCAN_CHECKS;
+	GET_SCAN_PROCEDURE(amskip);
+	Datum result =
+		DatumGetPointer(FunctionCall3(procedure,
+	  PointerGetDatum(scan),
+	  Int32GetDatum(direction),
+	  Int32GetDatum(prefix)));
+	return DatumGetBool(result);
+}
+
 /* 
  *		index_getprocid
  *
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 36dc6c2..2f987e4 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -434,6 +434,8 @@ btbeginscan(PG_FUNCTION_ARGS)
 	so-currPos.nextTupleOffset = 0;
 	so-markPos.nextTupleOffset = 0;
 
+	so-skipScanKey = NULL;
+
 	scan-xs_itupdesc = RelationGetDescr(rel);
 
 	scan-opaque = so;
@@ -509,6 +511,19 @@ btrescan(PG_FUNCTION_ARGS)
 }
 
 /*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+Datum
+btskip(PG_FUNCTION_ARGS)
+{
+	IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+	ScanDirection dir = (ScanDirection) PG_GETARG_INT32(1);
+	int			prefix = PG_GETARG_INT32(2);
+	bool result = _bt_skip(scan, dir, prefix);
+	PG_RETURN_BOOL(result);	
+}
+
+/*
  *	btendscan() -- close down a scan
  */
 Datum
@@ -1118,3 +1133,12 @@ btcanreturn(PG_FUNCTION_ARGS)
 {
 	PG_RETURN_BOOL(true);
 }
+
+/*
+ * btcanskip() -- Check whether btree indexes support skipping.
+ */
+Datum

Re: [HACKERS] DISTINCT with btree skip scan

2014-07-04 Thread Vik Fearing
On 07/05/2014 02:17 AM, Thomas Munro wrote:
 As an exercise I hacked up the simplest code I could think of that would
 demonstrate a faster DISTINCT based on skipping ahead to the next
 distinct value in an index-only scan. Please see the attached (extremely
 buggy) patch, and the example session below.  (It's against my natural
 instinct to send such half-baked early hacking phase code to the list,
 but thought it would make sense to demo the concept and then seek
 advice, warnings, cease and desist notices etc before pressing on down
 that route!)  I would be most grateful for any feedback you might have.

This is called a Loose Index Scan in our wiki[1] which I believe is
based on the terminology used for the same feature in MySQL although I
haven't and shan't check.

This is a feature I would really like to have.

Your benchmarks look promising but unfortunately it blows up for me so I
can't really test it.  I have not yet read the patch nor debugged to see
why, but please do keep up work on this.  People more expert than I can
tell you whether you're implementing it the right way.

[1] http://wiki.postgresql.org/wiki/Loose_indexscan
-- 
Vik


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Aggregate function API versus grouping sets

2014-07-04 Thread Vik Fearing
On 07/02/2014 10:15 PM, Andrew Gierth wrote:
 (Had one request so far for a mode() variant that returns the unique
 modal value if one exists, otherwise null; so the current set of
 ordered-set aggs by no means exhausts the possible applications.)

I resemble that remark. :)

But seriously, am I the only one who doesn't want some random value when
there is no dominant one?  And the fact that I can't create my own
without C code is a real bummer.
-- 
Vik


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pg_upgrade and toast tables bug discovered

2014-07-04 Thread Bruce Momjian
On Fri, Jul  4, 2014 at 12:01:37AM -0400, Bruce Momjian wrote:
  The most robust, but not trivial, approach seems to be to prevent toast
  table creation if there wasn't a set_next_toast_pg_class_oid(). Then,
  after all relations are created, iterate over all pg_class entries that
  possibly need toast tables and recheck if they now might need one.
 
 Wow, that is going to be kind of odd in that there really isn't a good
 way to create toast tables except perhaps add a dummy TEXT column and
 remove it.  There also isn't an easy way to not create a toast table,
 but also find out that one was needed.  I suppose we would have to
 insert some dummy value in the toast pg_class column and come back later
 to clean it up.
 
 I am wondering what the probability of having a table that didn't need a
 toast table in the old cluster, and needed one in the new cluster, and
 there being an oid collision.
 
 I think the big question is whether we want to go down that route.

Here is an updated patch.  It was a little tricky because if the
mismatched non-toast table is after the last old relation, you need to
test differently and emit a different error message as there is no old
relation to complain about.

As far as the reusing of oids, we don't set the oid counter until after
the restore, so any new unmatched toast table would given a very low
oid.  Since we restore in oid order, for an oid to be assigned that was
used in the old cluster, it would have to be a very early relation, so I
think that reduces the odds considerably.  I am not inclined to do
anything more to avoid this until I see an actual error report ---
trying to address it might be invasive and introduce new errors.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +
diff --git a/contrib/pg_upgrade/info.c b/contrib/pg_upgrade/info.c
new file mode 100644
index 6205c74..72f805c
*** a/contrib/pg_upgrade/info.c
--- b/contrib/pg_upgrade/info.c
*** gen_db_file_maps(DbInfo *old_db, DbInfo
*** 38,58 
   int *nmaps, const char *old_pgdata, const char *new_pgdata)
  {
  	FileNameMap *maps;
! 	int			relnum;
  	int			num_maps = 0;
  
  	maps = (FileNameMap *) pg_malloc(sizeof(FileNameMap) *
  	 old_db-rel_arr.nrels);
  
! 	for (relnum = 0; relnum  Min(old_db-rel_arr.nrels, new_db-rel_arr.nrels);
! 		 relnum++)
  	{
! 		RelInfo*old_rel = old_db-rel_arr.rels[relnum];
! 		RelInfo*new_rel = new_db-rel_arr.rels[relnum];
  
  		if (old_rel-reloid != new_rel-reloid)
! 			pg_fatal(Mismatch of relation OID in database \%s\: old OID %d, new OID %d\n,
! 	 old_db-db_name, old_rel-reloid, new_rel-reloid);
  
  		/*
  		 * TOAST table names initially match the heap pg_class oid. In
--- 38,96 
   int *nmaps, const char *old_pgdata, const char *new_pgdata)
  {
  	FileNameMap *maps;
! 	int			old_relnum, new_relnum;
  	int			num_maps = 0;
  
  	maps = (FileNameMap *) pg_malloc(sizeof(FileNameMap) *
  	 old_db-rel_arr.nrels);
  
! 	/*
! 	 * The old database shouldn't have more relations than the new one.
! 	 * We force the new cluster to have a TOAST table if the old table
! 	 * had one.
! 	 */
! 	if (old_db-rel_arr.nrels  new_db-rel_arr.nrels)
! 		pg_fatal(old and new databases \%s\ have a mismatched number of relations\n,
!  old_db-db_name);
! 
! 	/* Drive the loop using new_relnum, which might be higher. */
! 	for (old_relnum = new_relnum = 0; new_relnum  new_db-rel_arr.nrels;
! 		 new_relnum++)
  	{
! 		RelInfo*old_rel;
! 		RelInfo*new_rel = new_db-rel_arr.rels[new_relnum];
! 
! 		/*
! 		 * It is possible that the new cluster has a TOAST table for a table
! 		 * that didn't need one in the old cluster, e.g. 9.0 to 9.1 changed the
! 		 * NUMERIC length computation.  Therefore, if we have a TOAST table
! 		 * in the new cluster that doesn't match, skip over it and continue
! 		 * processing.  It is possible this TOAST table used an OID that was
! 		 * reserved in the old cluster, but we have no way of testing that,
! 		 * and we would have already gotten an error at the new cluster schema
! 		 * creation stage.  Fortunately, since we only restore the OID counter
! 		 * after schema restore, and restore in OID order, a conflict would
! 		 * only happen if the new TOAST table had a very low OID.
! 		 */
! 		if (old_relnum == old_db-rel_arr.nrels)
! 		{
! 			if (strcmp(new_rel-nspname, pg_toast) == 0)
! continue;
! 			else
! pg_fatal(Extra non-TOAST relation found in database \%s\: new OID %d\n,
! 		 old_db-db_name, new_rel-reloid);
! 		}
! 
! 		old_rel = old_db-rel_arr.rels[old_relnum];
  
  		if (old_rel-reloid != new_rel-reloid)
! 		{
! 			if (strcmp(new_rel-nspname, pg_toast) == 0)
! continue;
! 			else
! pg_fatal(Mismatch of relation OID in database \%s\: old OID %d, new OID %d\n,
! 		 old_db-db_name, old_rel-reloid, new_rel-reloid);
! 		}
  
  		/*
  		 * TOAST table names 

Re: [HACKERS] postgresql.auto.conf and reload

2014-07-04 Thread Amit Kapila
On Fri, Jun 27, 2014 at 9:11 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
 On Thu, Jun 26, 2014 at 1:49 PM, Christoph Berg c...@df7cb.de wrote:
  Re: Amit Kapila 2014-06-26 CAA4eK1+mUTjc=GXJK3bYtSwV2BmBni=
phevbqlqkhduv9cw...@mail.gmail.com
  
   How about adding a note in Alter System so that users are aware of
   such behaviour and can ensure that they don't have duplicate entries?
 
  If the behavior isn't going to change, that issue need to be
  documented, sure.

 I will send a patch to address this unless someone comes with a better
 way to address this and if no one objects to adding a note in Alter System
 documentation.

Please find the patch attached to address the above concern.  I have
updated docs, so that users can be aware of such behaviour.

I will add this patch for next CF to avoid getting lost, however I believe
it should be done for 9.4.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


alter_system_postmaster_params_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers