date:20140910

Re: [HACKERS] Scaling shared buffer eviction

2014-09-10 Thread Amit Kapila

On Tue, Sep 9, 2014 at 12:16 AM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Sep 5, 2014 at 9:19 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
  On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila amit.kapil...@gmail.com
wrote:
  Apart from above, I think for this patch, cat version bump is required
  as I have modified system catalog.  However I have not done the
  same in patch as otherwise it will be bit difficult to take performance
  data.
 
  One regression failed on linux due to spacing issue which is
  fixed in attached patch.

 I took another read through this patch.  Here are some further review
comments:

 1. In BgMoveBuffersToFreelist(), it seems quite superfluous to have
 both num_to_free and tmp_num_to_free.


num_to_free is used to accumulate total number of buffers that are
freed in one cycle of BgMoveBuffersToFreelist() which is reported
for debug info (BGW_DEBUG) and tmp_num_to_free is used as a temporary
number which is a count of number of buffers to be freed in one sub-cycle
(inner while loop)


 I'd get rid of tmp_num_to_free
 and move the declaration of num_to_free inside the outer loop.  I'd
 also move the definitions of tmp_next_to_clean, tmp_recent_alloc,
 tmp_recent_backend_clocksweep into the innermost scope in which they
 are used.

okay, I have moved the tmp_* variables in innermost scope.


 2. Also in that function, I think the innermost bit of logic could be
 rewritten more compactly, and in such a way as to make it clearer for
 what set of instructions the buffer header will be locked.
 LockBufHdr(bufHdr); if (bufHdr-refcount == 0) { if
 (bufHdr-usage_count  0) bufHdr-usage_count--; else add_to_freelist
 = true; } UnlockBufHdr(bufHdr); if (add_to_freelist 
 StrategyMoveBufferToFreeListEnd(bufHdr)) num_to_free--;

Changed as per suggestion.

 3. This comment is now obsolete:

 +   /*
 +* If bgwriterLatch is set, we need to waken the bgwriter, but we
should
 +* not do so while holding freelist_lck; so set it after
releasing the
 +* freelist_lck.  This is annoyingly tedious, but it happens
 at most once
 +* per bgwriter cycle, so the performance hit is minimal.
 +*/
 +

 We're not actually holding any lock in need of releasing at that point
 in the code, so this can be shortened to If bgwriterLatch is set, we
 need to waken the bgwriter.

Changed as per suggestion.


  * Ideally numFreeListBuffers should get called under freelist
spinlock,

 That doesn't make any sense.  numFreeListBuffers is a variable, so you
 can't call it.  The value should be *read* under the spinlock, but
 it is.  I think this whole comment can be deleted and replaced with
 If the number of free buffers has fallen below the low water mark,
 awaken the bgreclaimer to repopulate it.

Changed as per suggestion.

 4. StrategySyncStartAndEnd() is kind of a mess.  One, it can return
 the same victim buffer that's being handed out - at almost the same
 time - to a backend running the clock sweep; if it does, they'll fight
 over the buffer.  Two, the *end out parameter actually returns a
 count, not an endpoint.  I think we should have
 BgMoveBuffersToFreelist() call StrategySyncNextVictimBuffer() at the
 top of the inner loop rather than the bottom, and change
 StrategySyncStartAndEnd() so that it knows nothing about
 victimbuf_lck.  Let's also change StrategyGetBuffer() to call
 StrategySyncNextVictimBuffer() so that the logic is centralized in one
 place, and rename StrategySyncStartAndEnd() to something that better
 matches its revised purpose.

Changed as per suggestion.  I have also updated
StrategySyncNextVictimBuffer() such that it increments completePasses
on completion of cycle as I think it is appropriate to update it, even when
clock sweep is done by bgreclaimer.


 Maybe StrategyGetReclaimInfo().

I have changed it to StrategyGetFreelistAccessInfo() as it seems most
other functions in freelist.c uses the names to sound something related
to buffers.


 5. Have you tested that this new bgwriter statistic is actually
 working?  Because it looks to me like BgMoveBuffersToFreelist is
 changing BgWriterStats but never calling pgstat_send_bgwriter(), which
 I'm thinking will mean the counters accumulate forever inside the
 reclaimer but never get forwarded to the stats collector.

pgstat_send_bgwriter() is called in bgreclaimer loop (caller of
BgMoveBuffersToFreelist, this is similar to how bgwriter does).
I have done few tests with it before sending the previous patch.


 6. StrategyInitialize() uses #defines for
 HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT and
 LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT but inline constants (5, 2000)
 for clamping.  Let's have constants for all of those (and omit mention
 of the specific values in the comments).

Changed as per suggestion.

 7. The line you've added to the definition of view pg_stat_bgwriter
 doesn't seem to be indented the same way as all the others.  Tab vs.
 space problem?

Fixed.

Performance Data:

Re: [HACKERS] Scaling shared buffer eviction

2014-09-10 Thread Amit Kapila

On Wed, Sep 10, 2014 at 5:46 AM, Mark Kirkwood 
mark.kirkw...@catalyst.net.nz wrote:

 On 05/09/14 23:50, Amit Kapila wrote:

 On Fri, Sep 5, 2014 at 8:42 AM, Mark Kirkwood
   FWIW below are some test results on the 60 core beast with this patch
 applied to 9.4. I'll need to do more runs to iron out the variation,
   but it looks like the patch helps the standard (write heavy) pgbench
 workload a little, and clearly helps the read only case.
  

 Thanks for doing the test.  I think if possible you can take
 the performance data with higher scale factor (4000) as it
 seems your m/c has 1TB of RAM.  You might want to use
 latest patch I have posted today.


 Here's some fairly typical data from read-write and read-only runs at
scale 4000 for 9.4 beta2 with and without the v7 patch (below). I'm not
seeing much variation between repeated read-write runs with the same config
(which is nice - sleep 30 and explicit checkpoint call between each one
seem to help there).

 Interestingly, I note anecdotally that (unpatched) 9.4 beta2 seems to be
better at higher client counts than beta1 was...

 In terms of the effect of the patch - looks pretty similar to the scale
2000 results for read-write, but read-only is a different and more
interesting story - unpatched 9.4 is noticeably impacted in the higher
client counts, whereas the patched version scales as well (or even better
perhaps) than in the scale 2000 case.

Yeah, that's what I was expecting, the benefit of this patch
will be more at higher client count when there is large data
and all the data can fit in RAM .

Many thanks for doing the performance test for patch.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] FD_SETSIZE on Linux?

2014-09-10 Thread Thom Brown

On 10 September 2014 00:21, Tom Lane t...@sss.pgh.pa.us wrote:

 Thom Brown t...@linux.com writes:
  I find this in pgbench.c:

  #ifdef FD_SETSIZE
  #define MAXCLIENTS  (FD_SETSIZE - 10)
  #else
  #define MAXCLIENTS  1024
  #endif

 FD_SETSIZE is supposed to be defined, according to the POSIX spec:

 The sys/select.h header shall define the following symbolic constant,
 which shall have a value suitable for use in #if preprocessing
 directives:

 FD_SETSIZE
 Maximum number of file descriptors in an fd_set structure.

 It looks like Linux sets it to 1024.  On RHEL6, at least, I find this:

 $ grep -r FD_SETSIZE /usr/include
 /usr/include/linux/posix_types.h:#undef __FD_SETSIZE
 /usr/include/linux/posix_types.h:#define __FD_SETSIZE   1024
 ...
 /usr/include/sys/select.h:#define   FD_SETSIZE
 __FD_SETSIZE
 ...


Ah yes, I have the same on Debian:

/usr/include/linux/posix_types.h:#undef __FD_SETSIZE
/usr/include/linux/posix_types.h:#define __FD_SETSIZE 1024
...
usr/include/x86_64-linux-gnu/sys/select.h:#define FD_SETSIZE __FD_SETSIZE
/usr/include/x86_64-linux-gnu/bits/typesizes.h:#define __FD_SETSIZE 1024
...

I didn't think to look beyond Postgres' code.


  #ifdef WIN32
  #define FD_SETSIZE 1024 /* set before winsock2.h is
 included */
  #endif   /* ! WIN32 */

 Windows probably hasn't got sys/select.h at all, so it may not provide
 this symbol.

 Interestingly, it looks like POSIX also requires sys/time.h to define
 FD_SETSIZE.  I wonder whether Windows has that header?  It'd definitely
 be better to get this symbol from the system than assume 1024 will work.


Okay, this now makes sense.  It just takes the system value and reduces it
by 10 to get the MAXCLIENTS value.

Thanks for the explanation.

Thom

Re: [HACKERS] LIMIT for UPDATE and DELETE

2014-09-10 Thread Marko Tiikkaja


On 2014-09-10 04:25, Etsuro Fujita wrote:

(2014/09/09 18:57), Heikki Linnakangas wrote:

What's not clear to me is whether it make sense to do 1) without 2) ? Is
UPDATE .. LIMIT without support for an ORDER BY useful enough? And if we
apply this patch now, how much of it needs to be rewritten after 2) ? If
the answers are yes and not much, then we should review this patch
now, and put 2) on the TODO list. Otherwise 2) should do done first.


My answers are yes but completely rewritten.


Any particular reason for you to say that?  Because an UPDATE might have 
a RETURNING clause, all the updated tuples have to go through the 
ModifyTable node one at a time.  I don't see why we couldn't LIMIT there 
after implementing #2.



.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] replication commands and log_statements

2014-09-10 Thread Heikki Linnakangas


On 08/28/2014 11:38 AM, Fujii Masao wrote:

On Thu, Jun 19, 2014 at 5:29 PM, Ian Barwick i...@2ndquadrant.com wrote:

- minor rewording for the description, mentioning that statements will
   still be logged as DEBUG1 even if parameter set to 'off' (might
   prevent reports of the kind I set it to 'off', why am I still seeing
   log entries?).

para
 Causes each replication command to be logged in the server log.
 See xref linkend=protocol-replication for more information about
 replication commands. The default value is literaloff/. When set
to
 literaloff/, commands will be logged at log level
literalDEBUG1/literal.
 Only superusers can change this setting.
/para


Yep, fixed. Attached is the updated version of the patch.


I don't think it's necessary to mention that the commands will still be 
logged at DEBUG1 level. We log all kinds of crap at the various DEBUG 
levels, and they're not supposed to be used in normal operation.



- I feel it would be more consistent to use the plural form
   for this parameter, i.e. log_replication_commands, in line with
   log_lock_waits, log_temp_files, log_disconnections etc.


But log_statement is in the singular form. So I just used
log_replication_command. For the consistency, maybe we need to
change both parameters in the plural form? I don't have strong
opinion about this.


Yeah, we seem to be inconsistent. log_replication_commands would sound 
better to me in isolation, but then there is log_statement..


I'll mark this as Ready for Committer in the commitfest app (I assume 
you'll take care of committing this yourself when ready).


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

2014-09-10 Thread Mark Kirkwood


On 10/09/14 18:54, Amit Kapila wrote:

On Wed, Sep 10, 2014 at 5:46 AM, Mark Kirkwood
mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz
wrote:
 
  In terms of the effect of the patch - looks pretty similar to the
scale 2000 results for read-write, but read-only is a different and more
interesting story - unpatched 9.4 is noticeably impacted in the higher
client counts, whereas the patched version scales as well (or even
better perhaps) than in the scale 2000 case.

Yeah, that's what I was expecting, the benefit of this patch
will be more at higher client count when there is large data
and all the data can fit in RAM .

Many thanks for doing the performance test for patch.




No worries, this is looking like a patch we're going to apply to 9.4 to 
make the 60 core beast scale a bit better, so thanks very much for your 
work in this area.


If you would like more tests run at higher scales let me know (we have 
two of these machines at pre-production state currently so I can run 
benchmarks as reqd)!


Regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index

2014-09-10 Thread Etsuro Fujita


(2014/09/10 12:31), Fujii Masao wrote:

On Wed, Sep 10, 2014 at 12:15 PM, Etsuro Fujita
fujita.ets...@lab.ntt.co.jp wrote:

(2014/09/09 22:17), Fujii Masao wrote:

Attached is the updated version of the patch.



I took a quick review on the patch.  It looks good to me,



but one thing I'm
concerned about is

You wrote:

The attached patch introduces the GIN index storage parameter
PENDING_LIST_CLEANUP_SIZE which specifies the maximum size of
GIN pending list. If it's not set, work_mem is used as that maximum
size,
instead. So this patch doesn't break the existing application which
currently uses work_mem as the threshold of cleanup operation of
the pending list. It has only not to set PENDING_LIST_CLEANUP_SIZE.


As you mentioned, I think it's important to consider for the existing
applications, but I'm wondering if it would be a bit confusing users to have
two parameters,


Yep.


PENDING_LIST_CLEANUP_SIZE and work_mem, for this setting.
Wouldn't it be easy-to-use to have only one parameter,
PENDING_LIST_CLEANUP_SIZE?  How about setting PENDING_LIST_CLEANUP_SIZE to
work_mem as the default value when running the CREATE INDEX command?


That's an idea. But there might be some users who want to change
the cleanup size per session like they can do by setting work_mem,
and your idea would prevent them from doing that...


Why not use ALTER INDEX ... SET (PENDING_LIST_CLEANUP_SIZE= ...)?  Maybe 
I'm missing something, though.



So what about introduing pending_list_cleanup_size also as GUC?
That is, users can set the threshold by using either the reloption or
GUC.


Yeah, that's an idea.  So, I'd like to hear the opinions of others.

Thanks,

Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] add modulo (%) operator to pgbench

2014-09-10 Thread Fabien COELHO



Hello Robert,


Sure, and I would have looked at that patch and complained that you
were implementing a modulo operator with different semantics than C.
And then we'd be right back where we are now.
[...]


Probably.

To be clear about my intent, which is a summary of what you already know: 
I would like to be able to generate a reasonable non uniform throttled 
workload with pgbench.


I do think such a feature is worth having for anybody who would like to do 
something realistic with pgbench, and that it is in the general 
interest of postgresql to have such features.


In particular, given the current state of abysmal performance under some 
trivial load with pg, I really think that it is really worth having a 
better tool, and I think that my effort with rate and progress helped to 
put these hidden problems into the light, and I do hope that they are 
going to be solved quickly.


Now if I submitted a big patch with all the necessary features in it, 
someone would ask to break it into pieces. So they are submitted one by 
one (progress, throttling, limit, modulo, ...).


Note I did not start with the non uniform stuff, but Mitsumasa-san sent a 
gaussian distribution patch and I jumped onto the wagon to complement it 
with an exponential distribution patch. I knew when doing it that is was 
not enough, but as I said one piece at a time, given the effort required 
to pass simple patch.


What is still needed for the overall purpose is the ability to scatter the 
distribution. This is really:


 (1) a positive remainder modulo, which is the trivial patch under
 discussion

 (2) a fast but good quality for the purpose hash function
 also a rather small patch, not submitted yet.

 (3) maybe the '|' operator to do a TP*-like non-uniform load,
 which is really periodic so I do not like it.
 a trivial patch, not submitted yet.

If you do not want one of these pieces (1  2), basically the interest of 
gaussian/exponential addition is much reduced, and this is just a half 
baked effort aborted because you did not want what was required to make it 
useful. Well, I can only disagree, but you are a committer and I'm not!


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [TODO] Process pg_hba.conf keywords as case-insensitive

2014-09-10 Thread Kyotaro HORIGUCHI

Hello, I had a closer look on this patch.

Finally I think that we need case-insensitive version of
get_role_id and() get_database_id() to acoomplish this patch'es
objective. (This runs full-scans on pg_database or pg_authid X()

And I'd like to propose to change token categorization from
notation-base to how-to-treat base. Concretely this patch
categorizes tokens using 'special quote is used' and 'quote from
the first' but it seems making logics clearer to categorize them
using 'case sensive or not' and 'it represents group name'.

The attached patch is a revised version of your original patch
regarding to the above point. (Sorry in advance that this is a
quick hack, especially the code related to file-inclusion is not
tested at all)

I have tested this only superficial level but it seems works as
expected.

Under the new specifications, next_token will work as following,

  - USER  : token: USER  , case-insensitive
  - USeR: token: USeR  , case-SENSITIVE
  - +uSeR   : token: +uSeR , case-SENSITIVE
  - +UsER   : token: +UsEr , case-insensitive
  - USeR  : token: USeR , case-insensitive

  - +USER : token: USER  , case-insensitive, group_name
  - +uSeR   : token: uSeR  , case_SENSITIVE, group_name
  - +UsEr : token: UsEr , case-insensitive, group_name

  - + : token: + , (useless?)
  - @ : token: @ , (useless?)
  - @hoge: token: hoge, file_inclusion (not confirmed)


There's a concern that Case-insensitive matching is accomplished
by full-scan on pg_database or pg_authid so it would be rather
slow than case-sensitive matching. This might not be acceptable
by the community.

And one known defect is that you will get a bit odd message if
you put an hba line having keywords quoted or prefixed with '+',
for example

+locAl   postgres   +sUstRust

The server complains for the line above that 

*| LOG:  invalid connection type locAl
 | CONTEXT:  line 84 of configuration file 
/home/horiguti/data/data_work/pg_hba.conf

The prefixing '+' is omitted. To correct this, either deparsing
token into original string or storing original string into tokens
is needed, I think.

What do you think about the changes, Viswanatham or all ?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index f480be8..db73dd9 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1991,6 +1991,50 @@ get_database_oid(const char *dbname, bool missing_ok)
 	return oid;
 }
 
+/*
+ * get_database_oid - given a database name, look up the OID in
+ * case-insensitive manner.
+ *
+ * If missing_ok is false, throw an error if database name not found.  If
+ * true, just return InvalidOid.
+ */
+Oid
+get_database_oid_case_insensitive(const char *dbname, bool missing_ok)
+{
+	Relation	relation;
+	SysScanDesc scandesc;
+	HeapTuple	tuple;
+	Oid oid = InvalidOid;
+
+	/*
+	 * SysCache has no abirility to case insensitive match, so we have no
+	 * means except scanning whole the systable.
+	 */
+	relation = heap_open(DatabaseRelationId, AccessShareLock);
+
+	scandesc = systable_beginscan(relation, InvalidOid, false,
+  NULL, 0, NULL);
+	while (HeapTupleIsValid(tuple = systable_getnext(scandesc)))
+	{
+		Form_pg_database dbForm = (Form_pg_database) GETSTRUCT(tuple);
+
+		if (pg_strcasecmp(dbname, dbForm-datname.data) == 0)
+		{
+			oid = HeapTupleGetOid(tuple);
+			break;
+		}
+	}
+	systable_endscan(scandesc);
+	heap_close(relation, AccessShareLock);
+
+	if (!OidIsValid(oid)  !missing_ok)
+ 		ereport(ERROR,
+(errcode(ERRCODE_UNDEFINED_DATABASE),
+ errmsg(database \%s\ does not exist,
+		dbname)));
+
+	return oid;
+}
 
 /*
  * get_database_name - given a database OID, look up the name
diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c
index 84da823..2d3a059 100644
--- a/src/backend/libpq/hba.c
+++ b/src/backend/libpq/hba.c
@@ -60,9 +60,20 @@ typedef struct check_network_data
 	bool		result;			/* set to true if match */
 } check_network_data;
 
-
-#define token_is_keyword(t, k)	(!t-quoted  strcmp(t-string, k) == 0)
-#define token_matches(t, k)  (strcmp(t-string, k) == 0)
+typedef enum TokenType
+{
+	NORMAL,
+	GROUP_NAME,			/* this token had leading '+' */
+	FILE_INCLUSION,		/* this token had leading '@' */
+} TokenType;
+
+#define token_is_keyword(tk, kw)	\
+	((tk)-type != NORMAL || (tk)-case_sensitive ? false : \
+	 (pg_strcasecmp((tk)-string, (kw)) == 0))
+#define token_matches(t, k)		   \
+	((t)-type != NORMAL ? false :\
+	 ((t)-case_sensitive ? (strcmp((t)-string, (k)) == 0):	\
+	  (pg_strcasecmp((t)-string, (k)) == 0)))
 
 /*
  * A single string token lexed from the HBA config file, together with whether
@@ -71,7 +82,8 @@ typedef struct check_network_data
 typedef struct HbaToken
 {
 	char	   *string;
-	bool		quoted;
+	TokenType	type;
+	bool		case_sensitive;
 } HbaToken;
 
 /*
@@ -111,6 +123,7 @@ pg_isblank(const char

Re: [HACKERS] LIMIT for UPDATE and DELETE

2014-09-10 Thread Etsuro Fujita


(2014/09/10 16:57), Marko Tiikkaja wrote:

On 2014-09-10 04:25, Etsuro Fujita wrote:

(2014/09/09 18:57), Heikki Linnakangas wrote:

What's not clear to me is whether it make sense to do 1) without 2) ? Is
UPDATE .. LIMIT without support for an ORDER BY useful enough? And if we
apply this patch now, how much of it needs to be rewritten after 2) ? If
the answers are yes and not much, then we should review this patch
now, and put 2) on the TODO list. Otherwise 2) should do done first.


My answers are yes but completely rewritten.


Any particular reason for you to say that?  Because an UPDATE might have
a RETURNING clause, all the updated tuples have to go through the
ModifyTable node one at a time.  I don't see why we couldn't LIMIT there
after implementing #2.


The reason is because I think that after implementing #2, we should 
re-implement this feature by extending the planner to produce a plan 
tree such as ModifyTable+Limit+Append, maybe with LockRows below the 
Limit node.  Such an approach would prevent duplication of the LIMIT 
code in ModifyTable, making the ModifyTable code more simple, I think.


Thanks,

Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [TODO] Process pg_hba.conf keywords as case-insensitive

2014-09-10 Thread Kyotaro HORIGUCHI

Hmm...

case-insensitive mathing could get multiple matches, which should be an
error but I've forgot to do so.

regards,

2014/09/10 17:54 Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp:

 And one known defect is that you will get a bit odd message if
 you put an hba line having keywords quoted or prefixed with '+',
 for example

 +locAl   postgres   +sUstRust

 The server complains for the line above that

 *| LOG:  invalid connection type locAl
  | CONTEXT:  line 84 of configuration file
/home/horiguti/data/data_work/pg_hba.conf

 The prefixing '+' is omitted. To correct this, either deparsing
 token into original string or storing original string into tokens
 is needed, I think.

 What do you think about the changes, Viswanatham or all ?

--
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] [TODO] Process pg_hba.conf keywords as case-insensitive

2014-09-10 Thread Florian Pflug

On Sep10, 2014, at 10:54 , Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp 
wrote:
 Under the new specifications, next_token will work as following,
 
  - USER  : token: USER  , case-insensitive
  - USeR: token: USeR  , case-SENSITIVE
  - +uSeR   : token: +uSeR , case-SENSITIVE
  - +UsER   : token: +UsEr , case-insensitive
  - USeR  : token: USeR , case-insensitive
 
  - +USER : token: USER  , case-insensitive, group_name
  - +uSeR   : token: uSeR  , case_SENSITIVE, group_name
  - +UsEr : token: UsEr , case-insensitive, group_name
 
  - + : token: + , (useless?)
  - @ : token: @ , (useless?)
  - @hoge: token: hoge, file_inclusion (not confirmed)
 
 
 There's a concern that Case-insensitive matching is accomplished
 by full-scan on pg_database or pg_authid so it would be rather
 slow than case-sensitive matching. This might not be acceptable
 by the community.

That does indeed sound bad. Couldn't we handle this the same
way we handle SQL identifiers, i.e. simply downcase unquoted
identifiers, and then compare case-sensitively?

So foo, Foo and FOO would all match the user called foo,
but Foo would match the user called Foo, and FOO the
user called FOO.

An unquoted + would cause whatever follows it to be interpreted
as a group name, whereas a quoted + would simply become part of
the user name (or group name, if there's an additional unquoted
+ before it).

So +foo would refer to the group foo, +FOO to the group FOO,
and ++A to the group +A.

I haven't checked if such an approach would be sufficiently
backwards-compatible, though.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LIMIT for UPDATE and DELETE

2014-09-10 Thread Marko Tiikkaja


On 9/10/14 11:25 AM, Etsuro Fujita wrote:

The reason is because I think that after implementing #2, we should
re-implement this feature by extending the planner to produce a plan
tree such as ModifyTable+Limit+Append, maybe with LockRows below the
Limit node.  Such an approach would prevent duplication of the LIMIT
code in ModifyTable, making the ModifyTable code more simple, I think.


You can already change *this patch* to do ModifyTable - Limit - 
LockRows.  If we think that's what we want, we should rewrite this patch 
right now.  This isn't a reason not to implement LIMIT without ORDER BY.


Like I said upthread, I think LockRows is a bad idea, but I'll need to 
run some performance tests first.  But whichever method we decide to 
implement for this patch shouldn't need to be touched when the changes 
to UPDATE land, so your reasoning is incorrect.



.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index

2014-09-10 Thread Fujii Masao

On Wed, Sep 10, 2014 at 5:35 PM, Etsuro Fujita
fujita.ets...@lab.ntt.co.jp wrote:
 (2014/09/10 12:31), Fujii Masao wrote:

 On Wed, Sep 10, 2014 at 12:15 PM, Etsuro Fujita
 fujita.ets...@lab.ntt.co.jp wrote:

 (2014/09/09 22:17), Fujii Masao wrote:

 Attached is the updated version of the patch.


 I took a quick review on the patch.  It looks good to me,


 but one thing I'm
 concerned about is

 You wrote:

 The attached patch introduces the GIN index storage parameter
 PENDING_LIST_CLEANUP_SIZE which specifies the maximum size of
 GIN pending list. If it's not set, work_mem is used as that maximum
 size,
 instead. So this patch doesn't break the existing application which
 currently uses work_mem as the threshold of cleanup operation of
 the pending list. It has only not to set PENDING_LIST_CLEANUP_SIZE.


 As you mentioned, I think it's important to consider for the existing
 applications, but I'm wondering if it would be a bit confusing users to
 have
 two parameters,


 Yep.

 PENDING_LIST_CLEANUP_SIZE and work_mem, for this setting.
 Wouldn't it be easy-to-use to have only one parameter,
 PENDING_LIST_CLEANUP_SIZE?  How about setting PENDING_LIST_CLEANUP_SIZE
 to
 work_mem as the default value when running the CREATE INDEX command?


 That's an idea. But there might be some users who want to change
 the cleanup size per session like they can do by setting work_mem,
 and your idea would prevent them from doing that...


 Why not use ALTER INDEX ... SET (PENDING_LIST_CLEANUP_SIZE= ...)?  Maybe I'm
 missing something, though.

It takes AccessExclusive lock and has an effect on every sessions
(not only specified session).

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LIMIT for UPDATE and DELETE

2014-09-10 Thread Etsuro Fujita


(2014/09/10 18:33), Marko Tiikkaja wrote:

On 9/10/14 11:25 AM, Etsuro Fujita wrote:

The reason is because I think that after implementing #2, we should
re-implement this feature by extending the planner to produce a plan
tree such as ModifyTable+Limit+Append, maybe with LockRows below the
Limit node.  Such an approach would prevent duplication of the LIMIT
code in ModifyTable, making the ModifyTable code more simple, I think.



You can already change *this patch* to do ModifyTable - Limit -
LockRows.  If we think that's what we want, we should rewrite this patch
right now.


I think it might be relatively easy to do that for single-table cases, 
but for inheritance cases, inheritance_planner needs work and I'm not 
sure how much work it would take ...



Like I said upthread, I think LockRows is a bad idea, but I'll need to
run some performance tests first.  But whichever method we decide to
implement for this patch shouldn't need to be touched when the changes
to UPDATE land, so your reasoning is incorrect.


Yeah, as you say, we need the performance tests, and I think it would 
depend on those results whether LockRows is a bad idea or not.


Thanks,

Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LIMIT for UPDATE and DELETE

2014-09-10 Thread Marko Tiikkaja


On 9/10/14 12:05 PM, Etsuro Fujita wrote:

(2014/09/10 18:33), Marko Tiikkaja wrote:

You can already change *this patch* to do ModifyTable - Limit -
LockRows.  If we think that's what we want, we should rewrite this patch
right now.


I think it might be relatively easy to do that for single-table cases,
but for inheritance cases, inheritance_planner needs work and I'm not
sure how much work it would take ...


Uh.  Yeah, I think I'm an idiot and you're right.

I'll try and get some benchmarking done and see what comes out.


.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] replication commands and log_statements

2014-09-10 Thread Fujii Masao

Thanks for reviewing the patch!

On Wed, Sep 10, 2014 at 4:57 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 08/28/2014 11:38 AM, Fujii Masao wrote:

 On Thu, Jun 19, 2014 at 5:29 PM, Ian Barwick i...@2ndquadrant.com wrote:

 - minor rewording for the description, mentioning that statements will
still be logged as DEBUG1 even if parameter set to 'off' (might
prevent reports of the kind I set it to 'off', why am I still seeing
log entries?).

 para
  Causes each replication command to be logged in the server log.
  See xref linkend=protocol-replication for more information
 about
  replication commands. The default value is literaloff/. When
 set
 to
  literaloff/, commands will be logged at log level
 literalDEBUG1/literal.
  Only superusers can change this setting.
 /para


 Yep, fixed. Attached is the updated version of the patch.


 I don't think it's necessary to mention that the commands will still be
 logged at DEBUG1 level. We log all kinds of crap at the various DEBUG
 levels, and they're not supposed to be used in normal operation.

Agreed. I removed that mention from the document.


 - I feel it would be more consistent to use the plural form
for this parameter, i.e. log_replication_commands, in line with
log_lock_waits, log_temp_files, log_disconnections etc.


 But log_statement is in the singular form. So I just used
 log_replication_command. For the consistency, maybe we need to
 change both parameters in the plural form? I don't have strong
 opinion about this.


 Yeah, we seem to be inconsistent. log_replication_commands would sound
 better to me in isolation, but then there is log_statement..

Agreed. I changed the GUC name to log_replication_commands.


 I'll mark this as Ready for Committer in the commitfest app (I assume you'll
 take care of committing this yourself when ready).

Attached is the updated version of the patch. After at least one day
I will commit the patch.

Regards,

-- 
Fujii Masao
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 4558,4563  FROM pg_stat_activity;
--- 4558,4579 
/listitem
   /varlistentry
  
+  varlistentry id=guc-log-replication-commands xreflabel=log_replication_commands
+   termvarnamelog_replication_commands/varname (typeboolean/type)
+   indexterm
+primaryvarnamelog_replication_commands/ configuration parameter/primary
+   /indexterm
+   /term
+   listitem
+para
+ Causes each replication command to be logged in the server log.
+ See xref linkend=protocol-replication for more information about
+ replication command. The default value is literaloff/.
+ Only superusers can change this setting.
+/para
+   /listitem
+  /varlistentry
+ 
   varlistentry id=guc-log-temp-files xreflabel=log_temp_files
termvarnamelog_temp_files/varname (typeinteger/type)
indexterm
*** a/doc/src/sgml/protocol.sgml
--- b/doc/src/sgml/protocol.sgml
***
*** 1306,1311  To initiate streaming replication, the frontend sends the
--- 1306,1313 
  of literaltrue/ tells the backend to go into walsender mode, wherein a
  small set of replication commands can be issued instead of SQL statements. Only
  the simple query protocol can be used in walsender mode.
+ Replication commands are logged in the server log when
+ xref linkend=guc-log-replication-commands is enabled.
  Passing literaldatabase/ as the value instructs walsender to connect to
  the database specified in the literaldbname/ parameter, which will allow
  the connection to be used for logical replication from that database.
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***
*** 108,113  bool		am_db_walsender = false;	/* Connected to a database? */
--- 108,114 
  int			max_wal_senders = 0;	/* the maximum number of concurrent walsenders */
  int			wal_sender_timeout = 60 * 1000;		/* maximum time to send one
   * WAL data message */
+ bool		log_replication_commands = false;
  
  /*
   * State for WalSndWakeupRequest
***
*** 1268,1280  exec_replication_command(const char *cmd_string)
  	MemoryContext old_context;
  
  	/*
  	 * CREATE_REPLICATION_SLOT ... LOGICAL exports a snapshot until the next
  	 * command arrives. Clean up the old stuff if there's anything.
  	 */
  	SnapBuildClearExportedSnapshot();
  
- 	elog(DEBUG1, received replication command: %s, cmd_string);
- 
  	CHECK_FOR_INTERRUPTS();
  
  	cmd_context = AllocSetContextCreate(CurrentMemoryContext,
--- 1269,1287 
  	MemoryContext old_context;
  
  	/*
+ 	 * Log replication command if log_replication_commands is enabled.
+ 	 * Even when it's disabled, log the command with DEBUG1 level for
+ 	 * backward compatibility.
+ 	 */
+ 	ereport(log_replication_commands ? LOG : DEBUG1,
+ 			(errmsg(received replication

Re: [HACKERS] Add shutdown_at_recovery_target option to recovery.conf

2014-09-10 Thread Fujii Masao

On Wed, Sep 10, 2014 at 1:45 AM, Petr Jelinek p...@2ndquadrant.com wrote:
 Hi,

 I recently wanted several times to have slave server prepared at certain
 point in time to reduce the time it takes for it to replay remaining WALs
 (say I have pg_basebackup -x on busy db for example).

In your example, you're thinking to perform the recovery after taking
the backup and stop it at the consistent point (i.e., end of backup) by
using the proposed feature? Then you're expecting that the future recovery
will start from that consistent point and which will reduce the recovery time?

This is true if checkpoint is executed at the end of backup. But there might
be no occurrence of checkpoint during backup. In this case, even future
recovery would need to start from very start of backup. That is, we cannot
reduce the recovery time. So, for your purpose, for example, you might also
need to add new option to pg_basebackup so that checkpoint is executed
at the end of backup if the option is set.

 Currently the way to do it is to have pause_at_recovery_target true
 (default) and wait until pg accepts connection and the shut it down. The
 issue is that this is ugly, and also there is a chance that somebody else
 connects and does bad things (tm) before my process does.

 So I wrote simple patch that adds option to shut down the cluster once
 recovery_target is reached. The server will still be able to continue WAL
 replay if needed later or can be configured to start standalone.

What about adding something like action_at_recovery_target=pause|shutdown
instead of increasing the number of parameters?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgcrypto: PGP armor headers

2014-09-10 Thread Marko Tiikkaja


On 9/9/14 11:01 AM, I wrote:

On 9/9/14 10:54 AM, Heikki Linnakangas wrote:

So I think this (and the corresponding dearmor code too) should be
refactored to use a StringInfo that gets enlarged as needed, instead of
hoping to guess the size correctly beforehand. To ease review, might
make sense to do that as a separate patch over current sources, and the
main patch on top of that.


Yeah, I thought the same thing while writing that code.  Thanks, I'll do
it that way.


Attached is what I have right now.  I started working on the decoding 
part, but it has this piece of code:


   /* decode crc */
   if (b64_decode(p + 1, 4, buf) != 3)
   goto out;

which makes the approach a bit uglier.  If I did this the same way, I 
would have to create and destroy a StringInfo just for this operation, 
which seems ugly.  So I wonder if I shouldn't try and instead keep the 
code closer to what it is in HEAD right now; I could call 
enlargeStringInfo() first, then hand out a pointer to b64_encode (or 
b64_decode()) and finally increment StringInfoData.len by how much was 
actually written.  That would keep the code changes a lot smaller, too.


Is either of these approaches anywhere near what you had in mind?

I'm also not sure why we need to keep a copy of the base64 
encoding/decoding logic instead of exporting it in utils/adt/encode.c.



.marko
*** a/contrib/pgcrypto/pgp-armor.c
--- b/contrib/pgcrypto/pgp-armor.c
***
*** 31,36 
--- 31,38 
  
  #include postgres.h
  
+ #include lib/stringinfo.h
+ 
  #include px.h
  #include pgp.h
  
***
*** 38,58 
   * BASE64 - duplicated :(
   */
  
! static const unsigned char _base64[] =
  ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/;
  
! static int
! b64_encode(const uint8 *src, unsigned len, uint8 *dst)
  {
-   uint8  *p,
-  *lend = dst + 76;
const uint8 *s,
   *end = src + len;
int pos = 2;
unsigned long buf = 0;
  
s = src;
-   p = dst;
  
while (s  end)
{
--- 40,58 
   * BASE64 - duplicated :(
   */
  
! static const char _base64[] =
  ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/;
  
! static void
! b64_encode(const uint8 *src, unsigned len, StringInfo dst)
  {
const uint8 *s,
   *end = src + len;
int pos = 2;
unsigned long buf = 0;
+   int linepos = 0;
  
s = src;
  
while (s  end)
{
***
*** 65,93  b64_encode(const uint8 *src, unsigned len, uint8 *dst)
 */
if (pos  0)
{
!   *p++ = _base64[(buf  18)  0x3f];
!   *p++ = _base64[(buf  12)  0x3f];
!   *p++ = _base64[(buf  6)  0x3f];
!   *p++ = _base64[buf  0x3f];
  
pos = 2;
buf = 0;
}
!   if (p = lend)
{
!   *p++ = '\n';
!   lend = p + 76;
}
}
if (pos != 2)
{
!   *p++ = _base64[(buf  18)  0x3f];
!   *p++ = _base64[(buf  12)  0x3f];
!   *p++ = (pos == 0) ? _base64[(buf  6)  0x3f] : '=';
!   *p++ = '=';
}
- 
-   return p - dst;
  }
  
  /* probably should use lookup table */
--- 65,92 
 */
if (pos  0)
{
!   appendStringInfoCharMacro(dst, _base64[(buf  18)  
0x3f]);
!   appendStringInfoCharMacro(dst, _base64[(buf  12)  
0x3f]);
!   appendStringInfoCharMacro(dst, _base64[(buf  6)  
0x3f]);
!   appendStringInfoCharMacro(dst, _base64[buf  0x3f]);
  
+   linepos += 4;
pos = 2;
buf = 0;
}
!   if (linepos = 76)
{
!   appendStringInfoCharMacro(dst, '\n');
!   linepos = 0;
}
}
if (pos != 2)
{
!   appendStringInfoCharMacro(dst, _base64[(buf  18)  0x3f]);
!   appendStringInfoCharMacro(dst, _base64[(buf  12)  0x3f]);
!   appendStringInfoCharMacro(dst, (pos == 0) ? _base64[(buf  6) 
 0x3f] : '=');
!   appendStringInfoCharMacro(dst, '=');
}
  }
  
  /* probably should use lookup table */
***
*** 160,174  b64_decode(const uint8 *src, unsigned len, uint8 *dst)
  }
  
  static unsigned
- b64_enc_len(unsigned srclen)
- {
-   /*
-* 3 bytes will be converted to 4, linefeed after 76 chars
-*/
-   return (srclen + 2) * 4 / 3 + srclen / (76 * 3 / 4);
- }
- 
- static unsigned
  b64_dec_len(unsigned srclen)
  {
return (srclen * 3)  2;
---

Re: [HACKERS] pgcrypto: PGP armor headers

2014-09-10 Thread Heikki Linnakangas


On 09/10/2014 02:26 PM, Marko Tiikkaja wrote:

On 9/9/14 11:01 AM, I wrote:

On 9/9/14 10:54 AM, Heikki Linnakangas wrote:

So I think this (and the corresponding dearmor code too) should be
refactored to use a StringInfo that gets enlarged as needed, instead of
hoping to guess the size correctly beforehand. To ease review, might
make sense to do that as a separate patch over current sources, and the
main patch on top of that.


Yeah, I thought the same thing while writing that code.  Thanks, I'll do
it that way.


Attached is what I have right now.  I started working on the decoding
part, but it has this piece of code:

 /* decode crc */
 if (b64_decode(p + 1, 4, buf) != 3)
 goto out;

which makes the approach a bit uglier.  If I did this the same way, I
would have to create and destroy a StringInfo just for this operation,
which seems ugly.  So I wonder if I shouldn't try and instead keep the
code closer to what it is in HEAD right now; I could call
enlargeStringInfo() first, then hand out a pointer to b64_encode (or
b64_decode()) and finally increment StringInfoData.len by how much was
actually written.  That would keep the code changes a lot smaller, too.


Yeah, that sounds reasonable.


I'm also not sure why we need to keep a copy of the base64
encoding/decoding logic instead of exporting it in utils/adt/encode.c.


Good question...

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ALTER SYSTEM RESET?

2014-09-10 Thread Fujii Masao

On Wed, Sep 3, 2014 at 12:08 AM, Christoph Berg c...@df7cb.de wrote:
 Re: Vik Fearing 2014-09-02 5405d2d9.9050...@dalibo.com
  Uhm, are we agreed on the decision on not to backpatch this?  I would
  think this should have been part of the initial ALTER SYSTEM SET patch
  and thus should be backpatched to 9.4.

 I think it belongs in 9.4 as well, especially if we're having another beta.

 My original complaint was about 9.4, so I'd like to see it there, yes.

ISTM that the consensus is to do the back-patch to 9.4.
Does anyone object to the back-patch? If not, I will do that.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgcrypto: PGP armor headers

2014-09-10 Thread Marko Tiikkaja


On 9/10/14 1:38 PM, Heikki Linnakangas wrote:

On 09/10/2014 02:26 PM, Marko Tiikkaja wrote:

So I wonder if I shouldn't try and instead keep the
code closer to what it is in HEAD right now; I could call
enlargeStringInfo() first, then hand out a pointer to b64_encode (or
b64_decode()) and finally increment StringInfoData.len by how much was
actually written.  That would keep the code changes a lot smaller, too.


Yeah, that sounds reasonable.


OK, I've attemped to do that in the attached.  I'm pretty sure I didn't 
get all of the overflows right, so someone should probably take a really 
good look at it.  (I'm not too confident the original code got them 
right either, but whatever).


Speaking of good looks, should I add it to the next commitfest as a new 
patch, or should we try and get someone to review it like this?



I'm also not sure why we need to keep a copy of the base64
encoding/decoding logic instead of exporting it in utils/adt/encode.c.


Good question...


I've left this question unanswered for now.  We can fix it later, 
independently of this patch.



.marko
*** a/contrib/pgcrypto/pgp-armor.c
--- b/contrib/pgcrypto/pgp-armor.c
***
*** 203,240  crc24(const uint8 *data, unsigned len)
return crc  0xffL;
  }
  
! int
! pgp_armor_encode(const uint8 *src, unsigned len, uint8 *dst)
  {
!   int n;
!   uint8  *pos = dst;
unsignedcrc = crc24(src, len);
  
!   n = strlen(armor_header);
!   memcpy(pos, armor_header, n);
!   pos += n;
  
!   n = b64_encode(src, len, pos);
!   pos += n;
  
!   if (*(pos - 1) != '\n')
!   *pos++ = '\n';
  
!   *pos++ = '=';
!   pos[3] = _base64[crc  0x3f];
!   crc = 6;
!   pos[2] = _base64[crc  0x3f];
!   crc = 6;
!   pos[1] = _base64[crc  0x3f];
!   crc = 6;
!   pos[0] = _base64[crc  0x3f];
!   pos += 4;
  
!   n = strlen(armor_footer);
!   memcpy(pos, armor_footer, n);
!   pos += n;
! 
!   return pos - dst;
  }
  
  static const uint8 *
--- 203,239 
return crc  0xffL;
  }
  
! void
! pgp_armor_encode(const uint8 *src, int len, StringInfo dst)
  {
!   int res;
!   unsignedb64len;
unsignedcrc = crc24(src, len);
  
!   appendStringInfoString(dst, armor_header);
  
!   /* make sure we have enough room to b64_encode() */
!   b64len = b64_enc_len(len);
!   if (b64len  INT_MAX)
!   ereport(ERROR,
!   (errcode(ERRCODE_OUT_OF_MEMORY),
!errmsg(out of memory)));
!   enlargeStringInfo(dst, (int) b64len);
!   res = b64_encode(src, len, (uint8 *) dst-data + dst-len);
!   if (res  b64len)
!   elog(FATAL, overflow - encode estimate too small);
!   dst-len += res;
  
!   if (*(dst-data + dst-len - 1) != '\n')
!   appendStringInfoChar(dst, '\n');
  
!   appendStringInfoChar(dst, '=');
!   appendStringInfoChar(dst, _base64[(crc  18)  0x3f]);
!   appendStringInfoChar(dst, _base64[(crc  12)  0x3f]);
!   appendStringInfoChar(dst, _base64[(crc  6)  0x3f]);
!   appendStringInfoChar(dst, _base64[crc  0x3f]);
  
!   appendStringInfoString(dst, armor_footer);
  }
  
  static const uint8 *
***
*** 309,315  find_header(const uint8 *data, const uint8 *datend,
  }
  
  int
! pgp_armor_decode(const uint8 *src, unsigned len, uint8 *dst)
  {
const uint8 *p = src;
const uint8 *data_end = src + len;
--- 308,314 
  }
  
  int
! pgp_armor_decode(const uint8 *src, int len, StringInfo dst)
  {
const uint8 *p = src;
const uint8 *data_end = src + len;
***
*** 319,324  pgp_armor_decode(const uint8 *src, unsigned len, uint8 *dst)
--- 318,324 
const uint8 *base64_end = NULL;
uint8   buf[4];
int hlen;
+   int blen;
int res = PXE_PGP_CORRUPT_ARMOR;
  
/* armor start */
***
*** 360,382  pgp_armor_decode(const uint8 *src, unsigned len, uint8 *dst)
crc = (((long) buf[0])  16) + (((long) buf[1])  8) + (long) buf[2];
  
/* decode data */
!   res = b64_decode(base64_start, base64_end - base64_start, dst);
! 
!   /* check crc */
!   if (res = 0  crc24(dst, res) != crc)
!   res = PXE_PGP_CORRUPT_ARMOR;
  out:
return res;
  }
- 
- unsigned
- pgp_armor_enc_len(unsigned len)
- {
-   return b64_enc_len(len) + strlen(armor_header) + strlen(armor_footer) + 
16;
- }
- 
- unsigned
- pgp_armor_dec_len(unsigned len)
- {
-   return b64_dec_len(len);
- }
--- 360,377 
crc = (((long) buf[0])  16) + (((long) buf[1])  8) + (long) buf[2];
  
/* decode data */
!   blen = (int) b64_dec_len(len);
!   enlargeStringInfo(dst, blen);
!   res =

Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index

2014-09-10 Thread Alvaro Herrera

Fujii Masao wrote:
 On Wed, Sep 10, 2014 at 12:15 PM, Etsuro Fujita
 fujita.ets...@lab.ntt.co.jp wrote:

  PENDING_LIST_CLEANUP_SIZE and work_mem, for this setting.
  Wouldn't it be easy-to-use to have only one parameter,
  PENDING_LIST_CLEANUP_SIZE?  How about setting PENDING_LIST_CLEANUP_SIZE to
  work_mem as the default value when running the CREATE INDEX command?
 
 That's an idea. But there might be some users who want to change
 the cleanup size per session like they can do by setting work_mem,
 and your idea would prevent them from doing that...
 
 So what about introduing pending_list_cleanup_size also as GUC?
 That is, users can set the threshold by using either the reloption or
 GUC.

Yes, I think having both a GUC and a reloption makes sense -- the GUC
applies to all indexes, and can be tweaked for individual indexes using
the reloption.

I'm not sure about the idea of being able to change it per session,
though.  Do you mean that you would like insert processes use a very
large value so that they can just append new values to the pending list,
and have vacuum use a small value so that it cleans up as soon as it
runs?  Two things: 1. we could have an autovacuum_ reloption which
only changes what autovacuum does; 2. we could have autovacuum run
index cleanup actions separately from actual vacuuming.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SKIP LOCKED DATA (work in progress)

2014-09-10 Thread Alvaro Herrera

Thomas Munro wrote:

 @@ -2022,7 +2030,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, 
 int lockmode, bool noWait,
*/
   test = heap_lock_tuple(relation, tuple,
  
 estate-es_output_cid,
 -lockmode, 
 noWait,
 +lockmode, 
 wait_policy,
  false, 
 buffer, hufd);
   /* We now have two pins on the buffer, get rid of one */
   ReleaseBuffer(buffer);

Doesn't this heap_lock_tuple() need to check for a WouldBlock result?
Not sure that this is the same case that you were trying to test in
heap_lock_updated_tuple; I think this requires an update chain (so that
EPQFetch is invoked) and some tuple later in the chain is locked by a
third transaction.

I attach some additional minor suggestions to your patch.  Please feel
free to reword comments differently if you think my wording isn't an
improvements (or I've maked an english mistakes).

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench throttling latency limit

2014-09-10 Thread Heikki Linnakangas


On 09/09/2014 03:35 PM, Fabien COELHO wrote:


Hello Heikki,


I think we have to reconsider what we're reporting in 9.4, when --rate
is enabled, even though it's already very late in the release cycle.
It's a bad idea to change the definition of latency between 9.4 and 9.5,
so let's get it right in 9.4.


Indeed.


As per the attached patch. I think we should commit this to 9.4. Any
objections?


Ok for me. Some more propositions about the doc below.


I looked closer at the this, and per Jan's comments, realized that we 
don't log the lag time in the per-transaction log file. I think that's a 
serious omission; when --rate is used, the schedule lag time is 
important information to make sense of the result. I think we have to 
apply the attached patch, and backpatch to 9.4. (The documentation on 
the log file format needs updating)


Also, this is bizarre:


/*
 * Use inverse transform sampling to randomly generate a delay, 
such
 * that the series of delays will approximate a Poisson 
distribution
 * centered on the throttle_delay time.
 *
 * 1 implies a 9.2 (-log(1/1)) to 0.0 (log 1) delay
 * multiplier, and results in a 0.055 % target underestimation 
bias:
 *
 * SELECT 1.0/AVG(-LN(i/1.0)) FROM generate_series(1,1) 
AS i;
 * = 1.00055271703266335474
 *
 * If transactions are too slow or a given wait is shorter than 
a
 * transaction, the next transaction will start right away.
 */
int64   wait = (int64) (throttle_delay *
  1.00055271703 * -log(getrand(thread, 1, 
1) / 1.0));


We're using getrand() to generate a uniformly distributed random value 
between 1 and 1, and then convert it to a double between 0.0 and 
1.0. But getrand() is implemented by taking a double between 0.0 and 1.0 
and converting it an integer, so we're just truncating the original 
floating point number unnecessarily.  I think we should add a new 
function, getPoissonRand(), that uses pg_erand48() directly. We already 
have similiar getGaussianRand() and getExponentialRand() functions. 
Barring objections, I'll prepare another patch to do that, and backpatch 
to 9.4.

- Heikki

diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 2f7d80e..f7a3a5f 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -210,10 +210,10 @@ typedef struct
  * sent */
 	int			sleeping;		/* 1 indicates that the client is napping */
 	bool		throttling;		/* whether nap is for throttling */
-	int64		until;			/* napping until (usec) */
 	Variable   *variables;		/* array of variable definitions */
 	int			nvariables;
-	instr_time	txn_begin;		/* used for measuring transaction latencies */
+	int64		txn_scheduled;	/* scheduled start time of transaction (usec) */
+	instr_time	txn_begin;		/* used for measuring schedule lag times */
 	instr_time	stmt_begin;		/* used for measuring statement latencies */
 	int64		txn_latencies;	/* cumulated latencies */
 	int64		txn_sqlats;		/* cumulated square latencies */
@@ -284,12 +284,17 @@ typedef struct
 
 	long		start_time;		/* when does the interval start */
 	int			cnt;			/* number of transactions */
-	double		min_duration;	/* min/max durations */
-	double		max_duration;
-	double		sum;			/* sum(duration), sum(duration^2) - for
+
+	double		min_latency;	/* min/max latencies */
+	double		max_latency;
+	double		sum_latency;	/* sum(latency), sum(latency^2) - for
  * estimates */
-	double		sum2;
+	double		sum2_latency;
 
+	double		min_lag;
+	double		max_lag;
+	double		sum_lag;		/* sum(lag) */
+	double		sum2_lag;		/* sum(lag*lag) */
 } AggVals;
 
 static Command **sql_files[MAX_FILES];	/* SQL script files */
@@ -968,12 +973,18 @@ agg_vals_init(AggVals *aggs, instr_time start)
 {
 	/* basic counters */
 	aggs-cnt = 0;/* number of transactions */
-	aggs-sum = 0;/* SUM(duration) */
-	aggs-sum2 = 0;/* SUM(duration*duration) */
+	aggs-sum_latency = 0;		/* SUM(latency) */
+	aggs-sum2_latency = 0;/* SUM(latency*latency) */
 
 	/* min and max transaction duration */
-	aggs-min_duration = 0;
-	aggs-max_duration = 0;
+	aggs-min_latency = 0;
+	aggs-max_latency = 0;
+
+	/* schedule lag counters */
+	aggs-sum_lag = 0;
+	aggs-sum2_lag = 0;
+	aggs-min_lag = 0;
+	aggs-max_lag = 0;
 
 	/* start of the current interval */
 	aggs-start_time = INSTR_TIME_GET_DOUBLE(start);
@@ -1016,7 +1027,7 @@ top:
 
 		thread-throttle_trigger += wait;
 
-		st-until = thread-throttle_trigger;
+		st-txn_scheduled = thread-throttle_trigger;
 		st-sleeping = 1;
 		st-throttling = true;
 		st-is_throttled = true;
@@ -1032,13 +1043,13 @@ top:
 
 		INSTR_TIME_SET_CURRENT(now);
 		now_us = INSTR_TIME_GET_MICROSEC(now);
-		if (st-until = now_us)
+		if (st-txn_scheduled = now_us)
 		{

Re: [HACKERS] pgbench throttling latency limit

2014-09-10 Thread Mitsumasa KONDO

Hi,

I find typo in your patch. Please confirm.

@line 239
- agg-sum2_lag = 0;
+  agg-sum_lag = 0;

And back patch is welcome for me.

Best Regards,
--
Mitsumasa KONDO

Re: [HACKERS] pgbench throttling latency limit

2014-09-10 Thread Fabien COELHO



Hello Heikki,

I looked closer at the this, and per Jan's comments, realized that we don't 
log the lag time in the per-transaction log file. I think that's a serious 
omission; when --rate is used, the schedule lag time is important information 
to make sense of the result. I think we have to apply the attached patch, and 
backpatch to 9.4. (The documentation on the log file format needs updating)


Indeed. I think that people do not like it to change. I remember that I 
suggested to change timestamps to .yy instead of the unreadable 
 yyy, and be told not to, because some people have tool which 
process the output so the format MUST NOT CHANGE. So my behavior is not to 
avoid touching anything in this area.


I'm fine if you do it, though:-) However I have not time to have a precise 
look at your patch to cross-check it before Friday.



Also, this is bizarre:


int64 wait = (int64) (throttle_delay *
  1.00055271703 * -log(getrand(thread, 1, 1) / 1.0));


We're using getrand() to generate a uniformly distributed random value 
between 1 and 1, and then convert it to a double between 0.0 and 1.0.


The reason for this conversion is to have randomness but to still avoid 
going to extreme multiplier values. The idea is to avoid a very large 
multiplier which would generate (even if it is not often) a 0 tps when 100 
tps is required. The 1 granularity is basically random but the 
multiplier stays bounded (max 9.2, so the minimum possible tps would be 
around 11 for a target of 100 tps, bar issues from the database for 
processing the transactions).


So although this feature can be discussed and amended, it is fully 
voluntary. I think that it make sense so I would prefer to keep it as is. 
Maybe the comments could be update to be clearer.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spinlocks and compiler/memory barriers

2014-09-10 Thread Robert Haas

On Tue, Sep 9, 2014 at 6:00 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-09-09 17:54:03 -0400, Robert Haas wrote:
 So, that's committed, then.

 Yay, finally.

 I think we should pick something that uses
 spinlocks and is likely to fail spectacularly if we haven't got this
 totally right yet, and de-volatilize it.  And then watch to see what
 turns red in the buildfarm and/or which users start screaming.

 Good plan.

 I'm inclined to propose lwlock.c as a candidate, since that's very
 widely used and a place where we know there's significant contention.

 I suggest, additionally possibly, GetSnapshotData(). It's surely one of
 the hottest functions in postgres, and I've seen some performance
 increases from de-volatilizing it. IIRC it requires one volatile cast in
 one place to enforce that a variable is accessed only once. Not sure if
 we want to add such volatile casts or use something like linux'
 ACCESS_ONCE macros for some common types. Helps to grep for places
 worthy of inspection.

GetSnapshotData() isn't quite to the point, because it's using a
volatile pointer, but not because of anything about spinlock
manipulation.  That would, perhaps, be a good test for barriers, but
not for this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Aussie timezone database changes incoming

2014-09-10 Thread Tom Lane

In connection with a question asked today on pgsql-general, I had
occasion to go check the release announcements for the IANA timezone
database files, and it turns out that there are some big changes in
2014f:
http://mm.icann.org/pipermail/tz-announce/2014-August/23.html

The Russian changes are perhaps not such a big deal because they've
done that sort of thing before, but this is an earful:

 Australian eastern time zone abbreviations are now AEST/AEDT not
 EST, and similarly for the other Australian zones.  That is, for
 eastern standard and daylight saving time the abbreviations are AEST
 and AEDT instead of the former EST for both; similarly, ACST/ACDT,
 ACWST/ACWDT, and AWST/AWDT are now used instead of the former CST,
 CWST, and WST.  This change does not affect UTC offsets, only time
 zone abbreviations.  (Thanks to Rich Tibbett and many others.)

I'm wondering how many Aussie applications are going to break when
this goes in, and if we could/should do anything about it.  One idea
that comes to mind is to create an Australia_old tznames file
containing the current Aussie zone abbreviations, so as to provide
an easy way to maintain backwards compatibility at need (you'd select
that as your timezone_abbreviations GUC setting).

Anyone from down under care to remark about the actual usage of old
and new abbreviations?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] add modulo (%) operator to pgbench

2014-09-10 Thread Robert Haas

On Wed, Sep 10, 2014 at 4:48 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 Note I did not start with the non uniform stuff, but Mitsumasa-san sent a
 gaussian distribution patch and I jumped onto the wagon to complement it
 with an exponential distribution patch. I knew when doing it that is was not
 enough, but as I said one piece at a time, given the effort required to
 pass simple patch.

 What is still needed for the overall purpose is the ability to scatter the
 distribution. This is really:

  (1) a positive remainder modulo, which is the trivial patch under
  discussion

  (2) a fast but good quality for the purpose hash function
  also a rather small patch, not submitted yet.

  (3) maybe the '|' operator to do a TP*-like non-uniform load,
  which is really periodic so I do not like it.
  a trivial patch, not submitted yet.

 If you do not want one of these pieces (1  2), basically the interest of
 gaussian/exponential addition is much reduced, and this is just a half baked
 effort aborted because you did not want what was required to make it useful.
 Well, I can only disagree, but you are a committer and I'm not!

I am not objecting to the functionality; I'm objecting to bolting on
ad-hoc operators one at a time.  I think an expression syntax would
let us do this in a much more scalable way.  If I had time, I'd go do
that, but I don't.  We could add abs(x) and hash(x) and it would all
be grand.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench throttling latency limit

2014-09-10 Thread Heikki Linnakangas


On 09/10/2014 05:57 PM, Fabien COELHO wrote:


Hello Heikki,


I looked closer at the this, and per Jan's comments, realized that we don't
log the lag time in the per-transaction log file. I think that's a serious
omission; when --rate is used, the schedule lag time is important information
to make sense of the result. I think we have to apply the attached patch, and
backpatch to 9.4. (The documentation on the log file format needs updating)


Indeed. I think that people do not like it to change. I remember that I
suggested to change timestamps to .yy instead of the unreadable
 yyy, and be told not to, because some people have tool which
process the output so the format MUST NOT CHANGE. So my behavior is not to
avoid touching anything in this area.

I'm fine if you do it, though:-) However I have not time to have a precise
look at your patch to cross-check it before Friday.


This is different from changing xxx yyy to xxx.yyy in two ways. 
First, this adds new information to the log that just isn't there 
otherwise, it's not just changing the format for the sake of it. Second, 
in this case it's easy to write a parser for the log format so that it 
works with the old and new formats. Many such programs would probably 
ignore any unexpected extra fields, as a matter of lazy programming, 
while changing the separator from space to a dot would surely require 
changing every parsing program.


We could leave out the lag fields, though, when --rate is not used. 
Without --rate, the lag is always zero anyway. That would keep the file 
format unchanged, when you don't use the new --rate feature. I'm not 
sure if that would be better or worse for programs that might want to 
parse the information.



Also, this is bizarre:


int64 wait = (int64) (throttle_delay *
   1.00055271703 * -log(getrand(thread, 1, 1) / 1.0));


We're using getrand() to generate a uniformly distributed random value
between 1 and 1, and then convert it to a double between 0.0 and 1.0.


The reason for this conversion is to have randomness but to still avoid
going to extreme multiplier values. The idea is to avoid a very large
multiplier which would generate (even if it is not often) a 0 tps when 100
tps is required. The 1 granularity is basically random but the
multiplier stays bounded (max 9.2, so the minimum possible tps would be
around 11 for a target of 100 tps, bar issues from the database for
processing the transactions).

So although this feature can be discussed and amended, it is fully
voluntary. I think that it make sense so I would prefer to keep it as is.
Maybe the comments could be update to be clearer.


Ok, yeah, the comments indeed didn't mention anything about that. I 
don't think such clamping is necessary. With that 9.2x clamp on the 
multiplier, the probability that any given transaction hits it is about 
1/1. And a delay 9.2 times the average is still quite reasonable, 
IMHO. The maximum delay on my laptop, when pg_erand48() returns DBL_MIN, 
seems to be about 700 x the average, which is still reasonable if you 
run a decent number of transactions. And of course, the probability of 
hitting such an extreme value is miniscule, so if you're just doing a 
few quick test runs with a small total number of transactions, you won't 
hit that.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgcrypto: PGP armor headers

2014-09-10 Thread Heikki Linnakangas


On 09/10/2014 04:35 PM, Marko Tiikkaja wrote:

Speaking of good looks, should I add it to the next commitfest as a new
patch, or should we try and get someone to review it like this?


Let's handle this in this commitfest.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory Alignment in Postgres

2014-09-10 Thread Robert Haas

On Tue, Sep 9, 2014 at 10:08 AM, Arthur Silva arthur...@gmail.com wrote:
 I'm continuously studying Postgres codebase. Hopefully I'll be able to make
 some contributions in the future.

 For now I'm intrigued about the extensive use of memory alignment. I'm sure
 there's some legacy and some architecture that requires it reasoning behind
 it.

 That aside, since it wastes space (a lot of space in some cases) there must
 be a tipping point somewhere. I'm sure one can prove aligned access is
 faster in a micro-benchmark but I'm not sure it's the case in a DBMS like
 postgres, specially in the page/rows area.

 Just for the sake of comparison Mysql COMPACT storage (default and
 recommended since 5.5) doesn't align data at all. Mysql NDB uses a fixed
 4-byte alignment. Not sure about Oracle and others.

 Is it worth the extra space in newer architectures (specially Intel)?
 Do you guys think this is something worth looking at?

Yes.  At least in my opinion, though, it's not a good project for a
beginner.  If you get your changes to take effect, you'll find that a
lot of things will break in places that are not easy to find or fix.
You're getting into really low-level areas of the system that get
touched infrequently and require a lot of expertise in how things work
today to adjust.

The idea I've had before is to try to reduce the widest alignment we
ever require from 8 bytes to 4 bytes.  That is, look for types with
typalign = 'd', and rewrite them to have typalign = 'i' by having them
use two 4-byte loads to load an eight-byte value.  In practice, I
think this would probably save a high percentage of what can be saved,
because 8-byte alignment implies a maximum of 7 bytes of wasted space,
while 4-byte alignment implies a maximum of 3 bytes of wasted space.
And it would probably be pretty cheap, too, because any type with less
than 8 byte alignment wouldn't be affected at all, and even those
types that were affected would only be slightly slowed down by doing
two loads instead of one.  In contrast, getting rid of alignment
requirements completely would save a little more space, but probably
at the cost of a lot more slowdown: any type with alignment
requirements would have to fetch the value byte-by-byte instead of
pulling the whole thing out at once.

But there are a couple of obvious problems with this idea, too, such as:

1. It's really complicated and a ton of work.
2. It would break pg_upgrade pretty darn badly unless we employed some
even-more-complex strategy to mitigate that.
3. The savings might not be enough to justify the effort.

It might be interesting for someone to develop a tool measuring the
number of bytes of alignment padding we lose per tuple or per page and
gather some statistics on it on various databases.  That would give us
some sense as to the possible savings.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Escaping from blocked send() reprised.

2014-09-10 Thread Heikki Linnakangas


Wrong thread...

On 09/10/2014 03:04 AM, Kyotaro HORIGUCHI wrote:

Hmm. Sorry, I misunderstood the specification.


You approach that coloring tokens seems right, but you have
broken the parse logic by adding your code.

Other than the mistakes others pointed, I found that

- non-SQL-ident like tokens are ignored by their token style,
   quoted or not, so the following line works.

| local All aLL trust

I suppose this is not what you intended. This is because you have
igonred the attribute of a token when comparing it as
non-SQL-ident tokens.


- '+' at the head of the sequence '+' is treated as the first
   character of the *quoted* string. e.g. +hoge is tokenized as
   +hoge:special_quoted.


I found this is what intended. This should be documented as
comments.

|2) users and user-groups only requires special handling and behavior as follows
|Normal user :
|  A. unquoted ( USER ) will be treated as user ( downcase ).
|  B. quoted  ( USeR )  will be treated as USeR (case-sensitive).
|  C. quoted ( +USER ) will be treated as normal user +USER (i.e. will 
not be considered as user-group) and case-sensitive as string is quoted.

This seems confising with the B below. This seems should be
rearranged.

|   User Group :
|  A. unquoted ( +USERGROUP ) will be treated as +usergruop ( downcase ).
|  B. plus quoted ( +UserGROUP  ) will be treated as +UserGROUP 
(case-sensitive).




This is why you simply continued processing for '+' without
discarding and skipping the '+', and not setting in_quote so the
following parser code works as it is not intended. You should
understand what the original code does and insert or modify
logics not braeking the assumptions.


regards,




--
- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347

2014-09-10 Thread Andrew Gierth

 Tomas == Tomas Vondra t...@fuzzy.cz writes:

 Tomas If we can get rid of the excessive ChainAggregate, that's
 Tomas certainly enough for now.

I found an algorithm that should provably give the minimal number of sorts
(I was afraid that problem would turn out to be NP-hard, but not so - it's
solvable in P by reducing it to a problem of maximal matching in bipartite
graphs).

Updated patch should be forthcoming in a day or two.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-10 Thread Peter Geoghegan

On Tue, Sep 9, 2014 at 2:00 PM, Robert Haas robertmh...@gmail.com wrote:
 Boiled down, what you're saying is that you might have a set that
 contains lots of duplicates in general, but not very many where the
 abbreviated-keys also match.  Sure, that's true.

Abbreviated keys are not used in the case where we do a (fully)
opportunistic memcmp(), without really having any idea of whether or
not it'll work out. Abbreviated keys aren't really relevant to that
case, except perhaps in that we know we'll have statistics available
for leading attributes, which will make the case less frequent in
practice.

 But you might also
 not have that case, so I don't see where that gets us; the same
 worst-case test case Heikki developed the last time we relitigated
 this point is still relevant here.

Well, I think that there should definitely be a distinction made
between abbreviated and non-abbreviated cases; you could frequently
have almost 100% certainty that each of those optimistic memcmp()s
will work out with abbreviated keys. Low cardinality sets are very
common. I'm not sure what your position on that is. My proposal to
treat both of those cases (abbreviated with a cost model/cardinality
statistics; non-abbreviated without) the same is based on different
arguments for each case.

 In order to know how much we're
 giving up in that case, we need the exact number I asked you to
 provide in my previous email: the ratio of the cost of strcoll() to
 the cost of memcmp().

 I see that you haven't chosen to provide that information in any of
 your four responses.

Well, it's kind of difficult to give that number in a vacuum. I showed
a case that had a large majority of opportunistic memcmp()s go to
waste, while a small number were useful, which still put us ahead. I
can look at Heikki's case with this again if you think that'll help.
Heikki said that his case was all about wasted strxfrm()s, which are
surely much more expensive than wasted memcmp()s, particularly when
you consider temporal locality (we needed that memory to be stored in
a cacheline for the immediately subsequent operation anyway, should
the memcmp() thing not work out - the simple ratio that you're
interested in may be elusive).

In case I haven't been clear enough on this point, I re-emphasize that
I do accept that for something like the non-abbreviated case, the
opportunistic memcmp() thing must virtually be free if we're to
proceed, since it is purely opportunistic. If it can be demonstrated
that that isn't the case (and if that cannot be fixed by limiting it
to  CACHE_LINE_SIZE), clearly we should not proceed with
opportunistic (non-abbreviated) memcmp()s. In fact, I think I'm
holding it to a higher standard than you are - I believe that it had
better be virtually free.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Fwd: Re: [HACKERS] proposal: new long psql parameter --on-error-stop]

2014-09-10 Thread Pavel Stehule

2014-09-10 0:13 GMT+02:00 Andres Freund and...@2ndquadrant.com:

 On 2014-09-09 22:22:45 +0200, Andres Freund wrote:
  I plan to push this soon.

 Done.

 Thanks for the patch!


Thank you very much

Pavel



 Andres Freund

 --
  Andres Freund http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Heikki Linnakangas


On 09/10/2014 01:49 AM, Tomas Vondra wrote:

On 9.9.2014 16:09, Robert Haas wrote:

On Mon, Sep 8, 2014 at 5:53 PM, Tomas Vondra t...@fuzzy.cz wrote:

So I only posted the separate patch for those who want to do a review,
and then a complete patch with both parts combined. But it sure may be
a bit confusing.


Let's do this: post each new version of the patches only on this
thread, as a series of patches that can be applied one after another.


OK, attached. Apply in this order

1) dense-alloc-v5.patch
2) hashjoin-nbuckets-v12.patch


The dense-alloc-v5.patch looks good to me. I have committed that with 
minor cleanup (more comments below). I have not looked at the second patch.



I also did a few 'minor' changes to the dense allocation patch, most
notably:

* renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
   The original naming seemed a bit awkward.


That's too easy to confuse with regular MemoryContext and AllocChunk 
stuff. I renamed it to HashMemoryChunk.



* the chunks size is 32kB (instead of 16kB), and we're using 1/4
   threshold for 'oversized' items

   We need the threshold to be =8kB, to trigger the special case
   within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.


Should we care about the fact that if there are only a few tuples, we 
will nevertheless waste 32kB of memory for the chunk? I guess not, but I 
thought I'd mention it. The smallest allowed value for work_mem is 64kB.



I also considered Heikki's suggestion that while rebatching, we can
reuse chunks with a single large tuple, instead of copying it
needlessly. My attempt however made the code much uglier (I almost used
a GOTO, but my hands rejected to type it ...). I'll look into that.


I tried constructing a test case where the rehashing time would be 
significant enough for this to matter, but I couldn't find one. Even if 
the original estimate on the number of batches is way too small, we ramp 
up quickly to a reasonable size and the rehashing time is completely 
insignificant compared to all the other work. So I think we can drop 
that idea.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-10 Thread Robert Haas

On Wed, Sep 10, 2014 at 1:36 PM, Peter Geoghegan p...@heroku.com wrote:
 In order to know how much we're
 giving up in that case, we need the exact number I asked you to
 provide in my previous email: the ratio of the cost of strcoll() to
 the cost of memcmp().

 I see that you haven't chosen to provide that information in any of
 your four responses.

 Well, it's kind of difficult to give that number in a vacuum.

No, not really.  All you have to do is right a little test program to
gather the information.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Robert Haas

On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 The dense-alloc-v5.patch looks good to me. I have committed that with minor
 cleanup (more comments below). I have not looked at the second patch.

Gah.  I was in the middle of doing this.  Sigh.

 * the chunks size is 32kB (instead of 16kB), and we're using 1/4
threshold for 'oversized' items

We need the threshold to be =8kB, to trigger the special case
within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.

 Should we care about the fact that if there are only a few tuples, we will
 nevertheless waste 32kB of memory for the chunk? I guess not, but I thought
 I'd mention it. The smallest allowed value for work_mem is 64kB.

I think we should change the threshold here to 1/8th.  The worst case
memory wastage as-is ~32k/5  6k.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench throttling latency limit

2014-09-10 Thread Jan Wieck


On 09/10/2014 11:28 AM, Heikki Linnakangas wrote:

On 09/10/2014 05:57 PM, Fabien COELHO wrote:


Hello Heikki,


I looked closer at the this, and per Jan's comments, realized that we don't
log the lag time in the per-transaction log file. I think that's a serious
omission; when --rate is used, the schedule lag time is important information
to make sense of the result. I think we have to apply the attached patch, and
backpatch to 9.4. (The documentation on the log file format needs updating)


Indeed. I think that people do not like it to change. I remember that I
suggested to change timestamps to .yy instead of the unreadable
 yyy, and be told not to, because some people have tool which
process the output so the format MUST NOT CHANGE. So my behavior is not to
avoid touching anything in this area.

I'm fine if you do it, though:-) However I have not time to have a precise
look at your patch to cross-check it before Friday.


This is different from changing xxx yyy to xxx.yyy in two ways.
First, this adds new information to the log that just isn't there
otherwise, it's not just changing the format for the sake of it. Second,
in this case it's easy to write a parser for the log format so that it
works with the old and new formats. Many such programs would probably
ignore any unexpected extra fields, as a matter of lazy programming,
while changing the separator from space to a dot would surely require
changing every parsing program.

We could leave out the lag fields, though, when --rate is not used.
Without --rate, the lag is always zero anyway. That would keep the file
format unchanged, when you don't use the new --rate feature. I'm not
sure if that would be better or worse for programs that might want to
parse the information.


We could also leave the default output format as is and introduce 
another option with a % style format string.



Jan





Also, this is bizarre:


int64 wait = (int64) (throttle_delay *
   1.00055271703 * -log(getrand(thread, 1, 1) / 1.0));


We're using getrand() to generate a uniformly distributed random value
between 1 and 1, and then convert it to a double between 0.0 and 1.0.


The reason for this conversion is to have randomness but to still avoid
going to extreme multiplier values. The idea is to avoid a very large
multiplier which would generate (even if it is not often) a 0 tps when 100
tps is required. The 1 granularity is basically random but the
multiplier stays bounded (max 9.2, so the minimum possible tps would be
around 11 for a target of 100 tps, bar issues from the database for
processing the transactions).

So although this feature can be discussed and amended, it is fully
voluntary. I think that it make sense so I would prefer to keep it as is.
Maybe the comments could be update to be clearer.


Ok, yeah, the comments indeed didn't mention anything about that. I
don't think such clamping is necessary. With that 9.2x clamp on the
multiplier, the probability that any given transaction hits it is about
1/1. And a delay 9.2 times the average is still quite reasonable,
IMHO. The maximum delay on my laptop, when pg_erand48() returns DBL_MIN,
seems to be about 700 x the average, which is still reasonable if you
run a decent number of transactions. And of course, the probability of
hitting such an extreme value is miniscule, so if you're just doing a
few quick test runs with a small total number of transactions, you won't
hit that.

- Heikki




--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] removing volatile qualifiers from lwlock.c

2014-09-10 Thread Robert Haas

As discussed on the thread Spinlocks and compiler/memory barriers,
now that we've made the spinlock primitives function as compiler
barriers (we think), it should be possible to remove volatile
qualifiers from many places in the source code.  The attached patch
does this in lwlock.c.  If the changes in commit
0709b7ee72e4bc71ad07b7120acd117265ab51d0 (and follow-on commits) are
correct and complete, applying this shouldn't break anything, while
possibly giving the compiler room to optimize things better than it
does today.

However, demonstrating the necessity of that commit for these changes
seems to be non-trivial.  I tried applying this patch and reverting
commits 5b26278822c69dd76ef89fd50ecc7cdba9c3f035,
b4c28d1b92c81941e4fc124884e51a7c110316bf, and
0709b7ee72e4bc71ad07b7120acd117265ab51d0 on a PPC64 POWER8 box with a
whopping 192 hardware threads (thanks, IBM!).  I then ran the
regression tests repeatedly, and I ran several long pgbench runs with
as many as 350 concurrent clients.  No failures.

So I'm posting this patch in the hope that others can help.  The
relevant tests are:

1. If you apply this patch to master and run tests of whatever kind
strikes your fancy, does anything break under high concurrency?  If it
does, then the above commits weren't enough to make this safe on your
platform.

2. If you apply this patch to master, revert the commits mentioned
above, and again run tests, does anything now break?  If it does (but
the first tests were OK), then that shows that those commits did
something useful on your platform.

The change to make spinlocks act as compiler barriers provides the
foundation for, hopefully, a cleaner code base and easier future work
on scalabilty and performance projects, so help is much appreciated.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 7c96da5..66fb2e4 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -112,7 +112,7 @@ static lwlock_stats lwlock_stats_dummy;
 bool		Trace_lwlocks = false;
 
 inline static void
-PRINT_LWDEBUG(const char *where, const volatile LWLock *lock)
+PRINT_LWDEBUG(const char *where, const LWLock *lock)
 {
 	if (Trace_lwlocks)
 		elog(LOG, %s(%s %d): excl %d shared %d head %p rOK %d,
@@ -406,9 +406,7 @@ LWLock *
 LWLockAssign(void)
 {
 	LWLock	   *result;
-
-	/* use volatile pointer to prevent code rearrangement */
-	volatile int *LWLockCounter;
+	int		   *LWLockCounter;
 
 	LWLockCounter = (int *) ((char *) MainLWLockArray - 3 * sizeof(int));
 	SpinLockAcquire(ShmemLock);
@@ -429,9 +427,7 @@ int
 LWLockNewTrancheId(void)
 {
 	int			result;
-
-	/* use volatile pointer to prevent code rearrangement */
-	volatile int *LWLockCounter;
+	int		   *LWLockCounter;
 
 	LWLockCounter = (int *) ((char *) MainLWLockArray - 3 * sizeof(int));
 	SpinLockAcquire(ShmemLock);
@@ -511,9 +507,8 @@ LWLockAcquireWithVar(LWLock *l, uint64 *valptr, uint64 val)
 
 /* internal function to implement LWLockAcquire and LWLockAcquireWithVar */
 static inline bool
-LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
+LWLockAcquireCommon(LWLock *lock, LWLockMode mode, uint64 *valptr, uint64 val)
 {
-	volatile LWLock *lock = l;
 	PGPROC	   *proc = MyProc;
 	bool		retry = false;
 	bool		result = true;
@@ -525,7 +520,7 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 	PRINT_LWDEBUG(LWLockAcquire, lock);
 
 #ifdef LWLOCK_STATS
-	lwstats = get_lwlock_stats_entry(l);
+	lwstats = get_lwlock_stats_entry(lock);
 
 	/* Count lock acquisition attempts */
 	if (mode == LW_EXCLUSIVE)
@@ -642,13 +637,13 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 		 * so that the lock manager or signal manager will see the received
 		 * signal when it next waits.
 		 */
-		LOG_LWDEBUG(LWLockAcquire, T_NAME(l), T_ID(l), waiting);
+		LOG_LWDEBUG(LWLockAcquire, T_NAME(lock), T_ID(lock), waiting);
 
 #ifdef LWLOCK_STATS
 		lwstats-block_count++;
 #endif
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(l), T_ID(l), mode);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), T_ID(lock), mode);
 
 		for (;;)
 		{
@@ -659,9 +654,9 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 			extraWaits++;
 		}
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(l), T_ID(l), mode);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), T_ID(lock), mode);
 
-		LOG_LWDEBUG(LWLockAcquire, T_NAME(l), T_ID(l), awakened);
+		LOG_LWDEBUG(LWLockAcquire, T_NAME(lock), T_ID(lock), awakened);
 
 		/* Now loop back and try to acquire lock again. */
 		retry = true;
@@ -675,10 +670,10 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 	/* We are done updating shared state of the lock itself. */
 	SpinLockRelease(lock-mutex);
 
-	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(l), T_ID(l), mode);
+

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Heikki Linnakangas


On 09/10/2014 09:31 PM, Robert Haas wrote:

On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

The dense-alloc-v5.patch looks good to me. I have committed that with minor
cleanup (more comments below). I have not looked at the second patch.


Gah.  I was in the middle of doing this.  Sigh.


Sorry.


* the chunks size is 32kB (instead of 16kB), and we're using 1/4
threshold for 'oversized' items

We need the threshold to be =8kB, to trigger the special case
within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.


Should we care about the fact that if there are only a few tuples, we will
nevertheless waste 32kB of memory for the chunk? I guess not, but I thought
I'd mention it. The smallest allowed value for work_mem is 64kB.


I think we should change the threshold here to 1/8th.  The worst case
memory wastage as-is ~32k/5  6k.


Not sure how you arrived at that number. The worst case is that each 32k 
chunk is filled up to 24k+1 bytes, i.e. 8k-1 bytes is wasted in each 
chunk. That's 25% wastage. That's not too bad IMHO - the worst case 
waste of AllocSet is 50%.


But if you want to twiddle it, feel free. Note that there's some 
interplay with allocset code, like Tomas mentioned. If the threshold is 
 8k, palloc() will round up request sizes smaller than 8k anyway. That 
wastes some memory, nothing more serious than that, but it seems like a 
good idea to avoid it.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Tomas Vondra

On 10.9.2014 20:25, Heikki Linnakangas wrote:
 On 09/10/2014 01:49 AM, Tomas Vondra wrote:
 I also did a few 'minor' changes to the dense allocation patch, most
 notably:

 * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
The original naming seemed a bit awkward.
 
 That's too easy to confuse with regular MemoryContext and AllocChunk
 stuff. I renamed it to HashMemoryChunk.

OK.

 
 * the chunks size is 32kB (instead of 16kB), and we're using 1/4
threshold for 'oversized' items

We need the threshold to be =8kB, to trigger the special case
within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.
 
 Should we care about the fact that if there are only a few tuples, we
 will nevertheless waste 32kB of memory for the chunk? I guess not, but I
 thought I'd mention it. The smallest allowed value for work_mem is 64kB.

I don't think that's a problem.

 I also considered Heikki's suggestion that while rebatching, we can
 reuse chunks with a single large tuple, instead of copying it
 needlessly. My attempt however made the code much uglier (I almost used
 a GOTO, but my hands rejected to type it ...). I'll look into that.
 
 I tried constructing a test case where the rehashing time would be 
 significant enough for this to matter, but I couldn't find one. Even
 if the original estimate on the number of batches is way too small,
 we ramp up quickly to a reasonable size and the rehashing time is
 completely insignificant compared to all the other work. So I think
 we can drop that idea.

I don't think that had anything to do with rehashing. What you pointed
out is that when rebatching, we have to keep ~50% of the tuples, which
means 'copy into a new chunk, then throw away the old ones'.

For large tuples (you mentioned 100MB tuples, IIRC), we needlessly
allocate this amount of memory only to copy the single tuple and then
throw away the old tuple. So (a) that's an additional memcpy, and (b) it
allocates additional 100MB of memory for a short period of time.

Now, I'd guess when dealing with tuples this large, there will be more
serious trouble elsewhere, but I don't want to neglect it. Maybe it's
worth optimizing?

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Tomas Vondra

On 10.9.2014 20:31, Robert Haas wrote:
 On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 The dense-alloc-v5.patch looks good to me. I have committed that with minor
 cleanup (more comments below). I have not looked at the second patch.
 
 Gah.  I was in the middle of doing this.  Sigh.
 
 * the chunks size is 32kB (instead of 16kB), and we're using 1/4
threshold for 'oversized' items

We need the threshold to be =8kB, to trigger the special case
within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.

 Should we care about the fact that if there are only a few tuples, we will
 nevertheless waste 32kB of memory for the chunk? I guess not, but I thought
 I'd mention it. The smallest allowed value for work_mem is 64kB.
 
 I think we should change the threshold here to 1/8th.  The worst case
 memory wastage as-is ~32k/5  6k.

So you'd lower the threshold to 4kB? That may lower the wastage in the
chunks, but palloc will actually allocate 8kB anyway, wasting up to
additional 4kB. So I don't see how lowering the threshold to 1/8th
improves the situation ...

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Robert Haas

On Wed, Sep 10, 2014 at 3:02 PM, Tomas Vondra t...@fuzzy.cz wrote:
 On 10.9.2014 20:31, Robert Haas wrote:
 On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 The dense-alloc-v5.patch looks good to me. I have committed that with minor
 cleanup (more comments below). I have not looked at the second patch.

 Gah.  I was in the middle of doing this.  Sigh.

 * the chunks size is 32kB (instead of 16kB), and we're using 1/4
threshold for 'oversized' items

We need the threshold to be =8kB, to trigger the special case
within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.

 Should we care about the fact that if there are only a few tuples, we will
 nevertheless waste 32kB of memory for the chunk? I guess not, but I thought
 I'd mention it. The smallest allowed value for work_mem is 64kB.

 I think we should change the threshold here to 1/8th.  The worst case
 memory wastage as-is ~32k/5  6k.

 So you'd lower the threshold to 4kB? That may lower the wastage in the
 chunks, but palloc will actually allocate 8kB anyway, wasting up to
 additional 4kB. So I don't see how lowering the threshold to 1/8th
 improves the situation ...

Ah, OK.  Well, never mind then.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Tomas Vondra

On 10.9.2014 20:55, Heikki Linnakangas wrote:
 On 09/10/2014 09:31 PM, Robert Haas wrote:
 * the chunks size is 32kB (instead of 16kB), and we're using 1/4
 threshold for 'oversized' items

 We need the threshold to be =8kB, to trigger the special case
 within AllocSet. The 1/4 rule is consistent with
 ALLOC_CHUNK_FRACTION.

 Should we care about the fact that if there are only a few tuples, we
 will
 nevertheless waste 32kB of memory for the chunk? I guess not, but I
 thought
 I'd mention it. The smallest allowed value for work_mem is 64kB.

 I think we should change the threshold here to 1/8th.  The worst case
 memory wastage as-is ~32k/5  6k.
 
 Not sure how you arrived at that number. The worst case is that each 32k
 chunk is filled up to 24k+1 bytes, i.e. 8k-1 bytes is wasted in each
 chunk. That's 25% wastage. That's not too bad IMHO - the worst case
 waste of AllocSet is 50%.
 
 But if you want to twiddle it, feel free. Note that there's some
 interplay with allocset code, like Tomas mentioned. If the threshold is
  8k, palloc() will round up request sizes smaller than 8k anyway. That
 wastes some memory, nothing more serious than that, but it seems like a
 good idea to avoid it.

Yes, and pfree then keeps these blocks, which means a chance of keeping
memory that won't be reused. That's why I chose the 8kB threshold. So
I'd keep the 32kB/8kB, as proposed in the patch.

But I don't see this as a big issue - my experience is that either we
have very much smaller than 4kB, or much larger tuples than 8kB. And
even for the tuples between 4kB and 8kB, the waste is not that bad.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Tomas Vondra

On 10.9.2014 20:25, Heikki Linnakangas wrote:
 On 09/10/2014 01:49 AM, Tomas Vondra wrote:
 I also did a few 'minor' changes to the dense allocation patch, most
 notably:

 * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
The original naming seemed a bit awkward.
 
 That's too easy to confuse with regular MemoryContext and AllocChunk
 stuff. I renamed it to HashMemoryChunk.

BTW this breaks the second patch, which is allocating new chunks when
resizing the hash table. Should I rebase the patch, or can the commiter
do s/MemoryChunk/HashMemoryChunk/ ?

Assuming there are no more fixes needed, of course.

Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] inherit support for foreign tables

2014-09-10 Thread Heikki Linnakangas


I had a cursory look at this patch and the discussions around this.

ISTM there are actually two new features in this: 1) allow CHECK 
constraints on foreign tables, and 2) support inheritance for foreign 
tables. How about splitting it into two?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Robert Haas

On Wed, Sep 10, 2014 at 3:12 PM, Tomas Vondra t...@fuzzy.cz wrote:
 On 10.9.2014 20:25, Heikki Linnakangas wrote:
 On 09/10/2014 01:49 AM, Tomas Vondra wrote:
 I also did a few 'minor' changes to the dense allocation patch, most
 notably:

 * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
The original naming seemed a bit awkward.

 That's too easy to confuse with regular MemoryContext and AllocChunk
 stuff. I renamed it to HashMemoryChunk.

 BTW this breaks the second patch, which is allocating new chunks when
 resizing the hash table. Should I rebase the patch, or can the commiter
 do s/MemoryChunk/HashMemoryChunk/ ?

 Assuming there are no more fixes needed, of course.

Rebasing it will save the committer time, which will increase the
chances that one will look at your patch.  So it's highly recommended.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SKIP LOCKED DATA (work in progress)

2014-09-10 Thread Robert Haas

On Wed, Sep 10, 2014 at 9:47 AM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 I attach some additional minor suggestions to your patch.

I don't see any attachment.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-09-10 Thread Robert Haas

On Tue, Sep 9, 2014 at 6:03 PM, Petr Jelinek p...@2ndquadrant.com wrote:
 - The remaining functions are pq_init(), pq_comm_reset(), pq_flush(),
 pq_flush_if_writable(), pq_is_send_pending(), pq_putmessage(),
 pq_putmessage_noblock(), pq_startcopyout(), and pq_endcopyout().
 These are the ones that I think are potentially interesting.

 I didn't choose to provide hooks for all of these in the submitted
 patch because they're not all needed for I want to do here:
 pq_startcopyout() and pq_endcopyout() are only needed for V2 protocol
 support, which did not interest me (nor did COPY, really);
 pq_putmessage_noblock(), pq_flush_if_writable(), and
 pq_is_send_pending() are only used for the walsender protocol, which
 doesn't seem useful to redirect to a non-socket; and I just didn't
 happen to have any use for pq_init() or pq_comm_reset().  Hence what I
 ended up with.

 But, I could revisit that.  Suppose I provide a structure with 10
 function pointers for all ten of those functions, or maybe 9 since
 pq_init() is called so early that it's not likely we'll have control
 to put the hooks in place before that point, and anyway whatever code
 installs the hooks can do its own initialization then.  Then make a
 global variable like pqSendMethods and #define pq_comm_reset() to be
 pqSendMethods-comm_reset(), pflush() to be pqSendMethods-flush(),
 and so on for all 9 or 10 methods.  Then the pqmq code could just
 change pqSendMethods to point to its own method structure instead of
 the default one.  Would that address the concern this concern?  It's
 more than 20 lines of code, but it's not TOO bad.


 Yes, although my issue with the hooks was not that you only provided them
 for 2 functions, but the fact that it had no structure and the
 implementation was if hook set do this, else do that which I don't see
 like a good way of doing it.

We've done it that way in a bunch of other places, like ExecutorRun().
An advantage of this approach (I think) is that jumping to a fixed
address is faster than jumping through a function pointer - so with
the approach I've taken here, the common case where we're talking to
the client incurs only the overhead of a null-test, and the larger
overhead of the function pointer jump is incurred only when the hook
is in use.  Maybe that's not enough of a difference to matter to
anything, but I think the contention that I've invented some novel
kind of interface here doesn't hold up to scrutiny.  We have lots of
hooks that work just like what I did here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for N synchronous standby servers

2014-09-10 Thread Heikki Linnakangas


On 08/28/2014 10:10 AM, Michael Paquier wrote:

+ #synchronous_standby_num = -1 # number of standbys servers using sync rep


To be honest, that's a horrible name for the GUC. Back when synchronous 
replication was implemented, we had looong discussions on this feature. 
It was called quorum commit back then. I'd suggest using the quorum 
term in this patch, too, that's a fairly well-known term in distributed 
computing for this.


When synchronous replication was added, quorum was left out to keep 
things simple; the current feature set was the most we could all agree 
on to be useful. If you search the archives for quorum commit you'll 
see what I mean. There was a lot of confusion on what is possible and 
what is useful, but regarding this particular patch: people wanted to be 
able to describe more complicated scenarios. For example, imagine that 
you have a master and two standbys in one the primary data center, and 
two more standbys in a different data center. It should be possible to 
specify that you must get acknowledgment from at least on standby in 
both data centers. Maybe you could hack that by giving the standbys in 
the same data center the same name, but it gets ugly, and it still won't 
scale to even more complex scenarios.


Maybe that's OK - we don't necessarily need to solve all scenarios at 
once. But it's worth considering.


BTW, how does this patch behave if there are multiple standbys connected 
with the same name?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory Alignment in Postgres

2014-09-10 Thread Bruce Momjian

On Wed, Sep 10, 2014 at 11:43:52AM -0400, Robert Haas wrote:
 But there are a couple of obvious problems with this idea, too, such as:
 
 1. It's really complicated and a ton of work.
 2. It would break pg_upgrade pretty darn badly unless we employed some
 even-more-complex strategy to mitigate that.
 3. The savings might not be enough to justify the effort.
 
 It might be interesting for someone to develop a tool measuring the
 number of bytes of alignment padding we lose per tuple or per page and
 gather some statistics on it on various databases.  That would give us
 some sense as to the possible savings.

And will we ever implement a logical attribute system so we can reorder
the stored attribtes to minimize wasted space?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SKIP LOCKED DATA (work in progress)

2014-09-10 Thread Alvaro Herrera

Robert Haas wrote:
 On Wed, Sep 10, 2014 at 9:47 AM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
  I attach some additional minor suggestions to your patch.
 
 I don't see any attachment.

Uh.  Here it is.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 51d1889..2b336b0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -4090,7 +4090,7 @@ get_mxact_status_for_lock(LockTupleMode mode, bool is_update)
  *	cid: current command ID (used for visibility test, and stored into
  *		tuple's cmax if lock is successful)
  *	mode: indicates if shared or exclusive tuple lock is desired
- *	wait_policy: whether to block, ereport or skip if lock not available
+ *	wait_policy: what to do if tuple lock is not available
  *	follow_updates: if true, follow the update chain to also lock descendant
  *		tuples.
  *
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index e4065c2..a718c0b 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1928,6 +1928,9 @@ EvalPlanQual(EState *estate, EPQState *epqstate,
  *
  * Returns a palloc'd copy of the newest tuple version, or NULL if we find
  * that there is no newest version (ie, the row was deleted not updated).
+ * We also return NULL if the tuple is locked and the wait policy is to skip
+ * such tuples.
+ *
  * If successful, we have locked the newest tuple version, so caller does not
  * need to worry about it changing anymore.
  *
@@ -1994,7 +1997,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 		break;
 	case LockWaitSkip:
 		if (!ConditionalXactLockTableWait(SnapshotDirty.xmax))
-			return NULL; /* skipping instead of waiting */
+			return NULL; /* skip instead of waiting */
 		break;
 	case LockWaitError:
 		if (!ConditionalXactLockTableWait(SnapshotDirty.xmax))
@@ -2076,6 +2079,10 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,
 	/* tuple was deleted, so give up */
 	return NULL;
 
+case HeapTupleWouldBlock:
+	elog(WARNING, this is an odd case);
+	return NULL;
+
 default:
 	ReleaseBuffer(buffer);
 	elog(ERROR, unrecognized heap_lock_tuple status: %u,
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index f9feff4..990240b 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -179,7 +179,10 @@ lnext:
 
 if (copyTuple == NULL)
 {
-	/* Tuple was deleted or skipped (in SKIP LOCKED), so don't return it */
+	/*
+	 * Tuple was deleted; or it's locked and we're under SKIP
+	 * LOCKED policy, so don't return it
+	 */
 	goto lnext;
 }
 /* remember the actually locked tuple's TID */
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 0f6e9b0..1f66928 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -2515,11 +2515,12 @@ applyLockingClause(Query *qry, Index rtindex,
 		 * a shared and exclusive lock at the same time; it'll end up being
 		 * exclusive anyway.)
 		 *
-		 * We also consider that NOWAIT wins if it is specified multiple ways,
-		 * otherwise SKIP LOCKED wins. This is a bit more debatable but
-		 * raising an error doesn't seem helpful.  (Consider for instance
-		 * SELECT FOR UPDATE NOWAIT from a view that internally contains a
-		 * plain FOR UPDATE spec.)
+		 * Similarly, if the same RTE is specified with more than one lock wait
+		 * policy, consider that NOWAIT wins over SKIP LOCKED, which in turn
+		 * wins over waiting for the lock (the default).  This is a bit more
+		 * debatable but raising an error doesn't seem helpful.  (Consider for
+		 * instance SELECT FOR UPDATE NOWAIT from a view that internally
+		 * contains a plain FOR UPDATE spec.)
 		 *
 		 * And of course pushedDown becomes false if any clause is explicit.
 		 */
diff --git a/src/include/utils/lockwaitpolicy.h b/src/include/utils/lockwaitpolicy.h
index f8ad07e..7f9a693 100644
--- a/src/include/utils/lockwaitpolicy.h
+++ b/src/include/utils/lockwaitpolicy.h
@@ -1,34 +1,31 @@
 /*-
- *
  * lockwaitpolicy.h
- *	  Header file for the enum LockWaitPolicy enum, which is needed in
- *	  several modules (parser, planner, executor, heap manager).
+ *	  Header file for LockWaitPolicy enum.
  *
  * Copyright (c) 2014, PostgreSQL Global Development Group
  *
  * src/include/utils/lockwaitpolicy.h
- *
  *-
  */
 #ifndef LOCKWAITPOLICY_H
 #define LOCKWAITPOLICY_H
 
 /*
- * Policy for what to do when a row lock cannot be obtained immediately.
- *
- * The enum values defined here control how the parser treats multiple FOR
- * UPDATE/SHARE

[HACKERS] Commitfest status

2014-09-10 Thread Heikki Linnakangas

Another commitfest week has passed, and here are still. There are now 24 
patches in Needs Review state, and 8 in Ready for Committer. I'm not 
paying attention to the Waiting on Author patches - once we're close 
to zero on the other patches, those will be bounced back.


The good news is that all but two patches have a reviewer assigned. The 
bad news is that the rest don't seem to moving graduating from the Needs 
Review state.


Reviewers: please review your patches. And then pick another patch to 
review; one of the two that have no reviewer assigned yet, or some other 
patch. It is only good to have more than on reviewer for the same patch.


Patch authors: if your patch is not getting reviewed in a timely 
fashion, in a few days, please send an off-list note to the reviewer and 
ask what the status is. The reviewer might not realize that you're waiting.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-09-10 Thread Robert Haas

On Wed, Sep 10, 2014 at 4:01 PM, Robert Haas robertmh...@gmail.com wrote:
 Yes, although my issue with the hooks was not that you only provided them
 for 2 functions, but the fact that it had no structure and the
 implementation was if hook set do this, else do that which I don't see
 like a good way of doing it.

 We've done it that way in a bunch of other places, like ExecutorRun().
 An advantage of this approach (I think) is that jumping to a fixed
 address is faster than jumping through a function pointer - so with
 the approach I've taken here, the common case where we're talking to
 the client incurs only the overhead of a null-test, and the larger
 overhead of the function pointer jump is incurred only when the hook
 is in use.  Maybe that's not enough of a difference to matter to
 anything, but I think the contention that I've invented some novel
 kind of interface here doesn't hold up to scrutiny.  We have lots of
 hooks that work just like what I did here.

Here's what the other approach looks like.  I can't really see doing
this way and then only providing hooks for those two functions, so
this is with hooks for all the send-side stuff.

Original version: 9 files changed, 295 insertions(+), 3 deletions(-)
This version: 9 files changed, 428 insertions(+), 47 deletions(-)

There is admittedly a certain elegance to providing a complete set of
hooks, so maybe this is the way to go.  The remaining patches in the
patch series work with either the old version or this one; the changes
here don't affect anything else.

Anyone else have an opinion on which way is better here?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
commit 76b9b171f81b1639e9d0379f5066212107a740f6
Author: Robert Haas rh...@postgresql.org
Date:   Wed May 28 19:30:21 2014 -0400

Support frontend-backend protocol communication using a shm_mq.

A background worker can use pq_redirect_to_shm_mq() to direct protocol
that would normally be sent to the frontend to a shm_mq so that another
process may read them.

The receiving process may use pq_parse_errornotice() to parse an
ErrorResponse or NoticeResponse from the background worker and, if
it wishes, ThrowErrorData() to propagate the error (with or without
further modification).

V2: Lots more hooks, and a struct!

diff --git a/src/backend/libpq/Makefile b/src/backend/libpq/Makefile
index 8be0572..09410c4 100644
--- a/src/backend/libpq/Makefile
+++ b/src/backend/libpq/Makefile
@@ -15,7 +15,7 @@ include $(top_builddir)/src/Makefile.global
 # be-fsstubs is here for historical reasons, probably belongs elsewhere
 
 OBJS = be-fsstubs.o be-secure.o auth.o crypt.o hba.o ip.o md5.o pqcomm.o \
-   pqformat.o pqsignal.o
+   pqformat.o pqmq.o pqsignal.o
 
 ifeq ($(with_openssl),yes)
 OBJS += be-secure-openssl.o
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 605d891..7c4252d 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -102,7 +102,6 @@
 int			Unix_socket_permissions;
 char	   *Unix_socket_group;
 
-
 /* Where the Unix socket files are (list of palloc'd strings) */
 static List *sock_paths = NIL;
 
@@ -134,16 +133,38 @@ static bool DoingCopyOut;
 
 
 /* Internal functions */
-static void pq_close(int code, Datum arg);
+static void socket_comm_reset(void);
+static void socket_close(int code, Datum arg);
+static void socket_set_nonblocking(bool nonblocking);
+static int	socket_flush(void);
+static int	socket_flush_if_writable(void);
+static bool socket_is_send_pending(void);
+static int	socket_putmessage(char msgtype, const char *s, size_t len);
+static void socket_putmessage_noblock(char msgtype, const char *s, size_t len);
+static void socket_startcopyout(void);
+static void socket_endcopyout(bool errorAbort);
 static int	internal_putbytes(const char *s, size_t len);
 static int	internal_flush(void);
-static void pq_set_nonblocking(bool nonblocking);
+static void socket_set_nonblocking(bool nonblocking);
 
 #ifdef HAVE_UNIX_SOCKETS
 static int	Lock_AF_UNIX(char *unixSocketDir, char *unixSocketPath);
 static int	Setup_AF_UNIX(char *sock_path);
 #endif   /* HAVE_UNIX_SOCKETS */
 
+PQsendMethods PQsendSocketMethods;
+
+static PQsendMethods PqSendSocketMethods = {
+	socket_comm_reset,
+	socket_flush,
+	socket_flush_if_writable,
+	socket_is_send_pending,
+	socket_putmessage,
+	socket_putmessage_noblock,
+	socket_startcopyout,
+	socket_endcopyout
+};
+
 
 /* 
  *		pq_init - initialize libpq at backend startup
@@ -152,24 +173,25 @@ static int	Setup_AF_UNIX(char *sock_path);
 void
 pq_init(void)
 {
+	PqSendMethods = PqSendSocketMethods;
 	PqSendBufferSize = PQ_SEND_BUFFER_SIZE;
 	PqSendBuffer = MemoryContextAlloc(TopMemoryContext, PqSendBufferSize);
 	PqSendPointer = PqSendStart = PqRecvPointer = PqRecvLength = 0;
 	PqCommBusy = false;
 	DoingCopyOut = false;
-	on_proc_exit(pq_close, 0);
+	on_proc_exit(socket_close,

Re: [HACKERS] Memory Alignment in Postgres

2014-09-10 Thread Robert Haas

On Wed, Sep 10, 2014 at 4:29 PM, Bruce Momjian br...@momjian.us wrote:
 On Wed, Sep 10, 2014 at 11:43:52AM -0400, Robert Haas wrote:
 But there are a couple of obvious problems with this idea, too, such as:

 1. It's really complicated and a ton of work.
 2. It would break pg_upgrade pretty darn badly unless we employed some
 even-more-complex strategy to mitigate that.
 3. The savings might not be enough to justify the effort.

 It might be interesting for someone to develop a tool measuring the
 number of bytes of alignment padding we lose per tuple or per page and
 gather some statistics on it on various databases.  That would give us
 some sense as to the possible savings.

 And will we ever implement a logical attribute system so we can reorder
 the stored attribtes to minimize wasted space?

You forgot to attach the patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-09-10 Thread Tomas Vondra

On 10.9.2014 21:34, Robert Haas wrote:
 On Wed, Sep 10, 2014 at 3:12 PM, Tomas Vondra t...@fuzzy.cz wrote:
 On 10.9.2014 20:25, Heikki Linnakangas wrote:
 On 09/10/2014 01:49 AM, Tomas Vondra wrote:
 I also did a few 'minor' changes to the dense allocation patch, most
 notably:

 * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
The original naming seemed a bit awkward.

 That's too easy to confuse with regular MemoryContext and AllocChunk
 stuff. I renamed it to HashMemoryChunk.

 BTW this breaks the second patch, which is allocating new chunks when
 resizing the hash table. Should I rebase the patch, or can the commiter
 do s/MemoryChunk/HashMemoryChunk/ ?

 Assuming there are no more fixes needed, of course.
 
 Rebasing it will save the committer time, which will increase the
 chances that one will look at your patch.  So it's highly recommended.

OK. So here's v13 of the patch, reflecting this change.

regards
Tomas
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 781a736..d128e08 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,21 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es-format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong(Hash Buckets, hashtable-nbuckets, es);
+			ExplainPropertyLong(Original Hash Buckets,
+hashtable-nbuckets_original, es);
 			ExplainPropertyLong(Hash Batches, hashtable-nbatch, es);
 			ExplainPropertyLong(Original Hash Batches,
 hashtable-nbatch_original, es);
 			ExplainPropertyLong(Peak Memory Usage, spacePeakKb, es);
 		}
-		else if (hashtable-nbatch_original != hashtable-nbatch)
+		else if ((hashtable-nbatch_original != hashtable-nbatch) ||
+ (hashtable-nbuckets_original != hashtable-nbuckets))
 		{
 			appendStringInfoSpaces(es-str, es-indent * 2);
 			appendStringInfo(es-str,
-			Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n,
-			 hashtable-nbuckets, hashtable-nbatch,
-			 hashtable-nbatch_original, spacePeakKb);
+			Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n,
+			 hashtable-nbuckets, hashtable-nbuckets_original,
+			 hashtable-nbatch, hashtable-nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 3ef7cfb..4f00426 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -39,6 +39,7 @@
 
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 	  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -49,6 +50,9 @@ static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
 static void *dense_alloc(HashJoinTable hashtable, Size size);
 
+/* Memory needed for optimal number of buckets. */
+#define BUCKETS_SPACE(htab)	((htab)-nbuckets_optimal * sizeof(HashJoinTuple))
+
 /* 
  *		ExecHash
  *
@@ -117,6 +121,7 @@ MultiExecHash(HashState *node)
 /* It's a skew tuple, so put it into that hash table */
 ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 		bucketNumber);
+hashtable-skewTuples += 1;
 			}
 			else
 			{
@@ -127,6 +132,25 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
+	if (hashtable-nbuckets != hashtable-nbuckets_optimal)
+	{
+		/* We never decrease the number of buckets. */
+		Assert(hashtable-nbuckets_optimal  hashtable-nbuckets);
+
+#ifdef HJDEBUG
+		printf(Increasing nbuckets %d = %d\n,
+			   hashtable-nbuckets, hashtable-nbuckets_optimal);
+#endif
+
+		ExecHashIncreaseNumBuckets(hashtable);
+	}
+
+	/* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */
+	hashtable-spaceUsed += BUCKETS_SPACE(hashtable);
+	if (hashtable-spaceUsed  hashtable-spacePeak)
+			hashtable-spacePeak = hashtable-spaceUsed;
+
 	/* must provide our own instrumentation support */
 	if (node-ps.instrument)
 		InstrStopNode(node-ps.instrument, hashtable-totalTuples);
@@ -272,7 +296,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable-nbuckets = nbuckets;
+	hashtable-nbuckets_original = nbuckets;
+	hashtable-nbuckets_optimal = nbuckets;
 	hashtable-log2_nbuckets = log2_nbuckets;
+	hashtable-log2_nbuckets_optimal = log2_nbuckets;
 	hashtable-buckets = NULL;
 	hashtable-keepNulls = keepNulls;
 	hashtable-skewEnabled = false;
@@ -286,6 +313,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable-nbatch_outstart = nbatch;
 	hashtable-growEnabled = true;
 	hashtable-totalTuples = 0;
+	hashtable-skewTuples = 0;
 	hashtable-innerBatchFile = NULL;
 	hashtable-outerBatchFile = NULL;

Re: [HACKERS] Need Multixact Freezing Docs

2014-09-10 Thread Bruce Momjian

On Fri, Sep  5, 2014 at 07:39:36PM -0400, Bruce Momjian wrote:
 On Wed, Sep  3, 2014 at 05:17:17PM -0400, Robert Haas wrote:
   I had a look at this and came upon a problem --- there is no multi-xid
   SQL data type, and in fact the system catalogs that store mxid values
   use xid, e.g.:
  
relminmxid | xid   | not null
  
   With no mxid data type, there is no way to do function overloading to
   cause age to call the mxid variant.
  
   Should we use an explicit mxid_age() function name?  Add an mxid data
   type?
  
  Maybe both.  But mxid_age() is probably the simpler way forward just to 
  start.
 
 OK, patch applied using mxid_age() and no new data type.

Applied.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #10823: Better REINDEX syntax.

2014-09-10 Thread Vik Fearing

On 09/08/2014 06:17 AM, Stephen Frost wrote:
 * Vik Fearing (vik.fear...@dalibo.com) wrote:
 On 09/02/2014 10:17 PM, Marko Tiikkaja wrote:
 Yeah, I think I like this better than allowing all of them without the
 database name.

 Why?  It's just a noise word!
 
 Eh, because it ends up reindexing system tables too, which is probably
 not what new folks are expecting.

No behavior is changed at all.  REINDEX DATABASE dbname; has always hit
the system tables.  Since dbname can *only* be the current database,
there's no logic nor benefit in requiring it to be specified.

 Also, it's not required when you say
 'user tables', so it's similar to your user_tables v1 patch in that
 regard.

The fact that REINDEX USER TABLES; is the only one that doesn't require
the dbname seems very inconsistent and confusing.

 Yes, I will update the patch.
 
 Still planning to do this..?
 
 Marking this back to waiting-for-author.

Yes, but probably not for this commitfest unfortunately.
-- 
Vik


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #10823: Better REINDEX syntax.

2014-09-10 Thread Stephen Frost

* Vik Fearing (vik.fear...@dalibo.com) wrote:
 On 09/08/2014 06:17 AM, Stephen Frost wrote:
  * Vik Fearing (vik.fear...@dalibo.com) wrote:
  On 09/02/2014 10:17 PM, Marko Tiikkaja wrote:
  Yeah, I think I like this better than allowing all of them without the
  database name.
 
  Why?  It's just a noise word!
  
  Eh, because it ends up reindexing system tables too, which is probably
  not what new folks are expecting.
 
 No behavior is changed at all.  REINDEX DATABASE dbname; has always hit
 the system tables.  Since dbname can *only* be the current database,
 there's no logic nor benefit in requiring it to be specified.

Sure, but I think the point is that reindexing the system tables as part
of a database-wide reindex is a *bad* thing which we shouldn't be
encouraging by making it easier.

I realize you're a bit 'stuck' here because we don't like the current
behavior, but we don't want to change it either.

  Also, it's not required when you say
  'user tables', so it's similar to your user_tables v1 patch in that
  regard.
 
 The fact that REINDEX USER TABLES; is the only one that doesn't require
 the dbname seems very inconsistent and confusing.

I understand, but the alternative would be a 'reindex;' which *doesn't*
reindex the system tables- would that be less confusing?  Or getting rid
of the current 'reindex database' which also reindexes system tables...

  Yes, I will update the patch.
  
  Still planning to do this..?
  
  Marking this back to waiting-for-author.
 
 Yes, but probably not for this commitfest unfortunately.

Fair enough, I'll mark it 'returned with feedback'.

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] pg_basebackup failed to back up large file

2014-09-10 Thread Bruce Momjian

On Mon, Jun  9, 2014 at 11:17:41AM +0200, Magnus Hagander wrote:
 On Wednesday, June 4, 2014, Tom Lane t...@sss.pgh.pa.us wrote:
 
 Magnus Hagander mag...@hagander.net writes:
  On Tue, Jun 3, 2014 at 6:38 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Another thought is we could make pg_basebackup simply skip any files
 that
  exceed RELSEG_SIZE, on the principle that you don't really need/want
  enormous log files to get copied anyhow.  We'd still need the pax
  extension if the user had configured large RELSEG_SIZE, but having a
  compatible tar could be documented as a requirement of doing that.
 
  I think going all the way to pax is the proper long-term thing to do, at
  least if we can confirm it works in the main tar implementations.
 
  For backpatchable that seems more reasonable. It doesn't work today, and
 we
  just need to document that it doesn't, with larger RELSEG_SIZE. And then
  fix it properly for the future.
 
 Agreed, this would be a reasonable quick fix that we could replace in
 9.5 or later.
 
 
 
 Fujii, are you going to be able to work on this with the now expanded scope? 
 :)

Is this a TODO or doc item?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_control is missing a field for LOBLKSIZE

2014-09-10 Thread Bruce Momjian

On Tue, Jun 17, 2014 at 08:46:00PM -0400, Bruce Momjian wrote:
 On Tue, Jun 17, 2014 at 07:12:16PM -0400, Tom Lane wrote:
  Bruce Momjian br...@momjian.us writes:
   On Tue, Jun 17, 2014 at 03:55:02PM -0400, Alvaro Herrera wrote:
   Can't you compare it to the historic default value?  I mean, add an
   assumption that people thus far has never tweaked it.
  
   Well, if they did tweak it, then they would be unable to use pg_upgrade
   because it would complain about a mismatch if they actually matched the
   old and new servers.
  
  What about comparing to the symbolic value LOBLKSIZE?  This would make
  pg_upgrade assume that the old installation had been tweaked the same
  as in its own build.  This ends up being the same as what you said,
  ie, effectively no comparison ... but it might be less complicated to
  code/understand.
 
 OK, assume the compiled-in default is the value for an old cluster that
 has no value --- yeah, I could do that.

Turns out I already had values that could be missing in the old cluster,
so I just used the same format for this, rather than testing for
LOBLKSIZE.

Attached patch applied.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +
diff --git a/contrib/pg_upgrade/controldata.c b/contrib/pg_upgrade/controldata.c
new file mode 100644
index 9282b8e..d105a59
*** a/contrib/pg_upgrade/controldata.c
--- b/contrib/pg_upgrade/controldata.c
*** get_control_data(ClusterInfo *cluster, b
*** 54,59 
--- 54,60 
  	bool		got_ident = false;
  	bool		got_index = false;
  	bool		got_toast = false;
+ 	bool		got_large_object = false;
  	bool		got_date_is_int = false;
  	bool		got_float8_pass_by_value = false;
  	bool		got_data_checksum_version = false;
*** get_control_data(ClusterInfo *cluster, b
*** 357,362 
--- 358,374 
  			cluster-controldata.toast = str2uint(p);
  			got_toast = true;
  		}
+ 		else if ((p = strstr(bufin, Size of a large-object chunk:)) != NULL)
+ 		{
+ 			p = strchr(p, ':');
+ 
+ 			if (p == NULL || strlen(p) = 1)
+ pg_fatal(%d: controldata retrieval problem\n, __LINE__);
+ 
+ 			p++;/* removing ':' char */
+ 			cluster-controldata.large_object = str2uint(p);
+ 			got_large_object = true;
+ 		}
  		else if ((p = strstr(bufin, Date/time type storage:)) != NULL)
  		{
  			p = strchr(p, ':');
*** get_control_data(ClusterInfo *cluster, b
*** 475,480 
--- 487,494 
  		!got_tli ||
  		!got_align || !got_blocksz || !got_largesz || !got_walsz ||
  		!got_walseg || !got_ident || !got_index || !got_toast ||
+ 		(!got_large_object 
+ 		 cluster-controldata.cat_ver = LARGE_OBJECT_SIZE_PG_CONTROL_VER) ||
  		!got_date_is_int || !got_float8_pass_by_value || !got_data_checksum_version)
  	{
  		pg_log(PG_REPORT,
*** get_control_data(ClusterInfo *cluster, b
*** 527,532 
--- 541,550 
  		if (!got_toast)
  			pg_log(PG_REPORT,   maximum TOAST chunk size\n);
  
+ 		if (!got_large_object 
+ 			cluster-controldata.cat_ver = LARGE_OBJECT_SIZE_PG_CONTROL_VER)
+ 			pg_log(PG_REPORT,   large-object chunk size\n);
+ 
  		if (!got_date_is_int)
  			pg_log(PG_REPORT,   dates/times are integers?\n);
  
*** check_control_data(ControlData *oldctrl,
*** 576,581 
--- 594,602 
  	if (oldctrl-toast == 0 || oldctrl-toast != newctrl-toast)
  		pg_fatal(old and new pg_controldata maximum TOAST chunk sizes are invalid or do not match\n);
  
+ 	if (oldctrl-large_object == 0 || oldctrl-large_object != newctrl-large_object)
+ 		pg_fatal(old and new pg_controldata large-object chunk sizes are invalid or do not match\n);
+ 
  	if (oldctrl-date_is_int != newctrl-date_is_int)
  		pg_fatal(old and new pg_controldata date/time storage types do not match\n);
  
diff --git a/contrib/pg_upgrade/pg_upgrade.h b/contrib/pg_upgrade/pg_upgrade.h
new file mode 100644
index 1ac3394..0207391
*** a/contrib/pg_upgrade/pg_upgrade.h
--- b/contrib/pg_upgrade/pg_upgrade.h
*** extern char *output_files[];
*** 116,121 
--- 116,127 
  #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231
  
  /*
+  * large object chunk size added to pg_controldata,
+  * commit 5f93c37805e7485488480916b4585e098d3cc883
+  */
+ #define LARGE_OBJECT_SIZE_PG_CONTROL_VER 942
+ 
+ /*
   * Each relation is represented by a relinfo structure.
   */
  typedef struct
*** typedef struct
*** 203,208 
--- 209,215 
  	uint32		ident;
  	uint32		index;
  	uint32		toast;
+ 	uint32		large_object;
  	bool		date_is_int;
  	bool		float8_pass_by_value;
  	bool		data_checksum_version;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cancelling statement due to user request error occurs but the transaction has committed.

2014-09-10 Thread Bruce Momjian

On Tue, Jun 10, 2014 at 10:30:24AM -0400, Robert Haas wrote:
 On Tue, Jun 10, 2014 at 10:18 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  Robert Haas robertmh...@gmail.com writes:
  I don't agree with this analysis.  If the connection is closed after
  the client sends a COMMIT and before it gets a response, then the
  client must indeed be smart enough to figure out whether or not the
  commit happened.  But if the server sends a response, the client
  should be able to rely on that response being correct.  In this case,
  an ERROR is getting sent but the transaction is getting committed;
  yuck.  I'm not sure whether the fix is right, but this definitely
  seems like a bug.
 
  In general, the only way to avoid that sort of behavior for a post-commit
  error would be to PANIC ... and even then, the transaction got committed,
  which might not be the expectation of a client that got an error message,
  even if it said PANIC.  So this whole area is a minefield, and the only
  attractive thing we can do is to try to reduce the number of errors that
  can get thrown post-commit.  We already, for example, do not treat
  post-commit file unlink failures as ERROR, though we surely would prefer
  to do that.
 
 We could treated it as a lost-communication scenario.  The appropriate
 recovery actions from the client's point of view are identical.
 
  So from this standpoint, redefining SIGINT as not throwing an error when
  we're in post-commit seems like a good idea.  I'm not endorsing any
  details of the patch here, but the 2-foot view seems generally sound.
 
 Cool, that makes sense to me also.

Did we ever do anything about this?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inaccuracy in VACUUM's tuple count estimates

2014-09-10 Thread Bruce Momjian

On Thu, Jun 12, 2014 at 01:40:59PM +0200, Andres Freund wrote:
 Hi Tom,
 
 On 2014-06-06 15:44:25 -0400, Tom Lane wrote:
  I figured it'd be easy enough to get a better estimate by adding another
  counter to count just LIVE and INSERT_IN_PROGRESS tuples (thus effectively
  assuming that in-progress inserts and deletes will both commit).
 
 Did you plan to backpatch that? My inclination would be no...
 
   I did
  that, and found that it helped Tim's test case not at all :-(.  A bit of
  sleuthing revealed that HeapTupleSatisfiesVacuum actually returns
  INSERT_IN_PROGRESS for any tuple whose xmin isn't committed, regardless of
  whether the transaction has since marked it for deletion:
  
  /*
   * It'd be possible to discern between INSERT/DELETE in progress
   * here by looking at xmax - but that doesn't seem beneficial 
  for
   * the majority of callers and even detrimental for some. We'd
   * rather have callers look at/wait for xmin than xmax. It's
   * always correct to return INSERT_IN_PROGRESS because that's
   * what's happening from the view of other backends.
   */
  return HEAPTUPLE_INSERT_IN_PROGRESS;
  
  It did not use to blow this question off: back around 8.3 you got
  DELETE_IN_PROGRESS if the tuple had a delete pending.  I think we need
  less laziness + fuzzy thinking here.  Maybe we should have a separate
  HEAPTUPLE_INSERT_AND_DELETE_IN_PROGRESS result code?  Is it *really*
  the case that callers other than VACUUM itself are okay with failing
  to make this distinction?  I'm dubious: there are very few if any
  callers that treat the INSERT and DELETE cases exactly alike.
 
 My current position on this is that we should leave the code as is 9.4
 and HEAPTUPLE_INSERT_IN_PROGRESS for the 9.4/master. Would you be ok
 with that? The second best thing imo would be to discern and return
 HEAPTUPLE_INSERT_IN_PROGRESS/HEAPTUPLE_DELETE_IN_PROGRESS for the
 respective cases.
 Which way would you like to go?

Did we ever come to a conclusion on this?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] about half processes are blocked by btree, btree is bottleneck?

2014-09-10 Thread Xiaoyulei

I use benchmarksql with more than 200 clients on pg 9.3.3. when the test is 
going on, I collect all the process stack. I found about 100 processes are 
blocked by btree insert. Another 100 are blocked by xloginsert.

Does btree has bad performance in concurrency scenarios?

Sum:66
#0  0x7f8273a77627 in semop () from /lib64/libc.so.6
#1  0x0060cda7 in PGSemaphoreLock ()
#2  0x006511a9 in LWLockAcquire ()
#3  0x004987f7 in _bt_relandgetbuf ()
#4  0x0049c116 in _bt_search ()
#5  0x00497e13 in _bt_doinsert ()
#6  0x0049af52 in btinsert ()
#7  0x0072dce4 in FunctionCall6Coll ()
#8  0x0049592e in index_insert ()
#9  0x00590ac5 in ExecInsertIndexTuples ()


Sum:36
#0  0x7f8273a77627 in semop () from /lib64/libc.so.6
#1  0x0060cda7 in PGSemaphoreLock ()
#2  0x006511a9 in LWLockAcquire ()
#3  0x00497e31 in _bt_doinsert ()
#4  0x0049af52 in btinsert ()
#5  0x0072dce4 in FunctionCall6Coll ()
#6  0x0049592e in index_insert ()
#7  0x00590ac5 in ExecInsertIndexTuples ()


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for N synchronous standby servers

2014-09-10 Thread Michael Paquier

On Thu, Sep 11, 2014 at 5:21 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 08/28/2014 10:10 AM, Michael Paquier wrote:

 + #synchronous_standby_num = -1 # number of standbys servers using sync
 rep


 To be honest, that's a horrible name for the GUC. Back when synchronous
 replication was implemented, we had looong discussions on this feature. It
 was called quorum commit back then. I'd suggest using the quorum term in
 this patch, too, that's a fairly well-known term in distributed computing
 for this.
I am open to any suggestions. Then what about the following parameter names?
- synchronous_quorum_num
- synchronous_standby_quorum
- synchronous_standby_quorum_num
- synchronous_quorum_commit

 When synchronous replication was added, quorum was left out to keep things
 simple; the current feature set was the most we could all agree on to be
 useful. If you search the archives for quorum commit you'll see what I
 mean. There was a lot of confusion on what is possible and what is useful,
 but regarding this particular patch: people wanted to be able to describe
 more complicated scenarios. For example, imagine that you have a master and
 two standbys in one the primary data center, and two more standbys in a
 different data center. It should be possible to specify that you must get
 acknowledgment from at least on standby in both data centers. Maybe you
 could hack that by giving the standbys in the same data center the same
 name, but it gets ugly, and it still won't scale to even more complex
 scenarios.

Currently two nodes can only have the same priority if they have the
same application_name, so we could for example add a new connstring
parameter called, let's say application_group, to define groups of
nodes that will have the same priority (if a node does not define
application_group, it defaults to application_name, if app_name is
NULL, well we don't care much it cannot be a sync candidate). That's a
first idea that we could use to control groups of nodes. And we could
switch syncrep.c to use application_group in s_s_names instead of
application_name. That would be backward-compatible, and could open
the door for more improvements for quorum commits as we could control
groups node nodes. Well this is a super-set of what application_name
can already do, but there is no problem to identify single nodes of
the same data center and how much they could be late in replication,
so I think that this would be really user-friendly. An idea similar to
that would be a base work for the next thing... See below.

Now, in your case the two nodes on the second data center need to have
the same priority either way. With this patch you can achieve that
with the same node name. Where things are not that cool with this
patch is something like that though:
- 5 slaves: 1 with master (node_local), 2 on a 2nd data center
(node_center1), 2 last on a 3rd data center (node_center2)
- s_s_num = 3
- s_s_names = 'node_local,node_center1,node_center2'

In this case the nodes have the following priority:
- node_local = 1
- the 2 nodes with node_center1 = 2
- the 2 nodes with node_center2 = 3
In this {1,2,2,3,3} schema, the patch makes system wait for
node_local, and the two nodes in node_center1 without caring about the
ones in node_center2 as it will pick up only the nodes with the
highest priority. If user expects the system to wait for a node in
node_center2 he'll be disappointed. That's perhaps where we could
improve things, by adding an extra parameter able to control the
priority ranks, say synchronous_priority_check:
- [absolute|individual], wait for the first s_s_num nodes having the
lowest priority, in this case we'll wait for {1,2,2}
- group: for only one node in the lowest s_s_num priorities, here
we'll wait for {1,2,3}
Note that we may not even need this parameter if we assume by default
that we wait for only one node in a given group that has the same
priority.

 Maybe that's OK - we don't necessarily need to solve all scenarios at once.
 But it's worth considering.

Parametrizing and coverage of the user expectations are tricky. Either
way not everybody can be happy :) There are even people unhappy now
because we can only define one single sync node.

 BTW, how does this patch behave if there are multiple standbys connected
 with the same name?

All the nodes have the same priority. For example in the case of a
cluster with 5 slaves having the same application name and s_s_num =3,
the first three nodes when scanning the WAL sender array are expected
to return a COMMIT before committing locally:
=# show synchronous_standby_num ;
 synchronous_standby_num
-
 3
(1 row)
=# show synchronous_standby_names ;
 synchronous_standby_names
---
 node
(1 row)
=# SELECT application_name, client_port,
pg_xlog_location_diff(sent_location, flush_location) AS replay_delta,
sync_priority, sync_state
FROM pg_stat_replication ORDER BY replay_delta ASC, appl

Re: [HACKERS] pg_basebackup vs. Windows and tablespaces

2014-09-10 Thread Dilip kumar

On 20 August 2014 19:49, Amit Kapila Wrote



 There are some comments I would like to share with you



 1.  Rebase the patch to current GIT head.
Done.

 +  initStringInfo(symlinkfbuf);

 I think declaration and initialization of symlinkfbuf string can 
 be moved under #ifdef WIN32 compile time macro,

 for other platform it’s simply allocated and freed which can be avoided.
Agreed, I have changed the patch as per your suggestion.

I have done the testing and behavior is as per expectation,
Do we need to do some document change? I mean is this limitation on windows is 
mentioned anywhere ?
If no change then i will move the patch to “Ready For Committer”.

Thanks  Regards,
Dilip

Re: [HACKERS] Aussie timezone database changes incoming

2014-09-10 Thread Craig Ringer

On 09/10/2014 11:23 PM, Tom Lane wrote:
 In connection with a question asked today on pgsql-general, I had
 occasion to go check the release announcements for the IANA timezone
 database files, and it turns out that there are some big changes in
 2014f:
 http://mm.icann.org/pipermail/tz-announce/2014-August/23.html
 
 The Russian changes are perhaps not such a big deal because they've
 done that sort of thing before, but this is an earful:
 
  Australian eastern time zone abbreviations are now AEST/AEDT not
  EST, and similarly for the other Australian zones.  That is, for
  eastern standard and daylight saving time the abbreviations are AEST
  and AEDT instead of the former EST for both; similarly, ACST/ACDT,
  ACWST/ACWDT, and AWST/AWDT are now used instead of the former CST,
  CWST, and WST.  This change does not affect UTC offsets, only time
  zone abbreviations.  (Thanks to Rich Tibbett and many others.)

Oh, lovely.

I shouldn't be surprised that Australia gets to change. While the cynic
in me thinks this is the usual USA-is-the-center-of-the-universe-ism, in
reality it makes sense given relative population and likely impact.

 I'm wondering how many Aussie applications are going to break when
 this goes in, and if we could/should do anything about it.  One idea
 that comes to mind is to create an Australia_old tznames file
 containing the current Aussie zone abbreviations, so as to provide
 an easy way to maintain backwards compatibility at need (you'd select
 that as your timezone_abbreviations GUC setting).
 
 Anyone from down under care to remark about the actual usage of old
 and new abbreviations?

Most systems I see work in UTC, but I don't actually work with many
that're in Australia.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup vs. Windows and tablespaces

2014-09-10 Thread Amit Kapila

On Thu, Sep 11, 2014 at 9:10 AM, Dilip kumar dilip.ku...@huawei.com wrote:


 I have done the testing and behavior is as per expectation,

 Do we need to do some document change? I mean is this limitation on
windows is mentioned anywhere ?

I don't think currently such a limitation is mentioned in docs,
however I think we can update the docs at below locations:

1. In description of pg_start_backup in below page:
http://www.postgresql.org/docs/devel/static/functions-admin.html#FUNCTIONS-ADMIN-BACKUP
2. In Explanation of Base Backup, basically under heading
Making a Base Backup Using the Low Level API at below
page:
http://www.postgresql.org/docs/devel/static/continuous-archiving.html#BACKUP-BASE-BACKUP

In general, we can explain about symlink_label file where ever
we are explaining about backup_label file.

If you think it is sufficient to explain about symlink_label in
above places, then I can update the patch.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Aussie timezone database changes incoming

2014-09-10 Thread Andrew McNamara

The Russian changes are perhaps not such a big deal because they've
done that sort of thing before, but this is an earful:

 Australian eastern time zone abbreviations are now AEST/AEDT not
 EST, and similarly for the other Australian zones.  That is, for
 eastern standard and daylight saving time the abbreviations are AEST
 and AEDT instead of the former EST for both; similarly, ACST/ACDT,
 ACWST/ACWDT, and AWST/AWDT are now used instead of the former CST,
 CWST, and WST.  This change does not affect UTC offsets, only time
 zone abbreviations.  (Thanks to Rich Tibbett and many others.)
[...]
Anyone from down under care to remark about the actual usage of old
and new abbreviations?

AEST/AEDT/etc are the official abbreviations and are commonly used.
They have been increasingly used over the last 20 years or so, and the
EST/EDT stuff on the Olsen tz database has been a source of annoyance
for a very long time, eg:

http://thread.gmane.org/gmane.comp.time.tz/2262

Quite likely this change will break stuff, but my feeling is more people
will be cheering than screaming.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-10 Thread Rahila Syed

I will repeat the above tests with high load on CPU and using the benchmark
given by Fujii-san and post the results.

Average % of CPU usage at user level for each of the compression algorithm
are as follows.

CompressionMultipleSingle

Off81.133881.1267
LZ4 81.099881.1695
Snappy:80.9741 80.9703
Pglz :81.235381.2753

http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user_single.png

http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user.png

The numbers show CPU utilization of Snappy is the least. The CPU utilization
in increasing order is
pglz No compression LZ4 Snappy

The variance of average CPU utilization numbers is very low. However ,
snappy seems to be best when it comes to lesser utilization of CPU.

As per the measurement results posted till date

LZ4 outperforms snappy and pglz in terms of compression ratio and
performance. However , CPU utilization numbers show snappy utilizes least
amount of CPU . Difference is not much though.

As there has been no consensus yet about which compression algorithm to
adopt, is it better to make this decision independent of the FPW compression
patch as suggested earlier in this thread?. FPW compression can be done
using built in compression pglz as it shows considerable performance over
uncompressed WAL and good compression ratio
Also, the patch to compress multiple blocks at once gives better compression
as compared to single block. ISTM that performance overhead introduced by
multiple blocks compression is slightly higher than single block compression
which can be tested again after modifying the patch to use pglz . Hence,
this patch can be built using multiple blocks compression.

Thoughts?

--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5818552.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

73 matches

Mail list logo