date:20140829

Re: [HACKERS] Multithreaded SIGPIPE race in libpq on Solaris

2014-08-29 Thread Thomas Munro

On 29 August 2014 01:04, Thomas Munro mu...@ip9.org wrote:
 On 28 August 2014 23:45, Tom Lane t...@sss.pgh.pa.us wrote:
 I don't claim to be an expert on this stuff, but I had the idea that
 multithreaded environments were supposed to track signal state per-thread
 not just per-process, precisely because of issues like this.

 After some googling, I found reply #3 in
 https://community.oracle.com/thread/1950900?start=0tstart=0 and
 various other sources which say that in Solaris versions 10 they
 changed SIGPIPE delivery from per process (as specified by UNIX98) to
 per thread (as specified by POSIX:2001).  But we are on version 11, so
 my theory doesn't look great.  (Though 9 is probably still in use out
 there somewhere...)

I discovered that the machine we saw the problem on was running
Solaris 9 at the time, but has been upgraded since.  So I think my
sigwait race theory may have been correct, but we can just put this
down to a historical quirk and forget about it, because Solaris 9 is
basically deceased (extended support ends in October 2014).  Sorry
for the noise.

Best regards,
Thomas Munro


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why data of timestamptz does not store value of timezone passed to it?

2014-08-29 Thread arhipov



On 08/29/2014 05:28 AM, Tom Lane wrote:

k...@rice.edu k...@rice.edu writes:

On Thu, Aug 28, 2014 at 03:33:56PM -0400, Bruce Momjian wrote:

So the standard requires storing of original timezone in the data type?
I was not aware of that.

I do not have a copy of the SQL 92 spec, but several references to the
spec mention that it defined the time zone as a format SHH:MM where
S represents the sign (+ or -), which seems to be what PostgreSQL uses.

Yeah, the spec envisions timezone as being a separate numeric field
(ie, a numeric GMT offset) within a timestamp with time zone.  One of
the ways in which the spec's design is rather broken is that there's
no concept of real-world time zones with varying DST rules.

Anyway, I agree with the upthread comments that it'd have been better
if we'd used some other name for this datatype, and also that it's
at least ten years too late to revisit the choice :-(.

regards, tom lane



What about an alias for timestamptz? The current name is really confusing.
As for timestamp + time-zone (not just the offset) data type, it would 
be very useful. For example, in Java they have 5 time types: LocalDate 
for representing dates (date in Postgres), LocalTime for representing 
times (time in Postgres), LocalDateTime to represent a date with a time 
(timestamp in Postgres), Instant to represent a point on the time-line 
(timestamptz in Postgres) and ZonedDateTime that models a point on the 
time-line with a time-zone. Having a type for a time-zone itself would 
be useful as well.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-29 Thread Jeff Davis

On Fri, 2014-08-22 at 12:34 -0400, Robert Haas wrote:
 Still doesn't look good here.  On the same PPC64 machine I've been
 using for testing:

I have a new approach to the patch which is to call a callback at each
block allocation and child contexts inherit the callback from their
parents. The callback could be defined to simply dereference and
increment its argument, which would mean block allocations anywhere in
the hierarchy are incrementing the same variable, avoiding the need to
traverse the hierarchy.

I've made some progress I think (thanks to both Robert and Tomas), but I
have a bit more microoptimization and testing to do. I'll mark it
returned with feedback for now, though if I find the time I'll do more
testing to see if the performance concerns are fully addressed.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] implement subject alternative names support for SSL connections

2014-08-29 Thread Heikki Linnakangas


On 08/28/2014 07:28 PM, Alexey Klyukin wrote:

On Mon, Aug 25, 2014 at 12:02 PM, Heikki Linnakangas 
hlinnakan...@vmware.com wrote:


On 08/24/2014 03:11 PM, Alexey Klyukin wrote:


On Wed, Aug 20, 2014 at 11:53 AM, Heikki Linnakangas 
hlinnakan...@vmware.com wrote:


  The patch doesn't seem to support wildcards in alternative names. Is

that on purpose?


Not really, we just did not have any use case for them, but it seems that
RFC 5280 does not disallow them:

  Finally, the semantics of subject alternative names that include
wildcard characters (e.g., as a placeholder for a set of names) are
not addressed by this specification.  Applications with specific
requirements MAY use such names, but they must define the semantics.

I've added support for them in the next iteration of the patch attached to
this email.


Hmm. So wildcards MAY be supported, but should we? I think we should 
follow the example of common browsers here, or OpenSSL or other SSL 
libraries; what do they do?


RFC 6125 section 6.4.4 Checking of Common Names says:


   As noted, a client MUST NOT seek a match for a reference identifier
   of CN-ID if the presented identifiers include a DNS-ID, SRV-ID,
   URI-ID, or any application-specific identifier types supported by the
   client.


So, to conform to that we shouldn't check the Common Name at all, if an 
alternative subject field is present.


(Relying on OpenSSL's hostname-checking function is starting feel more 
and more appetizing. At least it won't be our problem then.)



It would be good to add a little helper function that does the NULL-check,
straight comparison, and wildcard check, for a single name. And then use
that for the Common Name and all the Alternatives. That'll ensure that all
the same rules apply whether the name is the Common Name or an Alternative
(assuming that the rules are supposed to be the same; I don't know if
that's true).


Thanks, common code has been moved into a separate new function.

Another question is how should we treat the certificates with no CN and
non-empty SAN?

Current code just bails out right after finding no CN section present , but
the RFC (5280) says:

Further, if the only subject identity included in the certificate is
an alternative name form (e.g., an electronic mail address), then the
subject distinguished name MUST be empty (an empty sequence), and the
subjectAltName extension MUST be present.

which to me sounds like the possibility of coming across such certificates
in the wild, although I personally see little use in them.


Yeah, I think a certificate without CN should be supported. See also RFC 
6125, section 4.1. Rules [for issuers of certificates]:



   5.  Even though many deployed clients still check for the CN-ID
   within the certificate subject field, certification authorities
   are encouraged to migrate away from issuing certificates that
   represent the server's fully qualified DNS domain name in a
   CN-ID.  Therefore, the certificate SHOULD NOT include a CN-ID
   unless the certification authority issues the certificate in
   accordance with a specification that reuses this one and that
   explicitly encourages continued support for the CN-ID identifier
   type in the context of a given application technology.


Certificates without a CN-ID are probably rare today, but they might 
start to appear in the future.


BTW, should we also support alternative subject names in the server, in 
client certificates? And if there are multiple names, which one takes 
effect? Perhaps better to leave that for a separate patch.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql \watch versus \timing

2014-08-29 Thread Heikki Linnakangas


On 08/28/2014 02:46 PM, Fujii Masao wrote:

On Tue, Aug 26, 2014 at 4:55 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

On 08/25/2014 10:48 PM, Heikki Linnakangas wrote:

Actually, perhaps it would be better to just copy-paste PSQLexec, and
modify the copy to suite \watch's needs. (PSQLexecWatch?
SendWatchQuery?). PSQLexec doesn't do much, and there isn't very much
overlap between what \watch wants and what other PSQLexec callers want.
\watch wants timing output, others don't. \watch doesn't want
transaction handling.


Agreed. Attached is the revised version of the patch. I implemented
PSQLexecWatch() which sends the query, prints the results and outputs
the query execution time (if \timing is enabled).

This patch was marked as ready for committer, but since I revised
the code very much, I marked this as needs review again.


This comment:


... We use PSQLexecWatch,
!* which is kind of cheating, but SendQuery doesn't let us 
suppress
!* autocommit behavior.


is a bit strange now. PSQLexecWatch isn't cheating like reusing PSQLexec 
was; it's whole purpose is to run \watch queries.



/*
 * Set up cancellation of 'watch' via SIGINT.  We redo this 
each time
 * through the loop since it's conceivable something inside 
PSQLexec
 * could change sigint_interrupt_jmp.
 */


This should now say PSQLexecWatch.

Other than that, looks good to me.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] v4 protocol TODO item - Lazy fetch/stream of TOASTed values?

2014-08-29 Thread Heikki Linnakangas


On 08/28/2014 02:14 PM, Craig Ringer wrote:

Hi all

Per the protocol todo:

  https://wiki.postgresql.org/wiki/Todo#Wire_Protocol_Changes

do you think it's reasonable to allow for delayed sending of big varlena
values and arrays in the protocol?


Yes.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench throttling latency limit

2014-08-29 Thread Heikki Linnakangas


On 08/27/2014 08:05 PM, Fabien COELHO wrote:



[...]


Yeah, something like that. I don't think it would be necessary to set
statement_timeout, you can inject that in your script or postgresql.conf if
you want. I don't think aborting a transaction that's already started is
necessary either. You could count it as LATE, but let it finish first.


I've implemented something along these simplified lines. The latency is
not limited as such, but slow (over the limit) queries are counted and
reported.


Ok, thanks.

This now begs the question:

In --rate mode, shouldn't the reported transaction latency also be 
calculated from the *scheduled* start time, not the time the transaction 
actually started? Otherwise we're using two different definitions of 
latency, one for the purpose of the limit, and another for reporting.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LIMIT for UPDATE and DELETE

2014-08-29 Thread Etsuro Fujita


(2014/08/25 15:48), Etsuro Fujita wrote:

(2014/08/15 6:18), Rukh Meski wrote:

Based on the feedback on my previous patch, I've separated only the
LIMIT part into its own feature.  This version plays nicely with
inheritance.  The intended use is splitting up big UPDATEs and DELETEs
into batches more easily and efficiently.


Before looking into the patch, I'd like to know the use cases in more
details.


Thanks for the input, Amit, Kevin and Jeff!  I understand that the patch 
is useful.


I've looked at the patch a bit closely.  Here is my initial thought 
about the patch.


The patch places limit-counting inside ModifyTable, and works well for 
inheritance trees, but I'm not sure that that is the right way to go.  I 
think that this feature should be implemented in the way that we can 
naturally extend it to the ORDER-BY-LIMIT case in future.  But honestly 
the patch doesn't seem to take into account that, I might be missing 
something, though.  What plan do you have for the future extensibility?


I think that the approach discussed in [1] would be promissing, so ISTM 
that it would be better to implement this feature by allowing for plans 
in the form of eg, ModifyTModifyTable+Limit+Append.


Thanks,

[1] http://www.postgresql.org/message-id/26819.1291133...@sss.pgh.pa.us

Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench throttling latency limit

2014-08-29 Thread Fabien COELHO



Hello Heikki,


This now begs the question:

In --rate mode, shouldn't the reported transaction latency also be calculated 
from the *scheduled* start time, not the time the transaction actually 
started? Otherwise we're using two different definitions of latency, one 
for the purpose of the limit, and another for reporting.


It could. Find a small patch **on top of v5** which does that. I've tried 
to update the documentation accordingly as well.


Note that the information is already there as the average lag time is 
reported, ISTM that:


avg latency2 ~ avg lag + avg latency1

so it is just a matter of choice, both are ok somehow. I would be fine 
with both.


--
Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 96e5fb9..40427a3 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -1125,12 +1125,24 @@ top:
 			commands[st-state + 1] == NULL)
 		{
 			instr_time	diff;
-			int64		latency, now;
+			int64		latency;
 
 			INSTR_TIME_SET_CURRENT(diff);
-			now = INSTR_TIME_GET_MICROSEC(diff);
-			INSTR_TIME_SUBTRACT(diff, st-txn_begin);
-			latency = INSTR_TIME_GET_MICROSEC(diff);
+
+			if (throttle_delay)
+			{
+/* Under throttling, compute latency wrt to the initial schedule,
+ * not the actual transaction start.
+ */
+int64 now = INSTR_TIME_GET_MICROSEC(diff);
+latency = now - thread-throttle_trigger;
+			}
+			else
+			{
+INSTR_TIME_SUBTRACT(diff, st-txn_begin);
+latency = INSTR_TIME_GET_MICROSEC(diff);
+			}
+
 			st-txn_latencies += latency;
 
 			/*
@@ -1144,16 +1156,8 @@ top:
 
 			/* record over the limit transactions if needed.
 			 */
-			if (latency_limit)
-			{
-/* Under throttling, late means wrt to the initial schedule,
- * not the actual transaction start
- */
-if (throttle_delay)
-	latency = now - thread-throttle_trigger;
-if (latency  latency_limit)
-	thread-latency_late++;
-			}
+			if (latency_limit  latency  latency_limit)
+thread-latency_late++;
 		}
 
 		/*
diff --git a/doc/src/sgml/pgbench.sgml b/doc/src/sgml/pgbench.sgml
index f80116f..453ae4b 100644
--- a/doc/src/sgml/pgbench.sgml
+++ b/doc/src/sgml/pgbench.sgml
@@ -420,8 +420,10 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/
 Show progress report every literalsec/ seconds.  The report
 includes the time since the beginning of the run, the tps since the
 last report, and the transaction latency average and standard
-deviation since the last report.  Under throttling (option-R/), it
-also includes the average schedule lag time since the last report.
+deviation since the last report.  Under throttling (option-R/),
+the latency is computed with respect to the transaction scheduled
+start time, not the actual transaction beginning time, and it also
+includes the average schedule lag time since the last report.
/para
   /listitem
  /varlistentry

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LIMIT for UPDATE and DELETE

2014-08-29 Thread Marko Tiikkaja


On 8/29/14 12:20 PM, Etsuro Fujita wrote:

The patch places limit-counting inside ModifyTable, and works well for
inheritance trees, but I'm not sure that that is the right way to go.  I
think that this feature should be implemented in the way that we can
naturally extend it to the ORDER-BY-LIMIT case in future.  But honestly
the patch doesn't seem to take into account that, I might be missing
something, though.


The LIMIT part *has* to happen after the rows have been locked or it 
will work very surprisingly under concurrency (sort of like how FOR 
SHARE / FOR UPDATE worked before 9.0).  So either it has to be inside 
ModifyTable or the ModifyTable has to somehow pass something to a Limit 
node on top of it which would then realize that the tuples from 
ModifyTable aren't supposed to be sent to the client (unless there's a 
RETURNING clause).  I think it's a lot nicer to do the LIMITing inside 
ModifyTable, even though that duplicates a small portion of code that 
already exists in the Limit node.



What plan do you have for the future extensibility?

I think that the approach discussed in [1] would be promissing, so ISTM
that it would be better to implement this feature by allowing for plans
in the form of eg, ModifyTModifyTable+Limit+Append.


I don't see an approach discussed there, just a listing of problems with 
no solutions.


This is just my personal opinion, but what I think should happen is:

  1) We put the LIMIT inside ModifyTable like this patch does.  This 
doesn't prevent us from doing ORDER BY in the future, but helps numerous 
people who today have to
  2) We allow ORDER BY on tables with no inheritance children using 
something similar to Rukh's previous patch.
  3) Someone rewrites how UPDATE works based on Tom's suggestion here: 
http://www.postgresql.org/message-id/1598.1399826...@sss.pgh.pa.us, 
which allows us to support ORDER BY on all tables (or perhaps maybe not 
FDWs, I don't know how those work).  The LIMIT functionality in this 
patch is unaffected.


Now, I know some people disagree with me on whether step #2 is worth 
taking or not, but that's a separate discussion.  My point w.r.t. this 
patch still stands: I don't see any forwards compatibility problems with 
this approach, nor do I really see any viable alternatives either.



.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LIMIT for UPDATE and DELETE

2014-08-29 Thread Marko Tiikkaja


On 8/29/14 1:53 PM, I wrote:

This is just my personal opinion, but what I think should happen is:

1) We put the LIMIT inside ModifyTable like this patch does.  This
doesn't prevent us from doing ORDER BY in the future, but helps numerous
people who today have to


Oops, looks like I didn't finish my thought here.

.. but helps numerous people who today have to achieve the same thing 
via tedious, slow and problematic subqueries (or a choose-any-two 
combination of these).



.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Question about coding of free space map

2014-08-29 Thread Heikki Linnakangas


On 08/26/2014 05:13 AM, Tatsuo Ishii wrote:

While looking into backend/storage/freespace/freespace.c, I noticed
that struct FSMAddress is passed to functions by value, rather than
reference. I thought our code practice is defining pointer to a struct
data and using the pointer for parameter passing etc.

typedef struct RelationData *Relation;

IMO freespace.c is better to follow the practice.


There isn't really any strict coding rule on that. We pass RelFileNode's 
by value in many functions, for example.


IMHO it's clearer to pass small structs like this by value. For example, 
it irritates me that we tend to pass ItemPointers by reference, when 
it's a small struct that represents a single value. I think of it as an 
atomic value, like an integer, so it feels wrong to pass it by reference.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Fix xpath() to return namespace definitions

2014-08-29 Thread Ali Akbar

Greetings,

Because of the memory bug in xmlCopyNode, this is a new patch with
different method. In this patch, instead of using xmlCopyNode to bring the
namespace back, we added the required namespaces to the  node before
dumping the node to string, and cleaning it up afterwards (because the same
node could be dumped again in next xpath result).

Thanks,
Ali Akbar


2014-07-11 15:36 GMT+07:00 Ali Akbar the.ap...@gmail.com:

 Greetings,

 Attached modified patch to handle xmlCopyNode returning NULL. The patch is
 larger because xmlerrcxt must be passed to xml_xmlnodetoxmltype (xmlerrcxt
 is needed for calling xml_ereport).

 From libxml2 source, it turns out that the other cases that xmlCopyNode
 will return NULL will not happen. So in this patch i assume that the only
 case is memory exhaustion.

 But i found some bug in libxml2's code, because we call xmlCopyNode with 1
 as the second parameter, internally xmlCopyNode will call xmlStaticCopyNode
 recursively via xmlStaticCopyNodeList. And xmlStaticCopyNodeList doesn't
 check the return of xmlStaticCopyNode whether it's NULL. So if someday the
 recursion returns NULL (maybe because of memory exhaustion), it will
 SEGFAULT.

 Knowing this but in libxml2 code, is this patch is still acceptable?

 Thanks,
 Ali Akbar




-- 
Ali Akbar
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 422be69..f34c87e 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -141,9 +141,11 @@ static bool print_xml_decl(StringInfo buf, const xmlChar *version,
 			   pg_enc encoding, int standalone);
 static xmlDocPtr xml_parse(text *data, XmlOptionType xmloption_arg,
 		  bool preserve_whitespace, int encoding);
-static text *xml_xmlnodetoxmltype(xmlNodePtr cur);
+static text *xml_xmlnodetoxmltype(xmlNodePtr cur,
+	   PgXmlErrorContext *xmlerrcxt);
 static int xml_xpathobjtoxmlarray(xmlXPathObjectPtr xpathobj,
-	   ArrayBuildState **astate);
+	   ArrayBuildState **astate,
+	   PgXmlErrorContext *xmlerrcxt);
 #endif   /* USE_LIBXML */
 
 static StringInfo query_to_xml_internal(const char *query, char *tablename,
@@ -3590,27 +3592,109 @@ SPI_sql_row_to_xmlelement(int rownum, StringInfo result, char *tablename,
 
 #ifdef USE_LIBXML
 
+/* check ns definition of node and its childrens. If any one of ns is
+ * not defined in node and it's children, but in the node's parent,
+ * copy the definition to node.
+ */
+static void
+xml_checkandcopyns(xmlNodePtr root,
+   PgXmlErrorContext *xmlerrcxt,
+   xmlNodePtr node,
+   xmlNsPtr lastns_before)
+{
+	xmlNsPtr ns = NULL;
+	xmlNsPtr cur_ns;
+	xmlNodePtr cur;
+
+	if (node-ns != NULL)
+	{
+		/* check in new nses */
+		cur_ns = lastns_before == NULL ? node-nsDef : lastns_before-next;
+		while (cur_ns != NULL)
+		{
+			if (cur_ns-href != NULL)
+			{
+if (((cur_ns-prefix == NULL)  (node-ns-prefix == NULL)) ||
+	((cur_ns-prefix != NULL)  (node-ns-prefix != NULL) 
+	 xmlStrEqual(cur_ns-prefix, node-ns-prefix)))
+{
+	ns = cur_ns;
+	break;
+}
+			}
+			cur_ns = cur_ns-next;
+		}
+		if (ns == NULL) /* not in new nses */
+		{
+			ns = xmlSearchNs(NULL, node-parent, node-ns-prefix);
+
+			if (ns != NULL) {
+ns = xmlNewNs(root, ns-href, ns-prefix);
+
+if (ns == NULL  xmlerrcxt-err_occurred) {
+	xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
+		could not allocate xmlNs);
+}
+			}
+		}
+	}
+	/* check and copy ns for children recursively */
+	cur = node-children;
+	while (cur != NULL) {
+		xml_checkandcopyns(root, xmlerrcxt, cur, lastns_before);
+		cur = cur-next;
+	}
+}
+
 /*
  * Convert XML node to text (dump subtree in case of element,
  * return value otherwise)
  */
 static text *
-xml_xmlnodetoxmltype(xmlNodePtr cur)
+xml_xmlnodetoxmltype(xmlNodePtr cur, PgXmlErrorContext *xmlerrcxt)
 {
 	xmltype*result;
 
 	if (cur-type == XML_ELEMENT_NODE)
 	{
 		xmlBufferPtr buf;
+		xmlNsPtr lastns_before;
+		xmlNsPtr ns;
+		xmlNsPtr next;
 
 		buf = xmlBufferCreate();
+
 		PG_TRY();
 		{
+			lastns_before = cur-nsDef;
+			if (lastns_before != NULL) {
+while (lastns_before-next != NULL) {
+	lastns_before = lastns_before-next;
+}
+			}
+			xml_checkandcopyns(cur, xmlerrcxt, cur, lastns_before);
+
 			xmlNodeDump(buf, NULL, cur, 0, 1);
 			result = xmlBuffer_to_xmltype(buf);
+
+			/* delete and free new nses */
+			ns = lastns_before == NULL ? cur-nsDef : lastns_before-next;
+			while (ns != NULL) {
+next = ns-next;
+xmlFree(ns);
+ns = next;
+			}
+			if (lastns_before == NULL) {
+cur-nsDef = NULL;
+			} else {
+lastns_before-next = NULL;
+			}
 		}
 		PG_CATCH();
 		{
+			/* new namespaces will be freed while free-ing the node, so we
+			 * won't free it here
+			 */
 			xmlBufferFree(buf);
 			PG_RE_THROW();
 		}
@@ -3656,7 +3740,8 @@ xml_xmlnodetoxmltype(xmlNodePtr cur)
  */
 static int
 xml_xpathobjtoxmlarray(xmlXPathObjectPtr xpathobj,
-	   ArrayBuildState **astate)
+	   ArrayBuildState

[HACKERS] Misleading error message in logical decoding for binary plugins

2014-08-29 Thread Michael Paquier

Hi all,

Using a plugin producing binary output, I came across this error:
=# select data from pg_logical_slot_peek_changes('foo', NULL, NULL);
ERROR:  0A000: output plugin cannot produce binary output
LOCATION:  pg_logical_slot_get_changes_guts, logicalfuncs.c:404

Shouldn't the error message be here something like binary output plugin
cannot produce textual output? The plugin used in my case produces binary
output, but what is requested from it with pg_logical_slot_peek_changes is
textual output.

A patch is attached (with s/pluggin/plugin in bonus). Comments welcome.
Regards,
-- 
Michael
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index a3a58e6..310b1e5 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -394,14 +394,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 		MemoryContextSwitchTo(oldcontext);
 
 		/*
-		 * Check whether the output pluggin writes textual output if that's
+		 * Check whether the output plugin writes textual output if that's
 		 * what we need.
 		 */
 		if (!binary 
 			ctx-options.output_type != OUTPUT_PLUGIN_TEXTUAL_OUTPUT)
 			ereport(ERROR,
 	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-	 errmsg(output plugin cannot produce binary output)));
+	 errmsg(binary output plugin cannot produce textual output)));
 
 		ctx-output_writer_private = p;
 
diff --git a/src/include/catalog/objectaccess.h b/src/include/catalog/objectaccess.h
index 4fdd056..1e1cb13 100644
--- a/src/include/catalog/objectaccess.h
+++ b/src/include/catalog/objectaccess.h
@@ -13,7 +13,7 @@
 /*
  * Object access hooks are intended to be called just before or just after
  * performing certain actions on a SQL object.  This is intended as
- * infrastructure for security or logging pluggins.
+ * infrastructure for security or logging plugins.
  *
  * OAT_POST_CREATE should be invoked just after the object is created.
  * Typically, this is done after inserting the primary catalog records and

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Per table autovacuum vacuum cost limit behaviour strange

2014-08-29 Thread Alvaro Herrera

Mark Kirkwood wrote:
 On 29/08/14 08:56, Alvaro Herrera wrote:
 Robert Haas wrote:
 
 I agree that you might not like that.  But you might not like having
 the table vacuumed slower than the configured rate, either.  My
 impression is that the time between vacuums isn't really all that
 negotiable for some people.  I had one customer who had horrible bloat
 issues on a table that was vacuumed every minute; when we changed the
 configuration so that it was vacuumed every 15 seconds, those problems
 went away.
 
 Wow, that's extreme.  For that case you can set
 autovacuum_vacuum_cost_limit to 0, which disables the whole thing and
 lets vacuum run at full speed -- no throttling at all.  Would that
 satisfy the concern?
 
 Well no - you might have a whole lot of big tables that you want
 vacuum to not get too aggressive on, but a few small tables that are
 highly volatile. So you want *them* vacuumed really fast to prevent
 them becoming huge tables with only a few rows therein, but your
 system might not be able to handle *all* your tables being vacuum
 full speed.

I meant setting cost limit to 0 *for those tables* only, not for all of
them.

Anyway it seems to me maybe there is room for a new table storage
parameter, say autovacuum_do_balance which means to participate in the
balancing program or not.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Misleading error message in logical decoding for binary plugins

2014-08-29 Thread Andres Freund

On 2014-08-29 22:42:46 +0900, Michael Paquier wrote:
 Hi all,
 
 Using a plugin producing binary output, I came across this error:
 =# select data from pg_logical_slot_peek_changes('foo', NULL, NULL);
 ERROR:  0A000: output plugin cannot produce binary output
 LOCATION:  pg_logical_slot_get_changes_guts, logicalfuncs.c:404
 
 Shouldn't the error message be here something like binary output plugin
 cannot produce textual output? The plugin used in my case produces binary
 output, but what is requested from it with pg_logical_slot_peek_changes is
 textual output.

I don't like the new message much. It's imo even more misleading than
the old message. How about
output pluing produces binary output but the sink only accepts textual data?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Misleading error message in logical decoding for binary plugins

2014-08-29 Thread Michael Paquier

On Fri, Aug 29, 2014 at 10:48 PM, Andres Freund and...@2ndquadrant.com
wrote:

 On 2014-08-29 22:42:46 +0900, Michael Paquier wrote:
  Hi all,
 
  Using a plugin producing binary output, I came across this error:
  =# select data from pg_logical_slot_peek_changes('foo', NULL, NULL);
  ERROR:  0A000: output plugin cannot produce binary output
  LOCATION:  pg_logical_slot_get_changes_guts, logicalfuncs.c:404
 
  Shouldn't the error message be here something like binary output plugin
  cannot produce textual output? The plugin used in my case produces
 binary
  output, but what is requested from it with pg_logical_slot_peek_changes
 is
  textual output.

 I don't like the new message much. It's imo even more misleading than
 the old message. How about
 output plugin produces binary output but the sink only accepts textual
 data?

Sounds fine for me. I am sure that native English speakers will come up as
well with additional suggestions.

Btw, I am having a hard time understanding in which case
OUTPUT_PLUGIN_BINARY_OUTPUT is actually useful. Looking at the code it is
only a subset of what OUTPUT_PLUGIN_TEXTUAL_OUTPUT can do as textual can
generate both binary and textual output, while binary can only generate
binary output. The only code path where a flag OUTPUT_PLUGIN_* is used is
in this code path for this very specific error message. On top of that, a
logical receiver receives data in textual form even if the decoder is
marked as binary when getting a message with PQgetCopyData.

Perhaps I am missing something... But in this case I think that the
documentation should have a more detailed explanation about the output
format options. Currently only the option value is mentioned, and there is
nothing about what they actually do. This is error-prone for the user.
Regards,
-- 
Michael

Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-29 Thread Tom Lane

Jeff Davis pg...@j-davis.com writes:
 I have a new approach to the patch which is to call a callback at each
 block allocation and child contexts inherit the callback from their
 parents. The callback could be defined to simply dereference and
 increment its argument, which would mean block allocations anywhere in
 the hierarchy are incrementing the same variable, avoiding the need to
 traverse the hierarchy.

Cute, but it's impossible to believe that a function call isn't going
to be *even more* overhead than what you've got now.  This could only
measure out as a win when the feature is going unused.

What about removing the callback per se and just keeping the argument,
as it were.  That is, a MemoryContext has a field of type size_t *
that is normally NULL, but if set to non-NULL, then we increment the
pointed-to value for pallocs and decrement for pfrees.

One thought in either case is that we don't have to touch the API for
MemoryContextCreate: rather, the feature can be enabled in a separate
call after making the context.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Removing dependency to wsock32.lib when compiling code on WIndows

2014-08-29 Thread Noah Misch

On Thu, Aug 21, 2014 at 11:29:31PM -0400, Noah Misch wrote:
 On Thu, Aug 21, 2014 at 09:18:22AM -0400, Andrew Dunstan wrote:
  What's happening about this? Buildfarm animal jacana is consistently
  red because of this.
 
 If nobody plans to do the aforementioned analysis in the next 4-7 days, I
 suggest we adopt one of Michael's suggestions: force configure to reach its
 old conclusion about getaddrinfo() on Windows.  Then the analysis can proceed
 on an unhurried schedule.

Done.

Incidentally, jacana takes an exorbitant amount of time, most recently 87
minutes, to complete the ecpg-check step.  Frogmouth takes 4 minutes, and none
of the other steps have such a large multiplier between the two animals.  That
pattern isn't new and appears on multiple branches.  I wonder if ecpg tickles
a performance bug in 64-bit MinGW-w64.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Misleading error message in logical decoding for binary plugins

2014-08-29 Thread Andres Freund

On 2014-08-29 23:07:44 +0900, Michael Paquier wrote:
 On Fri, Aug 29, 2014 at 10:48 PM, Andres Freund and...@2ndquadrant.com
 wrote:
 
  On 2014-08-29 22:42:46 +0900, Michael Paquier wrote:
   Hi all,
  
   Using a plugin producing binary output, I came across this error:
   =# select data from pg_logical_slot_peek_changes('foo', NULL, NULL);
   ERROR:  0A000: output plugin cannot produce binary output
   LOCATION:  pg_logical_slot_get_changes_guts, logicalfuncs.c:404
  
   Shouldn't the error message be here something like binary output plugin
   cannot produce textual output? The plugin used in my case produces
  binary
   output, but what is requested from it with pg_logical_slot_peek_changes
  is
   textual output.
 
  I don't like the new message much. It's imo even more misleading than
  the old message. How about
  output plugin produces binary output but the sink only accepts textual
  data?
 
 Sounds fine for me. I am sure that native English speakers will come up as
 well with additional suggestions.
 
 Btw, I am having a hard time understanding in which case
 OUTPUT_PLUGIN_BINARY_OUTPUT is actually useful. Looking at the code it is
 only a subset of what OUTPUT_PLUGIN_TEXTUAL_OUTPUT can do as textual can
 generate both binary and textual output, while binary can only generate
 binary output.

No, a textual output plugin is *NOT* allowed to produce binary
output. That'd violate e.g. pg_logical_slot_peek_changes's return type
because it's only declared to return text.
If you compile with assertions enabled not adhering to that will
actually result in an error:
/*
 * Assert ctx-out is in database encoding when we're writing textual
 * output.
 */
if (!p-binary_output)
Assert(pg_verify_mbstr(GetDatabaseEncoding(),
   ctx-out-data, 
ctx-out-len,
   false));

 The only code path where a flag OUTPUT_PLUGIN_* is used is
 in this code path for this very specific error message.

Well, LogicalDecodingContext-binary_output is checked too.

 On top of that, a
 logical receiver receives data in textual form even if the decoder is
 marked as binary when getting a message with PQgetCopyData.

I have no idea what you mean by that. The data is sent in the format the
output plugin writes the data in.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Misleading error message in logical decoding for binary plugins

2014-08-29 Thread Michael Paquier

On Fri, Aug 29, 2014 at 11:15 PM, Andres Freund and...@2ndquadrant.com
wrote:

 No, a textual output plugin is *NOT* allowed to produce binary
 output. That'd violate e.g. pg_logical_slot_peek_changes's return type
 because it's only declared to return text.


A textual output plugin can call pg_logical_slot_peek_binary_changes and
pg_logical_slot_peek_changes as well,
and a binary output plugin can only call
pg_logical_slot_peek_binary_changes, and will error out with
pg_logical_slot_peek_changes:
=# select pg_create_logical_replication_slot('foo', 'test_decoding');
 pg_create_logical_replication_slot

 (foo,0/16C6880)
(1 row)
=# create table aa as select 1;
SELECT 1
=# select substring(encode(data, 'escape'), 1, 20),
   substring(data, 1, 20)
 FROM pg_logical_slot_peek_binary_changes('foo', NULL, NULL);
  substring   | substring
--+
 BEGIN 1000   | \x424547494e2031303030
 table public.aa: INS | \x7461626c65207075626c69632e61613a20494e53
 COMMIT 1000  | \x434f4d4d49542031303030
(3 rows)
=# select pg_logical_slot_peek_changes('foo', NULL, NULL, 'force-binary',
'true');
ERROR:  0A000: output plugin cannot produce binary output
LOCATION:  pg_logical_slot_get_changes_guts, logicalfuncs.c:404
=# select substring(data, 1, 20)
from pg_logical_slot_peek_binary_changes('foo', NULL, NULL,
'force-binary', 'true');
 substring

 \x424547494e2031303030
 \x7461626c65207075626c69632e61613a20494e53
 \x434f4d4d49542031303030
(3 rows)

Is that expected?
-- 
Michael

Re: [HACKERS] LIMIT for UPDATE and DELETE

2014-08-29 Thread Tom Lane

Marko Tiikkaja ma...@joh.to writes:
 The LIMIT part *has* to happen after the rows have been locked or it 
 will work very surprisingly under concurrency (sort of like how FOR 
 SHARE / FOR UPDATE worked before 9.0).

Good point.

 So either it has to be inside 
 ModifyTable or the ModifyTable has to somehow pass something to a Limit 
 node on top of it

... or we add a LockRows node below the Limit node.  Yeah, that would make
UPDATE/LIMIT a tad slower, but I think that might be preferable to what
you're proposing anyway.  Raw speed of what is fundamentally a fringe
feature ought not trump every other concern.

 This is just my personal opinion, but what I think should happen is:

1) We put the LIMIT inside ModifyTable like this patch does.  This 
 doesn't prevent us from doing ORDER BY in the future, but helps numerous 
 people who today have to
2) We allow ORDER BY on tables with no inheritance children using 
 something similar to Rukh's previous patch.
3) Someone rewrites how UPDATE works based on Tom's suggestion here: 
 http://www.postgresql.org/message-id/1598.1399826...@sss.pgh.pa.us, 
 which allows us to support ORDER BY on all tables (or perhaps maybe not 
 FDWs, I don't know how those work).  The LIMIT functionality in this 
 patch is unaffected.

I still think we should skip #2 and go directly to work on #3.  Getting
rid of the unholy mess that is inheritance_planner would be a very nice
thing.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Question about coding of free space map

2014-08-29 Thread Tom Lane

Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 08/26/2014 05:13 AM, Tatsuo Ishii wrote:
 While looking into backend/storage/freespace/freespace.c, I noticed
 that struct FSMAddress is passed to functions by value, rather than
 reference.

 There isn't really any strict coding rule on that. We pass RelFileNode's 
 by value in many functions, for example.

The only reason RelFileNodes work like that is that Robert blithely
ignored the coding rule when he was revising things to pass those around.
I've never been terribly happy about it, but it wasn't important enough
to complain about.

The cases where it *would* be important enough to complain about would
be performance-critical code paths, which RelFileNode usages typically
don't appear in (if you're messing with one you're most likely going
to do a filesystem access).  I'd be unhappy though if someone wanted
to start passing ItemPointers by value.  I doubt we can rely on C
compilers to pass those as efficiently as they pass pointers.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Misleading error message in logical decoding for binary plugins

2014-08-29 Thread Andres Freund

On 2014-08-29 23:31:49 +0900, Michael Paquier wrote:
 On Fri, Aug 29, 2014 at 11:15 PM, Andres Freund and...@2ndquadrant.com
 wrote:
 
  No, a textual output plugin is *NOT* allowed to produce binary
  output. That'd violate e.g. pg_logical_slot_peek_changes's return type
  because it's only declared to return text.
 
 
 A textual output plugin can call pg_logical_slot_peek_binary_changes and
 pg_logical_slot_peek_changes as well,

Well, for one a output plugin doesn't call
pg_logical_slot_peek_binary_changes, it's the other way round. And sure:
Every text is also binary. So that's just fine.

 and a binary output plugin can only call
 pg_logical_slot_peek_binary_changes, and will error out with
 pg_logical_slot_peek_changes:
 =# select pg_create_logical_replication_slot('foo', 'test_decoding');
  pg_create_logical_replication_slot
 
  (foo,0/16C6880)
 (1 row)
 =# create table aa as select 1;
 SELECT 1
 =# select substring(encode(data, 'escape'), 1, 20),
substring(data, 1, 20)
  FROM pg_logical_slot_peek_binary_changes('foo', NULL, NULL);
   substring   | substring
 --+
  BEGIN 1000   | \x424547494e2031303030
  table public.aa: INS | \x7461626c65207075626c69632e61613a20494e53
  COMMIT 1000  | \x434f4d4d49542031303030
 (3 rows)
 =# select pg_logical_slot_peek_changes('foo', NULL, NULL, 'force-binary',
 'true');
 ERROR:  0A000: output plugin cannot produce binary output
 LOCATION:  pg_logical_slot_get_changes_guts, logicalfuncs.c:404
 =# select substring(data, 1, 20)
 from pg_logical_slot_peek_binary_changes('foo', NULL, NULL,
 'force-binary', 'true');
  substring
 
  \x424547494e2031303030
  \x7461626c65207075626c69632e61613a20494e53
  \x434f4d4d49542031303030
 (3 rows)
 
 Is that expected?

Yes. The output plugin declares whether it requires the *output method*
to support binary data. pg_logical_slot_peek_changes *can not* support
binary data because it outputs data as
text. pg_logical_slot_peek_binary_changes *can* support binary data
because it returns bytea (and thus it also can output text, because
that's essentially a subset of binary data).

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Removing dependency to wsock32.lib when compiling code on WIndows

2014-08-29 Thread Andrew Dunstan



On 08/29/2014 10:15 AM, Noah Misch wrote:

On Thu, Aug 21, 2014 at 11:29:31PM -0400, Noah Misch wrote:

On Thu, Aug 21, 2014 at 09:18:22AM -0400, Andrew Dunstan wrote:

What's happening about this? Buildfarm animal jacana is consistently
red because of this.

If nobody plans to do the aforementioned analysis in the next 4-7 days, I
suggest we adopt one of Michael's suggestions: force configure to reach its
old conclusion about getaddrinfo() on Windows.  Then the analysis can proceed
on an unhurried schedule.

Done.

Incidentally, jacana takes an exorbitant amount of time, most recently 87
minutes, to complete the ecpg-check step.  Frogmouth takes 4 minutes, and none
of the other steps have such a large multiplier between the two animals.  That
pattern isn't new and appears on multiple branches.  I wonder if ecpg tickles
a performance bug in 64-bit MinGW-w64.




Good pickup. I wonder what it could be.

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why data of timestamptz does not store value of timezone passed to it?

2014-08-29 Thread Kevin Grittner

Tom Lane t...@sss.pgh.pa.us wrote:
 Kevin Grittner kgri...@ymail.com writes:

 But the standard doesn't say anything about storing a time zone
 *name* or *abbreviation* -- it requires that it be stored as UTC
 with the *offset* (in hours and minutes).  That makes it pretty
 close to what we have -- it's all about a difference in
 presentation.  And as far as I can see it completely dodges the
 kinds of problems you're talking about.

 However, the added field creates its own semantic problems.
 As an example, is 2014-08-28 18:00:00-04 the same as or different from
 2014-08-28 17:00:00-05?  If they're different, which one is less?
 If they're the same, what's the point of storing the extra field?
 And do you really like equal values that behave differently,
 not only for I/O but for operations such as EXTRACT()?

 (I imagine the SQL spec gives a ruling on this issue, which
 I'm too lazy to look up; my point is that whatever they did, it
 will be the wrong thing for some use-cases.)

I think (based on your earlier post) that we agree that would have
been better to implement the type named in the standard according
to the definition given in the standard (and to use a new type name
for the more generally useful behavior PostgreSQL currently uses
for timestamptz), but that it's too late to go there now.  QUEL's
relational calculus is superior in just about every way to SQL, but
if we're going to go with the standard because it *is* a standard,
then let's freaking *do* it and extend where beneficial. Otherwise,
why switch from QUEL in the first place?

It was actually rather disappointing to hear that we had a
conforming implementation and changed away from it circa the 7.2
release; and even more disturbing to hear that decision is still
being defended on the grounds that there's no point providing
standard conforming behavior if we can think of different behavior
that we feel is more useful.  We should have both.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: Why data of timestamptz does not store value of timezone passed to it?

2014-08-29 Thread Greg Stark

On Fri, Aug 29, 2014 at 4:03 PM, Kevin Grittner kgri...@ymail.com wrote:
 It was actually rather disappointing to hear that we had a
 conforming implementation and changed away from it circa the 7.2
 release; and even more disturbing to hear that decision is still
 being defended on the grounds that there's no point providing
 standard conforming behavior if we can think of different behavior
 that we feel is more useful.  We should have both.

I don't think the behaviour was standards-compliant in 7.2 either. For
that matter, I can't think of any circumstance where the standard
behaviour is useful. There's absolutely no way to write correct code
using it.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Misleading error message in logical decoding for binary plugins

2014-08-29 Thread Michael Paquier

On Fri, Aug 29, 2014 at 11:39 PM, Andres Freund and...@2ndquadrant.com
wrote:

 Yes. The output plugin declares whether it requires the *output method*
 to support binary data. pg_logical_slot_peek_changes *can not* support
 binary data because it outputs data as
 text. pg_logical_slot_peek_binary_changes *can* support binary data
 because it returns bytea (and thus it also can output text, because
 that's essentially a subset of binary data).

Thanks for the explanations. This would meritate more details within the
docs, like what those two modes actually do and what the user can expect as
differences, advantages and disadvantages if he chooses one or the other
when starting decoding...
-- 
Michael

Re: [HACKERS] Why data of timestamptz does not store value of timezone passed to it?

2014-08-29 Thread Tom Lane

Kevin Grittner kgri...@ymail.com writes:
 It was actually rather disappointing to hear that we had a
 conforming implementation and changed away from it circa the 7.2
 release;

That is not the case.  The existing implementation is work that Tom
Lockhart did around 6.3 or so.  It was called timestamp at the time,
and was renamed to timestamp with time zone in 7.2, in order to make
room for timestamp without time zone (which I think *is* spec compliant
or close enough).  That was probably an unfortunate choice; but at
no time was there code in PG that did what the spec says timestamp
with time zone should do.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: Why data of timestamptz does not store value of timezone passed to it?

2014-08-29 Thread David G Johnston

On Fri, Aug 29, 2014 at 11:12 AM, Greg Stark [via PostgreSQL] 
ml-node+s1045698n5816903...@n5.nabble.com wrote:

 On Fri, Aug 29, 2014 at 4:03 PM, Kevin Grittner [hidden email]
 http://user/SendEmail.jtp?type=nodenode=5816903i=0 wrote:
  It was actually rather disappointing to hear that we had a
  conforming implementation and changed away from it circa the 7.2
  release; and even more disturbing to hear that decision is still
  being defended on the grounds that there's no point providing
  standard conforming behavior if we can think of different behavior
  that we feel is more useful.  We should have both.

 I don't think the behaviour was standards-compliant in 7.2 either. For
 that matter, I can't think of any circumstance where the standard
 behaviour is useful. There's absolutely no way to write correct code
 using it.



And forcing people to change their data types to migrate to PostgreSQL is
undesirable IF our type is usefully equivalent to others in the majority of
situations - though I don't know if that is actually the case.

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Why-data-of-timestamptz-does-not-store-value-of-timezone-passed-to-it-tp5816703p5816906.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

[HACKERS] Re: Why data of timestamptz does not store value of timezone passed to it?

2014-08-29 Thread Greg Stark

On Fri, Aug 29, 2014 at 4:19 PM, David G Johnston
david.g.johns...@gmail.com wrote:
 And forcing people to change their data types to migrate to PostgreSQL is
 undesirable IF our type is usefully equivalent to others in the majority of
 situations - though I don't know if that is actually the case.

You know... I wonder if we have enough leverage in the standards
committee these days that we could usefully push that direction
instead of being pushed around. The standard timestamp with time zone
is not very useful and I'm sure the standards committee wouldn't mind
having a useful point-in-time data type.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] On partitioning

2014-08-29 Thread Alvaro Herrera

Prompted by a comment in the UPDATE/LIMIT thread, I saw Marko Tiikkaja
reference Tom's post
http://www.postgresql.org/message-id/1598.1399826...@sss.pgh.pa.us
which mentions the possibility of a different partitioning
implementation than what we have so far.  As it turns out, I've been
thinking about partitioning recently, so I thought I would share what
I'm thinking so that others can poke holes.  My intention is to try to
implement this as soon as possible.


Declarative partitioning


In this design, partitions are first-class objects, not normal tables in
inheritance hierarchies.  There are no pg_inherits entries involved at all.

Partitions are a physical implementation detail.  Therefore we do not allow
the owner to be changed, or permissions to be granted directly to partitions;
all these operations happen to the parent relation instead.

System Catalogs
---

In pg_class we have two additional relkind values:

* relkind RELKIND_PARTITIONED_REL 'P' indicates a partitioned relation.
  It is used to indicate a parent table, i.e. one the user can directly
  address in DML queries.  Such relations DO NOT have their own storage.
  These use the same rules as regular tables for access privileges,
  ownership and so on.

* relkind RELKIND_PARTITION 'p' indicates a partition within a partitioned
  relation (its parent).  These cannot be addressed directly in DML
  queries and only limited DDL support is provided.  They don't have
  their own pg_attribute entries either and therefore they are always
  identical in column definitions to the parent relation.  Since they
  are not accessible directly, there is no need for ACL considerations;
  the parent relation's owner is the owner, and grants are applied to
  the parent relation only.
  XXX --- is there a need for a partition having different column
  default values than its parent relation?

Partitions are numbered sequentially, normally from 1 onwards; but it is
valid to have negative partition numbers and 0.  Partitions don't have
names (except automatically generated ones for pg_class.relname, but
they are unusable in DDL).

Each partition is assigned an Expression that receives a tuple and
returns boolean.  This expression returns true if a given tuple belongs
into it, false otherwise.  If a tuple for a partitioned relation is run
through expressions of all partitions, exactly one should return true.
If none returns true, it might be because the partition has not been
created yet.  A user-facing error is raised in this case (Rationale: if
user creates a partitioned rel and there is no partition that accepts
some given tuple, it's the user's fault.)

Additionally, each partitioned relation may have a master expression.
This receives a tuple and returns an integer, which corresponds to the
number of the partition it belongs into.

There are two new system catalogs:

pg_partitioned_rel -- (prelrelid, prelexpr) pg_partition   --
(partrelid, partseq, partexpr, partoverflow)

For partitioned rels that have prelexpr, we run that expression and
obtain the partition number; as a crosscheck we run partexpr and ensure
it returns true.  For partitioned rels that don't have prelexpr, we run
partexpr for each partition in turn until one returns true.  This means
that for a properly set up partitioned table, we need to run a single
expression on a tuple to find out what partition the tuple belongs into.

Per-partition expressions are formed as each partition is created, and
are based on the user-supplied partitioning criterion.  Master
expressions are formed at relation creation time.  (XXX Can we change
the master expression later, as a result of some ALTER command?
Presumably this would mean that all partitions might need to be
rewritten.)

Triggers 

(These are user-defined triggers, not partitioning triggers.  In fact
there are no partitioning triggers at all.)

Triggers are attached to the parent relation, not to the specific
partition.  When a trigger function runs on a tuple inserted, updated or
modified on a partition, the data received by the trigger function makes
it appear that the tuple belongs to the parent relation.  There is no
need to let the trigger know which partition the tuple went in or came
from.   XXX is there a need to give it the partition number that the
tuple went it?


Syntax --

CREATE TABLE xyz ( ... )  PARTITION BY RANGE ( a_expr ) This creates the
main table only: no partitions are created automatically.

We do not support other types of partitioning at this stage.  We will
implement these later.

We do not currently support ALTER TABLE/PARTITION BY (i.e. partition a
table after the fact).  We leave this as a future improvement.

Allowed actions on RELKIND_PARTITIONED_REL: * ALTER TABLE xyz CREATE
PARTITION n This creates a new partition * ALTER TABLE xyz CREATE
PARTITION FOR value Same as above; the partition number is determined
automatically.

Allowed actions on a RELKIND_PARTITION:

Re: [HACKERS] On partitioning

2014-08-29 Thread Tom Lane

Alvaro Herrera alvhe...@2ndquadrant.com writes:
 [ partition sketch ]

 In this design, partitions are first-class objects, not normal tables in
 inheritance hierarchies.  There are no pg_inherits entries involved at all.

Hm, actually I'd say they are *not* first class objects; the problem with
the existing design is exactly that child tables *are* first class
objects.  This is merely a terminology quibble though.

 * relkind RELKIND_PARTITION 'p' indicates a partition within a partitioned
   relation (its parent).  These cannot be addressed directly in DML
   queries and only limited DDL support is provided.  They don't have
   their own pg_attribute entries either and therefore they are always
   identical in column definitions to the parent relation.

Not sure that not storing the pg_attribute rows is a good thing; but
that's something that won't be clear till you try to code it.

 Each partition is assigned an Expression that receives a tuple and
 returns boolean.  This expression returns true if a given tuple belongs
 into it, false otherwise.

-1, in fact minus a lot.  One of the core problems of the current approach
is that the system, particularly the planner, hasn't got a lot of insight
into exactly what the partitioning scheme is in a partitioned table built
on inheritance.  If you allow the partitioning rule to be a black box then
that doesn't get any better.  I want to see a design wherein the system
understands *exactly* what the partitioning behavior is.  I'd start with
supporting range-based partitioning explicitly, and maybe we could add
other behaviors such as hashing later.

In particular, there should never be any question at all that there is
exactly one partition that a given row belongs to, not more, not less.
You can't achieve that with a set of independent filter expressions;
a meta-rule that says exactly one of them should return true is an
untrustworthy band-aid.

(This does not preclude us from mapping the tuple through the partitioning
rule and finding that the corresponding partition doesn't currently exist.
I think we could view the partitioning rule as a function from tuples to
partition numbers, and then we look in pg_class to see if such a partition
exists.)

 Additionally, each partitioned relation may have a master expression.
 This receives a tuple and returns an integer, which corresponds to the
 number of the partition it belongs into.

I guess this might be the same thing I'm arguing for, except that I say
it is not optional but is *the* way you define the partitioning.  And
I don't really want black-box expressions even in this formulation.
If you're looking for arbitrary partitioning rules, you can keep on
using inheritance.  The point of inventing partitioning, IMHO, is for
the system to have a lot more understanding of the behavior than is
possible now.

As an example of the point I'm trying to make, the planner should be able
to discard range-based partitions that are eliminated by a WHERE clause
with something a great deal cheaper than the theorem prover it currently
has to use for the purpose.  Black-box partitioning rules not only don't
improve that situation, they actually make it worse.

Other than that, this sketch seems reasonable ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-08-29 Thread Greg Stark

On Fri, Aug 29, 2014 at 4:56 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 For scan plans, we need to prepare Append lists which are used to scan
 for tuples in a partitioned relation.  We can setup fake constraint
 expressions based on the partitioning expressions, which let the planner
 discard unnecessary partitions by way of constraint exclusion.

 (In the future we might be interested in creating specialized plan and
 execution nodes that know more about partitioned relations, to avoid
 creating useless Append trees only to prune them later.)

This seems like a big part of the point of doing first class
partitions. If we have an equivalence class that specifies a constant
for all the variables in the master expression then we should be able
to look up the corresponding partition as a O(1) operation (or
O(log(n) if it involves searching a list) rather than iterating over
all the partitions and trying to prove lots of exclusions. We might
even need a btree index to store the partitions so that we can handle
scaling up and still find the corresponding partitions quickly.

And I think there are still unanswered questions about indexes. You
seem to be implying that users would be free to create any index they
want on any partition. It's probably going to be necessary to support
creating an index on the partitioned table which would create an index
on each of the partitions and, crucially, automatically create
corresponding indexes whenever new partitions are added.

That said, everything that's here sounds pretty spot-on to me.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-08-29 Thread Pavel Stehule

2014-08-29 18:35 GMT+02:00 Tom Lane t...@sss.pgh.pa.us:

 Alvaro Herrera alvhe...@2ndquadrant.com writes:
  [ partition sketch ]

  In this design, partitions are first-class objects, not normal tables in
  inheritance hierarchies.  There are no pg_inherits entries involved at
 all.

 Hm, actually I'd say they are *not* first class objects; the problem with
 the existing design is exactly that child tables *are* first class
 objects.  This is merely a terminology quibble though.


+1 .. only few partitions slowdown planning significantly from 1ms to 20ms,
what is a issue with very simple queries over PK



  * relkind RELKIND_PARTITION 'p' indicates a partition within a
 partitioned
relation (its parent).  These cannot be addressed directly in DML
queries and only limited DDL support is provided.  They don't have
their own pg_attribute entries either and therefore they are always
identical in column definitions to the parent relation.

 Not sure that not storing the pg_attribute rows is a good thing; but
 that's something that won't be clear till you try to code it.

  Each partition is assigned an Expression that receives a tuple and
  returns boolean.  This expression returns true if a given tuple belongs
  into it, false otherwise.

 -1, in fact minus a lot.  One of the core problems of the current approach
 is that the system, particularly the planner, hasn't got a lot of insight
 into exactly what the partitioning scheme is in a partitioned table built
 on inheritance.  If you allow the partitioning rule to be a black box then
 that doesn't get any better.  I want to see a design wherein the system
 understands *exactly* what the partitioning behavior is.  I'd start with
 supporting range-based partitioning explicitly, and maybe we could add
 other behaviors such as hashing later.

 In particular, there should never be any question at all that there is
 exactly one partition that a given row belongs to, not more, not less.
 You can't achieve that with a set of independent filter expressions;
 a meta-rule that says exactly one of them should return true is an
 untrustworthy band-aid.

 (This does not preclude us from mapping the tuple through the partitioning
 rule and finding that the corresponding partition doesn't currently exist.
 I think we could view the partitioning rule as a function from tuples to
 partition numbers, and then we look in pg_class to see if such a partition
 exists.)

  Additionally, each partitioned relation may have a master expression.
  This receives a tuple and returns an integer, which corresponds to the
  number of the partition it belongs into.

 I guess this might be the same thing I'm arguing for, except that I say
 it is not optional but is *the* way you define the partitioning.  And
 I don't really want black-box expressions even in this formulation.
 If you're looking for arbitrary partitioning rules, you can keep on
 using inheritance.  The point of inventing partitioning, IMHO, is for
 the system to have a lot more understanding of the behavior than is
 possible now.

 As an example of the point I'm trying to make, the planner should be able
 to discard range-based partitions that are eliminated by a WHERE clause
 with something a great deal cheaper than the theorem prover it currently
 has to use for the purpose.  Black-box partitioning rules not only don't
 improve that situation, they actually make it worse.

 Other than that, this sketch seems reasonable ...

 regards, tom lane


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-08-29 Thread Tom Lane

Greg Stark st...@mit.edu writes:
 And I think there are still unanswered questions about indexes.

One other interesting thought that occurs to me: are we going to support
UPDATEs that cause a row to belong to a different partition?  If so, how
are we going to handle the update chain links?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-08-29 Thread Alvaro Herrera

Tom Lane wrote:
 Greg Stark st...@mit.edu writes:
  And I think there are still unanswered questions about indexes.
 
 One other interesting thought that occurs to me: are we going to support
 UPDATEs that cause a row to belong to a different partition?  If so, how
 are we going to handle the update chain links?

Bah, I didn't mention it?  My current thinking is that it would be
disallowed; if you have chosen your partitioning key well enough it
shouldn't be necessary.  As a workaround you can always DELETE/INSERT.
Maybe we can allow it later, but for a first cut this seems more than
good enough.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-08-29 Thread Tom Lane

Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Tom Lane wrote:
 One other interesting thought that occurs to me: are we going to support
 UPDATEs that cause a row to belong to a different partition?  If so, how
 are we going to handle the update chain links?

 Bah, I didn't mention it?  My current thinking is that it would be
 disallowed; if you have chosen your partitioning key well enough it
 shouldn't be necessary.  As a workaround you can always DELETE/INSERT.
 Maybe we can allow it later, but for a first cut this seems more than
 good enough.

Hm.  I certainly agree that it's a case that could be disallowed for a
first cut, but it'd be nice to have some clue about how we might allow it
eventually.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-08-29 Thread Andres Freund

On 2014-08-29 13:15:16 -0400, Tom Lane wrote:
 Alvaro Herrera alvhe...@2ndquadrant.com writes:
  Tom Lane wrote:
  One other interesting thought that occurs to me: are we going to support
  UPDATEs that cause a row to belong to a different partition?  If so, how
  are we going to handle the update chain links?
 
  Bah, I didn't mention it?  My current thinking is that it would be
  disallowed; if you have chosen your partitioning key well enough it
  shouldn't be necessary.  As a workaround you can always DELETE/INSERT.
  Maybe we can allow it later, but for a first cut this seems more than
  good enough.
 
 Hm.  I certainly agree that it's a case that could be disallowed for a
 first cut, but it'd be nice to have some clue about how we might allow it
 eventually.

Not pretty, but we could set t_ctid to some 'magic' value when switching
partitions. Everything chasing ctid chains could then error out when
hitting a invisible row with such a t_ctid.  The usecases for doing such
updates really are more maintenance style commands, so it's possibly not
too bad from a usability POV :(

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Inverse of pg_get_serial_sequence?

2014-08-29 Thread Andres Freund

Hi,

We have pg_get_serial_sequence() mapping (relation, colum) to the
sequence. What I'm missing right now is the inverse. I.e. given a
sequence tell me the owner.
describe.c has a query for that, and it's not too hard to write, but it
still seems 'unfriendly' not to provide it.

Does anybody dislike adding a function for that?


I can't really think of a good name (not that pg_get_serial_sequence is
well named). pg_get_serial_sequence_owner(serial regclass, OUT rel
regclass, OUT colname name) maybe?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-08-29 Thread Tom Lane

Andres Freund and...@2ndquadrant.com writes:
 On 2014-08-29 13:15:16 -0400, Tom Lane wrote:
 Hm.  I certainly agree that it's a case that could be disallowed for a
 first cut, but it'd be nice to have some clue about how we might allow it
 eventually.

 Not pretty, but we could set t_ctid to some 'magic' value when switching
 partitions. Everything chasing ctid chains could then error out when
 hitting a invisible row with such a t_ctid.

An actual fix would presumably involve adding a partition number to the
ctid chain field in tuples in partitioned tables.  The reason I bring it
up now is that we'd have to commit to doing that (or at least leaving room
for it) in the first implementation, if we don't want to have an on-disk
compatibility break.

There is certainly room to argue that the value of this capability isn't
worth the disk space this solution would eat.  But we should have that
argument while the option is still feasible ...

 The usecases for doing such
 updates really are more maintenance style commands, so it's possibly not
 too bad from a usability POV :(

I'm afraid that might just be wishful thinking.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.5] Custom Plan API

2014-08-29 Thread Robert Haas

On Wed, Aug 27, 2014 at 6:51 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
  I'd like to follow this direction, and start stripping the DDL support.

 ...please make it so.

 The attached patch eliminates DDL support.

 Instead of the new CREATE CUSTOM PLAN PROVIDER statement,
 it adds an internal function; register_custom_scan_provider
 that takes custom plan provider name and callback function
 to add alternative scan path (should have a form of CustomPath)
 during the query planner is finding out the cheapest path to
 scan the target relation.
 Also, documentation stuff is revised according to the latest
 design.
 Any other stuff keeps the previous design.

Comments:

1. There seems to be no reason for custom plan nodes to have MultiExec
support; I think this as an area where extensibility is extremely
unlikely to work out.  The MultiExec mechanism is really only viable
between closely-cooperating nodes, like Hash and HashJoin, or
BitmapIndexScan, BitmapAnd, BitmapOr, and BitmapHeapScan; and arguably
those things could have been written as a single, more complex node.
Are we really going to want to support a custom plan that can
substitute for a Hash or BitmapAnd node?  I really doubt that's very
useful.

2. This patch is still sort of on the fence about whether we're
implementing custom plans (of any type) or custom scans (thus, of some
particular relation).  I previously recommended that we confine
ourselves initially to the task of adding custom *scans* and leave the
question of other kinds of custom plan nodes to a future patch.  After
studying the latest patch, I'm inclined to suggest a slightly revised
strategy.  This patch is really adding THREE kinds of custom objects:
CustomPlanState, CustomPlan, and CustomPath. CustomPlanState inherits
from ScanState, so it is not really a generic CustomPlan, but
specifically a CustomScan; likewise, CustomPlan inherits from Scan,
and is therefore a CustomScan, not a CustomPlan.  But CustomPath is
different: it's just a Path.  Even if we only have the hooks to inject
CustomPaths that are effectively scans at this point, I think that
part of the infrastructure could be somewhat generic.  Perhaps
eventually we have CustomPath which can generate either CustomScan or
CustomJoin which in turn could generate CustomScanState and
CustomJoinState.

For now, I propose that we rename CustomPlan and CustomPlanState to
CustomScan and CustomScanState, because that's what they are; but that
we leave CustomPath as-is.  For ease of review, I also suggest
splitting this into a series of three patches: (1) add support for
CustomPath; (2) add support for CustomScan and CustomScanState; (3)
ctidscan.

3. Is it really a good idea to invoke custom scan providers for RTEs
of every type?  It's pretty hard to imagine that a custom scan
provider can do anything useful with, say, RTE_VALUES.  Maybe an
accelerated scan of RTE_CTE or RTE_SUBQUERY is practical somehow, but
even that feels like an awfully big stretch.  At least until clear use
cases emerge, I'd be inclined to restrict this to RTE_RELATION scans
where rte-relkind != RELKIND_FOREIGN_TABLE; that is, put the logic in
set_plain_rel_pathlist() rather than set_rel_pathlist().

(We might even want to consider whether the hook in
set_plain_rel_pathlist() ought to be allowed to inject a non-custom
plan; e.g. substitute a scan of relation B for a scan of relation A.
For example, imagine that B contains all rows from A that satisfy some
predicate. This could even be useful for foreign tables; e.g.
substitute a scan of a local copy of a foreign table for a reference
to that table.  But I put all of these ideas in parentheses because
they're only good ideas to the extent that they don't sidetrack us too
much.)

4. Department of minor nitpicks.  You've got a random 'xs' in the
comments for ExecSupportsBackwardScan. And, in contrib/ctidscan,
ctidscan_path_methods, ctidscan_plan_methods, and
ctidscan_exec_methods can have static initializers; there's no need to
initialize them at run time in _PG_init().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-08-29 Thread Alvaro Herrera

Tom Lane wrote:
 Alvaro Herrera alvhe...@2ndquadrant.com writes:
  Tom Lane wrote:
  One other interesting thought that occurs to me: are we going to support
  UPDATEs that cause a row to belong to a different partition?  If so, how
  are we going to handle the update chain links?
 
  Bah, I didn't mention it?  My current thinking is that it would be
  disallowed; if you have chosen your partitioning key well enough it
  shouldn't be necessary.  As a workaround you can always DELETE/INSERT.
  Maybe we can allow it later, but for a first cut this seems more than
  good enough.
 
 Hm.  I certainly agree that it's a case that could be disallowed for a
 first cut, but it'd be nice to have some clue about how we might allow it
 eventually.

I hesitate to suggest this, but we have free flag bits in
MultiXactStatus.  We could use a specially marked multixact member to
indicate the OID of the target relation; perhaps set an infomask bit to
indicate that this has happened.  Of course, no HOT updates are possible
so I think it's okay from a heap_prune_chain perspective.  This abuses
the knowledge that OIDs and XIDs are both 32 bits long.  

Since nowhere else we have the space necessary to store the longer data
that a cross-partition update would require, I don't see anything else
ATM.  (For a moment I thought about abusing combo CIDs, but that doesn't
work because this requires to be persistent and visible from other
backends, neither of which is a quality of combocids.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-08-29 Thread Andres Freund

On 2014-08-29 13:29:19 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2014-08-29 13:15:16 -0400, Tom Lane wrote:
  Hm.  I certainly agree that it's a case that could be disallowed for a
  first cut, but it'd be nice to have some clue about how we might allow it
  eventually.
 
  Not pretty, but we could set t_ctid to some 'magic' value when switching
  partitions. Everything chasing ctid chains could then error out when
  hitting a invisible row with such a t_ctid.
 
 An actual fix would presumably involve adding a partition number to the
 ctid chain field in tuples in partitioned tables.  The reason I bring it
 up now is that we'd have to commit to doing that (or at least leaving room
 for it) in the first implementation, if we don't want to have an on-disk
 compatibility break.

Right. Just adding it unconditionally doesn't sound feasible to me. Our
per-row overhead is already too large. And it doesn't sound fun to have
the first-class partitions use a different heap tuple format than plain
relations.

What we could do is to add some sort of 'jump' tuple when moving a tuple
from one relation to another. So, when updating a tuple between
partitions we add another in the old partition with xmin_jump =
xmax_jump = xmax_old and have the jump tuple's content point to the new
relation.
Far from pretty, but it'd only matter overhead wise when used.

  The usecases for doing such
  updates really are more maintenance style commands, so it's possibly not
  too bad from a usability POV :(
 
 I'm afraid that might just be wishful thinking.

I admit that you might very well be right there :(

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench throttling latency limit

2014-08-29 Thread Fabien COELHO



Hello Heikki,


[...] I would be fine with both.


After giving it some thought, ISTM better to choose consistency over 
intuition, and have latency under throttling always defined wrt the 
scheduled start time and not the actual start time, even if having a 
latency of 1 ms for an OLTP load might seem surprising to some.


The other one can be computed by substracting the average lag time.

I attached a v6 which is a consolidate patch of v5 + the small update for 
the latency definition.


--
Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 2f7d80e..40427a3 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -141,6 +141,18 @@ double		sample_rate = 0.0;
 int64		throttle_delay = 0;
 
 /*
+ * Transactions which take longer that this limit are counted as late
+ * and reported as such, although they are completed anyway.
+ *
+ * When under throttling: execution time slots which are more than
+ * this late (in us) are simply skipped, and the corresponding transaction
+ * is counted as such... it is not even started;
+ * otherwise above the limit transactions are counted as such, with the latency
+ * measured wrt the transaction schedule, not its actual start.
+ */
+int64		latency_limit = 0;
+
+/*
  * tablespace selection
  */
 char	   *tablespace = NULL;
@@ -238,6 +250,8 @@ typedef struct
 	int64		throttle_trigger;		/* previous/next throttling (us) */
 	int64		throttle_lag;	/* total transaction lag behind throttling */
 	int64		throttle_lag_max;		/* max transaction lag */
+	int64		throttle_latency_skipped; /* lagging transactions skipped */
+	int64		latency_late; /* late transactions */
 } TState;
 
 #define INVALID_THREAD		((pthread_t) 0)
@@ -250,6 +264,8 @@ typedef struct
 	int64		sqlats;
 	int64		throttle_lag;
 	int64		throttle_lag_max;
+	int64		throttle_latency_skipped;
+	int64		latency_late;
 } TResult;
 
 /*
@@ -367,6 +383,10 @@ usage(void)
 		   -f, --file=FILENAME  read transaction script from FILENAME\n
 		 -j, --jobs=NUM   number of threads (default: 1)\n
 		 -l, --logwrite transaction times to log file\n
+		 -L, --limit=NUM  count transactions lasting more than NUM ms.\n
+		  under throttling (--rate), transactions behind schedule\n
+		  more than NUM ms are skipped, and those finishing more\n
+		  than NUM ms after their scheduled start are counted.\n
 		 -M, --protocol=simple|extended|prepared\n
 		  protocol for submitting queries (default: simple)\n
 		 -n, --no-vacuum  do not run VACUUM before tests\n
@@ -1016,6 +1036,24 @@ top:
 
 		thread-throttle_trigger += wait;
 
+		if (latency_limit)
+		{
+			instr_time	now;
+			int64		now_us;
+			INSTR_TIME_SET_CURRENT(now);
+			now_us = INSTR_TIME_GET_MICROSEC(now);
+			while (thread-throttle_trigger  now_us - latency_limit)
+			{
+/* if too far behind, this slot is skipped, and we
+ * iterate till the next nearly on time slot.
+ */
+int64 wait = (int64) (throttle_delay *
+	1.00055271703 * -log(getrand(thread, 1, 1) / 1.0));
+thread-throttle_trigger += wait;
+thread-throttle_latency_skipped ++;
+			}
+		}
+
 		st-until = thread-throttle_trigger;
 		st-sleeping = 1;
 		st-throttling = true;
@@ -1080,15 +1118,31 @@ top:
 			thread-exec_count[cnum]++;
 		}
 
-		/* transaction finished: record latency under progress or throttling */
-		if ((progress || throttle_delay)  commands[st-state + 1] == NULL)
+		/* transaction finished: record latency under progress or throttling,
+		 * ot if latency limit is set
+		 */
+		if ((progress || throttle_delay || latency_limit) 
+			commands[st-state + 1] == NULL)
 		{
 			instr_time	diff;
 			int64		latency;
 
 			INSTR_TIME_SET_CURRENT(diff);
-			INSTR_TIME_SUBTRACT(diff, st-txn_begin);
-			latency = INSTR_TIME_GET_MICROSEC(diff);
+
+			if (throttle_delay)
+			{
+/* Under throttling, compute latency wrt to the initial schedule,
+ * not the actual transaction start.
+ */
+int64 now = INSTR_TIME_GET_MICROSEC(diff);
+latency = now - thread-throttle_trigger;
+			}
+			else
+			{
+INSTR_TIME_SUBTRACT(diff, st-txn_begin);
+latency = INSTR_TIME_GET_MICROSEC(diff);
+			}
+
 			st-txn_latencies += latency;
 
 			/*
@@ -1099,6 +1153,11 @@ top:
 			 * would take 256 hours.
 			 */
 			st-txn_sqlats += latency * latency;
+
+			/* record over the limit transactions if needed.
+			 */
+			if (latency_limit  latency  latency_limit)
+thread-latency_late++;
 		}
 
 		/*
@@ -1294,7 +1353,7 @@ top:
 	}
 
 	/* Record transaction start time under logging, progress or throttling */
-	if ((logfile || progress || throttle_delay)  st-state == 0)
+	if ((logfile || progress || throttle_delay || latency_limit)  st-state == 0)
 		INSTR_TIME_SET_CURRENT(st-txn_begin);
 
 	/* Record statement start time if per-command latencies are requested */
@@

[HACKERS] clang warning on master

2014-08-29 Thread Pavel Stehule

Hi

I see a lot of warnings

[pavel@localhost postgresql]$ make all | grep warning
exec.c:280:2: warning: implicitly declaring library function 'strlcpy' with
type 'unsigned long (char *, const char *, unsigned long)'
strlcpy(link_buf, fname, sizeof(link_buf));
^
exec.c:280:2: note: please include the header string.h or explicitly
provide a declaration for 'strlcpy'
1 warning generated.
exec.c:280:2: warning: implicitly declaring library function 'strlcpy' with
type 'unsigned long (char *, const char *, unsigned long)'
strlcpy(link_buf, fname, sizeof(link_buf));
^
exec.c:280:2: note: please include the header string.h or explicitly
provide a declaration for 'strlcpy'
1 warning generated.
path.c:188:3: warning: implicitly declaring library function 'strlcpy' with
type 'unsigned long (char *, const char *, unsigned long)'
strlcpy(ret_path, head, MAXPGPATH);
^
path.c:188:3: note: please include the header string.h or explicitly
provide a declaration for 'strlcpy'
1 warning generated.
path.c:188:3: warning: implicitly declaring library function 'strlcpy' with
type 'unsigned long (char *, const char *, unsigned long)'
strlcpy(ret_path, head, MAXPGPATH);
^
path.c:188:3: note: please include the header string.h or explicitly
provide a declaration for 'strlcpy'
1 warning generated.
thread.c:77:2: warning: implicitly declaring library function 'strlcpy'
with type 'unsigned long (char *, const char *, unsigned long)'
strlcpy(strerrbuf, strerror(errnum), buflen);
^
thread.c:77:2: note: please include the header string.h or explicitly
provide a declaration for 'strlcpy'
1 warning generated.
pgtz.c:53:2: warning: implicitly declaring library function 'strlcpy' with
type 'unsigned long (char *, const char *, unsigned long)'
strlcpy(tzdir + strlen(tzdir), /timezone, MAXPGPATH -
strlen(tzdir));
^
pgtz.c:53:2: note: please include the header string.h or explicitly
provide a declaration for 'strlcpy'
1 warning generated.
xlog.c:5627:4: warning: implicitly declaring library function 'strlcpy'
with type 'unsigned long (char *, const char *, unsigned long)'
strlcpy(recoveryStopName,
recordRestorePointData-rp_name, MAXFNAMELEN);

Regards

Pavel

Re: [HACKERS] On partitioning

2014-08-29 Thread Tom Lane

Andres Freund and...@2ndquadrant.com writes:
 On 2014-08-29 13:29:19 -0400, Tom Lane wrote:
 An actual fix would presumably involve adding a partition number to the
 ctid chain field in tuples in partitioned tables.  The reason I bring it
 up now is that we'd have to commit to doing that (or at least leaving room
 for it) in the first implementation, if we don't want to have an on-disk
 compatibility break.

 What we could do is to add some sort of 'jump' tuple when moving a tuple
 from one relation to another. So, when updating a tuple between
 partitions we add another in the old partition with xmin_jump =
 xmax_jump = xmax_old and have the jump tuple's content point to the new
 relation.

Hm, that might work.  It sounds more feasible than Alvaro's suggestion
of abusing cmax --- I don't think that field is free for use in this
context.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] clang warning on master

2014-08-29 Thread Tom Lane

Pavel Stehule pavel.steh...@gmail.com writes:
 I see a lot of warnings

 [pavel@localhost postgresql]$ make all | grep warning
 exec.c:280:2: warning: implicitly declaring library function 'strlcpy' with
 type 'unsigned long (char *, const char *, unsigned long)'
 strlcpy(link_buf, fname, sizeof(link_buf));
 ^
 exec.c:280:2: note: please include the header string.h or explicitly
 provide a declaration for 'strlcpy'
 1 warning generated.

[ raised eyebrow... ]  exec.c certainly includes string.h already
(via c.h).  I think you should complain to your compiler vendor.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-08-29 Thread Hannu Krosing

On 08/29/2014 07:15 PM, Tom Lane wrote:
 Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Tom Lane wrote:
 One other interesting thought that occurs to me: are we going to support
 UPDATEs that cause a row to belong to a different partition?  If so, how
 are we going to handle the update chain links?
 Bah, I didn't mention it?  My current thinking is that it would be
 disallowed; if you have chosen your partitioning key well enough it
 shouldn't be necessary.  As a workaround you can always DELETE/INSERT.
 Maybe we can allow it later, but for a first cut this seems more than
 good enough.
 Hm.  I certainly agree that it's a case that could be disallowed for a
 first cut, but it'd be nice to have some clue about how we might allow it
 eventually.
There needs to be some structure that is specific to partitions and not
multiple plain tables which would then be used for both update chains and
cross-partition indexes (as you seem to imply by jumping from indexes
to update chains a few posts back).

It would need to replace plain tid (pagenr, tupnr) with triple of (partid,
pagenr, tupnr).

Cross-partition indexes are especially needed if we want to allow putting
UNIQUE constraints on non-partition-key columns.

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Built-in binning functions

2014-08-29 Thread Pavel Stehule

Hi

I am looking to your last patch and I have a few questions, notes

1. I am thinking so reduction to only numeric types is not necessary -
although we can live without it - but there are lot of non numeric
categories: chars, date, ...

2. Still I strongly afraid about used searching method - there is not
possible to check a  validity of input. Did you check how much linear
searching is slower - you spoke, so you have a real data and real use case?
Used methods can returns wrong result without any warning, what is
acceptable for extensions, but I am not sure, for core feature.

3. I am thinking about name - I don't think so varwidth_bucket is correct
-- in relation to name width_bucket ... what about range_bucket or
scope_bucket ?

last patch is very simple, there are no new compilation or regress tests
issues.

Regards

Pavel





2014-08-25 16:15 GMT+02:00 Petr Jelinek p...@2ndquadrant.com:

 Hi,

 I finally had some time to get back to this.

 I attached version3 of the patch which fixes Tom's complaint about int8
 version by removing the int8 version as it does not seem necessary (the
 float8 can handle integers just fine).

 This patch now basically has just one optimized function for float8 and
 one for numeric datatypes, just like width_bucket.

  On 08/07/14 02:14, Tom Lane wrote:
 Also, I'm not convinced by this business of throwing an error for a
 NULL array element.  Per spec, null arguments to width_bucket()
 produce a null result, not an error --- shouldn't this flavor act
 similarly?  In any case I think the test needs to use
 array_contains_nulls() not just ARR_HASNULL.


 I really think there should be difference between NULL array and NULL
 inside array and that NULL inside the array is wrong input. I changed the
 check to array_contains_nulls() though.

  On 08/07/14 02:14, Tom Lane wrote:
 Lastly, the spec defines behaviors for width_bucket that allow either
 ascending or descending buckets.  We could possibly do something
 similar


 I am not sure it's worth it here as we require input to be sorted anyway.
 It might be worthwhile if we decided to do this as an aggregate (since
 there input would not be required to be presorted) instead of function but
 I am not sure if that's desirable either as that would limit usability to
 only the single main use-case (group by and count()).



 On 20/07/14 11:01, Simon Riggs wrote:

 On 16 July 2014 20:35, Pavel Stehule pavel.steh...@gmail.com wrote:


  On 08/07/14 02:14, Tom Lane wrote:


 I didn't see any discussion of the naming question in this thread.
 I'd like to propose that it should be just width_bucket(); we can
 easily determine which function is meant, considering that the
 SQL-spec variants don't take arrays and don't even have the same
 number of actual arguments.


 I did mention in submission that the names are up for discussion, I am
 all
 for naming it just width_bucket.


 I had this idea too - but I am not sure if it is good idea. A distance
 between ANSI SQL with_bucket and our enhancing is larger than in our
 implementation of median for example.

 I can live with both names, but current name I prefer.



 So I suggest that we use the more generic function name bin(), with a
 second choice of discretize()  (though that seems annoyingly easy to
 spell incorrectly)


 I really don't think bin() is good choice here, bucket has same meaning in
 this context and bin is often used as shorthand for binary which would be
 confusing here.

 Anyway I currently left the name as it is, I would not be against using
 width_bucket() or even just bucket(), not sure about the discretize()
 option.


 --
  Petr Jelinek  http://www.2ndQuadrant.com/

  PostgreSQL Development, 24x7 Support, Training  Services

Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-29 Thread Tomas Vondra

On 29.8.2014 16:12, Tom Lane wrote:
 Jeff Davis pg...@j-davis.com writes:
 I have a new approach to the patch which is to call a callback at each
 block allocation and child contexts inherit the callback from their
 parents. The callback could be defined to simply dereference and
 increment its argument, which would mean block allocations anywhere in
 the hierarchy are incrementing the same variable, avoiding the need to
 traverse the hierarchy.

 Cute, but it's impossible to believe that a function call isn't going
 to be *even more* overhead than what you've got now.  This could only
 measure out as a win when the feature is going unused.
 
 What about removing the callback per se and just keeping the argument,
 as it were.  That is, a MemoryContext has a field of type size_t *
 that is normally NULL, but if set to non-NULL, then we increment the
 pointed-to value for pallocs and decrement for pfrees.

How is that going to work with two memory contexts (parent/child), both
requesting accounting? If I understand the proposal correctly, the child
will either inherit pointer to the field in parent (and then won't get
accounting for it's own memory), or wil get a private field (and thus
won't update the accounting of the parent context).

Or am I missing something?

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-08-29 Thread Hannu Krosing

On 08/29/2014 05:56 PM, Alvaro Herrera wrote:
 Prompted by a comment in the UPDATE/LIMIT thread, I saw Marko Tiikkaja
 reference Tom's post
 http://www.postgresql.org/message-id/1598.1399826...@sss.pgh.pa.us
 which mentions the possibility of a different partitioning
 implementation than what we have so far.  As it turns out, I've been
 thinking about partitioning recently, so I thought I would share what
 I'm thinking so that others can poke holes.  My intention is to try to
 implement this as soon as possible.


 Declarative partitioning
 
 ...
 Still To Be Designed
 
 * Dependency issues
 * Are indexes/constraints inherited from the parent rel?
I'd say mostly yes.

There could some extra constraint exclusion type magic for
conditional indexes, but the rest probably should come from main table

And there should be some kind of cross-partition indexes. At
 partitioning capability, this can probably wait for version 2.

 * Multiple keys?  
Why not. But probably just for hash partitioning.
 Subpartitioning? 
Probably not. If you need speed for huge numbers of partitions, use
Gregs idea of keeping the partitions in a tree (or just having a
partition index).
  Hash partitioning?
At some point definitely.


Also one thing you left unmentioned is dropping (and perhaps also
truncating)
a partition. We still may want to do historic data management the same way
we do it now, by just getting rid of the whole partition or its data.

At some point we may also want to do redistributing data between
partitions,
maybe for case where we end up with 90% of the data in on partition due to
bad partitioning key or partitioning function choice. This is again
something
that is hard now and can therefore be left to a later version.

 Open Questions
 --

 *  What's the syntax to refer to specific partitions within a partitioned
table?
We could do TABLE xyz PARTITION n, but for example if in
the future we add hash partitioning, we might need some non-integer
addressing (OTOH assigning sequential numbers to hash partitions doesn't
seem so bad).  Discussing with users of other DBMSs partitioning feature,
one useful phrase is TABLE xyz PARTITION FOR value.
Or more generally

TABLE xyz PARTITION FOR/WHERE col1=val1, col2=val2, ...;



Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [TODO] Track number of files ready to be archived in pg_stat_archiver

2014-08-29 Thread Julien Rouhaud

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Le 28/08/2014 05:58, Michael Paquier a écrit :
 On Thu, Aug 28, 2014 at 7:37 AM, Julien Rouhaud 
 julien.rouh...@dalibo.com wrote:
 
 Attached v2 patch implements this approach. All the work is still
 done in pg_stat_get_archiver, as I don't think that having a
 specific function for that information would be really
 interesting.
 
 
 Please be sure to add that to the next commit fest. This is a
 feature most welcome within this system view. Regards,
 

I just added it.

Thanks.

- -- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)

iQEcBAEBAgAGBQJUAMv6AAoJELGaJ8vfEpOqZgIIAKNp0a4XaZNRtEw3+yZogxLD
RIpSnURh1COEZs5UUkdsuybvLqOqZXbCQWfK9+3B3pqoYD9LTIzlg4jcArOcbqgd
Fe43BEH4QYabjdS1DWGSzop9E0NY/Vg82ZGzyHzGYQKI1k9Y/pEeM5q74vRN3aH0
RbUbcnN0ajCMswLbjfc/nDXNCDAr96peLZoI1l2lW7fJIElkXJz/I28fNAHtj7Dg
hxmBXf8uVZ7g+pCVIhLodFm4mp4ZB0ZvTHxDHCXU9wH/p7otDD4GV0Cml9DlSfE6
cFm0CXfeMHawaihz6bs8Z1Zxntdh7Qy+lAHmBRuXZUwzaJYTDxwL/YCvnSsVE9o=
=kD4R
-END PGP SIGNATURE-


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-08-29 Thread Alvaro Herrera

Hannu Krosing wrote:

 Cross-partition indexes are especially needed if we want to allow putting
 UNIQUE constraints on non-partition-key columns.

I'm not going to implement cross-partition indexes in the first patch.
They are a huge can of worms.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-08-29 Thread Robert Haas

On Fri, Aug 29, 2014 at 11:56 AM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 In this design, partitions are first-class objects, not normal tables in
 inheritance hierarchies.  There are no pg_inherits entries involved at all.

Whoa.  I always assumed that table inheritance was a stepping-stone to
real partitioning, and that real partitioning would be built on top of
table inheritance.  In particular, I assume that (as Itagaki
Takahiro's patch did those many years ago) we'd add some metadata
somewhere to allow fast tuple routing (for both pruning and
inserts/updates).  What's the benefit of inventing something new
instead?

I'm skeptical about your claim that there will be no pg_inherits
entries involved at all.  You need some way to know which partitions
go with which parent table.  You can store that many-to-one mapping
someplace other than pg_inherits, but it seems to me that that doesn't
buy you anything; they're just pg_inherits entries under some other
name.  Why reinvent that?

 Each partition is assigned an Expression that receives a tuple and
 returns boolean.  This expression returns true if a given tuple belongs
 into it, false otherwise.  If a tuple for a partitioned relation is run
 through expressions of all partitions, exactly one should return true.
 If none returns true, it might be because the partition has not been
 created yet.  A user-facing error is raised in this case (Rationale: if
 user creates a partitioned rel and there is no partition that accepts
 some given tuple, it's the user's fault.)

 Additionally, each partitioned relation may have a master expression.
 This receives a tuple and returns an integer, which corresponds to the
 number of the partition it belongs into.

I agree with Tom: this is a bad design.  In particular, if we want to
scale to large numbers of partitions (a principal weakness of the
present system) we need the operation of routing a tuple to a
partition to be as efficient as possible.  Range partitioning can be
O(lg n) where n is the number of partitions: store a list of the
boundaries and binary-search it.  List partitioning can be O(lg k)
where k is the number of values (which may be more than the number of
partitions) via a similar technique.  Hash partitioning can be O(1).
I'm not sure what other kind of partitioning anybody would want to do,
but it's likely that they *won't* want it to be O(1) in the number of
partitions.  So I'd say have *only* the master expression.

But, really, I don't think an expression is the right way to store
this; evaluating that repeatedly will, I think, still be too slow.
Think about what happens in PL/pgsql: minimizing the number of times
that you enter and exit the executor helps performance enormously,
even if the expressions are simple enough not to need planning.  I
think the representation should be more like an array of partition
boundaries and the pg_proc OID of a comparator.

 Per-partition expressions are formed as each partition is created, and
 are based on the user-supplied partitioning criterion.  Master
 expressions are formed at relation creation time.  (XXX Can we change
 the master expression later, as a result of some ALTER command?
 Presumably this would mean that all partitions might need to be
 rewritten.)

This is another really important point.  If you store an opaque
expression mapping partitioning keys to partition numbers, you can't
do things like this efficiently.  With a more transparent
representation, like a sorted array of partition boundaries for range
partitioning, or a sorted array of hash values for consistent hashing,
you can do things like split and merge partitions efficiently,
minimizing rewriting.

 Planner ---

 A partitioned relation behaves just like a regular relation for purposes
 of planner.  XXX do we need special considerations regarding relation
 size estimation?

 For scan plans, we need to prepare Append lists which are used to scan
 for tuples in a partitioned relation.  We can setup fake constraint
 expressions based on the partitioning expressions, which let the planner
 discard unnecessary partitions by way of constraint exclusion.

So if we're going to do all this, why bother making the partitions
anything other than inheritance children?  There might be some benefit
in having the partitions be some kind of stripped-down object if we
could avoid some of these planner gymnastics and get, e.g. efficient
run-time partition pruning.  But if you're going to generate Append
plans and switch ResultRelInfos and stuff just as you would for an
inheritance hierarchy, why not just make it an inheritance hierarchy?

It seems pretty clear to me that we need partitioned tables to have
the same tuple descriptor throughout the relation, for efficient tuple
routing and so on.  But the other restrictions you're proposing to
impose on partitions have no obvious value that I can see.  We could
have a rule that when you inherit from a partition root, you can only
inherit from

Re: [HACKERS] alter user set local_preload_libraries.

2014-08-29 Thread Peter Eisentraut

On 8/28/14 9:01 AM, Kyotaro HORIGUCHI wrote:
 I found this issue when trying per-pg_user (role) loading of
 auto_analyze and some tweaking tool. It is not necessarily set by
 the user by own, but the function to decide whether to load some
 module by the session-user would be usable, at least, as for me:)

I think we could just set local_preload_libraries to PGC_USERSET and
document that subsequent changes won't take effect.  That's the same way
session_preload_libraries works.  That would avoid inventing another
very specialized GUC context.

If you're interested in improving this area, I also suggest you read the
thread of
http://www.postgresql.org/message-id/1349829917.29682.5.ca...@vanquo.pezone.net.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Switch pg_basebackup to use -X stream instead of -X fetch by default?

2014-08-29 Thread Peter Eisentraut

On 8/27/14 2:55 AM, Magnus Hagander wrote:
 I think the easy way of doing that is to just create an xlog.tar file.
 Since we already create base.tar and possibly n*tablespace.tar,
 adding one more file shouldn't be a big problem, and would make such
 an implementation much easier. Would be trivial to do .tar.gz for it
 as well, just like for the others.

That might be a way forward, but for someone who doesn't use tablespaces
and just wants and all-on-one backup, this change would make that more
cumbersome, because now you'd always have more than one file to deal with.

It might be worth considering a mode that combines all those tar files
into a super-tar.  I'm personally not a user of the tar mode, so I don't
know what a typical use would be, though.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Switch pg_basebackup to use -X stream instead of -X fetch by default?

2014-08-29 Thread Magnus Hagander

On Fri, Aug 29, 2014 at 10:34 PM, Peter Eisentraut pete...@gmx.net wrote:
 On 8/27/14 2:55 AM, Magnus Hagander wrote:
 I think the easy way of doing that is to just create an xlog.tar file.
 Since we already create base.tar and possibly n*tablespace.tar,
 adding one more file shouldn't be a big problem, and would make such
 an implementation much easier. Would be trivial to do .tar.gz for it
 as well, just like for the others.

 That might be a way forward, but for someone who doesn't use tablespaces
 and just wants and all-on-one backup, this change would make that more
 cumbersome, because now you'd always have more than one file to deal with.

It would in stream mode, which doesn't work at all.

I do agree with Roberts suggestion that we shouldn't remove file mode
right away - but we should change the default.


 It might be worth considering a mode that combines all those tar files
 into a super-tar.  I'm personally not a user of the tar mode, so I don't
 know what a typical use would be, though.

That would probably be useful, though a lot more difficult when you
consider two separate processes writing into the same tarfile. But I
agree that the format for single tablespace just gimme a bloody
tarfile is quite incovenient today, in that you need a directory and
we drop a base.tar in there. We should perhaps try to find a more
convenient way for that specific usecase, since it probably represents
the majority of users.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: Why data of timestamptz does not store value of timezone passed to it?

2014-08-29 Thread Peter Eisentraut

On 8/29/14 11:27 AM, Greg Stark wrote:
 You know... I wonder if we have enough leverage in the standards
 committee these days that we could usefully push that direction
 instead of being pushed around. The standard timestamp with time zone
 is not very useful and I'm sure the standards committee wouldn't mind
 having a useful point-in-time data type.

Not likely unless Oracle or IBM have an existing implementation.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] delta relations in AFTER triggers

2014-08-29 Thread Kevin Grittner

Heikki Linnakangas hlinnakan...@vmware.com wrote:
 On 08/28/2014 12:03 AM, Kevin Grittner wrote:
 Heikki Linnakangas hlinnakan...@vmware.com wrote:
 I suggest adding a new hook to the ParseState struct, (p_rangevar_hook
 ?). The planner calls it whenever it sees a reference to a table, and
 the hook function returns back some sort of placeholder reference to the
 tuplestore. With variables, the hook returns a Param node, and at
 execution time, the executor calls the paramFetch hook to fetch the
 value of the param. For relations/tuplestores, I guess we'll need to
 invent something like a Param node, but for holding information about
 the relation. Like your TsrData struct, but without the pointer to the
 tuplestore. At execution time, in the SPI_execute call, you pass the
 pointer to the tuplestore in the ParamListInfo struct, like you pass
 parameter values.

 Does this make sense?

 I see your point, but SPI first has to be made aware of the
 tuplestores and their corresponding names and TupleDesc structures.
 Does it make sense to keep the SPI_register_tuplestore() and
 SPI_unregister_tuplestore() functions for the client side of the
 API, and pass things along to the parse analysis through execution
 phases using the techniques you describe?

 Sorry, I didn't understand that. What do you mean by first, and the
 client side of the API? I don't see any need for the
 SPI_register_tuplestore() and and SPI_unregister_tuplestore() functions
 if you use the hooks.

If we were to go with the hooks as you propose, we would still need
to take the information from TriggerData and put it somewhere else
for the hook to reference.  The hooks are generalized for plpgsql,
not just for triggers, and it doesn't seem appropriate for them to
be fishing around in the TriggerData structure.  And what if we add
other sources for tuplestores?  The lookup during parse analysis
each time an apparent relation name is encountered must be simple
and fast.

I want named tuplestores to be easy to use from *all* PLs (for
trigger usage) as well as useful for other purposes people may want
to develop.  I had to change the hashkey for plpgsql's plan
caching, but that needs to be done regardless of the API (to
prevent problems in the obscure case that someone attaches the same
trigger function to the same table for the same events more than
once with different trigger names and different transition table
names).  If you ignore that, the *entire* change to use this in
plpgsql is to add these lines to plpgsql_exec_trigger():

    /*
 * Capture the NEW and OLD transition TABLE tuplestores (if specified for
 * this trigger).
 */
    if (trigdata-tg_newtable)
    {
    Tsr tsr = palloc(sizeof(TsrData));

    tsr-name = trigdata-tg_trigger-tgnewtable;
    tsr-tstate = trigdata-tg_newtable;
    tsr-tupdesc = trigdata-tg_relation-rd_att;
    tsr-relid = trigdata-tg_relation-rd_id;
    SPI_register_tuplestore(tsr);
    }
    if (trigdata-tg_oldtable)
    {
    Tsr tsr = palloc(sizeof(TsrData));

    tsr-name = trigdata-tg_trigger-tgoldtable;
    tsr-tstate = trigdata-tg_oldtable;
    tsr-tupdesc = trigdata-tg_relation-rd_att;
    tsr-relid = trigdata-tg_relation-rd_id;
    SPI_register_tuplestore(tsr);
    }

With the new SPI functions, the code to implement this in each
other PL should be about the same (possibly identical), and areas
using SPI only need similar code to make tuplestores visible to the
planner and usable in the executor if someone has another use for
this.  You just do the above once you have run SPI_connect() and
before preparing or executing any query that references the named
tuplestore.  It remains available on that SPI connection until
SPI_finish() is called or you explicitly unregister it (by name).

If we use the hooks, I think it will be several times as much code,
more invasive, and probably more fragile.  More importantly, these
hooks are not used by the other PLs included with core, and are not
used in any of the other core code, anywhere.  They all use SPI, so
they can do the above very easily, but adding infrastructure for
them to use the hooks would be a lot more work, and I'm not seeing
a corresponding benefit.

I think there is some room to change the API as used by the parser,
planner, and executor so that no changes are needed there if we add
some other registration mechanism besides SPI, but I think having a
registry for tuplestores that sort of takes the place of the
catalogs and related caches (but is far simpler, being process
local) is a better model than what you propose.

In summary, thinking of the definition of a named tuplestore as a
variable is the wrong parallel; it is more like a lightweight
relation, and the comparison should be to that, not to function
parameters or local variables.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list

Re: [HACKERS] Why data of timestamptz does not store value of timezone passed to it?

2014-08-29 Thread Gianni Ciolli

Hi Craig,

On Fri, Aug 29, 2014 at 10:17:17AM +0800, Craig Ringer wrote:
 (...) It should also discuss the approach of storing a (instant
 timestamptz, timezone text) or (instant timestampts, tzoffset
 smallint) tuple for when unambiguous representation is required.
 
 (I guess I just volunteered myself to write a draft of that).

Please notice that smallint is too small for tzoffset:

  SELECT d AT TIME ZONE 'Europe/Berlin'
   - d AT TIME ZONE 'Europe/Paris'
  FROM (
VALUES
  (date '1815-10-31')
, (date '1897-02-19')
  ) AS f(d);

Cheers,
Dr. Gianni Ciolli - 2ndQuadrant Italia
PostgreSQL Training, Services and Support
gianni.cio...@2ndquadrant.it | www.2ndquadrant.it


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC, POC] Don't require a NBuffer sized PrivateRefCount array of local buffer pins

2014-08-29 Thread Andres Freund

On 2014-08-26 22:04:03 -0400, Robert Haas wrote:
 That's all I see on a first-read through.

I think I fixed all of them. Thanks.

 There might be other
 issues, and I haven't checked through it in great detail for mundane
 bugs, but generally, I favor pressing on relatively rapidly toward a
 commit.  It seems highly likely that this idea is a big win, and if
 there's some situation in which it's a loss, we're more likely to find
 out with it in the tree (and thus likely to be tested by many more
 people) than by analysis from first principles.

Attached is the version I plan to commit after going over it again
tomorrow.

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From b48a28eee3518548740525243d1a165720a67231 Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Fri, 29 Aug 2014 23:31:37 +0200
Subject: [PATCH] Make backend local tracking of buffer pins memory efficient.

Since the dawn of time (aka Postgres95) multiple pins of the same
buffer by one backend have been optimized not to modify the shared
refcount more than once. This optimization has always used a NBuffer
sized array in each backend keeping track of a backend's pins.

That array (PrivateRefCount) was one of the biggest per-backend memory
allocations, depending on the shared_buffers setting. Besides the
waste of memory it also has proven to be a performance bottleneck when
assertions are enabled as we make sure that there's no remaining pins
left at the end of transactions. Also, on servers with lots of memory
and a correspondingly high shared_buffers setting the amount of random
memory accesses can also lead to poor cpu cache efficiency.

Because of these reasons a backend's buffers pins are now kept track
of in a small statically sized array that overflows into a hash table
when necessary. Benchmarks have shown neutral to positive performance
results with considerably lower memory usage.

Patch by me, review by Robert Haas.
---
 contrib/pg_buffercache/pg_buffercache_pages.c |   2 +-
 src/backend/storage/buffer/buf_init.c |  39 +--
 src/backend/storage/buffer/bufmgr.c   | 418 --
 src/include/storage/bufmgr.h  |  19 --
 4 files changed, 391 insertions(+), 87 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index b1be98f..d00f985 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -37,7 +37,7 @@ typedef struct
 	/*
 	 * An int32 is sufficiently large, as MAX_BACKENDS prevents a buffer from
 	 * being pinned by too many backends and each backend will only pin once
-	 * because of bufmgr.c's PrivateRefCount array.
+	 * because of bufmgr.c's PrivateRefCount infrastructure.
 	 */
 	int32		pinning_backends;
 } BufferCachePagesRec;
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index e03394c..ff6c713 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -20,7 +20,6 @@
 
 BufferDesc *BufferDescriptors;
 char	   *BufferBlocks;
-int32	   *PrivateRefCount;
 
 
 /*
@@ -50,16 +49,9 @@ int32	   *PrivateRefCount;
  *
  * refcount --	Counts the number of processes holding pins on a buffer.
  *		A buffer is pinned during IO and immediately after a BufferAlloc().
- *		Pins must be released before end of transaction.
- *
- * PrivateRefCount -- Each buffer also has a private refcount that keeps
- *		track of the number of times the buffer is pinned in the current
- *		process.This is used for two purposes: first, if we pin a
- *		a buffer more than once, we only need to change the shared refcount
- *		once, thus only lock the shared state once; second, when a transaction
- *		aborts, it should only unpin the buffers exactly the number of times it
- *		has pinned them, so that it will not blow away buffers of another
- *		backend.
+ *		Pins must be released before end of transaction.  For efficiency the
+ *		shared refcount isn't increased if a individual backend pins a buffer
+ *		multiple times. Check the PrivateRefCount infrastructure in bufmgr.c.
  */
 
 
@@ -130,31 +122,6 @@ InitBufferPool(void)
 }
 
 /*
- * Initialize access to shared buffer pool
- *
- * This is called during backend startup (whether standalone or under the
- * postmaster).  It sets up for this backend's access to the already-existing
- * buffer pool.
- *
- * NB: this is called before InitProcess(), so we do not have a PGPROC and
- * cannot do LWLockAcquire; hence we can't actually access stuff in
- * shared memory yet.  We are only initializing local data here.
- * (See also InitBufferPoolBackend, over in bufmgr.c.)
- */
-void
-InitBufferPoolAccess(void)
-{
-	/*
-	 * Allocate and zero local arrays of per-buffer info.
-	 */
-	PrivateRefCount = (int32 *) calloc(NBuffers, sizeof(int32));
-	if

Re: [HACKERS] Inverse of pg_get_serial_sequence?

2014-08-29 Thread David G Johnston

Andres Freund-3 wrote
 Hi,
 
 We have pg_get_serial_sequence() mapping (relation, colum) to the
 sequence. What I'm missing right now is the inverse. I.e. given a
 sequence tell me the owner.
 describe.c has a query for that, and it's not too hard to write, but it
 still seems 'unfriendly' not to provide it.
 
 Does anybody dislike adding a function for that?
 
 
 I can't really think of a good name (not that pg_get_serial_sequence is
 well named). pg_get_serial_sequence_owner(serial regclass, OUT rel
 regclass, OUT colname name) maybe?

On a pure consistency basis: pg_get_sequence_serial(...) [though probably
plural: _serials(...)]

I'd drop the serial part altogether for the more appropriate:

pg_get_sequence_ownedby(...)

Given that ALTER SEQUENCE ... OWNED BY ... Is the corresponding SQL

The inverse of what you proposed above would probably be more like:

pg_get_owned_sequence(...)

Reminder: sequences can be unowned.

Ownership and usage via default are separate things though: do you have need
to know all users of a sequence or only the single one that is defined as
it's owner?

pg_get_sequence_users(...) [or serials: as noted first]

David J.





--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Inverse-of-pg-get-serial-sequence-tp5816933p5816993.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inverse of pg_get_serial_sequence?

2014-08-29 Thread Andres Freund

On 2014-08-29 17:55:38 -0700, David G Johnston wrote:
 Andres Freund-3 wrote
  Hi,
  
  We have pg_get_serial_sequence() mapping (relation, colum) to the
  sequence. What I'm missing right now is the inverse. I.e. given a
  sequence tell me the owner.
  describe.c has a query for that, and it's not too hard to write, but it
  still seems 'unfriendly' not to provide it.
  
  Does anybody dislike adding a function for that?
  
  
  I can't really think of a good name (not that pg_get_serial_sequence is
  well named). pg_get_serial_sequence_owner(serial regclass, OUT rel
  regclass, OUT colname name) maybe?
 
 On a pure consistency basis: pg_get_sequence_serial(...) [though probably
 plural: _serials(...)]

Yea, but that's just horrid.

 I'd drop the serial part altogether for the more appropriate:
 
 pg_get_sequence_ownedby(...)

My problem is that that possibly be confused with the user owning the
sequence :/

 Reminder: sequences can be unowned.

Don't you say.

 Ownership and usage via default are separate things though: do you have need
 to know all users of a sequence or only the single one that is defined as
 it's owner?

I'd rather know all its users, but that's not really possible in the
general case without guessing. I'll settle for the column that's
declared as owning it. Even if we had a interface for guessing I'd not
want it to be the same as the one returning the declared owner.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inverse of pg_get_serial_sequence?

2014-08-29 Thread David G Johnston

Andres Freund-3 wrote
 On 2014-08-29 17:55:38 -0700, David G Johnston wrote:
 Andres Freund-3 wrote
 
 
 pg_get_sequence_ownedby(...)
 
 My problem is that that possibly be confused with the user owning the
 sequence :/

Though as soon as that person reads the output their misunderstanding would
be obvious.

I think it's fine but ownedbycol or owningcol would be ok.

David J.



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Inverse-of-pg-get-serial-sequence-tp5816933p5816996.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-08-29 Thread Amit Langote

On Sat, Aug 30, 2014 at 12:56 AM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Prompted by a comment in the UPDATE/LIMIT thread, I saw Marko Tiikkaja
 reference Tom's post
 http://www.postgresql.org/message-id/1598.1399826...@sss.pgh.pa.us
 which mentions the possibility of a different partitioning
 implementation than what we have so far.  As it turns out, I've been
 thinking about partitioning recently, so I thought I would share what
 I'm thinking so that others can poke holes.  My intention is to try to
 implement this as soon as possible.


+1.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ALTER SYSTEM RESET?

2014-08-29 Thread Amit Kapila

On Wed, Aug 27, 2014 at 7:16 PM, Fujii Masao masao.fu...@gmail.com wrote:
 The patch looks good to me. One minor comment is; probably you need to
 update the tab-completion code.

Thanks for the review.  I have updated the patch to support
tab-completion.
As this is a relatively minor change, I will mark it as
Ready For Committer rather than Needs Review.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


alter_system_reset.v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

67 matches

Mail list logo