date:20161010

Re: [HACKERS] PL/Python adding support for multi-dimensional arrays

2016-10-10 Thread Pavel Stehule

2016-10-11 7:49 GMT+02:00 Heikki Linnakangas :

> On 10/10/2016 08:42 PM, Pavel Stehule wrote:
>
>> 2016-10-10 12:31 GMT+02:00 Heikki Linnakangas :
>>
>> On 10/01/2016 02:45 AM, Jim Nasby wrote:
>>>
>>> On 9/29/16 1:51 PM, Heikki Linnakangas wrote:

 Now, back to multi-dimensional arrays. I can see that the Sequence
> representation is problematic, with arrays, because if you have a
> python
> list of lists, like [[1, 2]], it's not immediately clear if that's a
> one-dimensional array of tuples, or two-dimensional array of integers.
> Then again, we do have the type definitions available. So is it really
> ambiguous?
>
>
 [[1,2]] is a list of lists...
 In [4]: b=[[1,2]]

 In [5]: type(b)
 Out[5]: list

 In [6]: type(b[0])
 Out[6]: list

 If you want a list of tuples...
 In [7]: c=[(1,2)]

 In [8]: type(c)
 Out[8]: list

 In [9]: type(c[0])
 Out[9]: tuple


>>> Hmm, so we would start to treat lists and tuples differently? A Python
>>> list would be converted into an array, and a Python tuple would be
>>> converted into a composite type. That does make a lot of sense. The only
>>> problem is that it's not backwards-compatible. A PL/python function that
>>> returns an SQL array of rows, and does that by returning Python list of
>>> lists, it would start failing.
>>>
>>
>> is not possible do decision in last moment - on PL/Postgres interface?
>> There the expected type should be known.
>>
>
> Unfortunately there are cases that are fundamentally ambiguous.
>
> create type comptype as (intarray int[]);
> create function array_return() returns comptype[] as $$
>   return 1;
> $$ language plpython;
>
> What does the function return? It could be two-dimension array of
> comptype, with a single-dimension intarray, or a single-dimension comptype,
> with a two-dimension intarray.
>
> We could resolve it for simpler cases, but not the general case. The
> simple cases would probably cover most things people do in practice. But if
> the distinction between a tuple and a list feels natural to Python
> programmers, I think it would be more clear in the long run to have people
> adjust their applications.
>

I agree. The distinction is natural - and it is our issue, so we don't
distinguish strongly.

Regards

Pavel


>
> - Heikki
>
>

Re: [HACKERS] proposal: psql \setfileref

2016-10-10 Thread Pavel Stehule

2016-10-10 19:50 GMT+02:00 Pavel Stehule :

>
>
> 2016-10-10 15:17 GMT+02:00 Gilles Darold :
>
>> Le 10/10/2016 à 14:38, Daniel Verite a écrit :
>> >   Gilles Darold wrote:
>> >
>> >>postgres=# \setfileref b /dev/random
>> >>postgres=# insert into test (:b);
>> >>
>> >> Process need to be killed using SIGTERM.
>> > This can be fixed by setting sigint_interrupt_enabled to true
>> > before operating on the file. Then ctrl-C would be able to cancel
>> > the command.
>>
>> If we do not prevent the user to be able to load special files that
>> would be useful yes.
>>
>
> I don't think so special files has some sense, just I had not time fix
> this issue. I will do it early - I hope.
>

fresh patch - only regular files are allowed

Regards

Pavel


>
> Regards
>
> Pavel
>
>>
>> --
>> Gilles Darold
>> Consultant PostgreSQL
>> http://dalibo.com - http://dalibo.org
>>
>>
>
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 4806e77..9fb9cd9 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -2714,6 +2714,24 @@ testdb= \setenv LESS -imx4F
   
 
   
+\setfileref name [ value ]
+
+
+
+Sets the internal variable name as a reference to the file path
+set as value. To unset a variable, use the \unset command.
+
+File references are shown as file path prefixed with the ^ character
+when using the \set command alone.
+
+Valid variable names can contain characters, digits, and underscores.
+See the section Variables below for details. Variable names are
+case-sensitive.
+
+
+  
+
+  
 \sf[+] function_description 
 
 
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index a9a2fdb..daa40b2 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -1363,6 +1363,41 @@ exec_command(const char *cmd,
 		free(envval);
 	}
 
+	/* \setfileref - set variable by reference on file */
+	else if (strcmp(cmd, "setfileref") == 0)
+	{
+		char	   *name = psql_scan_slash_option(scan_state,
+  OT_NORMAL, NULL, false);
+
+		char	   *ref = psql_scan_slash_option(scan_state,
+		 OT_NORMAL, NULL, false);
+
+		success = false;
+
+		if (!name && !ref)
+		{
+			/* list all variables */
+			PrintSetFileRefVariables(pset.vars);
+			success = true;
+		}
+		else
+		{
+			if (!name || !ref)
+			{
+psql_error("\\%s: missing required argument\n", cmd);
+success = false;
+			}
+			else
+			{
+if (!SetFileRef(pset.vars, name, ref))
+{
+	psql_error("\\%s: error while setting variable\n", cmd);
+	success = false;
+}
+			}
+		}
+	}
+
 	/* \sf -- show a function's source code */
 	else if (strcmp(cmd, "sf") == 0 || strcmp(cmd, "sf+") == 0)
 	{
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index a7789df..58b0065 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -12,6 +12,8 @@
 #include 
 #include 
 #include 
+#include 
+
 #ifndef WIN32
 #include /* for write() */
 #else
@@ -25,6 +27,7 @@
 #include "settings.h"
 #include "command.h"
 #include "copy.h"
+#include "catalog/pg_type.h"
 #include "crosstabview.h"
 #include "fe_utils/mbprint.h"
 
@@ -33,7 +36,6 @@ static bool ExecQueryUsingCursor(const char *query, double *elapsed_msec);
 static bool command_no_begin(const char *query);
 static bool is_select_command(const char *query);
 
-
 /*
  * openQueryOutputFile --- attempt to open a query output file
  *
@@ -109,6 +111,150 @@ setQFout(const char *fname)
 	return true;
 }
 
+void
+psql_reset_query_params(void)
+{
+	int			i;
+
+	for (i = 0; i < pset.nparams; i++)
+		if (pset.params[i] != NULL)
+		{
+			PQfreemem(pset.params[i]);
+			pset.params[i] = NULL;
+		}
+
+	pset.nparams = 0;
+}
+
+/*
+ * Load a content of the file_ref related file to query params buffer.
+ * When escaping is requested, then the text content is expected.
+ * Without escaping the bytea content is expected and related bytea
+ * escaping is processed.
+ */
+static char *
+get_file_ref_content(const char *value, bool escape, bool as_ident)
+{
+	PQExpBufferData		buffer;
+	FILE			   *fd = NULL;
+	char			   *fname;
+	char			   *escaped_value;
+	charline[1024];
+	size_tsize;
+	struct stat			fst;
+
+	fname = pstrdup(value);
+
+	expand_tilde();
+	if (!fname)
+	{
+		psql_error("missing valid path to file\n");
+		return NULL;
+	}
+
+	canonicalize_path(fname);
+
+	fd = fopen(fname, PG_BINARY_R);
+	if (!fd)
+	{
+		psql_error("%s: %s\n", fname, strerror(errno));
+		PQfreemem(fname);
+		return NULL;
+	}
+
+	/* ensure, max size of file content < 1GB */
+	if (fstat(fileno(fd), ) == -1)
+	{
+		psql_error("%s: %s\n", fname, strerror(errno));
+		PQfreemem(fname);
+		return NULL;
+	}
+
+	if (!S_ISREG(fst.st_mode))
+	{
+		psql_error("referenced file of file ref variable is not regular file\n");
+		PQfreemem(fname);
+		return NULL;
+	}
+
+	if (fst.st_size >

[HACKERS] Comment typo

2016-10-10 Thread Amit Langote

Attached fixes what seems like a copy-pasto in pg_cast.h.

Thanks,
Amit
diff --git a/src/include/catalog/pg_cast.h b/src/include/catalog/pg_cast.h
index ee568d8..04d11c0 100644
--- a/src/include/catalog/pg_cast.h
+++ b/src/include/catalog/pg_cast.h
@@ -59,7 +59,7 @@ typedef enum CoercionCodes
 
 /*
  * The allowable values for pg_cast.castmethod are specified by this enum.
- * Since castcontext is stored as a "char", we use ASCII codes for human
+ * Since castmethod is stored as a "char", we use ASCII codes for human
  * convenience in reading the table.
  */
 typedef enum CoercionMethod

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/Python adding support for multi-dimensional arrays

2016-10-10 Thread Heikki Linnakangas


On 10/10/2016 08:42 PM, Pavel Stehule wrote:

2016-10-10 12:31 GMT+02:00 Heikki Linnakangas :


On 10/01/2016 02:45 AM, Jim Nasby wrote:


On 9/29/16 1:51 PM, Heikki Linnakangas wrote:


Now, back to multi-dimensional arrays. I can see that the Sequence
representation is problematic, with arrays, because if you have a python
list of lists, like [[1, 2]], it's not immediately clear if that's a
one-dimensional array of tuples, or two-dimensional array of integers.
Then again, we do have the type definitions available. So is it really
ambiguous?



[[1,2]] is a list of lists...
In [4]: b=[[1,2]]

In [5]: type(b)
Out[5]: list

In [6]: type(b[0])
Out[6]: list

If you want a list of tuples...
In [7]: c=[(1,2)]

In [8]: type(c)
Out[8]: list

In [9]: type(c[0])
Out[9]: tuple



Hmm, so we would start to treat lists and tuples differently? A Python
list would be converted into an array, and a Python tuple would be
converted into a composite type. That does make a lot of sense. The only
problem is that it's not backwards-compatible. A PL/python function that
returns an SQL array of rows, and does that by returning Python list of
lists, it would start failing.


is not possible do decision in last moment - on PL/Postgres interface?
There the expected type should be known.


Unfortunately there are cases that are fundamentally ambiguous.

create type comptype as (intarray int[]);
create function array_return() returns comptype[] as $$
  return 1;
$$ language plpython;

What does the function return? It could be two-dimension array of 
comptype, with a single-dimension intarray, or a single-dimension 
comptype, with a two-dimension intarray.


We could resolve it for simpler cases, but not the general case. The 
simple cases would probably cover most things people do in practice. But 
if the distinction between a tuple and a list feels natural to Python 
programmers, I think it would be more clear in the long run to have 
people adjust their applications.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash Indexes

2016-10-10 Thread Amit Kapila

On Mon, Oct 10, 2016 at 10:07 PM, Jeff Janes  wrote:
> On Mon, Oct 10, 2016 at 5:55 AM, Amit Kapila 
> wrote:
>>
>> On Thu, Sep 29, 2016 at 8:27 PM, Amit Kapila 
>> wrote:
>> > On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas 
>> > wrote:
>> >
>> >> Another thing I don't quite understand about this algorithm is that in
>> >> order to conditionally lock the target primary bucket page, we'd first
>> >> need to read and pin it.  And that doesn't seem like a good thing to
>> >> do while we're holding a shared content lock on the metapage, because
>> >> of the principle that we don't want to hold content locks across I/O.
>> >>
>> >
>>
>> Aren't we already doing this during BufferAlloc() when the buffer
>> selected by StrategyGetBuffer() is dirty?
>
>
> Right, you probably shouldn't allocate another buffer while holding a
> content lock on a different one, if you can help it.
>

I don't see the problem in that, but I guess the simple rule is that
we should not hold content locks for longer duration, which could
happen if we do I/O, or need to allocate a new buffer.

> But, BufferAlloc
> doesn't do that internally, does it?
>

You are right that BufferAlloc() doesn't allocate a new buffer while
holding content lock on another buffer, but it does perform I/O while
holding content lock.

>  It is only a problem if you make it be
> one by the way you use it.  Am I missing something?
>
>>
>>
>> > I think we can release metapage content lock before reading the buffer.
>> >
>>
>> On thinking about this again, if we release the metapage content lock
>> before reading and pinning the primary bucket page, then we need to
>> take it again to verify if the split has happened during the time we
>> don't have a lock on a metapage.  Releasing and again taking content
>> lock on metapage is not
>> good from the performance aspect.  Do you have some other idea for this?
>
>
> Doesn't the relcache patch effectively deal wit hthis?  If this is a
> sticking point, maybe the relcache patch could be incorporated into this
> one.
>

Yeah, relcache patch would eliminate the need for metapage locking,
but that is not a blocking point.  As this patch is mainly to enable
WAL logging, so there is no urgency to incorporate relcache patch,
even if we have to go with an algorithm where we need to take the
metapage lock twice to verify the splits.  Having said that, I am
okay, if Robert and or others are also in favour of combining the two
patches (patch in this thread and cache the metapage patch).   If we
don't want to hold content lock across another ReadBuffer call, then
another option could be to modify the read algorithm as below:

read the metapage
compute bucket number for target hash key based on metapage contents
read the required block
loop:
acquire shared content lock on metapage
recompute bucket number for target hash key based on metapage contents
   if the recomputed block number is not same as the block number we read
  release meta page content lock
  read the recomputed block number
   else
   break;
if (we can't get a shared content lock on the target bucket without blocking)
loop:
release meta page content lock
take a shared content lock on the target primary bucket page
take a shared content lock on the metapage
if (previously-computed target bucket has not been split)
break;

The basic change here is that first we compute the target block number
*without* locking metapage and then after locking the metapage, if
both doesn't match, then we need to again read the computed block
number.

Thoughts?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Autovacuum launcher process launches worker process at high frequency

2016-10-10 Thread Masahiko Sawada

On Thu, Oct 6, 2016 at 12:11 AM, Jeff Janes  wrote:
> On Wed, Oct 5, 2016 at 7:28 AM, Masahiko Sawada 
> wrote:
>>
>> Hi all,
>>
>> I found the kind of strange behaviour of the autovacuum launcher
>> process when XID anti-wraparound vacuum.
>>
>> Suppose that a database (say test_db) whose age of frozenxid is about
>> to reach max_autovacuum_max_age has three tables T1 and T2.
>> T1 is very large and is frequently updated, so vacuum takes long time
>> for vacuum.
>> T2 is static and already frozen table, thus vacuum can skip to vacuum
>> whole table.
>> And anti-wraparound vacuum was already executed on other databases.
>>
>> Once the age of datfrozenxid of test_db exceeded
>> max_autovacuum_max_age, autovacuum launcher launches worker process in
>> order to do anti-wraparound vacuum on testdb.
>> A worker process assigned to test_db begins to vacuum T1, it takes long
>> time.
>> Meanwhile another worker process is assigned to test_db and completes
>> to vacuum on T2 and exits.
>>
>> After for while, the autovacuum launcher launches new worker again and
>> worker is assigned to test_db again.
>> But that worker exits quickly because there is no table we need to
>> vacuum. (T1 is being vacuumed by another worker process).
>> When new worker process starts, worker process sends SIGUSR2 signal to
>> launcher process to wake up him.
>> Although the launcher process executes WaitLatch() after launched new
>> worker, it is woken up and launches another new worker process soon
>> again.
>
>
> See also this thread, which was never resolved:
>
> https://www.postgresql.org/message-id/flat/CAMkU%3D1yE4YyCC00W_GcNoOZ4X2qxF7x5DUAR_kMt-Ta%3DYPyFPQ%40mail.gmail.com#CAMkU=1yE4YyCC00W_GcNoOZ4X2qxF7x5DUAR_kMt-Ta=ypy...@mail.gmail.com
>
>
>
>>
>> As a result, launcher process launches new worker process at extremely
>> high frequency regardless of autovacuum_naptime, which increase cpu
>> use rate.
>>
>> Why does auto vacuum worker need to wake up launcher process after
>> started?
>>
>> autovacuum.c:L1604
>>  /* wake up the launcher */
>> if (AutoVacuumShmem->av_launcherpid != 0)
>> kill(AutoVacuumShmem->av_launcherpid, SIGUSR2);
>
>
>
> I think that that is so that the launcher can launch multiple workers in
> quick succession if it has fallen behind schedule. It can't launch them in a
> tight loop, because its signals to the postmaster would get merged into one
> signal, so it has to wait for one to get mostly set-up before launching the
> next.
>
> But it doesn't make any real difference to your scenario, as the short-lived
> worker will wake the launcher up a few microseconds later anyway, when it
> realizes it has no work to do and so exits.
>

Thank you for the reply.

I also thought that it's better to have information about how many
tables there are in each database and not been vacuumed yet.
But I'm not sure how to implement that and  the current optimistic
logic is more safe in most situation.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multivariate statistics (v19)

2016-10-10 Thread Tomas Vondra


Hi everyone,

thanks for the reviews. Let me sum the feedback so far, and outline my 
plans for the next patch version that I'd like to submit for CF 2016-11.



1) syntax changes

I agree with the changes proposed by Dean, although only a subset of the 
syntax is going to be supported until we add support for either join or 
partial statistics. So something like this:


 CREATE STATISTICS name
   [ WITH (options) ]
   ON (column1, column2 [, ...])
   FROM table

That should be a difficult change.


2) catalog names

I'm not sure what are the best names, so I'm fine with using whatever is 
the consensus.


That being said, I'm not sure I like extending the catalog to also 
support non-multivariate statistics (like for example statistics on 
expressions). While that would be a clearly useful feature, it seems 
like a slightly different use case and perhaps a separate catalog would 
be better. So maybe pg_statistic_ext is not the best name.



3) special data type(s) to store statistics

I agree using an opaque bytea value is not very nice. I see Heikki 
proposed using something like pg_node_tree, and maybe storing all the 
statistics in a single value.


I assume the pg_node_tree was meant only as an inspiration how to build 
pseudo-type on top of a varlena value. I agree that's a good idea, and I 
plan to do something like that - say adding pg_mcv, pg_histogram, 
pg_ndistinct and pg_dependencies data types.


Heikki also mentioned that maybe JSONB would be a good way to store the 
statistics. I don't think so - firstly, it only supports a subset of 
data types, so we'd be unable to store statistics for some data types 
(or we'd have to store them as text, which sucks). Also, there's a fair 
amount of smartness in how the statistics are stored (e.g. how the 
histogram bucket boundaries are deduplicated, or how the estimation uses 
the serialized representation directly). We'd lose all of that when 
using JSONB.


Similarly for storing all the statistics in a single value - I see no 
reason why keeping the statistics in separate columns would be a bad 
idea (after all, that's kinda the point of relational databases). Also, 
there are perfectly valid cases when the caller only needs a particular 
type of statistic - e.g. when estimating GROUP BY we'll only need the 
ndistinct coefficients. Why should we force the caller to fetch and 
detoast everything, and throw away probably 99% of that?


So my plan here is to define pseudo types similar to how pg_node_tree is 
defined. That does not seem like a tremendous amount of work.



4) functional dependencies

Several people mentioned they don't like how functional dependencies are 
detected at ANALYZE time, particularly that there's a sudden jump 
between 0 and 1. Instead, a continuous "dependency degree" between 0 and 
1 was proposed.


I'm fine with that, although that makes "clause reduction" (deciding 
that we don't need to estimate one of the clauses at all, as it's 
implied by some other clause) impossible. But that's fine, the 
functional dependencies will still be much less expensive than the other 
statistics.


I'm wondering how will this interact with transitivity, though. IIRC the 
current implementation is able to detect transitive dependencies and use 
that to reduce storage space etc.


In any case, this significantly complicates the functional dependencies, 
which were meant as a trivial type of statistics, mostly to establish 
the shared infrastructure. Which brings me to ndistinct.



5) ndistinct

So far, the ndistinct coefficients were lumped at the very end of the 
patch, and the statistic was only built but not used for any sort of 
estimation. I agree with Dean that perhaps it'd be better to move this 
to the very beginning, and use it as the simplest statistic to build the 
infrastructure instead of functional dependencies (which only gets truer 
due to the changes in functional dependencies, discussed in the 
preceding section).


I think it's probably a good idea and I plan to do that, so the patch 
series will probably look like this:


   * 001 - CREATE STATISTICS infrastucture with ndistinct coefficients
   * 002 - use ndistinct coefficients to improve GROUP BY estimates
   * 003 - use ndistinct coefficients in clausesel.c (not sure)
   * 004 - add functional dependencies (build + clausesel.c)
   * 005 - add multivariate MCV (build + clausesel.c)
   * 006 - add multivariate histograms (build + clausesel.c)

I'm not sure about using the ndistinct coefficients in clausesel.c to 
estimate regular conditions - it's the place for which ndistinct 
coefficients were originally proposed by Kyotaro-san, but I seem to 
remember it was non-trivial to choose the best statistics when there 
were other types of stats available. But I'll look into that.



6) combining statistics

I've decided not to re-submit this part of the patch until the basic 
functionality gets in. I do think it's a very useful feature (despite 
having my doubts about the

[HACKERS] Re: "Re: Question about grant create on database and pg_dump/pg_dumpall

2016-10-10 Thread Noah Misch

On Thu, Sep 29, 2016 at 11:29:09AM -0400, Tom Lane wrote:
> The fundamental thing we have to do in order to move forward on this is
> to rethink what's the division of labor between pg_dump and pg_dumpall.
> I find the patch as presented quite unacceptable because it's made no
> effort to do that (or even to touch the documentation).
> 
> What do people think of this sketch:

This looks good so far.  It covers most of the problems from TODO item
"Refactor handling of database attributes between pg_dump and pg_dumpall".

> 1. pg_dump without --create continues to do what it does today, ie it just
> dumps objects within the database, assuming that database-level properties
> will already be set correctly for the target database.

dumpDatabase() isn't fully committed to that doctrine today; it skips most
database properties, but it does dump COMMENT ON DATABASE and any
pg_shseclabel entry for the database.  Would you make it stop doing that?

> One thing that would still be messy is that presumably "pg_dumpall -g"
> would issue ALTER ROLE SET commands, but it's unclear what to do with
> ALTER ROLE IN DATABASE SET commands.  Should those become part of
> "pg_dump --create"'s charter?  It seems like not, but I'm not certain.

"pg_dump --create" should emit ALTER ROLE IN DATABASE SET.  Regeardless,
pg_dump needs the roles in place for object ownership and ALTER DEFAULT
PRIVILEGES.  The latter is conceptually similar to ALTER ROLE IN DATABASE SET.

> Another thing that requires some thought is that pg_dumpall is currently
> willing to dump ACLs and other properties for template1/template0, though
> it does not invoke pg_dump on them.  If we wanted to preserve that
> behavior while still moving the code that does those things to pg_dump,
> pg_dump would have to grow an option that would let it do that.  But
> I'm not sure how much of that behavior is actually sensible.

It's appropriate to restore the template1/template0 attributes one can cover
without connecting to the database (pg_database, pg_db_role_setting,
pg_shseclabel, pg_shdescription).  template0 modifications that require a
connection are essentially unsupported, so ignore those.  I think, ideally, a
full-cluster dump should include template1 contents.  For template1, then, one
would want pg_dump to emit everything except the CREATE DATABASE statement.
That's a regrettable wart.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Supporting huge pages on Windows

2016-10-10 Thread Tsunakawa, Takayuki

From: Thomas Munro [mailto:thomas.mu...@enterprisedb.com]
> Your ~2.4% number is similar to what was reported for Linux with 4GB
> shared_buffers:
> 
> https://www.postgresql.org/message-id/20130913234125.GC13697%40roobarb
> .crazydogs.org

I'm relieved to know that a similar figure was gained on Linux.  Thanks for the 
info.


> Later in that thread there was a report of a dramatic ~15% increase in "best
> result" TPS, but that was with 60GB of shared_buffers on a machine with
> 256GB of RAM:
> 
> https://www.postgresql.org/message-id/20131024060313.GA21888%40toroid.
> org

From: Andres Freund [mailto:and...@anarazel.de]
> FWIW, I've seen 2-3x increases with ~60GB of s_b.

Wow, nice figures.  It's unfortunate that I don't have such a big machine 
available at hand.

Regards
Takayuki Tsunakawa



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-10 Thread Tomas Vondra


On 10/11/2016 04:07 AM, Andres Freund wrote:

On 2016-10-10 17:46:22 -0700, Andres Freund wrote:

TPC-DS (tpcds.ods)
--

In this case, I'd say the results are less convincing. There are quite a few
queries that got slower by ~10%, which is well above - for example queries
22 and 67. There are of course queries that got ~10% faster, and in total
the patched version executed more queries (so overall the result is slightly
positive, but not significantly).


That's interesting. I wonder whether that's plan changes just due to the
changing memory estimates, or what's causing that. I'll look into it.


Hm. Based on an initial look those queries aren't planned with any of
the affected codepaths.  Could this primarily be a question of
randomness? Would it perhaps make sense to run the tests in a comparable
order? Looking at tpcds.py and the output files, it seems that the query
order differes between the runs, that can easily explain bigger
difference than the above. For me a scale=1 run creates a database of
approximately 4.5GB, thus with shared_buffers=1GB execution order is
likely to have a significant performance impact.



Yes, I see similar plans (no bitmap index scans or hash aggregates). But 
the difference is there, even when running the query alone (so it's not 
merely due to the randomized ordering).


I wonder whether this is again due to compiler moving stuff around.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-10 Thread Andres Freund

On 2016-10-10 17:46:22 -0700, Andres Freund wrote:
> > TPC-DS (tpcds.ods)
> > --
> > 
> > In this case, I'd say the results are less convincing. There are quite a few
> > queries that got slower by ~10%, which is well above - for example queries
> > 22 and 67. There are of course queries that got ~10% faster, and in total
> > the patched version executed more queries (so overall the result is slightly
> > positive, but not significantly).
> 
> That's interesting. I wonder whether that's plan changes just due to the
> changing memory estimates, or what's causing that. I'll look into it.

Hm. Based on an initial look those queries aren't planned with any of
the affected codepaths.  Could this primarily be a question of
randomness? Would it perhaps make sense to run the tests in a comparable
order? Looking at tpcds.py and the output files, it seems that the query
order differes between the runs, that can easily explain bigger
difference than the above. For me a scale=1 run creates a database of
approximately 4.5GB, thus with shared_buffers=1GB execution order is
likely to have a significant performance impact.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Forbid use of LF and CR characters in database and role names

2016-10-10 Thread Noah Misch

On Sun, Oct 02, 2016 at 10:47:04PM +0900, Michael Paquier wrote:
> On Mon, Sep 12, 2016 at 11:38 AM, Michael Paquier
>  wrote:
> > On Mon, Sep 12, 2016 at 10:01 AM, Noah Misch  wrote:
> >> I discourage documenting LF/CR restrictions.  For the epsilon of readers 
> >> who
> >> would benefit from this knowledge, the error message suffices.  For 
> >> everyone
> >> else, it would just dilute the text.  (One could argue against other parts 
> >> of
> >> our documentation on this basis, but I'm not calling for such a study.  I'm
> >> just saying that today's lack of documentation on this topic is optimal.)
> >
> > Okay.
> >
> >>> > > I think the way forward here, if any, is to work on removing these
> >>> > > restrictions, not to keep sprinkling them around.
> >>> >
> >>> > If we were talking about pathnames containing spaces, I would agree,
> >>> > but I've never heard of a legitimate pathname containing CR or LF.  I
> >>> > can't see us losing much by refusing to allow such pathnames, except
> >>> > for security holes.
> >>
> >> (In other words, +1 to that.)
> >
> > Yep. To be honest, I see value in preventing directly the use of
> > newlines and carriage returns in database and role names to avoid
> > users to be bitten by custom scripts, things for example written in
> > bash that scan manually for pg_database, pg_roles, etc. And I have
> > seen such things over the years. Now it is true that the safeguards in
> > core are proving to be enough, if only the in-core tools are used, but
> > that's not necessarily the case with all the things gravitating around
> > this community.
> 
> And seeing nothing happening here, I still don't know what to do with
> this patch. Thoughts?

I count one person disfavoring the patch concept of rejecting these characters
early, and I count two people, plus yourself as author, favoring it.
Therefore, the patch can move forward with the proposed design.  The next
step, then, is for the patch to park in a CommitFest until someone volunteers
to review the implementation details.

nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-10 Thread Andres Freund

Hi,

On 2016-10-11 02:38:26 +0200, Tomas Vondra wrote:
> Yes, I've done a bunch of TPC-H and TPC-DS tests on the patch, but it took
> quite a bit of time. These tests were done on a fairly small machine (the
> usual i5-2500k with 8GB of RAM), with only 1GB data sets for both
> benchmarks, to keep it completely in memory as I presume once we start
> hitting I/O, it becomes the dominant part.

Thanks!


> TPC-DS (tpcds.ods)
> --
> 
> In this case, I'd say the results are less convincing. There are quite a few
> queries that got slower by ~10%, which is well above - for example queries
> 22 and 67. There are of course queries that got ~10% faster, and in total
> the patched version executed more queries (so overall the result is slightly
> positive, but not significantly).

That's interesting. I wonder whether that's plan changes just due to the
changing memory estimates, or what's causing that. I'll look into it.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] another typo in parallel.sgml

2016-10-10 Thread Tatsuo Ishii

I think I found a typo in parallel.sgml.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
diff --git a/doc/src/sgml/parallel.sgml b/doc/src/sgml/parallel.sgml
index c80d42d..1e71529 100644
--- a/doc/src/sgml/parallel.sgml
+++ b/doc/src/sgml/parallel.sgml
@@ -241,7 +241,7 @@ EXPLAIN SELECT * FROM pgbench_accounts WHERE filler LIKE '%x%';
 than normal but would produce incorrect results.  Instead, the parallel
 portion of the plan must be what is known internally to the query
 optimizer as a partial plan; that is, it must constructed
-so that each process will which executes the plan will generate only a
+so that each process which executes the plan will generate only a
 subset of the output rows in such a way that each required output row
 is guaranteed to be generated by exactly one of the cooperating processes.
   

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] FTS Configuration option

2016-10-10 Thread Artur Zakirov

Hello hackers,

Sometimes there is a need to collect lexems from various dictionaries.
For example, if we have a column with text in various languages.

Let's say there is a new option JOIN. This option will allow to parser
to append lexems from current dictionary and go to next dictionary to
get another lexems:

=> CREATE TEXT SEARCH CONFIGURATION multi_conf (COPY=simple);
=> ALTER TEXT SEARCH CONFIGURATION multi_conf
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH german_ispell (JOIN), english_ispell, simple;

Here there are the following cases:
- found lexem in german_ispell, didn't found lexem in english_ispell.
Return lexem from german_ispell.
- didn't found lexem in german_ispell, found lexem in english_ispell.
Return lexem from english_ispell.
- didn't found lexems in dictionaries. Return lexem from simple.
- found lexems in both dictionaries. Return lexems from both.

Could be such option is useful to the community? Name of the option is
debatable.

Thank you!

-- 
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] How to inspect tuples during execution of a plan?

2016-10-10 Thread Ernst-Georg Schmid

Hello all,

I'd like to inspect the content of tuples as they are sent during the
execution of a query in order to react to their values.

I guess I could do it with a FDW, but that's a bit clumsy so I took a
look at the hooks but have two questions:

1.) Would ExecutorRun_hook be the correct place to implement such an
'tuple inspector'?
2.) If yes, how? As far as I understand the source code, I would have
to provide my own implementation based on standard_ExecutorRun and
replace the ExecutePlan function with my own one that takes a look at
each 'slot' like so

if (estate->es_junkFilter != NULL)
 1589 slot = ExecFilterJunk(estate->es_junkFilter, slot);
 1590

Tuple inspection here

 1591 /*
 1592  * If we are supposed to send the tuple somewhere, do so. (In
 1593  * practice, this is probably always the case at this point.)
 1594  */
 1595 if (sendTuples)
 1596 {

If there is a better way, please advise. I'm really a newbie to this.

Best regards,

Ernst-Georg Schmid


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Remove vacuum_defer_cleanup_age

2016-10-10 Thread Josh Berkus

Folks,

Given that hot_standby_feedback is pretty bulletproof now, and a lot of
the work in reducing replay conflicts, I think the utility of
vacuum_defer_cleanup_age is at an end.  I really meant so submit a patch
to remove it to 9.6, but it got away from me.

Any objections to removing the option in 10?

-- 
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Merge Join with an Index Optimization

2016-10-10 Thread Michael Malis

I discovered that the kind of join I proposed is called the leapfrog
triejoin: https://arxiv.org/pdf/1210.0481v5.pdf

[HACKERS] parallel.sgml

2016-10-10 Thread Tatsuo Ishii

While reading parallel.sgml, I met with following sentence.

If this occurs, the leader will execute the portion of the plan
between below the Gather node entirely by itself,
almost as if the Gather node were not present.

Maybe "the portion of the plan between below" shoud have been
"the portion of the plan below"?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FSM corruption leading to errors

2016-10-10 Thread Michael Paquier

On Mon, Oct 10, 2016 at 11:41 PM, Pavan Deolasee
 wrote:
>
>
> On Mon, Oct 10, 2016 at 7:55 PM, Michael Paquier 
> wrote:
>>
>>
>>
>> +   /*
>> +* See comments in GetPageWithFreeSpace about handling outside the
>> valid
>> +* range blocks
>> +*/
>> +   nblocks = RelationGetNumberOfBlocks(rel);
>> +   while (target_block >= nblocks && target_block != InvalidBlockNumber)
>> +   {
>> +   target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0,
>> +   spaceNeeded);
>> +   }
>> Hm. This is just a workaround. Even if things are done this way the
>> FSM will remain corrupted.
>
>
> No, because the code above updates the FSM of those out-of-the range blocks.
> But now that I look at it again, may be this is not correct and it may get
> into an endless loop if the relation is repeatedly extended concurrently.

Ah yes, that's what the call for
RecordAndGetPageWithFreeSpace()/fsm_set_and_search() is for. I missed
that yesterday before sleeping.

>> And isn't that going to break once the
>> relation is extended again?
>
>
> Once the underlying bug is fixed, I don't see why it should break again. I
> added the above code to mostly deal with already corrupt FSMs. May be we can
> just document and leave it to the user to run some correctness checks (see
> below), especially given that the code is not cheap and adds overheads for
> everybody, irrespective of whether they have or will ever have corrupt FSM.

Yep. I'd leave it for the release notes to hold a diagnostic method.
That's annoying, but this has been done in the past like for the
multixact issues..

>> I'd suggest instead putting in the release
>> notes a query that allows one to analyze what are the relations broken
>> and directly have them fixed. That's annoying, but it would be really
>> better than a workaround. One idea here is to use pg_freespace() and
>> see if it returns a non-zero value for an out-of-range block on a
>> standby.
>>
>
> Right, that's how I tested for broken FSMs. A challenge with any such query
> is that if the shared buffer copy of the FSM page is intact, then the query
> won't return problematic FSMs. Of course, if the fix is applied to the
> standby and is restarted, then corrupt FSMs can be detected.

What if you restart the standby, and then do a diagnostic query?
Wouldn't that be enough? (Something just based on
pg_freespace(pg_relation_size(oid) / block_size) != 0)
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/Python adding support for multi-dimensional arrays

2016-10-10 Thread Dave Cramer

On 10 October 2016 at 13:42, Pavel Stehule  wrote:

>
>
> 2016-10-10 12:31 GMT+02:00 Heikki Linnakangas :
>
>> On 10/01/2016 02:45 AM, Jim Nasby wrote:
>>
>>> On 9/29/16 1:51 PM, Heikki Linnakangas wrote:
>>>
 Now, back to multi-dimensional arrays. I can see that the Sequence
 representation is problematic, with arrays, because if you have a python
 list of lists, like [[1, 2]], it's not immediately clear if that's a
 one-dimensional array of tuples, or two-dimensional array of integers.
 Then again, we do have the type definitions available. So is it really
 ambiguous?

>>>
>>> [[1,2]] is a list of lists...
>>> In [4]: b=[[1,2]]
>>>
>>> In [5]: type(b)
>>> Out[5]: list
>>>
>>> In [6]: type(b[0])
>>> Out[6]: list
>>>
>>> If you want a list of tuples...
>>> In [7]: c=[(1,2)]
>>>
>>> In [8]: type(c)
>>> Out[8]: list
>>>
>>> In [9]: type(c[0])
>>> Out[9]: tuple
>>>
>>
>> Hmm, so we would start to treat lists and tuples differently? A Python
>> list would be converted into an array, and a Python tuple would be
>> converted into a composite type. That does make a lot of sense. The only
>> problem is that it's not backwards-compatible. A PL/python function that
>> returns an SQL array of rows, and does that by returning Python list of
>> lists, it would start failing.
>>
>
> is not possible do decision in last moment - on PL/Postgres interface?
> There the expected type should be known.
>
> Regards
>
> Pavel
>
>
>>
>> I think we should bite the bullet and do that anyway. As long as it's
>> clearly documented, and the error message you get contains a clear hint on
>> how to fix it, I don't think it would be too painful to adjust existing
>> application.
>>
>> We could continue to accept a Python list for a plain composite type,
>> this would only affect arrays of composite types.
>>
>> I don't use PL/python much myself, so I don't feel qualified to make the
>> call, though. Any 3rd opinions?
>
>
Can't you determine the correct output based on the function output
definition ?

For instance if the function output was an array type then we would return
the list as an array
if the function output was a set of then we return tuples ?


Dave Cramer

da...@postgresintl.com
www.postgresintl.com

Re: [HACKERS] De-support SCO OpenServer/SCO UnixWare?

2016-10-10 Thread Thomas Munro

On Tue, Oct 11, 2016 at 11:03 AM, Tom Lane  wrote:
> Currently, we have the following OS-specific configure templates in
> src/template:
>
> aix
> cygwin
> darwin
> freebsd
> hpux
> linux
> netbsd
> openbsd
> sco
> solaris
> unixware
> win32
>
> There is at least one active buildfarm critter testing each of these
> except "sco" and "unixware".  I can't find any evidence in the buildfarm
> database that we've *ever* had a SCO OpenServer buildfarm member.  There
> were a couple of UnixWare members, but the last one was last heard from
> on 2012-10-30 (around when 9.2 was released).  Mentions of these platforms
> in the commit logs thin out to about nothing after 2008.
>
> I think we should stop supporting those two platforms; to wit, we should
> remove those template files and remove all supporting code and
> documentation references to those platforms.  We can always put that stuff
> back if someone shows up who's got enough interest to run a buildfarm
> member.  But right now I'd say our support for those platforms is likely
> to be hopelessly bit-rotted anyway, and it's more than likely that no one
> cares.

It looks like the official upgrade path for both is to the new
OpenServer 10 from SCO's acquirers:

http://xinuos.com/xinuos/news/247-news-xinuos-channel-buzz-openserver-10
http://www.xinuos.com/menu-products/openserver-10

It seems they are using virtualisation to help people migrate from the
ancient System V-based releases, but OpenServer 10 is a FreeBSD
derivative and has "support for the FreeBSD® Ports and Packages
Collection".  So we already support that without doing anything at
all.

The System V-based SCO operating systems are museum pieces and it
wouldn't make any sense for anyone to think about running PostgreSQL
10 on them.  +1 for dropping support.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] De-support SCO OpenServer/SCO UnixWare?

2016-10-10 Thread Tom Lane

Currently, we have the following OS-specific configure templates in
src/template:

aix
cygwin
darwin
freebsd
hpux
linux
netbsd
openbsd
sco
solaris
unixware
win32

There is at least one active buildfarm critter testing each of these
except "sco" and "unixware".  I can't find any evidence in the buildfarm
database that we've *ever* had a SCO OpenServer buildfarm member.  There
were a couple of UnixWare members, but the last one was last heard from
on 2012-10-30 (around when 9.2 was released).  Mentions of these platforms
in the commit logs thin out to about nothing after 2008.

I think we should stop supporting those two platforms; to wit, we should
remove those template files and remove all supporting code and
documentation references to those platforms.  We can always put that stuff
back if someone shows up who's got enough interest to run a buildfarm
member.  But right now I'd say our support for those platforms is likely
to be hopelessly bit-rotted anyway, and it's more than likely that no one
cares.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] memory leak in e94568ecc10 (pre-reading in external sort)

2016-10-10 Thread Peter Geoghegan

On Mon, Oct 10, 2016 at 5:21 AM, Heikki Linnakangas  wrote:
> Admittedly that's confusing. Thinking about this some more, I came up with
> the attached. I removed the separate LogicalTapeAssignReadBufferSize() call
> altogether - the read buffer size is now passed as argument to the
> LogicalTapeRewindForRead() call.
>
> I split LogicalTapeRewind() into two: LogicalTapeRewindForRead() and
> LogicalTapeRewindForWrite(). A boolean argument like 'forWrite' is mentally
> challenging, because when reading a call site, you have to remember which
> way round the boolean is. And now you also pass the extra buffer_size
> argument when rewinding for reading, but not when writing.

I like the general idea here.

> I gave up on the logic to calculate the quotient and reminder of the
> available memory, and assigning one extra buffer to some of the tapes. I'm
> now just rounding it down to the nearest BLCKSZ boundary. That simplifies
> the code further, although we're now not using all the memory we could. I'm
> pretty sure that's OK, although I haven't done any performance testing of
> this.

The only thing I notice is that this new struct field could use a comment:

> --- a/src/backend/utils/sort/tuplesort.c
> +++ b/src/backend/utils/sort/tuplesort.c
> @@ -366,6 +366,8 @@ struct Tuplesortstate
> char   *slabMemoryEnd;  /* end of slab memory arena */
> SlabSlot   *slabFreeHead;   /* head of free list */
>
> +   size_t  read_buffer_size;
> +

Also, something about trace_sort here:

> +#ifdef TRACE_SORT
> +   if (trace_sort)
> +   elog(LOG, "using " INT64_FORMAT " KB of memory for read buffers among 
> %d input tapes",
> +(state->availMem) / 1024, numInputTapes);
> +#endif
> +
> +   state->read_buffer_size = state->availMem / numInputTapes;
> +   USEMEM(state, state->availMem);

Maybe you should just make the trace_sort output talk about blocks at
this point?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Switch to unnamed POSIX semaphores as our preferred sema code?

2016-10-10 Thread Tom Lane

Thomas Munro  writes:
> On Tue, Oct 11, 2016 at 5:57 AM, Tom Lane  wrote:
>> (I'm a little suspicious that older variants of FreeBSD might not
>> have working sem_init, like the other *BSD variants, necessitating
>> a run-time test there.  But we'll cross that bridge when we come
>> to it.)

> The sem_init man page from FreeBSD 8.4[1] (EOL August 2015) and earlier said:
>  This implementation does not support shared semaphores, and reports this
>  fact by setting errno to EPERM.
> FreeBSD 9.0 (released January 2012) reimplemented semaphores and
> removed those words from that man page[2].

Yeah, in subsequent googling I found other mentions of this having been
added in FreeBSD 9.0.  But that will be more than 5 years old by the
time PG 10 gets out.

> All current releases[3] support it, though I guess there may be 8.4
> machines out there a year and a bit after EOL.

We don't have anything older than 9.0 in the buildfarm, which I take
to indicate that nobody particularly cares about older versions anymore.
I would just as soon not add a run-time test in configure (it breaks
cross-compiles), so I'd rather wait and see if anyone complains.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Supporting huge pages on Windows

2016-10-10 Thread Andres Freund

On 2016-10-11 09:57:48 +1300, Thomas Munro wrote:
> Later in that thread there was a report of a dramatic ~15% increase in
> "best result" TPS, but that was with 60GB of shared_buffers on a
> machine with 256GB of RAM:
> 
> https://www.postgresql.org/message-id/20131024060313.GA21888%40toroid.org

FWIW, I've seen 2-3x increases with ~60GB of s_b.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Supporting huge pages on Windows

2016-10-10 Thread Thomas Munro

On Wed, Sep 28, 2016 at 7:32 PM, Tsunakawa, Takayuki
 wrote:
> From: Thomas Munro [mailto:thomas.mu...@enterprisedb.com]
>> >  huge_pages=off: 70412 tps
>> >  huge_pages=on : 72100 tps
>>
>> Hmm.  I guess it could be noise or random code rearrangement effects.
>
> I'm not the difference was a random noise, because running multiple set of 
> three runs of pgbench (huge_pages = on, off, on, off, on...) produced similar 
> results.  But I expected a bit greater improvement, say, +10%.  There may be 
> better benchmark model where the large page stands out, but I think pgbench 
> is not so bad because its random data access would cause TLB cache misses.

Your ~2.4% number is similar to what was reported for Linux with 4GB
shared_buffers:

https://www.postgresql.org/message-id/20130913234125.GC13697%40roobarb.crazydogs.org

Later in that thread there was a report of a dramatic ~15% increase in
"best result" TPS, but that was with 60GB of shared_buffers on a
machine with 256GB of RAM:

https://www.postgresql.org/message-id/20131024060313.GA21888%40toroid.org

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_dump getBlobs query broken for 7.3 servers

2016-10-10 Thread Greg Stark

On Mon, Oct 10, 2016 at 3:36 AM, Jim Nasby  wrote:
> FWIW, Greg Stark did a talk at PG Open about PG performance going back to at
> least 7.4. He did discuss what he had to do to get those versions to compile
> on modern tools, and has a set of patches that enable it. Unfortunately his
> slides aren't posted[1] so I can't provide further details than that.

The code is here:

https://github.com/gsstark/retropg

The build script is called "makeall" and it applies patches from the
"old-postgres-fixes" directory though some of the smarts are in the
script because it knows about how to run older version of the
configure script and it tries to fix up the ecpg parser duplcate
tokens separately. It saves a diff after applying the patches and
other fixups into the "net-diffs" directory but I've never checked if
those diffs would work cleanly on their own.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical tape pause/resume

2016-10-10 Thread Peter Geoghegan

On Sun, Oct 9, 2016 at 11:52 PM, Heikki Linnakangas  wrote:
> Regardless of the number of tapes, the memory used for the tape buffers,
> while building the initial runs, is wasted. It's not entirely wasted when
> you have multiple merge passes, because without the buffer you need to read
> the last partial page back into memory, when you start to output the next
> run on it. But even in that case, I believe that memory would be better used
> for the quicksort, to create larger runs.

Okay. Somehow, I was confused on that point.

> Yeah, there might be value in having a cap, because operating the merge heap
> becomes more expensive when it becomes larger. Mainly because of cache
> misses. We should perhaps have a limit on the size of the merge heap, and
> use the same limit for the replacement selection. Although, the heap behaves
> slightly differently during replacement selection: During replacement
> selection, new values added to the heap tend to go to the bottom of the
> heap, but during merge, new values tend to go close to the top. The latter
> pattern incurs fewer cache misses.

I don't think it makes sense to bring replacement selection into it.
I'd vote for completely killing replacement selection at this point.
The merge heap work committed a few weeks back really weakened the
case for it, a case that the 9.6 work left hanging by a thread to
begin with. In general, having a replacement selection heap that is
larger can only be helpful when that means that we have enough
"juggling capacity" to reorder things into one (or relatively few)
long runs. For RS, the optimal size of the heap is good enough to do
any juggling that is possible to make runs as long as possible, but no
larger. The default replacement_sort_tuples setting (150,000) seems
very high to me, considered from that point of view, actually.

Multiple passes just don't seem to be that bad on modern hardware, in
the cases where they still happen. You don't see a big drop when
tuning down work_mem to just below the threshold at which the sort can
complete in one pass. And, multi-pass sorts still won't happen very
often in practice. Now, this work of yours risks slowing down multiple
pass sorts when that does happen, but I think it could still be worth
it.

> But that's orthogonal to the wasting-memory aspect of this. Even if a we
> have a cap of 100, it's good to not spend memory for the tape buffers of
> those 100 tapes, when building the initial runs.

You're right about that.

>> Okay. But, you haven't actually addressed the problem with non-trivial
>> amounts of memory being logically allocated ahead of time for no good
>> reason -- you've only address the constant factor (the overhead
>> per-tape).
>
>
> I thought I did. Can you elaborate?
>
> Are you referring to the fact that the array of LogicalTapes is still
> allocated ahead of time, with size maxTapes? True, it'd be nice to allocate
> the LogicalTape structs only when needed. But that's peanuts, compared to
> the tape buffers.

Is that's all that is left to remove, in terms of wasteful USEMEM()
overhead? Then, yeah, I guess I am just talking about "peanuts", plus
the more significant issue of merge heap CPU cache efficiency.

>> I can definitely see value in refactoring, to make that code less
>> complicated -- I just don't think it's justified by the amount of
>> memory that is wasted on tapes.
>
>
> Ok, good. I think we're in agreement on doing this, then, even if we don't
> agree on the justification :-).

Agreed. :-)

> Yeah. I'm not sure how partitioning and all that would be done here. But I'm
> prettty sure this simpler logtape.c code is easier to work with, for
> partitioning too.

That's my intuition, too. Obviously my thoughts on partitioning are
still very hand-wavy, but I'm pretty sure that partitioning is the
future for the parallelization of sorting in the executor. Pushing
*everything* down is more important than actually having the sort be
as fast as possible there.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Switch to unnamed POSIX semaphores as our preferred sema code?

2016-10-10 Thread Thomas Munro

On Tue, Oct 11, 2016 at 5:57 AM, Tom Lane  wrote:
> So at this point it seems likely that stopping with Linux and FreeBSD
> is the thing to do, and as far as I can tell the code we have now is
> working with all variants of those that we have in the buildfarm.
> (I'm a little suspicious that older variants of FreeBSD might not
> have working sem_init, like the other *BSD variants, necessitating
> a run-time test there.  But we'll cross that bridge when we come
> to it.)

The sem_init man page from FreeBSD 8.4[1] (EOL August 2015) and earlier said:

 This implementation does not support shared semaphores, and reports this
 fact by setting errno to EPERM.

FreeBSD 9.0 (released January 2012) reimplemented semaphores and
removed those words from that man page[2].  All current releases[3]
support it, though I guess there may be 8.4 machines out there a year
and a bit after EOL.

[1] 
https://www.freebsd.org/cgi/man.cgi?query=sem_init=0=0=FreeBSD+8.4-RELEASE=default=html
[2] 
https://www.freebsd.org/cgi/man.cgi?query=sem_init=0=0=FreeBSD+9.0-RELEASE=default=html
[3] https://www.freebsd.org/releases/

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench vs. wait events

2016-10-10 Thread Robert Haas

On Mon, Oct 10, 2016 at 9:44 AM, Bruce Momjian  wrote:
> On Thu, Oct  6, 2016 at 02:38:56PM -0400, Robert Haas wrote:
>> I decided to do some testing on hydra (IBM-provided community
>> resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
>> the newly-enhanced wait event stuff to try to get an idea of what
>> we're waiting for during pgbench.  I did 30-minute pgbench runs with
>> various configurations, but all had max_connections = 200,
>> shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
>> off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
>> log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
>> on.  During each run, I ran this psql script in another window and
>> captured the output:
>
> This is a great study that shows how the new instrumentation has given
> us a new window into performance.  I am frankly surprised we got as far
> as we did in finding performance bottlenecks before we had this
> instrumentation.

Thanks, and +1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: psql \setfileref

2016-10-10 Thread Pavel Stehule

2016-10-10 15:17 GMT+02:00 Gilles Darold :

> Le 10/10/2016 à 14:38, Daniel Verite a écrit :
> >   Gilles Darold wrote:
> >
> >>postgres=# \setfileref b /dev/random
> >>postgres=# insert into test (:b);
> >>
> >> Process need to be killed using SIGTERM.
> > This can be fixed by setting sigint_interrupt_enabled to true
> > before operating on the file. Then ctrl-C would be able to cancel
> > the command.
>
> If we do not prevent the user to be able to load special files that
> would be useful yes.
>

I don't think so special files has some sense, just I had not time fix this
issue. I will do it early - I hope.

Regards

Pavel

>
> --
> Gilles Darold
> Consultant PostgreSQL
> http://dalibo.com - http://dalibo.org
>
>

Re: [HACKERS] PL/Python adding support for multi-dimensional arrays

2016-10-10 Thread Pavel Stehule

2016-10-10 12:31 GMT+02:00 Heikki Linnakangas :

> On 10/01/2016 02:45 AM, Jim Nasby wrote:
>
>> On 9/29/16 1:51 PM, Heikki Linnakangas wrote:
>>
>>> Now, back to multi-dimensional arrays. I can see that the Sequence
>>> representation is problematic, with arrays, because if you have a python
>>> list of lists, like [[1, 2]], it's not immediately clear if that's a
>>> one-dimensional array of tuples, or two-dimensional array of integers.
>>> Then again, we do have the type definitions available. So is it really
>>> ambiguous?
>>>
>>
>> [[1,2]] is a list of lists...
>> In [4]: b=[[1,2]]
>>
>> In [5]: type(b)
>> Out[5]: list
>>
>> In [6]: type(b[0])
>> Out[6]: list
>>
>> If you want a list of tuples...
>> In [7]: c=[(1,2)]
>>
>> In [8]: type(c)
>> Out[8]: list
>>
>> In [9]: type(c[0])
>> Out[9]: tuple
>>
>
> Hmm, so we would start to treat lists and tuples differently? A Python
> list would be converted into an array, and a Python tuple would be
> converted into a composite type. That does make a lot of sense. The only
> problem is that it's not backwards-compatible. A PL/python function that
> returns an SQL array of rows, and does that by returning Python list of
> lists, it would start failing.
>

is not possible do decision in last moment - on PL/Postgres interface?
There the expected type should be known.

Regards

Pavel


>
> I think we should bite the bullet and do that anyway. As long as it's
> clearly documented, and the error message you get contains a clear hint on
> how to fix it, I don't think it would be too painful to adjust existing
> application.
>
> We could continue to accept a Python list for a plain composite type, this
> would only affect arrays of composite types.
>
> I don't use PL/python much myself, so I don't feel qualified to make the
> call, though. Any 3rd opinions?
>
> - Heikki
>
>

Re: [HACKERS] Switch to unnamed POSIX semaphores as our preferred sema code?

2016-10-10 Thread Tom Lane

Christoph Berg  writes:
> Another data point that's admittedly much more of a footnote than
> serious input to the original question is the following: Debian has a
> (so far mostly toy) port "hurd-i386" which is using the GNU hurd
> kernel along with the usual GNU userland that's also in use on Linux.

> This OS doesn't implement any semaphores yet (PG compiles, but initdb
> dies with ENOSYS immediately). On talking to the porters, they advised
> that POSIX semaphores would have the best chances to get implemented
> first, so I added USE_UNNAMED_POSIX_SEMAPHORES=1 to the architecture
> template to be prepared for that.

As of HEAD, that should happen automatically for anything using the
"linux" template.

I did some googling (but no actual testing) to try to find out the state
of POSIX sema support for the other platform templates:

aix

AIX doesn't seem to have support (reportedly, the functions exist but
always fail).

cygwin

Not clear whether unnamed semas work on this; I found conflicting reports.

darwin

Unnamed semas are known not to work here.

hpux

Reportedly, unnamed POSIX sema support exists on HPUX 11.x, but on 10.x
sem_init fails with ENOSYS.  We'd need a run-time test in configure to
see whether to use it.  Doubt it's worth the trouble.

netbsd

No support for cross-process unnamed semas.

openbsd

No support for cross-process unnamed semas.

sco

Doubt anyone cares.

solaris

Apparently supported in newer versions of Solaris; as with HPUX,
we might need a run-time configure probe to tell.  Again, without
specific evidence that it might be worth switching, I doubt it's
worth taking any trouble over.

unixware

Doubt anyone cares.

win32

No support.

So at this point it seems likely that stopping with Linux and FreeBSD
is the thing to do, and as far as I can tell the code we have now is
working with all variants of those that we have in the buildfarm.
(I'm a little suspicious that older variants of FreeBSD might not
have working sem_init, like the other *BSD variants, necessitating
a run-time test there.  But we'll cross that bridge when we come
to it.)

So, barring further input, this project is done.  I'll go update
the user docs to explain the new state of affairs.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash Indexes

2016-10-10 Thread Jeff Janes

On Mon, Oct 10, 2016 at 5:55 AM, Amit Kapila 
wrote:

> On Thu, Sep 29, 2016 at 8:27 PM, Amit Kapila 
> wrote:
> > On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas 
> wrote:
> >
> >> Another thing I don't quite understand about this algorithm is that in
> >> order to conditionally lock the target primary bucket page, we'd first
> >> need to read and pin it.  And that doesn't seem like a good thing to
> >> do while we're holding a shared content lock on the metapage, because
> >> of the principle that we don't want to hold content locks across I/O.
> >>
> >
>
> Aren't we already doing this during BufferAlloc() when the buffer
> selected by StrategyGetBuffer() is dirty?
>

Right, you probably shouldn't allocate another buffer while holding a
content lock on a different one, if you can help it. But, BufferAlloc
doesn't do that internally, does it?  It is only a problem if you make it
be one by the way you use it.  Am I missing something?


>
> > I think we can release metapage content lock before reading the buffer.
> >
>
> On thinking about this again, if we release the metapage content lock
> before reading and pinning the primary bucket page, then we need to
> take it again to verify if the split has happened during the time we
> don't have a lock on a metapage.  Releasing and again taking content
> lock on metapage is not
> good from the performance aspect.  Do you have some other idea for this?
>

Doesn't the relcache patch effectively deal wit hthis?  If this is a
sticking point, maybe the relcache patch could be incorporated into this
one.

Cheers,

Jeff

Re: [HACKERS] PostgreSQL - Weak DH group

2016-10-10 Thread Heikki Linnakangas


On 10/06/2016 10:26 PM, Christoph Berg wrote:

Re: Heikki Linnakangas 2016-10-06 

I propose the attached patch. It gives up on trying to deal with multiple
key lengths (as noted earlier, OpenSSL just always passed keylength=1024, so
that was useless). Instead of using the callback, it just sets fixed DH
parameters with SSL_CTX_set_tmp_dh(), like we do for the ECDH curve. The DH
parameters are loaded from a file called "dh_params.pem" (instead of
"dh1024.pem"), if present, otherwise the built-in 2048 bit parameters are
used.


Shouldn't this be a GUC pointing to a configurable location like
ssl_cert_file? This way, people reading the security section of the
default postgresql.conf would notice that there's something (new) to
configure. (And I wouldn't want to start creating symlinks from PGDATA
to /etc/ssl/something again...)


Perhaps. The DH parameters are not quite like other configuration files, 
though. The actual parameters used don't matter, you just want to 
generate different parameters for every cluster. The key length of the 
parameters can be considered configuration, though.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] vacuumdb -f and -j options (was Question / requests.)

2016-10-10 Thread Francisco Olarte

On Mon, Oct 10, 2016 at 3:04 PM, Amit Kapila  wrote:
> On Sun, Oct 9, 2016 at 10:59 PM, Francisco Olarte
>  wrote:
>> For me -f & -j is not perfect, but better than not having it. It can
>> deadlock when given certain sets of catalog tables, either by making
>> it go for the full db or by a perverse set of -t options. But any DBA
>> needing them together should, IMO, have resources to write ( or have
>> someone else write for him ) a 20-liner wrapping and feeding them via
>> -t. After all, not every tool/option is for everyone, and everything
>> has it prerequisites.
> Okay, but I think that doesn't mean it should deadlock when used by
> somewhat naive user.  I am not sure every user who wants to use -f and
> -j is smart enough to write a script as you are suggesting.  I think
> if more people see your proposal as meaningful and want to leave
> current usage of -f and -j as it is, then probably, we should issue a
> warning indicating such a risk.

I agree. It should NOT deadlock, but sadly it does. And disallowing it
feels wrong. I'm all in for emitting a warning whenever it is used and
even disallowing it when no table list is given, but I was trying to
avoid the sugestion of just disallowing it always because it may
deadlock ( even with a table list it may, as nothing forbids you from
entering a locky set of tables ). And some people wrote what I
interpreted as 'throw it out if it is not perfect, put logic in to
make full paralell work partially in series so it does not deadlock or
forbid it all long' ( which is IMHO not easy, and I would dislike it,
if I order full paralell, vdb does it, if it deadlocks is because I
wanted it to, if someone wants that a new '--auto-serial-as-needed'
switch could be added ).

Francisco Olarte.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cygwin64 assertion failure

2016-10-10 Thread Tom Lane

Amit Kapila  writes:
> On Sun, Oct 9, 2016 at 10:23 PM, Andrew Dunstan  wrote:
>> lorikeet seems to be stuck running the parallel tests, after having failed
>> an assertion.

> It looks like this problem has been discussed before [1].  Shall we do
> what Tom has suggested there?
> [1] - https://www.postgresql.org/message-id/15344.1473974558%40sss.pgh.pa.us

Well, we now have *two* issues.  One is why is the Assert failing.
The other is why is the system failing to clean up afterwards --- you'd
expect that an Assert in a parallel worker would not be a case that the
developers never saw ;-).  I'd suggest fixing the latter first.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] autonomous transactions

2016-10-10 Thread Merlin Moncure

On Thu, Oct 6, 2016 at 3:53 PM, Simon Riggs  wrote:
> On 6 October 2016 at 21:27, Robert Haas  wrote:
>> I think we should implement background transactions and call them
>> background transactions.  That allows us to expose additional
>> functionality which is useful, like the ability to kick something off
>> and check back later for the results.  There's no reason to call it
>> background transactions and also call it autonomous transactions: one
>> feature doesn't need two names.
>
> I'm happy to also invoke it via an alternate mechanism or API, so that
> it can continue to be used even if the above mechanism changes.
>
> We have no need to wait for the perfect solution, even assuming we
> would ever agree that just one exists.

-1 on implementing both autonomous and background transactions.  This
will confuse everyone.

The lingo here is no so important, I think.  What *is* important is
defining how the locking and execution rules should work and the
implementation should flow from that.  Those rules should be estimated
from competing implementations and how well they work.  +1 for any
solution that makes migration from other solutions to postgres easier.

bgworkers should be considered if you want things to run in parallel.
Reading the proposal (note, I may have missed it) it isn't clear to me
if you can have the parent and AT run a query at the same time.
Should this (parallel execution) be a design goal, then that's the end
of the story.

However I don't think it is TBH.  ISTM the expectation is single
threaded behavior with finer grained control of commits.   If we're
not 100% clear on this point one way or the other then things are a
bit preemptive.  Maybe we are clear and I missed something?

One major advantage non-bgworker serilized execution approach is that
certain classes of deadlock are easier to detect or do not exist since
there is only one execution state; AIUI it's impossible for two
transaction states to be simultaneously waiting assuming the pl/pgsql
instuctions are not run in parallel with one exception, and that is
the AT trying to acquire a lock exclusively held by the master.  If
the AT blocks on the parent it ought to be O(1) and instant to detect
that and roll it back with right supporting infrastructure in the lock
manager.  It also makes sharing execution state much easier,
especially the parts that look like, "I'm waiting here until the other
guy finishes" since there's only one "guy".

How will advisory locks work? I think they'd block with bgworkers and
not block with non-bgworkers.  What about other session based stuff
like prepared statements?  Expectations around those cases out to
clarify the implementation.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FSM corruption leading to errors

2016-10-10 Thread Pavan Deolasee

On Mon, Oct 10, 2016 at 7:55 PM, Michael Paquier 
wrote:

>
>
> +   /*
> +* See comments in GetPageWithFreeSpace about handling outside the
> valid
> +* range blocks
> +*/
> +   nblocks = RelationGetNumberOfBlocks(rel);
> +   while (target_block >= nblocks && target_block != InvalidBlockNumber)
> +   {
> +   target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0,
> +   spaceNeeded);
> +   }
> Hm. This is just a workaround. Even if things are done this way the
> FSM will remain corrupted.

No, because the code above updates the FSM of those out-of-the range
blocks. But now that I look at it again, may be this is not correct and it
may get into an endless loop if the relation is repeatedly extended
concurrently.

> And isn't that going to break once the
> relation is extended again?

Once the underlying bug is fixed, I don't see why it should break again. I
added the above code to mostly deal with already corrupt FSMs. May be we
can just document and leave it to the user to run some correctness checks
(see below), especially given that the code is not cheap and adds overheads
for everybody, irrespective of whether they have or will ever have corrupt
FSM.

> I'd suggest instead putting in the release
> notes a query that allows one to analyze what are the relations broken
> and directly have them fixed. That's annoying, but it would be really
> better than a workaround. One idea here is to use pg_freespace() and
> see if it returns a non-zero value for an out-of-range block on a
> standby.
>
>
Right, that's how I tested for broken FSMs. A challenge with any such query
is that if the shared buffer copy of the FSM page is intact, then the query
won't return problematic FSMs. Of course, if the fix is applied to the
standby and is restarted, then corrupt FSMs can be detected.

>
> At the same time, I have translated your script into a TAP test, I
> found that more useful when testing..
>
> Thanks for doing that.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] FSM corruption leading to errors

2016-10-10 Thread Michael Paquier

On Fri, Oct 7, 2016 at 11:50 PM, Anastasia Lubennikova
 wrote:
> Could you please add the patches to commitfest?
> I'm going to test them and write a review in a few days.

Here you go:
https://commitfest.postgresql.org/11/817/
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FSM corruption leading to errors

2016-10-10 Thread Michael Paquier

On Mon, Oct 10, 2016 at 11:25 PM, Michael Paquier
 wrote:
> At the same time, I have translated your script into a TAP test, I
> found that more useful when testing..

Well... Here is the actual patch.
-- 
Michael
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index bbd90c9..a6c6760 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -328,7 +328,14 @@ FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks)
 			return;/* nothing to do; the FSM was already smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		fsm_truncate_avail(BufferGetPage(buf), first_removed_slot);
-		MarkBufferDirtyHint(buf, false);
+
+		/*
+		 * Mark the buffer still holding references to truncated blocks as
+		 * dirty to be sure that this makes it to disk and is kept consistent
+		 * with the truncated relation.
+		 */
+		MarkBufferDirty(buf);
+
 		UnlockReleaseBuffer(buf);
 
 		new_nfsmblocks = fsm_logical_to_physical(first_removed_address) + 1;
diff --git a/src/test/recovery/t/008_fsm_check.pl b/src/test/recovery/t/008_fsm_check.pl
new file mode 100644
index 000..2c58e31
--- /dev/null
+++ b/src/test/recovery/t/008_fsm_check.pl
@@ -0,0 +1,90 @@
+# Set of tests to check file-space map consistency on standbys
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 1;
+
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1);
+
+$node_master->append_conf('postgresql.conf', qq{
+fsync = on
+wal_level = replica
+wal_log_hints = on
+max_prepared_transactions = 5
+autovacuum = off
+});
+
+# Create a master node and its standby, initializing both with
+# some data at the same time.
+$node_master->start;
+
+$node_master->backup('master_backup');
+my $node_standby = get_new_node('standby');
+$node_standby->init_from_backup($node_master, 'master_backup',
+	has_streaming => 1);
+$node_standby->start;
+
+$node_master->psql('postgres', qq{
+create table testtab (a int, b char(100));
+insert into testtab select generate_series(1,1000), 'foo';
+insert into testtab select generate_series(1,1000), 'foo';
+delete from testtab where ctid > '(8,0)';
+});
+
+# Take a lock on the table to prevent following vacuum from truncating it
+$node_master->psql('postgres', qq{
+begin;
+lock table testtab in row share mode;
+prepare transaction 'p1';
+});
+
+# Vacuum, update FSM without truncation
+$node_master->psql('postgres', 'vacuum verbose testtab');
+
+# Force a checkpoint
+$node_master->psql('postgres', 'checkpoint');
+
+# Now do some more insert/deletes, another vacuum to ensure FPW writes are
+# done
+$node_master->psql('postgres', qq{
+insert into testtab select generate_series(1,1000), 'foo';
+delete from testtab where ctid > '(8,0)';
+vacuum verbose testtab;
+});
+
+# Ensure all buffers are now clean on the standby
+$node_standby->psql('postgres', 'checkpoint');
+
+# Release the lock, vacuum again which should lead to truncation
+$node_master->psql('postgres', qq{
+rollback prepared 'p1';
+vacuum verbose testtab;
+});
+
+$node_master->psql('postgres', 'checkpoint');
+my $until_lsn =
+	$node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
+
+# Wait long enough for standby to receive and apply all WAL
+my $caughtup_query =
+	"SELECT '$until_lsn'::pg_lsn <= pg_last_xlog_replay_location()";
+$node_standby->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for standby to catch up";
+
+# now promote the standby
+$node_standby->promote;
+$node_standby->poll_query_until('postgres',
+	"SELECT NOT pg_is_in_recovery()")
+  or die "Timed out while waiting for promotion of standby";
+$node_standby->psql('postgres', 'checkpoint');
+
+# restart to discard in-memory copy of FSM
+$node_standby->restart;
+
+# Insert should work on standby
+is($node_standby->psql('postgres',
+   qq{insert into testtab select generate_series(1,1000), 'foo';}),
+   0, 'INSERT succeeds with truncated relation FSM');

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FSM corruption leading to errors

2016-10-10 Thread Michael Paquier

On Fri, Oct 7, 2016 at 2:59 AM, Pavan Deolasee  wrote:
> I investigated a bug report from one of our customers and it looked very
> similar to previous bug reports here [1], [2], [3] (and probably more). In
> these reports, the error looks something like this:
>
> ERROR:  could not read block 28991 in file "base/16390/572026": read only 0
> of 8192 bytes
>
> I traced it to the following code in MarkBufferDirtyHint().
>
> freespace.c freely uses MarkBufferDirtyHint() whenever changes are made to
> the FSM. I think that's usually alright because FSM changes are not WAL
> logged and if FSM ever returns a block with less free space than the caller
> needs, the caller is usually prepared to update the FSM and request for a
> new block. But if it returns a block that is outside the size of the
> relation, then we've a trouble. The very next ReadBuffer() fails to handle
> such a block and throws the error.

To be honest, I have been seeing a very similar issue for a couple of
weeks now on some test beds on a couple of relations involving a
promoted standby, this error happening more frequently for relations
having a more aggressive autovacuum setup. I did not take the time to
look at that seriously, and I am very glad to see this email.

> When a relation is truncated, the FSM is truncated too to remove references
> to the heap blocks that are being truncated. But since the FSM buffer may
> not be marked DIRTY on the standby, if the buffer gets evicted from the
> buffer cache, the on-disk copy of the FSM page may be left with references
> to the truncated heap pages. When the standby is later promoted to be the
> master, and an insert/update is attempted to the table, the FSM may return a
> block that is outside the valid range of the relation. That results in the
> said error.

Oops.

> I believe the fix is very simple. The FSM change during truncation is
> critical and the buffer must be marked by MarkBufferDirty() i.e. those
> changes must make to the disk. I think it's alright not to WAL log them
> because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs. But it must
> not be lost across a checkpoint. Also, since it happens only during relation
> truncation, I don't see any problem from performance perspective.

Agreed. I happen to notice that VM is similalry careful when it comes
to truncate it (visibilitymap_truncate).

> What bothers me is how to fix the problem for already affected standbys. If
> the FSM for some table is already corrupted at the standby, users won't
> notice it until the standby is promoted to be the new master. If the standby
> starts throwing errors suddenly after failover, it will be a very bad
> situation for the users, like we noticed with our customers. The fix is
> simple and users can just delete the FSM (and VACUUM the table), but that
> doesn't sound nice and they would not know until they see the problem.

+   /*
+* See comments in GetPageWithFreeSpace about handling outside the valid
+* range blocks
+*/
+   nblocks = RelationGetNumberOfBlocks(rel);
+   while (target_block >= nblocks && target_block != InvalidBlockNumber)
+   {
+   target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0,
+   spaceNeeded);
+   }
Hm. This is just a workaround. Even if things are done this way the
FSM will remain corrupted. And isn't that going to break once the
relation is extended again? I'd suggest instead putting in the release
notes a query that allows one to analyze what are the relations broken
and directly have them fixed. That's annoying, but it would be really
better than a workaround. One idea here is to use pg_freespace() and
see if it returns a non-zero value for an out-of-range block on a
standby.

I have also been thinking about the comment you added and we could
just do something like that:
+   /*
+* Mark the buffer still holding references to truncated blocks as
+* dirty to be sure that this makes it to disk and is kept consistent
+* with the truncated relation.
+*/
+   MarkBufferDirty(buf)

> It might also be a good idea to inspect other callers of
> MarkBufferDirtyHint() and see if any of them is vulnerable, especially from
> standby perspective. I did one round, and couldn't see another problem.

Haven't looked at those yet, will do so tomorrow.

At the same time, I have translated your script into a TAP test, I
found that more useful when testing..
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cygwin64 assertion failure

2016-10-10 Thread Amit Kapila

On Sun, Oct 9, 2016 at 10:23 PM, Andrew Dunstan  wrote:
> lorikeet seems to be stuck running the parallel tests, after having failed
> an assertion.
>

It looks like this problem has been discussed before [1].  Shall we do
what Tom has suggested there?


[1] - https://www.postgresql.org/message-id/15344.1473974558%40sss.pgh.pa.us

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench vs. wait events

2016-10-10 Thread Bruce Momjian

On Thu, Oct  6, 2016 at 02:38:56PM -0400, Robert Haas wrote:
> Hi,
> 
> I decided to do some testing on hydra (IBM-provided community
> resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
> the newly-enhanced wait event stuff to try to get an idea of what
> we're waiting for during pgbench.  I did 30-minute pgbench runs with
> various configurations, but all had max_connections = 200,
> shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
> off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
> log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
> on.  During each run, I ran this psql script in another window and
> captured the output:

This is a great study that shows how the new instrumentation has given
us a new window into performance.  I am frankly surprised we got as far
as we did in finding performance bottlenecks before we had this
instrumentation.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] FTS Configuration option

2016-10-10 Thread Artur Zakirov


Hello hackers,

Sometimes there is a need to collect lexems from various dictionaries.
For example, if we have a column with text in various languages.

Let's say there is a new option JOIN. This option will allow to the 
parser to append lexems from the current dictionary and go to the next 
dictionary:


=> CREATE TEXT SEARCH CONFIGURATION multi_conf (COPY=simple);
=> ALTER TEXT SEARCH CONFIGURATION multi_conf
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH german_ispell (JOIN), english_ispell, simple;

Here there are the following cases:
- found lexem in german_ispell, didn't found lexem in english_ispell.
Return lexem from german_ispell.
- didn't found lexem in german_ispell, found lexem in english_ispell.
Return lexem from english_ispell.
- didn't found lexems in dictionaries. Return lexem from simple.
- found lexems in both dictionaries. Return lexems from both.

Could be such option is useful to the community? Name of the option is
debatable.

Thank you!

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: psql \setfileref

2016-10-10 Thread Gilles Darold

Le 10/10/2016 à 14:38, Daniel Verite a écrit :
>   Gilles Darold wrote:
>
>>postgres=# \setfileref b /dev/random
>>postgres=# insert into test (:b);
>>
>> Process need to be killed using SIGTERM.
> This can be fixed by setting sigint_interrupt_enabled to true
> before operating on the file. Then ctrl-C would be able to cancel
> the command.

If we do not prevent the user to be able to load special files that
would be useful yes.

-- 
Gilles Darold
Consultant PostgreSQL
http://dalibo.com - http://dalibo.org



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] vacuumdb -f and -j options (was Question / requests.)

2016-10-10 Thread Amit Kapila

On Sun, Oct 9, 2016 at 10:59 PM, Francisco Olarte
 wrote:
> On Sat, Oct 8, 2016 at 2:22 PM, Michael Paquier
>  wrote:
>> On Sat, Oct 8, 2016 at 9:12 PM, Amit Kapila  wrote:
>
>>> After reading Francisco's proposal [1], I don't think it is directly
>>> trying to make -f and -j work together.  He is proposing to make it
>>> work by providing some new options.  As you are wondering upthread, I
>>> think it seems reasonable to disallow -f with parallel vacuuming if no
>>> tables are specified.
>
> For me -f & -j is not perfect, but better than not having it. It can
> deadlock when given certain sets of catalog tables, either by making
> it go for the full db or by a perverse set of -t options. But any DBA
> needing them together should, IMO, have resources to write ( or have
> someone else write for him ) a 20-liner wrapping and feeding them via
> -t. After all, not every tool/option is for everyone, and everything
> has it prerequisites.
>

Okay, but I think that doesn't mean it should deadlock when used by
somewhat naive user.  I am not sure every user who wants to use -f and
-j is smart enough to write a script as you are suggesting.  I think
if more people see your proposal as meaningful and want to leave
current usage of -f and -j as it is, then probably, we should issue a
warning indicating such a risk.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash Indexes

2016-10-10 Thread Amit Kapila

On Thu, Sep 29, 2016 at 8:27 PM, Amit Kapila  wrote:
> On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas  wrote:
>
>> Another thing I don't quite understand about this algorithm is that in
>> order to conditionally lock the target primary bucket page, we'd first
>> need to read and pin it.  And that doesn't seem like a good thing to
>> do while we're holding a shared content lock on the metapage, because
>> of the principle that we don't want to hold content locks across I/O.
>>
>

Aren't we already doing this during BufferAlloc() when the buffer
selected by StrategyGetBuffer() is dirty?

> I think we can release metapage content lock before reading the buffer.
>

On thinking about this again, if we release the metapage content lock
before reading and pinning the primary bucket page, then we need to
take it again to verify if the split has happened during the time we
don't have a lock on a metapage.  Releasing and again taking content
lock on metapage is not
good from the performance aspect.  Do you have some other idea for this?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_upgrade 9.5 -> 9.6 fails when pg_largeobject is in separate tablespace

2016-10-10 Thread Andreas Joseph Krogh

På søndag 09. oktober 2016 kl. 23:43:23, skrev Robert Haas <
robertmh...@gmail.com >:
On Sat, Oct 8, 2016 at 9:02 AM, Andreas Joseph Krogh > wrote: (I've set allow_system_table_mods=on in 
postgresql.conf)


Any configuration that includes this step is considered unsupported by the 
PostgreSQL community.
  
It might be a good idea if we supported storing large objects in an alternate 
tablespace, or in multiple tables in the same or different tablespaces.  
However, if you can only get there by enabling allow_system_table_mods, then we 
don't.

 
Note that pg_largeobject can be moved without 
changing allow_system_table_mods, namely by starting in single-user-mode, so I 
really don't se why this is considered unsupported? I would assume that having 
pg_largeobject in a separate tablespace is more and more common these days, 
having real-cheap SAN vs. fast-SSD for normal tables/indexes/wal.
 
AFAICT the very existence of pg_largeobject is an implementation-detail(and it 
being a system-catalog considered a defect) so saying that by moving it 
you're not able to use tools like pg_upgrade feels like being left out in the 
cold...
 
Is fixing this in any plans? Is this something we can pay for getting fixed, 
if so - what would it take?
 
PS: I cannot see this shortcoming being documented anywhere in pg_upgrade's 
docs ( https://www.postgresql.org/docs/9.6/static/pgupgrade.html ), is it 
mentioned anywhere?
 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com 
www.visena.com

Re: [HACKERS] proposal: psql \setfileref

2016-10-10 Thread Daniel Verite

Gilles Darold wrote:

>postgres=# \setfileref b /dev/random
>postgres=# insert into test (:b);
> 
> Process need to be killed using SIGTERM.

This can be fixed by setting sigint_interrupt_enabled to true
before operating on the file. Then ctrl-C would be able to cancel
the command.

See comment in common.c, above the declaration of 
sigint_interrupt_enabled:

/*
 []
 * SIGINT is supposed to abort all long-running psql operations, not only
 * database queries.  In most places, this is accomplished by checking
 * cancel_pressed during long-running loops.  However, that won't work when
 * blocked on user input (in readline() or fgets()).  In those places, we
 * set sigint_interrupt_enabled TRUE while blocked, instructing the signal
 * catcher to longjmp through sigint_interrupt_jmp.  We assume readline and
 * fgets are coded to handle possible interruption.  (XXX currently this does
 * not work on win32, so control-C is less useful there)
 */

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Using pg_ctl promote -w in TAP tests

2016-10-10 Thread Michael Paquier

Hi all,

Now that we have support for the wait mode of pg_ctl promote, I think
that it would be a good idea to switch to it in the TAP tests. This
allows avoiding extra logic with poll_query_until() to be sure that a
promoted standby is ready for read-write queries.
See the patch attached.

Thanks,
-- 
Michael
diff --git a/src/bin/pg_rewind/RewindTest.pm b/src/bin/pg_rewind/RewindTest.pm
index c7c3a8f..1c48261 100644
--- a/src/bin/pg_rewind/RewindTest.pm
+++ b/src/bin/pg_rewind/RewindTest.pm
@@ -161,12 +161,8 @@ sub promote_standby
 	  or die "Timed out while waiting for standby to receive and write WAL";
 
 	# Now promote slave and insert some new data on master, this will put
-	# the master out-of-sync with the standby. Wait until the standby is
-	# out of recovery mode, and is ready to accept read-write connections.
+	# the master out-of-sync with the standby.
 	$node_standby->promote;
-	$node_standby->poll_query_until('postgres',
-		"SELECT NOT pg_is_in_recovery()")
-	  or die "Timed out while waiting for promotion of standby";
 
 	# Force a checkpoint after the promotion. pg_rewind looks at the control
 	# file to determine what timeline the server is on, and that isn't updated
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 535d6c0..4362789 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -723,7 +723,7 @@ sub restart
 
 =item $node->promote()
 
-Wrapper for pg_ctl promote
+Wrapper for pg_ctl promote -w
 
 =cut
 
@@ -735,7 +735,8 @@ sub promote
 	my $logfile = $self->logfile;
 	my $name= $self->name;
 	print "### Promoting node \"$name\"\n";
-	TestLib::system_log('pg_ctl', '-D', $pgdata, '-l', $logfile, 'promote');
+	TestLib::system_log('pg_ctl', '-D', $pgdata, '-w', '-l', $logfile,
+		'promote');
 }
 
 # Internal routine to enable streaming replication on a standby node.
diff --git a/src/test/recovery/t/004_timeline_switch.pl b/src/test/recovery/t/004_timeline_switch.pl
index 3ee8df2..5f3b2fe 100644
--- a/src/test/recovery/t/004_timeline_switch.pl
+++ b/src/test/recovery/t/004_timeline_switch.pl
@@ -57,10 +57,7 @@ recovery_target_timeline='latest'
 $node_standby_2->restart;
 
 # Insert some data in standby 1 and check its presence in standby 2
-# to ensure that the timeline switch has been done. Standby 1 needs
-# to exit recovery first before moving on with the test.
-$node_standby_1->poll_query_until('postgres',
-	"SELECT pg_is_in_recovery() <> true");
+# to ensure that the timeline switch has been done.
 $node_standby_1->safe_psql('postgres',
 	"INSERT INTO tab_int VALUES (generate_series(1001,2000))");
 $until_lsn = $node_standby_1->safe_psql('postgres',

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] memory leak in e94568ecc10 (pre-reading in external sort)

2016-10-10 Thread Heikki Linnakangas


On 10/06/2016 06:44 PM, Peter Geoghegan wrote:

While the fix you pushed was probably a good idea anyway, I still
think you should not use swhichtate->maxTapes to exhaustively call
LogicalTapeAssignReadBufferSize() on every tape, even non-active
tapes. That's the confusing part.


Admittedly that's confusing. Thinking about this some more, I came up 
with the attached. I removed the separate 
LogicalTapeAssignReadBufferSize() call altogether - the read buffer size 
is now passed as argument to the LogicalTapeRewindForRead() call.


I split LogicalTapeRewind() into two: LogicalTapeRewindForRead() and 
LogicalTapeRewindForWrite(). A boolean argument like 'forWrite' is 
mentally challenging, because when reading a call site, you have to 
remember which way round the boolean is. And now you also pass the extra 
buffer_size argument when rewinding for reading, but not when writing.


I gave up on the logic to calculate the quotient and reminder of the 
available memory, and assigning one extra buffer to some of the tapes. 
I'm now just rounding it down to the nearest BLCKSZ boundary. That 
simplifies the code further, although we're now not using all the memory 
we could. I'm pretty sure that's OK, although I haven't done any 
performance testing of this.


- Heikki

>From 9862ff0cd184afc50515854b267e3a6456547ac6 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 10 Oct 2016 15:20:05 +0300
Subject: [PATCH 1/1] Simplify the code for logical tape read buffers.

Pass the buffer size as argument to LogicalTapeRewindForRead, rather than
setting it earlier with the separate LogicTapeAssignReadBufferSize call.
This way, the buffer size is set closer to where it's actually used, which
makes the code easier to understand.

This makes the calculation for how much memory to use for the buffers less
precise. We now use the same amount of memory for every tape, rounded down
to the nearest BLCKSZ boundary, instead of using one more block for some
tapes, to get the total up to exact amount of memory available. That should
be OK, merging isn't too sensitive to the exact amount of memory used.
---
 src/backend/utils/sort/logtape.c   | 214 ++---
 src/backend/utils/sort/tuplesort.c | 119 ++---
 src/include/utils/logtape.h|   5 +-
 3 files changed, 137 insertions(+), 201 deletions(-)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 6cc06b2..25090c7 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -141,14 +141,6 @@ typedef struct LogicalTape
 	long		curBlockNumber; /* this block's logical blk# within tape */
 	int			pos;			/* next read/write position in buffer */
 	int			nbytes;			/* total # of valid bytes in buffer */
-
-	/*
-	 * Desired buffer size to use when reading.  To keep things simple, we use
-	 * a single-block buffer when writing, or when reading a frozen tape.  But
-	 * when we are reading and will only read forwards, we allocate a larger
-	 * buffer, determined by read_buffer_size.
-	 */
-	int			read_buffer_size;
 } LogicalTape;
 
 /*
@@ -609,7 +601,6 @@ LogicalTapeSetCreate(int ntapes)
 		lt->lastBlockBytes = 0;
 		lt->buffer = NULL;
 		lt->buffer_size = 0;
-		lt->read_buffer_size = BLCKSZ;
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
@@ -739,13 +730,20 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 }
 
 /*
- * Rewind logical tape and switch from writing to reading or vice versa.
+ * Rewind logical tape and switch from writing to reading.
  *
- * Unless the tape has been "frozen" in read state, forWrite must be the
- * opposite of the previous tape state.
+ * The tape must currently be in writing state, or "frozen" in read state.
+ *
+ * 'buffer_size' specifies how much memory to use for the read buffer.  It
+ * does not include the memory needed for the indirect blocks.  Regardless
+ * of the argument, the actual amount of memory used is between BLCKSZ and
+ * MaxAllocSize, and is a multiple of BLCKSZ.  The given value is rounded
+ * down and truncated to fit those constraints, if necessary.  If the tape
+ * is frozen, the 'buffer_size' argument is ignored, and a small BLCKSZ byte
+ * buffer is used.
  */
 void
-LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
+LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum, size_t buffer_size)
 {
 	LogicalTape *lt;
 	long		datablocknum;
@@ -753,88 +751,111 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
 	Assert(tapenum >= 0 && tapenum < lts->nTapes);
 	lt = >tapes[tapenum];
 
-	if (!forWrite)
+	/*
+	 * Round and cap the buffer_size if needed.
+	 */
+	if (lt->frozen)
+		buffer_size = BLCKSZ;
+	else
 	{
-		if (lt->writing)
-		{
-			/*
-			 * Completion of a write phase.  Flush last partial data block,
-			 * flush any partial indirect blocks, rewind for normal
-			 * (destructive) read.
-			 */
-			if (lt->dirty)
-ltsDumpBuffer(lts,

Re: [HACKERS] Is it time to kill support for very old servers?

2016-10-10 Thread Craig Ringer

On 10 October 2016 at 10:45, Jim Nasby  wrote:
> On 10/7/16 1:08 PM, Steve Crawford wrote:
>>
>> This is effectively a 5-year upgrade "grace period" *after* the EOL date
>> of a given version which seems plenty generous.
>
>
> IMHO we need to be careful here. It's not at all unusual to see servers
> running versions that are *far* older than that. It's certainly
> understandable that we're not actively supporting those versions any more,
> but we also don't want people to effectively be stranded on them because
> they can't even get the data out and back into a newer version. So I think
> pg_dump at least should try to support as far back as we can without jumping
> to lots of hoops in code. I think moving the limit to 8.0 is fine, but I'm
> not so comfortable with making that limit 9.1.

The oldest I deal with is 8.2, and that's enough of an aberration that
expecting them to go via 9.6 isn't wholly unreasonable. I agree 9.0 is
way too far.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Question / requests.

2016-10-10 Thread Francisco Olarte

On Mon, Oct 10, 2016 at 4:51 AM, Jim Nasby  wrote:
> On 10/5/16 9:58 AM, Francisco Olarte wrote:
>> Is the system catalog a bottleneck for people who has real use for
>> paralell vacuum? I mean, to me someone who does this must have a very
>> big db on a big iron. If that does not consist of thousands and
>> thousands of smallish relations, it will normally be some very big
>> tables and a much smaller catalog.
> Not necessarily. Anyone that makes extensive use of temp tables can end up
> with a very large (and bloated) pg_attribute. AFAIK you can actually create
> "temp" versions of any object that lives in a schema by specifying pg_temp
> as the schema, but in practice I don't think you'll really see anything
> other than pg_attribute get really large. So it would be nice if
> pg_attribute could be done in parallel, but I suspect it's one of the
> catalog tables that could be causing these problems.

This I see, but if you crunch on temp tables I'm not sure you should
do full vacuum on the catalog ( I fear full catalog vacuum is going to
lock DDL, and this kind of situation is better served by autovacuum
maintaning free space in the catalog so it gets to an stable size ).

I do not think it is neccessary to make every operation as fast as
possible, I prefer a simpler system. I feel someone having a multi
terabyte database which needs full vacuums due to its use patterns, in
paralell, and also crunchs on a lots of temporarary tables, which a
strange use pattern which mandates full vacuums of pg_attribut ( I
could concoct a situtation for these, but not easily ) is a
specialized power DBA which should be able to easily script vacuum
tasks taking into account the usage pattern much better than any
reasonable simple alternative. After all -t is there and can be
repeated for something.

Francisco Olarte.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/Python adding support for multi-dimensional arrays

2016-10-10 Thread Heikki Linnakangas


On 10/01/2016 02:45 AM, Jim Nasby wrote:

On 9/29/16 1:51 PM, Heikki Linnakangas wrote:

Now, back to multi-dimensional arrays. I can see that the Sequence
representation is problematic, with arrays, because if you have a python
list of lists, like [[1, 2]], it's not immediately clear if that's a
one-dimensional array of tuples, or two-dimensional array of integers.
Then again, we do have the type definitions available. So is it really
ambiguous?


[[1,2]] is a list of lists...
In [4]: b=[[1,2]]

In [5]: type(b)
Out[5]: list

In [6]: type(b[0])
Out[6]: list

If you want a list of tuples...
In [7]: c=[(1,2)]

In [8]: type(c)
Out[8]: list

In [9]: type(c[0])
Out[9]: tuple


Hmm, so we would start to treat lists and tuples differently? A Python 
list would be converted into an array, and a Python tuple would be 
converted into a composite type. That does make a lot of sense. The only 
problem is that it's not backwards-compatible. A PL/python function that 
returns an SQL array of rows, and does that by returning Python list of 
lists, it would start failing.


I think we should bite the bullet and do that anyway. As long as it's 
clearly documented, and the error message you get contains a clear hint 
on how to fix it, I don't think it would be too painful to adjust 
existing application.


We could continue to accept a Python list for a plain composite type, 
this would only affect arrays of composite types.


I don't use PL/python much myself, so I don't feel qualified to make the 
call, though. Any 3rd opinions?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Un-include access/heapam.h

2016-10-10 Thread Heikki Linnakangas


On 10/04/2016 07:39 AM, Amit Langote wrote:

I noticed that un-including access/heapam.h from the following files
leaves things just fine.

src/backend/commands/aggregatecmds.c
src/backend/commands/collationcmds.c
src/backend/commands/conversioncmds.c
src/backend/commands/lockcmds.c

It seems any calls into heapam.c that there used to be have since moved to
the corresponding backend/catalog/* files or elsewhere.

Attached a patch.


Thanks, applied.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-tree index row size limit

2016-10-10 Thread Heikki Linnakangas


On 10/09/2016 02:39 PM, Florian Weimer wrote:

What it would it take to eliminate the B-tree index row size limit (or
rather, increase it to several hundred megabytes)?  I don't care about
performance for index-based lookups for overlong columns, I just want
to be able to load arbitrary data and index it.


A few ideas:

* Add support for "truncate" B-tree support functions. Long values 
wouldn't be stored, but they would be cut at a suitable length. This 
would complicate things when you have two values that only differ in the 
truncated-away portion. You'd need to still be able to order them 
correctly in the index, perhaps by fetching the actual value from the heap.


* Use TOAST for index datums. That would involve adding a whole new 
toast table for the index, with the index for the toast table.


* Have something like TOAST, implemented within the B-tree AM. When a 
large datum is stored, chop it into chunks that are stored in special 
"toast" pages in the index.


* Add smarts to the planner, to support using an expression index even 
if the predicate doesn't contain the expression verbatim. For example, 
if you have an index on SUBSTR(column, 100), and a predicate "column = 
'foo'", you could use the index, if the planner just knew enough about 
SUBSTR to realize that.


* Don't do it. Use a hash index instead. If all goes well, it'll be 
WAL-logged in PostgreSQL 10.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical tape pause/resume

2016-10-10 Thread Heikki Linnakangas

Here are freshly rebased versions of these patches. No big changes, but 
edited some comments slightly.


- Heikki

>From 324706fb2e15facbc5686e98ee26bbdc5017366a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 10 Oct 2016 10:59:38 +0300
Subject: [PATCH 1/2] Simplify tape block format.

No more indirect blocks. The blocks form a linked list instead.

This saves some memory, because we don't need to have a buffer in memory to
hold the indirect block (or blocks). To reflect that, TAPE_BUFFER_OVERHEAD
is reduced from 3 to 1 buffer, which allows using more memory for building
the initial runs.
---
 src/backend/utils/sort/logtape.c   | 581 ++---
 src/backend/utils/sort/tuplesort.c |  17 +-
 src/include/utils/logtape.h|   2 +-
 3 files changed, 156 insertions(+), 444 deletions(-)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 6cc06b2..7e1f0d8 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -31,15 +31,8 @@
  * in BLCKSZ-size blocks.  Space allocation boils down to keeping track
  * of which blocks in the underlying file belong to which logical tape,
  * plus any blocks that are free (recycled and not yet reused).
- * The blocks in each logical tape are remembered using a method borrowed
- * from the Unix HFS filesystem: we store data block numbers in an
- * "indirect block".  If an indirect block fills up, we write it out to
- * the underlying file and remember its location in a second-level indirect
- * block.  In the same way second-level blocks are remembered in third-
- * level blocks, and so on if necessary (of course we're talking huge
- * amounts of data here).  The topmost indirect block of a given logical
- * tape is never actually written out to the physical file, but all lower-
- * level indirect blocks will be.
+ * The blocks in each logical tape form a chain, with a prev- and next-
+ * pointer in each block.
  *
  * The initial write pass is guaranteed to fill the underlying file
  * perfectly sequentially, no matter how data is divided into logical tapes.
@@ -87,58 +80,58 @@
 #include "utils/memutils.h"
 
 /*
- * Block indexes are "long"s, so we can fit this many per indirect block.
- * NB: we assume this is an exact fit!
- */
-#define BLOCKS_PER_INDIR_BLOCK	((int) (BLCKSZ / sizeof(long)))
-
-/*
- * We use a struct like this for each active indirection level of each
- * logical tape.  If the indirect block is not the highest level of its
- * tape, the "nextup" link points to the next higher level.  Only the
- * "ptrs" array is written out if we have to dump the indirect block to
- * disk.  If "ptrs" is not completely full, we store -1L in the first
- * unused slot at completion of the write phase for the logical tape.
+ * A TapeBlockTrailer is stored at the end of each BLCKSZ block.
+ *
+ * The first block of a tape has prev == -1.  The last block of a tape
+ * stores the number of valid bytes on the block, subtracted by BLCKSZ.
+ * That's always negative, so prev < 0 indicates the last block.
  */
-typedef struct IndirectBlock
+typedef struct TapeBlockTrailer
 {
-	int			nextSlot;		/* next pointer slot to write or read */
-	struct IndirectBlock *nextup;		/* parent indirect level, or NULL if
-		 * top */
-	long		ptrs[BLOCKS_PER_INDIR_BLOCK];	/* indexes of contained blocks */
-} IndirectBlock;
+	long		prev;	/* previous block on this tape, or -1 on first block */
+	long		next;	/* next block on this tape, or # of valid bytes on last block */
+} TapeBlockTrailer;
+
+#define TapeBlockGetTrailer(buf) ((TapeBlockTrailer *) ((char *) buf + BLCKSZ - sizeof(TapeBlockTrailer)))
+#define TapeBlockPayloadSize  (BLCKSZ - sizeof(TapeBlockTrailer))
+
+#define TapeBlockGetNBytes(buf) (TapeBlockGetTrailer(buf)->next < 0 ? (TapeBlockGetTrailer(buf)->next + BLCKSZ) : TapeBlockPayloadSize)
+
+#define TapeBlockIsLast(buf) (TapeBlockGetTrailer(buf)->next < 0)
 
 /*
  * This data structure represents a single "logical tape" within the set
- * of logical tapes stored in the same file.  We must keep track of the
- * current partially-read-or-written data block as well as the active
- * indirect block level(s).
+ * of logical tapes stored in the same file.
+ *
+ * While writing, we hold the current partially-written data block in the
+ * buffer.  While reading, we can hold multiple blocks in the buffer.  Note
+ * that we don't retain the trailers of a block when it's read into the
+ * buffer.  The buffer therefore contains one large contiguous chunk of data
+ * from the tape.
  */
 typedef struct LogicalTape
 {
-	IndirectBlock *indirect;	/* bottom of my indirect-block hierarchy */
 	bool		writing;		/* T while in write phase */
 	bool		frozen;			/* T if blocks should not be freed when read */
 	bool		dirty;			/* does buffer need to be written? */
 
 	/*
-	 * The total data volume in the logical tape is numFullBlocks * BLCKSZ +
-	 * lastBlockBytes.  BUT: we do

Re: [HACKERS] proposal: psql \setfileref

2016-10-10 Thread Gilles Darold

Le 09/10/2016 à 11:48, Pavel Stehule a écrit :
> hi
>
> 2016-10-04 9:18 GMT+02:00 Gilles Darold  >:
>
> Le 03/10/2016 à 23:23, Gilles Darold a écrit :
> > Le 03/10/2016 à 23:03, Robert Haas a écrit :
> >> On Mon, Oct 3, 2016 at 3:54 PM, Gilles Darold
> > wrote:
> >>> 4) An other problem is that like this this patch will allow
> anyone to upload into a
> >>> column the content of any system file that can be read by
> postgres system user
> >>> and then allow non system user to read its content.
> >> I thought this was a client-side feature, so that it would let a
> >> client upload any file that the client can read, but not things
> that
> >> can only be read by the postgres system user.
> >>
> > Yes that's right, sorry for the noise, forget this fourth report.
> >
>
> After some more though there is still a security issue here. For a
> PostgreSQL user who also have login acces to the server, it is
> possible
> to read any file that the postgres system user can read, especially a
> .pgpass or a recovery.conf containing password.
>
>
> here is new update - some mentioned issues are fixed + regress tests
> and docs

Looks very good for me minus the two following points:

1) I think \setfileref must return the same syntax than \set command

postgres=# \setfileref a testfile.txt
postgres=# \setfileref
a = 'testfile.txt'

postgres=# \setfileref
...
a = ^'testfile.txt'

I think it would be better to prefixed the variable value with the ^ too
like in the \set report even if we know by using this command that
reported variables are file references.


2) You still allow special file to be used, I understand that this is
from the user responsibility but I think it could be a wise precaution. 

postgres=# \setfileref b /dev/random
postgres=# insert into test (:b);

Process need to be killed using SIGTERM.

However if this last point is not critical and should be handle by the
user, I think this patch is ready to be reviewed by a committer after
fixing the first point.

Regards,

-- 
Gilles Darold
Consultant PostgreSQL
http://dalibo.com - http://dalibo.org

Re: [HACKERS] Logical tape pause/resume

2016-10-10 Thread Heikki Linnakangas


On 10/09/2016 03:27 AM, Peter Geoghegan wrote:

You shouldn't really waste 8% of the budget with low work_mem settings
with my cap patch applied, because the new cap never limits the number
of tapes. IIRC, the cap of 500 tapes doesn't start to matter until you
have about 1GB of work_mem. So, if there is still any waste at the low
end, that can only be solved by tweaking the main calculation within
tuplesort_merge_order(). (Also, to be clear to others following along:
that memory is never actually allocated, so it's only "wasted" from
the perspective of one sort operation alone).


Regardless of the number of tapes, the memory used for the tape buffers, 
while building the initial runs, is wasted. It's not entirely wasted 
when you have multiple merge passes, because without the buffer you need 
to read the last partial page back into memory, when you start to output 
the next run on it. But even in that case, I believe that memory would 
be better used for the quicksort, to create larger runs.



The cost of multiple passes is paid in sequential I/O of tape temp
files, which is now clearly a relatively low cost. OTOH, the cost of a
larger merge heap is paid in more CPU cache misses, which is a cost we
can feel quite severely. While it's really hard to come up with a
generic break-even point, I do think that there is value in capping
the number of tapes somewhere in the hundreds. It's not like a cap on
the number of tapes along the lines I've proposed was not thought
about from day one, by both Tom and Simon. Noah's relatively recent
MaxAllocSize work has given the issue new importance, though (the same
might have been said during the 9.6 replacement selection vs.
quicksort discussions, actually).


Yeah, there might be value in having a cap, because operating the merge 
heap becomes more expensive when it becomes larger. Mainly because of 
cache misses. We should perhaps have a limit on the size of the merge 
heap, and use the same limit for the replacement selection. Although, 
the heap behaves slightly differently during replacement selection: 
During replacement selection, new values added to the heap tend to go to 
the bottom of the heap, but during merge, new values tend to go close to 
the top. The latter pattern incurs fewer cache misses.


But that's orthogonal to the wasting-memory aspect of this. Even if a we 
have a cap of 100, it's good to not spend memory for the tape buffers of 
those 100 tapes, when building the initial runs.



When we finish writing an initial run to a tape, we keep the tape buffers
around. That's the reason we need to reserve that memory. But there's no
fundamental reason we have to do that. As I suggested in [2], we could flush
the buffers to disk, and only keep in memory the location of the last block
we wrote. If we need to write another run to the tape, we can reload the
last incomplete block from the disk, and continue writing.


Okay. But, you haven't actually addressed the problem with non-trivial
amounts of memory being logically allocated ahead of time for no good
reason -- you've only address the constant factor (the overhead
per-tape).


I thought I did. Can you elaborate?

Are you referring to the fact that the array of LogicalTapes is still 
allocated ahead of time, with size maxTapes? True, it'd be nice to 
allocate the LogicalTape structs only when needed. But that's peanuts, 
compared to the tape buffers.



In addition to saving a little bit of memory, I'd like to do this
refactoring because it simplifies the code. It's code that has stayed
largely unchanged for the past 15 years, so I'm not too eager to touch it,
but looking at the changes coming with Peter's parallel tuplesort patch set,
I think this makes sense.


I can definitely see value in refactoring, to make that code less
complicated -- I just don't think it's justified by the amount of
memory that is wasted on tapes.


Ok, good. I think we're in agreement on doing this, then, even if we 
don't agree on the justification :-).



That said, I'm not especially worried about the complexity within the
logtape.c changes of the parallel CREATE INDEX patch alone. I'm much
more interested in having a logtape.c that could be more easily made
to support binary searching, etc, to find *partition* boundaries,
which my CREATE INDEX patch doesn't use or care about. This is needed
for tuplesort.c partition-based sorting. When parallel sort isn't just
used by CREATE INDEX, partitioning becomes important. And, with
partitioning, dynamic sampling is essential (I think that this will
end up living in logtape.c).


Yeah. I'm not sure how partitioning and all that would be done here. But 
I'm prettty sure this simpler logtape.c code is easier to work with, for 
partitioning too.


We might want to have a each partition on a separate tape, for example, 
and therefore have a lot more tapes than currently. Not wasting memory 
on the tape buffers becomes important in that case.


- Heikki



--
Sent via pgsql-hackers

61 matches

Mail list logo