Re: no library dependency in Makefile?

2017-11-17 Thread 高增琦
I very much look forward to hearing everyone's views on this issue.

If the solution mentioned before is ok, I will start to complete it.

thanks

高增琦 于2017年11月16日 周四20:51写道:

> LDFLAGS in the example changed to:
>
> '''
> override LDFLAGS := $(call expand_stlibs,$(STLIBS)) $(libpq_pgport)
> $(LDFLAGS)
> '''
>
> 2017-11-16 20:50 GMT+08:00 高增琦 :
>
>> Is this a problem or not?
>>
>>
>> A simple fix:
>> 1. add a STLIBS variable in Makefiles as normal prerequisite
>> 
>> 2. using GNU make's function to generate '-Lxxx -lxxx' for items in STLIBS
>>
>> For example: libpgfeutils.a in psql's Makefile:
>> '''
>> # function to generate '-Lxxx -lxxx', may put in another file
>> expand_stlibs = $(patsubst %,-L%,$(dir $(1))) $(patsubst
>> lib%.a,-l%,$(notdir $(1)))
>>
>> # static lib
>> STLIBS := $(top_builddir)/src/fe_utils/libpgfeutils.a
>>
>> # add STLIBS as normal prerequisite
>> psql: $(OBJS) $(STLIBS) | submake-libpq submake-libpgport
>> submake-libpgfeutils
>> $(CC) $(CFLAGS) $(OBJS) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
>> '''
>>
>> 2017-11-15 16:10 GMT+08:00 高增琦 :
>>
>>> Hi,
>>>
>>> Recently, I found 'psql' is not rebuilt automatically after
>>> editing 'fe_utils/psqlscan.l'.
>>>
>>> The detail is:
>>> 'psqlscan.l' is part of 'libpgfeutils.a' which will be built
>>> into 'psql' statically. But there is no dependency rule between
>>> them.
>>>
>>> It's OK for 'libpq' since 'libpq' is a dynamic library.
>>> For a static library such as 'libpgfeutils.a', should we
>>> add dependency rule in Makefile?
>>>
>>> --
>>> GaoZengqi
>>> pgf...@gmail.com
>>> zengqi...@gmail.com
>>>
>>
>>
>>
>> --
>> GaoZengqi
>> pgf...@gmail.com
>> zengqi...@gmail.com
>>
>
>
>
> --
> GaoZengqi
> pgf...@gmail.com
> zengqi...@gmail.com
>
-- 
GaoZengqi
pgf...@gmail.com
zengqi...@gmail.com


Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

2017-11-17 Thread Peter Geoghegan
On Fri, Nov 17, 2017 at 12:48 PM, Tom Lane  wrote:
>> I'd like to hear some opinions on the feasibility of this approach.
>
> There is indeed a big planning problem here, but Robert's sketch is
> missing an important component of it: work_mem is not an output of cost
> estimates, it is an *input*.  For example, we can sort or hash-join in
> however much memory you want, but it's going to cost different amounts.
>
> I think what we're actually looking for is to find the breakpoints in
> the cost curve where it thinks it can switch to a different sorting
> or hashing model, and then to emit paths that assume work_mem just
> above each of those breakpoints.  But the code isn't set up like that
> now, not as to either planning or execution.

It might not be that hard to come up with a model for determining
which points on the curve are of interest. It seems easy to do this
for sorting, because it's actually a very simple curve. Once you're in
single pass territory, and provided you're still using at least a few
tens of megabytes of work_mem, the availability of work_mem seems to
only make a very small difference (temp file I/O is still essentially
sequential, and the logarithmic growth in comparisons as more runs
must be merged doesn't really bite you). Perhaps this also means that
you can expect to get away with moderately bad estimates there.

> Another problem with formulating it that way is that it suddenly puts
> a much higher premium on the planner's space estimates being right,
> which is something I don't have much faith in.  For instance, if the
> planner thinks that 1000kB is just enough to hold a hash table, and
> then when we run it we find out that we need a bit more space than that,
> do we really want the executor to switch to a batched join?  Probably not,
> especially not if having set the node's work_mem to 1010kB instead
> would've let it run to completion without batching.  Addressing that
> discrepancy might be where we need the dynamic "emergency memory request"
> mechanism that Peter was postulating.  But I'm not sure exactly how that
> works, because at the point where the executor realizes it's about to
> exceed the original space budget, it generally has little idea how much
> more it would need in order to avoid spilling the sort to disk or
> adding another round of batching.

I don't know either.

I think that it's reasonable for us to make it a goal of the executor
to have operations that have a smooth cost function, in order to
manage the risk of misestimation well, and to make it a goal to have
operations that are otherwise adaptive to misestimation. To a large
degree, my abandoned "quicksort with spillover" design from a couple
of years ago was written with this in mind (it avoided a sudden
discontinuity in the cost function of sort nodes, at the point where
you must spill for the first time). Another example of an adaptive
operation is "role reversal" for hash joins, where the executor flips
the inner and outer side during execution, at a point where it becomes
clear that the optimizer had it backwards, estimation-wise. There are
probably numerous other things like this that are possible...and maybe
even worthwhile.

In summary, I agree that we're going to have big problems if the
planner needs to have very accurate estimates to see a real benefit.
It seems possible that most of the benefit of "fixing work_mem" comes
from avoiding using a woefully inadequate amount of memory where
batching was clearly always going to be necessary. There may be
limited benefit to preventing batching in the first place. So while
there could also be room for an "emergency memory request" mechanism,
it's more of a nice-to-have.

> So it's really unclear to me what either the planner or executor API
> contracts for memory consumption ought to be if we're going to try to do
> this differently.  I agree there's a lot of potential for improvement if
> we can find a better solution, but we're going to need to put some serious
> thought into it.

The devil is in the details, of course. Vladimir said something about
customer issues with sizing work_mem on Google's cloud database
service, and it reminded me of my experiences with this while working
at Heroku. I tended to hear few complaints about it, but then there'd
sometimes be serious customer issues quite suddenly.

My theory is that there can be a turning point where demands on
work_mem increase, and there are suddenly more group aggregates than
hash aggregates (to a lesser extent, there may be fewer hash joins).
Now the database is using group aggregates that are quite a bit slower
than hash aggregates, while still using approximately the same amount
of memory as before. This creates significantly more pressure quite
suddenly, because the group aggregates are quite a bit slower, and it
takes that much longer for the memory to be released.

I'm mostly concerned about avoiding instability like this. Users
greatly value predictable 

Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

2017-11-17 Thread Peter Geoghegan
On Fri, Nov 17, 2017 at 3:23 PM, Robert Haas  wrote:
> Hmm.  I wonder if you are correct that hashing is the special case.
> Hashing and sorting are of course the two main operations -- but
> there's materialize and anything else that uses a CTE, and maybe other
> stuff I'm not thinking about right now.  You might be right that hash
> is the one where it really matters, but it's probably worth a bit more
> reflection on where it matters most and for what reasons.

I'd rather be approximately correct than precisely wrong. I think that
the current work_mem model is precisely wrong. I'm conscious of the
fact that we are loathe to create new GUCs (I sometimes think that
we're a bit too averse to doing so), but maybe there is room for
adding a second work_mem-alike GUC.

For now, I admit that I am applying fuzzy criteria, and that I could
easily have missed an important subtlety. Creating hash_mem instead of
sort_mem is a direction that is entirely debatable, and should
actually be debated. OTOH, it seems like a real problem that we don't
allow hashing to take full advantage of available main memory, and
*some* interim solution seems preferable to what we have now.

-- 
Peter Geoghegan



Re: default range partition and constraint exclusion

2017-11-17 Thread Robert Haas
On Fri, Nov 17, 2017 at 12:57 AM, Amit Langote
 wrote:
> While working on the patch for partition pruning for declarative
> partitioned tables, I noticed that default range partition will fail to be
> included in a plan in certain cases due to pruning by constraint exclusion.
> you'll notice that it doesn't explicitly say that the default partition
> allows rows where a is null or b is null or both are null.  Given that,
> constraint exclusion will end up concluding that the default partition's
> constraint is refuted by a = 2.
>
> The attached will make the constraint to look like:

Uh, if the constraint exclusion logic we're using is drawing false
conclusions, we need to fix it so it doesn't, not change the
constraint so that the wrong logic gives the right answer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

2017-11-17 Thread Robert Haas
On Fri, Nov 17, 2017 at 1:22 PM, Peter Geoghegan  wrote:
> Right. The ability for sorts to do well with less memory is really
> striking these days. And though I didn't mean to seriously suggest it,
> a hash_mem GUC does seem like it solves some significant problems
> without much risk. I think it should be hash_mem, not sort_mem,
> because hashing seems more like the special case among operations that
> consume work_mem, and because sort_mem is already the old name for
> work_mem that is still accepted as a work_mem alias, and because
> hash_mem avoids any confusion about whether or not CREATE INDEX uses
> the new GUC (it clearly does not).

Hmm.  I wonder if you are correct that hashing is the special case.
Hashing and sorting are of course the two main operations -- but
there's materialize and anything else that uses a CTE, and maybe other
stuff I'm not thinking about right now.  You might be right that hash
is the one where it really matters, but it's probably worth a bit more
reflection on where it matters most and for what reasons.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Parallel Hash take II

2017-11-17 Thread Andres Freund
Hi,

On 2017-11-14 01:30:30 +1300, Thomas Munro wrote:
> New patch attached.

0002-Add-a-barrier-primitive-for-synchronizing-backends.patch
- Intro starts with:
+ *
+ * This implementation of barriers allows for static sets of participants
+ * known up front, or dynamic sets of participants which processes can join or
  that's a bit low-level, no? Should explain what they're for, not
  everyone's going to be familiar with the concept.

- The comment for BarrierInit() reads a bit weird:
+ * Initialize this barrier, setting a static number of participants that we
+ * will wait for at each computation phase.  To use a dynamic number of
+ * participants, this number should be zero, and BarrierAttach and
  Set a static number! Except maybe not?

- I'm not entirely convinced that we want the barrier debug stuff
  merged, what are your feelings about that?  It's like half the code,
  and adds some complexity to the non debug code... If we indeed want to
  keep it, it probably should be documented in config.sgml? And get an
  entry in pg_config_manual.h?

- Can we add assertions ensuring nobody attaches to a static barrier?

- If I understand correctly currently the first participant to arrive at
  a barrier is going to be selected, and the last wakes everyone
  up. Wouldn't it be better to do have the last arrived participant be
  selected? It already got a scheduler timeslice, it's memory is most
  likely to be in cache etc? Given that in cases where selection plays a
  role it's going to be blocking everyone else, using the process most
  likely to finish first seems like a good idea.  I guess the
  BarrierDetach() implementation would be slightly more complex?


Regards,

Andres



Re: [HACKERS] pg_upgrade to clusters with a different WAL segment size

2017-11-17 Thread Jeremy Schneider
On Mon, Nov 13, 2017 at 3:09 PM, Tom Lane  wrote:
>> On Tue, Nov 14, 2017 at 7:32 AM, Andres Freund  wrote:
>>> Even if that's the case, I fail to see why it'd be a good idea to have
>>> any sort of pg_upgrade integration here.  We should make pg_resetwal's
>>> checks for this good enough, and not conflate something unrelated with
>>> pg_upgrade goals.
>
> FWIW, I agree with Andres' position here.  I think the charter of
> pg_upgrade is to duplicate the old cluster as closely as it can,
> not to modify its configuration.  A close analogy is that it does not
> attempt to upgrade extension versions while migrating the cluster.

Having pg_upgrade simply allow an upgrade where the WAL sizes don't match
between the old cluster and the new cluster seems fairly trivial.
pg_upgrade isn't
changing anything; it's just *not* requiring the new and old clusters
to match in this
way. I'll admit I'm much newer to postgresql than everyone else on
this list, but I
haven't yet thought of any technical reason this check is actually
needed. (Just the
WAL sequencing/naming concern I outlined earlier.)

Writing code to change the WAL size on an existing cluster will be a little more
complex.  We will need to delete all WAL files and recreate them at the new
sizes. We will need to either (1) be absolutely sure that there was a
clean shutdown
or (2) copy WAL data from the old files to the new files.  We will
need to think harder
through the issue of gaps being introduced in the sequence of WAL filenames and
effects on backup/recovery.

These two ideas don't seem mutually exclusive to me.  If pg_upgrade
allows working
with different WAL sizes, it doesn't mean we can't still introduce a
utility to change the
WAL size on running clusters.

So yeah - having a utility command to change the WAL size on a running cluster
is a great idea.  But why are we wanting to maintain a check in pg_upgrade
which doesn't actually seem technically necessary?  Or am I missing something
that would break if the WAL sizes didn't match across two clusters when
pg_upgrade moved the data between them?

-Jeremy

-- 
http://about.me/jeremy_schneider

(Disclaimer: I work for AWS and my opinions are my own.)



Re: Fix number skipping in to_number

2017-11-17 Thread Tom Lane
I wrote:
> That leads me to the attached patch.  There is more that could be done
> here --- in particular, I'd like to see the character-not-byte-count
> rule extended to literal text.  But that seems like fit material for
> a different patch.

Attached is a patch that makes formatting.c more multibyte-aware;
it now handles multibyte characters as single NODE_TYPE_CHAR format
nodes, rather than one node per byte.  This doesn't really have much
impact on the output (to_char) side, but on the input side, it
greatly simplifies treating such characters as single characters
rather than multiple ones.  An example is that (in UTF8 encoding)
previously we got

u8=# select to_number('$12.34', '€99D99');
 to_number 
---
  0.34
(1 row)

because the literal euro sign is 3 bytes long and was thought to be
3 literal characters.  Now we get the expected result

u8=# select to_number('$12.34', '€99D99');
 to_number 
---
 12.34
(1 row)

Aside from skipping 1 input character (not byte) per literal format
character, I fixed the SKIP_THth macro, allowing to_date/to_timestamp to
also follow the rule of skipping whole characters not bytes for non-data
format patterns.  There might be some other places that need similar
adjustments, but I couldn't find any.

Not sure about whether/how to add regression tests for this; it's really
impossible to add specific tests in an ASCII-only test file.  Maybe we
could put a test or two into collate.linux.utf8.sql, but it seems a bit
off topic for that, and I think that test file hardly gets run anyway.

Note this needs to be applied over the patch I posted at
https://postgr.es/m/3626.1510949...@sss.pgh.pa.us
I intend to commit that fairly soon, but it's not in right now.

regards, tom lane

diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index cb0dbf7..ec97de0 100644
*** a/src/backend/utils/adt/formatting.c
--- b/src/backend/utils/adt/formatting.c
*** typedef enum
*** 151,158 
  	FROM_CHAR_DATE_ISOWEEK		/* ISO 8601 week date */
  } FromCharDateMode;
  
- typedef struct FormatNode FormatNode;
- 
  typedef struct
  {
  	const char *name;
--- 151,156 
*** typedef struct
*** 162,174 
  	FromCharDateMode date_mode;
  } KeyWord;
  
! struct FormatNode
  {
! 	int			type;			/* node type			*/
! 	const KeyWord *key;			/* if node type is KEYWORD	*/
! 	char		character;		/* if node type is CHAR		*/
! 	int			suffix;			/* keyword suffix		*/
! };
  
  #define NODE_TYPE_END		1
  #define NODE_TYPE_ACTION	2
--- 160,172 
  	FromCharDateMode date_mode;
  } KeyWord;
  
! typedef struct
  {
! 	int			type;			/* NODE_TYPE_XXX, see below */
! 	const KeyWord *key;			/* if type is ACTION */
! 	char		character[MAX_MULTIBYTE_CHAR_LEN + 1];	/* if type is CHAR */
! 	int			suffix;			/* keyword prefix/suffix code, if any */
! } FormatNode;
  
  #define NODE_TYPE_END		1
  #define NODE_TYPE_ACTION	2
*** parse_format(FormatNode *node, const cha
*** 1282,1293 
  		}
  		else if (*str)
  		{
  			/*
  			 * Process double-quoted literal string, if any
  			 */
  			if (*str == '"')
  			{
! while (*(++str))
  {
  	if (*str == '"')
  	{
--- 1280,1294 
  		}
  		else if (*str)
  		{
+ 			int			chlen;
+ 
  			/*
  			 * Process double-quoted literal string, if any
  			 */
  			if (*str == '"')
  			{
! str++;
! while (*str)
  {
  	if (*str == '"')
  	{
*** parse_format(FormatNode *node, const cha
*** 1297,1307 
  	/* backslash quotes the next character, if any */
  	if (*str == '\\' && *(str + 1))
  		str++;
  	n->type = NODE_TYPE_CHAR;
! 	n->character = *str;
  	n->key = NULL;
  	n->suffix = 0;
  	n++;
  }
  			}
  			else
--- 1298,1311 
  	/* backslash quotes the next character, if any */
  	if (*str == '\\' && *(str + 1))
  		str++;
+ 	chlen = pg_mblen(str);
  	n->type = NODE_TYPE_CHAR;
! 	memcpy(n->character, str, chlen);
! 	n->character[chlen] = '\0';
  	n->key = NULL;
  	n->suffix = 0;
  	n++;
+ 	str += chlen;
  }
  			}
  			else
*** parse_format(FormatNode *node, const cha
*** 1312,1323 
   */
  if (*str == '\\' && *(str + 1) == '"')
  	str++;
  n->type = NODE_TYPE_CHAR;
! n->character = *str;
  n->key = NULL;
  n->suffix = 0;
  n++;
! str++;
  			}
  		}
  	}
--- 1316,1329 
   */
  if (*str == '\\' && *(str + 1) == '"')
  	str++;
+ chlen = pg_mblen(str);
  n->type = NODE_TYPE_CHAR;
! memcpy(n->character, str, chlen);
! n->character[chlen] = '\0';
  n->key = NULL;
  n->suffix = 0;
  n++;
! str += chlen;
  			}
  		}
  	}
*** dump_node(FormatNode *node, int max)
*** 1349,1355 
  			elog(DEBUG_elog_output, "%d:\t NODE_TYPE_ACTION '%s'\t(%s,%s)",
   a, n->key->name, DUMP_THth(n->suffix), DUMP_FM(n->suffix));
  		else if 

Re: [HACKERS] Parallel Hash take II

2017-11-17 Thread Peter Geoghegan
On Fri, Nov 17, 2017 at 1:55 PM, Andres Freund  wrote:
> Looking at 0005-Add-infrastructure-for-sharing-temporary-files-betwe.patch:
> - The created path/filenames seem really redundant:
>   base/pgsql_tmp/pgsql_tmp11160.9.sharedfileset.d/pgsql_tmp.o3of8.p0.0
>
>   Including pgsql_tmp no less than three times seems a bit absurd.
>
>   I'm quite inclined to just remove all but the first.

+1

> - It's not clear to me why it's correct to have the vfdP->fdstate & 
> FD_TEMPORARY
>   handling in FileClose() be independent of the file being deleted. At
>   the very least there needs to be a comment explaining why we chose
>   that behaviour.

Isn't that just because only one backend is supposed to delete the
file, but they all must close it and do temp_file_limit accounting?
Sorry if I missed something (my explanation seems too obvious to be
correct).

> - I think we need to document somehwere that the temp_file_limit in a
>   shared file set applies independently for each participant that's
>   writing something.  We also should discuss whether that's actually
>   sane behaviour.

This is already the documented behavior of temp_file_limit, fwiw.

Another question for Thomas: Is it okay that routines like
BufFileOpenShared() introduce new palloc()s (not repalloc()s) to
buffile.c, given that struct BufFile already contains this?:

/*
 * resowner is the ResourceOwner to use for underlying temp files.  (We
 * don't need to remember the memory context we're using explicitly,
 * because after creation we only repalloc our arrays larger.)
 */
ResourceOwner resowner;

Maybe we need to remember the original caller's memory context, too?
Either that, or the contract/comments for memory contexts need to be
revised.

-- 
Peter Geoghegan



Re: [HACKERS] Parallel Hash take II

2017-11-17 Thread Andres Freund
Hi,

On 2017-11-14 01:30:30 +1300, Thomas Munro wrote:
> New patch attached.

(I've commit some of the preliminary work)

Looking at 0005-Add-infrastructure-for-sharing-temporary-files-betwe.patch:
- The created path/filenames seem really redundant:
  base/pgsql_tmp/pgsql_tmp11160.9.sharedfileset.d/pgsql_tmp.o3of8.p0.0

  Including pgsql_tmp no less than three times seems a bit absurd.

  I'm quite inclined to just remove all but the first.

- There seems to be a moment where could leak temporary file
  directories:
File
SharedFileSetCreate(SharedFileSet *fileset, const char *name)
{
charpath[MAXPGPATH];
Filefile;

SharedFilePath(path, fileset, name);
file = PathNameCreateTemporaryFile(path, false);

/* If we failed, see if we need to create the directory on demand. */
if (file <= 0)
{
chartempdirpath[MAXPGPATH];
charfilesetpath[MAXPGPATH];
Oid tablespace = ChooseTablespace(fileset, 
name);

TempTablespacePath(tempdirpath, tablespace);
SharedFileSetPath(filesetpath, fileset, tablespace);
PathNameCreateTemporaryDir(tempdirpath, filesetpath);
file = PathNameCreateTemporaryFile(path, true);
}

return file;
}

  The resowner handling is done in PathNameCreateTemporaryFile(). But if
  we fail after creating the directory we'll not have a resowner for
  that. That's probably not too bad.

- related to the last point, I'm kinda wondering why we need sub-fileset
  resowners? Given we're dealing with error paths in resowners I'm not
  quite seeing the point - we're not going to want to roll back
  sub-parts of of a fileset, no?

- If we want to keep these resowners, shouldn't we unregister them in
  PathNameDeleteTemporaryFile?

- PathNameCreateTemporaryFile() and OpenTemporaryFile() now overlap
  quite a bit. Can't we rejigger things to base the second on the first?
  At the very least the comments need to work out the difference more
  closely.

- It's not clear to me why it's correct to have the vfdP->fdstate & FD_TEMPORARY
  handling in FileClose() be independent of the file being deleted. At
  the very least there needs to be a comment explaining why we chose
  that behaviour.

- I think we need to document somehwere that the temp_file_limit in a
  shared file set applies independently for each participant that's
  writing something.  We also should discuss whether that's actually
  sane behaviour.

Greetings,

Andres Freund



Re: PG10.1 autovac killed building extended stats

2017-11-17 Thread Justin Pryzby
On Fri, Nov 17, 2017 at 01:36:00PM -0300, Alvaro Herrera wrote:
> Justin Pryzby wrote:
> 
> > Core was generated by `postgres: autovacuum worker process   gtt
> >  '.
> > Program terminated with signal 11, Segmentation fault.
> > #0  statext_ndistinct_build (totalrows=300, numrows=300, rows=0x21df3e8, 
> > attrs=, stats=0x0) at mvdistinct.c:103
> > 103 item->attrs = 
> > bms_add_member(item->attrs,
> > 
> > (gdb) p stats
> > $5 = (VacAttrStats **) 0x0
> > => This is an error, no ??
> 
> Not necessarily, but then I think this previous code is busted:

> If I recall things correctly, the "continue" should be executed
> regardless of IsAutoVacuumWorkerProcess() (only the log should be
> conditional on that).  I'm not sure how I arrived at the current coding,
> which I added in bf2a691e02d7 -- probably just fuzzy thinking.

Is it useful to run with that change ?

Justin



Re: pspg - psql pager

2017-11-17 Thread Pavel Stehule
2017-11-15 10:41 GMT+01:00 Pavel Stehule :

> Hi
>
> I finished new pager, that can be interesting for postgresql expert users.
>
> It demonstrate some possibilities of TUI and I hope it shows some possible
> directions of future possibilities of psql.
>
> It is available as rpm from our repository or you can get source code from
> github
>
> https://github.com/okbob/pspg
>
> I invite any discussion about this project.
>
> Pavel
>
> p.s. I hope so this pager is useful - I know some users who use it few
> months intensively. But the source code has proof concept quality. It
> should be cleaned next year.
>

I know so it is not related to Postgres, but I did some fixed at master,
and pspg is now possible to use with mysql client too.

Regards

Pavel


Re: bgwriter_lru_maxpages range in postgresql.conf

2017-11-17 Thread Robert Haas
On Thu, Nov 16, 2017 at 4:40 PM, Jeff Janes  wrote:
> I changed it to "0 disables" to be more consistent with other locations.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [COMMITTERS] pgsql: Add hash partitioning.

2017-11-17 Thread Robert Haas
On Thu, Nov 16, 2017 at 9:37 AM, amul sul  wrote:
> Fixed in the 001 patch.
>
> IMHO, this function is not meant for a user, so that instead of ereport() cant
> we simply return false?  TWIW, I have attached 003 patch which replaces all
> erepots() by return false.

I don't think just returning false is very helpful behavior, because
the user may not realize that the problem is that the function is
being called incorrectly rather than that the call is correct and the
answer is false.

I took your 0001 patch and made extensive modifications to it.  I
replaced your regression tests from 0002 with a new set that I wrote
myself.  The result is attached here.  This version makes different
decisions about how to handle the various problem cases than you did;
it returns NULL for a NULL input or an OID for which no relation
exists, and throws specific error messages for the other cases,
matching the parser error messages that CREATE TABLE would issue where
possible, but with a different error code.  It also checks that the
types match (which your patch did not, and which I'm fairly sure could
crash the server), makes the function work when invoked using the
explicit VARIADIC syntax (which seems fairly useless here but there's
no in-core precedent for variadic function which doesn't support that
case), and fixes the function header comment to describe the behavior
we committed rather than the older behavior you had in earlier patch
versions.

As far as I can tell, this should nail things down pretty tight, but
it would be great if someone can try to find a case where it still
breaks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


0001-Fix-multiple-problems-with-satisfies_hash_partition.patch
Description: Binary data


Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

2017-11-17 Thread Peter Geoghegan
On Fri, Nov 17, 2017 at 7:31 AM, Robert Haas  wrote:
> On Thu, Nov 16, 2017 at 11:50 AM, Serge Rielau  wrote:
>>
>> Just as you have, we have also considered holistic provisioning of work_mem 
>> across all consumers, but we find that to be too complex.
>> Having an “emergency fund” in shared memory is also an option, but I find it 
>> too limiting.
>
>
> I agree.

Yeah. I suspect that that idea is not ambitious enough to do a lot of
what we want, and yet is too ambitious to justify working on given its
limited shelf life.

> I think this is basically a planning problem.  For example, say we wanted to 
> have work_mem_per_query instead of work_mem_per_node.  There is an obvious 
> design: consider memory use as an independent dimension of merit during path 
> generation and comparison (less is better).  Discard candidate paths whose 
> memory use exceeds the work_mem_per_query budget unless there are no other 
> alternatives.  At the end of planning, pick the cheapest path that survived 
> the memory-budget filter.  Now, this has the problem that it would make 
> planning more expensive (because we'd hang on to more paths for longer) but 
> it solves a lot of other problems.  If there's no memory pressure, we can use 
> memory like mad even when it doesn't save much, but when we have to pick 
> between using more memory for one part of the plan and using more memory for 
> another part of the plan, the choice that does the best job reducing overall 
> execution time will win.  Awesome.

I'd like to hear some opinions on the feasibility of this approach.
Does David have anything to say about it, for example?

> We could also do more localized variants of this that don't provide hard 
> guarantees but do tend to avoid squandering resources.

That sounds like independent work, though it could be very useful.

> Yet another thing we could do is to try to get nodes to voluntarily use less 
> than work_mem when possible.  This is particularly an issue for sorts.  A 
> 2-batch hash join is so much more expensive than a single-batch hash join 
> that it's almost never going to make sense unless we have no realistic 
> alternative, although I suppose a 64-batch hash join might be not that 
> different from a 32-batch hash join.  But for sorts, given all Peter's work 
> in this area, I bet there are a lot of sorts that could budget a quarter or 
> less of work_mem and really not be hurt very much.  It depends somewhat on 
> how fast and how contended your I/O is, though, which we don't have an 
> especially good way to model.  I'm starting to wonder if that sort_mem GUC 
> might be a good idea... use that for sorts, and keep work_mem for everything 
> else.

Right. The ability for sorts to do well with less memory is really
striking these days. And though I didn't mean to seriously suggest it,
a hash_mem GUC does seem like it solves some significant problems
without much risk. I think it should be hash_mem, not sort_mem,
because hashing seems more like the special case among operations that
consume work_mem, and because sort_mem is already the old name for
work_mem that is still accepted as a work_mem alias, and because
hash_mem avoids any confusion about whether or not CREATE INDEX uses
the new GUC (it clearly does not).

Since I am primarily concerned about the difference in sensitivity to
the availability of memory that exists when comparing sorting and
hashing, and since a new GUC seems like it would noticeably improve
matters, I am beginning to take the idea of writing a hash_mem patch
for v11 seriously.

-- 
Peter Geoghegan



Re: Speed up the removal of WAL files

2017-11-17 Thread Fujii Masao
On Fri, Nov 17, 2017 at 5:20 PM, Tsunakawa, Takayuki
 wrote:
> From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
>> The orinal code recycles some of the to-be-removed files, but the patch
>> removes all the victims.  This may impact on performance.
>
> Yes, I noticed it after submitting the patch and was wondering what to do.  
> Thinking simply, I think it would be just enough to replace 
> durable_unlink/durable_rename in RemoveXLogFile() with unlink/rename, and 
> sync the pg_wal directory once in RemoveNonParentXlogFiles() and 
> RemoveOldXlogFiles().  This will benefit the failover time when fast 
> promotion is not performed.  What do you think?

It seems not good idea to just replace durable_rename() with rename()
in RemoveOldXlogFiles()->RemoveXlogFiles()->InstallXLogFileSegment().
Because that change seems to be able to cause the following problem.

1. Checkpoint calls RemoveOldXlogFiles().
2. It recycles the WAL file AAA to BBB. pg_wal directory has not fsync'd yet.
3. Another transaction (TX1) writes its WAL data into the (recycled) file BBB.
4. CRASH and RESTART
5. The WAL file BBB disappears and you can see AAA,
but AAA is not used in recovery. This causes data loss of
transaction by Tx1.

Regards,

-- 
Fujii Masao



Re: [HACKERS] Consistently catch errors from Python _New() functions

2017-11-17 Thread Tom Lane
Peter Eisentraut  writes:
> While reviewing some unrelated code, I noticed that we are handling
> error conditions from Python API functions such as PyList_New() and
> PyDict_New() in pretty random ways or not at all.  Here is a patch to
> fix that.

This needs some adjustment in the wake of 687f096ea.

I'm confused by the places that change PLy_elog calls to pass NULL:

-   PLy_elog(ERROR, "could not create globals");
+   PLy_elog(ERROR, NULL);

How is that an improvement?  Otherwise it looks reasonable.

regards, tom lane



Re: Fix number skipping in to_number

2017-11-17 Thread Tom Lane
I wrote:
> That leads me to the attached patch.  There is more that could be done
> here --- in particular, I'd like to see the character-not-byte-count
> rule extended to literal text.  But that seems like fit material for
> a different patch.

Hearing no further comments, I pushed that patch.

regards, tom lane



Re: PG10.1 autovac crashed building extended stats

2017-11-17 Thread Justin Pryzby
On Fri, Nov 17, 2017 at 01:27:49PM -0300, Alvaro Herrera wrote:
> Justin Pryzby wrote:
> > After adding extended/MV stats to a few of our tables a few days ago, it 
> > looks
> > like I wasn't been paying attention and this first crashed 2 nights ago.  
> > Why
> > at 1am?  not sure.  I have an "reindex" job which runs at 1am, and an
> > vacuum/analyze job which runs at 2am, but I don't use cron to change
> > autovac/analyze thresholds..
> 
> Can you please show the definition of the table and of the extended
> stats?

gtt=# SELECT stxrelid::regclass, stxname, stxkind FROM pg_statistic_ext ORDER 
BY 1;
stxrelid |  
stxname  | stxkind 
-+---+-
 daily_umts_eric_cell_traffic_hs_view_201603 | 
daily_umts_eric_cell_traffic_hs_view_201603_key_stats | {d,f}
 daily_umts_eric_cell_traffic_hs_eul_view_201603 | 
daily_umts_eric_cell_traffic_hs_eul_view_201603_key_stats | {d,f}
 daily_eric_umts_rnc_utrancell_view_201603   | 
daily_eric_umts_rnc_utrancell_view_201603_key_stats   | {d,f}
 daily_umts_eric_cell_traffic_hs_view_201504 | 
daily_umts_eric_cell_traffic_hs_view_201504_key_stats | {d,f}
 daily_umts_eric_cell_traffic_hs_eul_view_201504 | 
daily_umts_eric_cell_traffic_hs_eul_view_201504_key_stats | {d,f}
 daily_eric_umts_rnc_utrancell_view_201504   | 
daily_eric_umts_rnc_utrancell_view_201504_key_stats   | {d,f}
 daily_enodeb_ncell_view_201603  | 
daily_enodeb_ncell_view_201603_key_stats  | {d,f}
 daily_enodeb_ncell_view_201503  | 
daily_enodeb_ncell_view_201503_key_stats  | {d,f}
 daily_enodeb_ncell_view_201502  | 
daily_enodeb_ncell_view_201502_key_stats  | {d,f}
 daily_enodeb_ncell_view_201501  | 
daily_enodeb_ncell_view_201501_key_stats  | {d,f}
 daily_enodeb_baseband_view_201603   | 
daily_enodeb_baseband_view_201603_key_stats   | {d,f}
 daily_enodeb_baseband_view_201503   | 
daily_enodeb_baseband_view_201503_key_stats   | {d,f}
 daily_enodeb_baseband_view_201502   | 
daily_enodeb_baseband_view_201502_key_stats   | {d,f}
 daily_enodeb_baseband_view_201501   | 
daily_enodeb_baseband_view_201501_key_stats   | {d,f}
 daily_enodeb_cell_view_201603   | 
daily_enodeb_cell_view_201603_key_stats   | {d,f}
 daily_enodeb_cell_view_201502   | 
daily_enodeb_cell_view_201502_key_stats   | {d,f}
 daily_enodeb_201603 | 
daily_enodeb_201603_key_stats | {d,f}
 daily_enodeb_201503 | 
daily_enodeb_201503_key_stats | {d,f}
 daily_enodeb_201502 | 
daily_enodeb_201502_key_stats | {d,f}
 daily_enodeb_201501 | 
daily_enodeb_201501_key_stats | {d,f}
 daily_enodeb_cell_view_201710   | x
 | {d,f}
 daily_cdr_pstn_user_201711  | 
daily_cdr_pstn_user_201711_key_stats  | {d,f}
 daily_umts_eric_cell_traffic_hs_eul_view_201711 | 
daily_umts_eric_cell_traffic_hs_eul_view_201711_key_stats | {d,f}
 daily_umts_eric_cell_traffic_hs_view_201711 | 
daily_umts_eric_cell_traffic_hs_view_201711_key_stats | {d,f}
 daily_eric_umts_rnc_utrancell_view_201711   | 
daily_eric_umts_rnc_utrancell_view_201711_key_stats   | {d,f}
 daily_enodeb_baseband_view_201711   | 
daily_enodeb_baseband_view_201711_key_stats   | {d,f}
 daily_enodeb_cell_view_201711   | 
daily_enodeb_cell_view_201711_key_stats   | {d,f}
 daily_enodeb_ncell_view_201711  | 
daily_enodeb_ncell_view_201711_key_stats  | {d,f}
 daily_enodeb_201711 | 
daily_enodeb_201711_key_stats | {d,f}
(29 rows)

Here's the table which was 1) reindexed (including its toast) and 2)
autovacuumed(crashed):

gtt=# \d+ daily_enodeb_baseband_view_201711
  Table 
"public.daily_enodeb_baseband_view_201711"
Column|Type | Collation 
| Nullable | Default | Storage  | Stats target | Description 
--+-+---+--+-+--+--+-
 device_id| smallint|   
| not null | | plain| 400  | 
 site_id  | smallint  

Re: PG10.1 autovac killed building extended stats

2017-11-17 Thread Alvaro Herrera
Justin Pryzby wrote:

> Core was generated by `postgres: autovacuum worker process   gtt 
> '.
> Program terminated with signal 11, Segmentation fault.
> #0  statext_ndistinct_build (totalrows=300, numrows=300, rows=0x21df3e8, 
> attrs=, stats=0x0) at mvdistinct.c:103
> 103 item->attrs = 
> bms_add_member(item->attrs,
> 
> (gdb) p stats
> $5 = (VacAttrStats **) 0x0
> => This is an error, no ??

Not necessarily, but then I think this previous code is busted:

/*
 * Check if we can build these stats based on the column analyzed. If
 * not, report this fact (except in autovacuum) and move on.
 */
stats = lookup_var_attr_stats(onerel, stat->columns,
  natts, vacattrstats);
if (!stats && !IsAutoVacuumWorkerProcess())
{
ereport(WARNING,
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 errmsg("statistics object \"%s.%s\" could not be computed 
for relation \"%s.%s\"",
stat->schema, stat->name,
get_namespace_name(onerel->rd_rel->relnamespace),
RelationGetRelationName(onerel)),
 errtable(onerel)));
continue;
}

If I recall things correctly, the "continue" should be executed
regardless of IsAutoVacuumWorkerProcess() (only the log should be
conditional on that).  I'm not sure how I arrived at the current coding,
which I added in bf2a691e02d7 -- probably just fuzzy thinking.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PG10.1 autovac killed building extended stats

2017-11-17 Thread Alvaro Herrera
Justin Pryzby wrote:
> After adding extended/MV stats to a few of our tables a few days ago, it looks
> like I wasn't been paying attention and this first crashed 2 nights ago.  Why
> at 1am?  not sure.  I have an "reindex" job which runs at 1am, and an
> vacuum/analyze job which runs at 2am, but I don't use cron to change
> autovac/analyze thresholds..

Can you please show the definition of the table and of the extended
stats?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

2017-11-17 Thread Vladimir Rusinov
On Fri, Nov 17, 2017 at 3:31 PM, Robert Haas  wrote:

> On Thu, Nov 16, 2017 at 11:50 AM, Serge Rielau  wrote:
>
>> Just as you have, we have also considered holistic provisioning of
>> work_mem across all consumers, but we find that to be too complex.
>> Having an “emergency fund” in shared memory is also an option, but I find
>> it too limiting.
>>
>
> I agree.
>
> I think this is basically a planning problem.  For example, say we wanted
> to have work_mem_per_query instead of work_mem_per_node.  There is an
> obvious design: consider memory use as an independent dimension of merit
> during path generation and comparison (less is better).  Discard candidate
> paths whose memory use exceeds the work_mem_per_query budget unless there
> are no other alternatives.  At the end of planning, pick the cheapest path
> that survived the memory-budget filter.  Now, this has the problem that it
> would make planning more expensive (because we'd hang on to more paths for
> longer) but it solves a lot of other problems.  If there's no memory
> pressure, we can use memory like mad even when it doesn't save much, but
> when we have to pick between using more memory for one part of the plan and
> using more memory for another part of the plan, the choice that does the
> best job reducing overall execution time will win.  Awesome.
>
> We could also do more localized variants of this that don't provide hard
> guarantees but do tend to avoid squandering resources.  I don't think that
> we can directly incorporate memory use into cost because that will distort
> the costs of higher-level nodes in the plan tree; cost needs to mean
> execution time.  However, what we could do is refuse to replace a more
> expensive path in a relation's path list with a cheaper one when the
> savings are small and the cheaper path uses a lot more memory.  That way,
> you wouldn't replace a nested loop that costs a million units with a hash
> join that costs 999,999 units but uses a GB of RAM; you'd save the hash
> join for cases where we think it will help significantly.
>
> Yet another thing we could do is to try to get nodes to voluntarily use
> less than work_mem when possible.  This is particularly an issue for
> sorts.  A 2-batch hash join is so much more expensive than a single-batch
> hash join that it's almost never going to make sense unless we have no
> realistic alternative, although I suppose a 64-batch hash join might be not
> that different from a 32-batch hash join.  But for sorts, given all Peter's
> work in this area, I bet there are a lot of sorts that could budget a
> quarter or less of work_mem and really not be hurt very much.  It depends
> somewhat on how fast and how contended your I/O is, though, which we don't
> have an especially good way to model.  I'm starting to wonder if that
> sort_mem GUC might be a good idea... use that for sorts, and keep work_mem
> for everything else.
>
> If we really want to be able to dynamically react to change memory
> conditions, what we need is a forest of plans for a given query rather than
> just one.  Pick plan A if memory is limited, otherwise pick B.  Or use
> admission control.
>

FWIW, lack of per-connection and/or global memory limit for work_mem is
major PITA when running shared and/or large-scale setup.

Currently we are doing a poor job with the work_mem parameter because we
don't have a good way to let our customers increase it without also giving
them ability to shoot themselves in a foot.
Even a simple param limiting global total number of work_mem buffers would
help here.

-- 
Vladimir Rusinov
PostgreSQL SRE, Google Ireland

Google Ireland Ltd.,Gordon House, Barrow Street, Dublin 4, Ireland
Registered in Dublin, Ireland
Registration Number: 368047


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Inlining functions with "expensive" parameters

2017-11-17 Thread Andres Freund


On November 17, 2017 5:15:57 AM PST, Paul Ramsey  
wrote:
>On Thu, Nov 16, 2017 at 12:37 PM, Andres Freund 
>wrote:
>
>> Hi,
>>
>> On 2017-11-16 15:27:59 -0500, Tom Lane wrote:
>> > Andres Freund  writes:
>> > > On November 16, 2017 11:44:52 AM PST, Tom Lane
>
>> wrote:
>> > >> Yeah, there's no mechanism like that now, but there could be.
>> >
>> > > Right, but it doesn't sound that hard to introduce. Basically
>there'd
>> need to be a WithParamValue node,  that first evaluates parameters
>and then
>> executes the child expression. I'm thinking of doing this
>hierarchically so
>> there's less issues with the setting of the param value being moved
>away
>> from the child expression using it.
>> >
>> > Yeah.  If you also gave it the ability to optionally enforce
>strictness
>> > (ie, skip child expr and return NULL if any param is NULL) then we
>could
>> > do away with all of the parameter-based restrictions on inlining,
>since
>> > the semantics of parameter eval wouldn't change by inlining.
>>
>> Yep. And we quite easily can have a fastpath execution step that
>skips
>> these checks if no needed.
>>
>> I suspect there might still be some cases where it's worth either
>using
>> the current parameter replacement, or rely on eval_const_expressions
>> based infrastructure to directly "inline" reference parameters if
>> safe. The latter seems a bit nicer / more extensible.
>>
>>
>> > I might be showing my grad school background here, but I'd be
>inclined to
>> > call it a LambdaExpr.  A name based on "with" would be fine in a
>green
>> > field, but IMO we've got too much baggage from nodes related to SQL
>WITH.
>>
>> That'd work as well for me.
>>
>
>Is there a github branch or an active patch set where this work is
>going on
>that I could watch and learn from?
>Thanks!

Oh, I wasn't planning to work in this anytime soon. From my end this was just 
spitballing how a solution might look like.

If you want to take a stab...

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

2017-11-17 Thread Robert Haas
On Thu, Nov 16, 2017 at 11:50 AM, Serge Rielau  wrote:

> Just as you have, we have also considered holistic provisioning of
> work_mem across all consumers, but we find that to be too complex.
> Having an “emergency fund” in shared memory is also an option, but I find
> it too limiting.
>

I agree.

I think this is basically a planning problem.  For example, say we wanted
to have work_mem_per_query instead of work_mem_per_node.  There is an
obvious design: consider memory use as an independent dimension of merit
during path generation and comparison (less is better).  Discard candidate
paths whose memory use exceeds the work_mem_per_query budget unless there
are no other alternatives.  At the end of planning, pick the cheapest path
that survived the memory-budget filter.  Now, this has the problem that it
would make planning more expensive (because we'd hang on to more paths for
longer) but it solves a lot of other problems.  If there's no memory
pressure, we can use memory like mad even when it doesn't save much, but
when we have to pick between using more memory for one part of the plan and
using more memory for another part of the plan, the choice that does the
best job reducing overall execution time will win.  Awesome.

We could also do more localized variants of this that don't provide hard
guarantees but do tend to avoid squandering resources.  I don't think that
we can directly incorporate memory use into cost because that will distort
the costs of higher-level nodes in the plan tree; cost needs to mean
execution time.  However, what we could do is refuse to replace a more
expensive path in a relation's path list with a cheaper one when the
savings are small and the cheaper path uses a lot more memory.  That way,
you wouldn't replace a nested loop that costs a million units with a hash
join that costs 999,999 units but uses a GB of RAM; you'd save the hash
join for cases where we think it will help significantly.

Yet another thing we could do is to try to get nodes to voluntarily use
less than work_mem when possible.  This is particularly an issue for
sorts.  A 2-batch hash join is so much more expensive than a single-batch
hash join that it's almost never going to make sense unless we have no
realistic alternative, although I suppose a 64-batch hash join might be not
that different from a 32-batch hash join.  But for sorts, given all Peter's
work in this area, I bet there are a lot of sorts that could budget a
quarter or less of work_mem and really not be hurt very much.  It depends
somewhat on how fast and how contended your I/O is, though, which we don't
have an especially good way to model.  I'm starting to wonder if that
sort_mem GUC might be a good idea... use that for sorts, and keep work_mem
for everything else.

If we really want to be able to dynamically react to change memory
conditions, what we need is a forest of plans for a given query rather than
just one.  Pick plan A if memory is limited, otherwise pick B.  Or use
admission control.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


PG10.1 autovac killed building extended stats

2017-11-17 Thread Justin Pryzby
After adding extended/MV stats to a few of our tables a few days ago, it looks
like I wasn't been paying attention and this first crashed 2 nights ago.  Why
at 1am?  not sure.  I have an "reindex" job which runs at 1am, and an
vacuum/analyze job which runs at 2am, but I don't use cron to change
autovac/analyze thresholds..

Nov 16 01:15:42 database kernel: postmaster[16144]: segfault at 0 ip 
006f3e3e sp 7ffe2c8fc650 error 4 in postgres[40+693000]
[...]
Nov 17 01:15:41 database kernel: postmaster[7145]: segfault at 0 ip 
006f3e3e sp 7ffe2c8fc650 error 4 in postgres[40+693000]

[pryzbyj@database ~]$ ps -fu postgres
UIDPID  PPID  C STIME TTY  TIME CMD
postgres  1757 1  0 Nov09 ?00:20:43 /usr/pgsql-10/bin/postmaster -D 
/var/lib/pgsql/10/data
=> no longer running Alvaro's original patch on this server, which was also
first to experience the BRIN crash..

< 2017-11-16 01:15:43.103 -04  >LOG:  server process (PID 16144) was terminated 
by signal 11: Segmentation fault
< 2017-11-16 01:15:43.103 -04  >DETAIL:  Failed process was running: 
autovacuum: ANALYZE public.daily_enodeb_baseband_view_201711

< 2017-11-17 01:15:41.673 -04  >LOG:  server process (PID 7145) was terminated 
by signal 11: Segmentation fault
< 2017-11-17 01:15:41.673 -04  >DETAIL:  Failed process was running: 
autovacuum: ANALYZE public.daily_enodeb_baseband_view_201711

Here's a log of my reindex job for the last two nights' crashes.  It's
suspicious that the baseband table was reindexed ~15min before autovac crashed
during its processing, but it's also unclear to me why it would matter.

[...]
|Thu Nov 16 01:02:54 -04 2017: daily_enodeb_baseband_view_201711: 
daily_enodeb_baseband_view_201711_unique_idx(repack current)...
|Thu Nov 16 01:02:59 -04 2017: daily_enodeb_baseband_view_201711: 
pg_toast.pg_toast_691157026_index(reindex system)...
...
|Thu Nov 16 01:15:22 -04 2017: eric_hss_l2if_metrics_cum: 
eric_hss_l2if_metrics_cum_start_time_idx(repack non-partitioned)...
|Thu Nov 16 01:15:27 -04 2017: eric_hss_platform_metrics: 
eric_hss_platform_metrics_start_time_idx(repack non-partitioned)...
|WARNING:  terminating connection because of crash of another server process
|DETAIL:  The postmaster has commanded this server process to roll back the 
current transaction and exit, because another server process exited abnormally 
and possibly corrupted shared memory.
|HINT:  In a moment you should be able to reconnect to the database and repeat 
your command.
|WARNING: Error creating index "public"."index_40552734": server closed the 
connection unexpectedly
|This probably means the server terminated abnormally
|before or while processing the request.
|WARNING: Skipping index swapping for "eric_hss_platform_metrics", since no new 
indexes built
|WARNING: repack failed for "eric_hss_platform_metrics_start_time_idx"
|psql: FATAL:  the database system is in recovery mode
|Filesystem  Size  Used Avail Use% Mounted on
|/dev/vdb2.6T  2.1T  459G  83% /var/lib/pgsql

[...]
|Fri Nov 17 01:01:44 -04 2017: daily_enodeb_baseband_view_201711: 
daily_enodeb_baseband_view_201711_unique_idx(repack current)...
|Fri Nov 17 01:01:54 -04 2017: daily_enodeb_baseband_view_201711: 
pg_toast.pg_toast_691157026_index(reindex system)...
...
|Fri Nov 17 01:14:58 -04 2017: link_busy_hr: lbh_start_time_idx(repack 
non-partitioned)...
|Fri Nov 17 01:14:58 -04 2017: loaded_cdr_files: 
loaded_cdr_files_filename(repack non-partitioned)...
|WARNING:  terminating connection because of crash of another server process
|DETAIL:  The postmaster has commanded this server process to roll back the 
current transaction and exit, because another server process exited abnormally 
and possibly corrupted shared memory.
|HINT:  In a moment you should be able to reconnect to the database and repeat 
your command.
|WARNING: Error creating index "public"."index_30971": server closed the 
connection unexpectedly
|This probably means the server terminated abnormally
|before or while processing the request.
|WARNING: Skipping index swapping for "loaded_cdr_files", since no new indexes 
built
|WARNING: repack failed for "loaded_cdr_files_filename"
|psql: FATAL:  the database system is in recovery mode

Core was generated by `postgres: autovacuum worker process   gtt '.
Program terminated with signal 11, Segmentation fault.
#0  statext_ndistinct_build (totalrows=300, numrows=300, rows=0x21df3e8, 
attrs=, stats=0x0) at mvdistinct.c:103
103 item->attrs = 
bms_add_member(item->attrs,

(gdb) p stats
$5 = (VacAttrStats **) 0x0
=> This is an error, no ??

(gdb) p item
$1 = (MVNDistinctItem *) 0x2090928
(gdb) p result
$6 = (MVNDistinct *) 0x2090918

(gdb) p item->attrs
$2 = (Bitmapset *) 0x0
(gdb) p rows
$3 = (HeapTuple *) 0x21df3e8
(gdb) p j
$8 = 
(gdb) p attrs
$4 = 

Let me know if there's anything more I should send.

Justin

(gdb) bt f
#0  

Re: [HACKERS] Transaction control in procedures

2017-11-17 Thread Peter Eisentraut
On 11/16/17 18:35, Simon Riggs wrote:
> For the first two answers above the answer was "currently executing
> statement", yet the third answer seems to be the procedure. So that is
> a slight discrepancy.

That's the way function execution, or really any nested execution,
currently works.

> ISTM we would like
> 
> 1) a way to cancel execution of a procedure
> 2) a way to set a timeout to cancel execution of a procedure
> 
> as well as
> 
> 1) a way to cancel execution of a statement that is running within a procedure
> 2) a way to set a timeout to cancel execution of a statement in a procedure
> 
> Visibility of what a routine is currently executing is the role of a
> debugger utility/API.

That would probably be nice, but it would be an entirely separate
undertaking.  In particular, getting insight into some kind of execution
stack would necessarily be language specific.  We do have some of that
for PL/pgSQL of course.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Repetitive code in RI triggers

2017-11-17 Thread Ildus Kurbangaliev
On Fri, 17 Nov 2017 15:05:31 +
Ildus Kurbangaliev  wrote:

> The following review has been posted through the commitfest
> application: make installcheck-world:  tested, failed
> Implements feature:   tested, failed
> Spec compliant:   tested, failed
> Documentation:tested, failed
> 
> The patch looks good. It just removes repetitive code and I think
> it's ready to commit.
> 
> The new status of this patch is: Ready for Committer

"tested, failed" should be read as "tested, passed". Forgot to check
them.

-- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: Repetitive code in RI triggers

2017-11-17 Thread Ildus Kurbangaliev
The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   tested, failed
Spec compliant:   tested, failed
Documentation:tested, failed

The patch looks good. It just removes repetitive code and I think it's ready to 
commit.

The new status of this patch is: Ready for Committer


Re: [HACKERS] SERIALIZABLE on standby servers

2017-11-17 Thread Robert Haas
On Thu, Nov 16, 2017 at 5:52 PM, Simon Riggs  wrote:
>> I haven't looked at this again yet but a nearby thread reminded me of
>> another problem with this which I wanted to restate explicitly here in
>> the context of this patch.  Even without replication in the picture,
>> there is a race to reach ProcArrayEndTransaction() after
>> RecordTransactionCommit() runs, which means that the DO history
>> (normal primary server) and REDO history (recovery) don't always agree
>> on the order that transactions become visible.  With this patch, this
>> kind of diverging DO and REDO could allow undetectable read only
>> serialization anomalies.  I think that ProcArrayEndTransaction() and
>> RecordTransactionCommit() need to be made atomic in the simple case so
>> that DO and REDO agree.
>
> Not atomic, we just need to make ProcArrayEndTransaction() apply
> changes in the order of commits.
>
> I think that is more easily possible by reusing the
> SyncRepWaitForLSN() code, since that already orders things by LSN.
>
> So make all committers wait and then get WALwriter to wake people
> after ProcArrayEndTransaction() has been applied.

That doesn't solve any problem we have.  Making committers wait after
ProcArrayEndTransaction() wouldn't keep the effects of those
transactions from being visible to other transactions in the system.
And that's precisely the issue: on the primary, transactions become
visible to other transactions in the order in which they perform
ProcArrayEndTransaction(), but on the standby, they become visible in
the order in which they RecordTransactionCommit().  That difference is
a source of serialization anomalies.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



feature request: consume asynchronous notification via a function

2017-11-17 Thread Merlin Moncure
Hackers,

Currently the only way that I know of to consume async notifications
via SQL (as opposed to a client application) is via dblink_get_notify.
This method isn't very good; it requires some extra support coding,
eats a connection and a backend, and doesn't have any timeout
facilities.  The lack a good facility to do this will become more
troublesome if/when Peter's recent fantastic work to implement stored
procedures in the database gets accepted; asynchronous notifications
could be a more efficient mechanic for backend processes to signal
each other than the current method of signalling via fields in a
table.

A good interface might look something like:
pg_get_notifications(
  TimeOut INT DEFAULT 0,
  notify_name  OUT TEXT,
  payload OUT TEXT,
  pid OUT INT) RETURNS SETF RECORD AS...

The function would return immediately by default, or until TimeOut
seconds transpired.  We'd still have to poll internally, so that
signals could be checked etc, but this would be a nice way to consume
notifications without any dependencies -- what do you think?

merlin



Re: [HACKERS] Parallel Hash take II

2017-11-17 Thread Robert Haas
On Thu, Nov 16, 2017 at 6:42 PM, Andres Freund  wrote:
> What I'm basically worried about is that the *only* reason for some
> plans to choose to use parallelism is that essentially the effective
> amount of work_mem between the plans is that the parallel one uses
> (max_parallel_workers_per_gather + 1) * work_mem. Which might push
> queries to use parallelism even if it's not actually beneficial in
> reducing runtime.

I don't get it.  If we switch to using parallelism and the runtime
doesn't go down, that just means the costing is wrong.  The costing is
supposed to reflect the runtime.

It's true that adding parallel hash may tend to increase the amount of
memory that gets used in practice.  But it seems to me that any plan
that uses memory and is sometimes chosen over a non-memory-using plan
does that.

> Thomas' earlier comparison of this behaviour with e.g. parallel
> oblivious hash nodes does *NOT* seem apt to me. There's currently
> effectively no cost pressure for/against parallelism for those (even if
> there potentially should). Which means they do not favor parallel
> queries solely because they're allowed to use more memory, and thus it's
> far less likely that every of those nodes uses the maximum alloted
> work_mem.

I agree that there is no cost pressure for or against parallelism.
The design which I have been pursuing with parallel query up to this
point is that cost represents execution time, so minimizing cost means
minimizing execution time, and that's the goal.  If we want, we can
put a tax on parallel query plans, so that they're only chosen when
they are *substantially* cheaper than non-parallel plans, or even that
a plan with a few workers is to be preferred over a plan with more
workers unless the gains are sufficiently substantial.  But I don't
think the right way to do that is to prevent a parallel hash from
using as much memory as a non-parallel hash in the same spot would
use.

Rather, what we could do is, for example, have a knob that multiplies
the cost of a partial path by a floating-point value when we insert a
Gather/Gather Merge node.  If you want to always pick the cheapest
path regardless of whether it uses parallelism, set the GUC to 1.0.
If you only want to pick a parallel query path if it's twice as cheap
as the best non-parallel path, set the GUC to 2.0.

> I think it's wrong to just multiply the amount of work_mem that way, and
> it'll bite use. Introducing a separate guc, perhaps inheriting from
> work_mem if set to -1, that limits the amount of memory inside a
> parallel node seems saner. That value then would not be multiplied with
> the chosen worker number.

I don't mind having an option to override the amount of memory that
parallel hash is allowed to used, but I'm also not yet convinced that
we have a real problem that needs solving.

> As far as I understand it we currently cost a gather's startup cost
> once, not per worker.

Yes - because cost is supposed to measure execution time, and the
workers start all at once, not sequentially.  The startup time for the
workers as a whole doesn't get much longer as the number of workers
rises.  It might be that the formula should be 1000 + 50/worker or
something rather than a constant, but I doubt it matters very much.

> That startup cost effectively include redundant
> work like materializing a table, building parallel oblivious hashtables,
> resorting the same table for a mergejoin etc...

Right, because that work all contributes to total execution time.
Actually, when the same worker is going to be redone in multiple
workers, we should really inflate the cost, because if two workers
each sort the same table, it takes each of them longer than it would
take a single worker to sort the table for itself.  That's one of the
biggest problems with parallel costing right now, from what I've seen:
we're too willing to do the same work over in each of 4-6
participants, not realizing that they're going to content for buffer
locks and I/O bandwidth.

> That's fine if your goal is solely to return a single query as fast as
> possible, even if doubling the resources will only give you a minimal
> cost advantage.  If instead your goal is to optimize both for individual
> query performance and overall system throughput that's obviously not
> good.

I agree.

> I think we should cost the startup cost of paralell nodes more like
> startup_cost * (max_parallel_workers_per_gather * 
> parallel_resource_efficiency + 1)
>
> which'd allow to tune query performance for using parallelism even for
> tiny benefits (parallel_resource_efficiency = 0), and conversely tune it
> so it only gets used if the overall loss of efficiency is small
> (parallel_resource_efficiency = 0.9 or such).

No, I think that's wrong.  I think you need to apply the adjustment at
the end of the (parallel) planning process, not incrementally to each
node.  Otherwise, the costs assigned to the individual nodes becomes
some unholy amalgam of 

Re: Inlining functions with "expensive" parameters

2017-11-17 Thread Paul Ramsey
On Thu, Nov 16, 2017 at 12:37 PM, Andres Freund  wrote:

> Hi,
>
> On 2017-11-16 15:27:59 -0500, Tom Lane wrote:
> > Andres Freund  writes:
> > > On November 16, 2017 11:44:52 AM PST, Tom Lane 
> wrote:
> > >> Yeah, there's no mechanism like that now, but there could be.
> >
> > > Right, but it doesn't sound that hard to introduce. Basically there'd
> need to be a WithParamValue node,  that first evaluates parameters and then
> executes the child expression. I'm thinking of doing this hierarchically so
> there's less issues with the setting of the param value being moved away
> from the child expression using it.
> >
> > Yeah.  If you also gave it the ability to optionally enforce strictness
> > (ie, skip child expr and return NULL if any param is NULL) then we could
> > do away with all of the parameter-based restrictions on inlining, since
> > the semantics of parameter eval wouldn't change by inlining.
>
> Yep. And we quite easily can have a fastpath execution step that skips
> these checks if no needed.
>
> I suspect there might still be some cases where it's worth either using
> the current parameter replacement, or rely on eval_const_expressions
> based infrastructure to directly "inline" reference parameters if
> safe. The latter seems a bit nicer / more extensible.
>
>
> > I might be showing my grad school background here, but I'd be inclined to
> > call it a LambdaExpr.  A name based on "with" would be fine in a green
> > field, but IMO we've got too much baggage from nodes related to SQL WITH.
>
> That'd work as well for me.
>

Is there a github branch or an active patch set where this work is going on
that I could watch and learn from?
Thanks!
P


> Greetings,
>
> Andres Freund
>


Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-11-17 Thread David Rowley
On 15 November 2017 at 06:49, Alvaro Herrera 
wrote:

> Here's the remaining bits, rebased.


 I wonder if it should be this patches job to alter the code in
get_relation_info() which causes the indexes not to be loaded for
partitioned tables:

/*
* Make list of indexes.  Ignore indexes on system catalogs if told to.
* Don't bother with indexes for an inheritance parent, either.
*/
if (inhparent ||
(IgnoreSystemIndexes && IsSystemRelation(relation)))
hasindex = false;
else
hasindex = relation->rd_rel->relhasindex;

A partitioned table will always go into the hasindex = false code path.

I'm kind of thinking this patch should change that, even if the patch is
not making use of the indexes, you could argue that something
using set_rel_pathlist_hook might want to do something there, although,
there's likely a bunch of counter arguments too.

What do you think?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

2017-11-17 Thread David Rowley
On 17 November 2017 at 23:01, Amit Langote 
wrote:

> Please find attached updated patch set.   There are significant changes in
> this version as described below, including the support for hash
> partitioned tables.
>

Hi Amit,

Thanks for making those changes and adding the HASH partition support.

There's a good chance that I'm not going to get time to look at this maybe
until the last day of the month. I hope someone else can look over it in
the meantime.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] [PATCH] A hook for session start

2017-11-17 Thread Fabrízio de Royes Mello
On Fri, Nov 17, 2017 at 3:33 AM, Ashutosh Sharma 
wrote:
>
> On Fri, Nov 17, 2017 at 9:08 AM, Fabrízio de Royes Mello
>  wrote:
> > Hi all,
> >
> > Attached new version of the patch fixing issues about installcheck and
some
> > assertions reported before (based on Michael Paquier code):
> >
>
> The assertion failure which i reported earlier -[1] is fixed with v11
> patch. FYI, in my earlier email i wrongly mentioned that the assertion
> failure is happening in bgwroker process (logical replication worker)
> infact it was seen in the backend process.
>

I saw it in "Autovacuum Launcher", more specifically during "session end
hook" so I realize that the code to check if the current backend is a
client backend isn't safe so I've fixed it.


> By installcheck, did you mean installcheck-force because installcheck
> rule is not defined in the Makefile.
>

I meant to simple disable 'installcheck' because we need the
shared_preload_libraries has been properly configured. But you can use
'installcheck-force' if you manually config and install the test module.

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog: http://fabriziomello.github.io
>> Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
>> Github: http://github.com/fabriziomello


Re: [HACKERS] Partition-wise aggregation/grouping

2017-11-17 Thread Ashutosh Bapat
On Wed, Nov 15, 2017 at 5:31 PM, Jeevan Chalke
 wrote:
>
> OK. Done in the attached patch set.
>
> I have rebased all my patches on latest HEAD which is at
> 7518049980be1d90264addab003476ae105f70d4
>
> Thanks

These are review comments for the last set and I think most of them
apply to the new set as well.

Patches 0001 - 0005 refactoring existing code. I haven't
reviewed them in detail, checking whether we have missed anything in moving the
code, but they mostly look fine.

Comments on 0006
 /*
+ * cost_append
+ *  Determines and returns the cost of an Append node.
+ *
... clipped portion
+
+/* Add Append node overhead. */
+run_cost += cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR * tuples;
+

I am wondering whether it's really worth creating a new function for a single
line addition to create_append_path(). I think all we need to change in
create_append_path() is add (cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR *
tuples) to path->total_cost.


+/* Add MergeAppend node overhead like we do it for the Append node */
+run_cost += cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR * tuples;
+

With this change the following comment is no more true. Please remove it.
 * extracted tuple.  We don't charge cpu_tuple_cost because a MergeAppend
 * node doesn't do qual-checking or projection, so it has less overhead
 * than most plan nodes.
 */

+/*
+ * Arbitrarily use 50% of the cpu_tuple_cost to cost append node. Note that

May be reword it as " ... to cost per tuple processing by an append node ..."

+ * this value should be multiplied with cpu_tuple_cost wherever applicable.
+ */
+#define DEFAULT_APPEND_COST_FACTOR 0.5

I am wondering whether we should just define
#define APPEND_TUPLE_COST (cpu_tuple_cost * 0.5)
and use this macro everywhere. What else use DEFAULT_APPEND_COST_FACTOR would
have other than multiplying with cpu_tuple_cost?

 -- test partition matching with N-way join
 EXPLAIN (COSTS OFF)
 SELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM
plt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') =
t1.c GROUP BY t1.c, t2.c, t3.c ORDER BY t1.c, t2.c, t3.c;
-  QUERY PLAN
---
+   QUERY PLAN
+
  Sort
Sort Key: t1.c, t3.c
->  HashAggregate
  Group Key: t1.c, t2.c, t3.c
- ->  Result
+ ->  Hash Join
+   Hash Cond: (t1.c = t2.c)
->  Append
- ->  Hash Join
-   Hash Cond: (t1.c = t2.c)

That's sad. Interestingly this query has an aggregate, so the plan will use
partition-wise join again when partition-wise aggregation patch will be
applied. So may be fine.

- Append  (cost=0.00..0.04 rows=2 width=32)
+ Append  (cost=0.00..0.05 rows=2 width=32)

- Append  (cost=0.00..0.04 rows=2 width=4)
+ Append  (cost=0.00..0.05 rows=2 width=4)

We do have some testcases which print costs. Interesting :). I don't have
any objection to this change.

Comments on 0007

+   
+Enables or disables the query planner's use of partition-wise grouping
+or aggregation, which allows  If partition-wise aggregation
does not result in the
+cheapest path, it will still spend time in creating these paths and
+consume memory which increase linearly with the number of partitions.
+The default is off.
+   
+  
+ 
+
May be we should word this in the same manner as partition-wise join like

Enables or disables the query planner's use of partition-wise grouping
or aggregation, which allows aggregation or grouping on a partitioned
tables to be spread across the partitions. If GROUP
BY clause includes partition keys, the rows are aggregated at
each partition. Otherwise, partial aggregates computed for each
partition are required to be combined. Because partition-wise
aggregation/gropuing can use significantly more CPU time and memory
during planning, the default is off.

+
+Partition-wise aggregates/grouping
+--

... clipped patch

+In above plan, aggregation is performed after append node which means that the
+whole table is an input for the aggregation node. However, with partition-wise
+aggregation, same query will have plane like:

s/plane/plan/

+ Append

... clipped patch

+PartialAggregate stage greatly reduces the number of groups and lose if we have
+lots of small groups.

To keep the discussion brief, I suggest we rewrite this paragraph as


If GROUP BY clause has all partition keys, all the rows that belong to a given
group come from a single partition and thus aggregates can be finalized
separately for each partition. When the number of groups is far lesser than the

Re: [HACKERS] Proposal: generic WAL compression

2017-11-17 Thread Arthur Silva
I wonder if the algorithm could be changed to operate with units of 2 or 4
bytes (instead of 1).
Since the algorithm speed is essentially doubled/quadrupled it could yield
a better runtime/compression trade-off.

Does that make sense?

--
Arthur Silva


On Wed, Nov 1, 2017 at 12:43 AM, Oleg Ivanov 
wrote:

> Hackers,
>
> a few years ago generic WAL was proposed by Alexander Korotkov (
> https://www.postgresql.org/message-id/flat/CAPpHfdsXwZmojm6
> Dx%2BTJnpYk27kT4o7Ri6X_4OSWcByu1Rm%2BVA%40mail.gmail.com#CAP
> phfdsxwzmojm6dx+tjnpyk27kt4o7ri6x_4oswcbyu1rm...@mail.gmail.com). and was
> committed into PostgreSQL 9.6.
> One of the generic WAL advantages is the common interface for safe
> interaction with WAL for modules and extensions. The interface allows
> module to register the page, then change it, and then generic WAL generates
> and writes into xlog the algorithm of changing the old version of page into
> the new one. In the case of crash and recovery this algorithm may be
> applied.
>
> Bloom and RUM indexes use generic WAL. The issue is that the generic
> algorithm of turning the old page into the new one is not optimal in the
> sense of record size in xlog. Now the main idea of the approach is to find
> all positions in the page where new version is different from the original
> one and write these changes into xlog. It works well if the page was
> rewritten completely or if only a few bytes have been changed.
> Nevertheless, this approach produces too large WAL record in the case of
> inserting or deleting a few bytes into the page. In this case there are
> almost no position where the old version and the new one are equal, so the
> record size is near the page size, but actual amount of changes in the page
> is small. This is exactly what happens often in RUM indexes.
>
> In order to overcome that issue, I would like to propose the patch, which
> provides possibility to use another approach of the WAL record
> construction. If another approach fails to make a short enough record, it
> rolls back to the original approach. The main idea of another approach is
> not to perform bytewise comparison of pages, but finding the minimal
> editing distance between two pages and the corresponding editing algorithm.
> In the general case, finding editing distance takes O(N^2) time and memory.
> But there is also an algorithm which requires only O(ND) time and O(D^2)
> memory, where D is the editing distance between two sequences. Also for
> given D algorithm may show that the minimal editing distance between two
> sequences is more than D in the same amount of time and memory.
>
> The special case of this algorithm which does not consider replace
> operations is described in the paper (http://www.xmailserver.org/diff2.pdf).
> The version of this algorithm which consumes O(ND) time and O(N) memory is
> used in diff console command, but for our purposes we don't need to
> increase the constant for the time in order to decrease memory complexity.
> For RUM indexes we usually have small enough editing distance (less than
> 64), so D^2 is not too much to store.
>
> The results of experiments:
>
> +--++---
> -+++
> | Name |  WAL_diff(MB)  |  WAL_orig(MB) |
> Time_diff(s)  |  Time_orig(s)  |
> +--++---
> -+++
> | rum: make installcheck   | 38.9   | 82.5 | 4.37   |
> 4.16   |
> +--++---
> -+++
> | bloom: make installcheck | 1.0| 1.0 | 0.66   |
> 0.53   |
> +--++---
> -+++
> | test00.sql | 20.5   | 51.0 | 1.86   |
> 1.41   |
> +--++---
> -+++
> | test01.sql   | 207.1  | 219.7   | 8.06 | 6.89
>|
> +--++---
> -+++
>
> We can see that the patch provides a little slowdown, but compresses
> generic WAL efficiently for RUM index. Also I'm going to do a few more
> experiments on this patch with another data.
>
> The patch was tested on Lubuntu 14.04, but should not contain any
> platform-specific items. The patch, the files and scripts for doing the
> experiments and performance tests are attached.
>
> Oleg Ivanov
> Postgres Professional
> The Russian PostgreSQL Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hack...@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-11-17 Thread David Rowley
On 15 November 2017 at 06:49, Alvaro Herrera 
wrote:

> Here's the remaining bits, rebased.


Hi,

I've not had time for a thorough look at  this, but on a quick scan I
noticed that CompareIndexInfo() missed checking if the Index AM matches the
AM of the partitioned index.

Testing with:

create table p (a int not null) partition by range (a);
create table p1 partition of p for values from (1) to (10);
create table p2 partition of p for values from (10) to (20);
create index on p1 using btree (a);
create index on p2 using hash (a);
create index on p (a);

I see it ends up making use of the hash index on p2 to support the index
that's stored as a btree on the partitioned table. I think these should
match so that the operations we can perform on the index are all aligned.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: pspg - psql pager

2017-11-17 Thread Devrim Gündüz

Hi,

[Removed my email address in the OP, which I am not using since 2010 ;) ]

On Thu, 2017-11-16 at 17:13 -0600, Merlin Moncure wrote:
> This would be great package for inclusion in pgdg packaging as well.

I released this package on Sep 15th, but just noticed the latest version. Pavel
usually creates tickets to ping me, so we are usually up-to-date. The updated
packages will appear in the repos around 30 mins later.

Regards,
-- 
Devrim Gündüz
EnterpriseDB: https://www.enterprisedb.com
PostgreSQL Consultant, Red Hat Certified Engineer
Twitter: @DevrimGunduz , @DevrimGunduzTR

signature.asc
Description: This is a digitally signed message part


Missing wal_receiver_status_interval in Subscribers section

2017-11-17 Thread Masahiko Sawada
Hi,

I found that the doc says in 19.6.4. Subscribers section that "Note
that wal_receiver_timeout and wal_retrieve_retry_interval
configuration parameters affect the logical replication workers as
well" but subscriber actually uses wal_receiver_status_interval as
well. So I think we should mention it as well. Any thoughts?

Attached patch adds wal_receiver_stats_interval to the doc.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 996e825..3c8c504 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3410,7 +3410,8 @@ ANY num_sync ( 

Re: [HACKERS] parallelize queries containing initplans

2017-11-17 Thread Amit Kapila
On Thu, Nov 16, 2017 at 10:44 PM, Robert Haas  wrote:
> On Thu, Nov 16, 2017 at 5:23 AM, Kuntal Ghosh
>  wrote:
>> I've tested the above-mentioned scenario with this patch and it is
>> working fine. Also, I've created a text column named 'vartext',
>> inserted some random length texts(max length 100) and tweaked the
>> above query as follows:
>> select ten,count(*) from tenk1 a where a.ten in (select
>> b.ten from tenk1 b where (select a.vartext from tenk1 c where c.ten =
>> a.ten limit 1) = b.vartext limit 1) group by a.ten;
>> This query is equivalent to select ten,count(*) from tenk1 group by
>> a.ten. It also produced the expected result without throwing any
>> error.
>
> Great!  I have committed the patch; thanks for testing.
>

Thanks.

> As I said in the commit message, there's a lot more work that could be
> done here.  I think we should consider trying to revise this whole
> system so that instead of serializing the values and passing them to
> the workers, we allocate an array of slots where each slot has a Datum
> flag, an isnull flag, and a dsa_pointer (which maybe could be union'd
> to the Datum?).  If we're passing a parameter by value, we could just
> store it in the Datum field; if it's null, we just set isnull.  If
> it's being passed by reference, we dsa_allocate() space for it, copy
> it into that space, and then store the dsa_pointer.
>
> The advantage of this is that it would be possible to imagine the
> contents of a slot changing while parallelism is running, which
> doesn't really work with the current serialized-blob representation.
> That would in turn allow us to imagine letting parallel-safe InitPlans
> being evaluated by the first participant that needs the value rather
> than before launching workers, which would be good, not only because
> of the possibility of deferring work for InitPlans attached at or
> above the Gather but also because it could be used for InitPlans below
> the Gather (as long as they don't depend on any parameters computed
> below the Gather).
>

That would be cool, but I think here finding whether it is dependent
on any parameter computed below gather could be tricky.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Sv: Re: pspg - psql pager

2017-11-17 Thread Andreas Joseph Krogh
På fredag 17. november 2017 kl. 10:09:53, skrev Marco Nenciarini <
marco.nenciar...@2ndquadrant.it >:
Il 15/11/17 10:41, Pavel Stehule ha scritto:
 > Hi
 >
 > I finished new pager, that can be interesting for postgresql expert users.
 >

 It is now on apt.postgresql.org

   http://postgr.es/m/e1efc37-0007rv...@atalia.postgresql.org

 And will be in Debian Unstable soon as it is in the NEW queue for manual
 review from Debian's FTP Master.

    https://ftp-master.debian.org/new/pspg_0.5-1.html

 Regards,
 Marco
 
 
Coolio!
 
-- Andreas Joseph Krogh
 



Re: pspg - psql pager

2017-11-17 Thread Marco Nenciarini
Il 15/11/17 10:41, Pavel Stehule ha scritto:
> Hi
> 
> I finished new pager, that can be interesting for postgresql expert users.
> 

It is now on apt.postgresql.org

  http://postgr.es/m/e1efc37-0007rv...@atalia.postgresql.org

And will be in Debian Unstable soon as it is in the NEW queue for manual
review from Debian's FTP Master.

   https://ftp-master.debian.org/new/pspg_0.5-1.html

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it



signature.asc
Description: OpenPGP digital signature


Re: Typo in comment

2017-11-17 Thread Etsuro Fujita

(2017/11/17 4:18), Robert Haas wrote:

On Thu, Nov 16, 2017 at 6:53 AM, Etsuro Fujita
  wrote:

Here is a patch for fixing a comment typo: s/datum1/datums1/.


Thanks, committed.


Thank you.

Best regards,
Etsuro Fujita



RE: Speed up the removal of WAL files

2017-11-17 Thread Tsunakawa, Takayuki
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
> The orinal code recycles some of the to-be-removed files, but the patch
> removes all the victims.  This may impact on performance.

Yes, I noticed it after submitting the patch and was wondering what to do.  
Thinking simply, I think it would be just enough to replace 
durable_unlink/durable_rename in RemoveXLogFile() with unlink/rename, and sync 
the pg_wal directory once in RemoveNonParentXlogFiles() and 
RemoveOldXlogFiles().  This will benefit the failover time when fast promotion 
is not performed.  What do you think?

BTW, RemoveNonParentXlogFiles() recycles only 10 WAL files and delete all other 
files.  So the impact of modification is limited.


> Likewise the original code is using durable_unlink to actually remove a
> file so separating unlink and fsync might resurrect the problem that should
> have been fixed by
> 1b02be21f271db6bd3cd43abb23fa596fcb6bac3 (I'm not sure what it was but you
> are one of the reviwers of it). I suppose that you need to explain the reason
> why this change doesn't risk anything.

It's safe because the directory is finally synced.  If the database server 
crashes before it deletes all WAL files, it performs the deletion again during 
the next recovery.

Regards
Takayuki Tsunakawa