Re: [HACKERS] [RFC] Should I embed or parameterize syscall/Win32 function names from error messages?

2017-02-05 Thread Tsunakawa, Takayuki
From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> TBH, I think you are worried about the wrong thing here.  You could drop
> both of those errdetail calls altogether and be little worse off.  In the
> places where we have errdetail calls like "failed system call was xxx",
> the main point is to show the exact parameters that were given to the system
> call, and neither of these do that.  These errdetail messages would only
> be useful if the identical ereport errmsg might be issued for failures from
> different underlying Windows calls --- but I doubt that's what you're
> intending here.

Yes, that's what I'm intending to do.  To enable the user right "Lock pages in 
memory" on Windows, a few Win32 functions need to be called in turn.


> My problem with these messages is I am not sure what "memory user right"
> means.  Probably that just needs a bit of editing.  But I'd go for something
> like "could not do xxx: error code %lu", and not bother mentioning the system
> call name, unless failing to do so has some impact on whether we could
> understand what happened from a field report of this error message.

For the user, each step of enabling the user right is irrelevant.  It just 
matters to him that that the server could not enable the user right.  OTOH, the 
failed Win32 function may help us to talk with Microsoft to troubleshoot the 
problem.  So I used the same messages in those ereport() calls except for the 
function name to eliminate the translation work.


> (See the "Function Names" item in our message style guidelines for more
> about this issue.  Maybe we need to expand that item some more.)

The style guide does not necessarily require the function parameter values.

https://www.postgresql.org/docs/devel/static/error-style-guide.html

[Quote]
If it really seems necessary, mention the system call in the detail message. 
(In some cases, providing the actual values passed to the system call might be 
appropriate information for the detail message.)

postmaster.c doesn't display parameter values, too.

elog(LOG, "CreateProcess call failed: %m (error code %lu)",
 GetLastError());

Regards
Takayuki Tsunakawa



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [RFC] Should "SHOW huge_pages" display the effective value "off" when the huge page is unavailable?

2017-02-05 Thread Tsunakawa, Takayuki
Hello, all

Could you give me your opinions on whether the SHOW command should display the 
effective value or the specified value for huge_pages?  During the review of 
"Supporting huge_pages on Windows", which is now shifted to CommitFest 2017-3, 
Magnus gave me a comment that the huge_page variable should retain the value 
"try" when the huge page is not available on the machine and the server falls 
back to huge_page=off.  The Linux version does so.

I don't have a strong opinion on that, but I think a bit that it would be 
better to reflect the effective setting, i.e. SHOW displays huge_pages as off, 
not try.  Otherwise, the user cannot know whether the huge page setting is 
effective.

One parameter that behaves similarly is wal_buffers.  When wal_buffers is set 
to -1 (default), "SHOW wal_buffers" displays the actual size, not -1.  But I 
didn't find any other parameters like this.

Regards
Takayuki Tsunakawa



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-02-05 Thread Amit Khandekar
v2 patch was not rebased over the latest master branch commits. Please
refer to the attached ParallelAppend_v3.patch, instead.

On 6 February 2017 at 11:06, Amit Khandekar  wrote:
> Ashutosh Bapat  wrote:
>>> We may want to think about a third goal: preventing too many workers
>>> from executing the same plan. As per comment in get_parallel_divisor()
>>> we do not see any benefit if more than 4 workers execute the same
>>> node. So, an append node can distribute more than 4 worker nodes
>>> equally among the available subplans. It might be better to do that as
>>> a separate patch.
>>
>> I think that comment is for calculating leader contribution. It does
>> not say that 4 workers is too many workers in general.
>>
>> But yes, I agree, and I have it in mind as the next improvement.
>> Basically, it does not make sense to give more than 3 workers to a
>> subplan when parallel_workers for that subplan are 3. For e.g., if
>> gather max workers is 10, and we have 2 Append subplans s1 and s2 with
>> parallel workers 3 and 5 respectively. Then, with the current patch,
>> it will distribute 4 workers to each of these workers. What we should
>> do is : once both of the subplans get 3 workers each, we should give
>> the 7th and 8th worker to s2.
>>
>> Now that I think of that, I think for implementing above, we need to
>> keep track of per-subplan max_workers in the Append path; and with
>> that, the bitmap will be redundant. Instead, it can be replaced with
>> max_workers. Let me check if it is easy to do that. We don't want to
>> have the bitmap if we are sure it would be replaced by some other data
>> structure.
>
> Attached is v2 patch, which implements above. Now Append plan node
> stores a list of per-subplan max worker count, rather than the
> Bitmapset. But still Bitmapset turned out to be necessary for
> AppendPath. More details are in the subsequent comments.
>
>
>>> Goal A : Allow non-partial subpaths in Partial Append.
>>> Goal B : Distribute workers across the Append subplans.
>>> Both of these require some kind of synchronization while choosing the
>>> next subplan. So, goal B is achieved by doing all the synchronization
>>> stuff. And implementation of goal A requires that goal B is
>>> implemented. So there is a dependency between these two goals. While
>>> implementing goal B, we should keep in mind that it should also work
>>> for goal A; it does not make sense later changing the synchronization
>>> logic in goal A.
>>>
>>> I am ok with splitting the patch into 2 patches :
>>> a) changes required for goal A
>>> b) changes required for goal B.
>>> But I think we should split it only when we are ready to commit them
>>> (commit for B, immediately followed by commit for A). Until then, we
>>> should consider both of these together because they are interconnected
>>> as explained above.
>>
>> For B, we need to know, how much gain that brings and in which cases.
>> If that gain is not worth the complexity added, we may have to defer
>> Goal B. Goal A would certainly be useful since it will improve
>> performance of the targetted cases. The synchronization required for
>> Goal A is simpler than that of B and thus if we choose to implement
>> only A, we can live with a simpler synchronization.
>
> For Goal A , the logic for a worker synchronously choosing a subplan will be :
> Go the next subplan. If that subplan has not already assigned max
> workers, choose this subplan, otherwise, go the next subplan, and so
> on.
> For Goal B , the logic will be :
> Among the subplans which are yet to achieve max workers, choose the
> subplan with the minimum number of workers currently assigned.
>
> I don't think there is a significant difference between the complexity
> of the above two algorithms. So I think here the complexity does not
> look like a factor based on which we can choose the particular logic.
> We should choose the logic which has more potential for benefits. The
> logic for goal B will work for goal A as well. And secondly, if the
> subplans are using their own different system resources, the resource
> contention might be less. One case is : all subplans using different
> disks. Second case is : some of the subplans may be using a foreign
> scan, so it would start using foreign server resources sooner. These
> benefits apply when the Gather max workers count is not sufficient for
> running all the subplans in their full capacity. If they are
> sufficient, then the workers will be distributed over the subplans
> using both the logics. Just the order of assignments of workers to
> subplans will be different.
>
> Also, I don't see a disadvantage if we follow the logic of Goal B.
>
>>
>> BTW, Right now, the patch does not consider non-partial paths for a
>> child which has partial paths. Do we know, for sure, that a path
>> containing partial paths for a child, which has it, is always going to
>> be cheaper than the one which includes 

Re: \if, \elseif, \else, \endif (was Re: [HACKERS] PSQL commands: \quit_if, \quit_unless)

2017-02-05 Thread Corey Huinker
>
> Hmmm. ISTM that control-c must at least reset the stack, otherwise their
> is not easy way to get out. What can be left to another patch is doing a
> control-C for contents and then another one for the stack when there is no
> content.
>

And so it shall be.


> - put Fabien's tap test in place verbatim
>>
>
> Hmmm. That was really just a POC... I would suggest some more tests, eg:
>
>   # elif error
>   "\\if false\n\\elif error\n\\endif\n"
>
>   # ignore commands on error (stdout must be empty)
>   "\\if error\n\\echo NO\n\\else\n\\echo NO\n\\endif\n"
>

Those are already in the regression (around line 2763 of
expected/psql.out), are you saying we should have them in TAP as well?
Should we only do TAP tests?

Anyway, here's the Ctrl-C behavior:

# \if true
new \if is true, executing commands
# \echo msg
msg
# ^C
escaped \if, executing commands
# \if false
new \if is false, ignoring commands until next \elif, \else, or \endif
# \echo msg
inside inactive branch, command ignored.
# ^C
escaped \if, executing commands
# \echo msg
msg
# \endif
encountered un-matched \endif
#


Ctrl-C exits do the same before/after state checks that \endif does, the
lone difference being that it "escaped" the \if rather than "exited" the
\if. Thanks to Daniel for pointing out where it should be handled, because
I wasn't going to figure that out on my own.

v7's only major difference from v6 is the Ctrl-C branch escaping.
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index ae58708..c0ba4c4 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -2035,6 +2035,67 @@ hello 10
 
   
 
+  
+\if expr
+\elif expr
+\else
+\endif
+
+
+This group of commands implements nestable conditional blocks, like
+this:
+
+
+SELECT
+EXISTS(SELECT 1 FROM customer WHERE customer_id = 123) as is_customer,
+EXISTS(SELECT 1 FROM employee WHERE employee_id = 456) as is_employee
+\gset
+\if :is_customer
+SELECT * FROM customer WHERE customer_id = 123;
+\elif :is_employee
+\echo 'is not a customer but is an employee'
+SELECT * FROM employee WHERE employee_id = 456;
+\else
+\if yes
+\echo 'not a customer or employee'
+\else
+\echo 'this should never print'
+\endif
+\endif
+
+
+Conditional blocks must begin with a \if and end
+with an \endif, and the pairs must be found in
+the same source file. If an EOF is reached on the main file or an
+\include-ed file before all local
+\if-\endif are matched, then
+psql will raise an error.
+
+
+A conditional block can have any number of 
+\elif clauses, which may optionally be followed by a
+single \else clause.
+
+
+The \if and \elif commands
+read the rest of the line and evaluate it as a boolean expression.
+Currently, expressions are limited to a single unquoted string
+which are evaluated like other on/off options, so the valid values
+are any unambiguous case insensitive matches for one of:
+true, false, 
1,
+0, on, off,
+yes, no.  So 
+t, T, and tR
+will all match true.
+
+
+Lines within false branches are not evaluated in any way: queries are
+not sent to the server, non-conditional commands are not evaluated but
+bluntly ignored, nested if-expressions in such branches are also not
+evaluated but are tallied to check for proper nesting.
+
+
+  
 
   
 \ir or \include_relative 
filename
diff --git a/src/bin/psql/Makefile b/src/bin/psql/Makefile
index c53733f..7a418c6 100644
--- a/src/bin/psql/Makefile
+++ b/src/bin/psql/Makefile
@@ -61,8 +61,16 @@ uninstall:
 
 clean distclean:
rm -f psql$(X) $(OBJS) lex.backup
+   rm -rf tmp_check
 
 # files removed here are supposed to be in the distribution tarball,
 # so do not clean them in the clean/distclean rules
 maintainer-clean: distclean
rm -f sql_help.h sql_help.c psqlscanslash.c
+
+
+check:
+   $(prove_check)
+
+installcheck:
+   $(prove_installcheck)
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index f17f610..4a3e471 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -49,6 +49,7 @@
 #include "psqlscanslash.h"
 #include "settings.h"
 #include "variables.h"
+#include "fe_utils/psqlscan_int.h"
 
 /*
  * Editable database object types.
@@ -132,7 +133,7 @@ HandleSlashCmds(PsqlScanState scan_state,
status = PSQL_CMD_ERROR;
}
 
-   if (status != PSQL_CMD_ERROR)
+   if (status != PSQL_CMD_ERROR && pset.active_branch)
{
/* eat any remaining arguments after a valid command */
/* note we suppress evaluation of backticks here */
@@ -194,6 +195,30 @@ read_connect_arg(PsqlScanState 

Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-05 Thread Claudio Freire
On Mon, Feb 6, 2017 at 1:42 AM, Kouhei Kaigai  wrote:
> I also had thought an idea to have extra space to Instrumentation structure,
> however, it needs to make Instrumentation flexible-length structure according
> to the custom format by CSP/FDW. Likely, it is not a good design.
> As long as extension can retrieve its custom statistics on DSM area required
> by ExecParallelEstimate(), I have no preference on the hook location.

That's what I had in mind: the hook happens there, but the extension
retrieves the information from some extension-specific DSM area, just
as it would on the ParallelFinish hook.

> One thing we may pay attention is, some extension (not mine) may want to
> collect worker's statistics regardless of Instrumentation (in other words,
> even if plan is not under EXPLAIN ANALYZE).
> It is the reason why I didn't put a hook under the 
> ExecParallelRetrieveInstrumentation().

I don't think you should worry about that as long as it's a hypothetical case.

If/when some extension actually needs to do that, the design can be
discussed with a real use case at hand, and not a hypothetical one.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Should I embed or parameterize syscall/Win32 function names from error messages?

2017-02-05 Thread Tom Lane
"Tsunakawa, Takayuki"  writes:
> I added a few ereport() calls that emit the same message except for the Win32 
> API name.  Which of the following do you think is the best?  I'd like to 
> follow the majority.

> [Option 1]
> ereport(elevel,
> (errmsg("could not enable Lock pages in memory user right"),
> errdetail("Failed system call was %s, error code %lu", 
> "OpenProcessToken", GetLastError(;

> [Option 2]
> ereport(elevel,
> (errmsg("could not enable Lock Pages in Memory user right: error code 
> %lu", GetLastError()),
> errdetail("Failed system call was OpenProcessToken.")));

TBH, I think you are worried about the wrong thing here.  You could drop
both of those errdetail calls altogether and be little worse off.  In the
places where we have errdetail calls like "failed system call was xxx",
the main point is to show the exact parameters that were given to the
system call, and neither of these do that.  These errdetail messages
would only be useful if the identical ereport errmsg might be issued
for failures from different underlying Windows calls --- but I doubt
that's what you're intending here.

My problem with these messages is I am not sure what "memory user right"
means.  Probably that just needs a bit of editing.  But I'd go for
something like "could not do xxx: error code %lu", and not bother
mentioning the system call name, unless failing to do so has some
impact on whether we could understand what happened from a field
report of this error message.

(See the "Function Names" item in our message style guidelines
for more about this issue.  Maybe we need to expand that item
some more.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Should I embed or parameterize syscall/Win32 function names from error messages?

2017-02-05 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Michael Paquier
> I find it hard to have an opinion on the matter as a non-translator.
> Why not asking translators directly on pgsql-translators?
> 

I didn't think of pgsql-translators.  I'll ask the same question there.  Thanks.

Anyway, this is also a matter of source code style, and those who commit the 
code live here, so I think I need to hear opinions here, too.

Regards
Takayuki Tsunakawa
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 0/NULL/NIL assignments in build_*_rel()

2017-02-05 Thread Ashutosh Bapat
On Mon, Feb 6, 2017 at 11:10 AM, Tom Lane  wrote:
> Ashutosh Bapat  writes:
>> On Mon, Feb 6, 2017 at 10:34 AM, Tom Lane  wrote:
>>> I'd vote for not.  The general programming style in the backend is to
>>> ignore the fact that makeNode() zeroes the node's storage and initialize
>>> all the fields explicitly.  The primary advantage of that, IMO, is that
>>> you can grep for references to a particular field and understand
>>> everyplace that affects it.  As an example of the value, if you want to
>>> add a field X and you try to figure out what you need to modify by
>>> grepping for references to existing field Y, with this proposal you would
>>> miss places that were node initializations unless Y *always* needed to be
>>> initialized nonzero.
>
>> In the case of adding a new field, I would rather rely on where the
>> structure itself is used rather than another member.
>
> Maybe.  There's certainly something to be said for developing a different
> style in which we only initialize fields that need to be nonzero ... but
> if we were to go for that, relnode.c would not be the only place to fix,
> or even the best place to start.
>

As against any other Node structure, RelOptInfo somehow stood out
clearly for me in this aspect. May be because it's a large structure.
But yes, this might be a problem with other structures as well.

> Also, grepping for makeNode(FooStruct) works for cases where FooStructs
> are *only* built that way.  But there's more than one place in the backend
> where we build structs in other ways, typically by declaring one as a
> local variable.  It would take some effort to define a universal
> convention here and make sure all existing code follows it.
>

I think only makeNode() or palloc0(... * sizeof(nodename)) usages can
benefit from this. In other cases the fields need to be explicitly
initialized. Also, that would be a lot of code churn. May be we should
restrict to some large Node structures like RelOptInfo.

>> Should we at least move those assignments into a separate function?
>
> As far as relnode.c goes, I don't really think that's an improvement,
> because it complicates dealing with fields that need to be initialized
> differently for baserels and joinrels.
>

I am thinking more on the lines of makeVar() - create a function
makeRelOptInfo() or such, which accepts values for fields which need
assignment other than zero value and call it from build_join_rel() and
build_simple_rel() passing joinrel and base rel specific values resp.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add pgstathashindex() to get hash index table statistics.

2017-02-05 Thread Amit Kapila
On Mon, Feb 6, 2017 at 10:06 AM, Ashutosh Sharma  wrote:
> On Sat, Feb 4, 2017 at 1:12 AM, Robert Haas  wrote:
>>
>> Committed with those changes.
>
> Thanks for above corrections and commit. But, There are couple of
> things that we might have to change once the patch for 'WAL in Hash
> Indexes' gets checked-in.
>
> 1) The test-case result needs to be changed because there won't be any
> WARNING message : "WARNING:  hash indexes are not WAL-logged and their
> use is discouraged".
>

I think this should be changed in WAL patch itself, no need to handle
it separately.

> 2) From WAL patch for Hash Indexes onwards, we won't have any zero
> pages in Hash Indexes so I don't think we need to have column showing
> zero pages (zero_pages). When working on WAL in hash indexes, we found
> that WAL routine 'XLogReadBufferExtended' does not expect a page to be
> completely zero page else it returns Invalid Buffer. To fix this, we
> started initializing freed overflow page and newly allocated bucket
> pages using _hash_pageinit() hence I don't think there will be any
> zero pages from here onwards.
>

Can't we use PageIsEmpty() to show such information?



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Should I embed or parameterize syscall/Win32 function names from error messages?

2017-02-05 Thread Michael Paquier
On Mon, Feb 6, 2017 at 2:56 PM, Tsunakawa, Takayuki
 wrote:
> I'm rather inclined to choose Option 1 to reduce message translation work.  
> Actually, is the Option 3 the best so that it aligns with the existing 
> messages by putting the error code in the primary message?

I find it hard to have an opinion on the matter as a non-translator.
Why not asking translators directly on pgsql-translators?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [RFC] Should I embed or parameterize syscall/Win32 function names from error messages?

2017-02-05 Thread Tsunakawa, Takayuki
Hello, all

Could you give me your opinions on the message style?  Recently, I got 
different comments from Magnus and Alvaro during the review of "Supporting 
huge_pages on Windows", which is now shifted to CommitFest 2017-3.  To be more 
specific, I'm modifying src/backend/port/win32_shmem.c 
b/src/backend/port/win32_shmem.c.  This file has existing messages like this:

[Existing message]
ereport(FATAL,
(errmsg("could not create shared memory segment: error code %lu", 
GetLastError()),
errdetail("Failed system call was CreateFileMapping(size=%zu, 
name=%s).",
size, szShareMem)));


I added a few ereport() calls that emit the same message except for the Win32 
API name.  Which of the following do you think is the best?  I'd like to follow 
the majority.

[Option 1]
ereport(elevel,
(errmsg("could not enable Lock pages in memory user right"),
errdetail("Failed system call was %s, error code %lu", 
"OpenProcessToken", GetLastError(;

[Option 2]
ereport(elevel,
(errmsg("could not enable Lock Pages in Memory user right: error code 
%lu", GetLastError()),
errdetail("Failed system call was OpenProcessToken.")));

Alvaro thinks that Option 1 is better because it eliminates redundant 
translation work.  Magnus says Option 2 is better because it matches the style 
of existing messages in the same file.

[Magnus's comment]
this seems to be a new pattern of code -- for other similar cases it 
just writes LookupPrivilegeValue inside the patch itself. I'm guessing 
the idea was for translatability, but I think it's better we stick to 
the existing pattern.

[Alvaro's comment]
There are two reasons for doing things this way.  One is that you reduce the 
chances of a translator making a mistake with the function name (say just a 
typo, or in egregious cases they may even translate the function name).  The 
other is that if you have many of these messages, you only translate the 
generic part once instead of having the same message a handful of times, 
exactly identical but for the function name.
So please do apply that kind of pattern wherever possible.  We already have the 
proposed error message, twice.  No need for two more occurrences of it.


I'm rather inclined to choose Option 1 to reduce message translation work.  
Actually, is the Option 3 the best so that it aligns with the existing messages 
by putting the error code in the primary message?

[Option 3]
ereport(elevel,
(errmsg("could not enable Lock pages in memory user right: error code 
%lu", GetLastError()),
errdetail("Failed system call was %s", "OpenProcessToken")));

Regards
Takayuki Tsunakawa



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 3D Z-curve spatial index

2017-02-05 Thread David Fetter
On Mon, Feb 06, 2017 at 11:04:12AM +0700, Boris Muratshin wrote:
> The algorithm for 2D is described in articles (in Russian)
> https://habrahabr.ru/post/319810/ and
> https://habrahabr.ru/post/319096/ .
> Goggle-translator generates (IMHO) suitable for understanding text.

Welcome to the community!

I apologize for not being more clear.

Here, it is customary to send a patch rather than a web link when one
has code one wishes to share.  This is for several reasons:

1.  To establish that you are warranting that you have the right to
send the patch, i.e. that it is not legally encumbered in some way.

2.  To ensure that the patch stays in the archives, as large web sites
have gone away in the past, and will in the future.

3.  Because this is how the current development process works.

When I mentioned documentation, I was referring to the operation of
the SQL-callable interface, assuming that there is one.  When people
apply the patch, they need to have some idea what it is supposed to do
and how to make it do those things.  Any difference between what it is
supposed to do and what it actually does a bug, whether in the
implementation, the documentation, or both.

These things and many others are in the 
https://wiki.postgresql.org/wiki/Developer_FAQ

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Amit Kapila
On Mon, Feb 6, 2017 at 9:47 AM, Pavan Deolasee  wrote:
>
>
> On Mon, Feb 6, 2017 at 9:41 AM, Amit Kapila  wrote:
>>
>>
>>
>> Hmm.  Consider that the first time relcahe invalidation occurs while
>> computing id_attrs, so now the retry logic will compute the correct
>> set of attrs (considering two indexes, if we take the example given by
>> Alvaro above.).
>
>
> I don't quite get that. Since rd_idattr must be already cached at that point
> and we don't expect to process a relcache flush between successive calls to
> RelationGetIndexAttrBitmap(), we should return a consistent copy of
> rd_idattr.
>

Right, I was mistaken.  However, I am not sure if there is a need to
perform retry logic in RelationGetIndexAttrBitmap().  It is safe to do
that at transaction end.  I feel even though Tom's fix looks reliable
with respect to the problem reported in this thread, we should take
some time and evaluate different proposals and see which one is the
best way to move forward.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 0/NULL/NIL assignments in build_*_rel()

2017-02-05 Thread Tom Lane
Ashutosh Bapat  writes:
> On Mon, Feb 6, 2017 at 10:34 AM, Tom Lane  wrote:
>> I'd vote for not.  The general programming style in the backend is to
>> ignore the fact that makeNode() zeroes the node's storage and initialize
>> all the fields explicitly.  The primary advantage of that, IMO, is that
>> you can grep for references to a particular field and understand
>> everyplace that affects it.  As an example of the value, if you want to
>> add a field X and you try to figure out what you need to modify by
>> grepping for references to existing field Y, with this proposal you would
>> miss places that were node initializations unless Y *always* needed to be
>> initialized nonzero.

> In the case of adding a new field, I would rather rely on where the
> structure itself is used rather than another member.

Maybe.  There's certainly something to be said for developing a different
style in which we only initialize fields that need to be nonzero ... but
if we were to go for that, relnode.c would not be the only place to fix,
or even the best place to start.

Also, grepping for makeNode(FooStruct) works for cases where FooStructs
are *only* built that way.  But there's more than one place in the backend
where we build structs in other ways, typically by declaring one as a
local variable.  It would take some effort to define a universal
convention here and make sure all existing code follows it.

> Should we at least move those assignments into a separate function?

As far as relnode.c goes, I don't really think that's an improvement,
because it complicates dealing with fields that need to be initialized
differently for baserels and joinrels.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-02-05 Thread Amit Khandekar
Ashutosh Bapat  wrote:
>> We may want to think about a third goal: preventing too many workers
>> from executing the same plan. As per comment in get_parallel_divisor()
>> we do not see any benefit if more than 4 workers execute the same
>> node. So, an append node can distribute more than 4 worker nodes
>> equally among the available subplans. It might be better to do that as
>> a separate patch.
>
> I think that comment is for calculating leader contribution. It does
> not say that 4 workers is too many workers in general.
>
> But yes, I agree, and I have it in mind as the next improvement.
> Basically, it does not make sense to give more than 3 workers to a
> subplan when parallel_workers for that subplan are 3. For e.g., if
> gather max workers is 10, and we have 2 Append subplans s1 and s2 with
> parallel workers 3 and 5 respectively. Then, with the current patch,
> it will distribute 4 workers to each of these workers. What we should
> do is : once both of the subplans get 3 workers each, we should give
> the 7th and 8th worker to s2.
>
> Now that I think of that, I think for implementing above, we need to
> keep track of per-subplan max_workers in the Append path; and with
> that, the bitmap will be redundant. Instead, it can be replaced with
> max_workers. Let me check if it is easy to do that. We don't want to
> have the bitmap if we are sure it would be replaced by some other data
> structure.

Attached is v2 patch, which implements above. Now Append plan node
stores a list of per-subplan max worker count, rather than the
Bitmapset. But still Bitmapset turned out to be necessary for
AppendPath. More details are in the subsequent comments.


>> Goal A : Allow non-partial subpaths in Partial Append.
>> Goal B : Distribute workers across the Append subplans.
>> Both of these require some kind of synchronization while choosing the
>> next subplan. So, goal B is achieved by doing all the synchronization
>> stuff. And implementation of goal A requires that goal B is
>> implemented. So there is a dependency between these two goals. While
>> implementing goal B, we should keep in mind that it should also work
>> for goal A; it does not make sense later changing the synchronization
>> logic in goal A.
>>
>> I am ok with splitting the patch into 2 patches :
>> a) changes required for goal A
>> b) changes required for goal B.
>> But I think we should split it only when we are ready to commit them
>> (commit for B, immediately followed by commit for A). Until then, we
>> should consider both of these together because they are interconnected
>> as explained above.
>
> For B, we need to know, how much gain that brings and in which cases.
> If that gain is not worth the complexity added, we may have to defer
> Goal B. Goal A would certainly be useful since it will improve
> performance of the targetted cases. The synchronization required for
> Goal A is simpler than that of B and thus if we choose to implement
> only A, we can live with a simpler synchronization.

For Goal A , the logic for a worker synchronously choosing a subplan will be :
Go the next subplan. If that subplan has not already assigned max
workers, choose this subplan, otherwise, go the next subplan, and so
on.
For Goal B , the logic will be :
Among the subplans which are yet to achieve max workers, choose the
subplan with the minimum number of workers currently assigned.

I don't think there is a significant difference between the complexity
of the above two algorithms. So I think here the complexity does not
look like a factor based on which we can choose the particular logic.
We should choose the logic which has more potential for benefits. The
logic for goal B will work for goal A as well. And secondly, if the
subplans are using their own different system resources, the resource
contention might be less. One case is : all subplans using different
disks. Second case is : some of the subplans may be using a foreign
scan, so it would start using foreign server resources sooner. These
benefits apply when the Gather max workers count is not sufficient for
running all the subplans in their full capacity. If they are
sufficient, then the workers will be distributed over the subplans
using both the logics. Just the order of assignments of workers to
subplans will be different.

Also, I don't see a disadvantage if we follow the logic of Goal B.

>
> BTW, Right now, the patch does not consider non-partial paths for a
> child which has partial paths. Do we know, for sure, that a path
> containing partial paths for a child, which has it, is always going to
> be cheaper than the one which includes non-partial path. If not,
> should we build another paths which contains non-partial paths for all
> child relations. This sounds like a 0/1 knapsack problem.

I didn't quite get this. We do create a non-partial Append path using
non-partial child paths anyways.

>
>>
>>
>>> Here are some review comments
>> I will handle the 

Re: [HACKERS] 0/NULL/NIL assignments in build_*_rel()

2017-02-05 Thread Ashutosh Bapat
On Mon, Feb 6, 2017 at 10:34 AM, Tom Lane  wrote:
> Ashutosh Bapat  writes:
>> Both build_simple_rel() and build_join_rel() allocate RelOptInfo using
>> makeNode(), which returned a zeroed out memory. The functions then
>> assign values like false, NULL, 0 or NIL which essentially contain
>> zero valued bytes. This looks like needless work. So, we are spending
>> some CPU cycles unnecessarily in those assignments. That may not be
>> much time wasted, but whenever someone adds a field to RelOptInfo,
>> those functions need to be updated with possibly a zero value
>> assignment. That looks like an unnecessary maintenance burden. Should
>> we just drop all those zero value assignments from there?
>
> I'd vote for not.  The general programming style in the backend is to
> ignore the fact that makeNode() zeroes the node's storage and initialize
> all the fields explicitly.  The primary advantage of that, IMO, is that
> you can grep for references to a particular field and understand
> everyplace that affects it.  As an example of the value, if you want to
> add a field X and you try to figure out what you need to modify by
> grepping for references to existing field Y, with this proposal you would
> miss places that were node initializations unless Y *always* needed to be
> initialized nonzero.

In the case of adding a new field, I would rather rely on where the
structure itself is used rather than another member. For example while
adding a field to RelOptInfo, to find out places to initialize it, I
would grep for makeNode(RelOptInfo) or such to find where a new node
gets allocated, rather than say relids. Grepping for usages of a
particular field might find zero valued assignments useful, but not
necessarily worth maintaining that code.

>
> I could be convinced that it is a good idea to depart from this theory
> in places where it makes a significant performance difference ... but
> you've not offered any reason to think relnode.c is one.
>

I don't think that will bring any measurable performance improvement,
unless we start creating millions of RelOptInfos in a query. My real
pain is

>> whenever someone adds a field to RelOptInfo,
>> those functions need to be updated with possibly a zero value
>> assignment. That looks like an unnecessary maintenance burden.

Should we at least move those assignments into a separate function?
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 0/NULL/NIL assignments in build_*_rel()

2017-02-05 Thread Tom Lane
Ashutosh Bapat  writes:
> Both build_simple_rel() and build_join_rel() allocate RelOptInfo using
> makeNode(), which returned a zeroed out memory. The functions then
> assign values like false, NULL, 0 or NIL which essentially contain
> zero valued bytes. This looks like needless work. So, we are spending
> some CPU cycles unnecessarily in those assignments. That may not be
> much time wasted, but whenever someone adds a field to RelOptInfo,
> those functions need to be updated with possibly a zero value
> assignment. That looks like an unnecessary maintenance burden. Should
> we just drop all those zero value assignments from there?

I'd vote for not.  The general programming style in the backend is to
ignore the fact that makeNode() zeroes the node's storage and initialize
all the fields explicitly.  The primary advantage of that, IMO, is that
you can grep for references to a particular field and understand
everyplace that affects it.  As an example of the value, if you want to
add a field X and you try to figure out what you need to modify by
grepping for references to existing field Y, with this proposal you would
miss places that were node initializations unless Y *always* needed to be
initialized nonzero.

I could be convinced that it is a good idea to depart from this theory
in places where it makes a significant performance difference ... but
you've not offered any reason to think relnode.c is one.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: pageinspect: Try to fix some bugs in previous commit.

2017-02-05 Thread Ashutosh Sharma
 I think UInt32GetDatum(metad->hashm_procid) looks fine, the reason
 being 'hashm_procid' is defined as 32-bit unsigned integer but then we
 may have to define procid as int8 in SQL function.
>>>
>>> No, you're wrong.  The GetDatum you choose macro has to match the SQL
>>> type, not the type of the variable that you're passing to it.  For
>>> example, if you've got an "int" in the code and the SQL type is
>>> "int8", you have to use Int64GetDatum, not Int32GetDatum.  Otherwise,
>>> stuff breaks, because on some systems 64-bit integers are passed by
>>> reference, not by value, so the representation that Int32GetDatum
>>> produces isn't valid when interpreted by DatumGetInt64 later on.  The
>>> latter is expecting a pointer, but the former didn't produce one.
>>>
>>
>> Thank you very much for detailed information and explanation. It is
>> really very helpful and understandable. But, As per your explanation,
>> GetDatum we choose need to match the SQL type, not the type of the
>> variable used in code and I do not see any unsigned integer SQL type
>> in PostgreSQL then I am just wondering on why do we have
>> UInt32GetDatum or UInt64GetDatum macros.
>
> That's a pretty good question.  UInt64GetDatum is used in exactly one
> place (exec_stmt_getdiag) and there's really no reason why
> Int64GetDatum wouldn't be more appropriate.  UInt32GetDatum is used in
> a few more places, and some of those are used for hash values which
> are not exposed at the SQL level so they might be legitimate, but
> others like the ones in pageinspect look like fuzzy thinking that has
> only survived because it happens not to break anything.  I suppose if
> we wanted to be really rigorous about this sort of thing, we should
> change UInt32GetDatum to do something tangibly different from
> Int32GetDatum, like negate all the bits.  Then if somebody picked the
> wrong macro it would actually fail.  I'm not sure that's really the
> best place to spend our effort, though.  The moral of this episode is
> that it's important to at least get the right width.  Currently,
> getting the wrong signedness doesn't actually break anything.

Okay, understood. Thank you very much !

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Provide list of subscriptions and publications in psql's completion

2017-02-05 Thread Michael Paquier
On Sun, Feb 5, 2017 at 5:04 AM, Fujii Masao  wrote:
> With this patch, when normal users type TAB after SUBSCRIPTION,
> they got "permission denied" error. I don't think that's acceptable.

Right, I can see that now. pg_stat_get_subscription() does not offer
as well a way to filter by databases... Perhaps we could add it? it is
stored as LogicalRepWorker->dbid so making it visible is sort of easy.

> In "CREATE SUBSCRIPTION ... PUBLICATION" and "ALTER SUBSCRIPTION ... SET
> PUBLICATION" cases, the publication defined in the publisher side should be
> specified. But, with this patch, the tab-completion for PUBLICATION shows
> the publications defined in local server (ie, subscriber side in those cases).
> This might be confusing.

Which would mean that psql tab completion should try to connect the
remote server using the connection string defined with the
subscription to get this information, which looks unsafe to me. To be
honest, I'd rather see a list of local objects defined than nothing..
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] 0/NULL/NIL assignments in build_*_rel()

2017-02-05 Thread Ashutosh Bapat
Hi,
Both build_simple_rel() and build_join_rel() allocate RelOptInfo using
makeNode(), which returned a zeroed out memory. The functions then
assign values like false, NULL, 0 or NIL which essentially contain
zero valued bytes. This looks like needless work. So, we are spending
some CPU cycles unnecessarily in those assignments. That may not be
much time wasted, but whenever someone adds a field to RelOptInfo,
those functions need to be updated with possibly a zero value
assignment. That looks like an unnecessary maintenance burden. Should
we just drop all those zero value assignments from there? If there's
any reason to have those assignments there, should we move the code to
allocate RelOptInfo and assign zero values to a separate function and
call it from those two functions? fetch_upper_rel() also has those
kinds of assignments, but they are far fewer than build_*_rel().

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-05 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Claudio Freire
> Sent: Monday, February 06, 2017 1:05 PM
> To: Kaigai Kouhei(海外 浩平) 
> Cc: Amit Kapila ; Robert Haas
> ; pgsql-hackers 
> Subject: Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside
> ExecEndGather)
> 
> On Mon, Feb 6, 2017 at 1:00 AM, Claudio Freire 
> wrote:
> > On Sun, Feb 5, 2017 at 9:19 PM, Kouhei Kaigai  wrote:
> >>> If the use case for this is to gather instrumentation, I'd suggest
> >>> calling the hook in RetrieveInstrumentation, and calling it
> >>> appropriately. It would make the intended use far clearer than it is
> now.
> >>>
> >>> And if it saves some work, all the better.
> >>>
> >>> Until there's a use case for a non-instrumentation hook in that
> >>> place, I wouldn't add it. This level of generality sounds like a
> >>> solution waiting for a problem to solve.
> >>>
> >> The use cases I'd like to add are extension specific but significant
> >> for performance analytics. These statistics are not included in
> Instrumentation.
> >> For example, my problems are GPU execution time, data transfer ratio
> >> over DMA, synchronization time for GPU task completion, and so on.
> >> Only extension can know which attributes are collected during the
> >> execution, and its data format. I don't think Instrumentation fits these
> requirements.
> >> This is a problem I faced on the v9.6 based interface design, so I
> >> could notice it.
> >
> >
> > What RetrieveInstrumentation currently does may not cover the
> > extension's specific instrumentation, but what I'm suggesting is
> > adding a hook, like the one you're proposing for ParallelFinish, that
> > would collect extension-specific instrumentation. Couple that with
> > extension-specific explain output and you'll get your use cases
> > covered, I think.
> 
> 
> In essence, you added a ParallelFinish that is called after
> RetrieveInstrumentation. That's overreaching, for your use case.
> 
> If instrumentation is what you're collecting, it's far more correct, and
> more readable, to have a hook called from inside RetrieveInstrumentation
> that will retrieve extension-specific instrumentation.
> 
> RetrieveInstrumentation itself doesn't need to parse that information, that
> will be loaded into the custom scan nodes, and those are the ones that will
> parse it when generating explain output.
> 
> So, in essence it's almost what you did with ParallelFinish, more clearly
> aimed at collecting runtime statistics.
> 
I also had thought an idea to have extra space to Instrumentation structure,
however, it needs to make Instrumentation flexible-length structure according
to the custom format by CSP/FDW. Likely, it is not a good design.
As long as extension can retrieve its custom statistics on DSM area required
by ExecParallelEstimate(), I have no preference on the hook location.

One thing we may pay attention is, some extension (not mine) may want to
collect worker's statistics regardless of Instrumentation (in other words,
even if plan is not under EXPLAIN ANALYZE).
It is the reason why I didn't put a hook under the 
ExecParallelRetrieveInstrumentation().

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add pgstathashindex() to get hash index table statistics.

2017-02-05 Thread Ashutosh Sharma
On Sat, Feb 4, 2017 at 1:12 AM, Robert Haas  wrote:
> On Wed, Feb 1, 2017 at 10:13 PM, Michael Paquier
>  wrote:
>> On Tue, Jan 24, 2017 at 8:36 PM, Kuntal Ghosh
>>  wrote:
>>> Nothing else to add from my side. I'm marking this 'Ready for commiter'.
>>
>> Moved to CF 2017-03 with the same status.
>
> OK, I took a look at this.
>
> - The handling of the extension stuff wasn't correct.  You can't go
> back and modify version 1.4; that's already been released.  But
> version 1.5 hasn't been released yet, so we can (and should) add more
> stuff to that version instead of creating a new version.  We don't
> need the pushups that you did with superuser checks, because there's
> no version out there in the wild that has those checks, so users with
> installed binaries can't be relying on them.  I cleaned all that stuff
> up, which made this significantly simpler.
>
> - I removed several of the columns being returned from the metapage.
> The pageinspect code that I committed yesterday can be used to look at
> those values; there doesn't seem to be a need to also return them
> here.  What this is really useful for is getting the "real" values by
> scanning through the whole index and tallying things up.
>
> - I adjusted some of the data types, both in the SQL and in the C
> code, so that they all match and that they're wide enough to return
> the values they might contain without overflowing (but not wider).  I
> also made the free_percent float8 rather than float4, consistent with
> other things in this module.  One thing I cheated on slightly: I left
> the version as int4 rather than int8 even though the underlying field
> is uint32, for consistency with everything else in this module.
> That's not entirely intellectually satisfying, but I guess it's better
> to be consistent for the moment.
>
> - I fixed a number of cosmetic issues, like the fact that the
> documentation for this didn't use \x format, unlike all of the other
> pgstattuple documentation, and the fact that table_len isn't a very
> good name for a variable representing the total amount of space for
> tuples, and the fact that the documentation referred to a hash index
> as a "hash table", and the fact that the wording for the columns
> wasn't consistent with other functions in pgstattuple with similar
> columns.
>
> Committed with those changes.

Thanks for above corrections and commit. But, There are couple of
things that we might have to change once the patch for 'WAL in Hash
Indexes' gets checked-in.

1) The test-case result needs to be changed because there won't be any
WARNING message : "WARNING:  hash indexes are not WAL-logged and their
use is discouraged".

2) From WAL patch for Hash Indexes onwards, we won't have any zero
pages in Hash Indexes so I don't think we need to have column showing
zero pages (zero_pages). When working on WAL in hash indexes, we found
that WAL routine 'XLogReadBufferExtended' does not expect a page to be
completely zero page else it returns Invalid Buffer. To fix this, we
started initializing freed overflow page and newly allocated bucket
pages using _hash_pageinit() hence I don't think there will be any
zero pages from here onwards.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Rename pg_switch_xlog to pg_switch_wal

2017-02-05 Thread Michael Paquier
On Sat, Feb 4, 2017 at 6:39 AM, Robert Haas  wrote:
> On Fri, Feb 3, 2017 at 5:21 AM, Stephen Frost  wrote:
>> Daniel,
>>
>> * Daniel Verite (dan...@manitou-mail.org) wrote:
>>> What if we look at the change from the pessimistic angle?
>>> An example of confusion that the change would create:
>>> a lot of users currently choose pg_wal for the destination
>>> directory of their archive command. Less-informed users
>>> that set up archiving and/or log shipping in PG10 based on
>>> advice online from previous versions will be fairly
>>> confused about the missing pg_xlog, and the fact that the
>>> pg_wal directory they're supposed to create already exists.
>>
>> One would hope that they would realize that's not going to work
>> when they set up PG10.  If they aren't paying attention sufficient
>> to realize that then it seems entirely likely that they would feel
>> equally safe removing the contents of a directory named 'pg_xlog'.
>
> So... somebody want to tally up the votes here?

Here is what I have, 6 votes clearly stated:
1. Rename nothing: Daniel,
2. Rename directory only: Andres
3. Rename everything: Stephen, Vladimir, David S, Michael P (with
aliases for functions, I could live without at this point...)

> And... was this discussed at the FOSDEM developer meeting?
>
> (Please say yes.)

Looking only at the minutes, the answer is no:
https://wiki.postgresql.org/wiki/FOSDEM/PGDay_2017_Developer_Meeting
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Pavan Deolasee
On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee 
wrote:

>
>>
> I like this approach. I'll run the patch on a build with
> CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok.
>

While it looks certain that the fix will miss this point release, FWIW I
ran this patch with CACHE_CLOBBER_ALWAYS and it seems to be working as
expected.. Hard to run all the tests, but a small subset of regression and
test_decoding seems ok.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Pavan Deolasee
On Mon, Feb 6, 2017 at 9:41 AM, Amit Kapila  wrote:

>
>
> Hmm.  Consider that the first time relcahe invalidation occurs while
> computing id_attrs, so now the retry logic will compute the correct
> set of attrs (considering two indexes, if we take the example given by
> Alvaro above.).


I don't quite get that. Since rd_idattr must be already cached at that
point and we don't expect to process a relcache flush between successive
calls to RelationGetIndexAttrBitmap(), we should return a consistent copy
of rd_idattr. But may be I am missing something.

Thanks,
Pavan


-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


[HACKERS] IF [NOT] EXISTS for replication slots

2017-02-05 Thread Michael Paquier
Hi all,

Lately I have bumped into a case where it would have been useful to
make the difference between a failure because of a slot already
dropped and an internal failure of Postgres. Is there any interest for
support of IE and INE for CREATE and DROP_REPLICATION_SLOT?
My use case involved only the SQL-callable interface, but I would
think that it is useful to make this difference as well with the
replication protocol. For the function we could just add a boolean
argument to control the switch, and for the replication commands a
dedicated keyword.

Any thoughts?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Amit Kapila
On Mon, Feb 6, 2017 at 8:35 AM, Pavan Deolasee  wrote:
>
>
> On Mon, Feb 6, 2017 at 8:15 AM, Amit Kapila  wrote:
>>
>> On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee 
>> wrote:
>> >
>> >
>> > On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane  wrote:
>> >>
>> >>
>> >>
>> >> > 2. In the second patch, we tried to recompute attribute lists if a
>> >> > relcache
>> >> > flush happens in between and index list is invalidated. We've seen
>> >> > problems
>> >> > with that, especially it getting into an infinite loop with
>> >> > CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently
>> >> > relcache
>> >> > flushes keep happening.
>> >>
>> >> Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache
>> >> flush
>> >> happen whenever it possibly could.
>> >
>> >
>> > Ah, ok. That explains why the retry logic as originally proposed was
>> > causing
>> > infinite loop.
>> >
>> >>
>> >> The way to deal with that without
>> >> looping is to test whether the index set *actually* changed, not
>> >> whether
>> >> it just might have changed.
>> >
>> >
>> > I like this approach. I'll run the patch on a build with
>> > CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. I also like the
>> > fact that retry logic is now limited to only when index set changes,
>> > which
>> > fits in the situation we're dealing with.
>> >
>>
>> Don't you think it also has the same problem as mentioned by me in
>> Alvaro's patch and you also agreed on it?
>
>
> No, I don't think so. There can't be any more relcache flushes occurring
> between the first RelationGetIndexAttrBitmap() and the subsequent
> RelationGetIndexAttrBitmap() calls. The problem with the earlier proposed
> approach was that we were not caching, yet returning stale information. That
> implied the next call will again recompute a possibly different value. The
> retry logic is trying to close a small race yet important race condition
> where index set invalidation happens, but we cache stale information.
>

Hmm.  Consider that the first time relcahe invalidation occurs while
computing id_attrs, so now the retry logic will compute the correct
set of attrs (considering two indexes, if we take the example given by
Alvaro above.).  However, the attrs computed for hot_* and key_* will
be using only one index, so this will lead to a situation where two of
the attrs (hot_attrs and key_attrs) are computed with one index and
id_attrs is computed with two indexes. I think this can lead to
Assertion in HeapSatisfiesHOTandKeyUpdate().

>>
>> Do you see any need to fix
>> it locally in
>> RelationGetIndexAttrBitmap(), why can't we clear at transaction end?
>>
>
> We're trying to fix it in RelationGetIndexAttrBitmap() because that's where
> the race exists. I'm not saying there aren't other and better ways of
> handling it, but we don't know (I've studied Andres's proposal yet).
>

Okay, please find attached patch which is an extension of Tom's and
Andres's patch and I think it fixes the above problem noted by me.
Basically, invalidate the bitmaps at transaction end rather than in
RelationGetIndexAttrBitmap().  I think it is okay for
RelationGetIndexAttrBitmap() to use stale information until
transaction end because all the updates in the current running
transaction will be consumed by the second phase of CIC.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


invalidate_indexattr_v6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-05 Thread Claudio Freire
On Mon, Feb 6, 2017 at 1:00 AM, Claudio Freire  wrote:
> On Sun, Feb 5, 2017 at 9:19 PM, Kouhei Kaigai  wrote:
>>> If the use case for this is to gather instrumentation, I'd suggest calling
>>> the hook in RetrieveInstrumentation, and calling it appropriately. It would
>>> make the intended use far clearer than it is now.
>>>
>>> And if it saves some work, all the better.
>>>
>>> Until there's a use case for a non-instrumentation hook in that place, I
>>> wouldn't add it. This level of generality sounds like a solution waiting
>>> for a problem to solve.
>>>
>> The use cases I'd like to add are extension specific but significant for
>> performance analytics. These statistics are not included in Instrumentation.
>> For example, my problems are GPU execution time, data transfer ratio over
>> DMA, synchronization time for GPU task completion, and so on.
>> Only extension can know which attributes are collected during the execution,
>> and its data format. I don't think Instrumentation fits these requirements.
>> This is a problem I faced on the v9.6 based interface design, so I could
>> notice it.
>
>
> What RetrieveInstrumentation currently does may not cover the
> extension's specific instrumentation, but what I'm suggesting is
> adding a hook, like the one you're proposing for ParallelFinish, that
> would collect extension-specific instrumentation. Couple that with
> extension-specific explain output and you'll get your use cases
> covered, I think.


In essence, you added a ParallelFinish that is called after
RetrieveInstrumentation. That's overreaching, for your use case.

If instrumentation is what you're collecting, it's far more correct,
and more readable, to have a hook called from inside
RetrieveInstrumentation that will retrieve extension-specific
instrumentation.

RetrieveInstrumentation itself doesn't need to parse that information,
that will be loaded into the custom scan nodes, and those are the ones
that will parse it when generating explain output.

So, in essence it's almost what you did with ParallelFinish, more
clearly aimed at collecting runtime statistics.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 3D Z-curve spatial index

2017-02-05 Thread Boris Muratshin
The algorithm for 2D is described in articles (in Russian)
https://habrahabr.ru/post/319810/ and
https://habrahabr.ru/post/319096/ .
Goggle-translator generates (IMHO) suitable for understanding text.

3D version article is not finished yet.

The data in figures are obtained in the following way:
1) Test data set is pseudo 3d (x,0,z) array of 100 000 000 random points
2) R-tree for comparison - GiST 2d (x,z)
3) There is a set of experiments by an average number of points in
requested random area: 1,10, 100, 1 000, 10 000, 100 000, 1 000 000
4) For each area size I requested a set of random extents (from 100 000
times for 1 point to 100 times for 1 000 000 points).
5) Experiments were done on virtual machine (2 cores, 4Gb) and to exclude
noise, all times were got on second (or more) run to warm caches,
reads were got on restarted PosgreSQL.
6) For R-tree, times are very unstable and I used the least one in the
series.

Regards, Boris







On Mon, Feb 6, 2017 at 5:08 AM, David Fetter  wrote:

>
> Please send the actual patch and any documentation you write that
> comes with it.
>
> Best,
> David.
> --
> David Fetter  http://fetter.org/
> Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
> Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com
>
> Remember to vote!
> Consider donating to Postgres: http://www.postgresql.org/about/donate
>


Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-05 Thread Claudio Freire
On Sun, Feb 5, 2017 at 9:19 PM, Kouhei Kaigai  wrote:
>> If the use case for this is to gather instrumentation, I'd suggest calling
>> the hook in RetrieveInstrumentation, and calling it appropriately. It would
>> make the intended use far clearer than it is now.
>>
>> And if it saves some work, all the better.
>>
>> Until there's a use case for a non-instrumentation hook in that place, I
>> wouldn't add it. This level of generality sounds like a solution waiting
>> for a problem to solve.
>>
> The use cases I'd like to add are extension specific but significant for
> performance analytics. These statistics are not included in Instrumentation.
> For example, my problems are GPU execution time, data transfer ratio over
> DMA, synchronization time for GPU task completion, and so on.
> Only extension can know which attributes are collected during the execution,
> and its data format. I don't think Instrumentation fits these requirements.
> This is a problem I faced on the v9.6 based interface design, so I could
> notice it.


What RetrieveInstrumentation currently does may not cover the
extension's specific instrumentation, but what I'm suggesting is
adding a hook, like the one you're proposing for ParallelFinish, that
would collect extension-specific instrumentation. Couple that with
extension-specific explain output and you'll get your use cases
covered, I think.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Andres Freund
On 2017-02-05 22:34:34 -0500, Tom Lane wrote:
> Pavan Deolasee  writes:
> The point is that there's a nontrivial chance of a hasty fix introducing
> worse problems than we fix.
> 
> Given the lack of consensus about exactly how to fix this, I'm feeling
> like it's a good idea if whatever we come up with gets some time to age
> awhile in git before we ship it.

Right. And I'm not even convinced that we really know the extent of the
bug; it seems fairly plausible that there's further incidences of this.
There's also the issue that the mechanics in the older backbranches are
different again, because of SnapshotNow.


>> I'm bit a surprised with this position. What do we tell our users now that
>> we know this bug exists?

That we're scheduling a bugfix for the next point release.  I don't
think we can truthfully claim that there's no known corruption bugs in
any of the  release in the last few years.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Tom Lane
Pavan Deolasee  writes:
> On Mon, Feb 6, 2017 at 5:44 AM, Andres Freund  wrote:
>> +1.  I don't think we serve our users by putting out a nontrivial bugfix
>> hastily. Nor do I think we serve them in this instance by delaying the
>> release to cover this fix; there's enough other fixes in the release
>> imo.

> I'm bit a surprised with this position.

The point is that there's a nontrivial chance of a hasty fix introducing
worse problems than we fix.

Given the lack of consensus about exactly how to fix this, I'm feeling
like it's a good idea if whatever we come up with gets some time to age
awhile in git before we ship it.

Obviously, 2ndQ or EDB or any other distributor can choose to ship a patch
in their own builds if they're sufficiently comfortable with the particular
patch.  That doesn't translate to there having to be a fix in the
community's wraps this week.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Andres Freund
On 2017-02-05 21:49:57 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > I've not yet read the full thread, but I'm a bit confused so far. We
> > obviously can get changing information about indexes here, but isn't
> > that something we have to deal with anyway?  If we guarantee that we
> > don't loose knowledge that there's a pending invalidation, why do we
> > have to retry?
> 
> We don't *have to* retry; the core of the fix is to not enter stale data
> into the cache after we've already received an invalidation signal.

Right, the bug is that we do so without remembering that fact.


> The current caller can survive with stale data; or if that's not true,
> C.I.C.  has got more fundamental problems than posited here.

Indeed.


> But it seems like a generally bad idea for relcache to return data
> that it knows (or at least has reason to believe) is stale.

I'm not convinced by this - this kind of staleness can only occur if
changes happen during the execution of the cache building. And the
information returned can be outdated by pretty much the moment you
return anyway.  It's also how a number of the current caches
essentially already work.


> Also, even if you're okay with return-stale-data-but-don't-cache-it,
> I have little faith that invalidate_indexattr_v3.patch successfully
> implements that.

I sent a prototype clarifying what I mean. It's not really like
invalidate_indexattr_v3.patch


Btw, it seems RelationGetIndexList() avoids a similar issue onlye due to
either luck or a lot of undocumented assumptions.  The only reason that
setting relation->rd_indexlist / rd_indexvalid at the end doesn't cause
a similar issue seems to be that no invalidations appear to be processed
after the systable_beginscan().  And that in turn seems to not process
relcache invals after RegisterSnapshot(GetCatalogSnapshot(relid)).
Phew, that seems a bit fragile.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Pavan Deolasee
On Mon, Feb 6, 2017 at 5:44 AM, Andres Freund  wrote:

> On 2017-02-05 12:51:09 -0500, Tom Lane wrote:
> > Michael Paquier  writes:
> > > On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule 
> wrote:
> > >> I agree with Pavan - a release with known important bug is not good
> idea.
> >
> > > This bug has been around for some time, so I would recommend taking
> > > the time necessary to make the best fix possible, even if it means
> > > waiting for the next round of minor releases.
> >
> > I think the way to think about this sort of thing is, if we had found
> > this bug when a release wasn't imminent, would we consider it bad enough
> > to justify an unscheduled release cycle?  I have to think the answer for
> > this one is "probably not".
>
> +1.  I don't think we serve our users by putting out a nontrivial bugfix
> hastily. Nor do I think we serve them in this instance by delaying the
> release to cover this fix; there's enough other fixes in the release
> imo.
>
>
I'm bit a surprised with this position. What do we tell our users now that
we know this bug exists? They can't fully trust CIC and they don't have any
mechanism to confirm if the newly built index is corruption free or not.
I'm not suggesting that we should hastily refactor the code, but at the
very least some sort of band-aid fix which helps the situation, yet have
minimal side-effects, is warranted. I believe proposed retry patch does
exactly that.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Pavan Deolasee
On Mon, Feb 6, 2017 at 8:15 AM, Amit Kapila  wrote:

> On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee 
> wrote:
> >
> >
> > On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane  wrote:
> >>
> >>
> >>
> >> > 2. In the second patch, we tried to recompute attribute lists if a
> >> > relcache
> >> > flush happens in between and index list is invalidated. We've seen
> >> > problems
> >> > with that, especially it getting into an infinite loop with
> >> > CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently
> relcache
> >> > flushes keep happening.
> >>
> >> Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache
> flush
> >> happen whenever it possibly could.
> >
> >
> > Ah, ok. That explains why the retry logic as originally proposed was
> causing
> > infinite loop.
> >
> >>
> >> The way to deal with that without
> >> looping is to test whether the index set *actually* changed, not whether
> >> it just might have changed.
> >
> >
> > I like this approach. I'll run the patch on a build with
> > CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. I also like the
> > fact that retry logic is now limited to only when index set changes,
> which
> > fits in the situation we're dealing with.
> >
>
> Don't you think it also has the same problem as mentioned by me in
> Alvaro's patch and you also agreed on it?


No, I don't think so. There can't be any more relcache flushes occurring
between the first RelationGetIndexAttrBitmap() and the subsequent
RelationGetIndexAttrBitmap() calls. The problem with the earlier proposed
approach was that we were not caching, yet returning stale information.
That implied the next call will again recompute a possibly different value.
The retry logic is trying to close a small race yet important race
condition where index set invalidation happens, but we cache stale
information.


> Do you see any need to fix
> it locally in
> RelationGetIndexAttrBitmap(), why can't we clear at transaction end?
>
>
We're trying to fix it in RelationGetIndexAttrBitmap() because that's where
the race exists. I'm not saying there aren't other and better ways of
handling it, but we don't know (I've studied Andres's proposal yet).

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Possible spelling fixes

2017-02-05 Thread Josh Soref
It's now split more or less to your suggestion:
https://github.com/jsoref/postgres/commits/spelling
diff --git a/configure b/configure
--- a/configure
+++ b/configure
@@ -7088,7 +7088,7 @@ test -z "$INSTALL_SCRIPT" && INSTALL_SCR
 test -z "$INSTALL_DATA" && INSTALL_DATA='${INSTALL} -m 644'
 
 # When Autoconf chooses install-sh as install program it tries to generate
-# a relative path to it in each makefile where it subsitutes it. This clashes
+# a relative path to it in each makefile where it substitutes it. This clashes
 # with our Makefile.global concept. This workaround helps.
 case $INSTALL in
   *install-sh*) install_bin='';;
@@ -7232,7 +7232,7 @@ fi
 $as_echo "$MKDIR_P" >&6; }
 
 # When Autoconf chooses install-sh as mkdir -p program it tries to generate
-# a relative path to it in each makefile where it subsitutes it. This clashes
+# a relative path to it in each makefile where it substitutes it. This clashes
 # with our Makefile.global concept. This workaround helps.
 case $MKDIR_P in
   *install-sh*) MKDIR_P='\${SHELL} \${top_srcdir}/config/install-sh -c -d';;
diff --git a/configure.in b/configure.in
--- a/configure.in
+++ b/configure.in
@@ -887,7 +887,7 @@ fi
 
 AC_PROG_INSTALL
 # When Autoconf chooses install-sh as install program it tries to generate
-# a relative path to it in each makefile where it subsitutes it. This clashes
+# a relative path to it in each makefile where it substitutes it. This clashes
 # with our Makefile.global concept. This workaround helps.
 case $INSTALL in
   *install-sh*) install_bin='';;
@@ -900,7 +900,7 @@ AC_PROG_LN_S
 AC_PROG_AWK
 AC_PROG_MKDIR_P
 # When Autoconf chooses install-sh as mkdir -p program it tries to generate
-# a relative path to it in each makefile where it subsitutes it. This clashes
+# a relative path to it in each makefile where it substitutes it. This clashes
 # with our Makefile.global concept. This workaround helps.
 case $MKDIR_P in
   *install-sh*) MKDIR_P='\${SHELL} \${top_srcdir}/config/install-sh -c -d';;
diff --git a/contrib/bloom/blvacuum.c b/contrib/bloom/blvacuum.c
--- a/contrib/bloom/blvacuum.c
+++ b/contrib/bloom/blvacuum.c
@@ -51,7 +51,7 @@ blbulkdelete(IndexVacuumInfo *info, Inde
initBloomState(, index);
 
/*
-* Interate over the pages. We don't care about concurrently added 
pages,
+* Iterate over the pages. We don't care about concurrently added pages,
 * they can't contain tuples to delete.
 */
npages = RelationGetNumberOfBlocks(index);
diff --git a/contrib/cube/sql/cube.sql b/contrib/cube/sql/cube.sql
--- a/contrib/cube/sql/cube.sql
+++ b/contrib/cube/sql/cube.sql
@@ -256,7 +256,7 @@ SELECT cube_dim('(0,0,0)'::cube);
 SELECT cube_dim('(42,42,42),(42,42,42)'::cube);
 SELECT cube_dim('(4,8,15,16,23),(4,8,15,16,23)'::cube);
 
--- Test of cube_ll_coord function (retrieves LL coodinate values)
+-- Test of cube_ll_coord function (retrieves LL coordinate values)
 --
 SELECT cube_ll_coord('(-1,1),(2,-2)'::cube, 1);
 SELECT cube_ll_coord('(-1,1),(2,-2)'::cube, 2);
@@ -268,7 +268,7 @@ SELECT cube_ll_coord('(42,137)'::cube, 1
 SELECT cube_ll_coord('(42,137)'::cube, 2);
 SELECT cube_ll_coord('(42,137)'::cube, 3);
 
--- Test of cube_ur_coord function (retrieves UR coodinate values)
+-- Test of cube_ur_coord function (retrieves UR coordinate values)
 --
 SELECT cube_ur_coord('(-1,1),(2,-2)'::cube, 1);
 SELECT cube_ur_coord('(-1,1),(2,-2)'::cube, 2);
diff --git a/contrib/earthdistance/earthdistance--1.1.sql 
b/contrib/earthdistance/earthdistance--1.1.sql
--- a/contrib/earthdistance/earthdistance--1.1.sql
+++ b/contrib/earthdistance/earthdistance--1.1.sql
@@ -11,7 +11,7 @@ CREATE FUNCTION earth() RETURNS float8
 LANGUAGE SQL IMMUTABLE PARALLEL SAFE
 AS 'SELECT ''6378168''::float8';
 
--- Astromers may want to change the earth function so that distances will be
+-- Astronomers may want to change the earth function so that distances will be
 -- returned in degrees. To do this comment out the above definition and
 -- uncomment the one below. Note that doing this will break the regression
 -- tests.
diff --git a/contrib/isn/ISSN.h b/contrib/isn/ISSN.h
--- a/contrib/isn/ISSN.h
+++ b/contrib/isn/ISSN.h
@@ -23,7 +23,7 @@
  * Product 9 + 21 + 7 + 3 + 1 + 12 + 4 + 24 + 7 + 15 + 0 + 0 = 103
  * 103 / 10 = 10 remainder 3
  * Check digit 10 - 3 = 7
- * => 977-1144875-00-7 ??  <- suplemental number (number of the week, month, 
etc.)
+ * => 977-1144875-00-7 ??  <- supplemental number (number of the week, month, 
etc.)
  *   ^^ 00 for non-daily publications (01=Monday, 
02=Tuesday, ...)
  *
  * The hyphenation is always in after the four digits of the ISSN code.
diff --git a/contrib/isn/isn.c b/contrib/isn/isn.c
--- a/contrib/isn/isn.c
+++ b/contrib/isn/isn.c
@@ -160,7 +160,7 @@ dehyphenate(char *bufO, char *bufI)
  *   into bufO using the given hyphenation range 
TABLE.
  *   

Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Tom Lane
Andres Freund  writes:
> I've not yet read the full thread, but I'm a bit confused so far. We
> obviously can get changing information about indexes here, but isn't
> that something we have to deal with anyway?  If we guarantee that we
> don't loose knowledge that there's a pending invalidation, why do we
> have to retry?

We don't *have to* retry; the core of the fix is to not enter stale data
into the cache after we've already received an invalidation signal.  The
current caller can survive with stale data; or if that's not true, C.I.C.
has got more fundamental problems than posited here.  But it seems like a
generally bad idea for relcache to return data that it knows (or at least
has reason to believe) is stale.

Also, even if you're okay with return-stale-data-but-don't-cache-it,
I have little faith that invalidate_indexattr_v3.patch successfully
implements that.  Consider the sequence: inval received during
RelationGetIndexAttrBitmap, we clear rd_indexvalid, something asks for
the relation OID list, we recalculate that and set rd_indexvalid, then
we reach the end of RelationGetIndexAttrBitmap and see that rd_indexvalid
is set.  If that happened, we'd cache the bitmaps whether or not they were
really up to date.  Now admittedly, it's a bit hard to think of a reason
why anything would ask for the index list of anything but a system catalog
in that sequence, so as long as you assume that we don't allow C.I.C. (or
more realistically, REINDEX CONCURRENTLY) on system catalogs, this failure
mode is unreachable.  But I much prefer having a positive verification
that the index list is still what it was when we started.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Andres Freund
On 2017-02-06 08:08:01 +0530, Amit Kapila wrote:
> I don't see in your patch that you are setting rd_bitmapsvalid to 0.

IIRC a plain relcache rebuild should do that (note there's also no place
that directly resets rd_indexattrs).


> Also, I think this suffers from the same problem as the patch proposed
> by Alvaro which is that different attrs (hot_attrs, key_attrs and
> id_attrs) will get computed based on different index lists which can
> cause a problem as mentioned in my mail up thread.

I'm not convinced that that's a legitimate concern.  If that's an issue
with CIC, then we have a lot bigger problems, because relcache entries
and such will be inconsistent anyway if you have non-exclusive locks
while changing catalog data.  In the situations where that happens it
better be harmless (otherwis CiC is broken), and the next access will
recompute them.


> I am not sure but I think your solution can be
> extended to make it somewhat similar to what we do with rd_indexvalid
> (basically, we should set rd_bitmapsvalid to 2 (bitmap is temporarily
> forced) at rel cache invalidation and then clean it up transaction
> end).  I can try something on those lines.

Not sure I understand what you mean with "clean it up at transaction end"?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Peter Geoghegan
On Sun, Feb 5, 2017 at 6:42 PM, Pavan Deolasee  wrote:
> I'm not sure that just because the bug wasn't reported by a user, makes it
> any less critical. As Tomas pointed down thread, the nature of the bug is
> such that the users may not discover it very easily, but that doesn't mean
> it couldn't be affecting them all the time. We can now correlate many past
> reports of index corruption to this bug, but we don't have evidence to prove
> that. Lack of any good tool or built-in checks probably makes it even
> harder.

I think that we need to make an automated checker tool a requirement
for very complicated development projects in the future. We're behind
here.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Amit Kapila
On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee  wrote:
>
>
> On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane  wrote:
>>
>>
>>
>> > 2. In the second patch, we tried to recompute attribute lists if a
>> > relcache
>> > flush happens in between and index list is invalidated. We've seen
>> > problems
>> > with that, especially it getting into an infinite loop with
>> > CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache
>> > flushes keep happening.
>>
>> Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache flush
>> happen whenever it possibly could.
>
>
> Ah, ok. That explains why the retry logic as originally proposed was causing
> infinite loop.
>
>>
>> The way to deal with that without
>> looping is to test whether the index set *actually* changed, not whether
>> it just might have changed.
>
>
> I like this approach. I'll run the patch on a build with
> CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. I also like the
> fact that retry logic is now limited to only when index set changes, which
> fits in the situation we're dealing with.
>

Don't you think it also has the same problem as mentioned by me in
Alvaro's patch and you also agreed on it?  Do you see any need to fix
it locally in
RelationGetIndexAttrBitmap(), why can't we clear at transaction end?



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Pavan Deolasee
On Mon, Feb 6, 2017 at 5:41 AM, Peter Geoghegan  wrote:

> On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas  wrote:
> > I don't think this kind of black-and-white thinking is very helpful.
> > Obviously, data corruption is bad.  However, this bug has (from what
> > one can tell from this thread) been with us for over a decade; it must
> > necessarily be either low-probability or low-severity, or somebody
> > would've found it and fixed it before now.  Indeed, the discovery of
> > this bug was driven by new feature development, not a user report.  It
> > seems pretty clear that if we try to patch this and get it wrong, the
> > effects of our mistake could easily be a lot more serious than the
> > original bug.
>
> +1. The fact that it wasn't driven by a user report convinces me that
> this is the way to go.
>
>
I'm not sure that just because the bug wasn't reported by a user, makes it
any less critical. As Tomas pointed down thread, the nature of the bug is
such that the users may not discover it very easily, but that doesn't mean
it couldn't be affecting them all the time. We can now correlate many past
reports of index corruption to this bug, but we don't have evidence to
prove that. Lack of any good tool or built-in checks probably makes it even
harder.

The reason why I discovered this with WARM is because it now has a built-in
recheck logic, which was discarding some tuples returned by the index scan.
It took me lots of code review and inspection to finally conclude that this
must be an existing bug. Even then for lack of any utility, I could not
detect this easily with master. That doesn't mean I wasn't able to
reproduce, but detection was a problem.

If you look at the reproduction setup, one in every 10, if not 5, CREATE
INDEX CONCURRENTLY ends up creating a corrupt index. That's a good 10%
probability. And this is on a low end laptop, with just 5-10 concurrent
clients running. Probability of hitting the problem will be much higher on
a bigger machine, with many users (on a decent AWS machine, I would find
every third CIC to be corrupt). Moreover the setup is not doing anything
extraordinary. Just concurrent updates which change between HOT to non-HOT
and a CIC.

Thanks,
Pavan
-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Amit Kapila
On Mon, Feb 6, 2017 at 6:27 AM, Andres Freund  wrote:
> Hi,
>
> On 2017-02-05 16:37:33 -0800, Andres Freund wrote:
>> >  RelationGetIndexList(Relation relation)
>> > @@ -4746,8 +4747,10 @@ RelationGetIndexPredicate(Relation relat
>> >   * we can include system attributes (e.g., OID) in the bitmap 
>> > representation.
>> >   *
>> >   * Caller had better hold at least RowExclusiveLock on the target relation
>> > - * to ensure that it has a stable set of indexes.  This also makes it safe
>> > - * (deadlock-free) for us to take locks on the relation's indexes.
>> > + * to ensure it is safe (deadlock-free) for us to take locks on the 
>> > relation's
>> > + * indexes.  Note that since the introduction of CREATE INDEX 
>> > CONCURRENTLY,
>> > + * that lock level doesn't guarantee a stable set of indexes, so we have 
>> > to
>> > + * be prepared to retry here in case of a change in the set of indexes.
>>
>> I've not yet read the full thread, but I'm a bit confused so far. We
>> obviously can get changing information about indexes here, but isn't
>> that something we have to deal with anyway?  If we guarantee that we
>> don't loose knowledge that there's a pending invalidation, why do we
>> have to retry?  Pretty much by definition the knowledge in a relcache
>> entry can be outdated as soon as returned unless locking prevents that
>> from being possible - which is not the case here.
>>
>> ISTM it'd be better not to have retry logic here, but to follow the more
>> general pattern of making sure that we know whether the information
>> needs to be recomputed at the next access.  We could either do that by
>> having an additional bit of information about the validity of the
>> bitmaps (so we have invalid, building, valid - and only set valid at the
>> end of computing the bitmaps when still building and not invalid again),
>> or we simply move the bitmap computation into the normal relcache build.
>
> To show what I mean here's an *unpolished* and *barely tested* patch
> implementing the first of my suggestions.
>

+ * Signal that the attribute bitmaps are being built. If there's any
+ * relcache invalidations while building them, rd_bitmapsvalid will be
+ * reset to 0.  In that case return the bitmaps, but don't mark them as
+ * valid.
+ */

I don't see in your patch that you are setting rd_bitmapsvalid to 0.
Also, I think this suffers from the same problem as the patch proposed
by Alvaro which is that different attrs (hot_attrs, key_attrs and
id_attrs) will get computed based on different index lists which can
cause a problem as mentioned in my mail up thread.  However, I think
your general approach and idea that we have to set the things up for
next time is on spot. I am not sure but I think your solution can be
extended to make it somewhat similar to what we do with rd_indexvalid
(basically, we should set rd_bitmapsvalid to 2 (bitmap is temporarily
forced) at rel cache invalidation and then clean it up transaction
end).  I can try something on those lines.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Pavan Deolasee
On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane  wrote:

>
>
> > 2. In the second patch, we tried to recompute attribute lists if a
> relcache
> > flush happens in between and index list is invalidated. We've seen
> problems
> > with that, especially it getting into an infinite loop with
> > CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache
> > flushes keep happening.
>
> Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache flush
> happen whenever it possibly could.


Ah, ok. That explains why the retry logic as originally proposed was
causing infinite loop.


> The way to deal with that without
> looping is to test whether the index set *actually* changed, not whether
> it just might have changed.
>

I like this approach. I'll run the patch on a build with
CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. I also like the
fact that retry logic is now limited to only when index set changes, which
fits in the situation we're dealing with.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Possible spelling fixes

2017-02-05 Thread Andres Freund
Hi,

On 2017-02-05 21:05:50 -0500, Josh Soref wrote:
> Could someone please review the changes I have [3] and suggest a
> series of commits that this project might like?

I think the current split seem excessive...  I'd suggest splitting
things first into:
- straight up spelling/typo fixes in comments etc.
- straight up spelling/typo fixes in variable names etc
- straight up spelling/typo fixes that are API relevant
- the same split for stuff more dependenant on taste

Then afterwards we can see what's remaining.

> A complete diff would be roughly 130k. I've recently tried to submit a
> similarly sized patch to another project and it was stuck in
> moderation (typically mailing lists limit attachments to around 40k).

IIRC 130k shouln't get you stuck in filters yet - if you're concerned
you can gzip the individual patches.


> I understand that people
> don't care about diffstats (I certainly don't), but for reference
> collectively this encompasses 175 files and 305 lines.

FWIW, I do ;)

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Possible spelling fixes

2017-02-05 Thread Josh Soref
Hi.
I'm going through project-by-project offering spelling fixes.
I have read your submission suggestions [1][2] and am choosing to
disregard them.

Could someone please review the changes I have [3] and suggest a
series of commits that this project might like?

It is quite likely that someone will tell me that there are certain
types of changes that you don't want. That's fine, I'm happy to drop
changes. It's possible that I've chosen the wrong spelling correction
or made a mistake in my attempts, while the process of identifying
errant words is somewhat automated, the correction process is a mix of
me and some scripts, and neither of us are perfect.

The patch queue is currently sorted alphabetically, but it contains at least:
* branding corrections -- some of which may be wrong
* regional English -- I understand there's a discussion about this, I
can easily drop them, or you can choose to take them separately
* changelog changes -- I could understand not wanting to change the changelog
* local variable changes
* API changes -- hopefully the changes to functions/enums are within
private space and thus not a problem, but if you consider them to be
public, dropping them isn't a big deal
* changes to tests -- it wouldn't shock me if I messed some of these
up. Again, everything can be split/dropped/fixed if requested (within
limits -- I refuse to submit 100 emails with patches)
* changes to comments, lots of comments

Most changes are just changes to comments, and I'd hope it wouldn't be
terribly hard for them to be accepted.
A complete diff would be roughly 130k. I've recently tried to submit a
similarly sized patch to another project and it was stuck in
moderation (typically mailing lists limit attachments to around 40k).

I do not expect anyone to be able to do anything remotely useful with
such a patch. It is certainly possible for someone else to generate
such a file if they really like reading a diff in that manner.

For reference, I've split my proposal into 172 changes. Each change
should represent a single word/concept which has been misspelled
(possibly in multiple different manners). I understand that people
don't care about diffstats (I certainly don't), but for reference
collectively this encompasses 175 files and 305 lines.

Note that I've intentionally excluded certain files (by removing them
in prior commits), it's possible that changes would be needed to be
made to some of those files in order for tests to pass. If you have an
automated system for running tests (typically Travis or Appveyor), I'm
happy to submit changes to them before offering changes to a list. But
I couldn't find anything of the sort.

[1] https://wiki.postgresql.org/wiki/Submitting_a_Patch
[2] http://petereisentraut.blogspot.ca/2009/09/how-to-submit-patch-by-email.html
[3] https://github.com/jsoref/postgres/compare/ignore...jsoref:spelling?expand=1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Peter Geoghegan
On Sun, Feb 5, 2017 at 4:57 PM, Tomas Vondra
 wrote:
> OTOH I disagree with the notion that bugs that are not driven by user
> reports are somehow less severe. Some data corruption bugs cause quite
> visible breakage - segfaults, immediate crashes, etc. Those are pretty clear
> bugs, and are reported by users.

I meant that I find the fact that there were no user reports in all
these years to be a good reason to not proceed for now in this
instance.

I wrote amcheck to detect the latter variety of bug, so clearly I
think that they are very serious.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq Alternate Row Processor

2017-02-05 Thread Kyle Gearhart
From: Tom Lane [mailto:t...@sss.pgh.pa.us]:
> Kyle Gearhart  writes:
>> The guts of pqRowProcessor in libpq does a good bit of work to maintain the 
>> internal data structure of a PGresult.  There are a few use cases where the 
>> caller doesn't need the ability to access the result set row by row, column 
>> by column using PQgetvalue.  Think of an ORM that is just going to copy the 
>> data from PGresult for each row into its own structures.

> It seems like you're sort of reinventing "single row mode":
https://www.postgresql.org/docs/devel/static/libpq-single-row-mode.html

> Do we really need yet another way of breaking the unitary-query-result 
> abstraction?


If it's four times faster...then the option should be available in libpq.  I'm 
traveling tomorrow but will try to get a patch and proof with pgbench dataset 
up by the middle of the week.  

The performance gains are consistent with Jim Nasby's findings with SPI.

Kyle Gearhart


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS][REVIEW] macaddr 64 bit (EUI-64) datatype support

2017-02-05 Thread Haribabu Kommi
On Wed, Feb 1, 2017 at 6:27 AM, Vitaly Burovoy 
wrote:

> Hello,
>
> I've reviewed the patch[1].
>
> Result of testing:
>
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   tested, passed
> Documentation:tested, passed
>
>
> The patch introduce a new type macaddr8 for EUI-64 addresses[2]
> (assuming OUI field is 24 bits wide) with EUI-48 (existing "macaddr"
> type) interoperability.
> It is a mostly copy-pasted macaddr implementation with necessary
> changes for increased range.
> Consensus was reached on such implementation in the current thread before.
>
> There are two patch files for convenient reviewing: base macaddr8
> implementation and its supporting in btree-gin and btree-gist indexes.
>
> The patch:
> * cleanly applies to the current master
> (6af8b89adba16f97bee0d3b01256861e10d0e4f1);
> * passes tests;
> * looks fine, follows the PostgreSQL style guide;
> * has documentation changes;
> * has tests.
>
> All notes and requirements were took into account and the patch was
> changed according to them.
> I have no suggestions on improving it.
>
> The new status of this patch is: Ready for Committer
>

Thanks for the review.

Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] [WIP]Vertical Clustered Index (columnar store extension)

2017-02-05 Thread Haribabu Kommi
On Fri, Feb 3, 2017 at 8:28 PM, Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

> On 30.12.2016 06:55, Haribabu Kommi wrote:
>
>
> Hi All,
>
> Fujitsu was interested in developing a columnar storage extension with
> minimal
> changes the server backend.
>
>
> We  in PostgresPRO are also very interested in developing vertical storage
> (VS) for Postgres.
> And after considering many alternatives, we came to the conclusion that
> approach based on representing columnar store as access method (index)
> is the most promising one.
>
> It allows to:
> 1. Implement VS as extension without affecting Postgres core.
> 2. Have both ROS and WOS.
> 3. Create multiple projections (as in Vertica).
> 4. Optimize insert speed by support batch inserts and use flexible
> recovery model for VS.
>
> So it is very similar with your approach. But there are few differences:
>
> 1. Our intention is to completely eliminate changes in Postgres core.
>
> You wrote:
>
> Yes, it is a mix of both index and table access methods. The current design
> of Vertical clustered index needs both access methods, because of this
> reason
> we used both access methods.
>
> But I still do not completely understand why it is not possible to use VS
> in index only scans without any changes and standard Postgres executor?
> Why it is not possible to rely on standard rules of applying indexes in
> Postgres optimizer based on costs provided by our AM implementation?
>

In our storage design, we used TID-CRID map to identify a record in heap
to columnar storage. Because of HOT update, the new data will not be
inserted
into indexes, but this will give problem to the columnar storage, so we
added
a hook to insert index data even if the update is HOT.

And also we added another hook for initializing the parameters during the
execution.

Most of the other added hooks can be replaced with existing hooks and adding
some extra code.


> 2. You are accessing VS pages through Postgres buffer manager. It
> certainly have a lot of advantages. First of all it significantly
> simplifies implementation of VS and allows to reuse Postgres cache and lock
> managers.
> But is all leads to some limitation:
> - For VS it is preferable to have larger pages (in Vertica size of page
> can be several megabytes).
> - VS is optimized for sequential access, so caching pages in buffer
> manager is no needed and can only cause leaching of other useful pages from
> cache.
> - It makes it not possible to implement in-memory version of VS.
> - Access to buffer manager adds extra synchronization overhead which
> becomes noticeable at MPP systems.
>
> So I wonder if you have considered approach with VS specific
> implementation of storage layer?
>

Currently, we are just using the existing the PostgreSQL buffer manager
and didn't evaluate any columnar storage specific storage implementation.

we are having some plan of evaluating dynamic shared memory.


> 3. To take all advantages of vertical model, we should provide vector
> execution.
> Without it columnar store can only reduce amount of fetched data by
> selective fetch of accessed columns and better compression of them.
> But this is what existed cstore_fdw extension for Postgres also does.
>
> We are going to use executor hooks or custom nodes to implement vector
> operations for some nodes (filter, grand aggregate, aggregation with group
> by,...).
> Something similar with  https://github.com/citusdata/
> postgres_vectorization_test
>
> What is your vision of optimizing executor to work with VS?
>

Yes, we implemented similar like above by copy/paste the most of the
aggregate and etc code
into the extension for providing the vector execution support.

Without this vector execution and parallelism support, there will not be
much performance
benefit.

4. How do you consider adding parallelism support to VS? Should it be
> handled inside VS implementation? Or should we use standard Postgres
> parallel execution (parallel index-only scan)?
>
>
Currently we implemented our own parallelism in columnar storage with some
base infrastructure
of OSS, but we are planning to change/integrate according to the OSS
implementation.

Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Andres Freund
Hi,

On 2017-02-05 16:37:33 -0800, Andres Freund wrote:
> >  RelationGetIndexList(Relation relation)
> > @@ -4746,8 +4747,10 @@ RelationGetIndexPredicate(Relation relat
> >   * we can include system attributes (e.g., OID) in the bitmap 
> > representation.
> >   *
> >   * Caller had better hold at least RowExclusiveLock on the target relation
> > - * to ensure that it has a stable set of indexes.  This also makes it safe
> > - * (deadlock-free) for us to take locks on the relation's indexes.
> > + * to ensure it is safe (deadlock-free) for us to take locks on the 
> > relation's
> > + * indexes.  Note that since the introduction of CREATE INDEX CONCURRENTLY,
> > + * that lock level doesn't guarantee a stable set of indexes, so we have to
> > + * be prepared to retry here in case of a change in the set of indexes.
> 
> I've not yet read the full thread, but I'm a bit confused so far. We
> obviously can get changing information about indexes here, but isn't
> that something we have to deal with anyway?  If we guarantee that we
> don't loose knowledge that there's a pending invalidation, why do we
> have to retry?  Pretty much by definition the knowledge in a relcache
> entry can be outdated as soon as returned unless locking prevents that
> from being possible - which is not the case here.
> 
> ISTM it'd be better not to have retry logic here, but to follow the more
> general pattern of making sure that we know whether the information
> needs to be recomputed at the next access.  We could either do that by
> having an additional bit of information about the validity of the
> bitmaps (so we have invalid, building, valid - and only set valid at the
> end of computing the bitmaps when still building and not invalid again),
> or we simply move the bitmap computation into the normal relcache build.

To show what I mean here's an *unpolished* and *barely tested* patch
implementing the first of my suggestions.

Alvaro, Pavan, I think should address the issue as well?

- Andres
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 8a7c560e46..9e94495e75 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -4745,9 +4745,12 @@ RelationGetIndexPredicate(Relation relation)
  * Attribute numbers are offset by FirstLowInvalidHeapAttributeNumber so that
  * we can include system attributes (e.g., OID) in the bitmap representation.
  *
- * Caller had better hold at least RowExclusiveLock on the target relation
- * to ensure that it has a stable set of indexes.  This also makes it safe
- * (deadlock-free) for us to take locks on the relation's indexes.
+ * Caller had better hold at least RowExclusiveLock on the target relation to
+ * ensure that it has a stable set of indexes.  This also makes it safe
+ * (deadlock-free) for us to take locks on the relation's indexes.  Note that
+ * a concurrent CREATE/DROP INDEX CONCURRENTLY can lead to an outdated list
+ * being returned (will be recomputed at the next access), the CONCURRENTLY
+ * code deals with that.
  *
  * The returned result is palloc'd in the caller's memory context and should
  * be bms_free'd when not needed anymore.
@@ -4766,7 +4769,7 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 	MemoryContext oldcxt;
 
 	/* Quick exit if we already computed the result. */
-	if (relation->rd_indexattr != NULL)
+	if (relation->rd_bitmapsvalid == 2)
 	{
 		switch (attrKind)
 		{
@@ -4788,6 +4791,14 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 		return NULL;
 
 	/*
+	 * Signal that the attribute bitmaps are being built. If there's any
+	 * relcache invalidations while building them, rd_bitmapsvalid will be
+	 * reset to 0.  In that case return the bitmaps, but don't mark them as
+	 * valid.
+	 */
+	relation->rd_bitmapsvalid = 1;
+
+	/*
 	 * Get cached list of index OIDs
 	 */
 	indexoidlist = RelationGetIndexList(relation);
@@ -4892,13 +4903,7 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 	bms_free(relation->rd_idattr);
 	relation->rd_idattr = NULL;
 
-	/*
-	 * Now save copies of the bitmaps in the relcache entry.  We intentionally
-	 * set rd_indexattr last, because that's the one that signals validity of
-	 * the values; if we run out of memory before making that copy, we won't
-	 * leave the relcache entry looking like the other ones are valid but
-	 * empty.
-	 */
+	/* now save copies of the bitmaps in the relcache entry */
 	oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
 	relation->rd_keyattr = bms_copy(uindexattrs);
 	relation->rd_pkattr = bms_copy(pkindexattrs);
@@ -4906,6 +4911,18 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 	relation->rd_indexattr = bms_copy(indexattrs);
 	MemoryContextSwitchTo(oldcxt);
 
+	/*
+	 * If there've been no invalidations while building the entry, mark the
+	 * stored bitmaps as being valid.  Need to do so after the copies above,

Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Tomas Vondra

On 02/06/2017 01:11 AM, Peter Geoghegan wrote:

On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas  wrote:

I don't think this kind of black-and-white thinking is very
helpful. Obviously, data corruption is bad. However, this bug has
(from what one can tell from this thread) been with us for over a
decade; it must necessarily be either low-probability or
low-severity, or somebody would've found it and fixed it before
now. Indeed, the discovery of this bug was driven by new feature
development, not a user report. It seems pretty clear that if we
try to patch this and get it wrong, the effects of our mistake
could easily be a lot more serious than the original bug.


+1. The fact that it wasn't driven by a user report convinces me
that this is the way to go.



+1 to not rushing fixes into releases. While I think we now finally 
understand the mechanics of this bug, the fact that we came up with 
three different fixes in this thread, only to discover issues with each 
of them, warrants some caution.


OTOH I disagree with the notion that bugs that are not driven by user 
reports are somehow less severe. Some data corruption bugs cause quite 
visible breakage - segfaults, immediate crashes, etc. Those are pretty 
clear bugs, and are reported by users.


Other data corruption bugs are much more subtle - for example this bug 
may lead to incorrect results to some queries, failing to detect values 
violating UNIQUE constraints, issues with foreign keys, etc.


It's damn impossible to notice incorrect query results that only affect 
tiny subset of the rows (e.g. rows updated when the CIC was running), 
especially when the issue may go away after a while due to additional 
non-HOT updates.


Regarding the other symptoms - I wonder how many strange 'duplicate 
value' errors were misdiagnosed, wrongly attributed to a recent power 
outage, etc.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Andres Freund
On 2017-02-05 15:14:59 -0500, Tom Lane wrote:
> I do not like any of the other patches proposed in this thread, because
> they fail to guarantee delivering an up-to-date attribute bitmap to the
> caller.  I think we need a retry loop, and I think that it needs to look
> like the attached.

That looks like a reasonable approach, although I'm not sure it's the
best one.


> (Note that the tests whether rd_pkindex and rd_replidindex changed might
> be redundant; but I'm not totally sure of that, and they're cheap.)

I don't think they can, but I'm all for re-checking.


>  RelationGetIndexList(Relation relation)
> @@ -4746,8 +4747,10 @@ RelationGetIndexPredicate(Relation relat
>   * we can include system attributes (e.g., OID) in the bitmap representation.
>   *
>   * Caller had better hold at least RowExclusiveLock on the target relation
> - * to ensure that it has a stable set of indexes.  This also makes it safe
> - * (deadlock-free) for us to take locks on the relation's indexes.
> + * to ensure it is safe (deadlock-free) for us to take locks on the 
> relation's
> + * indexes.  Note that since the introduction of CREATE INDEX CONCURRENTLY,
> + * that lock level doesn't guarantee a stable set of indexes, so we have to
> + * be prepared to retry here in case of a change in the set of indexes.

I've not yet read the full thread, but I'm a bit confused so far. We
obviously can get changing information about indexes here, but isn't
that something we have to deal with anyway?  If we guarantee that we
don't loose knowledge that there's a pending invalidation, why do we
have to retry?  Pretty much by definition the knowledge in a relcache
entry can be outdated as soon as returned unless locking prevents that
from being possible - which is not the case here.

ISTM it'd be better not to have retry logic here, but to follow the more
general pattern of making sure that we know whether the information
needs to be recomputed at the next access.  We could either do that by
having an additional bit of information about the validity of the
bitmaps (so we have invalid, building, valid - and only set valid at the
end of computing the bitmaps when still building and not invalid again),
or we simply move the bitmap computation into the normal relcache build.


BTW, the interactions of RelationSetIndexList with
RelationGetIndexAttrBitmap() look more than a bit fragile to me, even if
documented:
 *
 * We deliberately do not change rd_indexattr here: even when operating
 * with a temporary partial index list, HOT-update decisions must be made
 * correctly with respect to the full index set.  It is up to the caller
 * to ensure that a correct rd_indexattr set has been cached before first
 * calling RelationSetIndexList; else a subsequent inquiry might cause a
 * wrong rd_indexattr set to get computed and cached.  Likewise, we do not
 * touch rd_keyattr, rd_pkattr or rd_idattr.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-05 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Claudio Freire
> Sent: Saturday, February 04, 2017 8:47 AM
> To: Kaigai Kouhei(海外 浩平) 
> Cc: Amit Kapila ; Robert Haas
> ; pgsql-hackers 
> Subject: Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside
> ExecEndGather)
> 
> On Mon, Oct 31, 2016 at 11:33 AM, Kouhei Kaigai 
> wrote:
> > Hello,
> >
> > The attached patch implements the suggestion by Amit before.
> >
> > What I'm motivated is to collect extra run-time statistics specific to
> > a particular ForeignScan/CustomScan, not only the standard
> > Instrumentation; like DMA transfer rate or execution time of GPU
> > kernels in my case.
> >
> > Per-node DSM toc is one of the best way to return run-time statistics
> > to the master backend, because FDW/CSP can assign arbitrary length of
> > the region according to its needs. It is quite easy to require.
> > However, one problem is, the per-node DSM toc is already released when
> > ExecEndNode() is called on the child node of Gather.
> >
> > This patch allows extensions to get control on the master backend's
> > context when all the worker node gets finished but prior to release of
> > the DSM segment. If FDW/CSP has its special statistics on the segment,
> > it can move to the private memory area for EXPLAIN output or something
> > other purpose.
> >
> > One design consideration is whether the hook shall be called from
> > ExecParallelRetrieveInstrumentation() or ExecParallelFinish().
> > The former is a function to retrieve the standard Instrumentation
> > information, thus, it is valid only if EXPLAIN ANALYZE.
> > On the other hands, if we put entrypoint at ExecParallelFinish(),
> > extension can get control regardless of EXPLAIN ANALYZE, however, it
> > also needs an extra planstate_tree_walker().
> 
> If the use case for this is to gather instrumentation, I'd suggest calling
> the hook in RetrieveInstrumentation, and calling it appropriately. It would
> make the intended use far clearer than it is now.
> 
> And if it saves some work, all the better.
> 
> Until there's a use case for a non-instrumentation hook in that place, I
> wouldn't add it. This level of generality sounds like a solution waiting
> for a problem to solve.
>
The use cases I'd like to add are extension specific but significant for
performance analytics. These statistics are not included in Instrumentation.
For example, my problems are GPU execution time, data transfer ratio over
DMA, synchronization time for GPU task completion, and so on.
Only extension can know which attributes are collected during the execution,
and its data format. I don't think Instrumentation fits these requirements.
This is a problem I faced on the v9.6 based interface design, so I could
notice it.

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Andres Freund
On 2017-02-05 12:51:09 -0500, Tom Lane wrote:
> Michael Paquier  writes:
> > On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule  
> > wrote:
> >> I agree with Pavan - a release with known important bug is not good idea.
> 
> > This bug has been around for some time, so I would recommend taking
> > the time necessary to make the best fix possible, even if it means
> > waiting for the next round of minor releases.
> 
> I think the way to think about this sort of thing is, if we had found
> this bug when a release wasn't imminent, would we consider it bad enough
> to justify an unscheduled release cycle?  I have to think the answer for
> this one is "probably not".

+1.  I don't think we serve our users by putting out a nontrivial bugfix
hastily. Nor do I think we serve them in this instance by delaying the
release to cover this fix; there's enough other fixes in the release
imo.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Peter Geoghegan
On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas  wrote:
> I don't think this kind of black-and-white thinking is very helpful.
> Obviously, data corruption is bad.  However, this bug has (from what
> one can tell from this thread) been with us for over a decade; it must
> necessarily be either low-probability or low-severity, or somebody
> would've found it and fixed it before now.  Indeed, the discovery of
> this bug was driven by new feature development, not a user report.  It
> seems pretty clear that if we try to patch this and get it wrong, the
> effects of our mistake could easily be a lot more serious than the
> original bug.

+1. The fact that it wasn't driven by a user report convinces me that
this is the way to go.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Robert Haas
On Sun, Feb 5, 2017 at 1:34 PM, Martín Marqués  wrote:
> El 05/02/17 a las 10:03, Michael Paquier escribió:
>> On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule  
>> wrote:
>>> I agree with Pavan - a release with known important bug is not good idea.
>>
>> This bug has been around for some time, so I would recommend taking
>> the time necessary to make the best fix possible, even if it means
>> waiting for the next round of minor releases.
>
> The fact that the bug has been around for a long time doesn't mean we
> shouldn't take it seriously.
>
> IMO any kind of corruption (heap or index) should be prioritized.
> There's also been comments about maybe this being the cause of old
> reports about index corruption.
>
> I ask myself if it's a good idea to make a point release with a know
> corruption bug in it.

I don't think this kind of black-and-white thinking is very helpful.
Obviously, data corruption is bad.  However, this bug has (from what
one can tell from this thread) been with us for over a decade; it must
necessarily be either low-probability or low-severity, or somebody
would've found it and fixed it before now.  Indeed, the discovery of
this bug was driven by new feature development, not a user report.  It
seems pretty clear that if we try to patch this and get it wrong, the
effects of our mistake could easily be a lot more serious than the
original bug.

Now, I do not object to patching this tomorrow before the wrap if the
various senior hackers who have studied this (Tom, Alvaro, Pavan,
Amit) are all in agreement on the way forward.  But if there is any
significant chance that we don't fully understand this issue and that
our fix might cause bigger problems, it would be more prudent to wait.
I'm all in favor of releasing important bug fixes quickly, but
releasing a bug fix before we're sure we've fixed the bug correctly is
taking a good principle to an irrational extreme.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: pageinspect: Try to fix some bugs in previous commit.

2017-02-05 Thread Robert Haas
On Fri, Feb 3, 2017 at 9:14 PM, Ashutosh Sharma  wrote:
> On Fri, Feb 3, 2017 at 8:29 PM, Robert Haas  wrote:
>> On Fri, Feb 3, 2017 at 7:41 AM, Ashutosh Sharma  
>> wrote:
>>> I think UInt32GetDatum(metad->hashm_procid) looks fine, the reason
>>> being 'hashm_procid' is defined as 32-bit unsigned integer but then we
>>> may have to define procid as int8 in SQL function.
>>
>> No, you're wrong.  The GetDatum you choose macro has to match the SQL
>> type, not the type of the variable that you're passing to it.  For
>> example, if you've got an "int" in the code and the SQL type is
>> "int8", you have to use Int64GetDatum, not Int32GetDatum.  Otherwise,
>> stuff breaks, because on some systems 64-bit integers are passed by
>> reference, not by value, so the representation that Int32GetDatum
>> produces isn't valid when interpreted by DatumGetInt64 later on.  The
>> latter is expecting a pointer, but the former didn't produce one.
>>
>
> Thank you very much for detailed information and explanation. It is
> really very helpful and understandable. But, As per your explanation,
> GetDatum we choose need to match the SQL type, not the type of the
> variable used in code and I do not see any unsigned integer SQL type
> in PostgreSQL then I am just wondering on why do we have
> UInt32GetDatum or UInt64GetDatum macros.

That's a pretty good question.  UInt64GetDatum is used in exactly one
place (exec_stmt_getdiag) and there's really no reason why
Int64GetDatum wouldn't be more appropriate.  UInt32GetDatum is used in
a few more places, and some of those are used for hash values which
are not exposed at the SQL level so they might be legitimate, but
others like the ones in pageinspect look like fuzzy thinking that has
only survived because it happens not to break anything.  I suppose if
we wanted to be really rigorous about this sort of thing, we should
change UInt32GetDatum to do something tangibly different from
Int32GetDatum, like negate all the bits.  Then if somebody picked the
wrong macro it would actually fail.  I'm not sure that's really the
best place to spend our effort, though.  The moral of this episode is
that it's important to at least get the right width.  Currently,
getting the wrong signedness doesn't actually break anything.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 3D Z-curve spatial index

2017-02-05 Thread David Fetter
On Sat, Feb 04, 2017 at 03:41:29PM +0700, Boris Muratshin wrote:
> Hi hackers,
> 
> The low-level implementation of 3D Z-curve index
> (https://github.com/bmuratshin/zcurve/tree/master)
> is getting close to GiST R-Tree performance at
> significantly lesser number of pages read from disk.
> 
> See attached figures,
> times2 - average request execution time VS average points number in result
> reads2 - average shared reads number VS average points number in result
> 
> Feel free to connect with me if you have any questions.

Please send the actual patch and any documentation you write that
comes with it.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] SortSupport for macaddr type

2017-02-05 Thread Brandur Leach
Hi Julien,

Thank you for taking the time to do this review, and my
apologies for the very delayed response. I lost track of
this work and have only jumped back into it today.

Please find attached a new version of the patch with your
feedback integrated. I've also rebased the patch against
the current master and selected a new OID because my old
one is now in use.

> * you used macaddr_cmp_internal() function name, for uuid
>   the same function is named uuid_internal_cmp().  Using
>   the same naming pattern is probably better.

I was a little split on this one! It's true that UUID uses
`_internal_cmp`, but `_cmp_internal` is also used in a
number of places like `enum`, `timetz`, and `network`. I
don't have a strong feeling about it either way, so I've
changed it to `_internal_cmp` to match UUID as you
suggested.

> * the function comment on macaddr_abbrev_convert()
>   doesn't mention specific little-endian handling

I tried to bake this into the comment text. Here are the
relevant lines of the amended version:

* Packs the bytes of a 6-byte MAC address into a Datum and treats it as
an
* unsigned integer for purposes of comparison. On a 64-bit machine,
there
* will be two zeroed bytes of padding. The integer is converted to
native
* endianness to facilitate easy comparison.

> * "There will be two bytes of zero padding on the least
>   significant end"
>
> "least significant bits" would be better

Also done. Here is the amended version:

* On a 64-bit machine, zero out the 8-byte datum and copy the 6 bytes of
* the MAC address in. There will be two bytes of zero padding on the end
* of the least significant bits.

> * This patch will trigger quite a lot modifications after
>   a pgindent run.  Could you try to run pgindent on mac.c
>   before sending an updated patch?

Good call! I've run the new version through pgindent.

Let me know if you have any further feedback and/or
suggestions. Thanks!

Brandur


On Wed, Sep 14, 2016 at 3:14 AM, Julien Rouhaud 
wrote:

> On 26/08/2016 19:44, Brandur wrote:
> > Hello,
> >
>
> Hello,
>
> > I've attached a patch to add SortSupport for Postgres' macaddr which has
> the
> > effect of improving the performance of sorting operations for the type.
> The
> > strategy that I employ is very similar to that for UUID, which is to
> create
> > abbreviated keys by packing as many bytes from the MAC address as
> possible into
> > Datums, and then performing fast unsigned integer comparisons while
> sorting.
> >
> > I ran some informal local benchmarks, and for cardinality greater than
> 100k
> > rows, I see a speed up on `CREATE INDEX` of roughly 25% to 65%. (For
> those
> > interested, I put a few more numbers into a small report here [2].)
> >
>
> That's a nice improvement!
>
> > Admittedly, this is not quite as useful as speeding up sorting on a more
> common
> > data type like TEXT or UUID, but the change still seems like a useful
> > performance improvement. I largely wrote it as an exercise to familiarize
> > myself with the Postgres codebase.
> >
> > I'll add an entry into the current commitfest as suggested by the
> Postgres Wiki
> > and follow up here with a link.
> >
> > Thanks, and if anyone has feedback or other thoughts, let me know!
> >
>
> I just reviewed your patch.  It applies and compiles cleanly, and the
> abbrev feature works as intended.  There's not much to say since this is
> heavily inspired on the uuid SortSupport. The only really specific part
> is in the abbrev_converter function, and I don't see any issue with it.
>
> I have a few trivial comments:
>
> * you used macaddr_cmp_internal() function name, for uuid the same
> function is named uuid_internal_cmp().  Using the same naming pattern is
> probably better.
>
> * the function comment on macaddr_abbrev_convert() doesn't mention
> specific little-endian handling
>
> * "There will be two bytes of zero padding on the least significant end"
>
> "least significant bits" would be better
>
> * This patch will trigger quite a lot modifications after a pgindent
> run.  Could you try to run pgindent on mac.c before sending an updated
> patch?
>
> Best regards.
>
> --
> Julien Rouhaud
> http://dalibo.com - http://dalibo.org
>


0002-Implement-SortSupport-for-macaddr-data-type.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Tom Lane
[ Having now read the whole thread, I'm prepared to weigh in ... ]

Pavan Deolasee  writes:
> This seems like a real problem to me. I don't know what the consequences
> are, but definitely having various attribute lists to have different view
> of the set of indexes doesn't seem right.

For sure.

> 2. In the second patch, we tried to recompute attribute lists if a relcache
> flush happens in between and index list is invalidated. We've seen problems
> with that, especially it getting into an infinite loop with
> CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache
> flushes keep happening.

Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache flush
happen whenever it possibly could.  The way to deal with that without
looping is to test whether the index set *actually* changed, not whether
it just might have changed.

I do not like any of the other patches proposed in this thread, because
they fail to guarantee delivering an up-to-date attribute bitmap to the
caller.  I think we need a retry loop, and I think that it needs to look
like the attached.

(Note that the tests whether rd_pkindex and rd_replidindex changed might
be redundant; but I'm not totally sure of that, and they're cheap.)

regards, tom lane

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 8a7c560..b659994 100644
*** a/src/backend/utils/cache/relcache.c
--- b/src/backend/utils/cache/relcache.c
*** RelationGetFKeyList(Relation relation)
*** 4327,4335 
   * it is the pg_class OID of a unique index on OID when the relation has one,
   * and InvalidOid if there is no such index.
   *
!  * In exactly the same way, we update rd_replidindex, which is the pg_class
!  * OID of an index to be used as the relation's replication identity index,
!  * or InvalidOid if there is no such index.
   */
  List *
  RelationGetIndexList(Relation relation)
--- 4327,4336 
   * it is the pg_class OID of a unique index on OID when the relation has one,
   * and InvalidOid if there is no such index.
   *
!  * In exactly the same way, we update rd_pkindex, which is the OID of the
!  * relation's primary key index if any, else InvalidOid; and rd_replidindex,
!  * which is the pg_class OID of an index to be used as the relation's
!  * replication identity index, or InvalidOid if there is no such index.
   */
  List *
  RelationGetIndexList(Relation relation)
*** RelationGetIndexPredicate(Relation relat
*** 4746,4753 
   * we can include system attributes (e.g., OID) in the bitmap representation.
   *
   * Caller had better hold at least RowExclusiveLock on the target relation
!  * to ensure that it has a stable set of indexes.  This also makes it safe
!  * (deadlock-free) for us to take locks on the relation's indexes.
   *
   * The returned result is palloc'd in the caller's memory context and should
   * be bms_free'd when not needed anymore.
--- 4747,4756 
   * we can include system attributes (e.g., OID) in the bitmap representation.
   *
   * Caller had better hold at least RowExclusiveLock on the target relation
!  * to ensure it is safe (deadlock-free) for us to take locks on the relation's
!  * indexes.  Note that since the introduction of CREATE INDEX CONCURRENTLY,
!  * that lock level doesn't guarantee a stable set of indexes, so we have to
!  * be prepared to retry here in case of a change in the set of indexes.
   *
   * The returned result is palloc'd in the caller's memory context and should
   * be bms_free'd when not needed anymore.
*** RelationGetIndexAttrBitmap(Relation rela
*** 4760,4765 
--- 4763,4769 
  	Bitmapset  *pkindexattrs;	/* columns in the primary index */
  	Bitmapset  *idindexattrs;	/* columns in the replica identity */
  	List	   *indexoidlist;
+ 	List	   *newindexoidlist;
  	Oid			relpkindex;
  	Oid			relreplindex;
  	ListCell   *l;
*** RelationGetIndexAttrBitmap(Relation rela
*** 4788,4795 
  		return NULL;
  
  	/*
! 	 * Get cached list of index OIDs
  	 */
  	indexoidlist = RelationGetIndexList(relation);
  
  	/* Fall out if no indexes (but relhasindex was set) */
--- 4792,4800 
  		return NULL;
  
  	/*
! 	 * Get cached list of index OIDs. If we have to start over, we do so here.
  	 */
+ restart:
  	indexoidlist = RelationGetIndexList(relation);
  
  	/* Fall out if no indexes (but relhasindex was set) */
*** RelationGetIndexAttrBitmap(Relation rela
*** 4800,4808 
  	 * Copy the rd_pkindex and rd_replidindex value computed by
  	 * RelationGetIndexList before proceeding.  This is needed because a
  	 * relcache flush could occur inside index_open below, resetting the
! 	 * fields managed by RelationGetIndexList. (The values we're computing
! 	 * will still be valid, assuming that caller has a sufficient lock on
! 	 * the relation.)
  	 */
  	relpkindex = relation->rd_pkindex;
  	relreplindex = 

Re: [HACKERS] Ignore tablespace ACLs when ignoring schema ACLs

2017-02-05 Thread Noah Misch
On Sun, Feb 05, 2017 at 01:48:09PM -0500, Tom Lane wrote:
> Noah Misch  writes:
> > On Sun, Feb 05, 2017 at 12:46:41PM -0500, Tom Lane wrote:
> >> Is there any likely use-case for providing separate control flags for the
> >> two permission checks?  That would require an API change for DefineIndex,
> >> making this considerably more invasive, so I'm not pushing for it ---
> >> just think it's worth asking the question before proceeding.
> 
> > Good question.  I can't think of one.
> 
> Yeah, after some reflection I agree.  Basically what we want for the ALTER
> TABLE scenario is "ignore *all* permissions checks"; if somebody adds some
> other check here in future, it probably also ought to be skipped during
> a rebuild.  So a single bool ought to be fine.

Right.

> Are you intending to back-patch this?  It seems like a bug fix --- but not
> a very high-priority one, so at this point maybe it should wait till after
> the release wraps.

Back-patch after wrap sounds good.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Ignore tablespace ACLs when ignoring schema ACLs

2017-02-05 Thread Tom Lane
Noah Misch  writes:
> On Sun, Feb 05, 2017 at 12:46:41PM -0500, Tom Lane wrote:
>> Is there any likely use-case for providing separate control flags for the
>> two permission checks?  That would require an API change for DefineIndex,
>> making this considerably more invasive, so I'm not pushing for it ---
>> just think it's worth asking the question before proceeding.

> Good question.  I can't think of one.

Yeah, after some reflection I agree.  Basically what we want for the ALTER
TABLE scenario is "ignore *all* permissions checks"; if somebody adds some
other check here in future, it probably also ought to be skipped during
a rebuild.  So a single bool ought to be fine.

Are you intending to back-patch this?  It seems like a bug fix --- but not
a very high-priority one, so at this point maybe it should wait till after
the release wraps.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Ignore tablespace ACLs when ignoring schema ACLs

2017-02-05 Thread Noah Misch
On Sun, Feb 05, 2017 at 12:46:41PM -0500, Tom Lane wrote:
> Noah Misch  writes:
> > DefineIndex() has a check_rights argument that determines whether to 
> > perform a
> > namespace ACL check.  When ALTER TABLE ALTER TYPE rebuilds an index, it sets
> > that flag.  The theory goes that use of DROP INDEX and CREATE INDEX is a 
> > mere
> > implementation detail of ALTER TABLE ALTER TYPE; the operation is logically 
> > like
> > an alteration of the existing index.  I think the same treatment should 
> > extend
> > to the tablespace ACL check, as attached.
> 
> Seems generally reasonable.
> 
> Is there any likely use-case for providing separate control flags for the
> two permission checks?  That would require an API change for DefineIndex,
> making this considerably more invasive, so I'm not pushing for it ---
> just think it's worth asking the question before proceeding.

Good question.  I can't think of one.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Martín Marqués
El 05/02/17 a las 10:03, Michael Paquier escribió:
> On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule  wrote:
>> I agree with Pavan - a release with known important bug is not good idea.
> 
> This bug has been around for some time, so I would recommend taking
> the time necessary to make the best fix possible, even if it means
> waiting for the next round of minor releases.

The fact that the bug has been around for a long time doesn't mean we
shouldn't take it seriously.

IMO any kind of corruption (heap or index) should be prioritized.
There's also been comments about maybe this being the cause of old
reports about index corruption.

I ask myself if it's a good idea to make a point release with a know
corruption bug in it.

Regards,

-- 
Martín Marquéshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Pavel Stehule
2017-02-05 18:51 GMT+01:00 Tom Lane :

> Michael Paquier  writes:
> > On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule 
> wrote:
> >> I agree with Pavan - a release with known important bug is not good
> idea.
>
> > This bug has been around for some time, so I would recommend taking
> > the time necessary to make the best fix possible, even if it means
> > waiting for the next round of minor releases.
>
> I think the way to think about this sort of thing is, if we had found
> this bug when a release wasn't imminent, would we consider it bad enough
> to justify an unscheduled release cycle?  I have to think the answer for
> this one is "probably not".
>

There are a questions

1. is it critical bug?
2. if is it, will be reason for special minor release?
3. how much time is necessary for fixing of this bug?

These questions can be unimportant if there is some security fixes in next
release, what I cannot to know.

Regards

Pavel


>
> regards, tom lane
>


Re: [HACKERS] Draft release notes for next week's releases are up for review

2017-02-05 Thread Tom Lane
Amit Kapila  writes:
> On Sat, Feb 4, 2017 at 11:27 PM, Tom Lane  wrote:
>> Fix possible miss of socket read events while waiting on Windows (Amit 
>> Kapial)

> Typo
> Amit Kapial/Amit Kapila

Wups.  Copied-and-pasted that from the commit log without stopping to
question it.  Will fix, thanks for noticing.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Tom Lane
Michael Paquier  writes:
> On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule  wrote:
>> I agree with Pavan - a release with known important bug is not good idea.

> This bug has been around for some time, so I would recommend taking
> the time necessary to make the best fix possible, even if it means
> waiting for the next round of minor releases.

I think the way to think about this sort of thing is, if we had found
this bug when a release wasn't imminent, would we consider it bad enough
to justify an unscheduled release cycle?  I have to think the answer for
this one is "probably not".

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Ignore tablespace ACLs when ignoring schema ACLs

2017-02-05 Thread Tom Lane
Noah Misch  writes:
> DefineIndex() has a check_rights argument that determines whether to perform a
> namespace ACL check.  When ALTER TABLE ALTER TYPE rebuilds an index, it sets
> that flag.  The theory goes that use of DROP INDEX and CREATE INDEX is a mere
> implementation detail of ALTER TABLE ALTER TYPE; the operation is logically 
> like
> an alteration of the existing index.  I think the same treatment should extend
> to the tablespace ACL check, as attached.

Seems generally reasonable.

Is there any likely use-case for providing separate control flags for the
two permission checks?  That would require an API change for DefineIndex,
making this considerably more invasive, so I'm not pushing for it ---
just think it's worth asking the question before proceeding.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: GIN non-intrusive vacuum of posting tree

2017-02-05 Thread Andrew Borodin
Hi, Jeff!

2017-02-05 3:45 GMT+05:00 Jeff Davis :
> On Sun, Jan 22, 2017 at 10:32 PM, Jeff Davis  wrote:
>> On Sat, Jan 21, 2017 at 4:25 AM, Andrew Borodin  wrote:
>> One idea I had that might be simpler is to use a two-stage page
>> delete. The first stage would remove the link from the parent and mark
>> the page deleted, but leave the right link intact and prevent
>> recycling. The second stage would follow the chain of right links
>> along each level, removing the right links to deleted pages and
>> freeing the page to be recycled.
>
> Do you think this approach is viable as a simplification?

To consider this approach I need to ask several questions.

1. Are we discussing simplification of existing GIN vacuum? Or
proposed GIN vacuum? Please, note that they do not differ in the way
page is unlinked, function ginDeletePage() is almost untoched.

2. What do you mean by "stage"? In terms of the paper "A symmetric
concurrent B-tree algorithm" by Lanin: stage is an
uninterrupted period of holding lock on nonempty page set.
Here's the picture https://www.screencast.com/t/xUpGKgkkU from L
Both paper (L and L) tend to avoid lock coupling, which is
inevitable if you want to do parent unlink first. To prevent insertion
of records on a page, you have to mark it. If you are doing this in
the stage when you unlink from parent - you have to own both locks.

3. What do you want to simplify? Currently we unlink a page from
parent and left sibling in one stage, at cost of aquiring CleanUp lock
(the way we aquire it - is the diference between current and patched
version).
2-stage algorithms will not be simplier, yet it will be more concurrent.
Please note, that during absence of high fence keys in GIN B-tree we
actually should implement algorithm from figure 3A
https://www.screencast.com/t/2cfGZtrzaz0z  (It would be incorrect, but
only with presence of high keys)

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-05 Thread Dilip Kumar
On Wed, Feb 1, 2017 at 11:09 PM, Robert Haas  wrote:

Here are the latest version of the patch, which address all the
comments given by Robert.

> https://www.postgresql.org/message-id/20161017201558.cr34stc6zvxy3...@alap3.anarazel.de
> that you should try to share only the iteration arrays, I believe that
> he meant the results of tbm_begin_iterate() -- that is,
> iterator->spageptr, chunkptr, schunkbit, spages, and schunks.  I'm
> perhaps putting words into his mouth, but what he said was that we
> could avoid sharing "the whole underlying hash".  But the patch set
> you've implemented here doesn't behave that way.  Instead, you've got
> the array containing the hash table elements shared (which is good)
> plus there's a sort of dummy hash table in every worker which copies
> some but not all of the members of the original hash table, leading to
> the otherwise-unnecessary if-test that Haribabu is complaining about.
> So the hash table is kinda-shared-kinda-not-shared, which I don't
> *think* is really what Andres had in mind.
>
> In short, I think SH_COPY (implemented here as pagetable_copy) needs
> to go away.  The work of tbm_begin_iterate() should be done before we
> begin the shared scan and the results of that work should be shared.

I have removed the SH_COPY and now leader performs the
tbm_begin_shared_iterate before waking up the worker. Basically, the
leader will create the page and chunk array and that is the array of
the "relptr" (offlist, suggested by Robert).

> What happens right now (it appears) is that every backend does that
> work based on the same hash table and we just assume they all get the
> same answer.  And we really do assume that, because
> pbms_parallel_iterate() assumes it can shuttle private state back and
> forth between iterator in different backends and nothing will break;
> but those backends aren't actually using the same iteration arrays.
> They are using different iteration arrays that should have the same
> contents because they were all derived from the same semi-shared hash
> table.  That's pretty fragile, and a waste of CPU cycles if the hash
> table is large (every backend does its own sort).
>
> On a related note, I think it's unacceptable to make the internal
> details of TBMIterator public.  You've moved it from tidbitmap.c to
> tidbitmap.h so that nodeBitmapHeapScan.c can tinker with the guts of
> the TBMIterator, but that's not OK.  Those details need to remain
> private to tidbitmap.c.
Fixed

 pbms_parallel_iterate() is a nasty kludge; we
> need some better solution.  The knowledge of how a shared iterator
> should iterate needs to be private to tidbitmap.c, not off in the
> executor someplace.  And I think the entries need to actually be
> updated in shared memory directly, not copied back and forth between a
> backend-private iterator and a shared iterator.

I have fixed this, now there is new function called tbm_shared_iterate
which will directly iterate using shared iterator. So now no need to
copy member back and forth between local and shared iterator.

>
> Also, pbms_parallel_iterate() can't hold a spinlock around a call to
> tbm_iterate().  Note the coding rules for spinlocks mentioned in
> spin.h and src/backend/storage/lmgr/README.  I think the right thing
> to do here is to use an LWLock in a new tranche (add an entry to
> BuiltinTrancheIds).

Done that way.
>
> In 0002, you have this:
>
> +tb->alloc = palloc(sizeof(SH_ALLOCATOR));
>
> This should be using MemoryContextAlloc(ctx, ...) rather than palloc.
> Otherwise the allocator object is in a different context from
> everything else in the hash table.  But TBH, I don't really see why we
> want this to be a separate object.  Why not just put
> HashAlloc/HashFree/args into SH_TYPE directly?  That avoids some
> pointer chasing and doesn't really seem to cost anything (except that
> SH_CREATE will grow a slightly longer argument sequence).

Done as suggested
>
> I think maybe we should rename the functions to element_allocate,
> element_free, and element_allocator_ctx, or something like that.  The
> current names aren't capitalized consistently with other things in
> this module, and putting the word "element" in there would make it
> more clear what the purpose of this thing is.

Fixed as per the suggestion

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0002-hash-support-alloc-free-v16.patch
Description: Binary data


0003-parallel-bitmap-heap-scan-v16.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Michael Paquier
On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule  wrote:
> I agree with Pavan - a release with known important bug is not good idea.

This bug has been around for some time, so I would recommend taking
the time necessary to make the best fix possible, even if it means
waiting for the next round of minor releases.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Draft release notes for next week's releases are up for review

2017-02-05 Thread Amit Kapila
On Sat, Feb 4, 2017 at 11:27 PM, Tom Lane  wrote:
> First-draft release notes are available at
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9863017b87f3592ff663d03fc663a4f1f8fdb8b2
> They should appear in a more readable form at
> https://www.postgresql.org/docs/devel/static/release-9-6-2.html
> after guaibasaurus' next buildfarm run, due a couple hours from now.
>
> As usual, I've filled in the frontmost branch's section with all the
> material, even though a couple of items don't apply because they only
> went into further-back branches.  I'll sort that out when I split
> things up.  Right now the question is whether individual items are
> correctly/adequately documented.
>


> Fix possible miss of socket read events while waiting on Windows (Amit Kapial)

Typo
Amit Kapial/Amit Kapila



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Generic type subscription

2017-02-05 Thread Dmitry Dolgov
On 24 January 2017 at 02:36, Tom Lane  wrote:

> It might be possible to get away with having only one pg_type column,
> pointing at the parse-analysis function.  That function would generate
> a SubscriptingRef tree node containing the OID of the appropriate
> execution function, which execQual.c could call without ever knowing
> its name explicitly.

Btw, is it acceptable if such generated SubscriptingRef will contain just a
function pointer to the appropriate execution function instead of an OID
(e.g.
like `ExprStateEvalFunc`)? It will help to avoid problems in case of
extensions.


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Pavel Stehule
2017-02-05 7:54 GMT+01:00 Pavan Deolasee :

>
> On Sat, Feb 4, 2017 at 11:54 PM, Tom Lane  wrote:
>
>>
>> Based on Pavan's comments, I think trying to force this into next week's
>> releases would be extremely unwise.  If the bug went undetected this long,
>> it can wait for a fix for another three months.
>
>
> Yes, I think bug existed ever since and went unnoticed. One reason could
> be that the race happens only when the new index turns HOT updates into
> non-HOT updates. Another reason could be that we don't have checks in place
> to catch these kinds of corruption. Having said that, since we have
> discovered the bug, at least many 2ndQuadrant customers have expressed
> worry and want to know if the fix will be available in 9.6.2 and other
> minor releases.  Since the bug can lead to data corruption, the worry is
> justified. Until we fix the bug, there will be a constant worry about using
> CIC.
>
> If we can have some kind of band-aid fix to plug in the hole, that might
> be enough as well. I tested my first patch (which will need some polishing)
> and that works well AFAIK. I was worried about prepared queries and all,
> but that seems ok too. RelationGetIndexList() always get called within
> ExecInitModifyTable. The fix seems quite unlikely to cause any other side
> effects.
>
> Another possible band-aid is to throw another relcache invalidation in
> CIC. Like adding a dummy index_set_state_flags() within yet another
> transaction. Seems ugly, but should fix the problem for now and not cause
> any impact on relcache mechanism whatsoever.
>
> That seems better than
>> risking new breakage when it's barely 48 hours to the release wrap
>> deadline.  We do not have time to recover from any mistakes.
>>
>
> I'm not sure what the project policies are, but can we consider delaying
> the release by a week for issues like these? Or do you think it will be
> hard to come up with a proper fix for the issue and it will need some
> serious refactoring?
>

I agree with Pavan - a release with known important bug is not good idea.

Pavel


> Thanks,
> Pavan
>
> --
>  Pavan Deolasee   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>