Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-06 Thread Peter Geoghegan
On Tue, Sep 6, 2016 at 10:51 PM, Heikki Linnakangas  wrote:
> I still don't get it. When building the initial runs, we don't need buffer
> space for maxTapes yet, because we're only writing to a single tape at a
> time. An unused tape shouldn't take much memory. In inittapes(), when we
> have built all the runs, we know how many tapes we actually needed, and we
> can allocate the buffer memory accordingly.

Right. That's correct. But, we're not concerned about physically
allocated memory, but rather logically allocated memory (i.e., what
goes into USEMEM()). tuplesort.c should be able to fully use the
workMem specified by caller in the event of an external sort, just as
with an internal sort.

> [thinks a bit, looks at logtape.c]. Hmm, I guess that's wrong, because of
> the way this all is implemented. When we're building the initial runs, we're
> only writing to one tape at a time, but logtape.c nevertheless holds onto a
> BLCKSZ'd currentBuffer, plus one buffer for each indirect level, for every
> tape that has been used so far. What if we changed LogicalTapeRewind to free
> those buffers?

There isn't much point in that, because those buffers are never
physically allocated in the first place when there are thousands. They
are, however, entered into the tuplesort.c accounting as if they were,
denying tuplesort.c the full benefit of available workMem. It doesn't
matter if you USEMEM() or FREEMEM() after we first spill to disk, but
before we begin the merge. (We already refund the
unused-but-logically-allocated memory from unusued at the beginning of
the merge (within beginmerge()), so we can't do any better than we
already are from that point on -- that makes the batch memtuples
growth thing slightly more effective.)

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-06 Thread Heikki Linnakangas

On 09/07/2016 12:46 AM, Peter Geoghegan wrote:

On Tue, Sep 6, 2016 at 12:34 AM, Heikki Linnakangas  wrote:

Why do we reserve the buffer space for all the tapes right at the beginning?
Instead of the single USEMEM(maxTapes * TAPE_BUFFER_OVERHEAD) callin
inittapes(), couldn't we call USEMEM(TAPE_BUFFER_OVERHEAD) every time we
start a new run, until we reach maxTapes?


No, because then you have no way to clamp back memory, which is now
almost all used (we hold off from making LACKMEM() continually true,
if at all possible, which is almost always the case). You can't really
continually shrink memtuples to make space for new tapes, which is
what it would take.


I still don't get it. When building the initial runs, we don't need 
buffer space for maxTapes yet, because we're only writing to a single 
tape at a time. An unused tape shouldn't take much memory. In 
inittapes(), when we have built all the runs, we know how many tapes we 
actually needed, and we can allocate the buffer memory accordingly.


[thinks a bit, looks at logtape.c]. Hmm, I guess that's wrong, because 
of the way this all is implemented. When we're building the initial 
runs, we're only writing to one tape at a time, but logtape.c 
nevertheless holds onto a BLCKSZ'd currentBuffer, plus one buffer for 
each indirect level, for every tape that has been used so far. What if 
we changed LogicalTapeRewind to free those buffers? Flush out the 
indirect buffers to disk, remembering just the physical block number of 
the topmost indirect block in memory, and free currentBuffer. That way, 
a tape that has been used, but isn't being read or written to at the 
moment, would take very little memory, and we wouldn't need to reserve 
space for them in the build-runs phase.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: function xmltable

2016-09-06 Thread Pavel Stehule
2016-09-07 5:03 GMT+02:00 Craig Ringer :

> On 7 September 2016 at 04:13, Pavel Stehule 
> wrote:
>
> >> Overall, I think this needs to be revised with appropriate comments.
> >> Whitespace/formatting needs fixing since it's all over the place.
> >> Documentation is insufficient (per notes below).
> >
> >
> > I am not able to write documentation in English language :( - This
> function
> > is pretty complex - so I hope so anybody with better language skills can
> > help with this. It respects standard and it respects little bit different
> > Oracle's behave too (different order of DEFAULT and PATH parts).
>
> OK, no problem. It can't be committed without more comprehensive docs
> though, especially for new and nontrivial functionality.
>
> Is there some reference material you can point to so someone else can
> help with docs? And can you describe what differences there are
> between your implementation and the reference?
>
> Alternately, if you document it in Czech, do you know of anyone who
> could assist in translating to English for the main documentation?
>
> >> Re identifier naming, some of this code uses XmlTable naming patterns,
> >> some uses TableExpr prefixes. Is that intended to indicate a bounary
> >> between things re-usable for other structured data ingesting
> >> functions? Do you expect a "JSONEXPR" or similar in future? That's
> >> alluded to by
> >
> >
> > This structure should be reused by JSON_TABLE function. Now, it is little
> > bit strange, because there is only XMLTABLE implementation - and I have
> to
> > choose between a) using two different names now, b) renaming some part in
> > future.
>
> OK. Are you planning on writing this JSON_TABLE or are you leaving
> room for future growth? Either way is fine, just curious.
>
> > And although XMLTABLE and JSON_TABLE functions are pretty similar - share
> > 90% of data (input value, path, columns definitions), these functions has
> > different syntax - so only middle level code should be shared.
>
> That makes sense.
>
> I think it would be best if you separated out the TableExpr
> infrastructure from the XMLTABLE implementation though, so we can
> review the first level infrastrcture separately and make this a
> 2-patch series. Most importantly, doing it that way will help you find
> places where TableExpr code calls directly into XMLTABLE code. If
> TableExpr is supposed to be reusable for json etc, it probably
> shouldn't be calling XmlTable stuff directly.
>
> That also means somewhat smaller simpler patches, which probably isn't bad.
>
> I don't necessarily think this needs to be fully pluggable with
> callbacks etc. It doesn't sound like you expect this to be used by
> extensions or to have a lot of users, right? So it probably just needs
> clearer separation of the infrastructure layer from the xmltable
> layer. I think splitting the patch will make that easier to see and
> make it easier to find problems.
>
> My biggest complaint at the moment is that execEvalTableExpr calls
> initXmlTableContext(...) directly, is aware of XML namespaces
> directly, calls XmlTableSetRowPath() directly, calls
> XmlTableFetchRow() directly, etc. It is in no way generic/reusable for
> some later JSONTABLE feature. That needs to be fixed by:
>
> * Renaming it so it's clearly only for XMLTABLE; or
> * Abstracting the init context, set row path, fetch row etc operations
> so json ones can be plugged in later
>
>
>
> > Currently the common part is not too big - just the Node related part -
> I am
> > not sure about necessity of two patches.
>
> The problem is that the common part is all mixed in with the
> XMLTABLE-specific part, so it's not at all clear it can be common with
> something else.
>
> > I am agree, there is missing some
> > TableExpBuilder, where can be better isolated the XML part.
>
> Yeah, that's sort of what I'm getting at.
>
>
> >> execEvalTableExpr seems to be defined twice, with a difference in
> >> case. This is probably not going to fly:
> >>
> >>
> >> +static Datum
> >> +execEvalTableExpr(TableExprState *tstate,
> >> +ExprContext *econtext,
> >> +bool *isNull, ExprDoneCond *isDone)
> >> +{
> >>
> >> +static Datum
> >> +ExecEvalTableExpr(TableExprState *tstate,
> >> +ExprContext *econtext,
> >> +bool *isNull, ExprDoneCond *isDone)
> >> +{
> >>
> >>
> >> It looks like you've split the function into a "guts" and "wrapper"
> >> part, with the error handling PG_TRY / PG_CATCH block in the wrapper.
> >> That seems reasonable for readability, but the naming isn't.
> >
> >
> > I invite any idea how these functions should be named.
>
> Definitely not how they are ;) . They really can't differ in a single
> character's case.
>
> I'm not sure if PostgreSQL has any formal convention for this. Some
> places use _impl e.g.  pg_read_barrier_impl() but that's in the
> context of an 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-06 Thread Peter Geoghegan
On Tue, Sep 6, 2016 at 10:36 PM, Peter Geoghegan  wrote:
> Well, maybe, but the whole idea behind replacement_sort_tuples (by
> which I mean the continued occasional use of replacement selection by
> Postgres) was that we hope to avoid a merge step *entirely*. This new
> merge shift down heap patch could make the merge step so cheap as to
> be next to free anyway (in the even of presorted input)

I mean: Cheaper than just processing the tuples to return to caller
without comparisons/merging (within the TSS_SORTEDONTAPE path). I do
not mean free in an absolute sense, of course.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-06 Thread Peter Geoghegan
On Tue, Sep 6, 2016 at 10:28 PM, Heikki Linnakangas  wrote:
>> BTW, the way that k-way merging is made more efficient by this
>> approach makes the case for replacement selection even weaker than it
>> was just before we almost killed it.
>
>
> This also makes the replacement selection cheaper, no?

Well, maybe, but the whole idea behind replacement_sort_tuples (by
which I mean the continued occasional use of replacement selection by
Postgres) was that we hope to avoid a merge step *entirely*. This new
merge shift down heap patch could make the merge step so cheap as to
be next to free anyway (in the even of presorted input), so the
argument for replacement_sort_tuples is weakened further. It might
always be cheaper once you factor in that the TSS_SORTEDONTAPE path
for returning tuples to caller happens to not be able to use batch
memory, even with something like collated text. And, as a bonus, you
get something that works just as well with an inverse correlation,
which was traditionally the worst case for replacement selection (it
makes it produce runs no larger than those produced by quicksort).

Anyway, I only mention this because it occurs to me. I have no desire
to go back to talking about replacement selection either. Maybe it's
useful to point this out, because it makes it clearer still that
severely limiting the use of replacement selection in 9.6 was totally
justified.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-06 Thread Heikki Linnakangas

On 09/06/2016 10:42 PM, Peter Geoghegan wrote:

On Tue, Sep 6, 2016 at 12:39 PM, Peter Geoghegan  wrote:

On Tue, Sep 6, 2016 at 12:08 AM, Heikki Linnakangas  wrote:

I attach a patch that changes how we maintain the heap invariant
during tuplesort merging.



Nice!


Thanks!


BTW, the way that k-way merging is made more efficient by this
approach makes the case for replacement selection even weaker than it
was just before we almost killed it.


This also makes the replacement selection cheaper, no?


I hate to say it, but I have to
wonder if we shouldn't get rid of the new-to-9.6
replacement_sort_tuples because of this, and completely kill
replacement selection. I'm not going to go on about it, but that seems
sensible to me.


Yeah, perhaps. But that's a different story.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2016-09-06 Thread Dilip Kumar
On Wed, Sep 7, 2016 at 8:52 AM, Haribabu Kommi  wrote:
> I reviewed and tested the patch. The changes are fine.
> This patch provides better error message compared to earlier.
>
> Marked the patch as "Ready for committer" in commit-fest.


Thanks for the review !

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Push down more UPDATEs/DELETEs in postgres_fdw

2016-09-06 Thread Ashutosh Bapat
Thanks Fujita-san for working on this.


> * with the patch:
> postgres=# explain verbose delete from ft1 using ft2 where ft1.a = ft2.a;
>  QUERY PLAN
> 
> -
>  Delete on public.ft1  (cost=100.00..102.04 rows=1 width=38)
>->  Foreign Delete  (cost=100.00..102.04 rows=1 width=38)
>  Remote SQL: DELETE FROM public.t1 r1 USING (SELECT ROW(a, b), a
> FROM public.t2) ss1(c1, c2) WHERE ((r1.a = ss1.c2))
> (3 rows)
>
> The WIP patch has been created on top of the join pushdown patch [1]. So,
> for testing, please apply the patch in [1] first.
>
>
The underlying scan on t2 requires ROW(a,b) for locking the row for
update/share. But clearly it's not required if the full query is being
pushed down. Is there a way we can detect that ROW(a,b) is useless column
(not used anywhere in the other parts of the query like RETURNING, DELETE
clause etc.) and eliminate it? Similarly for a, it's part of the targetlist
of the underlying scan so that the WHERE clause can be applied on it. But
it's not needed if we are pushing down the query. If we eliminate the
targetlist of the query, we could construct a remote query without having
subquery in it, making it more readable.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] [PATCH] COPY vs \copy HINT

2016-09-06 Thread Tom Lane
Craig Ringer  writes:
> On 7 September 2016 at 04:19, Christoph Berg  wrote:
>> I like your new version, it's crisp and transports the right message.

> OK, updated with Tom's tweaked version of Christoph's wording per
> discussion. Thanks all.

Pushed with minor stylistic adjustment (narrowing the scope of
save_errno).

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let file_fdw access COPY FROM PROGRAM

2016-09-06 Thread Craig Ringer
On 7 September 2016 at 11:37, Corey Huinker  wrote:
> On Tue, Sep 6, 2016 at 11:24 PM, Craig Ringer 
> wrote:
>>
>> On 7 September 2016 at 11:21, Corey Huinker 
>> wrote:
>> > On Tue, Sep 6, 2016 at 6:53 PM, Craig Ringer
>> > 
>>
>> > And the TAP test would detect the operating system and know to create an
>> > FDW
>> > that has the PROGRAM value 'cat test_data.csv' on Unix, 'type
>> > test_data.csv'
>> > on windows, and 'type test_data.csv;1' on VMS?
>>
>> Right. Or just "perl emit_test_data.pl" that works for all of them,
>> since TAP is perl so you can safely assume you have Perl.
>
>
> Thanks. I was mentally locked in more basic OS commands. Am I right in
> thinking perl is about the *only* OS command you can be sure is on every
> architecture?

Probably, there's a lot of crazy out there.

TAP tests can be conditionally run based on architecture, but
something like this is probably worth testing as widely as possible.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let file_fdw access COPY FROM PROGRAM

2016-09-06 Thread Corey Huinker
On Tue, Sep 6, 2016 at 11:24 PM, Craig Ringer 
wrote:

> On 7 September 2016 at 11:21, Corey Huinker 
> wrote:
> > On Tue, Sep 6, 2016 at 6:53 PM, Craig Ringer <
> craig.rin...@2ndquadrant.com>
>
> > And the TAP test would detect the operating system and know to create an
> FDW
> > that has the PROGRAM value 'cat test_data.csv' on Unix, 'type
> test_data.csv'
> > on windows, and 'type test_data.csv;1' on VMS?
>
> Right. Or just "perl emit_test_data.pl" that works for all of them,
> since TAP is perl so you can safely assume you have Perl.
>

Thanks. I was mentally locked in more basic OS commands. Am I right in
thinking perl is about the *only* OS command you can be sure is on every
architecture?

The platforms page says we support S/390 but no mention of VM/MVS/CMS. Did
we do an OS/400 port yet? ::ducks::


Re: [HACKERS] Let file_fdw access COPY FROM PROGRAM

2016-09-06 Thread Corey Huinker
On Tue, Sep 6, 2016 at 9:46 PM, Amit Langote 
wrote:

> On 2016/09/07 3:12, Corey Huinker wrote:
> > On Fri, Sep 2, 2016 at 5:07 AM, Amit Langote wrote:
> >> I am not familiar with win32 stuff too, so I don't have much to say
> about
> >> that.  Maybe someone else can chime in to help with that.
> >
> > The regressions basically *can't* test this because we'd need a shell
> > command we know works on any architecture.
>
> OK.
>

Well...maybe not, depending on what Craig and other can do to educate me
about the TAP tests.


> >
> > Changing table-level options requires superuser privileges, for security
> > reasons: only a superuser should be able to determine which file is read,
> > or which program is run. In principle non-superusers could be allowed to
> > change the other options, but that's not supported at present.
>
> Hmm, just a little modification would make it better for me:
>
> ... for security reasons.  For example, only superuser should be able to
> determine which file read or which program is run.
>
> Although that could be just me.
>

In this case "determine" is unclear whether a non-superuser can set the
program to be run, or is capable of knowing which program is set to be run
by the fdw.

We may want some more opinions on what is the most clear.


>
> >> @@ -632,11 +671,18 @@ fileBeginForeignScan(ForeignScanState *node, int
> >> eflags)
> >>   * Create CopyState from FDW options.  We always acquire all
> columns,
> >> so
> >>   * as to match the expected ScanTupleSlot signature.
> >>   */
> >> -cstate = BeginCopyFrom(node->ss.ss_currentRelation,
> >> -   filename,
> >> -   false,
> >> -   NIL,
> >> -   options);
> >> +if(filename)
> >> +cstate = BeginCopyFrom(node->ss.ss_currentRelation,
> >> +   filename,
> >> +   false,
> >> +   NIL,
> >> +   options);
> >> +else
> >> +cstate = BeginCopyFrom(node->ss.ss_currentRelation,
> >> +   program,
> >> +   true,
> >> +   NIL,
> >> +   options)
> >>
> >> Like I mentioned above, if there was a is_program Boolean flag instead
> of
> >> separate filename and program, this would be just:
> >>
> >> +cstate = BeginCopyFrom(node->ss.ss_currentRelation,
> >> +   filename,
> >> +   is_program,
> >> +   NIL,
> >> +   options);
> >>
> >> That would get rid of if-else blocks here and a couple of other places.
> >
> > It would, pushing the complexity out to the user. Which could be fine,
> but
> > if we do that then "filename" is a misnomer.
>
> This is an internal state so I'm not sure how this would be pushing
> complexity out to the user.  Am I perhaps misreading what you said?
>

Indeed, it is internal state. Maybe we rename the variable file_command or
something.


>
> What a user sees is that there are two separate foreign table options -
> filename and program.  That we handle them internally using a string to
> identify either file or program and a Boolean flag to show which of the
> two is just an internal implementation detail.
>
> COPY does it that way internally and I just saw that psql's \copy does it
> the same way too.  In the latter's case, following is the options struct
> (src/bin/psql/copy.c):
>
> struct copy_options
> {
> char   *before_tofrom;  /* COPY string before TO/FROM */
> char   *after_tofrom;   /* COPY string after TO/FROM filename */
> char   *file;   /* NULL = stdin/stdout */
> boolprogram;/* is 'file' a program to popen? */
> boolpsql_inout; /* true = use psql stdin/stdout */
> boolfrom;   /* true = FROM, false = TO */
> };
>
> But as you said above, this could be deferred to the committer.
>

Yeah, and that made for zero storage savings: a char pointer which is never
assigned a string takes up as much space as a 4-byte-aligned boolean. And
the result is that "file" really means program, which I found slightly
awkward.


> >> diff --git a/contrib/file_fdw/input/file_fdw.source
> >> b/contrib/file_fdw/input/file_fdw.source
> >> index 685561f..eae34a4 100644
> >> --- a/contrib/file_fdw/input/file_fdw.source
> >> +++ b/contrib/file_fdw/input/file_fdw.source
> >>
> >> You forgot to include expected output diffs.
> >
> > Having regression tests for this is extremely problematic, because the
> > program invoked would need to be an invokable command on any architecture
> > supported by postgres. I'm pretty sure no such command exists.
>
> Craig and Michael elsewhere suggested something about adding TAP tests. If
> that helps the situation, maybe you could.
>

Yeah, I 

Re: [HACKERS] Let file_fdw access COPY FROM PROGRAM

2016-09-06 Thread Craig Ringer
On 7 September 2016 at 11:21, Corey Huinker  wrote:
> On Tue, Sep 6, 2016 at 6:53 PM, Craig Ringer 

> And the TAP test would detect the operating system and know to create an FDW
> that has the PROGRAM value 'cat test_data.csv' on Unix, 'type test_data.csv'
> on windows, and 'type test_data.csv;1' on VMS?

Right. Or just "perl emit_test_data.pl" that works for all of them,
since TAP is perl so you can safely assume you have Perl.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2016-09-06 Thread Haribabu Kommi
On Fri, Aug 26, 2016 at 7:11 PM, Dilip Kumar  wrote:


> I have changed this in attached patch..
>

I reviewed and tested the patch. The changes are fine.
This patch provides better error message compared to earlier.

Marked the patch as "Ready for committer" in commit-fest.

Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] Suggestions for first contribution?

2016-09-06 Thread Craig Ringer
On 7 September 2016 at 10:54, Noah Misch  wrote:
> On Tue, Sep 06, 2016 at 01:32:10PM -0400, Christian Convey wrote:
>> It sounds like the most useful thing I can do at the moment is perform
>> code reviews.  I assumed I'd need more experience with the PG code
>> base, but I keep on reading that newcomers' reviews are welcome.
>> Unless someone has a better idea, I'll start with that.
>
> That is a good choice.  Thanks for getting involved in this way.

Yeah, code review is hard when you don't know the codebase, but it's
also a good way to learn it.

A reviewer who doesn't already know the part of the code of interest
well is in a good position to say "hey, this makes no sense unless you
know that X, Y, Z and 42, maybe you should have a comment here so the
next reader doesn't have to spend 4 hours studying calls through three
different subsystems to figure it out?" . That's actually quite nice
for maintainability, though obviously we can't describe everything in
detail everywhere, there has to be a balance.

Sometimes you'll also ask "dumb" questions that force someone to
explain assumptions they'd made... and in the process figure out that
actually their assumptions don't fit reality / make sense. I certainly
find myself writing "well, it does X" explanations then figure I
should re-check, go read the code and find out that actually ... no,
not so much.

The main thing that's hard is knowing where to look to start learning
about the basic subsystems you need to know and the ones a particular
patch touches on. fmgr, relcache/syscache (and invalidation, oh fun),
how the whole Datum variant business works, memory contexts, PGPROC,
walsenders and walreceivers, startup process, bgwriter, checkpointer,
bgworkers, latches and LWLocks, shmem and private memory and
copy-on-write postmaster memory, elog/ereport(...) and longjmp() error
unwinding, visibility/xmin/xmax/cmin/cmax, infomask/hintbits/locks,
SLRUs and clog, multixacts and comboxacts, low level heap/index access
routines, the SPI, the parser/rewriter/planner and its subsystems like
inheritance_planner, extensions, timelines, logical decoding, slots,
 etc etc. We really could use a developer's primer. The current
developer docs and scattered READMEs are nice for what they cover, but
somewhat spotty. That said, who's volunteering? Nope nope nope! Though
I am trying to blog about various subsystems occasionally.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let file_fdw access COPY FROM PROGRAM

2016-09-06 Thread Corey Huinker
On Tue, Sep 6, 2016 at 6:53 PM, Craig Ringer 
wrote:

> On 7 Sep. 2016 02:14, "Corey Huinker"  wrote:
> >
>
> > Having regression tests for this is extremely problematic, because the
> program invoked would need to be an invokable command on any architecture
> supported by postgres. I'm pretty sure no such command exists.
>
> Your best bet will be using the TAP framework. There you can use Perl
> logic.
>
> I'm not sure where to put such a test though. It doesn't really make sense
> in src/test/recovery/ .
>

And the TAP test would detect the operating system and know to create an
FDW that has the PROGRAM value 'cat test_data.csv' on Unix, 'type
test_data.csv' on windows, and 'type test_data.csv;1' on VMS?


Re: [HACKERS] Optimization for lazy_scan_heap

2016-09-06 Thread Masahiko Sawada
On Wed, Sep 7, 2016 at 1:47 AM, Robert Haas  wrote:
> On Mon, Sep 5, 2016 at 8:57 PM, Masahiko Sawada  wrote:
>>> What performance difference does this make, in a realistic use case?
>>
>> I have yet to measure performance effect but it would be effect for
>> very large frozen table.
>
> I think if you are proposing this patch because you think that the
> existing code won't perform well, you definitely need to submit some
> performance results showing that your approach is better.  If you
> can't show that your approach is practically better (rather than just
> theoretically better), then I'm not sure we should change anything.
> It's very easy to screw up the code in this area and we do not want to
> risk corrupting data for the sake of an optimization that isn't really
> needed in the first place.
>
> Of course, if you can prove that the change has a significant benefit,
> then it's definitely worth considering.
>

I understood, thank you.

I've measured the performance benefit of this patch by following steps.
1. Create very large table and set all-visible flag to all blocks.
2. Measure the execution time of vacuum that skips to scan all heap pages.

* 1TB Table(visibility map size is 32MB)
HEAD : 11012.274 ms (00:11.012)
Patched : 6742.481 ms (00:06.742)

* 8TB Table(visibility map size is 64MB)
HEAD : 69886.605 ms (01:09.887)
Patched : 35562.131 ms (00:35.562)

* 32TB Table(visibility map size is 258MB)
HEAD: 265901.096 ms (04:25.901)
Patched: 131779.082 ms (02:11.779)

Since current HEAD could scan visibility map twice, the execution time
of Patched is approximately half of HEAD.
But when table is several hundreds gigabyte, performance benefit would
not be large.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: function xmltable

2016-09-06 Thread Craig Ringer
On 7 September 2016 at 04:13, Pavel Stehule  wrote:

>> Overall, I think this needs to be revised with appropriate comments.
>> Whitespace/formatting needs fixing since it's all over the place.
>> Documentation is insufficient (per notes below).
>
>
> I am not able to write documentation in English language :( - This function
> is pretty complex - so I hope so anybody with better language skills can
> help with this. It respects standard and it respects little bit different
> Oracle's behave too (different order of DEFAULT and PATH parts).

OK, no problem. It can't be committed without more comprehensive docs
though, especially for new and nontrivial functionality.

Is there some reference material you can point to so someone else can
help with docs? And can you describe what differences there are
between your implementation and the reference?

Alternately, if you document it in Czech, do you know of anyone who
could assist in translating to English for the main documentation?

>> Re identifier naming, some of this code uses XmlTable naming patterns,
>> some uses TableExpr prefixes. Is that intended to indicate a bounary
>> between things re-usable for other structured data ingesting
>> functions? Do you expect a "JSONEXPR" or similar in future? That's
>> alluded to by
>
>
> This structure should be reused by JSON_TABLE function. Now, it is little
> bit strange, because there is only XMLTABLE implementation - and I have to
> choose between a) using two different names now, b) renaming some part in
> future.

OK. Are you planning on writing this JSON_TABLE or are you leaving
room for future growth? Either way is fine, just curious.

> And although XMLTABLE and JSON_TABLE functions are pretty similar - share
> 90% of data (input value, path, columns definitions), these functions has
> different syntax - so only middle level code should be shared.

That makes sense.

I think it would be best if you separated out the TableExpr
infrastructure from the XMLTABLE implementation though, so we can
review the first level infrastrcture separately and make this a
2-patch series. Most importantly, doing it that way will help you find
places where TableExpr code calls directly into XMLTABLE code. If
TableExpr is supposed to be reusable for json etc, it probably
shouldn't be calling XmlTable stuff directly.

That also means somewhat smaller simpler patches, which probably isn't bad.

I don't necessarily think this needs to be fully pluggable with
callbacks etc. It doesn't sound like you expect this to be used by
extensions or to have a lot of users, right? So it probably just needs
clearer separation of the infrastructure layer from the xmltable
layer. I think splitting the patch will make that easier to see and
make it easier to find problems.

My biggest complaint at the moment is that execEvalTableExpr calls
initXmlTableContext(...) directly, is aware of XML namespaces
directly, calls XmlTableSetRowPath() directly, calls
XmlTableFetchRow() directly, etc. It is in no way generic/reusable for
some later JSONTABLE feature. That needs to be fixed by:

* Renaming it so it's clearly only for XMLTABLE; or
* Abstracting the init context, set row path, fetch row etc operations
so json ones can be plugged in later



> Currently the common part is not too big - just the Node related part - I am
> not sure about necessity of two patches.

The problem is that the common part is all mixed in with the
XMLTABLE-specific part, so it's not at all clear it can be common with
something else.

> I am agree, there is missing some
> TableExpBuilder, where can be better isolated the XML part.

Yeah, that's sort of what I'm getting at.


>> execEvalTableExpr seems to be defined twice, with a difference in
>> case. This is probably not going to fly:
>>
>>
>> +static Datum
>> +execEvalTableExpr(TableExprState *tstate,
>> +ExprContext *econtext,
>> +bool *isNull, ExprDoneCond *isDone)
>> +{
>>
>> +static Datum
>> +ExecEvalTableExpr(TableExprState *tstate,
>> +ExprContext *econtext,
>> +bool *isNull, ExprDoneCond *isDone)
>> +{
>>
>>
>> It looks like you've split the function into a "guts" and "wrapper"
>> part, with the error handling PG_TRY / PG_CATCH block in the wrapper.
>> That seems reasonable for readability, but the naming isn't.
>
>
> I invite any idea how these functions should be named.

Definitely not how they are ;) . They really can't differ in a single
character's case.

I'm not sure if PostgreSQL has any formal convention for this. Some
places use _impl e.g.  pg_read_barrier_impl() but that's in the
context of an interface-vs-implementation separation, which isn't the
case here.

Some places use _internal, like AlterObjectRename_internal(...), but
that's where there's an associated public/external part, which isn't
the case here.

Some places use _guts e.g. pg_logical_slot_get_changes_guts(...),
largely where 

Re: [HACKERS] improved DefElem list processing

2016-09-06 Thread Peter Eisentraut
On 9/6/16 7:23 PM, Tom Lane wrote:
> I'm curious where you are on that?  I find myself needing to whack around
> this processing in CREATE EXTENSION, but I don't want to create a merge
> problem for you if you're close to committing.

I have committed what I have for now.  Thanks.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Suggestions for first contribution?

2016-09-06 Thread Noah Misch
On Tue, Sep 06, 2016 at 01:32:10PM -0400, Christian Convey wrote:
> It sounds like the most useful thing I can do at the moment is perform
> code reviews.  I assumed I'd need more experience with the PG code
> base, but I keep on reading that newcomers' reviews are welcome.
> Unless someone has a better idea, I'll start with that.

That is a good choice.  Thanks for getting involved in this way.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: Exclude additional directories in pg_basebackup

2016-09-06 Thread Michael Paquier
On Wed, Sep 7, 2016 at 12:16 AM, David Steele  wrote:
> On 9/1/16 9:53 AM, Peter Eisentraut wrote:
>>
>> On 8/15/16 3:39 PM, David Steele wrote:
>>>
>>> That patch got me thinking about what else could be excluded and after
>>> some investigation I found the following: pg_notify, pg_serial,
>>> pg_snapshots, pg_subtrans.  These directories are all cleaned, zeroed,
>>> or rebuilt on server start.
>>>
>>> The attached patch adds these directories to the pg_basebackup
>>> exclusions and takes an array-based approach to excluding directories
>>> and files during backup.
>>
>>
>> We do support other backup methods besides using pg_basebackup.  So I
>> think we need to document the required or recommended handling of each
>> of these directories.  And then pg_basebackup should become a consumer
>> of that documentation.
>>
>> The current documentation on this is at
>>
>> ,
>> which covers pg_xlog and pg_replslot.  I think that documentation should
>> be expanded, maybe with a simple list that is easy to copy into an
>> exclude file, following by more detail on each directory.
>
>
> Attached is a new patch that adds sgml documentation.  I can expand on each
> directory individually if you think that's necessary, but thought it was
> better to lump them into a few categories.

+be ommitted from the backup as they will be initialized on postmaster
+startup. If the  is set and is
+under the database cluster directory then the contents of the directory
+specified by  can also
be ommitted.

s/ommitted/omitted/


+#define EXCLUDE_DIR_MAX 8
+#define EXCLUDE_DIR_STAT_TMP 0
+
+const char *excludeDirContents[EXCLUDE_DIR_MAX] =
+{
+/*
+ * Skip temporary statistics files. The first array position will be
+ * filled with the value of pgstat_stat_directory relative to PGDATA.
+ * PG_STAT_TMP_DIR must be skipped even when stats_temp_directory is set
+ * because PGSS_TEXT_FILE is always created there.
+ */
+NULL,
I find that ugly. I'd rather use an array with undefined size for the
fixed elements finishing by NULL, remove EXCLUDE_DIR_MAX and
EXCLUDE_DIR_STAT_TMP and use a small routine to do the work done on
_tarWriteHeader...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] COPY vs \copy HINT

2016-09-06 Thread Craig Ringer
On 7 September 2016 at 04:19, Christoph Berg  wrote:

> I like your new version, it's crisp and transports the right message.

OK, updated with Tom's tweaked version of Christoph's wording per
discussion. Thanks all.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From 17bdc19e15f21ce84eee76dee9f4f86e868a8430 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Fri, 12 Aug 2016 15:42:12 +0800
Subject: [PATCH] Emit a HINT when COPY can't find a file

Users often get confused between COPY and \copy and try to use client-side
paths with COPY. The server of course cannot find the file.

Emit a HINT in the most common cases to help users out.

Craig Ringer, Tom Lane and Christoph Berg
---
 src/backend/commands/copy.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 5947e72..e7118df 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -1751,6 +1751,7 @@ BeginCopyTo(Relation rel,
 		{
 			mode_t		oumask; /* Pre-existing umask value */
 			struct stat st;
+			int			save_errno;
 
 			/*
 			 * Prevent write to relative path ... too easy to shoot oneself in
@@ -1759,16 +1760,21 @@ BeginCopyTo(Relation rel,
 			if (!is_absolute_path(filename))
 ereport(ERROR,
 		(errcode(ERRCODE_INVALID_NAME),
-	  errmsg("relative path not allowed for COPY to file")));
+		 errmsg("relative path not allowed for COPY to file")));
 
 			oumask = umask(S_IWGRP | S_IWOTH);
 			cstate->copy_file = AllocateFile(cstate->filename, PG_BINARY_W);
+			save_errno = errno;
 			umask(oumask);
+
 			if (cstate->copy_file == NULL)
 ereport(ERROR,
 		(errcode_for_file_access(),
 		 errmsg("could not open file \"%s\" for writing: %m",
-cstate->filename)));
+cstate->filename),
+		 ((save_errno == ENOENT || save_errno == EACCES) ?
+		  errhint("COPY TO instructs the PostgreSQL server process to write a file. "
+  "You may want a client-side facility such as psql's \\copy.") : 0)));
 
 			if (fstat(fileno(cstate->copy_file), ))
 ereport(ERROR,
@@ -2786,13 +2792,21 @@ BeginCopyFrom(Relation rel,
 		else
 		{
 			struct stat st;
+			int			save_errno;
 
 			cstate->copy_file = AllocateFile(cstate->filename, PG_BINARY_R);
+
+			/* in case something in ereport changes errno: */
+			save_errno = errno;
+
 			if (cstate->copy_file == NULL)
 ereport(ERROR,
 		(errcode_for_file_access(),
 		 errmsg("could not open file \"%s\" for reading: %m",
-cstate->filename)));
+cstate->filename),
+		 ((save_errno == ENOENT || save_errno == EACCES) ?
+		  errhint("COPY FROM instructs the PostgreSQL server process to read a file. "
+  "You may want a client-side facility such as psql's \\copy.") : 0)));
 
 			if (fstat(fileno(cstate->copy_file), ))
 ereport(ERROR,
-- 
2.5.5


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transactions involving multiple postgres foreign servers

2016-09-06 Thread vinayak



On 2016/08/26 15:13, Ashutosh Bapat wrote:



On Fri, Aug 26, 2016 at 11:37 AM, Masahiko Sawada 
> wrote:


On Fri, Aug 26, 2016 at 3:03 PM, Ashutosh Bapat
> wrote:
>
>
> On Fri, Aug 26, 2016 at 11:22 AM, Masahiko Sawada
>
> wrote:
>>
>> On Fri, Aug 26, 2016 at 1:32 PM, Vinayak Pokale
>
>> wrote:
>> > Hi All,
>> >
>> > Ashutosh proposed the feature 2PC for FDW for achieving
atomic commits
>> > across multiple foreign servers.
>> > If a transaction make changes to more than two foreign
servers the
>> > current
>> > implementation in postgres_fdw doesn't make sure that either
all of them
>> > commit or all of them rollback their changes.
>> >
>> > We (Masahiko Sawada and me) reopen this thread and trying to
contribute
>> > in
>> > it.
>> >
>> > 2PC for FDW
>> > 
>> > The patch provides support for atomic commit for transactions
involving
>> > foreign servers. when the transaction makes changes to
foreign servers,
>> > either all the changes to all the foreign servers commit or
rollback.
>> >
>> > The new patch 2PC for FDW include the following things:
>> > 1. The patch 0001 introduces a generic feature. All kinds of
FDW that
>> > support 2PC such as oracle_fdw, mysql_fdw, postgres_fdw etc.
can involve
>> > in
>> > the transaction.
>> >
>> > Currently we can push some conditions down to shard nodes,
especially in
>> > 9.6
>> > the directly modify feature has
>> > been introduced. But such a transaction modifying data on
shard node is
>> > not
>> > executed surely.
>> > Using 0002 patch, that modify is executed with 2PC. It means
that we
>> > almost
>> > can provide sharding solution using
>> > multiple PostgreSQL server (one parent node and several
shared node).
>> >
>> > For multi master, we definitely need transaction manager but
transaction
>> > manager probably can use this 2PC for FDW feature to manage
distributed
>> > transaction.
>> >
>> > 2. 0002 patch makes postgres_fdw possible to use 2PC.
>> >
>> > 0002 patch makes postgres_fdw to use below APIs. These APIs
are generic
>> > features which can be used by all kinds of FDWs.
>> >
>> > a. Execute PREAPRE TRANSACTION and COMMIT/ABORT PREAPRED
instead of
>> > COMMIT/ABORT on foreign server which supports 2PC.
>> > b. Manage information of foreign prepared transactions
resolver
>> >
>> > Masahiko Sawada will post the patch.
>> >
>> >
>>
>
> Thanks Vinayak and Sawada-san for taking this forward and basing
your work
> on my patch.
>
>>
>> Still lot of work to do but attached latest patches.
>> These are based on the patch Ashutosh posted before, I revised
it and
>> divided into two patches.
>> Compare with original patch, patch of pg_fdw_xact_resolver and
>> documentation are lacked.
>
>
> I am not able to understand the last statement.

Sorry to confuse you.

> Do you mean to say that your patches do not have
pg_fdw_xact_resolver() and
> documentation that my patches had?

Yes.
I'm confirming them that your patches had.


Thanks for the clarification. I had added pg_fdw_xact_resolver() to 
resolve any transactions which can not be resolved immediately after 
they were prepared. There was a comment from Kevin (IIRC) that leaving 
transactions unresolved on the foreign server keeps the resources 
locked on those servers. That's not a very good situation. And nobody 
but the initiating server can resolve those. That functionality is 
important to make it a complete 2PC solution. So, please consider it 
to be included in your first set of patches.

The attached patch included pg_fdw_xact_resolver.

Regards,
Vinayak Pokale
NTT Open Source Software Center



0003-pg-fdw-xact-resolver.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let file_fdw access COPY FROM PROGRAM

2016-09-06 Thread Amit Langote
On 2016/09/07 3:12, Corey Huinker wrote:
> On Fri, Sep 2, 2016 at 5:07 AM, Amit Langote wrote:
>> I am not familiar with win32 stuff too, so I don't have much to say about
>> that.  Maybe someone else can chime in to help with that.
> 
> The regressions basically *can't* test this because we'd need a shell
> command we know works on any architecture.

OK.

>> @@ -97,8 +99,9 @@ typedef struct FileFdwPlanState
>>  typedef struct FileFdwExecutionState
>>  {
>>  char   *filename;   /* file to read */
>> -List   *options;/* merged COPY options, excluding
>> filename */
>> -CopyState   cstate; /* state of reading file */
>> +char   *program;/* program to read output from */
>> +List   *options;/* merged COPY options, excluding
>> filename and program */
>> +CopyState   cstate; /* state of reading file or program */
>>
>> Have you considered a Boolean flag is_program instead of char **program
>> similar to how copy.c does it?  (See a related comment further below)
> 
> Considered it yes, but decided against it when I started to write my
> version. When Adam delivered his version of the patch, I noticed he had
> made the same design decision. Only one of them will be initialized, and
> the boolean will byte align to 4 bytes, so it's the same storage allocated
> either way.
> 
> Either we're fine with two variables, or we think file_name is poorly
> named. I have only a slight preference for the two variables, and defer to
> the group for a preference.

OK, let's defer it to the committer.

>> - * This is because the filename is one of those options, and we don't
>> want
>> - * non-superusers to be able to determine which file gets read.
>> + * This is because the filename or program string are two of those
>> + * options, and we don't want non-superusers to be able to determine
>> which
>> + * file gets read or what command is run.
>>
>> I'm no native English speaker, but I wonder if this should read: filename
>> or program string *is* one of those options ... OR filename *and* program
>> are two of those options ... ?  Also, "are two of those options" sounds a
>> bit odd to me because I don't see that used as often as "one of those
>> whatever".  I guess that's quite a bit of nitpicking about a C comment, ;)
> 
> Given that C comments constitute a large portion of our documentation, I
> fully support making them as clear as possible.
> 
> I don't remember if this is Adam's comment or mine. Adam - if you're out
> there, chime in.
> 
> The original paragraph was:
> 
> Changing table-level options requires superuser privileges, for security
> reasons: only a superuser should be able to determine which file is read.
> In principle non-superusers could be allowed to change the other options,
> but that's not supported at present.
> 
> 
> How does this paragraph sound?:
> 
> Changing table-level options requires superuser privileges, for security
> reasons: only a superuser should be able to determine which file is read,
> or which program is run. In principle non-superusers could be allowed to
> change the other options, but that's not supported at present.

Hmm, just a little modification would make it better for me:

... for security reasons.  For example, only superuser should be able to
determine which file read or which program is run.

Although that could be just me.

>> @@ -257,9 +264,26 @@ file_fdw_validator(PG_FUNCTION_ARGS)
>>  ereport(ERROR,
>>  (errcode(ERRCODE_SYNTAX_ERROR),
>>   errmsg("conflicting or redundant options")));
>> +if (program)
>> +ereport(ERROR,
>> +(errcode(ERRCODE_SYNTAX_ERROR),
>> + errmsg("conflicting or redundant options")));
>>  filename = defGetString(def);
>>  }
>>
>> +else if (strcmp(def->defname, "program") == 0)
>> +{
>> +if (filename)
>> +ereport(ERROR,
>> +(errcode(ERRCODE_SYNTAX_ERROR),
>> + errmsg("conflicting or redundant options")));
>> +if (program)
>> +ereport(ERROR,
>> +(errcode(ERRCODE_SYNTAX_ERROR),
>> + errmsg("conflicting or redundant options")));
>> +program = defGetString(def);
>> +}
>>
>> Why does it check for filename here?  Also the other way around above.
> 
> We don't want them to specify both program and filename, nor do we allow 2
> programs, or 2 filenames.

Ah, I forgot about the mutually exclusive options part.

>> @@ -632,11 +671,18 @@ fileBeginForeignScan(ForeignScanState *node, int
>> eflags)
>>   * Create CopyState from FDW options.  We always acquire all columns,
>> so
>>   * as to match the expected ScanTupleSlot signature.
>>   */
>> -cstate = BeginCopyFrom(node->ss.ss_currentRelation,
>> 

Re: [HACKERS] ICU integration

2016-09-06 Thread Peter Geoghegan
On Tue, Sep 6, 2016 at 10:40 AM, Doug Doole  wrote:
> - Suppose in ICU X.X, AA = Å but in ICU Y.Y AA != Å. Further suppose there
> was an RI constraint where the primary key used AA and the foreign key used
> Å. If ICU was updated, the RI constraint between the rows would break,
> leaving an orphaned foreign key.

This isn't a problem for Postgres, or at least wouldn't be right now,
because we don't have case insensitive collations. So, we use a
strcmp()/memcmp() tie-breaker when strcoll() indicates equality, while
also making the general notion of text equality actually mean binary
equality. In short, we are aware that cases like this exist. IIRC
Unicode Technical Standard #10 independently recommends that this
tie-breaker strategy is one way of dealing with problems like this, in
a pinch, though I think we came up with the idea independently of that
recommendation. This was in response to a bug report over 10 years
ago.

I would like to get case insensitive collations some day, and was
really hoping that ICU would help. That being said, the need for a
strcmp() tie-breaker makes that hard. Oh well.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] add option to pg_dumpall to exclude tables from the dump

2016-09-06 Thread Gerdan Rezende dos Santos
On Fri, Aug 19, 2016 at 12:38 PM, Jim Nasby 
wrote:

> On 8/18/16 5:01 PM, Tom Lane wrote:
>
>> I agree, but I think mandating a database name (which I suppose could be
>>> > *) with the specifiers would solve that issue.
>>>
>> Hmm, something like "-T dbname1:pattern1 -T dbname2:pattern2" ?
>>
>
> Bingo. Hopefully there'd be some way to consolidate the code between the
> two as well...
> --
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com
> 855-TREBLE2 (855-873-2532)   mobile: 512-569-9461
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


After review, I realized that there is a call to the function:
doShellQuoting (pgdumpopts, OPTARG), which no longer seems to exist ...
After understand the code, I saw that the call is appendShellString (
pgdumpopts, OPTARG).

Follow the patches already with the necessary corrections.

Regards
*Gerdan Rezende dos Santos *
*Po*stgreSQL & EnterpriseDB Specialist, Support, Training & Services
+55 (61) 9645-1525
diff --git a/src/bin/pg_dump/pg_dumpall.c b/src/bin/pg_dump/pg_dumpall.c
index a7dc41c..979a964 100644
--- a/src/bin/pg_dump/pg_dumpall.c
+++ b/src/bin/pg_dump/pg_dumpall.c
@@ -111,6 +111,7 @@ main(int argc, char *argv[])
 		{"password", no_argument, NULL, 'W'},
 		{"no-privileges", no_argument, NULL, 'x'},
 		{"no-acl", no_argument, NULL, 'x'},
+		{"exclude-table", required_argument, NULL, 'T'},
 
 		/*
 		 * the following options don't have an equivalent short option letter
@@ -195,7 +196,7 @@ main(int argc, char *argv[])
 
 	pgdumpopts = createPQExpBuffer();
 
-	while ((c = getopt_long(argc, argv, "acd:f:gh:l:oOp:rsS:tU:vwWx", long_options, )) != -1)
+	while ((c = getopt_long(argc, argv, "acd:f:gh:l:oOp:rsS:tU:vwWxT:", long_options, )) != -1)
 	{
 		switch (c)
 		{
@@ -283,6 +284,11 @@ main(int argc, char *argv[])
 appendPQExpBufferStr(pgdumpopts, " -x");
 break;
 
+			case 'T':
+appendPQExpBufferStr(pgdumpopts, " -T");
+appendShellString(pgdumpopts, optarg);
+break;
+
 			case 0:
 break;
 
@@ -564,6 +570,7 @@ help(void)
 	printf(_("  -s, --schema-onlydump only the schema, no data\n"));
 	printf(_("  -S, --superuser=NAME superuser user name to use in the dump\n"));
 	printf(_("  -t, --tablespaces-only   dump only tablespaces, no databases or roles\n"));
+	printf(_("  -T, --exclude-table  exclude some tables\n"));
 	printf(_("  -x, --no-privileges  do not dump privileges (grant/revoke)\n"));
 	printf(_("  --binary-upgrade for use by upgrade utilities only\n"));
 	printf(_("  --column-inserts dump data as INSERT commands with column names\n"));
-- 
diff --git a/doc/src/sgml/ref/pg_dumpall.sgml b/doc/src/sgml/ref/pg_dumpall.sgml
index 6c34c25..24408b9 100644
--- a/doc/src/sgml/ref/pg_dumpall.sgml
+++ b/doc/src/sgml/ref/pg_dumpall.sgml
@@ -198,6 +198,20 @@ PostgreSQL documentation
  
 
  
+  -T table
+  --exclude-table=table
+  
+   
+Do not dump any tables matching the table pattern.  The pattern is
+interpreted according to the same rules as for -t.
+-T can be given more than once to exclude tables
+matching any of several patterns.
+   
+  
+ 
+
+ 
   -v
   --verbose
   

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU integration

2016-09-06 Thread Peter Geoghegan
On Tue, Sep 6, 2016 at 10:40 AM, Doug Doole  wrote:
>> The ICU ABI (not API) is also versioned.  The way that this is done is
>> that all functions are actually macros to a versioned symbol.  So
>> ucol_open() is actually a macro that expands to, say, ucol_open_57() in
>> ICU version 57.  (They also got rid of a dot in their versions a while
>> ago.)  It's basically hand-crafted symbol versioning.  That way, you can
>> link with multiple versions of ICU at the same time.  However, the
>> purpose of that, as I understand it, is so that plugins can have a
>> different version of ICU loaded than the main process or another plugin.
>  > In terms of postgres using the right version of ICU, it doesn't buy
>> anything beyond what the soname mechanism does.
>
> You can access the versioned API as well, it's just not documented. (The ICU
> team does support this - we worked very closely with them when doing all
> this.) We exploited the versioned API when we learned that there is no
> guarantee of backwards compatibility in collations. You can't just change a
> collation under a user (at least that was our opinion) since it can cause
> all sorts of problems. Refreshing a collation (especially on the fly) is a
> lot more work than we were prepared to take on. So we exploited the
> versioned APIs.

I originally got some of this information from the ICU Doxygen site
for the C API, which isn't great documentation, but also isn't bad. I
admit that there are certainly gaps in my understanding of how to
bridge our requirements with versioning to what ICU can give us.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical Replication WIP

2016-09-06 Thread Peter Eisentraut
Review of 0002-Add-SUBSCRIPTION-catalog-and-DDL.patch:

(As you had already mentioned, some of the review items in 0001 apply
analogously here.)


Changes needed to compile:

--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -218,7 +218,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt)
CatalogUpdateIndexes(rel, tup);
heap_freetuple(tup);

-   ObjectAddressSet(myself, SubscriptionRelationId, suboid);
+   ObjectAddressSet(myself, SubscriptionRelationId, subid);

heap_close(rel, RowExclusiveLock);

This is fixed later in patch 0005.

--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -6140,8 +6140,7 @@ pg_subscription
Columns
   subpublications
  text[]
 
-  Array of subscribed publication names. For more on
publications
-   see .
+  Array of subscribed publication names.

  
   

I don't see that id defined in any later patch.


Minor problems:

Probably unintentional change in pg_dump.h:

- * The PublicationInfo struct is used to represent publications.
+ * The PublicationInfo struct is used to represent publication.

pg_subscription column "dbid" rename to "subdbid".

I think subpublications ought to be of type name[], not text[].

It says, a subscription can only be dropped by its owner or a
superuser.  But subscriptions don't have owners.  Maybe they should.

On the CREATE SUBSCRIPTION ref page,

| INITIALLY ( ENABLED | DISABLED )

should use {} instead.

We might want to add ALTER commands to rename subscriptions and
publications.

Similar concerns as before about ALTER syntax, e.g., does ALTER
SUBSCRIPTION ... PUBLICATION add to or replace the publication set?

For that matter, why is there no way to add?

Document why publicationListToArray() creates its own memory context.

I think we should allow creating subscriptions initally without
publications.  This could be useful for example to test connections,
or create slots before later on adding publications.  Seeing that
there is support for changing the publications later, this shouldn't
be a problem.

The synopsis of CREATE SUBSCRIPTION indicates that options are
optional, but it actually requires at least one option.

At the end of CreateSubscription(), the CommandCounterIncrement()
doesn't appear to be necessary (yet, see patch 0005?).

Maybe check for duplicates in the publications list.


Larger conceptual issues:

I haven't read the rest of the code yet to understand why
pg_subscription needs to be a shared catalog, but be that as it may, I
would still make it so that subscription names appear local to the
database.  We already have the database OID in the pg_subscription
catalog, so I would make the key (subname, subdatid).  DDL commands
would only operate on subscriptions in the current database (or you
could use "datname"."subname" syntax), and \drs would only show
subscriptions in the current database.  That way the shared catalog is
an implementation detail that can be changed in the future.  I think
it would be very helpful for users if publications and subscriptions
appear to work in a parallel way.  If I have two databases that I want
to replicate between two servers, I might want to have a publication
"mypub" in each database and a subscription "mysub" in each database.
If I learn that the subscriptions can't be named that way, then I have
to go back to rename to the publications, and it'll all be a bit
frustrating.

Some thoughts on pg_dump and such:

Even an INITIALLY DISABLED subscription needs network access to create
the replication slot.  So restoring a dump when the master is not
available will have some difficulties.  And restoring master and slave
at the same time (say disaster recovery) will not necessarily work
well either.  Also, the general idea of doing network access during a
backup restore without obvious prior warning sounds a bit unsafe.

I imagine maybe having three states for subscriptions: DISABLED,
PREPARED, ENABLED (to use existing key words).  DISABLED just exists
in the catalog, PREPARED has the slots set up, ENABLED starts
replicating.  So you can restore a dump with all slots disabled.  And
then it might be good to have a command to "prepare" or "enable" all
subscriptions at once.

That command would also help if you restore a dump not in a
transaction but you want to enable all subscriptions in the same
transaction.

I'd also prefer having subscriptions dumped by default, just to keep
it so that pg_dump by default backs up everything.

Finally, having disabled subscriptions without network access would
also allow writing some DDL command tests.

As I had mentioned privately before, I would perhaps have CREATE
SUBSCRIPTION use the foreign server infrastructure for storing
connection information.

We'll have to keep thinking about ways to handle abandonded
replication slots.  I imagine that people will want to create
subscriber instances in fully automated ways. 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-06 Thread Peter Geoghegan
On Tue, Sep 6, 2016 at 5:50 PM, Claudio Freire  wrote:
> However, I agree that it's not worth the risk conflating the two
> optimizations. That one can be done later as a separate patch.

I'm rather fond of the assertions about tape number that exist within
root_displace currently. But, yeah, maybe.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-06 Thread Claudio Freire
On Tue, Sep 6, 2016 at 9:19 PM, Peter Geoghegan  wrote:
> On Tue, Sep 6, 2016 at 4:55 PM, Claudio Freire  wrote:
>> I noticed, but here n = state->memtupcount
>>
>> +   Assert(memtuples[0].tupindex == newtup->tupindex);
>> +
>> +   CHECK_FOR_INTERRUPTS();
>> +
>> +   n = state->memtupcount; /* n is heap's size,
>> including old root */
>> +   imin = 0;   /*
>> start with caller's "hole" in root */
>> +   i = imin;
>
> I'm fine with using "n" in the later assertion you mentioned, if
> that's clearer to you. memtupcount is broken out as "n" simply because
> that's less verbose, in a place where that makes things far clearer.
>
>> In fact, the assert on the patch would allow writing memtuples outside
>> the heap, as in calling tuplesort_heap_root_displace if
>> memtupcount==0, but I don't think that should be legal (memtuples[0]
>> == memtuples[imin] would be outside the heap).
>
> You have to have a valid heap (i.e. there must be at least one
> element) to call tuplesort_heap_root_displace(), and it doesn't
> directly compact the heap, so it must remain valid on return. The
> assertion exists to make sure that everything is okay with a
> one-element heap, a case which is quite possible.

More than using "n" or "memtupcount" what I'm saying is to assert that
memtuples[imin] is inside the heap, which would catch the same errors
the original assert would, and more.

Assert(imin < state->memtupcount)

If you prefer.

The original asserts allows any value of imin for memtupcount>1, and
that's my main concern. It shouldn't.


On Tue, Sep 6, 2016 at 9:19 PM, Peter Geoghegan  wrote:
>> Sure, that's a weird enough case (that assert up there already reads
>> memtuples[0] which would be equally illegal if memtupcount==0), but it
>> goes on to show that the assert expression just seems odd for its
>> intent.
>>
>> BTW, I know it's not the scope of the patch, but shouldn't
>> root_displace be usable on the TSS_BOUNDED phase?
>
> I don't think it should be, no. With a top-n heap sort, the
> expectation is that after a little while, we can immediately determine
> that most tuples do not belong in the heap (this will require more
> than one comparison per tuple when the tuple that may be entered into
> the heap will in fact go in the heap, which should be fairly rare
> after a time). That's why that general strategy can be so much faster,
> of course.

I wasn't proposing getting rid of that optimization, but just
replacing the siftup+insert step with root_displace...

> Note that that heap is "reversed" -- the sort order is inverted, so
> that we can use a minheap. The top of the heap is the most marginal
> tuple in the top-n heap so far, and so is the next to be removed from
> consideration entirely (not the next to be returned to caller, when
> merging).

...but I didn't pause to consider that point.

It still looks like a valid optimization, instead rearranging the heap
twice (siftup + insert), do it once (replace + relocate).

However, I agree that it's not worth the risk conflating the two
optimizations. That one can be done later as a separate patch.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-06 Thread Peter Geoghegan
On Tue, Sep 6, 2016 at 4:55 PM, Claudio Freire  wrote:
> I noticed, but here n = state->memtupcount
>
> +   Assert(memtuples[0].tupindex == newtup->tupindex);
> +
> +   CHECK_FOR_INTERRUPTS();
> +
> +   n = state->memtupcount; /* n is heap's size,
> including old root */
> +   imin = 0;   /*
> start with caller's "hole" in root */
> +   i = imin;

I'm fine with using "n" in the later assertion you mentioned, if
that's clearer to you. memtupcount is broken out as "n" simply because
that's less verbose, in a place where that makes things far clearer.

> In fact, the assert on the patch would allow writing memtuples outside
> the heap, as in calling tuplesort_heap_root_displace if
> memtupcount==0, but I don't think that should be legal (memtuples[0]
> == memtuples[imin] would be outside the heap).

You have to have a valid heap (i.e. there must be at least one
element) to call tuplesort_heap_root_displace(), and it doesn't
directly compact the heap, so it must remain valid on return. The
assertion exists to make sure that everything is okay with a
one-element heap, a case which is quite possible. If you want to see a
merge involving one input tape, apply the entire parallel CREATE INDEX
patch set, set "force_parallal_mode = regress", and note that the
leader merge merges only 1 input tape, making the heap only ever
contain one element. In general, most use of the heap for k-way
merging will eventually end up as a one element heap, at the very end.

Maybe that assertion you mention is overkill, but I like to err on the
side of overkill with assertions. It doesn't seem that important,
though.

> Sure, that's a weird enough case (that assert up there already reads
> memtuples[0] which would be equally illegal if memtupcount==0), but it
> goes on to show that the assert expression just seems odd for its
> intent.
>
> BTW, I know it's not the scope of the patch, but shouldn't
> root_displace be usable on the TSS_BOUNDED phase?

I don't think it should be, no. With a top-n heap sort, the
expectation is that after a little while, we can immediately determine
that most tuples do not belong in the heap (this will require more
than one comparison per tuple when the tuple that may be entered into
the heap will in fact go in the heap, which should be fairly rare
after a time). That's why that general strategy can be so much faster,
of course.

Note that that heap is "reversed" -- the sort order is inverted, so
that we can use a minheap. The top of the heap is the most marginal
tuple in the top-n heap so far, and so is the next to be removed from
consideration entirely (not the next to be returned to caller, when
merging).

Anyway, I just don't think that this is important enough to change --
it couldn't possibly be worth much of any risk. I can see the appeal
of consistency, but I also see the appeal of sticking to how things
work there: continually and explicitly inserting into and compacting
the heap seems like a good enough way of framing what a top-n heap
does, since there are no groupings of tuples (tapes) involved there.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-09-06 Thread Michael Paquier
>> On 06 Sep 2016, at 12:03, Michael Paquier  wrote:
>>
>> On Tue, Sep 6, 2016 at 5:58 PM, Stas Kelvich  
>> wrote:
>>> Oh, I was preparing new version of patch, after fresh look on it. Probably, 
>>> I should
>>> said that in this topic. I’ve found a bug in sub transaction handling and 
>>> now working
>>> on fix.
>>
>> What's the problem actually?
>
> Handling of xids_p array in PrescanPreparedTransactions() is wrong for 
> prepared tx's in memory.
> Also I want to double-check and add comments to RecoveryInProgress() checks 
> in FinishGXact.
>
> I’ll post reworked patch in several days.

Could you use as a base the version I just sent yesterday then? I
noticed many mistakes in the comments, for example many s/it's/its/
and did a couple of adjustments around the code, the goto next_file
was particularly ugly. That will be that much work not do to again
later.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-06 Thread Claudio Freire
On Tue, Sep 6, 2016 at 8:28 PM, Peter Geoghegan  wrote:
> On Tue, Sep 6, 2016 at 12:57 PM, Claudio Freire  
> wrote:
>> Patch lacks any new tests, but the changed code paths seem covered
>> sufficiently by existing tests. A little bit of fuzzing on the patch
>> itself, like reverting some key changes, or flipping some key
>> comparisons, induces test failures as it should, mostly in cluster.
>>
>> The logic in tuplesort_heap_root_displace seems sound, except:
>>
>> +*/
>> +   memtuples[i] = memtuples[imin];
>> +   i = imin;
>> +   }
>> +
>> +   Assert(state->memtupcount > 1 || imin == 0);
>> +   memtuples[imin] = *newtup;
>> +}
>>
>> Why that assert? Wouldn't it make more sense to Assert(imin < n) ?
>
> There might only be one or two elements in the heap. Note that the
> heap size is indicated by state->memtupcount at this point in the
> sort, which is a little confusing (that differs from how memtupcount
> is used elsewhere, where we don't partition memtuples into a heap
> portion and a preread tuples portion, as we do here).

I noticed, but here n = state->memtupcount

+   Assert(memtuples[0].tupindex == newtup->tupindex);
+
+   CHECK_FOR_INTERRUPTS();
+
+   n = state->memtupcount; /* n is heap's size,
including old root */
+   imin = 0;   /*
start with caller's "hole" in root */
+   i = imin;

In fact, the assert on the patch would allow writing memtuples outside
the heap, as in calling tuplesort_heap_root_displace if
memtupcount==0, but I don't think that should be legal (memtuples[0]
== memtuples[imin] would be outside the heap).

Sure, that's a weird enough case (that assert up there already reads
memtuples[0] which would be equally illegal if memtupcount==0), but it
goes on to show that the assert expression just seems odd for its
intent.

BTW, I know it's not the scope of the patch, but shouldn't
root_displace be usable on the TSS_BOUNDED phase?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let file_fdw access COPY FROM PROGRAM

2016-09-06 Thread Michael Paquier
On Wed, Sep 7, 2016 at 7:53 AM, Craig Ringer
 wrote:
> On 7 Sep. 2016 02:14, "Corey Huinker"  wrote:
>>
>
>> Having regression tests for this is extremely problematic, because the
>> program invoked would need to be an invokable command on any architecture
>> supported by postgres. I'm pretty sure no such command exists.
>
> Your best bet will be using the TAP framework. There you can use Perl logic.
>
> I'm not sure where to put such a test though. It doesn't really make sense
> in src/test/recovery/ .

There is nothing preventing the addition of a TAP test where there are
normal regression tests, so if you want a test for file_fdw you should
add it there, then change its Makefile to have the following target
rules:
check: prove-check

prove-check:
$(prove_check)
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-06 Thread Peter Geoghegan
On Tue, Sep 6, 2016 at 12:57 PM, Claudio Freire  wrote:
> Patch lacks any new tests, but the changed code paths seem covered
> sufficiently by existing tests. A little bit of fuzzing on the patch
> itself, like reverting some key changes, or flipping some key
> comparisons, induces test failures as it should, mostly in cluster.
>
> The logic in tuplesort_heap_root_displace seems sound, except:
>
> +*/
> +   memtuples[i] = memtuples[imin];
> +   i = imin;
> +   }
> +
> +   Assert(state->memtupcount > 1 || imin == 0);
> +   memtuples[imin] = *newtup;
> +}
>
> Why that assert? Wouldn't it make more sense to Assert(imin < n) ?

There might only be one or two elements in the heap. Note that the
heap size is indicated by state->memtupcount at this point in the
sort, which is a little confusing (that differs from how memtupcount
is used elsewhere, where we don't partition memtuples into a heap
portion and a preread tuples portion, as we do here).

> In the meanwhile, I'll go and do some perf testing.
>
> Assuming the speedup is realized during testing, LGTM.

Thanks. I suggest spending at least as much time on unsympathetic
cases (e.g., only 2 or 3 tapes must be merged). At the same time, I
suggest focusing on a type that has relatively expensive comparisons,
such as collated text, to make differences clearer.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improved DefElem list processing

2016-09-06 Thread Tom Lane
Peter Eisentraut  writes:
> Here are two WIP patches to improve the DefElem list processing that is
> used by many utility commands.

> One factors out the duplicate checks, which are currently taking up a
> lot of space with duplicate code.  I haven't applied this everywhere
> yet, but the patch shows how much boring code can be saved.

I'm curious where you are on that?  I find myself needing to whack around
this processing in CREATE EXTENSION, but I don't want to create a merge
problem for you if you're close to committing.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let file_fdw access COPY FROM PROGRAM

2016-09-06 Thread Craig Ringer
On 7 Sep. 2016 02:14, "Corey Huinker"  wrote:
>

> Having regression tests for this is extremely problematic, because the
program invoked would need to be an invokable command on any architecture
supported by postgres. I'm pretty sure no such command exists.

Your best bet will be using the TAP framework. There you can use Perl logic.

I'm not sure where to put such a test though. It doesn't really make sense
in src/test/recovery/ .


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-06 Thread Peter Geoghegan
On Tue, Sep 6, 2016 at 2:46 PM, Peter Geoghegan  wrote:
> Feel free to make a counter-proposal for a cap. I'm not attached to
> 500. I'm mostly worried about blatant waste with very large workMem
> sizings. Tens of thousands of tapes is just crazy. The amount of data
> that you need to have as input is very large when workMem is big
> enough for this new cap to be enforced.

If tuplesort callers passed a hint about the number of tuples that
would ultimately be sorted, and (for the sake of argument) it was
magically 100% accurate, then theoretically we could just allocate the
right number of tapes up-front. That discussion is a big can of worms,
though. There are of course obvious disadvantages that come with a
localized cost model, even if you're prepared to add some "slop" to
the allocation size or whatever.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] kqueue

2016-09-06 Thread Marko Tiikkaja

On 2016-06-03 01:45, Thomas Munro wrote:

On Fri, Jun 3, 2016 at 4:02 AM, Alvaro Herrera  wrote:

Tom Lane wrote:

Andres Freund  writes:

pg_strtoi?



I think that's what Thomas did upthread. Are you taking this one then?


I'd go with just "strtoint".  We have "strtoint64" elsewhere.


For closure of this subthread: this rename was committed by Tom as
0ab3595e5bb5.


Thanks.  And here is a new version of the kqueue patch.  The previous
version doesn't apply on top of recent commit
a3b30763cc8686f5b4cd121ef0bf510c1533ac22, which sprinkled some
MAXALIGN macros nearby.  I've now done the same thing with the kevent
struct because it's cheap, uniform with the other cases and could
matter on some platforms for the same reason.


I've tested and reviewed this, and it looks good to me, other than this 
part:


+   /*
+* kevent guarantees that the change list has been processed in the 
EINTR

+* case.  Here we are only applying a change list so EINTR counts as
+* success.
+*/

this doesn't seem to be guaranteed on old versions of FreeBSD or any 
other BSD flavors, so I don't think it's a good idea to bake the 
assumption into this code.  Or what do you think?



.m


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-06 Thread Peter Geoghegan
On Tue, Sep 6, 2016 at 12:34 AM, Heikki Linnakangas  wrote:
> I'm reviewing patches 1-3 in this series, i.e. those patches that are not
> directly related to parallelism, but are independent improvements to
> merging.

That's fantastic! Thanks!

I'm really glad you're picking those ones up. I feel that I'm far too
dependent on Robert's review for this stuff. That shouldn't be taken
as a statement against Robert -- it's intended as quite the opposite
-- but it's just personally difficult to rely on exactly one other
person for something that I've put so much work into. Robert has been
involved with 100% of all sorting patches I've written, generally with
far less input from anyone else, and at this point, that's really
rather a lot of complex patches.

> Let's begin with patch 1:
>
> On 08/02/2016 01:18 AM, Peter Geoghegan wrote:
>>
>> Cap the number of tapes used by external sorts

> Yeah, wasting 8% of the memory budget on this seems like a bad idea. If I
> understand correctly, that makes the runs shorter than necessary, leading to
> more runs.

Right. Quite simply, whatever you could have used the workMem for
prior to the merge step, now you can't. It's not so bad during the
merge step of a final on-the-fly merge (or, with the 0002-* patch, any
final merge), since you can get a "refund" of unused (though logically
allocated by USEMEM()) tapes to grow memtuples with (other overhead
forms the majority of the refund, though). That still isn't much
consolation to the user, because run generation is typically much more
expensive (we really just refund unused tapes because it's easy).

>> A new quasi-arbitrary cap of 501 is applied on the number of tapes that
>> tuplesort will ever use (i.e.  merge order is capped at 500 inclusive).
>> This is a conservative estimate of the number of runs at which doing all
>> merging on-the-fly no longer allows greater overlapping of I/O and
>> computation.
>
>
> Hmm. Surely there are cases, so that with > 501 tapes you could do it with
> one merge pass, but now you need two? And that would hurt performance, no?

In theory, yes, that could be true, and not just for my proposed new
cap of 500 for merge order (501 tapes), but for any such cap. I
noticed that the Greenplum tuplesort.c uses a max of 250, so I guess I
just thought that to double that. Way back in 2006, Tom and Simon
talked about a cap too on several occasions, but I think that that was
in the thousands then.

Hundreds of runs are typically quite rare. It isn't that painful to do
a second pass, because the merge process may be more CPU cache
efficient as a result, which tends to be the dominant cost these days
(over and above the extra I/O that an extra pass requires).

This seems like a very familiar situation to me: I pick a
quasi-arbitrary limit or cap for something, and it's not clear that
it's optimal. Everyone more or less recognizes the need for such a
cap, but is uncomfortable about the exact figure chosen, not because
it's objectively bad, but because it's clearly something pulled from
the air, to some degree. It may not make you feel much better about
it, but I should point out that I've read a paper that claims "Modern
servers of the day have hundreds of GB operating memory and tens of TB
storage capacity. Hence, if the sorted data fit the persistent
storage, the first phase will generate hundreds of runs at most." [1].

Feel free to make a counter-proposal for a cap. I'm not attached to
500. I'm mostly worried about blatant waste with very large workMem
sizings. Tens of thousands of tapes is just crazy. The amount of data
that you need to have as input is very large when workMem is big
enough for this new cap to be enforced.

> Why do we reserve the buffer space for all the tapes right at the beginning?
> Instead of the single USEMEM(maxTapes * TAPE_BUFFER_OVERHEAD) callin
> inittapes(), couldn't we call USEMEM(TAPE_BUFFER_OVERHEAD) every time we
> start a new run, until we reach maxTapes?

No, because then you have no way to clamp back memory, which is now
almost all used (we hold off from making LACKMEM() continually true,
if at all possible, which is almost always the case). You can't really
continually shrink memtuples to make space for new tapes, which is
what it would take.

[1] http://ceur-ws.org/Vol-1343/paper8.pdf
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: function xmltable

2016-09-06 Thread Pavel Stehule
> libxml2 and our XPATH function doesn't support default namespace (
> http://plasmasturm.org/log/259/ ). This is pretty useful feature - so I
> implemented. This is the mayor issue of libxml2 library. Another difference
> between XPATH function and XMLTABLE function is using two phase searching
> and implicit prefix "./" and suffix ("/text()") in XMLTABLE. XMLTABLE using
> two XPATH expressions - for row data cutting and next for column data
> cutting (from row data). The our XPATH functions is pretty simple mapped to
> libxml2 XPATH API. But it is not possible with XMLTABLE function - due
> design of this function in standard (it is more user friendly and doesn't
> require exactly correct xpath expressions).
>

libxm2 doesn't support xpath 2.0  where default namespace was introduced.


Re: [HACKERS] [PATCH] COPY vs \copy HINT

2016-09-06 Thread Christoph Berg
Re: Tom Lane 2016-09-06 <17637.1473192...@sss.pgh.pa.us>
> Christoph's idea isn't bad.  We could tweak it to:
> 
>   COPY TO instructs the PostgreSQL server process to write a file.
> 
>   COPY FROM instructs the PostgreSQL server process to read a file.
> 
> This seems to cover both the wrong-machine and wrong-process-identity
> cases, and as a bonus it might help if you've had a brain fade about
> which COPY direction is write vs. read.
> 
> (I think we're all in agreement that the second sentence of the hint
> is fine, so leaving that out of it.)
> 
> Comments?

I like your new version, it's crisp and transports the right message.

Christoph


signature.asc
Description: PGP signature


Re: [HACKERS] [PATCH] Alter or rename enum value

2016-09-06 Thread Tom Lane
Andrew Dunstan  writes:
> Are we also going to have an exists test for the original thing being 
> renamed? Exists tests on renames do strike me as somewhat cumbersome, to 
> say the least.

I'm less opposed to that part, because it's at least got *some* precedent
in existing RENAME features.  But the fact that RENAME COLUMN hasn't got
it still makes me wonder why this is the first place that's getting it
("it" being an EXISTS test that applies to a sub-object rather than the
whole object being ALTER'd).

Bottom line here is that I'd rather commit ALTER TYPE RENAME VALUE with
no EXISTS features and then see it accrete those features together with
other types of RENAME, when and if there's a will to make that happen.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: function xmltable

2016-09-06 Thread Pavel Stehule
Hi

2016-09-06 6:54 GMT+02:00 Craig Ringer :

> On 4 September 2016 at 16:06, Pavel Stehule 
> wrote:
> > Hi
> >
> > minor update - using DefElem instead own private parser type
>
> I'm really glad that you're doing this and I'll take a look at it for this
> CF.
>
> It's quite a big patch so I expect this will take a few rounds of
> review and updating.
>

Thank you for review


>
>
> Patch applies cleanly and builds cleanly on master both with and
> without --with-xml .
>
> Overall, I think this needs to be revised with appropriate comments.
> Whitespace/formatting needs fixing since it's all over the place.
> Documentation is insufficient (per notes below).
>

I am not able to write documentation in English language :( - This function
is pretty complex - so I hope so anybody with better language skills can
help with this. It respects standard and it respects little bit different
Oracle's behave too (different order of DEFAULT and PATH parts).


>
> Re identifier naming, some of this code uses XmlTable naming patterns,
> some uses TableExpr prefixes. Is that intended to indicate a bounary
> between things re-usable for other structured data ingesting
> functions? Do you expect a "JSONEXPR" or similar in future? That's
> alluded to by
>

This structure should be reused by JSON_TABLE function. Now, it is little
bit strange, because there is only XMLTABLE implementation - and I have to
choose between a) using two different names now, b) renaming some part in
future.

And although XMLTABLE and JSON_TABLE functions are pretty similar - share
90% of data (input value, path, columns definitions), these functions has
different syntax - so only middle level code should be shared.


>
> +/*--
> + * TableExpr - used for XMLTABLE function
> + *
> + * This can be used for json_table, jsonb_table functions in future
> + *--
> + */
> +typedef struct TableExpr
> +{
> ...
>
> If so, should this really be two patches, one to add the table
> expression infrastructure and another to add XMLTABLE that uses it?
> Also, why in that case does so much of the TableExpr code call
> directly into XmlTable code? It doesn't look very generic.
>

Currently the common part is not too big - just the Node related part - I
am not sure about necessity of two patches. I am agree, there is missing
some TableExpBuilder, where can be better isolated the XML part.


>
> Overall I find identifier naming to be a bit inconsisent and think
> it's necessary to make it clear that all the "TableExpr" stuff is for
> XMLTABLE specifically, if that's the case, or make the delineation
> clearer if not.
>
> I'd also like to see tests that exercise the ruleutils get_rule_expr
> parts of the code for the various XMLTABLE variants.
>
> Similarly, since this seems to add a new xpath parser, that needs
> comprehensive tests. Maybe re-purpose an existing xpath test data set?
>
>
sure


>
>
>
> More detailed comments:
> 
>
> Docs comments:
>
>   The xmltable produces [a] table based on
> [the] passed XML value.
>
> The docs are pretty minimal and don't explain the various clauses of
> XMLTABLE. What is "BY REF" ? Is PATH an xpath expression? If so, is
> there a good cross reference link available? The PASSING clause? etc.
>
> How does XMLTABLE decide what to iterate over, and how to iterate over it?
>
> Presumably the FOR ORDINALITY clause makes a column emit a numeric counter.
>
> What standard, if any, does this conform to? Does it resemble
> implementations elsewhere? What limitations or unsupported features
> does it have relative to those standards?
>
>
>
> execEvalTableExpr seems to be defined twice, with a difference in
> case. This is probably not going to fly:
>
>
> +static Datum
> +execEvalTableExpr(TableExprState *tstate,
> +ExprContext *econtext,
> +bool *isNull, ExprDoneCond *isDone)
> +{
>
> +static Datum
> +ExecEvalTableExpr(TableExprState *tstate,
> +ExprContext *econtext,
> +bool *isNull, ExprDoneCond *isDone)
> +{
>
>
> It looks like you've split the function into a "guts" and "wrapper"
> part, with the error handling PG_TRY / PG_CATCH block in the wrapper.
> That seems reasonable for readability, but the naming isn't.
>

I invite any idea how these functions should be named.


>
> A comment is needed to explain what ExecEvalTableExpr is / does. If
> it's XMLTABLE specific (which it looks like based on the code), its
> name should reflect that. This pattern is repeated elsewhere; e.g.
> TableExprState is really the state for an XMLTABLE expression. But
> PostgreSQL actually has TABLE statements, and in future we might want
> to support table-expressions, so I don't think this naming is
> appropriate. This is made worse by the lack of comments on things like
> the definition of TableExprState. Please use something that makes it
> clear it's for XMLTABLE and add appropriate 

Re: [HACKERS] [PATCH] COPY vs \copy HINT

2016-09-06 Thread Tom Lane
Craig Ringer  writes:
> Tom, any preference here?
> I'm probably inclined to go for your original wording and accept that
> it's just too hard to hint at the client/server process split in a
> single short message.

I think my original wording is pretty hopeless for the single-machine
case: "COPY copies to a file on the PostgreSQL server, not on the client".
If the server and client are the same machine, that's content-free.

Christoph's idea isn't bad.  We could tweak it to:

COPY TO instructs the PostgreSQL server process to write a file.

COPY FROM instructs the PostgreSQL server process to read a file.

This seems to cover both the wrong-machine and wrong-process-identity
cases, and as a bonus it might help if you've had a brain fade about
which COPY direction is write vs. read.

(I think we're all in agreement that the second sentence of the hint
is fine, so leaving that out of it.)

Comments?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Alter or rename enum value

2016-09-06 Thread Andrew Dunstan



On 09/06/2016 02:30 PM, Tom Lane wrote:

Robert Haas  writes:

On Mon, Sep 5, 2016 at 11:40 PM, Tom Lane  wrote:

... And again, it's
hard to get excited about having these options for RENAME VALUE when no
one has felt a need for them yet in RENAME COLUMN.  I'm especially dubious
about IF NOT EXISTS against the destination name, considering that there
isn't *any* variant of RENAME that has an equivalent of that.  If it's
really useful, why hasn't that happened?

Because Tom Lane keeps voting against every patch to expand IF [ NOT ]
EXISTS into a new area?  :-)

Well, I'm on record as not liking the squishy semantics of CREATE IF NOT
EXISTS, and you could certainly make the same argument against RENAME IF
NOT EXISTS: you don't really know what state you will have after the
command executes.  But that wasn't the point I was trying to make here.


We do have ALTER TABLE [ IF EXISTS ] .. ADD COLUMN [ IF NOT EXISTS ],
so if somebody wanted the [ IF NOT EXISTS ] clause to also apply to
the RENAME COLUMN case, they'd have a good argument for adding it.

If someone wanted to propose adding IF NOT EXISTS to our rename
commands across-the-board, that would be a sensible feature to discuss.
What I'm objecting to is this one niche-case command getting out in
front of far-more-widely-used commands in terms of having such features.
I think the fact that we don't already have it in other rename commands
is pretty strong evidence that this is a made-up feature rather than
something with actual field demand.  I'm also concerned about adding it
in just one place like this; we might find ourselves boxed in in terms of
hitting syntax conflicts when we try to duplicate the feature elsewhere,
if we haven't done the legwork to add it to all variants of RENAME at
the same time.






Are we also going to have an exists test for the original thing being 
renamed? Exists tests on renames do strike me as somewhat cumbersome, to 
say the least.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-06 Thread Claudio Freire
On Mon, Aug 15, 2016 at 9:33 PM, Peter Geoghegan  wrote:
> The patch is intended to be applied on top of parallel B-Tree patches
> 0001-* and 0002-* [1]. I happened to test it with parallelism, but
> these are all independently useful, and will be entered as a separate
> CF entry (perhaps better to commit the earlier two patches first, to
> avoid merge conflicts). I'm optimistic that we can get those 3 patches
> in the series out of the way early, without blocking on discussing
> parallel sort.

Applied patches 1 and 2, builds fine, regression tests run fine. It
was a prerequisite to reviewing patch 3 (which I'm going to do below),
so I thought I might as well report on that tidbit of info, fwiw.

> The patch makes tuplesort merging shift down and displace the root
> tuple with the tape's next preread tuple, rather than compacting and
> then inserting into the heap anew. This approach to maintaining the
> heap as tuples are returned to caller will always produce fewer
> comparisons overall. The new approach is also simpler. We were already
> shifting down to compact the heap within the misleadingly named [2]
> function tuplesort_heap_siftup() -- why not instead just use the
> caller tuple (the tuple that we currently go on to insert) when
> initially shifting down (not the heap's preexisting last tuple, which
> is guaranteed to go straight to the leaf level again)? That way, we
> don't need to enlarge the heap at all through insertion, shifting up,
> etc. We're done, and are *guaranteed* to have performed less work
> (fewer comparisons and swaps) than with the existing approach (this is
> the reason for my optimism about getting this stuff out of the way
> early).

Patch 3 applies fine to git master as of
25794e841e5b86a0f90fac7f7f851e5d950e51e2 (on top of patches 1 and 2).

Builds fine and without warnings on gcc 4.8.5 AFAICT, regression test
suite runs without issues as well.

Patch lacks any new tests, but the changed code paths seem covered
sufficiently by existing tests. A little bit of fuzzing on the patch
itself, like reverting some key changes, or flipping some key
comparisons, induces test failures as it should, mostly in cluster.

The logic in tuplesort_heap_root_displace seems sound, except:

+*/
+   memtuples[i] = memtuples[imin];
+   i = imin;
+   }
+
+   Assert(state->memtupcount > 1 || imin == 0);
+   memtuples[imin] = *newtup;
+}

Why that assert? Wouldn't it make more sense to Assert(imin < n) ?


In the meanwhile, I'll go and do some perf testing.

Assuming the speedup is realized during testing, LGTM.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in 9.6 tuplesort batch memory growth logic

2016-09-06 Thread Peter Geoghegan
On Tue, Sep 6, 2016 at 12:51 PM, Tom Lane  wrote:
> I rewrote the comment and pushed it.

Thank you.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in 9.6 tuplesort batch memory growth logic

2016-09-06 Thread Tom Lane
I wrote:
> I see.  The comment could do with a bit of rewriting, perhaps.

I rewrote the comment and pushed it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-06 Thread Peter Geoghegan
On Tue, Sep 6, 2016 at 12:39 PM, Peter Geoghegan  wrote:
> On Tue, Sep 6, 2016 at 12:08 AM, Heikki Linnakangas  wrote:
>>> I attach a patch that changes how we maintain the heap invariant
>>> during tuplesort merging.
>
>> Nice!
>
> Thanks!

BTW, the way that k-way merging is made more efficient by this
approach makes the case for replacement selection even weaker than it
was just before we almost killed it. I hate to say it, but I have to
wonder if we shouldn't get rid of the new-to-9.6
replacement_sort_tuples because of this, and completely kill
replacement selection. I'm not going to go on about it, but that seems
sensible to me.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-06 Thread Peter Geoghegan
On Tue, Sep 6, 2016 at 12:08 AM, Heikki Linnakangas  wrote:
>> I attach a patch that changes how we maintain the heap invariant
>> during tuplesort merging.

> Nice!

Thanks!

>> This new approach is more or less the *conventional* way to maintain
>> the heap invariant when returning elements from a heap during k-way
>> merging. Our existing approach is convoluted; merging was presumably
>> only coded that way because the generic functions
>> tuplesort_heap_siftup() and tuplesort_heap_insert() happened to be
>> available. Perhaps the problem was masked by unrelated bottlenecks
>> that existed at the time, too.
>
>
> Yeah, this seems like a very obvious optimization. Is there a standard name
> for this technique in the literature? I'm OK with "displace", or perhaps
> just "replace" or "siftup+insert", but if there's a standard name for this,
> let's use that.

I used the term "displace" specifically because it wasn't a term with
a well-defined meaning in the context of the analysis of algorithms.
Just like "insert" isn't for tuplesort_heap_insert(). I'm not
particularly attached to the name tuplesort_heap_root_displace(), but
I do think that whatever it ends up being called should at least not
be named after an implementation detail. For example,
tuplesort_heap_root_replace() also seems fine.

I think that tuplesort_heap_siftup() should be called something like
tuplesort_heap_compact instead [1], since what it actually does
(shifting down -- the existing name is completely backwards!) is just
an implementation detail involved in compacting the heap (notice that
it decrements memtupcount, which, by now, means the k-way merge heap
gets one element smaller). I can write a patch to do this renaming, if
you're interested. Someone should fix it, because independent of all
this, it's just wrong.

[1] 
https://www.postgresql.org/message-id/CAM3SWZQKM=Pzc=cahzrixkjp2eo5q0jg1sofqqexfq647ji...@mail.gmail.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-06 Thread Tomas Vondra


On 09/06/2016 04:49 AM, Amit Kapila wrote:
> On Mon, Sep 5, 2016 at 11:34 PM, Tomas Vondra
>  wrote:
>>
>>
>> On 09/05/2016 06:03 AM, Amit Kapila wrote:
>>>  So, in short we have to compare three
>>> approaches here.
>>>
>>> 1) Group mode to reduce CLOGControlLock contention
>>> 2) Use granular locking model
>>> 3) Use atomic operations
>>>
>>> For approach-1, you can use patch [1].  For approach-2, you can use
>>> 0001-Improve-64bit-atomics-support patch[2] and the patch attached
>>> with this mail.  For approach-3, you can use
>>> 0001-Improve-64bit-atomics-support patch[2] and the patch attached
>>> with this mail by commenting USE_CONTENT_LOCK.  If the third doesn't
>>> work for you then for now we can compare approach-1 and approach-2.
>>>
>>
>> OK, I can compile all three cases - but onl with gcc 4.7 or newer. Sadly
>> the 4-socket 64-core machine runs Debian Jessie with just gcc 4.6 and my
>> attempts to update to a newer version were unsuccessful so far.
>>
> 
> So which all patches your are able to compile on 4-socket m/c?  I
> think it is better to measure the performance on bigger machine.

Oh, sorry - I forgot to mention that only the last test (with
USE_CONTENT_LOCK commented out) fails to compile, because the functions
for atomics were added in gcc-4.7.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-06 Thread Peter Geoghegan
On Tue, Sep 6, 2016 at 12:08 PM, Peter Geoghegan  wrote:
> Offhand, I would think that taken together this is very important. I'd
> certainly want to see cases in the hundreds of megabytes or gigabytes
> of work_mem alongside your 4MB case, even just to be able to talk
> informally about this. As you know, the default work_mem value is very
> conservative.

It looks like your benchmark relies on multiple passes, which can be
misleading. I bet it suffers some amount of problems from palloc()
fragmentation. When very memory constrained, that can get really bad.

Non-final merge passes (merges with more than one run -- real or dummy
-- on any given tape) can have uneven numbers of runs on each tape.
So, tuplesort.c needs to be prepared to doll out memory among tapes
*unevenly* there (same applies to memtuples "slots"). This is why
batch memory support is so hard for those cases (the fact that they're
so rare anyway also puts me off it). As you know, I wrote a patch that
adds batch memory support to cases that require randomAccess (final
output on a materialized tape), for their final merge. These final
merges happen to not be a final on-the-fly merge only due to this
randomAccess requirement from caller. It's possible to support these
cases in the future, with that patch, only because I am safe to assume
that each run/tape is the same size there (well, the assumption is
exactly as safe as it was for the 9.6 final on-the-fly merge, at
least).

My point about non-final merges is that you have to be very careful
that you're comparing apples to apples, memory accounting wise, when
looking into something like this. I'm not saying that you didn't, but
it's worth considering.

FWIW, I did try an increase in the buffer size in LogicalTape at one
time several months back, and so no benefit there (at least, with no
other changes).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] C++ port of Postgres

2016-09-06 Thread Christian Convey
On Tue, Sep 6, 2016 at 3:12 PM, Tom Lane  wrote:
>> (2) It seems like there are still a few big questions about this commit:
>>- Is it wanted at the moment?  It didn't seem like there's a
>>  consensus about whether or not this enhancement should be
>>  merged, even if the patch is pretty minimal.
>>- It seems like there are two competing patch
>>  sets in play for this enhancement: Joy's and
>>  Peter's.  Presumably at most one of them would
>>  be merged.
>
> These are things that reviews should be helping to decide.  It's probably
> a squishier topic than some patches, but if you're interested, feel free
> to read code and weigh in.

Thanks. It sounds like worst-case scenario, I perform an unneeded
review.  I'll give it a shot.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] C++ port of Postgres

2016-09-06 Thread Tom Lane
Christian Convey  writes:
> Could someone help me with a few procedural questions?

> (1) This page: https://wiki.postgresql.org/wiki/Reviewing_a_Patch
> lists the current commitfest's manager as "(vacant)".  But this page:
> https://commitfest.postgresql.org/ seems to indicate that a commitfest
> is currently in progress.  Is there a commitfest manager at the
> moment?

Fabrízio de Royes Mello is it.  I guess he hasn't noticed that that
page ought to be updated.

> (2) It seems like there are still a few big questions about this commit:
>- Is it wanted at the moment?  It didn't seem like there's a
>  consensus about whether or not this enhancement should be
>  merged, even if the patch is pretty minimal.
>- It seems like there are two competing patch
>  sets in play for this enhancement: Joy's and
>  Peter's.  Presumably at most one of them would
>  be merged.

These are things that reviews should be helping to decide.  It's probably
a squishier topic than some patches, but if you're interested, feel free
to read code and weigh in.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-06 Thread Peter Geoghegan
On Tue, Sep 6, 2016 at 5:20 AM, Heikki Linnakangas  wrote:
> I wrote a quick patch to test that, attached. It seems to improve
> performance, at least in this small test case:
>
> create table lotsofints(i integer);
> insert into lotsofints select random() * 10.0 from
> generate_series(1, 1000);
> vacuum freeze;
>
> select count(*) FROM (select * from lotsofints order by i) t;
>
> On my laptop, with default work_mem=4MB, that select takes 7.8 s on
> unpatched master, and 6.2 s with the attached patch.

The benefits have a lot to do with OS read-ahead, and efficient use of
memory bandwidth during the merge, where we want to access the caller
tuples sequentially per tape (i.e. that's what the batch memory stuff
added -- it also made much better use of available memory). Note that
I've been benchmarking the parallel CREATE INDEX patch on a server
with many HDDs, since sequential performance is mostly all that
matters. I think that in 1999, the preloading had a lot more to do
with logtape.c's ability to aggressively recycle blocks during merges,
such that the total storage overhead does not exceed the original size
of the caller tuples (plus what it calls "trivial bookkeeping
overhead" IIRC). That's less important these days, but still matters
some (it's more of an issue when you can't complete the sort in one
pass, which is rare these days).

Offhand, I would think that taken together this is very important. I'd
certainly want to see cases in the hundreds of megabytes or gigabytes
of work_mem alongside your 4MB case, even just to be able to talk
informally about this. As you know, the default work_mem value is very
conservative.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical Replication WIP

2016-09-06 Thread Petr Jelinek

On 06/09/16 20:14, Peter Eisentraut wrote:

On 9/3/16 5:14 AM, Petr Jelinek wrote:

What are the BKI_ROWTYPE_OID assignments for?  Are they necessary

here?  (Maybe this was just copied from pg_subscription?)


Yes they are.


Please explain/document why.  It does not match other catalogs, which
either use it for relcache initialization or because they are shared
catalogs.  (I'm not sure of the details, but this one clearly looks
different.)



Erm, I meant yes they are just copied and I will remove them (I see how 
my answer might been confusing given that you asked multiple questions, 
sorry).


--
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-09-06 Thread Josh Berkus
On 08/29/2016 06:52 AM, Fujii Masao wrote:
> Also I like the following Simon's idea.
> 
> https://www.postgresql.org/message-id/canp8+jlhfbvv_pw6grasnupw+bdk5dctu7gwpnap-+-zwvk...@mail.gmail.com
> ---
> * first k (n1, n2, n3) – does the same as k (n1, n2, n3) does now
> * any k (n1, n2, n3) – would release waiters as soon as we have the
> responses from k out of N standbys. “any k” would be faster, so is
> desirable for performance and resilience

What are we going to do for backwards compatibility, here?

So, here's the dilemma:

If we want to keep backwards compatibility with 9.6, then:

"k (n1, n2, n3)" == "first k (n1, n2, n3)"

However, "first k" is not what most users will want, most of the time;
users of version 13, years from now, will be getting constantly confused
by "first k" behavior when they wanted quorum.  So the sensible default
would be:

"k (n1, n2, n3)" == "any k (n1, n2, n3)"

... however, that will break backwards compatibility.  Thoughts?

My $0.02 is that we break backwards compat somehow and document the heck
out of it.

-- 
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-06 Thread Tom Lane
Simon Riggs  writes:
> On 6 September 2016 at 19:23, Robert Haas  wrote:
>> On Tue, Sep 6, 2016 at 2:16 PM, Simon Riggs  wrote:
>>> What occurs to me is that we can exactly predict how many tuples we
>>> are going to get when we autovacuum, since we measure that and we know
>>> what the number is when we trigger it.
>>> So there doesn't need to be any guessing going on at all, nor do we
>>> need it to be flexible.

>> No, that's not really true.  A lot can change between the time it's
>> triggered and the time it happens, or even while it's happening.
>> Somebody can run a gigantic bulk delete just after we start the
>> VACUUM.

> Which wouldn't be removed by the VACUUM, so can be ignored.

(1) If the delete commits just before the vacuum starts, it may be
removable.  I think you're nuts to imagine there are no race conditions
here.

(2) Stats from the stats collector never have been, and likely never will
be, anything but approximate.  That goes double for dead-tuple counts,
which are inaccurate even as sent from backends, never mind the multiple
ways that the collector might lose the counts.

The idea of looking to the stats to *guess* about how many tuples are
removable doesn't seem bad at all.  But imagining that that's going to be
exact is folly of the first magnitude.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] C++ port of Postgres

2016-09-06 Thread Heikki Linnakangas

On 08/31/2016 04:41 PM, Peter Eisentraut wrote:

I developed a minimally invasive patch for C++ support a few years ago
shortly after I wrote that blog post.  Since there appears to have been
some interest here now, I have updated that and split it up into logical
chunks.

So here you go.


Looking at this with the POV of what would make sense, even if we don't 
care about C++.



The patches are numbered approximately in increasing order of dubiosity.
 So 0001 is probably a straight bug fix, 0002 and 0003 are arguably
minor bug fixes as well.  The patches through 0012 can probably be
considered for committing in some form.  After that it gets a bit hackish.


0001-0003 look clear to me as well. 0006 - 0009 also seem OK. The rest 
really only make sense if we decided to make the switch to C++.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] C++ port of Postgres

2016-09-06 Thread Christian Convey
On Wed, Aug 31, 2016 at 9:41 AM, Peter Eisentraut
 wrote:
>> Joy, do you have an idea what a *minimally invasive* patch for C++
>> support would look like? That's certainly the first step here.
>
> I developed a minimally invasive patch for C++ support a few years ago
> shortly after I wrote that blog post.  Since there appears to have been
> some interest here now, I have updated that and split it up into logical
> chunks.
>
> So here you go.

Hi folks,

I'm ready for my first-ever PG code review, and I was considering
tackling this one.  I.e., https://commitfest.postgresql.org/10/776/

Could someone help me with a few procedural questions?

(1) This page: https://wiki.postgresql.org/wiki/Reviewing_a_Patch
lists the current commitfest's manager as "(vacant)".  But this page:
https://commitfest.postgresql.org/ seems to indicate that a commitfest
is currently in progress.  Is there a commitfest manager at the
moment?

(2) It seems like there are still a few big questions about this commit:

   - Is it wanted at the moment?  It didn't seem like there's a
 consensus about whether or not this enhancement should be
 merged, even if the patch is pretty minimal.

   - It seems like there are two competing patch
 sets in play for this enhancement: Joy's and
 Peter's.  Presumably at most one of them would
 be merged.

   Are these signs that I should find a different commit to review
   for now, so that this one can be further discussed?

Thanks very much,
Christian


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-06 Thread Robert Haas
On Tue, Sep 6, 2016 at 2:51 PM, Simon Riggs  wrote:
> On 6 September 2016 at 19:23, Robert Haas  wrote:
>> On Tue, Sep 6, 2016 at 2:16 PM, Simon Riggs  wrote:
>>> What occurs to me is that we can exactly predict how many tuples we
>>> are going to get when we autovacuum, since we measure that and we know
>>> what the number is when we trigger it.
>>>
>>> So there doesn't need to be any guessing going on at all, nor do we
>>> need it to be flexible.
>>
>> No, that's not really true.  A lot can change between the time it's
>> triggered and the time it happens, or even while it's happening.
>> Somebody can run a gigantic bulk delete just after we start the
>> VACUUM.
>
> Which wouldn't be removed by the VACUUM, so can be ignored.

OK, true.  But I still think it's very unlikely that we can calculate
an exact count of how many dead tuples we might run into.  I think we
shouldn't rely on the stats collector to be perfectly correct anyway -
for one thing, you can turn it off - and instead cope with the
uncertainty.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-06 Thread Simon Riggs
On 6 September 2016 at 19:23, Robert Haas  wrote:
> On Tue, Sep 6, 2016 at 2:16 PM, Simon Riggs  wrote:
>> What occurs to me is that we can exactly predict how many tuples we
>> are going to get when we autovacuum, since we measure that and we know
>> what the number is when we trigger it.
>>
>> So there doesn't need to be any guessing going on at all, nor do we
>> need it to be flexible.
>
> No, that's not really true.  A lot can change between the time it's
> triggered and the time it happens, or even while it's happening.
> Somebody can run a gigantic bulk delete just after we start the
> VACUUM.

Which wouldn't be removed by the VACUUM, so can be ignored.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] \timing interval

2016-09-06 Thread Tom Lane
Corey Huinker  writes:
> On Sun, Sep 4, 2016 at 7:05 PM, Jim Nasby  wrote:
>> I'd find this useful in the final output of EXPLAIN ANALYZE as well; any
>> objections to adding it?

> It's sorta out of my hands now, but what Tom said earlier is that because
> this is client-side code, it wouldn't use existing interval code.
> EXPLAIN *is* server-side, we couldn't use this code, but we could leverage
> existing interval code there to achieve a similar concept.

If we like this specific output format, I'd be inclined to just
copy-and-paste the code from psql.  I seriously doubt that getting type
interval involved in the discussion would lead to a shorter or
better-performing solution.

> I have another thing I'd like to add to EXPLAIN output : server version
> number output. So maybe we can pick those up in another thread.

Ugh.  There are multiple ways to get that already, and it's not like
space in EXPLAIN's output is not a precious resource.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Showing parallel status in \df+

2016-09-06 Thread Pavel Stehule
Hi

2016-09-06 0:05 GMT+02:00 Tom Lane :

> I wrote:
> > Pavel Stehule  writes:
> >> Using footer for this purpose is little bit strange. What about
> following
> >> design?
> >> 1. move out source code of PL functions from \df+
> >> 2. allow not unique filter in \sf and allow to display multiple
> functions
>
> > Wasn't that proposed and rejected upthread?
>
> So ... why did you put this patch in "Waiting on Author" state?  AFAIK,
> we had dropped the idea of relying on \sf for this, mainly because
> Peter complained about \df+ no longer providing source code.  I follow
> his point: if you're used to using \df+ to see source code, you probably
> can figure it out quickly if that command shows the source in a different
> place than before.  But if it doesn't show it at all, using \sf instead
> might not occur to you right away.
>

I see only one situation, when I want to see more then one source code -
checking overloaded functions. I prefer to see complete source code - in
\sf format. But I don't remember, when I did it last time. So I can live
without it well.

I am thinking, there is strong agreement about reduction \dt+ result. I am
not sure about usability of showing source code in footer. It is not too
much readable - and the fact, so function's body is displayed not as CREATE
statements, does the result less readable.

Now I am thinking so using footer for this purpose is not too great idea -
maybe we can live better without it (without source code of PL in \dt+
result, I would to see only C function source there). If you like using
footer, then the format should be changed to be more consistent, readable?
I am not sure, how it can be enhanced.

Regards

Pavel


>
> regards, tom lane
>


Re: [HACKERS] \timing interval

2016-09-06 Thread Corey Huinker
On Sun, Sep 4, 2016 at 7:05 PM, Jim Nasby  wrote:

> On 9/3/16 2:35 PM, Tom Lane wrote:
>
>> I pushed the patch using this:
>>
>> Time: 176460001.200 ms (2 d 01:01:00.001)
>>
>> and all else as before.
>>
>
> I'd find this useful in the final output of EXPLAIN ANALYZE as well; any
> objections to adding it?


It's sorta out of my hands now, but what Tom said earlier is that because
this is client-side code, it wouldn't use existing interval code.

EXPLAIN *is* server-side, we couldn't use this code, but we could leverage
existing interval code there to achieve a similar concept.

I have another thing I'd like to add to EXPLAIN output : server version
number output. So maybe we can pick those up in another thread.


Re: [HACKERS] [PATCH] Alter or rename enum value

2016-09-06 Thread Tom Lane
Robert Haas  writes:
> On Mon, Sep 5, 2016 at 11:40 PM, Tom Lane  wrote:
>> ... And again, it's
>> hard to get excited about having these options for RENAME VALUE when no
>> one has felt a need for them yet in RENAME COLUMN.  I'm especially dubious
>> about IF NOT EXISTS against the destination name, considering that there
>> isn't *any* variant of RENAME that has an equivalent of that.  If it's
>> really useful, why hasn't that happened?

> Because Tom Lane keeps voting against every patch to expand IF [ NOT ]
> EXISTS into a new area?  :-)

Well, I'm on record as not liking the squishy semantics of CREATE IF NOT
EXISTS, and you could certainly make the same argument against RENAME IF
NOT EXISTS: you don't really know what state you will have after the
command executes.  But that wasn't the point I was trying to make here.

> We do have ALTER TABLE [ IF EXISTS ] .. ADD COLUMN [ IF NOT EXISTS ],
> so if somebody wanted the [ IF NOT EXISTS ] clause to also apply to
> the RENAME COLUMN case, they'd have a good argument for adding it.

If someone wanted to propose adding IF NOT EXISTS to our rename
commands across-the-board, that would be a sensible feature to discuss.
What I'm objecting to is this one niche-case command getting out in
front of far-more-widely-used commands in terms of having such features.
I think the fact that we don't already have it in other rename commands
is pretty strong evidence that this is a made-up feature rather than
something with actual field demand.  I'm also concerned about adding it
in just one place like this; we might find ourselves boxed in in terms of
hitting syntax conflicts when we try to duplicate the feature elsewhere,
if we haven't done the legwork to add it to all variants of RENAME at
the same time.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-06 Thread Claudio Freire
On Tue, Sep 6, 2016 at 3:45 AM, Simon Riggs  wrote:
> On 5 September 2016 at 21:58, Claudio Freire  wrote:
>
> How long does that part ever take? Is there any substantial gain from 
> this?
>
>> Btw, without a further patch to prefetch pages on the backward scan
>> for truncate, however, my patience ran out before it finished
>> truncating. I haven't submitted that patch because there was an
>> identical patch in an older thread that was discussed and more or less
>> rejected since it slightly penalized SSDs.
>
> OK, thats enough context. Sorry for being forgetful on that point.
>
> Please post that new patch also.

Attached.

On Mon, Sep 5, 2016 at 5:58 PM, Claudio Freire  wrote:
> On Mon, Sep 5, 2016 at 5:36 PM, Simon Riggs  wrote:
>> On 5 September 2016 at 15:50, Claudio Freire  wrote:
>>> On Sun, Sep 4, 2016 at 3:46 AM, Simon Riggs  wrote:
 On 3 September 2016 at 04:25, Claudio Freire  
 wrote:
> The patch also makes vacuum free the dead_tuples before starting
> truncation. It didn't seem necessary to hold onto it beyond that
> point, and it might help give the OS more cache, especially if work
> mem is configured very high to avoid multiple index scans.

 How long does that part ever take? Is there any substantial gain from this?

 Lets discuss that as a potential second patch.
>>>
>>> In the test case I mentioned, it takes longer than the vacuum part itself.
>>
>> Please provide a test case and timings so we can see what's happening.

Robert made a strong point for a change in the approach, so the
information below is applicable only to the old patch (to be
rewritten).

I'm sending this merely to document the testing done, it will be a
while before I can get the proposed design running and tested.

> The referenced test case is the one I mentioned on the OP:
>
> - createdb pgbench
> - pgbench -i -s 4000 pgbench
> - psql pgbench -c 'delete from pgbench_accounts;'
> - vacuumdb -v -t pgbench_accounts pgbench
>
> fsync=off, autovacuum=off, maintainance_work_mem=4GB
>
> From what I remember, it used ~2.7GB of RAM up until the truncate
> phase, where it freed it. It performed a single index scan over the
> PK.
>
> I don't remember timings, and I didn't take them, so I'll have to
> repeat the test to get them. It takes all day and makes my laptop
> unusably slow, so I'll post them later, but they're not very
> interesting. The only interesting bit is that it does a single index
> scan instead of several, which on TB-or-more tables it's kinda nice.


So, the test results below:

During setup, maybe for context, the delete took 52m 50s real time
(measured with time psql pgbench -c 'delete from pgbench_accounts;')

During the delete, my I/O was on average like the following, which
should give an indication of what my I/O subsystem is capable of (not
much, granted):

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda   0.47 5.27   35.53   77.4217.5842.95
1097.51   145.22 1295.23   33.47 1874.36   8.85 100.00

Since it's a 5k RPM laptop drive, it's rather slow on IOPS, and since
I'm using the defaults for shared buffers and checkpoints, write
thoughput isn't stellar either. But that's not the point of the test
anyway, it's just for context.

The hardware is an HP envy laptop with a 1TB 5.4k RPM hard drive, 12GB
RAM, core i7-4722HQ, no weird performance tweaking of any kind (ie:
cpu scaling left intact). The system was not dedicated of course,
being a laptop, but it had little else going on while the test was
running. Given the size of the test, I don't believe there's any
chance concurrent activity could invalidate the results.

The timing for setup was comparable with both versions (patched and
unpatched), so I'm reporting the patched times only.


The vacuum phase:

patched:

$ vacuumdb -v -t pgbench_accounts pgbench
INFO:  vacuuming "public.pgbench_accounts"
INFO:  scanned index "pgbench_accounts_pkey" to remove 4 row versions
DETAIL:  CPU 12.46s/48.76u sec elapsed 566.47 sec.
INFO:  "pgbench_accounts": removed 4 row versions in 6557378 pages
DETAIL:  CPU 56.68s/28.90u sec elapsed 1872.76 sec.
INFO:  index "pgbench_accounts_pkey" now contains 0 row versions in
1096762 pages
DETAIL:  4 index row versions were removed.
1092896 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.47 sec.
INFO:  "pgbench_accounts": found 4 removable, 0 nonremovable
row versions in 6557378 out of 6557378 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 129.24s/127.24u sec elapsed 3877.13 sec.
INFO:  "pgbench_accounts": truncated 6557378 to 0 pages

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-06 Thread Robert Haas
On Tue, Sep 6, 2016 at 2:16 PM, Simon Riggs  wrote:
> What occurs to me is that we can exactly predict how many tuples we
> are going to get when we autovacuum, since we measure that and we know
> what the number is when we trigger it.
>
> So there doesn't need to be any guessing going on at all, nor do we
> need it to be flexible.

No, that's not really true.  A lot can change between the time it's
triggered and the time it happens, or even while it's happening.
Somebody can run a gigantic bulk delete just after we start the
VACUUM.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-06 Thread Simon Riggs
On 6 September 2016 at 19:09, Robert Haas  wrote:
> On Tue, Sep 6, 2016 at 2:06 PM, Simon Riggs  wrote:
>> On 6 September 2016 at 19:00, Tom Lane  wrote:
>>> Robert Haas  writes:
 Yeah, but I've seen actual breakage from exactly this issue on
 customer systems even with the 1GB limit, and when we start allowing
 100GB it's going to get a whole lot worse.
>>>
>>> While it's not necessarily a bad idea to consider these things,
>>> I think people are greatly overestimating the consequences of the
>>> patch-as-proposed.  AFAICS, it does *not* let you tell VACUUM to
>>> eat 100GB of workspace.  Note the line right in front of the one
>>> being changed:
>>>
>>>  maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData);
>>>  maxtuples = Min(maxtuples, INT_MAX);
>>> -maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData));
>>> +maxtuples = Min(maxtuples, MaxAllocHugeSize / 
>>> sizeof(ItemPointerData));
>>>
>>> Regardless of what vac_work_mem is, we aren't gonna let you have more
>>> than INT_MAX ItemPointers, hence 12GB at the most.  So the worst-case
>>> increase from the patch as given is 12X.  Maybe that's enough to cause
>>> bad consequences on some systems, but it's not the sort of disaster
>>> Robert posits above.
>>
>> Is there a reason we can't use repalloc here?
>
> There are two possible problems, either of which is necessarily fatal:
>
> 1. I expect repalloc probably works by allocating the new space,
> copying from old to new, and freeing the old.  That could work out
> badly if we are nearly the edge of the system's allocation limit.
>
> 2. It's slower than the approach proposed upthread of allocating the
> array in segments.  With that approach, we never need to memcpy()
> anything.
>
> On the plus side, it's probably less code.

Hmm, OK.

What occurs to me is that we can exactly predict how many tuples we
are going to get when we autovacuum, since we measure that and we know
what the number is when we trigger it.

So there doesn't need to be any guessing going on at all, nor do we
need it to be flexible.

My proposal now is to pass in the number of rows changed since last
vacuum and use that (+10% to be safe) as the size of the array, up to
the defined limit.

Manual VACUUM still needs to guess, so we might need a flexible
solution there, but generally we don't. We could probably estimate it
from the VM.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-06 Thread Claudio Freire
On Tue, Sep 6, 2016 at 3:11 PM, Tom Lane  wrote:
> We could get around (1) by something like Robert's idea of segmented
> allocation, but TBH I've seen nothing on this thread to make me think
> it's necessary or would even result in any performance improvement
> at all.  The bigger we make that array, the worse index-cleaning
> is going to perform, and complicating the data structure will add
> another hit on top of that.

I wouldn't be so sure, I've seen cases where two binary searches were
faster than a single binary search, especially when working with
humongus arrays like this tid array, because touching less (memory)
pages for a search does pay off considerably.

I'd try before giving up on the idea.

The test results (which I'll post in a second) do give credit to your
expectation that making the array bigger/more complex does impact
index scan performance. It's still faster than scanning several times
though.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical Replication WIP

2016-09-06 Thread Peter Eisentraut
On 9/3/16 5:14 AM, Petr Jelinek wrote:
>> What are the BKI_ROWTYPE_OID assignments for?  Are they necessary
>> > here?  (Maybe this was just copied from pg_subscription?)
>> >
> Yes they are.

Please explain/document why.  It does not match other catalogs, which
either use it for relcache initialization or because they are shared
catalogs.  (I'm not sure of the details, but this one clearly looks
different.)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let file_fdw access COPY FROM PROGRAM

2016-09-06 Thread Corey Huinker
On Fri, Sep 2, 2016 at 5:07 AM, Amit Langote 
wrote:

>
> I am not familiar with win32 stuff too, so I don't have much to say about
> that.  Maybe someone else can chime in to help with that.
>

The regressions basically *can't* test this because we'd need a shell
command we know works on any architecture.


>
> About the patch:
>
> * applies cleanly, compiles fine
> * basic functionality seems to work (have not done any extensive tests
> though)
>
>
> - Specifies the file to be read.  Required.  Must be an absolute path
> name.
> + Specifies the file to be read.  Must be an absolute path name.
> + Either filename or program
> must be
> + specified.  They are mutually exclusive.
> +
> +   
> +  
>
> The note about filename and program being mutually exclusive could be
> placed at the end of list of options.  Or perhaps mention this note as
> part of the description of program option, because filename is already
> defined by that point.
>
> +   
> +
> + Specifies the command to executed.
>
> s/to executed/to be executed/g
>

Correct. I will fix that when other issues below are resolved.


>
> + Either program or filename
> must be
> + specified.  They are mutually exclusive.
>  
>
> Oh I see this has already been mentioned in program option description.
> Did you intend to specify the same in both filename and program
> descriptions?
>

It's been a while since I repackaged Adam's code, but generally I'm in
favor of some redundancy if the two mutually exclusive things are
documented far enough apart.



>
> @@ -97,8 +99,9 @@ typedef struct FileFdwPlanState
>  typedef struct FileFdwExecutionState
>  {
>  char   *filename;   /* file to read */
> -List   *options;/* merged COPY options, excluding
> filename */
> -CopyState   cstate; /* state of reading file */
> +char   *program;/* program to read output from */
> +List   *options;/* merged COPY options, excluding
> filename and program */
> +CopyState   cstate; /* state of reading file or program */
>
> Have you considered a Boolean flag is_program instead of char **program
> similar to how copy.c does it?  (See a related comment further below)
>

Considered it yes, but decided against it when I started to write my
version. When Adam delivered his version of the patch, I noticed he had
made the same design decision. Only one of them will be initialized, and
the boolean will byte align to 4 bytes, so it's the same storage allocated
either way.

Either we're fine with two variables, or we think file_name is poorly
named. I have only a slight preference for the two variables, and defer to
the group for a preference.


>
> - * This is because the filename is one of those options, and we don't
> want
> - * non-superusers to be able to determine which file gets read.
> + * This is because the filename or program string are two of those
> + * options, and we don't want non-superusers to be able to determine
> which
> + * file gets read or what command is run.
>
> I'm no native English speaker, but I wonder if this should read: filename
> or program string *is* one of those options ... OR filename *and* program
> are two of those options ... ?  Also, "are two of those options" sounds a
> bit odd to me because I don't see that used as often as "one of those
> whatever".  I guess that's quite a bit of nitpicking about a C comment, ;)
>

Given that C comments constitute a large portion of our documentation, I
fully support making them as clear as possible.

I don't remember if this is Adam's comment or mine. Adam - if you're out
there, chime in.

The original paragraph was:

Changing table-level options requires superuser privileges, for security
reasons: only a superuser should be able to determine which file is read.
In principle non-superusers could be allowed to change the other options,
but that's not supported at present.


How does this paragraph sound?:

Changing table-level options requires superuser privileges, for security
reasons: only a superuser should be able to determine which file is read,
or which program is run. In principle non-superusers could be allowed to
change the other options, but that's not supported at present.




> @@ -257,9 +264,26 @@ file_fdw_validator(PG_FUNCTION_ARGS)
>  ereport(ERROR,
>  (errcode(ERRCODE_SYNTAX_ERROR),
>   errmsg("conflicting or redundant options")));
> +if (program)
> +ereport(ERROR,
> +(errcode(ERRCODE_SYNTAX_ERROR),
> + errmsg("conflicting or redundant options")));
>  filename = defGetString(def);
>  }
>
> +else if (strcmp(def->defname, "program") == 0)
> +{
> +if (filename)
> +ereport(ERROR,
> +

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-06 Thread Tom Lane
Simon Riggs  writes:
> Is there a reason we can't use repalloc here?

(1) repalloc will probably copy the data.

(2) that answer doesn't excuse you from choosing a limit.

We could get around (1) by something like Robert's idea of segmented
allocation, but TBH I've seen nothing on this thread to make me think
it's necessary or would even result in any performance improvement
at all.  The bigger we make that array, the worse index-cleaning
is going to perform, and complicating the data structure will add
another hit on top of that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-06 Thread Robert Haas
On Tue, Sep 6, 2016 at 2:09 PM, Robert Haas  wrote:
> There are two possible problems, either of which is necessarily fatal:

I meant to write "neither of which" not "either of which".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-06 Thread Robert Haas
On Tue, Sep 6, 2016 at 2:06 PM, Simon Riggs  wrote:
> On 6 September 2016 at 19:00, Tom Lane  wrote:
>> Robert Haas  writes:
>>> Yeah, but I've seen actual breakage from exactly this issue on
>>> customer systems even with the 1GB limit, and when we start allowing
>>> 100GB it's going to get a whole lot worse.
>>
>> While it's not necessarily a bad idea to consider these things,
>> I think people are greatly overestimating the consequences of the
>> patch-as-proposed.  AFAICS, it does *not* let you tell VACUUM to
>> eat 100GB of workspace.  Note the line right in front of the one
>> being changed:
>>
>>  maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData);
>>  maxtuples = Min(maxtuples, INT_MAX);
>> -maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData));
>> +maxtuples = Min(maxtuples, MaxAllocHugeSize / 
>> sizeof(ItemPointerData));
>>
>> Regardless of what vac_work_mem is, we aren't gonna let you have more
>> than INT_MAX ItemPointers, hence 12GB at the most.  So the worst-case
>> increase from the patch as given is 12X.  Maybe that's enough to cause
>> bad consequences on some systems, but it's not the sort of disaster
>> Robert posits above.
>
> Is there a reason we can't use repalloc here?

There are two possible problems, either of which is necessarily fatal:

1. I expect repalloc probably works by allocating the new space,
copying from old to new, and freeing the old.  That could work out
badly if we are nearly the edge of the system's allocation limit.

2. It's slower than the approach proposed upthread of allocating the
array in segments.  With that approach, we never need to memcpy()
anything.

On the plus side, it's probably less code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-06 Thread Robert Haas
On Tue, Sep 6, 2016 at 2:00 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> Yeah, but I've seen actual breakage from exactly this issue on
>> customer systems even with the 1GB limit, and when we start allowing
>> 100GB it's going to get a whole lot worse.
>
> While it's not necessarily a bad idea to consider these things,
> I think people are greatly overestimating the consequences of the
> patch-as-proposed.  AFAICS, it does *not* let you tell VACUUM to
> eat 100GB of workspace.  Note the line right in front of the one
> being changed:
>
>  maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData);
>  maxtuples = Min(maxtuples, INT_MAX);
> -maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData));
> +maxtuples = Min(maxtuples, MaxAllocHugeSize / 
> sizeof(ItemPointerData));
>
> Regardless of what vac_work_mem is, we aren't gonna let you have more
> than INT_MAX ItemPointers, hence 12GB at the most.  So the worst-case
> increase from the patch as given is 12X.  Maybe that's enough to cause
> bad consequences on some systems, but it's not the sort of disaster
> Robert posits above.

Hmm, OK.  Yes, that is a lot less bad.  (I think it's still bad.)

> If we think the expected number of dead pointers is so much less than
> that, why don't we just decrease LAZY_ALLOC_TUPLES, and take a hit in
> extra index vacuum cycles when we're wrong?

Because that's really inefficient.  Growing the array, even with a
stupid approach that copies all of the TIDs every time, is a heck of a
lot faster than incurring an extra index vac cycle.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-06 Thread Simon Riggs
On 6 September 2016 at 19:00, Tom Lane  wrote:
> Robert Haas  writes:
>> Yeah, but I've seen actual breakage from exactly this issue on
>> customer systems even with the 1GB limit, and when we start allowing
>> 100GB it's going to get a whole lot worse.
>
> While it's not necessarily a bad idea to consider these things,
> I think people are greatly overestimating the consequences of the
> patch-as-proposed.  AFAICS, it does *not* let you tell VACUUM to
> eat 100GB of workspace.  Note the line right in front of the one
> being changed:
>
>  maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData);
>  maxtuples = Min(maxtuples, INT_MAX);
> -maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData));
> +maxtuples = Min(maxtuples, MaxAllocHugeSize / 
> sizeof(ItemPointerData));
>
> Regardless of what vac_work_mem is, we aren't gonna let you have more
> than INT_MAX ItemPointers, hence 12GB at the most.  So the worst-case
> increase from the patch as given is 12X.  Maybe that's enough to cause
> bad consequences on some systems, but it's not the sort of disaster
> Robert posits above.

Is there a reason we can't use repalloc here?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-06 Thread Tom Lane
Robert Haas  writes:
> Yeah, but I've seen actual breakage from exactly this issue on
> customer systems even with the 1GB limit, and when we start allowing
> 100GB it's going to get a whole lot worse.

While it's not necessarily a bad idea to consider these things,
I think people are greatly overestimating the consequences of the
patch-as-proposed.  AFAICS, it does *not* let you tell VACUUM to
eat 100GB of workspace.  Note the line right in front of the one
being changed:

 maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData);
 maxtuples = Min(maxtuples, INT_MAX);
-maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData));
+maxtuples = Min(maxtuples, MaxAllocHugeSize / sizeof(ItemPointerData));

Regardless of what vac_work_mem is, we aren't gonna let you have more
than INT_MAX ItemPointers, hence 12GB at the most.  So the worst-case
increase from the patch as given is 12X.  Maybe that's enough to cause
bad consequences on some systems, but it's not the sort of disaster
Robert posits above.

It's also worth re-reading the lines just after this, which constrain
the allocation a whole lot more for small tables.  Robert comments:

> ...  But VACUUM will very happily allocate
> vastly more memory than the number of dead tuples.  It is thankfully
> smart enough not to allocate more storage than the number of line
> pointers that could theoretically exist in a relation of the given
> size, but that only helps for very small relations.  In a large
> relation that divergence between the amount of storage space that
> could theoretically be needed and the amount that is actually needed
> is likely to be extremely high.  1 TB relation = 2^27 blocks, each of
> which can contain MaxHeapTuplesPerPage dead line pointers.  On my
> system, MaxHeapTuplesPerPage is 291, so that's 291 * 2^27 possible
> dead line pointers, which at 6 bytes each is 291 * 6 * 2^27 = ~218GB,
> but the expected number of dead line pointers is much less than that.

If we think the expected number of dead pointers is so much less than
that, why don't we just decrease LAZY_ALLOC_TUPLES, and take a hit in
extra index vacuum cycles when we're wrong?

(Actually, what I'd be inclined to do is let it have MaxHeapTuplesPerPage
slots per page up till a few meg, and then start tailing off the
space-per-page, figuring that the law of large numbers will probably kick
in.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-06 Thread Robert Haas
On Tue, Sep 6, 2016 at 11:22 PM, Claudio Freire  wrote:
> CREATE INDEX could also allocate 218GB, you just need to index enough
> columns and you'll get that.
>
> Aside from the fact that CREATE INDEX will only allocate what is going
> to be used and VACUUM will overallocate, the potential to fully
> allocate the amount given is still there for both cases.

I agree with that, but I think there's a big difference between
allocating the memory only when it's needed and allocating it whether
it is needed or not.  YMMV, of course, but that's what I think

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Alter or rename enum value

2016-09-06 Thread Robert Haas
On Mon, Sep 5, 2016 at 11:40 PM, Tom Lane  wrote:
> The opportunity cost here is potential user confusion.  The only
> closely parallel rename operation we have is ALTER TABLE RENAME COLUMN,
> and that doesn't have a column-level IF EXISTS option; it has a
> table-level IF EXISTS option.  So I think it would be weird and confusing
> for ALTER TYPE RENAME VALUE to be different from that.  And again, it's
> hard to get excited about having these options for RENAME VALUE when no
> one has felt a need for them yet in RENAME COLUMN.  I'm especially dubious
> about IF NOT EXISTS against the destination name, considering that there
> isn't *any* variant of RENAME that has an equivalent of that.  If it's
> really useful, why hasn't that happened?

Because Tom Lane keeps voting against every patch to expand IF [ NOT ]
EXISTS into a new area?  :-)

We do have ALTER TABLE [ IF EXISTS ] .. ADD COLUMN [ IF NOT EXISTS ],
so if somebody wanted the [ IF NOT EXISTS ] clause to also apply to
the RENAME COLUMN case, they'd have a good argument for adding it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-06 Thread Claudio Freire
On Tue, Sep 6, 2016 at 2:39 PM, Robert Haas  wrote:
>> I could attempt that, but I don't see the difference between
>> vacuum and create index in this case. Both could allocate a huge chunk
>> of the virtual address space if maintainance work mem says so, both
>> proportional to the size of the table. I can't see how that could take
>> any DBA by surprise.
>
> Really?  CREATE INDEX isn't going to allocate more storage space than
> the size of the data actually being sorted, because tuplesort.c is
> smart about that kind of thing.  But VACUUM will very happily allocate
> vastly more memory than the number of dead tuples.  It is thankfully
> smart enough not to allocate more storage than the number of line
> pointers that could theoretically exist in a relation of the given
> size, but that only helps for very small relations.  In a large
> relation that divergence between the amount of storage space that
> could theoretically be needed and the amount that is actually needed
> is likely to be extremely high.  1 TB relation = 2^27 blocks, each of
> which can contain MaxHeapTuplesPerPage dead line pointers.  On my
> system, MaxHeapTuplesPerPage is 291, so that's 291 * 2^27 possible
> dead line pointers, which at 6 bytes each is 291 * 6 * 2^27 = ~218GB,
> but the expected number of dead line pointers is much less than that.
> Even if this is a vacuum triggered by autovacuum_vacuum_scale_factor
> and you're using the default of 0.2 (probably too high for such a
> large table), assuming there are about 60 tuples for page (which is
> what I get with pgbench -i) the table would have about 2^27 * 60 = 7.7
> billion tuples of which 1.5 billion would be dead, meaning we need
> about 9-10GB of space to store all of those dead tuples.  Allocating
> as much as 218GB when we need 9-10GB is going to sting, and I don't
> see how you will get a comparable distortion with CREATE INDEX.  I
> might be missing something, though.

CREATE INDEX could also allocate 218GB, you just need to index enough
columns and you'll get that.

Aside from the fact that CREATE INDEX will only allocate what is going
to be used and VACUUM will overallocate, the potential to fully
allocate the amount given is still there for both cases.

> There's no real issue when there's only one process running on the
> system at a time.  If the user set maintenance_work_mem to an amount
> of memory that he can't afford to pay even once, then that's simple
> misconfiguration and it's not really our problem.  The issue is that
> when there are 3 or potentially more VACUUM processes running plus a
> CREATE INDEX or two at the same time.  If you set maintenance_work_mem
> to a value that is large enough to make the CREATE INDEX run fast, now
> with your patch that is also going to cause each VACUUM process to
> gobble up lots of extra memory that it probably doesn't need, and now
> you may well start to get failures.  I've seen this happen even with
> the current 1GB limit, though you need a pretty small system - e.g.
> 8GB RAM - for it to be a problem.  I think it is really really likely
> to cause big problems for us if we dramatically increase that limit
> without making the allocation algorithm smarter.

Ok, a pity it will invalidate all the testing already done though (I
was almost done with the testing).

I guess I'll send the results anyway.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Forbid use of LF and CR characters in database and role names

2016-09-06 Thread Robert Haas
On Tue, Sep 6, 2016 at 9:43 PM, Peter Eisentraut
 wrote:
> On 8/11/16 9:12 PM, Michael Paquier wrote:
>> Note that pg_dump[all] and pg_upgrade already have safeguards against
>> those things per the same routines putting quotes for execution as
>> commands into psql and shell. So attached is a patch to implement this
>> restriction in the backend, and I am adding that to the next CF for
>> 10.0. Attached is as well a script able to trigger those errors.
>
> After further review, I have my doubts about this approach.
>
> Everything that is using appendShellString() is now going to reject LF
> and CR characters, but there is no systematic way by which this is
> managed, enforced, or documented.  It happens that right now most of the
> affected cases are user and database names, but there are others.  For
> example, you cannot anymore install PostgreSQL into a path containing
> LF/CR, because initdb will fail when it composes the pg_ctl command line
> to print out.  Also, initdb will fail if the data directory name
> contains LF/CR, but it creates the directory nonetheless.  (Apparently,
> it doesn't even clean it up.)  But for example pg_ctl and pg_basebackup
> and postgres itself handle all of that just fine.  This is a slowly
> growing mess.
>
> I think the way forward here, if any, is to work on removing these
> restrictions, not to keep sprinkling them around.

If we were talking about pathnames containing spaces, I would agree,
but I've never heard of a legitimate pathname containing CR or LF.  I
can't see us losing much by refusing to allow such pathnames, except
for security holes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical Replication WIP

2016-09-06 Thread Petr Jelinek

On 01/09/16 08:29, Erik Rijkers wrote:

On 2016-09-01 01:04, Erik Rijkers wrote:

On 2016-08-31 22:51, Petr Jelinek wrote:

Here are some small changes to  logical-replication.sgml


...  and other .sgml files.


Thanks I'll integrate these into next iteration of the patch,

--
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Make initdb's suggested "pg_ctl start" command line more reliabl

2016-09-06 Thread Tom Lane
Claudio Freire  writes:
> On Tue, Sep 6, 2016 at 2:08 PM, Tom Lane  wrote:
>> Dash is considered a character that needs quoting.  It might be possible
>> to avoid that if we could be certain that appendShellString's output would
>> never be placed in a spot where it could be taken for a switch, but that
>> seems like a large assumption to me.

> Wouldn't it be taken for a switch even with quoting?
> Quoting "-D" seems to work fine, which would suggest the above is true.

[ thinks about that... ]  Oh, you're right, brain fade on my part.  The
shell doesn't care whether words are switches or not.  So actually the
risk-factor for us is whether we have designed any command-line syntaxes
in a way that would allow a path starting with a dash to cause bad things
to happen.  I have a feeling the answer is "yes", even without considering
the prospect that GNU getopt will arbitrarily rearrange the command words
on us depending on what it thinks is a switch.  (Maybe leading-dash is
another one of the things we'd better make a policy against.)

But meanwhile, yes, the argument for treating it as quotable in
appendShellString seems completely bogus.  I'll go change that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU integration

2016-09-06 Thread Doug Doole
> The ICU ABI (not API) is also versioned.  The way that this is done is
> that all functions are actually macros to a versioned symbol.  So
> ucol_open() is actually a macro that expands to, say, ucol_open_57() in
> ICU version 57.  (They also got rid of a dot in their versions a while
> ago.)  It's basically hand-crafted symbol versioning.  That way, you can
> link with multiple versions of ICU at the same time.  However, the
> purpose of that, as I understand it, is so that plugins can have a
> different version of ICU loaded than the main process or another plugin.
 > In terms of postgres using the right version of ICU, it doesn't buy
> anything beyond what the soname mechanism does.

You can access the versioned API as well, it's just not documented. (The
ICU team does support this - we worked very closely with them when doing
all this.) We exploited the versioned API when we learned that there is no
guarantee of backwards compatibility in collations. You can't just change a
collation under a user (at least that was our opinion) since it can cause
all sorts of problems. Refreshing a collation (especially on the fly) is a
lot more work than we were prepared to take on. So we exploited the
versioned APIs.

We carried the ICU version numbers around on our collation and locale IDs
(such as fr_FR%icu36) . The database would load multiple versions of the
ICU library so that something created with ICU 3.6 would always be
processed with ICU 3.6. This avoided the problems of trying to change the
rules on the user. (We'd always intended to provide tooling to allow the
user to move an existing object up to a newer version of ICU, but we never
got around to doing it.) In the code, this meant we were explicitly calling
the versioned API so that we could keep the calls straight. (Of course this
was abstracted in a set of our own locale functions so that the rest of the
engine was ignorant of the ICU library fun that was going on.)

> We can refine the guidance.  But indexes are the most important issue, I
> think, because changing the sorting rules in the background makes data
> silently disappear.

I'd say that collation is the most important issue, but collation impacts a
lot more than indexes.

Unfortunately as part of changing companies I had to leave my "screwy stuff
that has happened in collations" presentation behind so I don't have
concrete examples to point to, but I can cook up illustrative examples:

- Suppose in ICU X.X, AA = Å but in ICU Y.Y AA != Å. Further suppose there
was an RI constraint where the primary key used AA and the foreign key
used Å. If ICU was updated, the RI constraint between the rows would break,
leaving an orphaned foreign key.

- I can't remember the specific language but they had the collation rule
that "CH" was treated as a distinct entity between C and D. This gave the
order C < CG < CI < CZ < CH < D. Then they removed CH as special which gave
C < CG < CH < CI < CZ < D. Suppose there was the constraint CHECK (COL
BETWEEN 'C' AND 'CH'). Originally it would allow (almost) all strings that
started with C. After the change it the constraint would block everything
that started with CI - CZ leaving many rows that no longer qualify in the
database.

It could be argued that these are edge cases and not likely to be hit.
That's likely true for a lot of users. But for a user who hits this, their
database is going to be left in a mess.

--
Doug Doole

On Tue, Sep 6, 2016 at 8:37 AM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 8/31/16 4:24 PM, Doug Doole wrote:
> > ICU explicitly does not provide stability in their locales and
> collations. We pushed them hard to provide this, but between changes to the
> CLDR data and changes to the ICU code it just wasn’t feasible for them to
> provide version to version stability.
> >
> > What they do offer is a compile option when building ICU to version all
> their APIs. So instead of calling icu_foo() you’d call icu_foo46(). (Or
> something like this - it’s been a few years since I actually worked with
> the ICU code.) This ultimately allows you to load multiple versions of the
> ICU library into a single program and provide stability by calling the
> appropriate version of the library. (Unfortunately, the OS - at least my
> Linux box - only provides the generic version of ICU and not the version
> annotated APIs, which means a separate compile of ICU is needed.)
> >
> > The catch with this is that it means you likely want to expose the
> version information. In another note it was suggested to use something like
> fr_FR%icu. If you want to pin it to a specific version of ICU, you’ll
> likely need something like fr_FR%icu46. (There’s nothing wrong with
> supporting fr_FR%icu to give users an easy way of saying “give me the
> latest and greatest”, but you’d probably want to harden it to a specific
> ICU version internally.)
>
> There are multiple things going on.
>
> Collations in ICU are versioned.  You can find out the version 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-06 Thread Robert Haas
On Tue, Sep 6, 2016 at 10:28 PM, Claudio Freire  wrote:
>> The problem with this is that we allocate the entire amount of
>> maintenance_work_mem even when the number of actual dead tuples turns
>> out to be very small.  That's not so bad if the amount of memory we're
>> potentially wasting is limited to ~1 GB, but it seems pretty dangerous
>> to remove the 1 GB limit, because somebody might have
>> maintenance_work_mem set to tens or hundreds of gigabytes to speed
>> index creation, and allocating that much space for a VACUUM that
>> encounters 1 dead tuple does not seem like a good plan.
>>
>> What I think we need to do is make some provision to initially
>> allocate only a small amount of memory and then grow the allocation
>> later if needed.  For example, instead of having
>> vacrelstats->dead_tuples be declared as ItemPointer, declare it as
>> ItemPointer * and allocate the array progressively in segments.  I'd
>> actually argue that the segment size should be substantially smaller
>> than 1 GB, like say 64MB; there are still some people running systems
>> which are small enough that allocating 1 GB when we may need only 6
>> bytes can drive the system into OOM.
>
> This would however incur the cost of having to copy the whole GB-sized
> chunk every time it's expanded. It woudln't be cheap.

No, I don't want to end up copying the whole array; that's what I
meant by allocating it progressively in segments.  Something like what
you go on to propose.

> I've monitored the vacuum as it runs and the OS doesn't map the whole
> block unless it's touched, which it isn't until dead tuples are found.
> Surely, if overcommit is disabled (as it should), it could exhaust the
> virtual address space if set very high, but it wouldn't really use the
> memory unless it's needed, it would merely reserve it.

Yeah, but I've seen actual breakage from exactly this issue on
customer systems even with the 1GB limit, and when we start allowing
100GB it's going to get a whole lot worse.

> To fix that, rather than repalloc the whole thing, dead_tuples would
> have to be an ItemPointer** of sorted chunks. That'd be a
> significantly more complex patch, but at least it wouldn't incur the
> memcpy.

Right, this is what I had in mind.  I don't think this is actually
very complicated, because the way we use this array is really simple.
We basically just keep appending to the array until we run out of
space, and that's not very hard to implement with an array-of-arrays.
The chunks are, in some sense, sorted, as you say, but you don't need
to do qsort() or anything like that.  You're just replacing a single
flat array with a data structure that can be grown incrementally in
fixed-size chunks.

> I could attempt that, but I don't see the difference between
> vacuum and create index in this case. Both could allocate a huge chunk
> of the virtual address space if maintainance work mem says so, both
> proportional to the size of the table. I can't see how that could take
> any DBA by surprise.

Really?  CREATE INDEX isn't going to allocate more storage space than
the size of the data actually being sorted, because tuplesort.c is
smart about that kind of thing.  But VACUUM will very happily allocate
vastly more memory than the number of dead tuples.  It is thankfully
smart enough not to allocate more storage than the number of line
pointers that could theoretically exist in a relation of the given
size, but that only helps for very small relations.  In a large
relation that divergence between the amount of storage space that
could theoretically be needed and the amount that is actually needed
is likely to be extremely high.  1 TB relation = 2^27 blocks, each of
which can contain MaxHeapTuplesPerPage dead line pointers.  On my
system, MaxHeapTuplesPerPage is 291, so that's 291 * 2^27 possible
dead line pointers, which at 6 bytes each is 291 * 6 * 2^27 = ~218GB,
but the expected number of dead line pointers is much less than that.
Even if this is a vacuum triggered by autovacuum_vacuum_scale_factor
and you're using the default of 0.2 (probably too high for such a
large table), assuming there are about 60 tuples for page (which is
what I get with pgbench -i) the table would have about 2^27 * 60 = 7.7
billion tuples of which 1.5 billion would be dead, meaning we need
about 9-10GB of space to store all of those dead tuples.  Allocating
as much as 218GB when we need 9-10GB is going to sting, and I don't
see how you will get a comparable distortion with CREATE INDEX.  I
might be missing something, though.

There's no real issue when there's only one process running on the
system at a time.  If the user set maintenance_work_mem to an amount
of memory that he can't afford to pay even once, then that's simple
misconfiguration and it's not really our problem.  The issue is that
when there are 3 or potentially more VACUUM processes running plus a
CREATE INDEX or two at the same time.  If you set maintenance_work_mem
to a 

Re: [HACKERS] Suggestions for first contribution?

2016-09-06 Thread Christian Convey
Thanks everyone for the suggestions.

It sounds like the most useful thing I can do at the moment is perform
code reviews.  I assumed I'd need more experience with the PG code
base, but I keep on reading that newcomers' reviews are welcome.
Unless someone has a better idea, I'll start with that.

The projects that Pavel and David mentioned sound very interesting.
Hopefully I can tackle one of them after I've done a few code reviews.

Thanks again,
Christian


On Tue, Sep 6, 2016 at 4:20 AM, Aleksander Alekseev
 wrote:
> Hello, Christian
>
>> Can anyone suggest a project for my first PG contribution?
>
> In my experience good starting points are:
>
> * Doing code review. It's really important for PostgreSQL project.
>   Everyone likes to write code, no one likes to read or test it :)
> * Finding typos, there are a lot of them.
> * Finding bugs: static code analysis, sanitizers, etc. Also subscribe to
>   pgsql-bugs@ mailing list.
> * Fixing bottlenecks. Tricky part here is to find one. Could be
>   difficult unless you are working for a company with many clients or
>   maintain a project that uses PostgreSQL.
>
> Also please note that such activity as answering questions on IRC or in
> pgsql-general@, highlighting project news in your blog, writing articles
> (or maybe even a book) about PostgreSQL internals, organizing local user
> group, etc - all of this are very significant contributions.
>
> Good luck!
>
> --
> Best regards,
> Aleksander Alekseev


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Forbid use of LF and CR characters in database and role names

2016-09-06 Thread Tom Lane
Peter Eisentraut  writes:
> On 8/11/16 9:12 PM, Michael Paquier wrote:
>> Note that pg_dump[all] and pg_upgrade already have safeguards against
>> those things per the same routines putting quotes for execution as
>> commands into psql and shell. So attached is a patch to implement this
>> restriction in the backend, and I am adding that to the next CF for
>> 10.0. Attached is as well a script able to trigger those errors.

> After further review, I have my doubts about this approach.

> Everything that is using appendShellString() is now going to reject LF
> and CR characters, but there is no systematic way by which this is
> managed, enforced, or documented.  It happens that right now most of the
> affected cases are user and database names, but there are others.  For
> example, you cannot anymore install PostgreSQL into a path containing
> LF/CR, because initdb will fail when it composes the pg_ctl command line
> to print out.  Also, initdb will fail if the data directory name
> contains LF/CR, but it creates the directory nonetheless.  (Apparently,
> it doesn't even clean it up.)  But for example pg_ctl and pg_basebackup
> and postgres itself handle all of that just fine.  This is a slowly
> growing mess.

> I think the way forward here, if any, is to work on removing these
> restrictions, not to keep sprinkling them around.

I think probably a better answer is to reject bad paths earlier, eg have
initdb error out before doing anything if the proposed -D path contains
CR/LF.  Given the collection of security bugs we just fixed, and the
strong likelihood that user-written scripts contain other instances of
the same type of problem, I do not think we are doing anybody any favors
if we sweat bullets to support such paths.  And I'm not even convinced
that we *can* support such paths on Windows; no one was able to find a
safe shell quoting solution there.

And yeah, the docs probably need work.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Make initdb's suggested "pg_ctl start" command line more reliabl

2016-09-06 Thread Andres Freund
On 2016-09-06 13:08:51 -0400, Tom Lane wrote:
> Dash is considered a character that needs quoting.  It might be possible
> to avoid that if we could be certain that appendShellString's output would
> never be placed in a spot where it could be taken for a switch, but that
> seems like a large assumption to me.  Or maybe we could treat it as
> forcing quotes only if it starts the string; but I don't personally care
> enough to write the code for that.  Feel free if you want to.

But quotes are interpreted by the shell, not by the called program. So I
don't think putting --whatnot in quotes changes anything here?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Make initdb's suggested "pg_ctl start" command line more reliabl

2016-09-06 Thread Claudio Freire
On Tue, Sep 6, 2016 at 2:08 PM, Tom Lane  wrote:
>> The not-quoting-if-not-needed doesn't appear to do anything useful for me:
>> 'pg-install/bin/pg_ctl' -D 'pg-install/var/data' -l logfile start
>
> Dash is considered a character that needs quoting.  It might be possible
> to avoid that if we could be certain that appendShellString's output would
> never be placed in a spot where it could be taken for a switch, but that
> seems like a large assumption to me.  Or maybe we could treat it as
> forcing quotes only if it starts the string; but I don't personally care
> enough to write the code for that.  Feel free if you want to.


Wouldn't it be taken for a switch even with quoting?

Quoting "-D" seems to work fine, which would suggest the above is true.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SELECT FOR UPDATE regression in 9.5

2016-09-06 Thread Alvaro Herrera
Marko Tiikkaja wrote:
> On 2016-09-06 6:02 PM, Marti Raudsepp wrote:
> >This issue is also reproducible on the current master branch. In an
> >assertions-enabled build, it traps an assertion in HeapTupleHeaderGetCmax
> >called by heap_lock_tuple. The following test case demonstrates the issue...
> 
> I think you found a reproducible test case for my bug in
> 48d3eade-98d3-8b9a-477e-1a8dc32a7...@joh.to.  Thanks.

Ah, many thanks.  I'll have a look.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Make initdb's suggested "pg_ctl start" command line more reliabl

2016-09-06 Thread Tom Lane
Peter Eisentraut  writes:
> On 8/20/16 3:05 PM, Tom Lane wrote:
>> Make initdb's suggested "pg_ctl start" command line more reliable.

> A couple of problems with this:

> The not-quoting-if-not-needed doesn't appear to do anything useful for me:
> 'pg-install/bin/pg_ctl' -D 'pg-install/var/data' -l logfile start

Dash is considered a character that needs quoting.  It might be possible
to avoid that if we could be certain that appendShellString's output would
never be placed in a spot where it could be taken for a switch, but that
seems like a large assumption to me.  Or maybe we could treat it as
forcing quotes only if it starts the string; but I don't personally care
enough to write the code for that.  Feel free if you want to.

> The indentation of that line was changed from 4 to 10.  I don't think
> that was a good change.

Hmm, not intentional ... [ looks closely... ]  Looks like a tab got in
there somehow.  Will fix.

> As just mentioned elsewhere, this accidentally introduces a failure if
> the PostgreSQL installation path contains LF/CR, because of the use of
> appendShellString().

I think that's intentional, not accidental.  What actual use case is
there for allowing such paths?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SELECT FOR UPDATE regression in 9.5

2016-09-06 Thread Marko Tiikkaja

On 2016-09-06 6:02 PM, Marti Raudsepp wrote:

This issue is also reproducible on the current master branch. In an
assertions-enabled build, it traps an assertion in HeapTupleHeaderGetCmax
called by heap_lock_tuple. The following test case demonstrates the issue...


I think you found a reproducible test case for my bug in 
48d3eade-98d3-8b9a-477e-1a8dc32a7...@joh.to.  Thanks.



.m


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] (Comment)Bug in CteScanNext

2016-09-06 Thread Tom Lane
Jim Nasby  writes:
> On 9/3/16 1:30 PM, Tom Lane wrote:
>> Or we could add something like "But first, we must deal with the special
>> case of reversing direction after reaching EOF."

> I'm working on that, but one thing isn't clear to me... why do we only 
> skip past the last tuple if (!node->leader->eof_cte)? Even if we've hit 
> the end of the underlying node, the tuplestore could still return data, 
> and AFAICT we would still need to move past the last item in the 
> tuplestore, no?

If tuplestore_ateof is true, then what presumably happened in the previous
call is that we fetched a row (moving forward) from the underlying plan,
appended it to the tuplestore, and returned it to the caller.  If the
current call specifies moving backwards, we want to return the tuple
before that one, not the same one again.

Alternatively, if we'd already backed up some and gone forwards again (all
fetching from the tuplestore), then ateof means that the last call
returned the last row currently in the tuplestore.  Again, we don't want
to return that row twice.

On the other hand, if eof_cte is true, then what happened on the last
call is that we tried to fetch forwards, reached EOF on the underlying
query, and returned NULL.  In that case, a backwards fetch *should*
produce the last row in the tuplestore.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-06 Thread Claudio Freire
On Sun, Sep 4, 2016 at 8:10 PM, Jim Nasby  wrote:
> On 9/4/16 1:46 AM, Simon Riggs wrote:
>>>
>>> > The patch also makes vacuum free the dead_tuples before starting
>>> > truncation. It didn't seem necessary to hold onto it beyond that
>>> > point, and it might help give the OS more cache, especially if work
>>> > mem is configured very high to avoid multiple index scans.
>>
>> How long does that part ever take? Is there any substantial gain from
>> this?
>
>
> If you're asking about how long the dealloc takes, we're going to have to
> pay that cost anyway when the context gets destroyed/reset, no? Doing that
> sooner rather than later certainly seems like a good idea since we've seen
> that truncation can take quite some time. Might as well give the memory back
> to the OS ASAP.

AFAIK, except on debug builds where it has to memset the whole thing,
the cost is constant (unrelated to the allocated block size), so it
should be rather small in this context.


On Tue, Sep 6, 2016 at 1:42 PM, Robert Haas  wrote:
> On Sat, Sep 3, 2016 at 8:55 AM, Claudio Freire  wrote:
>> The attached patch allows setting maintainance_work_mem or
>> autovacuum_work_mem higher than 1GB (and be effective), by turning the
>> allocation of the dead_tuples into a huge allocation.
>>
>> This results in fewer index scans for heavily bloated tables, and
>> could be a lifesaver in many situations (in particular, the situation
>> I'm living right now in production, where we don't have enough room
>> for a vacuum full, and have just deleted 75% of a table to make room
>> but have to rely on regular lazy vacuum to free the space).
>>
>> The patch also makes vacuum free the dead_tuples before starting
>> truncation. It didn't seem necessary to hold onto it beyond that
>> point, and it might help give the OS more cache, especially if work
>> mem is configured very high to avoid multiple index scans.
>>
>> Tested with pgbench scale 4000 after deleting the whole
>> pgbench_accounts table, seemed to work fine.
>
> The problem with this is that we allocate the entire amount of
> maintenance_work_mem even when the number of actual dead tuples turns
> out to be very small.  That's not so bad if the amount of memory we're
> potentially wasting is limited to ~1 GB, but it seems pretty dangerous
> to remove the 1 GB limit, because somebody might have
> maintenance_work_mem set to tens or hundreds of gigabytes to speed
> index creation, and allocating that much space for a VACUUM that
> encounters 1 dead tuple does not seem like a good plan.
>
> What I think we need to do is make some provision to initially
> allocate only a small amount of memory and then grow the allocation
> later if needed.  For example, instead of having
> vacrelstats->dead_tuples be declared as ItemPointer, declare it as
> ItemPointer * and allocate the array progressively in segments.  I'd
> actually argue that the segment size should be substantially smaller
> than 1 GB, like say 64MB; there are still some people running systems
> which are small enough that allocating 1 GB when we may need only 6
> bytes can drive the system into OOM.

This would however incur the cost of having to copy the whole GB-sized
chunk every time it's expanded. It woudln't be cheap.

I've monitored the vacuum as it runs and the OS doesn't map the whole
block unless it's touched, which it isn't until dead tuples are found.
Surely, if overcommit is disabled (as it should), it could exhaust the
virtual address space if set very high, but it wouldn't really use the
memory unless it's needed, it would merely reserve it.

To fix that, rather than repalloc the whole thing, dead_tuples would
have to be an ItemPointer** of sorted chunks. That'd be a
significantly more complex patch, but at least it wouldn't incur the
memcpy. I could attempt that, but I don't see the difference between
vacuum and create index in this case. Both could allocate a huge chunk
of the virtual address space if maintainance work mem says so, both
proportional to the size of the table. I can't see how that could take
any DBA by surprise.

A sensible compromise could be dividing the maintainance_work_mem by
autovacuum_max_workers when used in autovacuum, as is done for cost
limits, to protect those that set both rather high.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST penalty functions [PoC]

2016-09-06 Thread Heikki Linnakangas

On 08/29/2016 09:04 AM, Andrew Borodin wrote:

In this message I address only first problem. I want to construct a
penalty function that will choose:
1.Entry with a zero volume and smallest margin, that can
accommodate insertion tuple without extension, if one exists;
2.Entry with smallest volume, that can accommodate insertion tuple
without extension, if one exists;
3.Entry with zero volume increase after insert and smallest margin
increase, if one exists;
4.Entry with minimal volume increase after insert.


Looking at the code, by "margin", you mean the sum of all "edges", i.e. 
of each dimension, of the cube. I guess the point of that is to 
differentiate between cubes that have the same volume, but a different 
shape.


So, let's try to come up with a scheme that doesn't require IEEE 754 
floats. Those above cases can be split into two categories, depending on 
whether the new value has zero volume or not. We can use a completely 
different scheme for the two cases, if we want, because when we're 
choosing a target page to insert to, penalty gets called for every 
original tuple, but with the same new tuple.


Here's a scheme I just came up with. There might be better ones, but 
it's something.


Let's have:

oX: Original tuple's edge sum
nX: New tuple's edge sum
dX: Edge increase

oV: Original tuple's volume
nX: New tuple's volume
dX: Volume increase

Case A: New entry has zero volume.
--

Two sub-cases:
A1: if dE > 0, use dE. dE must be in the range [0, nE]
A2: otherwise, use oE.

So how do we cram those two into a single float?

If we offset case A1 by n, we can use the range between [0, nE] for A2. 
Something like this pseudocode:


if (dE > 0)
return nE + dE;   /* A1, offset dE into range [nE, inf] */
else
return nE - (nE/oE);  /* A2, scale oE into range [0, nE] */

Case B: New entry has non-zero volume
--
B1: if dV > 0. use dV. dV must be in the range [0, nV].
B2: if dE > 0, use dE. dE must be in the range [0, nE].
B3: oV, otherwise.

By offsetting cases B1 and B2, we can again divide the range into three 
parts, with pseudocode like:


if (dV > 0)
return nV + nE + dV;   /* B1, offset dV into range [nV+nE, inf] */
else if (dE > 0)
return nV + dE;/* B2, offset dE into range [nV, nV+nE] */
else
return nV - (nV/oV)/* B3, scale oV into range [0, nV]


I’ve tested patch performance with attached test. On my machine patch
slows index construction time from 60 to 76 seconds, reduces size of
the index from 85Mb to 82Mb, reduces time of aggregates computation
from 36 seconds to 29 seconds.


Hmm. That's a pretty large increase in construction time. Have you done 
any profiling on where the time goes?



I do not know: should I continue this work in cube, or it’d be better
to fork cube?


Should definitely continue work within cube, IMHO. I don't think forking 
it to a new datatype would make this any easier.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Covering + unique indexes.

2016-09-06 Thread Anastasia Lubennikova

28.08.2016 09:13, Amit Kapila:

On Mon, Aug 15, 2016 at 8:15 PM, Anastasia Lubennikova
  wrote:
@@ -590,7 +622,14 @@ _bt_buildadd(BTWriteState *wstate, BTPageState
*state, IndexTuple itup)
   if (last_off == P_HIKEY)
   {
   Assert(state->btps_minkey == NULL);
- state->btps_minkey = CopyIndexTuple(itup);
+ /*
+ * Truncate the tuple that we're going to insert
+ * into the parent page as a downlink
+ */

+ if (indnkeyatts != indnatts && P_ISLEAF(pageop))
+ state->btps_minkey = index_truncate_tuple(wstate->index, itup);
+ else
+ state->btps_minkey = CopyIndexTuple(itup);

It seems that above code always ensure that for leaf pages, high key
is a truncated tuple.  What is less clear is if that is true, why you
need to re-ensure it again for the old  page in below code:


Thank you for the question. Investigation took a long time)

As far as I understand, the code above only applies to
the first tuple of each level. While the code you have quoted below
truncates high keys for all other pages.

There is a comment that clarifies situation:
/*
 * If the new item is the first for its page, stash a copy for 
later. Note

 * this will only happen for the first item on a level; on later pages,
 * the first item for a page is copied from the prior page in the code
 * above.
 */


So the patch is correct.
We can go further and remove this index_truncate_tuple() call, because
the first key of any inner (or root) page doesn't need any key at all.
It simply points out to the leftmost page of the level below.
But it's not a bug, because truncation of one tuple per level doesn't
add any considerable overhead. So I want to leave the patch in its 
current state.



@@ -510,6 +513,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState
*state, IndexTuple itup)
{
..
+ if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+ {
+ /*
+ * It's essential to truncate High key here.
+ * The purpose is not just to save more space on this particular page,
+ * but to keep whole b-tree structure consistent. Subsequent insertions
+ * assume that hikey is already truncated, and so they should not
+ * worry about it, when copying the high key into the parent page
+ * as a downlink.
+ * NOTE It is not crutial for reliability in present,
+ * but maybe it will be that in the future.
+ */
+ keytup = index_truncate_tuple(wstate->index, oitup);
+
+ /*  delete "wrong" high key, insert keytup as P_HIKEY. */
+ PageIndexTupleDelete(opage, P_HIKEY);
+
+ if (!_bt_pgaddtup(opage, IndexTupleSize(keytup), keytup, P_HIKEY))
+ elog(ERROR, "failed to rewrite compressed item in index \"%s\"",
+ RelationGetRelationName(wstate->index));
+ }
+
..
..



--
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Optimization for lazy_scan_heap

2016-09-06 Thread Robert Haas
On Mon, Sep 5, 2016 at 8:57 PM, Masahiko Sawada  wrote:
>> What performance difference does this make, in a realistic use case?
>
> I have yet to measure performance effect but it would be effect for
> very large frozen table.

I think if you are proposing this patch because you think that the
existing code won't perform well, you definitely need to submit some
performance results showing that your approach is better.  If you
can't show that your approach is practically better (rather than just
theoretically better), then I'm not sure we should change anything.
It's very easy to screw up the code in this area and we do not want to
risk corrupting data for the sake of an optimization that isn't really
needed in the first place.

Of course, if you can prove that the change has a significant benefit,
then it's definitely worth considering.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-06 Thread Robert Haas
On Sat, Sep 3, 2016 at 8:55 AM, Claudio Freire  wrote:
> The attached patch allows setting maintainance_work_mem or
> autovacuum_work_mem higher than 1GB (and be effective), by turning the
> allocation of the dead_tuples into a huge allocation.
>
> This results in fewer index scans for heavily bloated tables, and
> could be a lifesaver in many situations (in particular, the situation
> I'm living right now in production, where we don't have enough room
> for a vacuum full, and have just deleted 75% of a table to make room
> but have to rely on regular lazy vacuum to free the space).
>
> The patch also makes vacuum free the dead_tuples before starting
> truncation. It didn't seem necessary to hold onto it beyond that
> point, and it might help give the OS more cache, especially if work
> mem is configured very high to avoid multiple index scans.
>
> Tested with pgbench scale 4000 after deleting the whole
> pgbench_accounts table, seemed to work fine.

The problem with this is that we allocate the entire amount of
maintenance_work_mem even when the number of actual dead tuples turns
out to be very small.  That's not so bad if the amount of memory we're
potentially wasting is limited to ~1 GB, but it seems pretty dangerous
to remove the 1 GB limit, because somebody might have
maintenance_work_mem set to tens or hundreds of gigabytes to speed
index creation, and allocating that much space for a VACUUM that
encounters 1 dead tuple does not seem like a good plan.

What I think we need to do is make some provision to initially
allocate only a small amount of memory and then grow the allocation
later if needed.  For example, instead of having
vacrelstats->dead_tuples be declared as ItemPointer, declare it as
ItemPointer * and allocate the array progressively in segments.  I'd
actually argue that the segment size should be substantially smaller
than 1 GB, like say 64MB; there are still some people running systems
which are small enough that allocating 1 GB when we may need only 6
bytes can drive the system into OOM.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >