[HACKERS] Tee for COPY

2015-12-13 Thread Konstantin Knizhnik

Hi,

I am trying to create version of COPY command which can scatter/replicate data 
to different nodes based on some distribution method.
There is some master process, having information about data distribution, to 
which all clients are connected.
This master process should receive copied data from client and scatters tuples 
to nodes.
May be somebody can recommend me the best way of implementing such COPY agent?

The obvious plan is the following:

1. Register utility callback
2. Handle T_CopyStmt in this callback
3. Use BeginCopyFrom/NextCopyFrom to receive tuples from client
4. Calculate distribution function for the received tuple
5. Establish connection with correspondent node (if not yet established) and 
start the same COPY command to this node (if not started yet).
6. Send data to this node using PQputCopyData.

The problem is with step 6: I do not see any way to copy received data to the 
destination node.
NextCopyFrom returns array of values (Dutums) of tuple columns. But there are 
no public methods to send tuple to the copy stream.
All this logic is implemented in src/backend/commands/copy.c and is not 
available outside this module.

It is more or less clear how to do it using text or CSV mode: I can use 
NextCopyFromRawFields and then construct a line with comma separated list of 
values.
But how to handle binary mode? Also, I suspect that copy in text mode is 
significantly slower than in binary mode, isn't it?

The dirty solution is just to cut copy.c code. But may be there is some 
more elegant way?

Thanks in advance,
Konstantin






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-13 Thread Simon Riggs
On 11 December 2015 at 16:14, Aleksander Alekseev  wrote:


> I see your point, but I would like to clarify a few things.
>
> 1. Do we consider described measurement method good enough to conclude
> that sometimes PostgreSQL really spends 3 ms in a spinlock (like a RTT
> between two Internet hosts in the same city)? If not, what method
> should be used to approve or disapprove this?
>

Timing things seems fine. You may have located an important issue.


> 2. If we agree that PostgreSQL does sometimes spend 3 ms in a spinlock
> do we consider this a problem?
>

Probably, but then Tom didn't question 1 or 2. What he questioned was your
fix.


> 3. If we consider this a problem, what method is considered appropriate
> to find a real reason of such behaviour so we could fix it?
>

The problem you identify is in only one place, yet your fix changes many
parts of the code. Why is that the best fix out of the many possible ones?
Why would such a change be acceptable?

I personally don't know the answers to those questions. It would be
wonderful if somebody else would structure our lives such that all we had
to do was find simple answers, but that isn't the way life is. You ask for
a chain of logical thought, but it is for you to create one, somehow:
patches are default-reject, not committer-explain-why-reject.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Logical replication and multimaster

2015-12-13 Thread Simon Riggs
On 13 December 2015 at 11:02, Andres Freund  wrote:

> On December 13, 2015 10:19:07 AM CET, Simon Riggs 
> wrote:
> >I didn't see the patch for this anywhere. Where is the code?
>
> Where I'd the code for all of pg logical?
>

Thanks for asking, perhaps our plans weren't public enough. pglogical has
already been announced as open source, under the postgres licence and that
it will be a submission to core PostgreSQL, just as BDR was. pglogical is
in development/test right now and will be released when its ready, which
hopefully will be "soon", aiming for 9.6.

Thanks also for the opportunity to let me ask what your plans are regarding
contributing to core? I couldn't make it to SF recently because of a
funding meeting, but I heard something like your company will release
something as open source sometime in 2016. Could you clarify what that will
be, when it will be, what licence it is under and if it is a contribution
to core also? Is that something you're working on also?

I don't know the status of Konstantin's work, so same question for him also.

It will be useful to help work out what parts need work on once pglogical
is done.

For me, submission to core means both that it is postgres licenced and that
the copyright is novated to PGDG, allowing it to be included within PG core.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Logical replication and multimaster

2015-12-13 Thread Andres Freund
On 2015-12-13 11:39:32 +, Simon Riggs wrote:
> On 13 December 2015 at 11:02, Andres Freund  wrote:
> > On December 13, 2015 10:19:07 AM CET, Simon Riggs 
> > wrote:
> > >I didn't see the patch for this anywhere. Where is the code?
> >
> > Where is the code for all of pg logical?
> >
> 
> Thanks for asking, perhaps our plans weren't public enough. pglogical has
> already been announced as open source, under the postgres licence and that
> it will be a submission to core PostgreSQL, just as BDR was. pglogical is
> in development/test right now and will be released when its ready, which
> hopefully will be "soon", aiming for 9.6.

Well, at the moment not making it public is obviously blocking other
people, and not doing open design discussions publically seems to make
it rather unlikely that it'll get accepted close to as-is anyway. It's
constantly referred to in discussions, and it guided the design of the
submitted output plugin.


> Thanks also for the opportunity to let me ask what your plans are regarding
> contributing to core?

Uh, I am? Stuff like working on upsert, grouping sets et al, was all on
Citus' time. I've been busy with something else for the last 2-3 months.


> I couldn't make it to SF recently because of a funding meeting, but I
> heard something like your company will release something as open
> source sometime in 2016. Could you clarify what that will be, when it
> will be, what licence it is under and if it is a contribution to core
> also? Is that something you're working on also?

I don't see how that belongs to this thread, it's unrelated to
replication.

Either way, the license is yet to be determined, and it'll be Q1
2016. Yes, I've worked on open sourcing it.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication and multimaster

2015-12-13 Thread Andres Freund
On December 13, 2015 10:19:07 AM CET, Simon Riggs  wrote:
>I didn't see the patch for this anywhere. Where is the code?


Where I'd the code for all of pg logical?

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication and multimaster

2015-12-13 Thread Simon Riggs
On 6 December 2015 at 17:39, Konstantin Knizhnik 
wrote:


> I have integrated pglogical_output in multimaster, using bdr_apply from
> BDR as template for implementation of receiver part.
>

I didn't see the patch for this anywhere. Where is the code?

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Using a single standalone-backend run in initdb (was Re: [HACKERS] Bootstrap DATA is a pita)

2015-12-13 Thread Craig Ringer
On 13 December 2015 at 06:31, Tom Lane  wrote:


> I'm not particularly wedded to this rule.  In principle we could go so
> far as to import psql's code that parses commands and figures out which
> semicolons are command terminators --- but that is a pretty large chunk
> of code, and I think it'd really be overkill considering that initdb
> deals only with fixed input scripts.


Shouldn't that be a bison/flex job anyway, rather than hand-coded? Or a
simple(ish) state machine?

Dollar-quoted strings are probably the only quite ugly bit due to their
arbitrary delimiters. So I thought I'd sketch out how it'd look as a state
machine. At which point I remembered that we allow $ in identifiers too. So
the state machine would have to bother with unquoted identifiers. Of course
$ marks parameters, so it has to keep track of if it's reading a parameter.
At which point you have half an SQL parser.

This strikes me as a really good reason for making it re-usable, because
it's horrid to write code that handles statement splitting in the
PostgreSQL dialect.

Optional handling of psql \commands would be required, but that'd make it
easier for PgAdmin to support psql backslash commands, so there'd be a win
there too.

I figured I'd sketch it out for kicks. Comment: yuck.

States would be at least:

SQL_TEXT
SQL_UNQUOTED_IDENTIFIER_OR_KEYWORD
NUMBER
QUOTED_IDENTIFIER
QUOTED_IDENTIFIER_QUOTE
SQL_TEXT_DOLLAR
DOLLAR_STRING_START_DELIM
DOLLAR_STRING
DOLLAR_STRING_DOLLAR
DOLLAR_STRING_END_DELIM
STANDARD_STRING
STANDARD_STRING_QUOTE
SQL_TEXT_E
ESCAPE_STRING
ESCAPE_STRING_ESCAPE

Transitions

SQL_TEXT => { SQL_TEXT, SQL_UNQUOTED_IDENTIFIER_OR_KEYWORD, NUMBER,
QUOTED_IDENTIFIER, SQL_TEXT_DOLLAR, STANDARD_STRING, SQL_TEXT_E,
ESCAPE_STRING }

SQL_TEXT_E => { SQL_TEXT, ESCAPE_STRING, SQL_UNQUOTED_IDENTIFIER_OR_KEYWORD
}

SQL_TEXT_DOLLAR => { SQL_TEXT, NUMBER, DOLLAR_STRING_START_DELIM }

QUOTED_IDENTIFIER => { QUOTED_IDENTIFIER, QUOTED_IDENTIFIER_QUOTE }

QUOTED_IDENTIFIER_QUOTE => { SQL_TEXT, QUOTED_IDENTIFIER }

DOLLAR_STRING_START_DELIM => { DOLLAR_STRING_START_DELIM, DOLLAR_STRING }

DOLLAR_STRING => { DOLLAR_STRING, DOLLAR_STRING_DOLLAR }

DOLLAR_STRING_END_DELIM => { DOLLAR_STRING_END_DELIM, SQL_TEXT,
DOLLAR_STRING }

STANDARD_STRING => { STANDARD_STRING, STANDARD_STRING_QUOTE }

STANDARD_STRING_QUOTE => { SQL_TEXT, STANDARD_STRING }

ESCAPE_STRING => { ESCAPE_STRING, ESCAPE_STRING_ESCAPE }

ESCAPE_STRING_ESCAPE => { SQL_TEXT, ESCAPE_STRING }


NUMBER consumes sequential digits and period chars and returns to SQL_TEXT
at any non-digit. (That way it can handle Pg's lazy parser quirks like
SELECT 123"f" being legal, and equivalent to SELECT 123 AS "f").

SQL_UNQUOTED_IDENTIFIER_OR_KEYWORD is needed because a $ within an
identifier is part of the identifier so it can't just be consumed as
SQL_TEXT .

For dollar strings, when a $ is found when reading SQL text (not an
identifier/keyword), enter SQL_TEXT_DOLLAR. What comes next must be a
parameter or the start of a dollar string. If the next char is a digit then
it's a parameter so switch to NUMBER, since dollar-quoted string delims
follow identifier rules and unquoted identifiers can't start with a number.
Otherwise switch to DOLLAR_STRING_START_DELIM and consume until a $ is
found or something illegal in an identifier is found. Or of course it could
be lone $ which is bogus syntax but as far as we're concerned still just
SQL_TEXT. Really, this is just looking for a dollar-quote start and doesn't
care what it finds if it isn't a valid dollar-quote start.

If a valid dollar-quote delim is found switch to DOLLAR_STRING and read
until we find the matching delim using a similar process, entering
DOLLAR_STRING_DOLLAR, looking for param vs end delim, etc. When a full
delimiter is read compare to the start delimiter and switch back to
SQL_TEXT mode if it matches, otherwise remain in DOLLAR_STRING.

If an invalid dollar string delim was found switch back to SQL_TEXT (since
it wasn't a valid beginning of a dollar string) and continue.

For QUOTED_IDENTIFIER_QUOTE and STANDARD_STRING_QUOTE, it found a " or '
while reading a quoted identifier or standard string and transitioned into
that state. If the next char doubles the quote it'll return to reading the
string; otherwise it'll return to the SQL_TEXT state since the identifier
or literal has ended.

Similarly with ESCAPE_STRING_ESCAPE. Having found an escape, consume the
next char even if it's a quote and return to the string parsing.

All this ignores psql backslash commands.

Have I missed anything really obvious? Does it seem useful to have more
re-usable statement splitting code? Is there any sane justification for
doing anything but extracting what psql does verbatim while carefully
changing nothing?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] [PATCH] Refactoring of LWLock tranches

2015-12-13 Thread and...@anarazel.de
On 2015-12-12 21:15:52 -0500, Robert Haas wrote:
> On Sat, Dec 12, 2015 at 1:17 PM, and...@anarazel.de  
> wrote:
> > Here's two patches doing that. The first is an adaption of your
> > constants patch, using an enum and also converting xlog.c's locks. The
> > second is the separation into distinct tranches.
> 
> Personally, I prefer the #define approach to the enum, but I can live
> with doing it this way.

I think the lack needing to adjust the 'last defined' var is worth it...

> Other than that, I think these patches look
> good, although if it's OK with you I would like to make a pass over
> the comments and the commit messages which seem to me that they could
> benefit from a bit of editing (but not much substantive change).

Sounds good to me. You'll then commit that?


> > One thing to call out is that an "oversized" s_lock can now make
> > BufferDesc exceed 64 bytes, right now that's just the case when it's
> > larger than 4 bytes.  I'm not sure if that's cause for real concern,
> > given that it's not very concurrent or ancient platforms where that's
> > the case.
> > http://archives.postgresql.org/message-id/20150915020625.GI9666%40alap3.anarazel.de
> > would alleviate that concern again, as it collapses flags, usage_count,
> > buf_hdr_lock and refcount into one 32 bit int...
> 
> I don't think that would be worth worrying about even if we didn't
> have a plan in mind that would make it go away again, and even less so
> given that we do have such a plan.

Ok cool. I'm not particularly concerned either, just didn't want to slip
that in without having it called out.


Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Disabling an index temporarily

2015-12-13 Thread Corey Huinker
On Sun, Dec 13, 2015 at 1:33 AM, Oleg Bartunov  wrote:

>
> On Sun, Dec 13, 2015 at 1:16 AM, Jaime Casanova <
> jaime.casan...@2ndquadrant.com> wrote:
>
>> indexrelid = 'indexname'::regclass;
>
>
> This works, but might bloat system catalog.
>
>
+1 for the functionality.
+1 for ALTER INDEX foo SET DISABLED

I mentioned the need for this functionality to PeterG as PgConfUS back in
March when he asked what I missed most about Oracle, where it came into
play when doing partitions swaps and similar bulk Data Warehouse
operations. He didn't seem to think it would be too hard to implement.

But the real win would be the ability to disable all indexes on a table
without specifying names. Even Oracle has to do this with an anonymous
pl/sql block querying dba_indexes or all_indexes, a pity for such a common
pattern.

So, I'd propose we following syntax:

ALTER INDEX foo SET DISABLED
-- does the SET indisvalid = false shown earlier.

ALTER TABLE foo DISABLE [NONUNIQUE] INDEXES
-- same, but joining to pg_class and possibly filtering on indisunique

REINDEX [DISABLED [INDEXES ON]] TABLE table_name [PARALLEL [degree]]
or
REINDEX [INVALID [INDEXES ON]] TABLE table_name [PARALLEL [degree]]

In this last case, REINDEX would walk the catalog as it does now, but
potentially filtering the table indexes on indisvalid = false. I'd ask that
we make a parallel spec part of the command even if it is not initially
honored.

This would be another feather in Postgres's cap of letting the user write
clear code and hiding implementation specific complexity.


Re: [HACKERS] Disabling an index temporarily

2015-12-13 Thread Tatsuo Ishii
>> On Sun, Dec 13, 2015 at 1:16 AM, Jaime Casanova <
>> jaime.casan...@2ndquadrant.com> wrote:
>>
>>> indexrelid = 'indexname'::regclass;
>>
>>
>> This works, but might bloat system catalog.
>>
>>
> +1 for the functionality.
> +1 for ALTER INDEX foo SET DISABLED

-1 for the reason I mentioned in the up thread. Also I dislike this
 because this does not work with standby servers.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Disabling an index temporarily

2015-12-13 Thread Bill Moran
On Sun, 13 Dec 2015 22:15:31 -0500
Corey Huinker  wrote:

> ALTER TABLE foo DISABLE [NONUNIQUE] INDEXES
> -- same, but joining to pg_class and possibly filtering on indisunique

I would think that NONUNIQUE should be the default, and you should have
to specify something special to also disable unique indexes. Arguably,
unique indexes are actually an implementation detail of unique
constraints. Disabling a performance-based index doesn't cause data
corruption, whereas disabling an index created as part of unique
constraint can allow invalid data into the table.

Just my $.02 ...

-- 
Bill Moran


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort for every external sort run

2015-12-13 Thread Jeff Janes
On Sun, Dec 13, 2015 at 3:40 PM, Peter Geoghegan  wrote:
> On Sat, Dec 12, 2015 at 4:41 PM, Jeff Janes  wrote:
>
 Also, if I am reading this correctly, when we refill a pool from a
 logical tape we still transform each tuple as it is read from the disk
 format to the memory format.  This inflates the size quite a bit, at
 least for single-datum tuples.  If we instead just read the disk
 format directly into the pool, and converted them into the in-memory
 format when each tuple came due for the merge heap, would that destroy
 the locality of reference you are seeking to gain?
>>>
>>> Are you talking about alignment?
>>
>> Maybe alignment, but also the size of the SortTuple struct itself,
>> which is not present on tape but is present in memory if I understand
>> correctly.
>>
>> When reading 128kb (32 blocks) worth of in-memory pool, it seems like
>> it only gets to read 16 to 18 blocks of tape to fill them up, in the
>> case of building an index on single column 32-byte random md5 digests.
>> I don't exactly know where all of that space goes, I'm taking an
>> experimentalist approach.
>
> I'm confused.
>
> readtup_datum(), just like every other READTUP() variant, has the new
> function tupproperalloc() as a drop-in replacement for the master
> branch palloc() + USEMEM() calls.

Right, I'm not comparing what your patch does to what the existing
code does.  I'm comparing it to what it could be doing.  Only call
READTUP when you need to go from the pool to the heap, not when you
need to go from tape to the pool.  If you store the data in the pool
the same way they are stored on tape, then we no longer need memtuples
at all.  There is already a "mergetupcur" per tape pointing to the
first tuple of the tape, and since they are now stored contiguously
that is all that is needed, once you are done with one tuple the
pointer is left pointing at the next one.

The reason for memtuples is to handle random access.  Since we are no
longer doing random access, we no longer need it.

We could free memtuples, re-allocate just enough to form the binary
heap for the N-way merge, and use all the rest of that space (which
could be a significant fraction of work_mem) as part of the new pool.


Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Disabling an index temporarily

2015-12-13 Thread Jeff Janes
On Sun, Dec 13, 2015 at 7:27 PM, Tom Lane  wrote:
> Corey Huinker  writes:
>> So, I'd propose we following syntax:
>> ALTER INDEX foo SET DISABLED
>> -- does the SET indisvalid = false shown earlier.
>
> This is exactly *not* what Tatsuo-san was after, though; he was asking
> for a session-local disable, which I would think would be by far the more
> common use-case.  It's hard for me to see much of a reason to disable an
> index globally while still paying all the cost to maintain it.

Not to hijack the thread even further in the wrong direction, but I
think what Corey really wants here is to stop maintaining the index at
retail while preserving the existing definition and existing index
data, and then to do a wholesale fix-up, like what is done in the 2nd
half of a create index concurrently, upon re-enabling it.

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_replication log positions vs base backups

2015-12-13 Thread Magnus Hagander
On Fri, Nov 27, 2015 at 6:07 AM, Michael Paquier 
wrote:

> On Thu, Nov 26, 2015 at 10:53 PM, Magnus Hagander 
> wrote:
> > On Thu, Nov 26, 2015 at 1:03 PM, Michael Paquier <
> michael.paqu...@gmail.com>
> > wrote:
> >>
> >> On Thu, Nov 26, 2015 at 6:45 PM, Magnus Hagander wrote:
> >> > I'm only talking about the actual value in pg_stat_replication here,
> not
> >> > what we are using internally. These are two different things of
> course -
> >> > let's keep them separate for now. In pg_stat_replication, we
> explicitly
> >> > check for InvalidXLogRecPtr and then explicitly set the resulting
> value
> >> > to
> >> > NULL in the SQL return.
> >>
> >> No objections from here. I guess you already have a patch?
> >
> > Well, no, because I haven't figured out which way is the logical one -
> make
> > them all return NULL or make them all return 0/0...
>
> It seems to me that NULL is the logical one. We want to define a value
> from the user prospective where things are in an undefined state.
> That's my logic flow, other opinions are of course welcome.
>

I've applied these two patches now.

The one that fixes the initialization backpatched to 9.3 which is the
oldest one that has it, and the one that changes the actual 0-vs-NULL
output to 9.5 only as it's a behaviour change.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] Logical replication and multimaster

2015-12-13 Thread Konstantin Knizhnik

On 12/13/2015 12:19 PM, Simon Riggs wrote:

On 6 December 2015 at 17:39, Konstantin Knizhnik > wrote:

I have integrated pglogical_output in multimaster, using bdr_apply from BDR 
as template for implementation of receiver part.


I didn't see the patch for this anywhere. Where is the code?


I am sorry,  the code is now in our internal gitlab repository.
We have published pg_dtm and pg_tsdtm as separate repositories at 
github.com/postgrespro.
Them include source of plugin itself and patch to PostgreSQL core.
But we find it is very inconvenient, because we also have to extend DTM API, 
adding new features as deadlock detection...
So we are going to publish at github.com/postgrespro our branch of PostgreSQL 
where pg_dtm, pg_tsdtm and multimaster will be available as extensions in 
contrib directory.  It will be available at Monday.



--
Simon Riggs http://www.2ndQuadrant.com/ 
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2015-12-13 Thread Amit Kapila
On Fri, Dec 11, 2015 at 6:34 PM, Andres Freund  wrote:
>
> On 2015-12-11 15:56:46 +0300, Alexander Korotkov wrote:
> >
> > Yes, there is a cycle with retries in LWLockAcquire function. The case
of
> > retry is when waiter is waked up, but someone other steal the lock
before
> > him. Lock waiter is waked up by lock releaser only when lock becomes
free.
> > But in the case of high concurrency for shared lock, it almost never
> > becomes free. So, exclusive locker would be never waked up. I'm pretty
sure
> > this happens on big Intel machine while we do the benchmark. So,
relying on
> > number of retries wouldn't work in this case.
> > I'll do the tests to verify if retries happens in our case.

makes sense and if retries never happen, then I think changing
LWLockRelease()
such that it should wake the waiters if there are waiters on a lock and it
has not
waked them for some threshold number of times or something like that might
work.

>
> I seriously doubt that making lwlocks fairer is the right way to go
> here. In my testing the "unfairness" is essential to performance - the
> number of context switches otherwise increases massively.
>

Agreed, if the change being discussed hurts in any kind of scenario, then
we should better not do it, OTOH the case described by Alexander seems
to be genuine and I have seen similar complaint by customer in the past
for another database I worked with and the reason for the problem is same.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Using a single standalone-backend run in initdb (was Re: [HACKERS] Bootstrap DATA is a pita)

2015-12-13 Thread Tom Lane
Craig Ringer  writes:
> On 13 December 2015 at 06:31, Tom Lane  wrote:
>> I'm not particularly wedded to this rule.  In principle we could go so
>> far as to import psql's code that parses commands and figures out which
>> semicolons are command terminators --- but that is a pretty large chunk
>> of code, and I think it'd really be overkill considering that initdb
>> deals only with fixed input scripts.

> Shouldn't that be a bison/flex job anyway, rather than hand-coded?

It is, if you're speaking of how psql does it.

I thought about trying to get the backend's existing lexer to do it,
but that code will want to throw an error if it sees unterminated
input (such as an incomplete slash-star comment).  I'm not sure that
it'd be a good thing to try to make that lexer serve two masters.

I'm also getting less and less enthused about trying to share code with
psql.  In the first place, the backend has no interest in recognizing
psql backslash-commands, nor does it need to deal with some of the weird
constraints psql has like having to handle non-backend-safe encodings.
In the second, while it's reasonable for psql to deal with CREATE RULE
syntax by counting parentheses, there's a good argument that that is
not the behavior we want for noninteractive situations such as reading
information_schema.sql.  We won't, for example, have anything
corresponding to psql's changing input prompt to help debug problems.
In the third place, it's just difficult and ugly to write code that
will work in both backend and frontend contexts.  We've done it,
certainly, but not for any problem as involved as this would be.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tee for COPY

2015-12-13 Thread David Fetter
On Sun, Dec 13, 2015 at 11:29:23AM +0300, Konstantin Knizhnik wrote:
> Hi,
> 
> I am trying to create version of COPY command which can scatter/replicate 
> data to different nodes based on some distribution method.
> There is some master process, having information about data distribution, to 
> which all clients are connected.
> This master process should receive copied data from client and scatters 
> tuples to nodes.
> May be somebody can recommend me the best way of implementing such COPY agent?
> 
> The obvious plan is the following:
> 
> 1. Register utility callback
> 2. Handle T_CopyStmt in this callback
> 3. Use BeginCopyFrom/NextCopyFrom to receive tuples from client
> 4. Calculate distribution function for the received tuple
> 5. Establish connection with correspondent node (if not yet established) and 
> start the same COPY command to this node (if not started yet).
> 6. Send data to this node using PQputCopyData.
> 
> The problem is with step 6: I do not see any way to copy received data to the 
> destination node.
> NextCopyFrom returns array of values (Dutums) of tuple columns. But there are 
> no public methods to send tuple to the copy stream.
> All this logic is implemented in src/backend/commands/copy.c and is not 
> available outside this module.
> 
> It is more or less clear how to do it using text or CSV mode: I can use 
> NextCopyFromRawFields and then construct a line with comma separated list of 
> values.
> But how to handle binary mode? Also, I suspect that copy in text mode is 
> significantly slower than in binary mode, isn't it?
> 
> The dirty solution is just to cut copy.c code. But may be there is some 
> more elegant way?

A slightly cleaner solution is to make public methods to send tuples
to the copy stream and have COPY call those.

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication and multimaster

2015-12-13 Thread David Fetter
On Sat, Dec 12, 2015 at 06:48:58PM +0800, Craig Ringer wrote:
> Being able to access pg_dump and pg_restore's dependency resolution logic,
> object dumping routines, etc from regular SQL and from the SPI would be
> wonderful.

As I understand it, pushing these into a library has been proposed but
not rejected.  That it hasn't happened yet is mostly about the lack of
tuits (the round ones) to rewrite the functionality as libraries and
refactor pg_dump/pg_restore to use only calls to same.  As usual, it's
less about writing the code and more about the enormous amount of
testing any such a refactor would entail.

> I believe the main complaints about doing that when it was discussed in the
> past were issues with downgrading. A pg_get_tabledef(...) in 9.6 might emit
> keywords etc that a 9.2 server wouldn't understand, and the way we
> currently solve this is to require that you run 9.2's pg_dump against the
> 9.6 server to get a 9.2-compatible dump. So if we had pg_get_blahdef
> functions we'd still need external versions, so why bother having them?

You have made a persuasive case that major version downgrading is not
a problem we need to solve on the first go.

> The alternative is to have all the get_blahdef functions accept a param for
> server version compatibility, which would work but burden future servers
> with knowledge about older versions' features and corresponding code cruft
> for some extended period of time.
> 
> So it's gone nowhere to date.

I believe that refactoring much of pg_dump's functionality for the
current version of the server into SQL-accessible functions and making
pg_dump use only those functions is achievable with available
resources.

Such a refactor need not be all-or-nothing.  For example, the
dependency resolution stuff is a first step that appears to be worth
doing by itself even if the effort then pauses, possibly for some
time.

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strange CREATE INDEX tab completion cases

2015-12-13 Thread Michael Paquier
On Sat, Dec 12, 2015 at 11:17 AM, Peter Eisentraut  wrote:
> These two tab completion pieces look strange to me:
>
> /* If we have CREATE|UNIQUE INDEX  CONCURRENTLY, then add "ON" */
>  else if ((pg_strcasecmp(prev3_wd, "INDEX") == 0 ||
>pg_strcasecmp(prev2_wd, "INDEX") == 0) &&
>   pg_strcasecmp(prev_wd, "CONCURRENTLY") == 0)
>  COMPLETE_WITH_CONST("ON");
>  /* If we have CREATE|UNIQUE INDEX , then add "ON" or "CONCURRENTLY" 
> */
>  else if ((pg_strcasecmp(prev3_wd, "CREATE") == 0 ||
>pg_strcasecmp(prev3_wd, "UNIQUE") == 0) &&
>   pg_strcasecmp(prev2_wd, "INDEX") == 0)
> {
>  static const char *const list_CREATE_INDEX[] =
>  {"CONCURRENTLY", "ON", NULL};
>
>  COMPLETE_WITH_LIST(list_CREATE_INDEX);
>  }
>
> They appear to support a syntax along the lines of
>
> CREATE INDEX name CONCURRENTLY
>
> which is not the actual syntax.

Yep. That's visibly a bug introduced by this commit:
commit: 37ec19a15ce452ee94f32ebc3d6a9a45868e82fd
author: Itagaki Takahiro 
date: Wed, 17 Feb 2010 04:09:40 +
Support new syntax and improve handling of parentheses in psql tab-completion.

The current implementation is missing a correct completion in a couple
of cases, among them:
-- Should be only ON
=# create index asd
CONCURRENTLY  ON
-- should give list of table
=# create index CONCURRENTLY aaa on
-- Should give ON and list of existing indexes
=# create index concurrently

Please see the attached to address those things (and others) with
extra fixes for a couple of comments.
Regards,
-- 
Michael


20151213_psql_completion_index.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Proposal: custom compression methods

2015-12-13 Thread Alexander Korotkov
Hackers,

I'd like to propose a new feature: "Custom compression methods".

*Motivation*

Currently when datum doesn't fit the page PostgreSQL tries to compress it
using PGLZ algorithm. Compression of particular attributes could be turned
on/off by tuning storage parameter of column. Also, there is heuristics
that datum is not compressible when its first KB is not compressible. I can
see following reasons for improving this situation.

 * Heuristics used for detection of compressible data could be not optimal.
We already met this situation with jsonb.
 * For some data distributions there could be more effective compression
methods than PGLZ. For example:
 * For natural languages we could use predefined dictionaries which
would allow us to compress even relatively short strings (which are not
long enough for PGLZ to train its dictionary).
 * For jsonb/hstore we could implement compression methods which have
dictionary of keys. This could be either static predefined dictionary or
dynamically appended dictionary with some storage.
 * For jsonb and other container types we can implement compression
methods which would allow extraction of particular fields without
decompression of full value.

Therefore, it would be nice to make compression methods pluggable.

*Design*

Compression methods would be stored in pg_compress system catalog table of
following structure:

compnamename
comptype  oid
compcompfunc  regproc
compdecompfunc  regproc

Compression methods could be created by "CREATE COMPRESSION METHOD" command
and deleted by "DROP COMPRESSION METHOD" command.

CREATE COMPRESSION METHOD compname [FOR TYPE comptype_name]
WITH COMPRESS FUNCTION compcompfunc_name
 DECOMPRESS FUNCTION compdecompfunc_name;
DROP COMPRESSION METHOD compname;

Signatures of compcompfunc and compdecompfunc would be similar
pglz_compress and pglz_decompress except compression strategy. There is
only one compression strategy in use for pglz (PGLZ_strategy_default).
Thus, I'm not sure it would be useful to provide multiple strategies for
compression methods.

extern int32 compcompfunc(const char *source, int32 slen, char *dest);
extern int32 compdecompfunc(const char *source, int32 slen, char *dest,
int32 rawsize);

Compression method could be type-agnostic (comptype = 0) or type specific
(comptype != 0). Default compression method is PGLZ.

Compression method of column would be stored in pg_attribute table.
Dependencies between columns and compression methods would be tracked in
pg_depend preventing dropping compression method which is currently in use.
Compression method of the attribute could be altered by ALTER TABLE command.

ALTER TABLE table_name ALTER COLUMN column_name SET COMPRESSION METHOD
compname;

Since mixing of different compression method in the same attribute would be
hard to manage (especially dependencies tracking), altering attribute
compression method would require a table rewrite.

*Implementation details*

Catalog changes, new commands, dependency tracking etc are mostly
mechanical stuff with no fundamental problems. The hardest part seems to be
providing seamless integration of custom compression methods into existing
code.

It doesn't seems hard to add extra parameter with compression method to
toast_compress_datum. However, PG_DETOAST_DATUM should call custom
decompress function with only knowledge of datum. That means that we should
somehow conceal knowledge of compression method into datum. The solution
could be putting compression method oid right after varlena header. Putting
this on-disk would cause storage overhead and break backward compatibility.
Thus, we can add this oid right after reading datum from the page. This
could be the weakest point in the whole proposal and I'll be very glad for
better ideas.

P.S. I'd like to thank Petr Korobeinikov  who
started work on this patch and sent me draft of proposal in Russian.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Using a single standalone-backend run in initdb (was Re: [HACKERS] Bootstrap DATA is a pita)

2015-12-13 Thread Mark Dilger

> On Dec 12, 2015, at 9:40 PM, Tom Lane  wrote:
> 
> Mark Dilger  writes:
>>> On Dec 12, 2015, at 3:42 PM, Tom Lane  wrote:
>>> ... In general, though, I'd rather not try to
>>> teach InteractiveBackend() such a large amount about SQL syntax.
> 
>> I use CREATE RULE within startup files in the fork that I maintain.  I have
>> lots of them, totaling perhaps 50k lines of rule code.  I don't think any of 
>> that
>> code would have a problem with the double-newline separation you propose,
>> which seems a more elegant solution to me.
> 
> Yeah?  Just for proof-of-concept, could you run your startup files with
> the postgres.c patch as proposed, and see whether you get any failures?

Given all the changes I've made to initdb.c in my fork, that patch
of yours doesn't apply.

mark 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication and multimaster

2015-12-13 Thread Simon Riggs
On 13 December 2015 at 11:53, Andres Freund  wrote:


> > Thanks for asking, perhaps our plans weren't public enough. pglogical has
> > already been announced as open source, under the postgres licence and
> that
> > it will be a submission to core PostgreSQL, just as BDR was. pglogical is
> > in development/test right now and will be released when its ready, which
> > hopefully will be "soon", aiming for 9.6.
>
> Well, at the moment not making it public is obviously blocking other
> people, and


What other people are being blocked? What contribution they are making to
PostgreSQL core is being delayed? What is the nature of this delay?


> not doing open design discussions publically seems to make
> it rather unlikely that it'll get accepted close to as-is anyway. It's
> constantly referred to in discussions, and it guided the design of the
> submitted output plugin.


It's a shame you think that, but posting incoherent proposals just wastes
everybody's time.

The focus has been on making the internals more generic, unpicking many of
the parts of BDR that were too specialized. The output plugin has not been
guided by the needs of pglogical at all, its been designed to be way more
open than BDR was.

The UI is a fairly thin layer on top of that and can be recoded without too
much work, but I don't think its a surprising design. It's been quite
difficult to cater for the many complex and sometimes conflicting
requirements and that has only recently come to together into a coherent
form by my hand. Whether the UI makes sense remains to be seen, but I think
building it is an essential part of the evaluation of whether it is
actually a good UI.

Submission to core implies that changes are possible and discussion is
welcome. I expect that to happen. If there were any truly contentious parts
they would have been discussed ahead of time. I see no reason to believe
that pglogical would not or could not be accepted into 9.6.


> > Thanks also for the opportunity to let me ask what your plans are
> regarding
> > contributing to core?
>
> Uh, I am? Stuff like working on upsert, grouping sets et al, was all on
> Citus' time. I've been busy with something else for the last 2-3 months.


Good, thanks; you misread that and I wasn't questioning it. pglogical
developers have a day job too.


> > I couldn't make it to SF recently because of a funding meeting, but I
> > heard something like your company will release something as open
> > source sometime in 2016. Could you clarify what that will be, when it
> > will be, what licence it is under and if it is a contribution to core
> > also? Is that something you're working on also?
>
> I don't see how that belongs to this thread, it's unrelated to
> replication.
>

I assumed your interest in pglogical meant there was some connection.


> Either way, the license is yet to be determined, and it'll be Q1
> 2016. Yes, I've worked on open sourcing it.
>

If its under the Postgres licence and submitted to core, as is BDR, you may
find many people interested in working on it also.

Initial development of major features is IMHO best done by small groups of
dedicated developers. That has got nothing at all to do with what happens
to the code in the longer term.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-12-13 Thread Peter Geoghegan
On Mon, Nov 30, 2015 at 1:04 PM, Peter Geoghegan  wrote:
> Perhaps we can consider more selectively applying prefetching in the
> context of writing out tuples.

It may still be worth selectively applying these techniques to writing
out tuples, per my previous remarks (and the latest "Using quicksort
for every external sort run" revision, which has a specialized version
of this patch). I see real benefits there across a variety of
situations on different hardware, and I'm hesitant to give that up
without further analysis.

However, I think that this patch is over as an independent piece of
work -- there is too much of a mixed picture. I'm going to mark this
patch "returned with feedback".

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: add pg_current_xlog_flush_location function

2015-12-13 Thread Tomas Vondra

Hi,

On 12/13/2015 06:13 AM, Amit Kapila wrote:
>

...

>

Is there a reason why you can't use existing function
GetFlushRecPtr() in xlog.c?


No, not really. I think I somehow missed that function when writing the 
initial version of the patch. Will fix in v2 of the patch.


thanks

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making tab-complete.c easier to maintain

2015-12-13 Thread Thomas Munro
On Sun, Dec 13, 2015 at 1:08 AM, Michael Paquier
 wrote:
> On Wed, Dec 9, 2015 at 8:17 PM, Thomas Munro wrote:
>> Thanks for looking at this Michael.  It's probably not much fun to
>> review!  Here is a new version with that bit removed.  More responses
>> inline below.
>
> Could this patch be rebased? There are now conflicts with 8b469bd7 and
> it does not apply cleanly.

New version attached.

I've also add (very) primitive negative pattern support and used it in
5 places.  Improvement?  Examples:

/* ALTER TABLE xxx RENAME yyy */
-   else if (TailMatches5("ALTER", "TABLE", MatchAny, "RENAME", MatchAny) &&
-!TailMatches1("CONSTRAINT|TO"))
+   else if (TailMatches5("ALTER", "TABLE", MatchAny, "RENAME",
"!CONSTRAINT|TO"))
COMPLETE_WITH_CONST("TO");

/* If we have CLUSTER , then add "USING" */
-   else if (TailMatches2("CLUSTER", MatchAny) &&
!TailMatches1("VERBOSE|ON"))
+   else if (TailMatches2("CLUSTER", "!VERBOSE|ON"))
COMPLETE_WITH_CONST("USING");

-- 
Thomas Munro
http://www.enterprisedb.com


tab-complete-macrology-v10.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_replication log positions vs base backups

2015-12-13 Thread Michael Paquier
On Mon, Dec 14, 2015 at 1:01 AM, Magnus Hagander  wrote:
> I've applied these two patches now.
>
> The one that fixes the initialization backpatched to 9.3 which is the oldest
> one that has it, and the one that changes the actual 0-vs-NULL output to 9.5
> only as it's a behaviour change.

Thanks!
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort for every external sort run

2015-12-13 Thread Peter Geoghegan
On Sat, Dec 12, 2015 at 4:41 PM, Jeff Janes  wrote:
> Those usages make sense to me, as they are locally self-contained and
> it is clear what they are in contradistinction to.   But your usage is
> spread throughout (even in function names, not just comments) and
> seems to contradict the current usage as yours are not separately
> palloced, as the "proper" ones described here are.  I think that
> "proper" only works when the same comment also defines the
> alternative, rather than as some file-global description.  Maybe
> "pooltuple" rather than "tupleproper"

I don't think of it that way. The "tuple proper" is the thing that the
client passes to their tuplesort -- the thing they are actually
interested in having sorted. Like an IndexTuple for CREATE INDEX
callers, for example. SortTuple is just an internal implementation
detail. (That appears all over the file tuplesort.c, just as my new
references to "tuple proper" do. But neither appear elsewhere.)

>>> Also, if I am reading this correctly, when we refill a pool from a
>>> logical tape we still transform each tuple as it is read from the disk
>>> format to the memory format.  This inflates the size quite a bit, at
>>> least for single-datum tuples.  If we instead just read the disk
>>> format directly into the pool, and converted them into the in-memory
>>> format when each tuple came due for the merge heap, would that destroy
>>> the locality of reference you are seeking to gain?
>>
>> Are you talking about alignment?
>
> Maybe alignment, but also the size of the SortTuple struct itself,
> which is not present on tape but is present in memory if I understand
> correctly.
>
> When reading 128kb (32 blocks) worth of in-memory pool, it seems like
> it only gets to read 16 to 18 blocks of tape to fill them up, in the
> case of building an index on single column 32-byte random md5 digests.
> I don't exactly know where all of that space goes, I'm taking an
> experimentalist approach.

I'm confused.

readtup_datum(), just like every other READTUP() variant, has the new
function tupproperalloc() as a drop-in replacement for the master
branch palloc() + USEMEM() calls.

It is true that tupproperalloc() (and a couple of other places
relating to preloading) know *a little* about the usage pattern --
tupproperalloc() accepts a "tape number" argument to know what
partition within the large pool/buffer to use for each logical
allocation. However, from the point of view of correctness,
tupproperalloc() should function as a drop-in replacement for palloc()
+ USEMEM() calls in the context of the various READTUP() routines.

I have done nothing special with any particular READTUP() routine,
including readtup_datum() (all READTUP() routines have received the
same treatment). Nothing else was changed in those routines, including
how tuples are stored on tape. The datum case does kind of store the
SortTuples on tape today in one very limited sense, which is that the
length is stored fairly naively (that's already available from the
IndexTuple in the case of writetup_index(), for example, but length
must be stored explicitly for the datum case).

My guess is you're confusion comes from the fact that the memtuples
array (the array of SortTuple) is also factored in to memory
accounting, but that grows at geometric intervals, whereas the
existing READTUP() retail palloc() calls (and their USEMEM() memory
accounting calls) occur in drips and drabs. It's probably the case
that the sizing of the memtuples array and the amount of memory we use
for that rather than retail palloc()/"tuple proper" memory is a kind
of arbitrary (why should the needs be the same when SortTuples are
merge step "slots"?), but I don't think that's the biggest problem in
this general area at all.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort for every external sort run

2015-12-13 Thread Jeff Janes
On Tue, Dec 8, 2015 at 6:39 PM, Greg Stark  wrote:
> On Wed, Dec 9, 2015 at 12:02 AM, Jeff Janes  wrote:
>>
>>
>> Then in the next (final) merge, it is has to read in this huge
>> fragmented tape run emulation, generating a lot of random IO to read
>> it.
>
> This seems fairly plausible. Logtape.c is basically implementing a
> small filesystem and doesn't really make any attempt to avoid
> fragmentation. The reason it does this is so that we can reuse blocks
> and avoid needing to store 2x disk space for the temporary space. I
> wonder if we're no longer concerned about keeping the number of tapes
> down if it makes sense to give up on this goal too and just write out
> separate files for each tape letting the filesystem avoid
> fragmentation. I suspect it would also be better for filesystems like
> ZFS and SSDs where rewriting blocks can be expensive.

During my testing I actually ran into space problems, where the index
I was building and the temp files used to do the sort for it could not
coexist, and I was wondering if there wasn't a way to free up some of
those temp files as the index was growing. So I don't think we want to
throw caution to the wind here.

(Also, I think it does make *some* attempt to reduce fragmentation,
but it could probably do more.)

>
>
>> With the patched code, the average length of reads on files in
>> pgsql_tmp between lseeks or changing to a different file descriptor is
>> 8, while in the unpatched code it is 14.
>
> I don't think Peter did anything to the scheduling of the merges so I
> don't see how this would be different. It might just have hit a
> preexisting case by changing the number and size of tapes.

Correct.  (There was a small additional increase with the memory pool,
but it was small enough that I am not worried about it).  But, this
changing number and size of tapes was exactly what Robert was worried
about, so I don't want to just dismiss it without further
investigation.

>
> I also don't think the tapes really ought to be so unbalanced. I've
> noticed some odd things myself -- like what does a 1-way merge mean
> here?

I noticed some of those (although in my case they were always the
first merges which were one-way) and I just attributed it to the fact
that the algorithm doesn't know how many runs there will be up front,
and so can't optimally distribute them among the tapes.

But it does occur to me that we are taking the tape analogy rather too
far in that case.  We could say that we have only 223 tape *drives*,
but that each run is a separate tape which can be remounted amongst
the drives in any combination, as long as only 223 are active at one
time.

I started looking into this at one time, before I got sidetracked on
the fact that the memory usage pattern would often leave a few bytes
less than half of work_mem completely unused.  Once that memory usage
got fixed, I never returned to the original examination.  And it would
be a shame to sink more time into it now, when we are trying to avoid
these polyphase merges altogether.

So, is a sometimes-regression at 64MB really a blocker to substantial
improvement most of the time at 64MB, and even more so at more
realistic modern settings for large index building?


> Fwiw attached are two patches for perusal. One is a trivial patch to
> add the size of the tape to trace_sort output. I guess I'll just apply
> that without discussion.

+1 there.  Having this in place would make evaluating the other things
be easier.

Cheers,

Jeff

On Tue, Dec 8, 2015 at 6:39 PM, Greg Stark  wrote:
> On Wed, Dec 9, 2015 at 12:02 AM, Jeff Janes  wrote:
>>
>>
>> Then in the next (final) merge, it is has to read in this huge
>> fragmented tape run emulation, generating a lot of random IO to read
>> it.
>
> This seems fairly plausible. Logtape.c is basically implementing a
> small filesystem and doesn't really make any attempt to avoid
> fragmentation. The reason it does this is so that we can reuse blocks
> and avoid needing to store 2x disk space for the temporary space. I
> wonder if we're no longer concerned about keeping the number of tapes
> down if it makes sense to give up on this goal too and just write out
> separate files for each tape letting the filesystem avoid
> fragmentation. I suspect it would also be better for filesystems like
> ZFS and SSDs where rewriting blocks can be expensive.
>
>
>> With the patched code, the average length of reads on files in
>> pgsql_tmp between lseeks or changing to a different file descriptor is
>> 8, while in the unpatched code it is 14.
>
> I don't think Peter did anything to the scheduling of the merges so I
> don't see how this would be different. It might just have hit a
> preexisting case by changing the number and size of tapes.
>
> I also don't think the tapes really ought to be so unbalanced. I've
> noticed some odd things myself -- like what does a 1-way merge mean
> here?
>
> LOG:  finished 

[HACKERS] Function and view to retrieve WAL receiver status

2015-12-13 Thread Michael Paquier
Hi all,

Currently there is no equivalent of pg_stat_get_wal_senders for the
WAL receiver on a node, and it seems that it would be useful to have
an SQL representation of what is in shared memory should a WAL
receiver be active without going through the ps display for example.
So, any opinion about having in core a function called
pg_stat_get_wal_receiver that returns a single tuple that translates
the data WalRcvData?
We could bundle on top of it a system view, say called
pg_stat_wal_receiver, with this layer:
View "public.pg_stat_wal_receiver"
Column|Type|Modifiers
pid|integer|
status|text|
receive_start_lsn|pg_lsn|
receive_start_tli|integer|
received_up_to_lsn|pg_lsn|
received_tli|integer|
latest_chunk_start_lsn|pg_lsn|
last_msg_send_time|timestamp with time zone|
last_msg_receipt_time|timestamp with time zone|
latest_end_lsn|pg_lsn|
latest_end_time|timestamp with time zone|
slot_name|text|

If the node has no WAL receiver active, a tuple with NULL values is
returned instead.
Thoughts?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Aggregate

2015-12-13 Thread Haribabu Kommi
On Fri, Dec 11, 2015 at 5:42 PM, Haribabu Kommi
 wrote:
> 3. Performance test to observe the effect of parallel aggregate.

Here I attached the performance test report of parallel aggregate.
Summary of the result is:
1. Parallel aggregate is not giving any improvement or having
very less overhead compared to parallel scan in case of low
selectivity.

2. Parallel aggregate is performing well more than 60% compared
to parallel scan because of very less data transfer overhead as the
hash aggregate operation is reducing the number of tuples that
are required to be transferred from workers to backend.

The parallel aggregate plan is depends on below parallel seq scan.
In case if parallel seq scan plan is not generated because of more
tuple transfer overhead cost in case of higher selectivity, then
parallel aggregate is also not possible. But with parallel aggregate
the number of records that are required to be transferred from
worker to backend may reduce compared to parallel seq scan. So
the overall cost of parallel aggregate may be better.

To handle this problem, how about the following way?

Having an one more member in RelOptInfo called
cheapest_parallel_path used to store the parallel path that is possible.
where ever the parallel plan is possible, this value will be set with
the possible parallel plan. If parallel plan is not possible in the parent
nodes, then this will be set as NULL. otherwise again calculate the
parallel plan at this node based on the below parallel plan node.

Once the entire paths are finalized, in grouping planner, prepare a
plan for normal aggregate and parallel aggregate. Compare these
two costs and decide the cheapest cost plan.

I didn't yet evaluated the feasibility of the above solution. suggestions?

Regards,
Hari Babu
Fujitsu Australia


performance_test_result.xlsx
Description: MS-Excel 2007 spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Disabling an index temporarily

2015-12-13 Thread Tom Lane
Corey Huinker  writes:
> So, I'd propose we following syntax:
> ALTER INDEX foo SET DISABLED
> -- does the SET indisvalid = false shown earlier.

This is exactly *not* what Tatsuo-san was after, though; he was asking
for a session-local disable, which I would think would be by far the more
common use-case.  It's hard for me to see much of a reason to disable an
index globally while still paying all the cost to maintain it.  Seems to
me the typical work flow would be more like "disable index in a test
session, try all your queries and see how well they work, if you conclude
you don't need the index then drop it".  Or perhaps you could imagine that
you want the index selected for use only in certain specific sessions ...
but the above doesn't cater for that use-case either.

Certainly, there's opportunities to improve the flexibility of the
index-disable specifications in the plug-in Oleg and Teodor did.  But
I think that that is the right basic approach: some sort of SET command,
not anything that alters the catalogs.  We already have lots of
infrastructure that could handle desires like having specific values
active in only some sessions.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Disabling an index temporarily

2015-12-13 Thread Tom Lane
Bill Moran  writes:
> I would think that NONUNIQUE should be the default, and you should have
> to specify something special to also disable unique indexes. Arguably,
> unique indexes are actually an implementation detail of unique
> constraints. Disabling a performance-based index doesn't cause data
> corruption, whereas disabling an index created as part of unique
> constraint can allow invalid data into the table.

Maybe I misunderstood, but I thought what was being discussed here is
preventing the planner from selecting an index for use in queries, while
still requiring all table updates to maintain validity of the index.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Disabling an index temporarily

2015-12-13 Thread Michael Paquier
On Mon, Dec 14, 2015 at 12:27 PM, Tom Lane  wrote:
> Certainly, there's opportunities to improve the flexibility of the
> index-disable specifications in the plug-in Oleg and Teodor did.  But
> I think that that is the right basic approach: some sort of SET command,
> not anything that alters the catalogs.  We already have lots of
> infrastructure that could handle desires like having specific values
> active in only some sessions.

ISTM that an intuitive answer is something like enable_indexscan_list
= 'index1, index2' and not worry about any disable switch, that's more
in line with the equivalent planner-level GUC.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Disabling an index temporarily

2015-12-13 Thread Tom Lane
Jeff Janes  writes:
> Not to hijack the thread even further in the wrong direction, but I
> think what Corey really wants here is to stop maintaining the index at
> retail while preserving the existing definition and existing index
> data, and then to do a wholesale fix-up, like what is done in the 2nd
> half of a create index concurrently, upon re-enabling it.

Meh.  Why not just drop the index?  I mean, yeah, you might save a few
keystrokes when and if you ever re-enable it, but this sure seems like
a feature in search of a use-case.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making tab-complete.c easier to maintain

2015-12-13 Thread Michael Paquier
On Mon, Dec 14, 2015 at 8:10 AM, Thomas Munro
 wrote:
> I've also add (very) primitive negative pattern support and used it in
> 5 places.  Improvement?  Examples:
>
> /* ALTER TABLE xxx RENAME yyy */
> -   else if (TailMatches5("ALTER", "TABLE", MatchAny, "RENAME", MatchAny) 
> &&
> -!TailMatches1("CONSTRAINT|TO"))
> +   else if (TailMatches5("ALTER", "TABLE", MatchAny, "RENAME",
> "!CONSTRAINT|TO"))
> COMPLETE_WITH_CONST("TO");
>
> /* If we have CLUSTER , then add "USING" */
> -   else if (TailMatches2("CLUSTER", MatchAny) &&
> !TailMatches1("VERBOSE|ON"))
> +   else if (TailMatches2("CLUSTER", "!VERBOSE|ON"))
> COMPLETE_WITH_CONST("USING");

+   /* Handle negated patterns. */
+   if (*pattern == '!')
+   return !word_matches(pattern + 1, word);

Yeah, I guess that's an improvement for those cases, and there is no
immediate need for a per-keyword NOT operator in our cases to allow
things of the type (foo1 OR NOT foo2). Still, in the case of this
patch !foo1|foo2 actually means (NOT foo1 AND NOT foo2). It does not
seem that much intuitive. Reading straight this expression it seems
that !foo1|foo2 means actually (NOT foo1 OR foo2) because the lack of
parenthesis. Thoughts?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort for every external sort run

2015-12-13 Thread Peter Geoghegan
On Sun, Dec 13, 2015 at 7:31 PM, Jeff Janes  wrote:
> The reason for memtuples is to handle random access.  Since we are no
> longer doing random access, we no longer need it.
>
> We could free memtuples, re-allocate just enough to form the binary
> heap for the N-way merge, and use all the rest of that space (which
> could be a significant fraction of work_mem) as part of the new pool.

Oh, you're talking about having the final on-the-fly merge use a
tuplestore-style array of pointers to "tuple proper" memory (this was
how tuplesort.c worked in all cases about 15 years ago, actually).

I thought about that. It's not obvious how we'd do without
SortTuple.tupindex during the merge phase, since it sometimes
represents an offset into memtuples (the SortTuple array). See "free
list" management within mergeonerun().

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: custom compression methods

2015-12-13 Thread Craig Ringer
On 14 December 2015 at 15:27, Chapman Flack  wrote:

> On 12/14/15 01:50, Craig Ringer wrote:
>
> > pg_upgrade means you can't just redefine the current toast bits so the
> > compressed bit means "data is compressed, check first byte of varlena
> data
> > for algorithm" because existing data won't have that, the first byte will
> > be the start of the compressed data stream.
>
> Is there any small sequence of initial bytes you wouldn't ever see in PGLZ
> output?  Either something invalid, or something obviously nonoptimal
> like run(n,'A')||run(n,'A') where PGLZ would have just output run(2n,'A')?
>
>
I don't think we need to worry, since doing it per-column makes this issue
go away. Per-Datum compression would make it easier to switch methods
(requiring no table rewrite) at the cost of more storage for each varlena,
which probably isn't worth it anyway.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Function and view to retrieve WAL receiver status

2015-12-13 Thread Gurjeet Singh
On Dec 13, 2015 9:56 PM, "Michael Paquier" 
wrote:

>
> If the node has no WAL receiver active, a tuple with NULL values is
> returned instead.

IMO, in the absence of a WAL receiver the SRF (and the view) should not
return any rows.


Re: [HACKERS] Function and view to retrieve WAL receiver status

2015-12-13 Thread Michael Paquier
On Mon, Dec 14, 2015 at 3:09 PM, Gurjeet Singh  wrote:
> On Dec 13, 2015 9:56 PM, "Michael Paquier" 
> wrote:
>> If the node has no WAL receiver active, a tuple with NULL values is
>> returned instead.
>
> IMO, in the absence of a WAL receiver the SRF (and the view) should not
> return any rows.

The whole point is to not use a SRF in this case: there is always at
most one WAL receiver.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] W-TinyLfu for cache eviction

2015-12-13 Thread Vladimir Sitnikov
> a global lock would be good enough for a proof of concept that only
evaluates cache hit ratios.

I think emulator can be used to check hit ratios. That way we can see
how different algorithms affect hit ratio.

Is there a set of traces of "buffer load events"? (I did some Google
searches like "postgresql buffer cache trace" with no luck)
Is there an option that enables tracing of each requested buffer Id?

Frankly speaking, I've no access to PG instances with lots of data
(i.e. >10GiB).

> Maybe.  Want to code it up?

That would be interesting, however: I'm not fluent at C. I've never
written multithreaded C code either. I understand what a cache line is
though.
Anyway, before hacking a prototype it makes sense to gather some traces.

Vladimir


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] Failed to generate plan on lateral subqueries

2015-12-13 Thread David Fetter
On Mon, Dec 14, 2015 at 03:06:18PM +0900, Michael Paquier wrote:
> On Sun, Dec 13, 2015 at 10:14 AM, Andreas Seltenreich
>  wrote:
> >> Did you publish the source already? I haven't been following all
> >> along, sorry if these are all answered questions.
> >
> > It's not had a proper release yet, but the code is available via
> > github in all its rapid-prototypesque glory:
> >
> > https://github.com/anse1/sqlsmith
> 
> I am in awe regarding this stuff, which has caught many bugs
> already, it is a bit sad that it is released under the GPL license
> preventing a potential integration into PostgreSQL core to
> strengthen the test infrastructure,

I suspect that a polite request to the Andreas that he change to a
PostgreSQL-compatible license like one of (TPL, BSD2, MIT) might
handle this part.

> and it is even sadder to see a direct dependency with
> libpqxx :(

I suspect this part is a SMOP, but I'm not quite sure what S might
constitute in this case.

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Disabling an index temporarily

2015-12-13 Thread Victor Yegorov
2015-12-14 5:34 GMT+02:00 Tom Lane :

> Maybe I misunderstood, but I thought what was being discussed here is
> preventing the planner from selecting an index for use in queries, while
> still requiring all table updates to maintain validity of the index.
>

The O-ther big DBMS has `ALTER INDEX ... INVISIBLE` feature, that does
exactly this.


I was thinking of a function, similar to `set_config()`, for it has
`is_local` parameter, making it possible to adjust just current session or
a global behavior.

`set_index(name, is_visible, is_local` perhaps?


-- 
Victor Y. Yegorov


Re: [HACKERS] [sqlsmith] Failed to generate plan on lateral subqueries

2015-12-13 Thread Michael Paquier
On Sun, Dec 13, 2015 at 10:14 AM, Andreas Seltenreich
 wrote:
>> Did you publish the source already? I haven't been following all
>> along, sorry if these are all answered questions.
>
> It's not had a proper release yet, but the code is available via github
> in all its rapid-prototypesque glory:
>
> https://github.com/anse1/sqlsmith

I am in awe regarding this stuff, which has caught many bugs already,
it is a bit sad that it is released under the GPL license preventing a
potential integration into PostgreSQL core to strengthen the test
infrastructure, and it is even sadder to see a direct dependency with
libpqxx :(
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: custom compression methods

2015-12-13 Thread Craig Ringer
On 14 December 2015 at 01:28, Alexander Korotkov 
wrote:

> Hackers,
>
> I'd like to propose a new feature: "Custom compression methods".
>

Are you aware of the past work in this area? There's quite a bit of history
and I strongly advise you to read the relevant threads to make sure you
don't run into the same problems.

See:

http://www.postgresql.org/message-id/flat/20130615102028.gk19...@alap2.anarazel.de#20130615102028.gk19...@alap2.anarazel.de

for at least one of the prior attempts.


> *Motivation*
>
> Currently when datum doesn't fit the page PostgreSQL tries to compress it
> using PGLZ algorithm. Compression of particular attributes could be turned
> on/off by tuning storage parameter of column. Also, there is heuristics
> that datum is not compressible when its first KB is not compressible. I can
> see following reasons for improving this situation.
>

Yeah, recent discussion has made it clear that there's room for improving
how and when TOAST compresses things. Per-attribute compression thresholds
made a lot of sense.

Therefore, it would be nice to make compression methods pluggable.
>

Very important issues to consider here is on-disk format stability, space
overhead, and pg_upgrade-ability. It looks like you have addressed all of
these issues below by making compression methods per-column not per-Datum
and forcing a full table rewrite to change it.

The issue with per-Datum is that TOAST claims two bits of a varlena header,
which already limits us to 1 GiB varlena values, something people are
starting to find to be a problem. There's no wiggle room to steal more
bits. If you want pluggable compression you need a way to store knowledge
of how a given datum is compressed with the datum or have a fast, efficient
way to check.

pg_upgrade means you can't just redefine the current toast bits so the
compressed bit means "data is compressed, check first byte of varlena data
for algorithm" because existing data won't have that, the first byte will
be the start of the compressed data stream.

There's also the issue of what you do when the algorithm used for a datum
is no longer loaded. I don't care so much about that one, I'm happy to say
"you ERROR and tell the user to fix the situation". But I think some people
were concerned about that too, or being stuck with algorithms forever once
they're added.

Looks like you've dealt with all those concerns.


> DROP COMPRESSION METHOD compname;
>
>
When you drop a compression method what happens to data compressed with
that method?

If you re-create it can the data be associated with the re-created method?


> Compression method of column would be stored in pg_attribute table.
>

So you can't change it without a full table rewrite, but thus you also
don't have to poach any TOAST header bits to determine which algorithm is
used. And you can use pg_depend to prevent dropping a compression method
still in use by a table. Makes sense.

Looks promising, but I haven't re-read the old thread in detail to see if
this approach was already considered and rejected.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] pgbench stats per script & other stuff

2015-12-13 Thread Michael Paquier
On Sat, Oct 3, 2015 at 3:11 PM, Fabien COELHO  wrote:
>
>> Here is a v10, which is a rebase because of the "--progress-timestamp"
>> option addition.
>
>
> Here is a v11, which is a rebase after some recent changes committed to
> pgbench.

+The provided scriptname needs only to be a prefix
s/repleacable/replaceable, in short I think that documentation
compilation would fail.

-Do not update pgbench_tellers and
-pgbench_branches.
-This will avoid update contention on these tables, but
-it makes the test case even less like TPC-B.
+Shorthand for -b simple-update@1.
I don't think it is a good idea to remove entirely the description of
what the default scenarios can do. The description would be better at
the bottom in some  with a list of each default test and what to
expect from them.

+/* data structure to hold various statistics.
+ * it is used for interval statistics as well as file statistics.
  */
Nitpick: this is not a comment formatted the Postgres-way.

This is surprisingly broken:
$ pgbench -i
some of the specified options cannot be used in initialization (-i) mode

Any file name or path including "@" will fail strangely:
$ pgbench -f "t...@1.sql"
could not open file "test": No such file or directory
empty commands for test
Perhaps instead of failing we should warn the user and enforce the
weight to be set at 1?

$ pgbench -b foo
no builtin found for "foo"
This is not really helpful for the user, I think that the list of
potential options should be listed as an error hint.

-  "  -N, --skip-some-updates  skip updates of
pgbench_tellers and pgbench_branches\n"
+  "  -N, --skip-some-updates  same as \"-b simple-update@1\"\n"
   "  -P, --progress=NUM   show thread progress
report every NUM seconds\n"
   "  -r, --report-latencies   report average latency
per command\n"
"  -R, --rate=NUM   target rate in
transactions per second\n"
   "  -s, --scale=NUM  report this scale
factor in output\n"
-  "  -S, --select-onlyperform SELECT-only
transactions\n"
+  "  -S, --select-onlysame as \"-b select-only@1\"\n"
It is good to mention that there is an equivalent, but I think that
the description should be kept.

+   /* although a mutex would make sense, the
likelyhood of an issue
+* is small and these are only stats which may
be slightly false
+*/
+   doSimpleStats(& commands[st->state]->stats,
+ INSTR_TIME_GET_DOUBLE(now) -
+
INSTR_TIME_GET_DOUBLE(st->stmt_begin));
Why would the likelyhood of an issue be small here?

+   /* print NaN if no transactions where executed */
+   double latency = ss->sum / ss->count;
This does not look like a good idea, ss->count can be 0.

It seems also that it would be a good idea to split the patch into two parts:
1) Refactor the code so as the existing test scripts are put under the
same umbrella with addScript, adding at the same time the new option
-b.
2) Add the weight facility and its related statistics.

The patch having some issues, I am marking it as returned with
feedback. It would be nice to see a new version for next CF.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning

2015-12-13 Thread Amit Langote
On 2015/11/24 2:23, Robert Haas wrote:
> To me, it seems like there is a pretty obvious approach here: each
> table can be either a plain table, or a partition root (which can look
> just like an empty inheritance parent).  Then multi-level partitioning
> falls right out of that design without needing to do anything extra.
> If you want a single level of partitioning, you partition the original
> table.  If you want two levels of partitioning, you partition the
> partitions.  If you want three levels of partitioning, you partition
> those.  It's being made out here that limiting ourselves to a single
> of partitioning makes things simpler, but it's not very clear to me
> that this is actually true.
>
> I think it is also worth getting the syntax right from the beginning.
> Even if we decide that patch #1 won't support multi-level
> partitioning, we should have a plan for the syntax that can be
> extended to multi-level partitioning.  If we decide after releasing
> partitioning with one syntax that we really wish we'd used some other
> syntax, that is going to be a really big problem - deprecating the use
> of => or introducing standard_conforming_strings were projects that
> took many years to complete.  We really only get one shot to get that
> right.  That doesn't mean it's all got to be there in version one, but
> there had better be a way to extend it to all the things we want to do
> later or we are going to be deeply, deeply sad.

Previously proposed design would support multi-level partitioning with
some adjustments. One of the reasons to not include it in the recent patch
was a lack of clarity about proper syntax and catalog organization. That
wasn't really nice. So, here is a revised proposal for the syntax and some
general notes. I will adjust my previous patch to follow along any
feedback I receive here.


I think we can provide explicit SUBPARTITION keyword-based syntax for
multi-level partitioning. It's not unreasonable to expect that all
partitions at a given level are partitioned on the same key. IOW, why
require to declare partition key for each partition separately? Instead,
using SUBPARTITION BY on the master table seems better.

Syntax to create a partitioned table (up to 2 levels of partitioning):

CREATE TABLE foo (
...
)
PARTITION BY R/L ON (key0)
SUBPARTITION BY R/L ON (key1)
[(PARTITION foo_1 FOR VALUES  [] []
[(SUBPARTITION foo_1_1 FOR VALUES  [] [],
...)], ...)];

The above creates two pg_partitioned_rel entries for foo with partlevel 0
and 1, for key0 and key1, respectively. For foo_1 and foo_1_1, this
creates pg_partition entries, with foo and foo_1 as partparent,
respectively.

Why just 2 levels? - it seems commonplace and makes the syntax more
intuitive? I guess it might be possible to generalize the syntax for
multi-level partitioning. Ideas? If we want to support the notion of
sub-partition template in future, that would require some thought, more
importantly proper catalog organization for the same.

To add a partition to table after-the-fact:

ALTER TABLE foo
CREATE PARTITION foo1 FOR VALUES  [] []
[(SUBPARTITION foo11 FOR VALUES  [] [], ...)];

To add a sub-partition to an existing partition:

ALTER TABLE foo
MODIFY PARTITION foo_1
CREATE SUBPARTITION foo_1_ FOR VALUES (val) [] [];

I considered ADD/ALTER instead of CREATE/MODIFY, but there exist grammar
conflicts with ADD/ALTER column.

What about ALTER TABLE? - Instead of allowing ALTER TABLE to be applied
directly to partitions on case-by-case basis (they are tables under the
hood after all), we should restrict AT to the master table. Most of the AT
changes implicitly propagate from the master table to its partitions. Some
of them could be directly applied to partitions and/or sub-partitions such
as rename, storage manipulations like - changing tablespace, storage
parameters (reloptions), etc.:

ALTER TABLE foo
RENAME PARTITION  TO ;

ALTER TABLE foo
RENAME SUBPARTITION  TO ;

ALTER TABLE foo
SET TABLESPACE ... [DEFAULT] FOR PARTITION ;

ALTER TABLE foo
SET TABLESPACE ... FOR SUBPARTITION ;

ALTER TABLE foo
SET (storage_parameter = value) [DEFAULT] FOR PARTITION ;

ALTER TABLE foo
SET (storage_parameter = value) FOR SUBPARTITION ;

Note that the keyword DEFAULT in some cases above means, do not apply the
new setting to existing sub-partitions of the partition, rather use the
new setting for future sub-partitions. In case of SET TABLESPACE, if FOR
PARTITION clause is not specified, all partitions (actually "leaf"
partitions) of the master table are moved to the new tablespace; is that
necessary or should we just disallow that and instead output an error
asking to use [DEFAULT] FOR PARTITION/FOR SUBPARTITION to move only
specific partitions/sub-partitions?

By the way, should we also allow changing the logging of
partitions/sub-partitions as follows?

ALTER TABLE foo
  MODIFY PARTITION  SET {LOGGED | UNLOGGED};

ALTER TABLE foo
  MODIFY SUBPARTITION  SET {LOGGED | UNLOGGED};

What about index constraints, ie, PRIMARY 

Re: [HACKERS] Proposal: custom compression methods

2015-12-13 Thread Chapman Flack
On 12/14/15 01:50, Craig Ringer wrote:

> pg_upgrade means you can't just redefine the current toast bits so the
> compressed bit means "data is compressed, check first byte of varlena data
> for algorithm" because existing data won't have that, the first byte will
> be the start of the compressed data stream.

Is there any small sequence of initial bytes you wouldn't ever see in PGLZ
output?  Either something invalid, or something obviously nonoptimal
like run(n,'A')||run(n,'A') where PGLZ would have just output run(2n,'A')?

-Chap


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Fdw cleanup

2015-12-13 Thread Feng Tian
Hi, Hackers,

I need some help to understand foreign table error handling.

For a query on foreign table, ExecInitForeignScan is called, which in turn
calls the BeginForeignScan hook.   Inside this hook, I allocated some
resource,

node->fdw_state = allocate_resource(...);

If everything goes well, ExecEndForeignScan will call call my
EndForeignScan hook, inside the hook, I free the resource.

free_resource(node->fdw_state);

However, if during the execution an error happened, seems to me that
EndForeignScan will not be called (traced using gdb).   So my question is,
is Begin/End the right place for allocate/free resources?   If it is not,
what is the right way to do this?

Thank you very much,
Feng