Re: [HACKERS] multivariate statistics (v19)

2016-08-10 Thread Michael Paquier
On Thu, Aug 11, 2016 at 3:34 AM, Tomas Vondra
 wrote:
> On 08/10/2016 02:23 PM, Michael Paquier wrote:
>>
>> On Wed, Aug 10, 2016 at 8:33 PM, Tomas Vondra
>>  wrote:
>>> The idea is that the syntax should work even for statistics built on
>>> multiple tables, e.g. to provide better statistics for joins. That's why
>>> the
>>> schema may be specified (as each table might be in different schema), and
>>> so
>>> on.
>>
>>
>> So you mean that the same statistics could be shared between tables?
>> But as this is visibly not a concept introduced yet in this set of
>> patches, why not just cut it off for now to simplify the whole? If
>> there is no schema-related field in pg_mv_statistics we could still
>> add it later if it proves to be useful.
>>
>
> Yes, I think creating statistics on multiple tables is one of the possible
> future directions. One of the previous patch versions introduced ALTER TABLE
> ... ADD STATISTICS syntax, but that ran into issues in gram.y, and given the
> multi-table possibilities the CREATE STATISTICS seems like a much better
> idea anyway.
>
> But I guess you're right we may make this a bit more strict now, and relax
> it in the future if needed. For example as we only support single-table
> statistics at this point, we may remove the schema and always create the
> statistics in the schema of the table.

This would simplify the code the code a bit so I'd suggest removing
that from the first shot. If there is demand for it, keeping the
infrastructure open for this extension is what we had better do.

> But I don't think we should make the statistics names unique only within a
> table (instead of within the schema).

They could be made unique using (name, table_oid, column_list).

 There is a lot of mumbo-jumbo regarding the way dependencies are
 stored with mainly serialize_mv_dependencies and
 deserialize_mv_dependencies that operates them from bytea/dep trees.
 That's not cool and not portable because pg_mv_statistic represents
 that as pure bytea. I would suggest creating a generic data type that
 does those operations, named like pg_dependency_tree and then use that
 in those new catalogs. pg_node_tree is a precedent of such a thing.
 New features could as well make use of this new data type of we are
 able to design that in a way generic enough, so that would be a base
 patch that the current 0002 applies on top of.
>>>
>>>
>>>
>>> Interesting idea, haven't thought about that. So are you suggesting to
>>> add a
>>> data type for each statistics type (dependencies, MCV, histogram, ...)?
>>
>>
>> Yes that would be something like that, it would be actually perhaps
>> better to have one single data type, and be able to switch between
>> each model easily instead of putting byteas in the catalog.
>
> Hmmm, not sure about that. For example what about combinations of statistics
> - e.g. when we have MCV list on the most common values and a histogram on
> the rest? Should we store both as a single value, or would that be in two
> separate values, or what?

The same statistics can combine two different things, using different
columns may depend on how readable things get...
Btw, for the format we could get inspired from pg_node_tree, with pg_stat_tree:
{HISTOGRAM :arg {BUCKET :index 0 :minvals ... }}
{DEPENDENCY :arg {:elt "a => c" ...} ... }
{MVC :arg {:index 0 :values {0,0} ... } ... }
Please consider that as a tentative idea to make things more friendly.
Others may have a different opinion on the matter.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-10 Thread Amit Kapila
On Mon, Aug 8, 2016 at 9:56 PM, Bruce Momjian  wrote:
> On Mon, Aug  8, 2016 at 06:34:46PM +0530, Amit Kapila wrote:
>> I think here expensive part would be recheck for the cases where the
>> index value is changed to a different value (value which doesn't exist
>> in WARM chain).   You anyway have to add the new entry (key,TID) in
>> index, but each time traversing the WARM chain would be an additional
>> effort.  For cases, where there are just two index entries and one
>> them is being updated, it might regress as compare to what we do now.
>
> Yes, I think the all-increment or all-decrement WARM chain is going to
> be the right approach.
>

Probably, it will mitigate the cost in some cases, still there will be
a cost in comparisons especially when index is on char/varchar
columns.   Also, I think it will reduce the chance of performing WARM
update in certain cases considering we can create a WARM tuple only
when it follows the order.  OTOH, if we store pointers in index to
intermediate tuples, we won't face such issues.  Yes, there will be
some delay in cleanup of intermediate line pointers, however I think
we can clear those once we mark the corresponding index tuples as dead
in the next scan on corresponding index.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-08-10 Thread Venkata Balaji N
On Tue, Aug 9, 2016 at 1:02 AM, Robert Haas  wrote:

> On Mon, Aug 8, 2016 at 10:10 AM, Rahila Syed 
> wrote:
> > Thank you for inputs everyone.
> >
> > The opinions on this thread can be classified into following
> > 1. Commit
> > 2. Rollback
> > 3. Error
> > 4. Warning
> >
> > As per opinion upthread, issuing implicit commit immediately after
> switching
> > autocommit to ON, can be unsafe if it was not desired.  While I agree
> that
> > its difficult to judge users intention here, but if we were to base it on
> > some assumption, the closest would be implicit COMMIT in my
> opinion.There is
> > higher likelihood of a user being happy with issuing a commit when
> setting
> > autocommit ON than a transaction being rolled back.  Also there are quite
> > some interfaces which provide this.
> >
> > As mentioned upthread, issuing a warning on switching back to autocommit
> > will not be effective inside a script. It won't allow subsequent
> commands to
> > be committed as set autocommit to ON is not committed. Scripts will have
> to
> > be rerun with changes which will impact user friendliness.
> >
> > While I agree that issuing an ERROR and rolling back the transaction
> ranks
> > higher in safe behaviour, it is not as common (according to instances
> stated
> > upthread) as immediately committing any open transaction when switching
> back
> > to autocommit.
>
> I think I like the option of having psql issue an error.  On the
> server side, the transaction would still be open, but the user would
> receive a psql error message and the autocommit setting would not be
> changed.  So the user could type COMMIT or ROLLBACK manually and then
> retry changing the value of the setting.
>

This makes more sense as the user who is doing it would realise that the
transaction has been left open.


> Alternatively, I also think it would be sensible to issue an immediate
> COMMIT when the autocommit setting is changed from off to on.  That
> was my first reaction.
>

Issuing commit would indicate that, open transactions will be committed
which is not a good idea in my opinion. If the user is issuing AUTOCOMMIT =
ON, then it means all the transactions initiated after issuing this must be
committed, whereas it is committing the previously pending transactions as
well.


> Aborting the server-side transaction - with or without notice -
> doesn't seem very reasonable.
>

Agreed. Traditionally, open transactions in the database must be left open
until user issues a COMMIT or ROLLBACK. If the session is changed or
killed, then, the transaction must be rolled back.

Regards,
Venkata B N

Fujitsu Australia


Re: [HACKERS] Btree Index on PostgreSQL and Wiredtiger (MongoDB3.2)

2016-08-10 Thread Kisung Kim
Thank you for your information.
Here is the result:

After insertions:

ycsb=# select * from pgstatindex('usertable_pkey');
 version | tree_level | index_size | root_block_no | internal_pages |
leaf_pages | empty_pages | deleted_pages | avg_leaf_density |
leaf_fragmentation
-+++---+++-+---+--+
   2 |  3 | 5488721920 | 44337 |   4464 |
665545 |   0 | 0 |   52 | 11
(1 row)

After rebuild:


ycsb=# select * from pgstatindex('usertable_pkey');
 version | tree_level | index_size | root_block_no | internal_pages |
leaf_pages | empty_pages | deleted_pages | avg_leaf_density |
leaf_fragmentation
-+++---+
++-+---+
--+
   2 |  3 | 3154296832 | 41827 |   1899 |
383146 |   0 | 0 |90.08 |  0


It seems like that rebuild has an effect to reduce the number of internal
and leaf_pages and make more dense leaf pages.



On Wed, Aug 10, 2016 at 6:47 PM, Lukas Fittl  wrote:

> On Wed, Aug 10, 2016 at 4:24 PM, Kisung Kim  wrote:
>>
>> When I used the index bloating estimation script in
>> https://github.com/ioguix/pgsql-bloat-estimation,
>> the result is as follows:
>>
>
> Regardless of the issue at hand, it might make sense to verify these
> statistics using pgstattuple - those bloat estimation queries can be wildly
> off at times.
>
> See also https://www.postgresql.org/docs/9.5/static/pgstattuple.html as
> well as https://www.keithf4.com/checking-for-postgresql-bloat/
>
> Best,
> Lukas
>
> --
> Lukas Fittl
>
> Skype: lfittl
> Phone: +1 415 321 0630
>



-- 




Bitnine Global Inc., Kisung Kim, Ph.D
https://sites.google.com/site/kisungresearch/
E-mail : ks...@bitnine.net
Office phone : 070-4800-5890, 408-606-8602
US Mobile phone : 408-805-2192


Re: [HACKERS] phrase search TS_phrase_execute code readability patch

2016-08-10 Thread Robert Haas
On Tue, Aug 9, 2016 at 3:35 PM, David G. Johnston
 wrote:
> I don't follow why LposStart is needed so I removed it...

That doesn't seem very reasonable.

> Not compiled or in any way tested...

Please do not bother submitting patches that you aren't prepared to
compile and test.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] No longer possible to query catalogs for index capabilities?

2016-08-10 Thread Robert Haas
On Wed, Aug 10, 2016 at 6:14 PM, Tom Lane  wrote:
> Kevin Grittner  writes:
>> That one seems like it should either be at the AM level or not
>> included at all.  Where it would be interesting to know is if you
>> are a hacker looking for an AM to enhance with support, or (when
>> there is more than just btree supported, so it is not so easy to
>> remember) if you are a DBA investigating a high rate of
>> serialization failures and want to know whether indexes of a
>> certain type have index-relation predicate locking granularity or
>> something more fine-grained.  The latter use case seems plausible
>> once there is more of a mix of support among the AMs.
>
> TBH, that line of thought impresses me not at all, because I do not see
> a reason for SQL queries to need to see internal behaviors of AMs, and
> especially not at levels as crude as boolean properties of entire AMs,
> because that's just about guaranteed to become a lie (or at least not
> enough of the truth) in a year or two.  If you are a DBA wanting to know
> how fine-grained the locking is in a particular index type, you really
> need to read the source code or ask a hacker.
>
> We have been bit by the "representation not good enough to describe actual
> behavior" problem *repeatedly* over the years that pg_am had all this
> detail.  First it was amstrategies and amsupport, which have never
> usefully described the set of valid proc/op strategy numbers for any index
> type more complicated than btree.  Then there was amorderstrategy, which
> we got rid of in favor of amcanorder, and later added amcanbackward to
> that (not to mention amcanorderbyop).  And amconcurrent, which went away
> for reasons I don't recall.  Then we added amstorage, which later had to
> be supplemented with amkeytype, and still isn't a very accurate guide to
> what's actually in an index.  amcanreturn actually was a boolean rather
> than a function for awhile (though it looks like we never shipped a
> release with that definition).  There's still a lot of stuff with
> obviously limited life expectancy, like amoptionalkey, which is at best a
> really crude guide to what are valid index qualifications; someday that
> will likely have to go away in favor of a "check proposed index qual for
> supportability" AM callback.
>
> So I don't think I'm being unreasonable in wanting to minimize, not
> maximize, the amount of info exposed through this interface.  There is
> enough history to make me pretty sure that a lot of things that might be
> simple boolean properties today are going to be less simple tomorrow, and
> then we'll be stuck having to invent arbitrary definitions for what the
> property-test function is going to return for those.  And if there are
> any queries out there that are depending on simplistic interpretations
> of those property flags, they'll be broken in some respect no matter
> what we do.
>
> In short, I do not see a good reason to expose ampredlocks at the SQL
> level, and I think there needs to be a darn good reason to expose any of
> this stuff, not just "maybe some DBA will think he needs to query this".

I don't think you're being unreasonable, but I don't agree with your
approach.  I think that we should expose everything we reasonably can,
and if we have to change it later then it will be a backward
compatibility break.  Making it unqueryable in the hopes that people
won't try to query it is futile.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] new pgindent run before branch?

2016-08-10 Thread Robert Haas
On Wed, Aug 10, 2016 at 10:23 PM, Tom Lane  wrote:
>> Great.  Are you handling creating the new branch?
>
> Yeah, I guess.  I've made most of them lately.

I'll do it if you want.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] new pgindent run before branch?

2016-08-10 Thread Bruce Momjian
On Wed, Aug 10, 2016 at 10:35:44PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > It sounds like you are saying that the branch is to happen before the
> > pgindent run.  Am I missing something?
> 
> I read it as pgindent then branch.

OK, good.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] new pgindent run before branch?

2016-08-10 Thread Tom Lane
Bruce Momjian  writes:
> It sounds like you are saying that the branch is to happen before the
> pgindent run.  Am I missing something?

I read it as pgindent then branch.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] new pgindent run before branch?

2016-08-10 Thread Bruce Momjian
On Wed, Aug 10, 2016 at 10:31:35PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > I am confused --- are you pgindenting just HEAD and not 9.6?  Why?
> 
> The point is to pgindent while they're still the same.

So I read this:

> >> +1, I was planning to do that myself.
> 
> > Great.  Are you handling creating the new branch?
> 
> Yeah, I guess.  I've made most of them lately.
> 
> > If so, it probably
> > makes sense for you to do this just beforehand.
> 
> OK.

It sounds like you are saying that the branch is to happen before the
pgindent run.  Am I missing something?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] new pgindent run before branch?

2016-08-10 Thread Tom Lane
Bruce Momjian  writes:
> I am confused --- are you pgindenting just HEAD and not 9.6?  Why?

The point is to pgindent while they're still the same.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] new pgindent run before branch?

2016-08-10 Thread Bruce Momjian
On Wed, Aug 10, 2016 at 10:23:14PM -0400, Tom Lane wrote:
> Robert Haas  writes:
> > On Wed, Aug 10, 2016 at 5:04 PM, Tom Lane  wrote:
> >> Robert Haas  writes:
> >>> Would anyone mind too much if I refreshed typedefs.list and
> >>> re-indented the whole tree before we branch?
> 
> >> +1, I was planning to do that myself.
> 
> > Great.  Are you handling creating the new branch?
> 
> Yeah, I guess.  I've made most of them lately.
> 
> > If so, it probably
> > makes sense for you to do this just beforehand.
> 
> OK.

I am confused --- are you pgindenting just HEAD and not 9.6?  Why?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] new pgindent run before branch?

2016-08-10 Thread Tom Lane
Robert Haas  writes:
> On Wed, Aug 10, 2016 at 5:04 PM, Tom Lane  wrote:
>> Robert Haas  writes:
>>> Would anyone mind too much if I refreshed typedefs.list and
>>> re-indented the whole tree before we branch?

>> +1, I was planning to do that myself.

> Great.  Are you handling creating the new branch?

Yeah, I guess.  I've made most of them lately.

> If so, it probably
> makes sense for you to do this just beforehand.

OK.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] new pgindent run before branch?

2016-08-10 Thread Robert Haas
On Wed, Aug 10, 2016 at 5:04 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> Would anyone mind too much if I refreshed typedefs.list and
>> re-indented the whole tree before we branch?
>
> +1, I was planning to do that myself.

Great.  Are you handling creating the new branch?  If so, it probably
makes sense for you to do this just beforehand.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Btree Index on PostgreSQL and Wiredtiger (MongoDB3.2)

2016-08-10 Thread Lukas Fittl
On Wed, Aug 10, 2016 at 4:24 PM, Kisung Kim  wrote:
>
> When I used the index bloating estimation script in
> https://github.com/ioguix/pgsql-bloat-estimation,
> the result is as follows:
>

Regardless of the issue at hand, it might make sense to verify these
statistics using pgstattuple - those bloat estimation queries can be wildly
off at times.

See also https://www.postgresql.org/docs/9.5/static/pgstattuple.html as
well as https://www.keithf4.com/checking-for-postgresql-bloat/

Best,
Lukas

-- 
Lukas Fittl

Skype: lfittl
Phone: +1 415 321 0630


Re: [HACKERS] Set log_line_prefix and application name in test drivers

2016-08-10 Thread Peter Eisentraut
On 8/10/16 5:18 PM, Tom Lane wrote:
> Or in short: I don't want to be seeing one prefix format in some buildfarm
> logs and a different format in others.

Sure.  My patch has

log_line_prefix = '%t [%p]: [%l] %qapp=%a '

which is modeled after the pgfouine recommendation, which is I believe a
wide-spread convention, and it also vaguely follows syslog customs.

The build farm client has

log_line_prefix = '%m [%c:%l] '

which is very similar, but the lack of the PID makes it unsuitable for
the purposes that I have set out, and there is no obvious place to put
additional information such as %a.

%m vs %t is obviously a minor issue that I will gladly adjust, but
besides that I prefer to stick with my version.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Improved ICU patch - WAS: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-10 Thread Peter Geoghegan
On Wed, Aug 10, 2016 at 1:42 PM, Palle Girgensohn  wrote:
> They've been used for the FreeBSD ports since 2005, and have served us well. 
> I have of course updated them regularly. In this latest version, I've removed 
> support for other encodings beside UTF-8, mostly since I don't know how to 
> test them, but also, I see little point in supporting anything else using ICU.

Looks like you're not using the ICU equivalent of strxfrm(). While 9.5
is not the release that introduced its use, it did expand it
significantly. I think you need to fix this, even though it isn't
actually used to sort text at present, since presumably FreeBSD builds
of 9.5 don't TRUST_STRXFRM. Since you're using ICU, though, you could
reasonably trust the ICU equivalent of strxfrm(), so that's a missed
opportunity. (See the wiki page on the abbreviated keys issue [1] if
you don't know what I'm talking about.)

Shouldn't you really have a strxfrm() wrapper, used across the board,
including for callers outside of varlena.c? convert_string_datum() has
been calling strxfrm() for many releases now. These calls are still
used in FreeBSD builds, I would think, which seems like a bug that is
not dodged by simply not defining TRUST_STRXFRM. Isn't its assumption
that that matching the ordering used elsewhere not really hold on
FreeBSD builds?

[1] https://wiki.postgresql.org/wiki/Abbreviated_keys_glibc_issue
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Btree Index on PostgreSQL and Wiredtiger (MongoDB3.2)

2016-08-10 Thread Kisung Kim
Hi,

I've run YCSB(Yahoo! Cloud Service Benchmark) on PostgreSQL and MongoDB
with WiredTiger.
And I found some interesting results and some issues(maybe) on Btree index
of PostgreSQL.

Here is my experiments and results.
YCSB is for document store benchmark and I build following schema in PG.

CREATE TABLE usertable (
YCSB_KEY varchar(30) primary key,
FIELDS jsonb);

And the benchmark generated avg-300-byte-length Json documents and loaded
100M rows in PG and Mongo.

First I found that the size difference between PG and Mongo:
I configured Mongo not to use any compression for both storage and index.

MongoDB index size: 2.1 GB
PostgreSQL index size: 5.5 GB

When I used the index bloating estimation script in
https://github.com/ioguix/pgsql-bloat-estimation,
the result is as follows:
 current_database | schemaname | tblname  |  idxname
   | real_size  | extra_size |extra_ratio| fillfactor |
bloat_size |bloat_ratio| is_na
--++--+---+++---+++---+---
ycsb | public | usertable| usertable_pkey
 | 5488852992 | 2673713152 |  48.7116917850949 | 90 |
2362122240 |  43.0348971532448 | f

It says that the bloat_ratio is 42 for the index.

So, I rebuilded the index and the result was changed:

 current_database | schemaname | tblname  |  idxname
   | real_size  | extra_size |extra_ratio| fillfactor |
bloat_size |bloat_ratio| is_na
--++--+---+++---+++---+---
 ycsb | public | usertable| usertable_pkey
   | 3154264064 |  339132416 |  10.7515543758863 | 90 |
27533312 | 0.872891788428275 | f


I am curious about the results
1) why the index was bloated even though rows were only inserted not
removed or updated.
2) And then why the bloating is removed when it is rebuilded

I guess that the splits of Btree nodes during inserting rows caused the
bloating but I don't know exact reasons.
And given that MongoDB's index size is much smaller than PG after they
handled the same workload (only insert),
then is there any chances to improve PG's index behavior.

Thank you very much.


-- 




Bitnine Global Inc., Kisung Kim, Ph.D
https://sites.google.com/site/kisungresearch/
E-mail : ks...@bitnine.net
Office phone : 070-4800-5890, 408-606-8602
US Mobile phone : 408-805-2192


Re: [HACKERS] Assertion failure in REL9_5_STABLE

2016-08-10 Thread Marko Tiikkaja

On 2016-08-11 12:09 AM, Alvaro Herrera wrote:

BTW this is not a regression, right?  It's been broken all along.  Or am
I mistaken?


You're probably right.  I just hadn't realized I could run our app 
against 9.5 with --enable-cassert until last week.



.m


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assertion failure in REL9_5_STABLE

2016-08-10 Thread Marko Tiikkaja

On 2016-08-10 11:01 PM, Alvaro Herrera wrote:

Oh, I see ... so there's an update chain, and you're updating a earlier
tuple.  But the later tuple has a multixact and one of the members is
the current transaction.

I wonder how can you lock a tuple that's not the latest, where that
update chain was produced by your own transaction.  I suppose this is
because of plpgsql use of cursors.


There's a rolled back subtransaction that also did some magic on the 
rows AFAICT.  I can try and spend some time producing a smaller test 
case over the weekend.  No hurry since this missed the today's point 
release anyway.



.m


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] No longer possible to query catalogs for index capabilities?

2016-08-10 Thread Tom Lane
Andrew Gierth  writes:
> So these properties (I've changed all the names here, suggestions
> welcome for better ones) I think should be testable on the AM, each with
> an example of why:

>   can_order
>   can_unique
>   can_multi_col
>   can_exclude 

Check, flags that indicate what you can put in CREATE INDEX obviously
have to be testable on the AM.  I wonder though whether this behavior
of can_order should be distinct from the notion of "can produce
ordered scan output"?  Maybe that's silly.  Or maybe can_order needs
to be something you test at the opclass level not the AM level ---
I can sort of imagine an index type in which some opclasses support
ordering and others don't.  Certainly that behavior is possible today
for amcanorderbyop.

> (One possible refinement here could be to invert the sense of all of
> these, making them no_whatever, so that "false" and "null" could be
> treated the same by clients. Or would that be too confusing?)

Hmm?  I think true as the "has capability" case is fine from that
perspective, null would be taken as "doesn't work".

> These could be limited to being testable only on a specified index, and
> not AM-wide:

That would require adding a third function, but maybe we should just do
that.  In a lot of cases you'd rather not have to worry about which AM
underlies a given index, so a simple index_has_property(regclass, text)
function would be nice.  (That means can_order ought to be supported in
the per-index function even if you don't believe that it'd ever be
opclass-specific, or in the per-column function if you do.)

>   can_backward

As above, that could conceivably need to be per-column.

>   clusterable

Call this can_cluster, maybe?  Or just cluster?

>   index_scan
>   bitmap_scan
>   optional_key (? maybe)
>   predicate_locks (? maybe)

As noted in my response to Kevin, I don't like the last two.  They
are internal properties and it's hard to see them being of much
use to applications even if they weren't subject to change.

> And these for individual columns:

>   can_return
>   search_array  (? maybe)
>   search_nulls  (? maybe)
>   operator_orderable  (or distance_orderable? what's a good name?)

distance_orderable seems not bad.

>   orderable
>   asc
>   desc
>   nulls_first
>   nulls_last

OK

> A question: implementing can_return as a per-column property looks like
> it requires actually opening the index rel, rather than just consulting
> the syscache the way that most pg_get_* functions do. Should it always
> open it, or only for properties that need it?

Probably only if needed, on performance grounds, and because opening
the rel adds to chances of failure.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] No longer possible to query catalogs for index capabilities?

2016-08-10 Thread Tom Lane
Kevin Grittner  writes:
> That one seems like it should either be at the AM level or not
> included at all.  Where it would be interesting to know is if you
> are a hacker looking for an AM to enhance with support, or (when
> there is more than just btree supported, so it is not so easy to
> remember) if you are a DBA investigating a high rate of
> serialization failures and want to know whether indexes of a
> certain type have index-relation predicate locking granularity or
> something more fine-grained.  The latter use case seems plausible
> once there is more of a mix of support among the AMs.

TBH, that line of thought impresses me not at all, because I do not see
a reason for SQL queries to need to see internal behaviors of AMs, and
especially not at levels as crude as boolean properties of entire AMs,
because that's just about guaranteed to become a lie (or at least not
enough of the truth) in a year or two.  If you are a DBA wanting to know
how fine-grained the locking is in a particular index type, you really
need to read the source code or ask a hacker.

We have been bit by the "representation not good enough to describe actual
behavior" problem *repeatedly* over the years that pg_am had all this
detail.  First it was amstrategies and amsupport, which have never
usefully described the set of valid proc/op strategy numbers for any index
type more complicated than btree.  Then there was amorderstrategy, which
we got rid of in favor of amcanorder, and later added amcanbackward to
that (not to mention amcanorderbyop).  And amconcurrent, which went away
for reasons I don't recall.  Then we added amstorage, which later had to
be supplemented with amkeytype, and still isn't a very accurate guide to
what's actually in an index.  amcanreturn actually was a boolean rather
than a function for awhile (though it looks like we never shipped a
release with that definition).  There's still a lot of stuff with
obviously limited life expectancy, like amoptionalkey, which is at best a
really crude guide to what are valid index qualifications; someday that
will likely have to go away in favor of a "check proposed index qual for
supportability" AM callback.

So I don't think I'm being unreasonable in wanting to minimize, not
maximize, the amount of info exposed through this interface.  There is
enough history to make me pretty sure that a lot of things that might be
simple boolean properties today are going to be less simple tomorrow, and
then we'll be stuck having to invent arbitrary definitions for what the
property-test function is going to return for those.  And if there are
any queries out there that are depending on simplistic interpretations
of those property flags, they'll be broken in some respect no matter
what we do.

In short, I do not see a good reason to expose ampredlocks at the SQL
level, and I think there needs to be a darn good reason to expose any of
this stuff, not just "maybe some DBA will think he needs to query this".

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assertion failure in REL9_5_STABLE

2016-08-10 Thread Alvaro Herrera
BTW this is not a regression, right?  It's been broken all along.  Or am
I mistaken?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort, partitioning, merging, and the future

2016-08-10 Thread Peter Geoghegan
On Wed, Aug 10, 2016 at 11:59 AM, Robert Haas  wrote:
> I think that last part is a very important property; my intuition is
> that dividing up the work between cooperating processes in a way that
> should come out equal will often fail to do so, either due to the
> operating system scheduler or due to some data being cached and other
> data not being cached or due to the comparator running faster on some
> data than other data or due to NUMA effects that make some processes
> run faster than others or due to any number of other causes.  So I
> think that algorithms that allocate the work dynamically are going to
> greatly outperform those that use a division of labor which is fixed
> at the beginning of a computation phase.

I agree that dynamic sampling has big advantages. Our Quicksort
implementation does dynamic sampling, of course.

You need to be strict about partition boundaries: they may not be
drawn at a point in the key space that is not precisely defined, and
in general there can be no ambiguity about what bucket a tuple can end
up in ahead of time. In other words, you cannot carelessly allow equal
tuples to go on either side of an equal boundary key.

The reason for this restriction is that otherwise, stuff breaks when
you later attempt to "align" boundaries across sort operations that
are performed in parallel. I don't think you can introduce an
artificial B-Tree style tie-breaker condition to avoid the problem,
because that will slow things right down (B Quicksort does really
well with many equal keys).

When you have one really common value, load balancing for partitioning
just isn't going to do very well. My point is that there will be a
somewhat unpleasant worst case that will need to be accepted. It's not
practical to go to the trouble of preventing it entirely. So, the
comparison with quicksort works on a couple of levels.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] per-statement-level INSTEAD OF triggers

2016-08-10 Thread Emre Hasegeli
> It might be more useful after we get the infrastructure that Kevin's been
> working on to allow collecting all the updates into a tuplestore that
> could be passed to a statement-level trigger.  Right now I tend to agree
> that there's little point.

Maybe, this can be used to re-implement FOREIGN KEYs.  Never-ending
bulk DELETEs caused by lack of indexes on foreign key columns are
biting novice users quite often in my experience.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Set log_line_prefix and application name in test drivers

2016-08-10 Thread Tom Lane
Peter Eisentraut  writes:
> On 8/9/16 12:16 PM, Tom Lane wrote:
>> Peter Eisentraut  writes:
>>> Here is a small patch that sets log_line_prefix and application name in
>>> pg_regress and the TAP tests, to make analyzing the server log output
>>> easier.

>> How would this interact with the buildfarm's existing policies
>> on setting log_line_prefix?

> AFAICT, that only applies if the build farm client runs initdb itself,
> that is, for the installcheck parts.

Well, I guess the subtext of my question was whether we shouldn't try
to align this with the buildfarm's choices, or vice versa.  Andrew made
some different choices than you have done here, and it seems like we
ought to strive for a meeting of the minds on what's appropriate.

Or in short: I don't want to be seeing one prefix format in some buildfarm
logs and a different format in others.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort, partitioning, merging, and the future

2016-08-10 Thread Peter Geoghegan
On Wed, Aug 10, 2016 at 12:08 PM, Claudio Freire  wrote:
> I think it's a great design, but for that, per-worker final tapes have
> to always be random-access.

Thanks. I don't think I need to live with the randomAccess
restriction, because I can be clever about reading only the first
tuple on each logtape.c block initially. Much later, when the binary
search gets down to seeking within a single block, everything in the
block can be read at once into memory, and we can take the binary
search to that other representation. This latter part only needs to
happen once or twice per partition boundary per worker.

> I'm not hugely familiar with the code, but IIUC there's some penalty
> to making them random-access right?

Yeah, there is. For one thing, you have to store the length of the
tuple twice, to support incremental seeking in both directions. For
another, you cannot perform the final merge on-the-fly; you must
produce a serialized tape as output, which is used subsequently to
support random seeks. There is no penalty when you manage to do the
sort in memory, though (not that that has anything to do with parallel
sort).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] new pgindent run before branch?

2016-08-10 Thread Tom Lane
Robert Haas  writes:
> Would anyone mind too much if I refreshed typedefs.list and
> re-indented the whole tree before we branch?

+1, I was planning to do that myself.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assertion failure in REL9_5_STABLE

2016-08-10 Thread Alvaro Herrera
Marko Tiikkaja wrote:
> On 2016-08-10 19:28, Alvaro Herrera wrote:
> >Umm.  AFAICS HeapTupleSatisfiesUpdate() only returns SelfUpdated after
> >already calling HeapTupleHeaderGetCmax() (which obviously hasn't caught
> >the same assertion).  Something is odd there ...
> 
> HeapTupleSatisfiesUpdate() returns HeapTupleBeingUpdated in this case.
> HeapTupleSelfUpdated comes from here (line 4749):
> 
> /* if there are updates, follow the update chain */
> if (follow_updates && !HEAP_XMAX_IS_LOCKED_ONLY(infomask))
> {
> HTSU_Result res;
> res = heap_lock_updated_tuple(relation, tuple, _ctid,
>   GetCurrentTransactionId(),
>   mode);
> if (res != HeapTupleMayBeUpdated)
> {
> result = res;
> /* recovery code expects to have buffer lock held */
> LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
> goto failed;
>  }
>  }

Oh, I see ... so there's an update chain, and you're updating a earlier
tuple.  But the later tuple has a multixact and one of the members is
the current transaction.

I wonder how can you lock a tuple that's not the latest, where that
update chain was produced by your own transaction.  I suppose this is
because of plpgsql use of cursors.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort, partitioning, merging, and the future

2016-08-10 Thread Peter Geoghegan
On Wed, Aug 10, 2016 at 11:59 AM, Robert Haas  wrote:
> My view on this - currently anyway - is that we shouldn't conflate the
> tuplesort with the subsequent index generation, but that we should try
> to use parallelism within the tuplesort itself to the greatest extent
> possible.  If there is a single output stream that the leader uses to
> generate the final index, then none of the above problems arise.  They
> only arise if you've got multiple processes actually writing to the
> index.

I'm not sure if you're agreeing with my contention about parallel
CREATE INDEX not being a good target for partitioning here. Are you?

You can get some idea of how much a separate pass over the
concatenated outputs would hurt by using the test GUC with my patch
applied (the one that artificially forces randomAccess by B-Tree
tuplesort callers).

>> Suggested partitioning algorithm
>> 

>> The basic idea I have in mind is that we create runs in workers in the
>> same way that the parallel CREATE INDEX patch does (one output run per
>> worker). However, rather than merging in the leader, we use a
>> splitting algorithm to determine partition boundaries on-the-fly. The
>> logical tape stuff then does a series of binary searches to find those
>> exact split points within each worker's "final" tape. Each worker
>> reports the boundary points of its original materialized output run in
>> shared memory. Then, the leader instructs workers to "redistribute"
>> slices of their final runs among each other, by changing the tapeset
>> metadata to reflect that each worker has nworker input tapes with
>> redrawn offsets into a unified BufFile. Workers immediately begin
>> their own private on-the-fly merges.
>
> Yeah, this is pretty cool.  You end up with the final merge segmented
> into N submerges producing nonoverlapping ranges.  So you could have
> the leader perform submerge 0 itself, and while it's doing that the
> other workers can perform submerges 1..N.  By the time  the leader
> finishes submerge 0, the remaining submerges will likely be complete
> and after that the leader can just read the outputs of those submerges
> one after another and it has basically no additional work to do.

Again, I'm a little puzzled by your remarks here. Surely the really
great case for parallel sort with partitioning is the case where there
remains minimal further IPC between workers? So, while "the leader can
just read the outputs of those submerges", ideally it will be reading
as little as possible from workers. For example, it's ideal when the
workers were able to determine that their particular range in the
parallel merge join has very few tuples to return, having also
"synchronized" their range within two underlying relations
(importantly, the merge join "synchronization" can begin per worker
when the tuplesort.c on-the-fly merge begins and returns its first
tuple -- that is, it can begin very soon).

In short, partitioning when sorting is as much about avoiding a serial
dependency for the entire query tree as it is about taking advantage
of available CPU cores and disk spindles. That is my understanding, at
any rate.

While all this speculation about choice of algorithm is fun,
realistically I'm not gong to write the patch for a rainy day (nor for
parallel CREATE INDEX, at least until we become very comfortable with
all the issues I raise, which could never happen). I'd be happy to
consider helping you improve parallel query by providing
infrastructure like this, but I need someone else to write the client
of the infrastructure (e.g. a parallel merge join patch), or to at
least agree to meet me half way with an interdependent prototype of
their own. It's going to be messy, and we'll have to do a bit of
stumbling to get to a good place. I can sign up to that if I'm not the
only one that has to stumble.

Remember how I said we should work on the merging bottleneck
indirectly? I'm currently experimenting with having merging use
sifting down to replace the root in the heap. This is very loosely
based on the Jeremy Harris patch from 2014, I suppose. Anyway, this
can be far, far faster, with perhaps none of the downsides that we saw
in the context of building an initial replacement selection heap,
because we have more control of the distribution of input (tapes have
sorted tuples), and because this heap is so tiny and cache efficient
to begin with. This does really well in the event of clustering of
values, which is a common case, but also helps with totally random
initially input.

I need to do some more research before posting a patch, but right now
I can see that it makes merging presorted numeric values more than 2x
faster. And that's with 8 tapes, on my very I/O bound laptop. I bet
that the benefits would also be large for text (temporal locality is
improved, and so strcoll() comparison caching is more effective).
Serial merging still needs work, it seems.

-- 
Peter Geoghegan


-- 

Re: [HACKERS] Assertion failure in REL9_5_STABLE

2016-08-10 Thread Marko Tiikkaja

On 2016-08-10 19:28, Alvaro Herrera wrote:

Umm.  AFAICS HeapTupleSatisfiesUpdate() only returns SelfUpdated after
already calling HeapTupleHeaderGetCmax() (which obviously hasn't caught
the same assertion).  Something is odd there ...


HeapTupleSatisfiesUpdate() returns HeapTupleBeingUpdated in this case. 
HeapTupleSelfUpdated comes from here (line 4749):


/* if there are updates, follow the update chain */
if (follow_updates && !HEAP_XMAX_IS_LOCKED_ONLY(infomask))
{
HTSU_Result res;
res = heap_lock_updated_tuple(relation, tuple, _ctid,
  GetCurrentTransactionId(),
  mode);
if (res != HeapTupleMayBeUpdated)
{
result = res;
/* recovery code expects to have buffer lock held */
LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
goto failed;
 }
 }


Can you share the test case?


Not at this time, sorry.


.m


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] new pgindent run before branch?

2016-08-10 Thread Bruce Momjian
On Wed, Aug 10, 2016 at 04:02:27PM -0400, Robert Haas wrote:
> Hi,
> 
> Would anyone mind too much if I refreshed typedefs.list and
> re-indented the whole tree before we branch?  There's not too much
> churn at the moment, and I feel like it would be nice to start the
> cycle in as clean a state as possible.
> 
> Current results of this attached.

Sure, seems like a good idea.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Improved ICU patch - WAS: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-10 Thread Palle Girgensohn

> 4 aug. 2016 kl. 02:40 skrev Bruce Momjian :
> 
> On Thu, Aug  4, 2016 at 08:22:25AM +0800, Craig Ringer wrote:
>> Yep, it does. But we've made little to no progress on integration of ICU
>> support and AFAIK nobody's working on it right now.
> 
> Uh, this email from July says Peter Eisentraut will submit it in
> September  :-)
> 
>   
> https://www.postgresql.org/message-id/2b833706-1133-1e11-39d9-4fa228892...@2ndquadrant.com

Cool.

I have brushed up my decade+ old patches [1] for ICU, so they now have support 
for COLLATE on columns.


https://github.com/girgen/postgres/


in branches icu/XXX where XXX is master or REL9_X_STABLE.

They've been used for the FreeBSD ports since 2005, and have served us well. I 
have of course updated them regularly. In this latest version, I've removed 
support for other encodings beside UTF-8, mostly since I don't know how to test 
them, but also, I see little point in supporting anything else using ICU.



I have one question for someone with knowledge about Turkish (Devrim?). This is 
the diff from regression tests, when running

$ gmake check EXTRA_TESTS=collate.linux.utf8 LANG=sv_SE.UTF-8

$ cat "/Users/girgen/postgresql/obj/src/test/regress/regression.diffs"
*** 
/Users/girgen/postgresql/postgres/src/test/regress/expected/collate.linux.utf8.out
  2016-08-10 21:09:03.0 +0200
--- 
/Users/girgen/postgresql/obj/src/test/regress/results/collate.linux.utf8.out
2016-08-10 21:12:53.0 +0200
***
*** 373,379 
  SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false";
   false
  ---
!  f
  (1 row)

  SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false";
--- 373,379 
  SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false";
   false
  ---
!  t
  (1 row)

  SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false";
***
*** 385,391 
  SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true";
   true
  --
!  t
  (1 row)

  -- The following actually exercises the selectivity estimation for ~*.
--- 385,391 
  SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true";
   true
  --
!  f
  (1 row)

  -- The following actually exercises the selectivity estimation for ~*.

==

The Linux locale behaves differently from ICU for the above (corner ?) cases. 
Any ideas if one is more correct than the other? I seems unclear to me. Perhaps 
it depends on whether the case-insensitive match is done using lower(both) or 
upper(both)? I haven't investigated this yet. @Devrim, is one more correct than 
the other?


As Thomas points out, using ucoll_strcoll it is quick, since no copying is 
needed. I will get some benchmarks soon.

Palle



[1] https://people.freebsd.org/~girgen/postgresql-icu/README.html



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [HACKERS] Wait events monitoring future development

2016-08-10 Thread Robert Haas
On Tue, Aug 9, 2016 at 12:07 AM, Tsunakawa, Takayuki
 wrote:
> As another idea, we can stand on the middle ground.  Interestingly, MySQL 
> also enables their event monitoring (Performance Schema) by default, but not 
> all events are collected.  I guess highly encountered events are not 
> collected by default to minimize the overhead.

Yes, I think that's a sensible approach.  I can't see enabling by
default a feature that significantly regresses performance.  We work
too hard to improve performance to throw very much of it away for any
one feature, even a feature that a lot of people like.  What I really
like about what got committed to 9.6 is that it's so cheap we should
be able to use for lots of other things - latch events, network I/O,
disk I/O, etc. without hurting performance at all.

But if we start timing those events, it's going to be really
expensive.  Even just counting them or keeping a history will cost a
lot more than just publishing them while they're active, which is what
we're doing now.

> BTW, I remember EnterpriseDB has a wait event monitoring feature.  Is it 
> disabled by default?  What was the overhead?

Timed events in Advanced Server are disabled by default.  I haven't
actually tested the overhead myself and I don't remember exactly what
the numbers were the last time someone else did, but I think if you
turned edb_timed_statistics on, it's pretty expensive.  If we can
agree on something sensible here, I imagine we'll get rid of that
feature in Advanced Server in favor of whatever the community settles
on.  But if the community agrees to turn on something by default that
costs a measurable percentage in performance, I predict that Advanced
Server 10 will ship with a different default for that feature than
PostgreSQL 10.

Personally, I think too much of this thread (and previous threads) is
devoted to arguing about whether it's OK to make performance worse,
and by how much we'd be willing to make it worse.  What I think we
ought to be talking about is how to design a feature that produces the
most useful data for the least performance cost possible, like by
avoiding measuring wait times for events that are very frequent or
waits that are very short.  Or, maybe we could have a background
process that updates a timestamp in shared memory every millisecond,
and other processes can read that value instead of making a system
call.  I think on Linux systems with fast clocks the operating system
basically does something like that for you, but there might be other
systems where it helps.  Of course, it could also skew the results if
the system is so overloaded that the clock-updater process gets
descheduled for a lengthy period of time.

Anyway, I disagree with the idea that this feature is stalled or
blocked in some way.  I (and quite a few other people, though not
everyone) oppose making performance significantly worse in the default
configuration.  I oppose that regardless of whether it is a
hypothetical patch for this feature that causes the problem or whether
it is a hypothetical patch for some other feature that causes the
problem.  I am not otherwise opposed to more work in this area; in
fact, I'm rather in favor of it.  But you can count on me to argue
against pretty much everything that causes a performance regression,
whatever the reason. Virtually every release, at least one developer
proposes some patch that slows the server down by "only" 1-2%.  If
we'd accepted all of the patches that were shot down because of such
impacts, we'd have lost a very big chunk of performance between the
time I started working on PostgreSQL and now.

As it is, our single-threaded performance seems to have regressed
noticeably since 9.1:

http://bonesmoses.org/2016/01/08/pg-phriday-how-far-weve-come/

I think that's awful.  But if we'd accepted all of those patches that
cost "only" one or two percentage points, it would probably be -15% or
-25% rather than -4.4%.  I think that if we want to really be
successful as a project, we need to make that number go UP, not down.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] new autovacuum criterion for visible pages

2016-08-10 Thread Jeff Janes
I wanted to create a new relopt named something like
autovacuum_vacuum_pagevisible_factor which would cause autovacuum to
vacuum a table once less than a certain fraction of the relation's
pages are marked allvisible.

I wanted some feedback on some things.

1) One issue is that pg_class.relpages and pg_class.relallvisible are
themselves only updated by vacuum/analyze.  In the absence of manual
vacuum or analyze, this means that if the new criterion uses those
field, it could only kick in after an autoanalyze has already been
done, which means that autovacuum_vacuum_pagevisible_factor could not
meaningfully be set lower than autovacuum_analyze_scale_factor.

Should relallvisible be moved/copied from pg_class to
pg_stat_all_tables, so that it is maintained by the stats collector?
Or should the autovacuum worker just walk the vm of every table with a
defined autovacuum_vacuum_pagevisible_factor each time it is launched
to get an up-to-date count that way?

2) Should there be a guc in addition to the relopt?  I can't think of
a reason why I would want to set this globally, so I'm happy with just
a relopt.  If it were set globally, it would sure increase the cost
for scanning the vm for each table once each naptime.

3) Should there be a autovacuum_vacuum_pagevisible_threshold?  The
other settings have both a factor and a threshold.  I've never
understand what the point of the threshold settings is, but presumably
there is a point to them.  Does that reason also apply to keeping vm
tuned up?

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slowness of extended protocol

2016-08-10 Thread Vladimir Sitnikov
Shay> it's important to note that query parsing and rewriting isn't an
"inevitable evil".

Ok, I stay corrected. ADO.NET have raw mode in the API. That's interesting.

Let's say "lots of heavily used languages do have their own notion of bind
placeholders".
And for the reset, it is still not that hard to prepare on the go.

Shay> As Tom said, if an application can benefit from preparing, the
developer has the responsibility (and also the best knowledge) to manage
preparation, not the driver. Magical behavior under the hood causes
surprises, hard-to-diagnose bugs etc.

Why do you do C# then?
Aren't you supposed to love machine codes as the least magical thing?
Even x86 does not describe "the exact way the code should be executed".
All the CPUs shuffle the instructions to make it faster.

Shay>As Tom said, if an application can benefit from preparing, the
developer has the responsibility

Does developer have the responsibility to choose between "index scan" and
"table seq scan"? So your "developer has the responsibility" is like
building on sand.

Even "SQL execution on PostgreSQL is a magical behavior under the hood".
Does that mean we should drop the optimizer and implement "fully static
hint-based execution mode"? I do not buy that.

My experience shows, that people are very bad at predicting where the
performance problem would be.
For 80% (or even 99%) of the cases, they just do not care thinking if a
particular statement should be server-prepared or not.
They have absolutely no idea how much resources it would take and so on.

ORMs have no that notion of "this query must be server-prepared, while that
one must not be".
And so on.

It is somewhat insane to assume people would use naked SQL. Of course they
would use ORMs and alike, so they just would not be able to pass that
additional parameter telling if a particular query out of thousands should
be server-prepared or not.

Vladimir> "cached plan cannot change result type" -- PostgreSQL just fails
to execute the server-prepared statement if a table was altered.

Shay>How exactly do users cope with this in pgjdbc? Do they have some
special API to flush (i.e. deallocate) prepared statements which they're
supposed to use after a DDL?

First of all, pgjdbc does report those problems to hackers.
Unfortunately, it is still "not implemented".
Then, a workaround at pgjdbc side is made.
Here's relevant pgjdbc fix: https://github.com/pgjdbc/pgjdbc/pull/451

It analyzes error code, and if it finds "not_implemented
from RevalidateCachedQuery", then it realizes it should re-prepare.
Unfortunately, there is no dedicated error code, but at least there's a
routine name.

On top of that, each pgjdbc commit is tested against HEAD PostgreSQL
revision, so if the routine will get renamed for some reason, we'll know
that right away.

There will be some more parsing to cover "deallocate all" case.

Shay>it have been listed many times - pgbouncer

Let's stop discussing pgbouncer issue here?
It has absolutely nothing to do with pgsql-hackers.
Thanks.

Vladimir


Re: [HACKERS] regression test for extended query protocol

2016-08-10 Thread Alvaro Herrera
Michael Paquier wrote:
> On Fri, Aug 5, 2016 at 12:21 AM, Alvaro Herrera
>  wrote:
> > If somebody had some spare time to devote to this, I would suggest to
> > implement something in core that can be used to specify a list of
> > commands to run, and a list of files-to-be-saved-in-bf-log emitted by
> > each command.  We could have one such base file in the core repo that
> > specifies some tests to run (such as "make -C src/test/recovery check"),
> > and an additional file can be given by buildfarm animals to run custom
> > tests, without having to write BF modules for each thing.  With that,
> > pgsql committers could simply add a new test to be run by all buildfarm
> > animals by adding it to the list in core.
> 
> Do you think about using a special makefile target to run those
> commands, say in src/test? At the end we are going to need to patch
> the buildfarm client code at least once, at least that would be worth
> it in the long term..

Sure.  Some time ago I proposed something like a JSON file, something
like 

test_foo => {
  "command" : "make -C src/test/blah check",
  "save_output" : [ "src/test/blah/server.log", "src/test/blah/regress*.log" ]
}

as I recall, Andrew said that he didn't like JSON very much but that the
idea made sense to him.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] new pgindent run before branch?

2016-08-10 Thread Robert Haas
Hi,

Would anyone mind too much if I refreshed typedefs.list and
re-indented the whole tree before we branch?  There's not too much
churn at the moment, and I feel like it would be nice to start the
cycle in as clean a state as possible.

Current results of this attached.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgindent.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-10 Thread Claudio Freire
On Wed, Aug 10, 2016 at 4:37 PM, Simon Riggs  wrote:
> On 10 August 2016 at 03:45, Pavan Deolasee  wrote:
>>
>>
>> On Tue, Aug 9, 2016 at 12:06 AM, Claudio Freire 
>> wrote:
>>>
>>> On Mon, Aug 8, 2016 at 2:52 PM, Pavan Deolasee 
>>> wrote:
>>>
>>> > Some heuristics and limits on amount of work done to detect duplicate
>>> > index
>>> > entries will help too.
>>>
>>> I think I prefer a more thorough approach.
>>>
>>> Increment/decrement may get very complicated with custom opclasses,
>>> for instance. A column-change bitmap won't know how to handle
>>> funcional indexes, etc.
>>>
>>> What I intend to try, is modify btree to allow efficient search of a
>>> key-ctid pair, by adding the ctid to the sort order.
>>
>>
>> Yes, that's a good medium term plan. And this is kind of independent from
>> the core idea.
>
> +1 That seems like a good idea. It would allow us to produce a bitmap
> scan in blocksorted order.

Well, not really. Only equal-key runs will be block-sorted, since the
sort order would be (key, block, offset).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slowness of extended protocol

2016-08-10 Thread Shay Rojansky
Some comments...

For the record, I do find implicit/transparent driver-level query
preparation interesting and potentially useful, and have opened
https://github.com/npgsql/npgsql/issues/1237 to think about it - mostly
based on arguments on this thread. One big question mark I have is whether
this behavior should be opt-out as in pgjdbc, rather than opt-in. Like
others in this thread (I think), I prefer my libraries and drivers very
simple, following my exact API calls to the letter and doing minimal magic
under the hood. As Tom said, if an application can benefit from preparing,
the developer has the responsibility (and also the best knowledge) to
manage preparation, not the driver. Magical behavior under the hood causes
surprises, hard-to-diagnose bugs etc.

One interesting case where implicit preparation is unbeatable, though, is
when users aren't coding against the driver directly but are using some
layer, e.g. an ORM. For example, the Entity Framework ORM simply does not
prepare statements (see
https://github.com/aspnet/EntityFramework/issues/5459 which I opened). This
is one reason I find it a valuable feature to have, although probably as
opt-in and not opt-out.

Regarding driver-level SQL parsing, Vladimir wrote:

> Unfortunately, client-side SQL parsing is inevitable evil (see below on
'?'), so any sane driver would cache queries any way. The ones that do not
have query cache will perform badly anyway.

First, it's a very different thing to have a query cache in the driver, and
to implicitly prepare statements.

Second, I personally hate the idea that drivers parse SQL and rewrite
queries. It's true that in many languages database APIs have "standardized"
parameter placeholders, and therefore drivers have no choice but to
rewrite. ADO.NET (the .NET database API) actually does not do this -
parameter placeholders are completely database-dependent (see
https://msdn.microsoft.com/en-us/library/yy6y35y8%28v=vs.110%29.aspx?f=255=-2147217396).
Npgsql does rewrite queries to provide named parameters, but the next
version is going to have a "raw query" mode in which no parsing/rewriting
happens whatsoever. It simply takes your string, packs it into a Parse (or
Query), and sends it off. This is not necessarily going to be the main way
users use Npgsql, but it's important to note that query parsing and
rewriting isn't an "inevitable evil".

Vladimir wrote:

> This new message would be slower than proper query cache, so why should
we all spend time on a half-baked solution?

The cases where the prepared statement path doesn't work have already been
listed many times - pgbouncer (and other pools), drivers which don't
support persisting prepared statement across pooled connection open/close,
and people (like many in this conversation) who don't appreciate their
drivers doing magic under the hood. This is why I thing both proposals are
great - there's nothing wrong with query caching (or more precisely,
implicit statement preparation by the driver), especially if it's opt-in,
but that doesn't mean we shouldn't optimize things for people who don't,
can't or won't go for that.

> To be fair, implementing a cache is a trivial thing when compared with
hard-coding binary/text formats for all the datatypes in each and every
language.
> Remember, each driver has to implement its own set of procedures to
input/output values in text/binary format, and that is a way harder than
implementing the cache we are talking about.

I agree with that, but it's not a question of how easy it is to implement
implicit preparation. It's a question of whether driver developers choose
to do this, based on other considerations as well - many people here think
it's not the job of the driver to decide for you whether to prepare or not,
for example. It has nothing to do with implementation complexity.

> "cached plan cannot change result type" -- PostgreSQL just fails to
execute the server-prepared statement if a table was altered.

That alone seems to be a good reason to make implicit preparation opt-in at
best. How exactly do users cope with this in pgjdbc? Do they have some
special API to flush (i.e. deallocate) prepared statements which they're
supposed to use after a DDL? Do you look at the command tag in the
CommandComplete to know whether a command was DDL or not?


Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-10 Thread Simon Riggs
On 10 August 2016 at 03:45, Pavan Deolasee  wrote:
>
>
> On Tue, Aug 9, 2016 at 12:06 AM, Claudio Freire 
> wrote:
>>
>> On Mon, Aug 8, 2016 at 2:52 PM, Pavan Deolasee 
>> wrote:
>>
>> > Some heuristics and limits on amount of work done to detect duplicate
>> > index
>> > entries will help too.
>>
>> I think I prefer a more thorough approach.
>>
>> Increment/decrement may get very complicated with custom opclasses,
>> for instance. A column-change bitmap won't know how to handle
>> funcional indexes, etc.
>>
>> What I intend to try, is modify btree to allow efficient search of a
>> key-ctid pair, by adding the ctid to the sort order.
>
>
> Yes, that's a good medium term plan. And this is kind of independent from
> the core idea.

+1 That seems like a good idea. It would allow us to produce a bitmap
scan in blocksorted order.

> So I'll go ahead and write a patch that implements the core
> idea and some basic optimizations.

+1

> We can then try different approaches such
> as tracking changed columns, tracking increment/decrement and teaching btree
> to skip duplicate (key, CTID) entries.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Set log_line_prefix and application name in test drivers

2016-08-10 Thread Peter Eisentraut
On 8/9/16 12:16 PM, Tom Lane wrote:
> Peter Eisentraut  writes:
>> > Here is a small patch that sets log_line_prefix and application name in
>> > pg_regress and the TAP tests, to make analyzing the server log output
>> > easier.
> How would this interact with the buildfarm's existing policies
> on setting log_line_prefix?

AFAICT, that only applies if the build farm client runs initdb itself,
that is, for the installcheck parts.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort, partitioning, merging, and the future

2016-08-10 Thread Claudio Freire
On Mon, Aug 8, 2016 at 4:44 PM, Peter Geoghegan  wrote:
> The basic idea I have in mind is that we create runs in workers in the
> same way that the parallel CREATE INDEX patch does (one output run per
> worker). However, rather than merging in the leader, we use a
> splitting algorithm to determine partition boundaries on-the-fly. The
> logical tape stuff then does a series of binary searches to find those
> exact split points within each worker's "final" tape. Each worker
> reports the boundary points of its original materialized output run in
> shared memory. Then, the leader instructs workers to "redistribute"
> slices of their final runs among each other, by changing the tapeset
> metadata to reflect that each worker has nworker input tapes with
> redrawn offsets into a unified BufFile. Workers immediately begin
> their own private on-the-fly merges.

I think it's a great design, but for that, per-worker final tapes have
to always be random-access.

I'm not hugely familiar with the code, but IIUC there's some penalty
to making them random-access right?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort, partitioning, merging, and the future

2016-08-10 Thread Robert Haas
On Mon, Aug 8, 2016 at 3:44 PM, Peter Geoghegan  wrote:
> I don't think partitioning is urgent for CREATE INDEX, and may be
> inappropriate for CREATE INDEX under any circumstances, because:
>
> * Possible problems with parallel infrastructure and writes.
> * Unbalanced B-Trees (or the risk thereof).
> * What I've come up with is minimally divergent from the existing
> approach to tuplesorting.

My view on this - currently anyway - is that we shouldn't conflate the
tuplesort with the subsequent index generation, but that we should try
to use parallelism within the tuplesort itself to the greatest extent
possible.  If there is a single output stream that the leader uses to
generate the final index, then none of the above problems arise.  They
only arise if you've got multiple processes actually writing to the
index.

> Suggested partitioning algorithm
> 
>
> I think a hybrid partitioning + merging approach would work well for
> us. The paper "Parallel Sorting on a Shared-Nothing Architecture using
> Probabilistic Splitting" [3] has influenced my thinking here (this was
> written by prominent researchers from the influential UW-Madison
> Wisconsin database group). Currently, I have in mind something that is
> closer to what they call exact splitting to what they call
> probabilistic splitting, because I don't think it's going to be
> generally possible to have good statistics on partition boundaries
> immediately available (e.g., through something like their
> probabilistic splitting sampling the relation ahead of time).
>
> The basic idea I have in mind is that we create runs in workers in the
> same way that the parallel CREATE INDEX patch does (one output run per
> worker). However, rather than merging in the leader, we use a
> splitting algorithm to determine partition boundaries on-the-fly. The
> logical tape stuff then does a series of binary searches to find those
> exact split points within each worker's "final" tape. Each worker
> reports the boundary points of its original materialized output run in
> shared memory. Then, the leader instructs workers to "redistribute"
> slices of their final runs among each other, by changing the tapeset
> metadata to reflect that each worker has nworker input tapes with
> redrawn offsets into a unified BufFile. Workers immediately begin
> their own private on-the-fly merges.

Yeah, this is pretty cool.  You end up with the final merge segmented
into N submerges producing nonoverlapping ranges.  So you could have
the leader perform submerge 0 itself, and while it's doing that the
other workers can perform submerges 1..N.  By the time  the leader
finishes submerge 0, the remaining submerges will likely be complete
and after that the leader can just read the outputs of those submerges
one after another and it has basically no additional work to do.

It might be a good idea to divide the work into a number of submerges
substantially greater than the number of workers.  For example,
suppose we expect between 1 and 4 workers, but we partition the work
into 64 submerges.  The leader claims submerge 0, which is only 1/64
of the total.  By the time it finishes consuming those tuples,
submerge 1 will likely be done.  Hopefully, even if there are only 1
or 2 workers, they can keep ahead of the leader so that very little of
the merging happens in the leader.  Also, if some submerges go faster
than others, the distribution of work among workers remains even,
because the ones that go quicker will handle more of the submerges and
the ones that go slower will handle fewer.

I think that last part is a very important property; my intuition is
that dividing up the work between cooperating processes in a way that
should come out equal will often fail to do so, either due to the
operating system scheduler or due to some data being cached and other
data not being cached or due to the comparator running faster on some
data than other data or due to NUMA effects that make some processes
run faster than others or due to any number of other causes.  So I
think that algorithms that allocate the work dynamically are going to
greatly outperform those that use a division of labor which is fixed
at the beginning of a computation phase.

> Clearly it's really hard to be sure that this is the right thing at
> this point, but my intuition is that this is the way to go (while
> avoiding anything like this for CREATE INDEX). I'd like to know how
> others feel about it.

The number of others weighing in on these topics is surely less than
either of us would like, but hopefully we can find a way to make
progress anyhow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for CSN based snapshots

2016-08-10 Thread Alexander Korotkov
On Wed, Aug 10, 2016 at 8:26 PM, Greg Stark  wrote:

> On Wed, Aug 10, 2016 at 5:54 PM, Alexander Korotkov
>  wrote:
> > Oh, I found that I underestimated complexity of async commit...  :)
>
> Indeed. I think Tom's attitude was right even if the specific
> conclusion was wrong. While I don't think removing async commit is
> viable I think it would be a laudable goal if we can remove some of
> the complication in it. I normally describe async commit as "just like
> a normal commit but don't block on the commit" but it's actually a bit
> more complicated.
>
> AIUI the additional complexity is that while async commits are visible
> to everyone immediately other non-async transactions can be committed
> but then be in limbo for a while before they are visible to others. So
> other sessions will see the async commit "jump ahead" of any non-async
> transactions even if those other transactions were committed first.
> Any standbys will see the non-async transaction in the logs before the
> async transaction and in a crash it's possible to lose the async
> transaction even though it was visible but not the sync transaction
> that wasn't.
>
> Complexity like this makes it hard to implement other features such as
> CSNs. IIRC this already bit hot standby as well. I think it would be a
> big improvement if we had a clear, well defined commit order that was
> easy to explain and easy to reason about when new changes are being
> made.
>

Heikki, Ants, Greg, thank you for the explanation.  You restored order in
my personal world.
Now I see that introduction of own sequence of CSN which is not equal to
LSN makes sense.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Proposal for CSN based snapshots

2016-08-10 Thread Kevin Grittner
On Wed, Aug 10, 2016 at 12:26 PM, Greg Stark  wrote:

> Complexity like this makes it hard to implement other features such as
> CSNs. IIRC this already bit hot standby as well. I think it would be a
> big improvement if we had a clear, well defined commit order that was
> easy to explain and easy to reason about when new changes are being
> made.

And here I was getting concerned that there was no mention of
"apparent order of execution" for serializable transactions --
which does not necessarily match either the order of LSNs from
commit records nor CSNs.  The order in which transactions become
visible is clearly a large factor in determining AOoE, but it is
secondary to looking at whether a transaction modified data based
on reading the "before" image of a data set modified by a
concurrent transaction.

I still think that our best bet for avoiding anomalies when using
logical replication in complex environments is for logical
replication to apply transactions in apparent order of execution.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics (v19)

2016-08-10 Thread Tomas Vondra

On 08/10/2016 02:23 PM, Michael Paquier wrote:

On Wed, Aug 10, 2016 at 8:33 PM, Tomas Vondra
 wrote:

On 08/10/2016 06:41 AM, Michael Paquier wrote:

Patch 0001: there have been comments about that before, and you have
put the checks on RestrictInfo in a couple of variables of
pull_varnos_walker, so nothing to say from here.



I don't follow. Are you suggesting 0001 is a reasonable fix, or that there's
a proposed solution?


I think that's reasonable.



Well, to me the 0001 feels more like a temporary workaround rather than 
a proper solution. I just don't know how to deal with it so I've kept it 
for now. Pretty sure there will be complaints that adding RestrictInfo 
to the expression walkers is not a nice idea.


>> ...


The idea is that the syntax should work even for statistics built on
multiple tables, e.g. to provide better statistics for joins. That's why the
schema may be specified (as each table might be in different schema), and so
on.


So you mean that the same statistics could be shared between tables?
But as this is visibly not a concept introduced yet in this set of
patches, why not just cut it off for now to simplify the whole? If
there is no schema-related field in pg_mv_statistics we could still
add it later if it proves to be useful.



Yes, I think creating statistics on multiple tables is one of the 
possible future directions. One of the previous patch versions 
introduced ALTER TABLE ... ADD STATISTICS syntax, but that ran into 
issues in gram.y, and given the multi-table possibilities the CREATE 
STATISTICS seems like a much better idea anyway.


But I guess you're right we may make this a bit more strict now, and 
relax it in the future if needed. For example as we only support 
single-table statistics at this point, we may remove the schema and 
always create the statistics in the schema of the table.


But I don't think we should make the statistics names unique only within 
a table (instead of within the schema).


The difference between those two cases is that if we allow multi-table 
statistics in the future, we can simply allow specifying the schema and 
everything will work just fine. But it'd break the second case, as it 
might result in conflicts in existing schemas.


I do realize this might be seen as a case of "future proofing" based on 
dubious predictions of how something might work, but OTOH this (schema 
inherited from table, unique within a schema) is pretty consistent with 
how this work for indexes.



+
 /*
Spurious noise in the patch.

+   /* check that at least some statistics were requested */
+   if (!build_dependencies)
+   ereport(ERROR,
+   (errcode(ERRCODE_SYNTAX_ERROR),
+errmsg("no statistics type (dependencies) was
requested")));
So, WITH (dependencies) is mandatory in any case. Why not just
dropping it from the first cut then?



Because the follow-up patches extend this to require at least one statistics
type. So in 0004 it becomes

if (!(build_dependencies || build_mcv))

and in 0005 it's

if (!(build_dependencies || build_mcv || build_histogram))

We might drop it from 0002 (and assume build_dependencies=true), and then
add the check in 0004. But it seems a bit pointless.


This is a complicated set of patches. I'd think that we should try to
simplify things as much as possible first, and the WITH clause is not
mandatory to have as of 0002.



OK, I can remove the WITH from the 0002 part. Not a big deal.


Statistics definition reorder the columns by itself depending on their
order. For example:
create table aa (a int, b int);
create statistics aas on aa(b, a) with (dependencies);
\d aa
"public.aas" (dependencies) ON (a, b)
As this defines a correlation between multiple columns, isn't it wrong
to assume that (b, a) and (a, b) are always the same correlation? I
don't recall such properties as being always commutative (old
memories, I suck at stats in general). [...reading README...] So this
is caused by the implementation limitations that only limit the
analysis between interactions of two columns. Still it seems incorrect
to reorder the user-visible portion.


I don't follow. If you talk about Pearson's correlation, that clearly does
not depend on the order of columns - it's perfectly independent of that. If
you talk about about correlation in the wider sense (i.e. arbitrary
dependence between columns), that might depend - but I don't remember a
single piece of the patch where this might be a problem.


Yes, based on what is done in the patch that may not be a problem, but
I am wondering if this is not restricting things too much.



Let's keep the code as it is. If we run into this issue in the future, 
we can easily relax this - there's nothing depending on the ordering of 
attnums, IIRC.



Also, which README states that we can only analyze interactions between two
columns? That's pretty clearly not the case - the patch should handle
dependencies between more columns 

Re: [HACKERS] pg_ctl promote wait

2016-08-10 Thread Peter Eisentraut
On 8/7/16 9:44 PM, Michael Paquier wrote:
>>> This is not a good
>>> >> idea, and the idea of putting a wait argument in get_controlfile does
>>> >> not seem a good interface to me. I'd rather see get_controlfile be
>>> >> extended with a flag saying no_error_on_failure and keep the wait
>>> >> logic within pg_ctl.
>> >
>> > I guess we could write a wrapper function in pg_ctl that encapsulated
>> > the wait logic.
> That's what I would do.

New patches, incorporating your suggestions.

I moved some of the error handling out of get_controlfile() and back
into the callers, because it was getting too weird that that function
knew so much about the callers' intentions.  That way we don't actually
have to change the signature.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 04a724675769b2412b739c408abc136d17d52794 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 26 Jul 2016 11:39:43 -0400
Subject: [PATCH v2 1/5] Make command_like output more compact

Consistently print the test name, not the full command, which can be
quite lenghty and include temporary directory names and other
distracting details.
---
 src/test/perl/TestLib.pm | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index 649fd82..c7b3262 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -276,8 +276,8 @@ sub command_like
 	my ($stdout, $stderr);
 	print("# Running: " . join(" ", @{$cmd}) . "\n");
 	my $result = IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
-	ok($result, "@$cmd exit code 0");
-	is($stderr, '', "@$cmd no stderr");
+	ok($result, "$test_name: exit code 0");
+	is($stderr, '', "$test_name: no stderr");
 	like($stdout, $expected_stdout, "$test_name: matches");
 }
 
-- 
2.9.2

From b7c04b3f6cbe7a37359509363da0701c84063113 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 26 Jul 2016 10:48:43 -0400
Subject: [PATCH v2 2/5] pg_ctl: Add tests for promote action

---
 src/bin/pg_ctl/t/003_promote.pl | 39 +++
 src/test/perl/TestLib.pm| 11 +++
 2 files changed, 50 insertions(+)
 create mode 100644 src/bin/pg_ctl/t/003_promote.pl

diff --git a/src/bin/pg_ctl/t/003_promote.pl b/src/bin/pg_ctl/t/003_promote.pl
new file mode 100644
index 000..1461234
--- /dev/null
+++ b/src/bin/pg_ctl/t/003_promote.pl
@@ -0,0 +1,39 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 9;
+
+my $tempdir = TestLib::tempdir;
+
+command_fails_like([ 'pg_ctl', '-D', "$tempdir/nonexistent", 'promote' ],
+   qr/directory .* does not exist/,
+   'pg_ctl promote with nonexistent directory');
+
+my $node_primary = get_new_node('primary');
+$node_primary->init(allows_streaming => 1);
+
+command_fails_like([ 'pg_ctl', '-D', $node_primary->data_dir, 'promote' ],
+   qr/PID file .* does not exist/,
+   'pg_ctl promote of not running instance fails');
+
+$node_primary->start;
+
+command_fails_like([ 'pg_ctl', '-D', $node_primary->data_dir, 'promote' ],
+   qr/not in standby mode/,
+   'pg_ctl promote of primary instance fails');
+
+my $node_standby = get_new_node('standby');
+$node_primary->backup('my_backup');
+$node_standby->init_from_backup($node_primary, 'my_backup', has_streaming => 1);
+$node_standby->start;
+
+is($node_standby->safe_psql('postgres', 'SELECT pg_is_in_recovery()'),
+   't', 'standby is in recovery');
+
+command_ok([ 'pg_ctl', '-D', $node_standby->data_dir, 'promote' ],
+		   'pg_ctl promote of standby runs');
+
+ok($node_standby->poll_query_until('postgres', 'SELECT NOT pg_is_in_recovery()'),
+   'promoted standby is not in recovery');
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index c7b3262..51b533e 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -34,6 +34,7 @@ our @EXPORT = qw(
   program_version_ok
   program_options_handling_ok
   command_like
+  command_fails_like
 
   $windows_os
 );
@@ -281,4 +282,14 @@ sub command_like
 	like($stdout, $expected_stdout, "$test_name: matches");
 }
 
+sub command_fails_like
+{
+	my ($cmd, $expected_stderr, $test_name) = @_;
+	my ($stdout, $stderr);
+	print("# Running: " . join(" ", @{$cmd}) . "\n");
+	my $result = IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+	ok(!$result, "$test_name: exit code not 0");
+	like($stderr, $expected_stderr, "$test_name: matches");
+}
+
 1;
-- 
2.9.2

From cf77e8419bae21a84a2a30d260a70bac2c19498a Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 26 Jul 2016 11:23:43 -0400
Subject: [PATCH v2 3/5] pg_ctl: Detect current standby state from pg_control

pg_ctl used to determine whether a server was in standby mode by looking
for a recovery.conf file.  With this change, it instead looks into
pg_control, which is potentially more accurate.  There are also
occasional 

Re: [HACKERS] Declarative partitioning - another take

2016-08-10 Thread Robert Haas
On Wed, Aug 10, 2016 at 7:09 AM, Amit Langote
 wrote:
> Attached is the latest set of patches to implement declarative
> partitioning.

Cool.  I would encourage you to give some thought to what is the least
committable subset of these patches, and think about whether it can be
reorganized to make that smaller.  Based on the descriptions, it
sounds to me like the least committable subset is probably all of 0001
- 0005 as one giant commit, and that's a lot of code.  Maybe there's
no real way to come up with something more compact than that, but it's
worth some thought.

> 0001-Catalog-and-DDL-for-partition-key.patch

+  pg_partitioned

pg_partitioned seems like a slightly strange choice of name.  We have
no other catalog tables whose names end in "ed", so the use of a past
participle here is novel.  More generally, this patch seems like it's
suffering a bit of terminological confusion: while the catalog table
is called pg_partitioned, the relkind is RELKIND_PARTITION_REL.  And
get_relation_by_qualified_name() thinks that RELKIND_PARTITION_REL is
a "table", but getRelationDescription thinks it's a "partitioned
table".  I think we need to get all these bits and pieces on the same
page, or have some kind of consistent way of deciding what to do in
each case.  Maybe pg_partitioned_table, RELKIND_PARTITIONED_TABLE,
etc. for the internal stuff, but just "table" in user-facing messages.

Alternatively, I wonder if we should switch to calling this a
"partition root" rather than a "partitioned table".  It's not the
whole table, just the root.  And doesn't contain data - in fact it
probably shouldn't even have storage - so calling it a "table" might
be confusing.  But if we're using CREATE TABLE to create it in the
ifrst place, then calling it something other than a table later on is
also confusing.  Hmm.

+  partexprs

There's a certain symmetry between this and what we do for indexes,
but I'm wondering whether there's a use case for partitioning a table
by an expression rather than a column value.  I suppose if you've
already done the work, there's no harm in supporting it.

+[ PARTITION BY {RANGE | LIST} ( { column_name | ( expression ) } [ opclass ] [, ...] )

The spacing isn't right here.  Should say { RANGE | LIST } rather than
{RANGE|LIST}.  Similarly elsewhere.

+  thus created is called partitioned table.  Key
+  consists of an ordered list of column names and/or expressions when
+  using the RANGE method, whereas only a single column or
+  expression can be specified when using the LIST method.

Why do we support range partitioning with multiple columns, but list
partitioning only with a single column?

+  The type of a key column or an expresion must have an associated

Typo.

+ 
+  Currently, there are following limitations on definition of partitioned
+  tables: one cannot specify any UNIQUE, PRIMARY KEY, EXCLUDE and/or
+  FOREIGN KEY constraints.
+ 

But I presume you can define these constraints on the partitions;
that's probably worth mentioning.  I'd say something like this
"Partitioned tables do not support UNIQUE, PRIMARY, EXCLUDE, or
FOREIGN KEY constraints; however, you can define these constraints on
individual data partitions."

+if (partexprs)
+recordDependencyOnSingleRelExpr(,
+(Node *) partexprs,
+RelationGetRelid(rel),
+DEPENDENCY_NORMAL,
+DEPENDENCY_IGNORE);

I don't think introducing a new DEPENDENCY_IGNORE type is a good idea
here.  Instead, you could just add an additional Boolean argument to
recordDependencyOnSingleRelExpr.  That seems less likely to create
bugs in unrelated portions of the code.

+/*
+ * Override specified inheritance option, if relation is a partitioned
+ * table and it's a target of INSERT.
+ */
+if (alsoSource)
+rte->inh |= relid_is_partitioned(rte->relid);

This seems extremely unlikely to be the right way to do this.  We're
just doing parse analysis here, so depending on properties of the
table that are (or might be made to be) changeable is not good.  It's
also an extra catalog lookup whether the table is partitioned or not
(and whether rte->inh is already true or not). Furthermore, it seems
to go against the comment explaining what rte->inh is supposed to
mean, which says:

 *inh is TRUE for relation references that should be expanded to include
 *inheritance children, if the rel has any.  This *must* be FALSE for
 *RTEs other than RTE_RELATION entries.

I am not sure what problem you're trying to solve here, but I suspect
this needs to be ripped out altogether or handled later, during
planning.  Similarly for the related change in addRangeTableEntry.

+{PartitionedRelationId,/* PARTEDRELID */
+PartitionedRelidIndexId,
+1,
+{
+  

Re: [HACKERS] multivariate statistics (v19)

2016-08-10 Thread Tomas Vondra

On 08/10/2016 02:24 PM, Michael Paquier wrote:

On Wed, Aug 10, 2016 at 8:50 PM, Petr Jelinek  wrote:

On 10/08/16 13:33, Tomas Vondra wrote:


On 08/10/2016 06:41 AM, Michael Paquier wrote:


On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra


2) combining multiple statistics


I think the ability to combine multivariate statistics (covering
different
subsets of conditions) is important and useful, but I'm starting to
think
that the current implementation may not be the correct one (which is
why I
haven't written the SGML docs about this part of the patch series yet).

Assume there's a table "t" with 3 columns (a, b, c), and that we're
estimating query:

   SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3

but that we only have two statistics (a,b) and (b,c). The current
patch does
about this:

   P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)

i.e. it estimates the first two conditions using (a,b), and then
estimates
(c=3) using (b,c) with "b=2" as a condition. Now, this is very
efficient,
but it only works as long as the query contains conditions
"connecting" the
two statistics. So if we remove the "b=2" condition from the query, this
stops working.



This is trying to make the algorithm smarter than the user, which is
something I'd think we could live without. In this case statistics on
(a,c) or (a,b,c) are missing. And what if the user does not want to
make use of stats for (a,c) because he only defined (a,b) and (b,c)?



I don't think so. Obviously, if you have statistics covering all the
conditions - great, we can't really do better than that.

But there's a crucial relation between the number of dimensions of the
statistics and accuracy of the statistics. Let's say you have statistics
on 8 columns, and you split each dimension twice to build a histogram -
that's 256 buckets right there, and we only get ~50% selectivity in each
dimension (the actual histogram building algorithm is more complex, but
you get the idea).


I think it makes sense to pursue this, but I also think we can easily live
with not having it in the first version that gets committed and doing it as
follow-up patch.


This patch is large and complicated enough. As this is not a mandatory
piece to get a basic support, I'd suggest just to drop that for later.


Which is why combining multiple statistics is in part 0006 and all the 
previous parts simply choose the single "best" statistics ;-)


I'm perfectly fine with committing just the first few parts, and leaving 
0006 for the next major version.


regards


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics (v19)

2016-08-10 Thread Tomas Vondra

On 08/10/2016 03:29 PM, Ants Aasma wrote:

On Wed, Aug 3, 2016 at 4:58 AM, Tomas Vondra
 wrote:

2) combining multiple statistics

I think the ability to combine multivariate statistics (covering different
subsets of conditions) is important and useful, but I'm starting to think
that the current implementation may not be the correct one (which is why I
haven't written the SGML docs about this part of the patch series yet).


While researching this topic a few years ago I came across a paper on
this exact topic called "Consistently Estimating the Selectivity of
Conjuncts of Predicates" [1]. While effective it seems to be quite
heavy-weight, so would probably need support for tiered optimization.

[1] 
https://courses.cs.washington.edu/courses/cse544/11wi/papers/markl-vldb-2005.pdf



I think I've read that paper some time ago, and IIRC it's solving the 
same problem but in a very different way - instead of combining the 
statistics directly, it relies on the "partial" selectivities and then 
estimates the total selectivity using the maximum-entropy principle.


I think it's a nice idea and it probably works fine in many cases, but 
it kinda throws away part of the information (that we could get by 
matching the statistics against each other directly). But I'll keep that 
paper in mind, and we can revisit this solution later.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is there a way around function search_path killing SQL function inlining?

2016-08-10 Thread Regina Obe

> Michael Banck  writes:
>> As I've been bitten by this problem recently, I thought I'd take a 
>> look at editing the PostGIS extension SQL file to this end, but 
>> contrary to the above, the @extschema@ feature only applies to 
>> non-relocatable extensions, from src/backend/commands/extension.c:

>>   * If it's not relocatable, substitute the target schema name for
>>  * occurrences of @extschema@.
>>   *
>>   * For a relocatable extension, we needn't do this.  There cannot be
>>   * any need for @extschema@, else it wouldn't be relocatable.

>> I'm not sure that logic is sound - even if setting @extschema@ 
>> explicitly in the SQL functions bodies kills inlining (not sure about
>> that) or wouldn't help for other reasons, ISTM this should be 
>> reconsidered in the light of the use case with materialized views 
> > during restore.

> It's not simply a matter of allowing the substitution to occur while
reading the extension script.  "Relocatable" means that we support ALTER
EXTENSION SET SCHEMA, which means moving all the 
> extension's objects into some new schema.  There's no good way to run
around and find places where @extschema@ was replaced in order to change
them to something else.

> Basically the point of @extschema@ is to support extensions that are
relocatable at installation time, but not afterwards.

>   regards, tom lane

FWIW on upcoming PostGIS 2.3, we have changed to not allow PostGIS to be
relocatable and schema qualifying internal calls. I took Tom's suggestion of
just using @extschema@
Which did mean we needed to not allow PostGIS to be relocatable anymore.  A
bit of a bummer.

Setting search_path on functions aside from killing inlining also killed
performance in other ways so that was a no go. Not sure if that is a known
issue or not and I haven't determined under what circumstances setting
search_path kills performance when index usage does not come into play.
I'll take it as a known.
Here is an example of such a case. 

https://trac.osgeo.org/postgis/ticket/3611

Now getting to the fact that using @extschema@ means requiring extension not
to be relocatable, that was a bummer and something we would need to deal
with if we ever forced everyone to install PostGIS in a specific schema so
that other extensions that rely on us can just know where PostGIS is
installed (or as Steve Frost suggested a way for dependency extensions to be
able to specify location of dependent extensions with a code such as
@extschema_postgis@ as we've got a bunch of extensions we are aware of
relying on postgis already (pgrouting, postgis_sfcgal, postgis_topology,
postgis_tiger_geocoder)

It would also be nice if the extension model had a way to allow the
extension authors the choice of handling the 'ALTER EXTENSION SET SCHEMA'
event short of monkeying with event triggers.

Yes we really need an extensions authors list to iron out and hear about
these pain points.  :)

Thanks,
Regina



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-10 Thread Claudio Freire
On Tue, Aug 9, 2016 at 11:39 PM, Jim Nasby  wrote:
> On 8/9/16 6:44 PM, Claudio Freire wrote:
>>
>> Since we can lookup all occurrences of k1=a index=0 and k2=a index=0,
>> and in fact we probably did so already as part of the update logic
>
>
> That's a change from what currently happens, right?
>
> The reason I think that's important is that dropping the assumption that we
> can't safely re-find index entries from the heap opens up other
> optimizations, ones that should be significantly simpler to implement. The
> most obvious example being getting rid of full index scans in vacuum. While
> that won't help with write amplification, it would reduce the cost of vacuum
> enormously. Orders of magnitude wouldn't surprise me in the least.
>
> If that's indeed a prerequisite to WARM it would be great to get that
> groundwork laid early so others could work on other optimizations it would
> enable.

I can do that. I've been prospecting the code to see what changes it
would entail already.

But it's still specific to btree, I'm not sure the same optimizations
can be applied to GIN (maybe, if the posting list is sorted) or GIST
(probably, since it's like a btree, but I don't know the code well
enough).

Certainly hash indexes won't support it.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] No longer possible to query catalogs for index capabilities?

2016-08-10 Thread Kevin Grittner
On Wed, Aug 10, 2016 at 12:31 PM, Andrew Gierth
 wrote:

> These could be limited to being testable only on a specified index, and
> not AM-wide:

>   predicate_locks (? maybe)

That one seems like it should either be at the AM level or not
included at all.  Where it would be interesting to know is if you
are a hacker looking for an AM to enhance with support, or (when
there is more than just btree supported, so it is not so easy to
remember) if you are a DBA investigating a high rate of
serialization failures and want to know whether indexes of a
certain type have index-relation predicate locking granularity or
something more fine-grained.  The latter use case seems plausible
once there is more of a mix of support among the AMs.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] No longer possible to query catalogs for index capabilities?

2016-08-10 Thread Andrew Gierth
> "Tom" == Tom Lane  writes:

 >> - this still has everything in amapi.c rather than creating any new
 >> files. Also, the regression tests are in create_index.sql for lack
 >> of any obviously better place.

 Tom> This more than doubles the size of amapi.c, so it has a definite
 Tom> feel of tail-wags-dog for me, even if that seemed like an
 Tom> appropriate place otherwise which it doesn't really.  I think a
 Tom> new .c file under adt/ is the way to go, with extern declarations
 Tom> in builtins.h.

Yeah, I'm fine with that.  adt/amutils.c unless I see some better
suggestion.

 Tom> Maybe we need a new regression test case file too.  I don't much
 Tom> like adding this to create_index because (a) it doesn't
 Tom> particularly seem to match that file's purpose of setting up
 Tom> indexes on the standard regression test tables, and (b) that file
 Tom> is a bottleneck in parallel regression runs because it can't run
 Tom> in parallel with much else.

Good point. I looked around to see if anything was testing
pg_get_indexdef, thinking that that would be a good place, but it seems
that pg_get_indexdef is tested only in incidental ways (in collate and
rules, the latter of which tests it only with invalid input).

I'll do the next version with a new file, unless a better idea shows up.

 >> Comments?

 Tom> Why did you go with "capability" rather than "property" in the
 Tom> exposed function names?  The latter seems much closer to le mot
 Tom> juste, especially for things like asc/desc.

The first version (which dealt only with AMs) went with "capability"
because it was dealing with what the AM _could_ do rather than what was
defined on any specific index.

The second version added pg_index_column_has_property because that was
the name that had been used in discussion.

Changing them all to "property" would be more consistent I suppose.

 Tom> I'd personally cut the list of pg_am replacement properties way
 Tom> back, as I believe much of what you've got there is not actually
 Tom> of use to applications, and some of it is outright
 Tom> counterproductive.  An example is exposing amcanreturn as an
 Tom> index-AM boolean.

For AM-wide properties, it may be that they have to be considered
"lossy" when tested against the AM oid rather than on an individual
index or column - at the AM level, "false" might mean "this won't work"
while "true" would mean "this might work sometimes, not guaranteed to
work on every index".  The documentation should probably indicate this.

So these properties (I've changed all the names here, suggestions
welcome for better ones) I think should be testable on the AM, each with
an example of why:

  can_order
- if this is false, an admin tool shouldn't try and put ASC or DESC
  in a CREATE INDEX

  can_unique
- if this is false, an admin tool might, for example, want to not
  offer the user the option of CREATE UNIQUE INDEX with this AM

  can_multi_col
- if this is false, an admin tool might want to allow the user to
  select only one column

  can_exclude 
- a new property that indicates whether the AM can be used for
  exclusion constraints; at present this matches "amgettuple" but
  that implementation detail should of course be hidden

(One possible refinement here could be to invert the sense of all of
these, making them no_whatever, so that "false" and "null" could be
treated the same by clients. Or would that be too confusing?)

These could be limited to being testable only on a specified index, and
not AM-wide:

  can_backward
  clusterable
  index_scan
  bitmap_scan
  optional_key (? maybe)
  predicate_locks (? maybe)

And these for individual columns:

  can_return
  search_array  (? maybe)
  search_nulls  (? maybe)
  operator_orderable  (or distance_orderable? what's a good name?)
  orderable
  asc
  desc
  nulls_first
  nulls_last

A question: implementing can_return as a per-column property looks like
it requires actually opening the index rel, rather than just consulting
the syscache the way that most pg_get_* functions do. Should it always
open it, or only for properties that need it?

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for CSN based snapshots

2016-08-10 Thread Ants Aasma
On Wed, Aug 10, 2016 at 7:54 PM, Alexander Korotkov
 wrote:
> Oh, I found that I underestimated complexity of async commit...  :)
>
> Do I understand right that now async commit right as follows?
> 1) Async transaction confirms commit before flushing WAL.
> 2) Other transactions sees effect of async transaction only when its WAL
> flushed.
> 3) In the session which just committed async transaction, effect of this
> transaction is visible immediately (before WAL flushed). Is it true?

Current code  simplified:

XactLogCommitRecord()
if (synchronous_commit)
 XLogFlush()
ProcArrayEndTransaction() // Become visible

The issue we are discussing is that with CSNs, the "become visible"
portion must occur in CSN order. If CSN == LSN, then async
transactions that have their commit record after a sync record must
wait for the sync record to flush and become visible. Simplest
solution is to not require CSN == LSN and just assign a CSN value
immediately before becoming visible.

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assertion failure in REL9_5_STABLE

2016-08-10 Thread Alvaro Herrera
Marko Tiikkaja wrote:
> Hi,
> 
> Running one specific test from our application against REL9_5_STABLE
> (current as of today) gives me this:
> 
> #2  0x7effb59595be in ExceptionalCondition (
> conditionName=conditionName@entry=0x7effb5b27a88 "!(CritSectionCount > 0
> || TransactionIdIsCurrentTransactionId(( (!((tup)->t_infomask & 0x0800) &&
> ((tup)->t_infomask & 0x1000) && !((tup)->t_infomask & 0x0080)) ?
> HeapTupleGetUpdateXid(tup) : ( (tup)-"...,
> errorType=errorType@entry=0x7effb599b74b "FailedAssertion",
> fileName=fileName@entry=0x7effb5b2796c "combocid.c",
> lineNumber=lineNumber@entry=132) at assert.c:54
> #3  0x7effb598e19b in HeapTupleHeaderGetCmax (tup=0x7effa85714c8) at
> combocid.c:131
> #4  0x7effb55fb0c1 in heap_lock_tuple (relation=0x7effb5bf5d90,
> tuple=tuple@entry=0x7fffcee73690, cid=346,
> mode=mode@entry=LockTupleExclusive, wait_policy=LockWaitBlock,
> follow_updates=follow_updates@entry=1 '\001',
> buffer=buffer@entry=0x7fffcee7367c, hufd=hufd@entry=0x7fffcee73680) at
> heapam.c:4813

Umm.  AFAICS HeapTupleSatisfiesUpdate() only returns SelfUpdated after
already calling HeapTupleHeaderGetCmax() (which obviously hasn't caught
the same assertion).  Something is odd there ...

Can you share the test case?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slowness of extended protocol

2016-08-10 Thread Vladimir Sitnikov
Stephen> While it may have good results in many cases, it's not accurate to
say that using prepared statements will always be faster than not.

There's no silver bullet. <-- that is accurate, but it is useless for
end-user applications
I've never claimed that "server prepared statement" is a silver bullet.

I just claim that "PrepareBindExecuteDeallocate" message does have
justification from performance point of view.

Stephen Frost> Dropping and recreating the prepared statement is how that
particular issue is addressed.

Stephen,

The original problem is: "extended query execution is slow"
I recommend "let's just add a query cache"
You mention: "that does not work since one query in a year might get slower
on 6th execution"
I ask: "what should be done _instead_ of a query cache?"
You say: "the same query cache, but execute 5 times at most"

Does that mean you agree that "query cache is a way to go"?


I do not follow that. Of course, I could add "yet another debugger switch"
to pgjdbc so it executes server-prepared statements 5 times maximum,
however I do consider it to be a PostgreSQL's issue.
I do not like to hard-code "5" into the driver logic.

I do not like making "5 times maximum" a default mode.

Vladimir


Re: [HACKERS] Proposal for CSN based snapshots

2016-08-10 Thread Greg Stark
On Wed, Aug 10, 2016 at 5:54 PM, Alexander Korotkov
 wrote:
> Oh, I found that I underestimated complexity of async commit...  :)

Indeed. I think Tom's attitude was right even if the specific
conclusion was wrong. While I don't think removing async commit is
viable I think it would be a laudable goal if we can remove some of
the complication in it. I normally describe async commit as "just like
a normal commit but don't block on the commit" but it's actually a bit
more complicated.

AIUI the additional complexity is that while async commits are visible
to everyone immediately other non-async transactions can be committed
but then be in limbo for a while before they are visible to others. So
other sessions will see the async commit "jump ahead" of any non-async
transactions even if those other transactions were committed first.
Any standbys will see the non-async transaction in the logs before the
async transaction and in a crash it's possible to lose the async
transaction even though it was visible but not the sync transaction
that wasn't.

Complexity like this makes it hard to implement other features such as
CSNs. IIRC this already bit hot standby as well. I think it would be a
big improvement if we had a clear, well defined commit order that was
easy to explain and easy to reason about when new changes are being
made.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for CSN based snapshots

2016-08-10 Thread Heikki Linnakangas

On 08/10/2016 07:54 PM, Alexander Korotkov wrote:

Do I understand right that now async commit right as follows?
1) Async transaction confirms commit before flushing WAL.


Yes.


2) Other transactions sees effect of async transaction only when its WAL
flushed.


No. Other transactions also see the effects of the async transaction 
before it's flushed.



3) In the session which just committed async transaction, effect of this
transaction is visible immediately (before WAL flushed). Is it true?


Yes. (The same session is not a special case, per previous point. All 
sessions see the async transaction as committed, even before it's flushed.)


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slowness of extended protocol

2016-08-10 Thread Stephen Frost
* Vladimir Sitnikov (sitnikov.vladi...@gmail.com) wrote:
> 3) "suddently get slow the 6th time" is a PostgreSQL bug that both fails to
> estimate cardinality properly, and it does not provide administrator a way
> to disable the feature (generic vs specific plan).

Dropping and recreating the prepared statement is how that particular
issue is addressed.

> Query cache does have very good results for the overall web page times, and
> problems like "6th execution" are not that often.

While it may have good results in many cases, it's not accurate to say
that using prepared statements will always be faster than not.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Proposal for CSN based snapshots

2016-08-10 Thread Alexander Korotkov
On Wed, Aug 10, 2016 at 6:09 PM, Heikki Linnakangas  wrote:

> On 08/10/2016 05:51 PM, Tom Lane wrote:
>
>> Heikki Linnakangas  writes:
>>
>>> On 08/10/2016 05:09 PM, Tom Lane wrote:
>>>
 Uh, what?  That's not the semantics we have today, and I don't see why
 it's necessary or a good idea.  Once the commit is in the WAL stream,
 any action taken on the basis of seeing the commit must be later in
 the WAL stream.  So what's the problem?

>>>
>> I was talking about synchronous commits in the above. A synchronous
>>> commit is not made visible to other transactions, until the commit WAL
>>> record is flushed to disk.
>>>
>>
>> [ thinks for a bit... ]  Oh, OK, that's because we don't treat a
>> transaction as committed until its clog bit is set *and* it's not
>> marked as running in the PGPROC array.  And sync transactions will
>> flush WAL in between.
>>
>
> Right.
>
> Still, having to invent CSNs seems like a huge loss for this design.
>> Personally I'd give up async commit first.  If we had only sync commit,
>> the rule could be "xact LSN less than snapshot threshold and less than
>> WAL flush position", and we'd not need CSNs.  I know some people like
>> async commit, but it's there because it was easy and cheap in our old
>> design, not because it's the world's greatest feature and worth giving
>> up performance for.
>>
>
> I don't think that's a very popular opinion (I disagree, for one).
> Asynchronous commits are a huge performance boost for some applications.
> The alternative is fsync=off, and I don't want to see more people doing
> that. SSDs have made the penalty of an fsync much smaller, but it's still
> there.
>
> Hmm. There's one more possible way this could all work. Let's have CSN ==
> LSN, also for asynchronous commits. A snapshot is the current insert
> position, but also make note of the current flush position, when you take a
> snapshot. Now, when you use the snapshot, if you ever see an XID that
> committed between the snapshot's insert position and the flush position,
> wait for the WAL to be flushed up to the snapshot's insert position at that
> point. With that scheme, an asynchronous commit could return to the
> application without waiting for a flush, but if someone actually looks at
> the changes the transaction made, then that transaction would have to wait.
> Furthermore, we could probably skip that waiting too, if the reading
> transaction is also using synchronous_commit=off.
>
> That's slightly different from the current behaviour. A transaction that
> runs with synchronous_commit=on, and reads data that was modified by an
> asynchronous transaction, would take a hit. But I think that would be
> acceptable.


Oh, I found that I underestimated complexity of async commit...  :)

Do I understand right that now async commit right as follows?
1) Async transaction confirms commit before flushing WAL.
2) Other transactions sees effect of async transaction only when its WAL
flushed.
3) In the session which just committed async transaction, effect of this
transaction is visible immediately (before WAL flushed). Is it true?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Slowness of extended protocol

2016-08-10 Thread Vladimir Sitnikov
Stephen>I encourage you to look through the archives

The thing is pl/pgsql suffers from exactly the same problem.
pl/pgsql is not a typical language of choice (e.g. see Tiobe index and
alike), so the probability of running into "prepared statement issues" was
low.

As more languages would use server-prepared statements, the rate of the
issues would naturally increase.

JFYI: I did participate in those conversations, so I do not get which
particular point are you asking for me to "look through" there.

Stephen Frost:

> And is the source of frequent complaints on various mailing lists along
> the lines of "why did my query suddently get slow the 6th time it was
> run?!".
>

I claim the following:
1) People run into such problems with pl/pgsql as well. pl/pgsql does
exactly the same server-prepared logic. So what? Pl/pgsql does have a query
cache, but other languages are forbidden from having one?
2) Those problematic queries are not that often
3) "suddently get slow the 6th time" is a PostgreSQL bug that both fails to
estimate cardinality properly, and it does not provide administrator a way
to disable the feature (generic vs specific plan).

4) Do you have better solution? Of course, the planner is not perfect. Of
course it will have issues with wrong cardinality estimations. So what?
Should we completely abandon the optimizer?
I do not think so.
Query cache does have very good results for the overall web page times, and
problems like "6th execution" are not that often.

By the way, other common problems are:
"cached plan cannot change result type" -- PostgreSQL just fails to execute
the server-prepared statement if a table was altered.
"prepared statement does not exist" -- the applications might use
"deallocate all" for some unknown reason, so the driver has to keep eye on
that.
"set search_path" vs "prepared statement" -- the prepared statement binds
by oids, so "search_path changes" should be accompanied by "deallocate all"
or alike.

Vladimir


Re: [HACKERS] Assertion failure in REL9_5_STABLE

2016-08-10 Thread Tom Lane
Marko Tiikkaja  writes:
> The failure is a number of levels down a call stack of PL/PgSQL 
> functions, but I can reproduce it at will by calling the function.  I 
> unfortunately can't narrow it down to a smaller test case, but attached 
> is an xlogdump of the session.  The query that finally breaks the 
> elephant's back is a SELECT .. FOR UPDATE from relid=54511.

> Any ideas on how to debug this?

If this test case used to pass, it'd be pretty useful to git bisect
to find where it started to fail.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for CSN based snapshots

2016-08-10 Thread Ants Aasma
On Wed, Aug 10, 2016 at 6:09 PM, Heikki Linnakangas  wrote:
> Hmm. There's one more possible way this could all work. Let's have CSN ==
> LSN, also for asynchronous commits. A snapshot is the current insert
> position, but also make note of the current flush position, when you take a
> snapshot. Now, when you use the snapshot, if you ever see an XID that
> committed between the snapshot's insert position and the flush position,
> wait for the WAL to be flushed up to the snapshot's insert position at that
> point. With that scheme, an asynchronous commit could return to the
> application without waiting for a flush, but if someone actually looks at
> the changes the transaction made, then that transaction would have to wait.
> Furthermore, we could probably skip that waiting too, if the reading
> transaction is also using synchronous_commit=off.
>
> That's slightly different from the current behaviour. A transaction that
> runs with synchronous_commit=on, and reads data that was modified by an
> asynchronous transaction, would take a hit. But I think that would be
> acceptable.

My proposal of vector clocks would allow for pretty much exactly
current behavior.

To simplify, there would be lastSyncCommitSeqNo and
lastAsyncCommitSeqNo variables in ShmemVariableCache. Transaction
commit would choose which one to update based on synchronous_commit
setting and embed the value of the setting into CSN log. Snapshots
would contain both values, when checking for CSN visibility use the
value of the looked up synchronous_commit setting to decide which
value to compare against. Standby's replaying commit records would
just update both values, resulting in transactions becoming visible in
xlog order, as they do today. The scheme would allow for inventing a
new xlog record/replication message communicating visibility ordering.

However I don't see why inventing a separate CSN concept is a large
problem. Quite the opposite, unless there is a good reason that I'm
missing, it seems better to not unnecessarily conflate commit record
durability and transaction visibility ordering. Not having them tied
together allows for an external source to provide CSN values, allowing
for interesting distributed transaction implementations. E.g. using a
timestamp as the CSN a'la Google Spanner and the TrueTime API.

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Assertion failure in REL9_5_STABLE

2016-08-10 Thread Marko Tiikkaja

Hi,

Running one specific test from our application against REL9_5_STABLE 
(current as of today) gives me this:


#2  0x7effb59595be in ExceptionalCondition (
conditionName=conditionName@entry=0x7effb5b27a88 
"!(CritSectionCount > 0 || TransactionIdIsCurrentTransactionId(( 
(!((tup)->t_infomask & 0x0800) && ((tup)->t_infomask & 0x1000) && 
!((tup)->t_infomask & 0x0080)) ? HeapTupleGetUpdateXid(tup) : ( 
(tup)-"..., errorType=errorType@entry=0x7effb599b74b "FailedAssertion", 
fileName=fileName@entry=0x7effb5b2796c "combocid.c", 
lineNumber=lineNumber@entry=132) at assert.c:54
#3  0x7effb598e19b in HeapTupleHeaderGetCmax (tup=0x7effa85714c8) at 
combocid.c:131
#4  0x7effb55fb0c1 in heap_lock_tuple (relation=0x7effb5bf5d90, 
tuple=tuple@entry=0x7fffcee73690, cid=346, 
mode=mode@entry=LockTupleExclusive, wait_policy=LockWaitBlock, 
follow_updates=follow_updates@entry=1 '\001',
buffer=buffer@entry=0x7fffcee7367c, hufd=hufd@entry=0x7fffcee73680) 
at heapam.c:4813
#5  0x7effb5753e82 in ExecLockRows (node=node@entry=0x7effb6cebb00) 
at nodeLockRows.c:179
#6  0x7effb573cbc8 in ExecProcNode (node=node@entry=0x7effb6cebb00) 
at execProcnode.c:516
#7  0x7effb5739432 in ExecutePlan (dest=0x7effb5dd8160 
, direction=, numberTuples=0, 
sendTuples=1 '\001', operation=CMD_SELECT, planstate=0x7effb6cebb00, 
estate=0x7effb6ceb8f8)

at execMain.c:1549
#8  standard_ExecutorRun (queryDesc=0x7effb6ae3b40, direction=out>, count=0) at execMain.c:337
#9  0x7effb57661db in _SPI_pquery (tcount=0, fire_triggers=1 '\001', 
queryDesc=0x7effb6ae3b40) at spi.c:2411


The failure is a number of levels down a call stack of PL/PgSQL 
functions, but I can reproduce it at will by calling the function.  I 
unfortunately can't narrow it down to a smaller test case, but attached 
is an xlogdump of the session.  The query that finally breaks the 
elephant's back is a SELECT .. FOR UPDATE from relid=54511.


Any ideas on how to debug this?


.m
rmgr: Standby len (rec/tot): 24/50, tx:  0, lsn: 
0/2428, prev 0/23000108, desc: RUNNING_XACTS nextXid 8495 
latestCompletedXid 8494 oldestRunningXid 8495
rmgr: Transaction len (rec/tot): 12/38, tx:   8496, lsn: 
0/2460, prev 0/2428, desc: ASSIGNMENT xtop 8495: subxacts: 8496
rmgr: Heaplen (rec/tot):  7/  1042, tx:   8496, lsn: 
0/2488, prev 0/2460, desc: LOCK off 10: xid 8496 LOCK_ONLY EXCL_LOCK 
KEYS_UPDATED , blkref #0: rel 1663/47473/50461 blk 0 FPW
rmgr: Heaplen (rec/tot):  7/  8246, tx:   8496, lsn: 
0/240004A0, prev 0/2488, desc: LOCK off 48: xid 8496 LOCK_ONLY EXCL_LOCK 
KEYS_UPDATED , blkref #0: rel 1663/47473/51018 blk 81 FPW
rmgr: Heaplen (rec/tot):  7/53, tx:   8496, lsn: 
0/240024F0, prev 0/240004A0, desc: LOCK off 49: xid 8496 LOCK_ONLY EXCL_LOCK 
KEYS_UPDATED , blkref #0: rel 1663/47473/51018 blk 81
rmgr: Sequencelen (rec/tot):158/   204, tx:   8496, lsn: 
0/24002528, prev 0/240024F0, desc: LOG rel 1663/47473/49387, blkref #0: rel 
1663/47473/49387 blk 0
rmgr: Sequencelen (rec/tot):158/   204, tx:  0, lsn: 
0/240025F8, prev 0/24002528, desc: LOG rel 1663/47473/49992, blkref #0: rel 
1663/47473/49992 blk 0
rmgr: Heaplen (rec/tot):  3/  4404, tx:   8497, lsn: 
0/240026C8, prev 0/240025F8, desc: INSERT off 74, blkref #0: rel 
1663/47473/49994 blk 18 FPW
rmgr: Btree   len (rec/tot):  2/  6961, tx:   8497, lsn: 
0/24003800, prev 0/240026C8, desc: INSERT_LEAF off 78, blkref #0: rel 
1663/47473/59412 blk 2 FPW
rmgr: Btree   len (rec/tot):  2/  6933, tx:   8497, lsn: 
0/24005350, prev 0/24003800, desc: INSERT_LEAF off 342, blkref #0: rel 
1663/47473/59414 blk 8 FPW
rmgr: Heaplen (rec/tot):  7/  4910, tx:   8497, lsn: 
0/24006E80, prev 0/24005350, desc: LOCK off 3: xid 8497 LOCK_ONLY KEYSHR_LOCK , 
blkref #0: rel 1663/47473/50005 blk 0 FPW
rmgr: Heaplen (rec/tot):  3/87, tx:   8498, lsn: 
0/240081C8, prev 0/24006E80, desc: INSERT off 75, blkref #0: rel 
1663/47473/49994 blk 18
rmgr: Btree   len (rec/tot):  2/  4273, tx:   8498, lsn: 
0/24008220, prev 0/240081C8, desc: INSERT_LEAF off 76, blkref #0: rel 
1663/47473/59412 blk 13 FPW
rmgr: Btree   len (rec/tot):  2/64, tx:   8498, lsn: 
0/240092D8, prev 0/24008220, desc: INSERT_LEAF off 343, blkref #0: rel 
1663/47473/59414 blk 8
rmgr: Heaplen (rec/tot):  7/53, tx:   8498, lsn: 
0/24009318, prev 0/240092D8, desc: LOCK off 29: xid 8498 LOCK_ONLY KEYSHR_LOCK 
, blkref #0: rel 1663/47473/50005 blk 0
rmgr: Heaplen (rec/tot):  3/  2397, tx:   8496, lsn: 
0/24009350, prev 0/24009318, desc: INSERT off 16, blkref #0: rel 
1663/47473/49389 blk 16 FPW
rmgr: Btree   len (rec/tot):  2/  5773, tx:   8496, lsn: 
0/24009CB0, prev 0/24009350, desc: INSERT_LEAF off 6, blkref #0: rel 
1663/47473/49400 blk 1 FPW
rmgr: Btree   len (rec/tot):  

Re: [HACKERS] Proposal for CSN based snapshots

2016-08-10 Thread Joshua D. Drake

On 08/10/2016 09:04 AM, Stephen Frost wrote:

* Joshua D. Drake (j...@commandprompt.com) wrote:

+1 for Robert here, removing async commit is a non-starter. It is
PostgreSQL performance 101 that you disable synchronous commit
unless you have a specific data/business requirement that needs it.
Specifically because of how much faster Pg is with async commit.


I agree that we don't want to get rid of async commit, but, for the
archive, I wouldn't recommend using it unless you specifically understand
and accept that trade-off, so I wouldn't lump it into a "PostgreSQL
performance 101" group- that's increasing work_mem, shared_buffers, WAL
size, etc.  Accepting that you're going to lose *committed* transactions
on a crash requires careful thought and consideration of what you're
going to do when that happens, not the other way around.


Yes Stephen, you are correct which is why I said, "unless you have a 
specific data/business requirement that needs it".


Thanks!

jD



Thanks!

Stephen




--
Command Prompt, Inc.  http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
Unless otherwise stated, opinions are my own.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slowness of extended protocol

2016-08-10 Thread Vladimir Sitnikov
Robert>But that makes it the job of every driver to implement some
sort of cache, which IMHO isn't a very reasonable position

Let's wait what Shay decides on implementing query cache in npgsql ?
Here's the issue:  https://github.com/npgsql/npgsql/issues/1237

He could change his mind when it comes to the need of "new
ParseBindExecuteDeallocate message".


Robert>I think you should consider the possibility that those people
know what they are talking about.

I do appreciate all the inputs, however, all the performance-related
complaints in this thread I've seen can trivially be solved by adding a
query cache at the driver level + fixing pgbouncer's issue.
Unfortunately, client-side SQL parsing is inevitable evil (see below on
'?'), so any sane driver would cache queries any way. The ones that do not
have query cache will perform badly anyway.

As far as I can understand, the alternative solution is "to add
ParseBindExecDeallocateInOneGo
message to the protocol". This new message would require changes from all
the drivers, and all the pgbouncers. This new message would be slower than
proper query cache, so why should we all spend time on a half-baked
solution?

Of course, I might miss some use cases, that is why keep asking: please
provide close to production scenario that does require the new protocol
message we are talking about.
Note: examples (by Robert and Shay) like "typical web application that
fetches a single row from DB and prints it to the browser" were already
presented, and they are easily covered by the query cache.



To be fair, implementing a cache is a trivial thing when compared with
hard-coding binary/text formats for all the datatypes in each and every
language.
Remember, each driver has to implement its own set of procedures to
input/output values in text/binary format, and that is a way harder than
implementing the cache we are talking about.

If only there was some ability to have language-independent binary transfer
format (protobuf, avro, kryo, whatever)

Regarding query cache: each language has its own notion how bind variables
are represented in SQL text.
For instance, in Go language (as well as in Java), bind placeholders are
represented as '?' character.
Of course, PostgreSQL does not support that (it wants $1, $2, etc), so Go
driver has to parse SQL text at the driver side in order to convert it into
PostgreSQL-compatible flavor. This parser has to support comments, string
literals, etc, etc. It is just natural thing to have a cache there so the
driver does not repeat the same parsing logic again and again (typical
applications are known to use the same query text multiple times).

Robert>When
there's a common need that affects users of many different programming
languages

You are right. No questions here.
Ok, what is a need? What problem does "new message" solve?

>From what I see there is no performance need to introduce
"ParseBindExecDeallocateInOneGo" message. The thread is
performance-related, so I naturally object spending everybody's time on
implementing a useless feature.

Vladimir>Do you agree that the major part would be some hot queries, the
rest will be
Vladimir> much less frequently used ones (e.g. one time queries)?
Robert>Sure, but I don't want the application to have to know about that

Application does not need to know that. It is like "branch prediction in
the CPU". Application does not need to know there is a branch predictor in
the CPU. The same goes for query cache. Application should just continue to
execute SQLs in a sane manner, and the query cache would pick up the
pattern (server-prepare hot-used queries).

Vladimir


Re: [HACKERS] Is there a way around function search_path killing SQL function inlining?

2016-08-10 Thread Tom Lane
Michael Banck  writes:
> As I've been bitten by this problem recently, I thought I'd take a look
> at editing the PostGIS extension SQL file to this end, but contrary to
> the above, the @extschema@ feature only applies to non-relocatable
> extensions, from src/backend/commands/extension.c:

>   * If it's not relocatable, substitute the target schema name for
>   * occurrences of @extschema@.
>   *
>   * For a relocatable extension, we needn't do this.  There cannot be
>   * any need for @extschema@, else it wouldn't be relocatable.

> I'm not sure that logic is sound - even if setting @extschema@ 
> explicitly in the SQL functions bodies kills inlining (not sure about
> that) or wouldn't help for other reasons, ISTM this should be 
> reconsidered in the light of the use case with materialized views during
> restore.

It's not simply a matter of allowing the substitution to occur while
reading the extension script.  "Relocatable" means that we support
ALTER EXTENSION SET SCHEMA, which means moving all the extension's objects
into some new schema.  There's no good way to run around and find places
where @extschema@ was replaced in order to change them to something else.

Basically the point of @extschema@ is to support extensions that are
relocatable at installation time, but not afterwards.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slowness of extended protocol

2016-08-10 Thread Stephen Frost
* Vladimir Sitnikov (sitnikov.vladi...@gmail.com) wrote:
> It works completely transparent to the application, and it does use
> server-prepared statements even though application builds "brand new" sql
> text every time.

And is the source of frequent complaints on various mailing lists along
the lines of "why did my query suddently get slow the 6th time it was
run?!".

I encourage you to look through the archives and read up on how and why
that can happen.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Proposal for CSN based snapshots

2016-08-10 Thread Stephen Frost
* Joshua D. Drake (j...@commandprompt.com) wrote:
> +1 for Robert here, removing async commit is a non-starter. It is
> PostgreSQL performance 101 that you disable synchronous commit
> unless you have a specific data/business requirement that needs it.
> Specifically because of how much faster Pg is with async commit.

I agree that we don't want to get rid of async commit, but, for the
archive, I wouldn't recommend using it unless you specifically understand
and accept that trade-off, so I wouldn't lump it into a "PostgreSQL
performance 101" group- that's increasing work_mem, shared_buffers, WAL
size, etc.  Accepting that you're going to lose *committed* transactions
on a crash requires careful thought and consideration of what you're
going to do when that happens, not the other way around.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Slowness of extended protocol

2016-08-10 Thread Robert Haas
On Wed, Aug 10, 2016 at 11:50 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> Sure, but I don't want the application to have to know about that, and
>> I don't really think the driver should need to know about that either.
>> Your point, as I understand it, is that sufficiently good query
>> caching in the driver can ameliorate the problem, and I agree with
>> that.
>
> I don't, actually.  If a particular application has a query mix that gets
> a good win from caching query plans, it should already be using prepared
> statements.  The fact that that's not a particularly popular thing to do
> isn't simply because people are lazy, it's because they've found out that
> it isn't worth the trouble for them.  Putting query caching logic into
> drivers isn't going to improve performance for such cases, and it could
> very possibly make things worse.  The driver is the place with the
> absolute least leverage for implementing caching; it has no visibility
> into either the application or the server.

I didn't mean to imply, in any way, that it is an ideal solution -
just that people use it successfully.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is there a way around function search_path killing SQL function inlining?

2016-08-10 Thread Michael Banck
On Thu, Mar 10, 2016 at 11:48:41AM -0500, Tom Lane wrote:
> If you're worried about preserving relocatability of an extension
> containing such functions, the @extschema@ feature might help.

As I've been bitten by this problem recently, I thought I'd take a look
at editing the PostGIS extension SQL file to this end, but contrary to
the above, the @extschema@ feature only applies to non-relocatable
extensions, from src/backend/commands/extension.c:

  * If it's not relocatable, substitute the target schema name for
  * occurrences of @extschema@.
  *
  * For a relocatable extension, we needn't do this.  There cannot be
  * any need for @extschema@, else it wouldn't be relocatable.

I'm not sure that logic is sound - even if setting @extschema@ 
explicitly in the SQL functions bodies kills inlining (not sure about
that) or wouldn't help for other reasons, ISTM this should be 
reconsidered in the light of the use case with materialized views during
restore.


Best regards,

Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for CSN based snapshots

2016-08-10 Thread Joshua D. Drake

On 08/10/2016 08:28 AM, Robert Haas wrote:

On Wed, Aug 10, 2016 at 11:09 AM, Heikki Linnakangas  wrote:

Still, having to invent CSNs seems like a huge loss for this design.
Personally I'd give up async commit first.  If we had only sync commit,
the rule could be "xact LSN less than snapshot threshold and less than
WAL flush position", and we'd not need CSNs.  I know some people like
async commit, but it's there because it was easy and cheap in our old
design, not because it's the world's greatest feature and worth giving
up performance for.


I don't think that's a very popular opinion (I disagree, for one).
Asynchronous commits are a huge performance boost for some applications. The
alternative is fsync=off, and I don't want to see more people doing that.
SSDs have made the penalty of an fsync much smaller, but it's still there.


Uh, yeah.  Asynchronous commit can be 100 times faster on some
realistic workloads.  If we remove it, many people will have to decide
between running with fsync=off and abandoning PostgreSQL altogether.
That doesn't strike me as a remotely possible line of attack.


+1 for Robert here, removing async commit is a non-starter. It is 
PostgreSQL performance 101 that you disable synchronous commit unless 
you have a specific data/business requirement that needs it. 
Specifically because of how much faster Pg is with async commit.


Sincerely,

jD







--
Command Prompt, Inc.  http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
Unless otherwise stated, opinions are my own.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slowness of extended protocol

2016-08-10 Thread Tom Lane
Robert Haas  writes:
> Sure, but I don't want the application to have to know about that, and
> I don't really think the driver should need to know about that either.
> Your point, as I understand it, is that sufficiently good query
> caching in the driver can ameliorate the problem, and I agree with
> that.

I don't, actually.  If a particular application has a query mix that gets
a good win from caching query plans, it should already be using prepared
statements.  The fact that that's not a particularly popular thing to do
isn't simply because people are lazy, it's because they've found out that
it isn't worth the trouble for them.  Putting query caching logic into
drivers isn't going to improve performance for such cases, and it could
very possibly make things worse.  The driver is the place with the
absolute least leverage for implementing caching; it has no visibility
into either the application or the server.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] No longer possible to query catalogs for index capabilities?

2016-08-10 Thread Tom Lane
Andrew Gierth  writes:
>   - this still has everything in amapi.c rather than creating any new
> files. Also, the regression tests are in create_index.sql for lack
> of any obviously better place.

This more than doubles the size of amapi.c, so it has a definite feel
of tail-wags-dog for me, even if that seemed like an appropriate place
otherwise which it doesn't really.  I think a new .c file under adt/ is
the way to go, with extern declarations in builtins.h.

Maybe we need a new regression test case file too.  I don't much like
adding this to create_index because (a) it doesn't particularly seem
to match that file's purpose of setting up indexes on the standard
regression test tables, and (b) that file is a bottleneck in parallel
regression runs because it can't run in parallel with much else.

> The list of column properties is:

>   ordered  - (same as "amcanorder" AM capability)
>   ordered_asc
>   ordered_desc
>   ordered_nulls_first
>   ordered_nulls_last

> If "ordered" is true then exactly one of _asc/_desc and exactly one of
> _nulls_first/_last will be true; if "ordered" is false then all the
> others will be false too.

Check, but do we need the "ordered_" prefixes on the last four?  Seems
like mostly useless typing.  And the name space here is never going
to be so large that it'd be confusing.

> Comments?

Why did you go with "capability" rather than "property" in the exposed
function names?  The latter seems much closer to le mot juste, especially
for things like asc/desc.

I'd expect the property keywords to be recognized case-insensitively,
as for example are the keywords in has_table_privilege() and its ilk.
That being the case, you should be using pg_strcasecmp().  This stuff:
if (namelen == 10 && memcmp(nameptr, "amcanorder", 10) == 0)
is ugly and hard to maintain as well as being unnecessarily non-user-
friendly.  Just convert the input to a C string and be done with it.
It's not like saving a couple of nanoseconds is critical here.

I'd personally cut the list of pg_am replacement properties way back,
as I believe much of what you've got there is not actually of use to
applications, and some of it is outright counterproductive.  An example
is exposing amcanreturn as an index-AM boolean.  That's just wrong these
days.  If you want that, it should be an index column property.  And
because what used to be an index-AM property no longer is in that case,
I think it's a fine example of why we shouldn't be reporting more than
the absolute minimum here.  It would tie our hands with respect to making
other improvements of that kind later.  (This might be a good argument for
moving as much as we can over to the index-column-property function, even
if it can't be set per-column today.  If there's a chance that it could be
per-column in the future, let's not call it an AM property.)

Also, while I see the point of the amgettuple and amgetbitmap properties,
I would call them something like "index_scan" and "bitmap_scan" because
that's the user's-eye view of what is supported.  I do not think we should
slavishly tie the property names to the old, never-intended-as-user-facing
column names.  I'd drop the "am" prefix, for starters.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slowness of extended protocol

2016-08-10 Thread Robert Haas
On Tue, Aug 9, 2016 at 5:07 PM, Vladimir Sitnikov
 wrote:
> I do not buy that "dynamically generated queries defeat server-prepared
> usage" argument. It is just not true (see below).
>
> Do you mean "in language X, where X != Java it is impossible to implement a
> query cache"?
> That is just ridiculus.

Well, multiple experienced users are telling you that this is a real
problem.  You certainly don't have to agree, but when a bunch of other
people who are smart and experienced think that something is a
problem, I think you should consider the possibility that those people
know what they are talking about.

> Do you agree that the major part would be some hot queries, the rest will be
> much less frequently used ones (e.g. one time queries)?

Sure, but I don't want the application to have to know about that, and
I don't really think the driver should need to know about that either.
Your point, as I understand it, is that sufficiently good query
caching in the driver can ameliorate the problem, and I agree with
that.  But that makes it the job of every driver to implement some
sort of cache, which IMHO isn't a very reasonable position.  When
there's a common need that affects users of many different programming
languages, the server should make it easy to meet that need, not
require every driver to implement query caching.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for CSN based snapshots

2016-08-10 Thread Robert Haas
On Wed, Aug 10, 2016 at 11:09 AM, Heikki Linnakangas  wrote:
>> Still, having to invent CSNs seems like a huge loss for this design.
>> Personally I'd give up async commit first.  If we had only sync commit,
>> the rule could be "xact LSN less than snapshot threshold and less than
>> WAL flush position", and we'd not need CSNs.  I know some people like
>> async commit, but it's there because it was easy and cheap in our old
>> design, not because it's the world's greatest feature and worth giving
>> up performance for.
>
> I don't think that's a very popular opinion (I disagree, for one).
> Asynchronous commits are a huge performance boost for some applications. The
> alternative is fsync=off, and I don't want to see more people doing that.
> SSDs have made the penalty of an fsync much smaller, but it's still there.

Uh, yeah.  Asynchronous commit can be 100 times faster on some
realistic workloads.  If we remove it, many people will have to decide
between running with fsync=off and abandoning PostgreSQL altogether.
That doesn't strike me as a remotely possible line of attack.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for CSN based snapshots

2016-08-10 Thread Heikki Linnakangas

On 08/10/2016 05:51 PM, Tom Lane wrote:

Heikki Linnakangas  writes:

On 08/10/2016 05:09 PM, Tom Lane wrote:

Uh, what?  That's not the semantics we have today, and I don't see why
it's necessary or a good idea.  Once the commit is in the WAL stream,
any action taken on the basis of seeing the commit must be later in
the WAL stream.  So what's the problem?



I was talking about synchronous commits in the above. A synchronous
commit is not made visible to other transactions, until the commit WAL
record is flushed to disk.


[ thinks for a bit... ]  Oh, OK, that's because we don't treat a
transaction as committed until its clog bit is set *and* it's not
marked as running in the PGPROC array.  And sync transactions will
flush WAL in between.


Right.


Still, having to invent CSNs seems like a huge loss for this design.
Personally I'd give up async commit first.  If we had only sync commit,
the rule could be "xact LSN less than snapshot threshold and less than
WAL flush position", and we'd not need CSNs.  I know some people like
async commit, but it's there because it was easy and cheap in our old
design, not because it's the world's greatest feature and worth giving
up performance for.


I don't think that's a very popular opinion (I disagree, for one). 
Asynchronous commits are a huge performance boost for some applications. 
The alternative is fsync=off, and I don't want to see more people doing 
that. SSDs have made the penalty of an fsync much smaller, but it's 
still there.


Hmm. There's one more possible way this could all work. Let's have CSN 
== LSN, also for asynchronous commits. A snapshot is the current insert 
position, but also make note of the current flush position, when you 
take a snapshot. Now, when you use the snapshot, if you ever see an XID 
that committed between the snapshot's insert position and the flush 
position, wait for the WAL to be flushed up to the snapshot's insert 
position at that point. With that scheme, an asynchronous commit could 
return to the application without waiting for a flush, but if someone 
actually looks at the changes the transaction made, then that 
transaction would have to wait. Furthermore, we could probably skip that 
waiting too, if the reading transaction is also using 
synchronous_commit=off.


That's slightly different from the current behaviour. A transaction that 
runs with synchronous_commit=on, and reads data that was modified by an 
asynchronous transaction, would take a hit. But I think that would be 
acceptable.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] No longer possible to query catalogs for index capabilities?

2016-08-10 Thread Andrew Gierth
Updated patch. Changes:

  - returns NULL rather than "cache lookup failed"

  - added pg_index_column_has_property (incl. docs)

  - added regression tests

Not changed / need consideration:

  - this still has everything in amapi.c rather than creating any new
files. Also, the regression tests are in create_index.sql for lack
of any obviously better place.

The list of column properties is:

  ordered  - (same as "amcanorder" AM capability)
  ordered_asc
  ordered_desc
  ordered_nulls_first
  ordered_nulls_last

If "ordered" is true then exactly one of _asc/_desc and exactly one of
_nulls_first/_last will be true; if "ordered" is false then all the
others will be false too. The intended usage is something like

  CASE WHEN pg_index_column_has_property(idx, attno, 'ordered_asc')
   THEN 'ASC'
   WHEN pg_index_column_has_property(idx, attno, 'ordered_desc')
   THEN 'DESC'
   ELSE ''  -- or NULL
  END

Comments?

-- 
Andrew (irc:RhodiumToad)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index ccb9b97..684f7b3 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -577,6 +577,89 @@

   
 
+  
+Capability information formerly stored in pg_am
+is now available via the functions
+pg_indexam_has_capability and
+pg_index_has_capability
+(see ). The following
+boolean-valued capability names are currently supported:
+  
+
+  
+   Capabilities
+
+   
+
+ 
+  Name
+  Description
+ 
+
+
+ 
+  amcanorder
+  Does the access method support ordered scans sorted by the
+   indexed column's value?
+ 
+ 
+  amcanorderbyop
+  Does the access method support ordered scans sorted by the result
+   of an operator on the indexed column?
+ 
+ 
+  amcanbackward
+  Does the access method support backward scanning?
+ 
+ 
+  amcanunique
+  Does the access method support unique indexes?
+ 
+ 
+  amcanmulticol
+  Does the access method support multicolumn indexes?
+ 
+ 
+  amoptionalkey
+  Does the access method support a scan without any constraint
+   for the first index column?
+ 
+ 
+  amsearcharray
+  Does the access method support ScalarArrayOpExpr searches?
+ 
+ 
+  amsearchnulls
+  Does the access method support IS NULL/NOT NULL searches?
+ 
+ 
+  amstorage
+  Can index storage data type differ from column data type?
+ 
+ 
+  amclusterable
+  Can an index of this type be clustered on?
+ 
+ 
+  ampredlocks
+  Does an index of this type manage fine-grained predicate locks?
+ 
+ 
+  amgettuple
+  Does the access method provide an amgettuple function?
+ 
+ 
+  amgetbitmap
+  Does the access method provide an amgetbitmap function?
+ 
+ 
+  amcanreturn
+  Does the access method support index-only scans?
+ 
+
+   
+  
+
  
 
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7830334..d2fe506 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -16290,6 +16290,18 @@ SELECT pg_type_is_visible('myschema.widget'::regtype);

 

+pg_indexam_has_capability
+   
+
+   
+pg_index_column_has_property
+   
+
+   
+pg_index_has_capability
+   
+
+   
 pg_options_to_table

 
@@ -16477,6 +16489,21 @@ SELECT pg_type_is_visible('myschema.widget'::regtype);
   number of columns, pretty-printing is implied
   
   
+   pg_indexam_has_capability(am_oid, cap_name)
+   boolean
+   Test whether an index access method has a specified capability
+  
+  
+   pg_index_column_has_property(index_oid, column_no, prop_name)
+   boolean
+   Test whether an index column has a specified property
+  
+  
+   pg_index_has_capability(index_oid, cap_name)
+   boolean
+   Test whether the access method for the specified index has a specified capability
+  
+  
pg_options_to_table(reloptions)
setof record
get the set of storage option name/value pairs
@@ -16620,6 +16647,73 @@ SELECT pg_type_is_visible('myschema.widget'::regtype);
   
 
   
+   pg_indexam_has_capability and
+   pg_index_has_capability return whether the specified
+   access method, or the access method for the specified index, advertises the
+   named capability. NULL is returned if the capability
+   name is not known; true if the capability is advertised,
+   false if it is not. Refer
+   to  for capability names and their meanings.
+  
+
+  
+   pg_index_column_has_property returns whether the
+   specified index column possesses the named property.
+   NULL is returned if the property name is not
+   known; true if the property is present,
+   false if it is not. Index column property names and the
+   matching clauses of CREATE INDEX are given in
+   .
+  
+
+  
+   Index Column 

Re: [HACKERS] Proposal for CSN based snapshots

2016-08-10 Thread Tom Lane
Heikki Linnakangas  writes:
> On 08/10/2016 05:09 PM, Tom Lane wrote:
>> Uh, what?  That's not the semantics we have today, and I don't see why
>> it's necessary or a good idea.  Once the commit is in the WAL stream,
>> any action taken on the basis of seeing the commit must be later in
>> the WAL stream.  So what's the problem?

> I was talking about synchronous commits in the above. A synchronous 
> commit is not made visible to other transactions, until the commit WAL 
> record is flushed to disk.

[ thinks for a bit... ]  Oh, OK, that's because we don't treat a
transaction as committed until its clog bit is set *and* it's not
marked as running in the PGPROC array.  And sync transactions will
flush WAL in between.

Still, having to invent CSNs seems like a huge loss for this design.
Personally I'd give up async commit first.  If we had only sync commit,
the rule could be "xact LSN less than snapshot threshold and less than
WAL flush position", and we'd not need CSNs.  I know some people like
async commit, but it's there because it was easy and cheap in our old
design, not because it's the world's greatest feature and worth giving
up performance for.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait events monitoring future development

2016-08-10 Thread Bruce Momjian
On Wed, Aug 10, 2016 at 11:37:36PM +0900, Satoshi Nagayasu wrote:
> Agreed.
> 
> If people are facing with some difficult situation in terms of performance,
> they may accept some (one-time) overhead to resolve the issue.
> But if they don't have (recognize) any issue, they may not.
> 
> That's one of the realities according to my experiences.

Yes.  Many people are arguing for specific defaults based on what _they_
would want, not what the average user would want.  Sophisticated users
will know about this and turn it on when desired.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait events monitoring future development

2016-08-10 Thread Satoshi Nagayasu
2016/08/10 23:22 "Bruce Momjian" :
>
> On Wed, Aug 10, 2016 at 05:14:52PM +0300, Alexander Korotkov wrote:
> > On Tue, Aug 9, 2016 at 5:37 AM, Bruce Momjian  wrote:
> >
> > On Tue, Aug  9, 2016 at 02:06:40AM +, Tsunakawa, Takayuki wrote:
> > > I hope wait event monitoring will be on by default even if the
overhead
> > is not
> > > almost zero, because the data needs to be readily available for
faster
> > > troubleshooting.  IMO, the benefit would be worth even 10%
overhead.  If
> > you
> > > disable it by default because of overhead, how can we convince
users to
> > enable
> > > it in production systems to solve some performance problem?  I’m
afraid
> > severe
> > > users would say “we can’t change any setting that might cause more
> > trouble, so
> > > investigate the cause with existing information.”
> >
> > If you want to know why people are against enabling this monitoring
by
> > default, above is the reason.  What percentage of people do you
think
> > would be willing to take a 10% performance penalty for monitoring
like
> > this?  I would bet very few, but the argument above doesn't seem to
> > address the fact it is a small percentage.
> >
> >
> > Just two notes from me:
> >
> > 1) 10% overhead from monitoring wait events is just an idea without any
proof
> > so soon.
> > 2) We already have functionality which trades insight into database
with way
> > more huge overhead.  auto_explain.log_analyze = true can slowdown
queries *in
> > times*.  Do you think we should remove it?
>
> The point is not removing it, the point is whether
> auto_explain.log_analyze = true should be enabled by default, and I
> think no one wants to do that.

Agreed.

If people are facing with some difficult situation in terms of performance,
they may accept some (one-time) overhead to resolve the issue.
But if they don't have (recognize) any issue, they may not.

That's one of the realities according to my experiences.

Regards,


Re: [HACKERS] Proposal for CSN based snapshots

2016-08-10 Thread Heikki Linnakangas

On 08/10/2016 05:09 PM, Tom Lane wrote:

Heikki Linnakangas  writes:

Imagine that you have a stream of normal, synchronous, commits. They get
assigned LSNs: 1, 2, 3, 4. They become visible to other transactions in
that order.



The way I described this scheme in the first emails on this thread, was
to use the current WAL insertion position as the snapshot. That's not
correct, though: you have to use the current WAL *flush* position as the
snapshot. Otherwise you would see the results of a transaction that
hasn't been flushed to disk yet, i.e. which might still get lost, if you
pull the power plug before the flush happens. So you have to use the
last flush position as the snapshot.


Uh, what?  That's not the semantics we have today, and I don't see why
it's necessary or a good idea.  Once the commit is in the WAL stream,
any action taken on the basis of seeing the commit must be later in
the WAL stream.  So what's the problem?


I was talking about synchronous commits in the above. A synchronous 
commit is not made visible to other transactions, until the commit WAL 
record is flushed to disk.


You could argue that that doesn't need to be so, because indeed any 
action taken on the basis of seeing the commit must be later in the WAL 
stream. But that's what asynchronous commits are for. For synchronous 
commits, we have a higher standard.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait events monitoring future development

2016-08-10 Thread Bruce Momjian
On Wed, Aug 10, 2016 at 05:14:52PM +0300, Alexander Korotkov wrote:
> On Tue, Aug 9, 2016 at 5:37 AM, Bruce Momjian  wrote:
> 
> On Tue, Aug  9, 2016 at 02:06:40AM +, Tsunakawa, Takayuki wrote:
> > I hope wait event monitoring will be on by default even if the overhead
> is not
> > almost zero, because the data needs to be readily available for faster
> > troubleshooting.  IMO, the benefit would be worth even 10% overhead.  If
> you
> > disable it by default because of overhead, how can we convince users to
> enable
> > it in production systems to solve some performance problem?  I’m afraid
> severe
> > users would say “we can’t change any setting that might cause more
> trouble, so
> > investigate the cause with existing information.”
> 
> If you want to know why people are against enabling this monitoring by
> default, above is the reason.  What percentage of people do you think
> would be willing to take a 10% performance penalty for monitoring like
> this?  I would bet very few, but the argument above doesn't seem to
> address the fact it is a small percentage.
> 
> 
> Just two notes from me:
> 
> 1) 10% overhead from monitoring wait events is just an idea without any proof
> so soon.
> 2) We already have functionality which trades insight into database with way
> more huge overhead.  auto_explain.log_analyze = true can slowdown queries *in
> times*.  Do you think we should remove it?

The point is not removing it, the point is whether
auto_explain.log_analyze = true should be enabled by default, and I
think no one wants to do that.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait events monitoring future development

2016-08-10 Thread Alexander Korotkov
On Tue, Aug 9, 2016 at 5:37 AM, Bruce Momjian  wrote:

> On Tue, Aug  9, 2016 at 02:06:40AM +, Tsunakawa, Takayuki wrote:
> > I hope wait event monitoring will be on by default even if the overhead
> is not
> > almost zero, because the data needs to be readily available for faster
> > troubleshooting.  IMO, the benefit would be worth even 10% overhead.  If
> you
> > disable it by default because of overhead, how can we convince users to
> enable
> > it in production systems to solve some performance problem?  I’m afraid
> severe
> > users would say “we can’t change any setting that might cause more
> trouble, so
> > investigate the cause with existing information.”
>
> If you want to know why people are against enabling this monitoring by
> default, above is the reason.  What percentage of people do you think
> would be willing to take a 10% performance penalty for monitoring like
> this?  I would bet very few, but the argument above doesn't seem to
> address the fact it is a small percentage.
>

Just two notes from me:

1) 10% overhead from monitoring wait events is just an idea without any
proof so soon.
2) We already have functionality which trades insight into database with
way more huge overhead.  auto_explain.log_analyze = true can slowdown
queries *in times*.  Do you think we should remove it?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Proposal for CSN based snapshots

2016-08-10 Thread Tom Lane
Heikki Linnakangas  writes:
> Imagine that you have a stream of normal, synchronous, commits. They get 
> assigned LSNs: 1, 2, 3, 4. They become visible to other transactions in 
> that order.

> The way I described this scheme in the first emails on this thread, was 
> to use the current WAL insertion position as the snapshot. That's not 
> correct, though: you have to use the current WAL *flush* position as the 
> snapshot. Otherwise you would see the results of a transaction that 
> hasn't been flushed to disk yet, i.e. which might still get lost, if you 
> pull the power plug before the flush happens. So you have to use the 
> last flush position as the snapshot.

Uh, what?  That's not the semantics we have today, and I don't see why
it's necessary or a good idea.  Once the commit is in the WAL stream,
any action taken on the basis of seeing the commit must be later in
the WAL stream.  So what's the problem?

> Now, if you do an asynchronous commit, the effects of that should become 
> visible immediately, without waiting for the next flush. You can't do 
> that, if its CSN == LSN.

This distinction is completely arbitrary, and unlike the way it works now.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait events monitoring future development

2016-08-10 Thread Alexander Korotkov
On Tue, Aug 9, 2016 at 12:47 AM, Ilya Kosmodemiansky <
ilya.kosmodemian...@postgresql-consulting.com> wrote:

> On Mon, Aug 8, 2016 at 7:03 PM, Bruce Momjian  wrote:
> > It seems asking users to run pg_test_timing before deploying to check
> > the overhead would be sufficient.
>
> I'am not sure. Time measurement for waits is slightly more complicated
> than a time measurement for explain analyze: a good workload plus
> using gettimeofday in a straightforward manner can cause huge
> overhead.


What makes you think so?  Both my thoughts and observations are opposite:
it's way easier to get huge overhead from EXPLAIN ANALYZE than from
measuring wait events.  Current wait events are quite huge events itself
related to syscalls, context switches and so on. In contrast EXPLAIN
ANALYZE calls gettimeofday for very cheap operations like transfer tuple
from one executor node to another.


> Thats why a proper testing is important - if we can see a
> significant performance drop if we have for example large
> shared_buffers with the same concurrency,  that shows gettimeofday is
> too expensive to use. Am I correct, that we do not have such accurate
> tests now?
>

Do you think that large shared buffers is a kind a stress test for wait
events monitoring? If so, why?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Proposal for CSN based snapshots

2016-08-10 Thread Heikki Linnakangas

On 08/10/2016 04:34 PM, Alexander Korotkov wrote:

On Tue, Aug 9, 2016 at 3:16 PM, Heikki Linnakangas  wrote:


I switched to using a separate counter for CSNs. CSN is no longer the same
as the commit WAL record's LSN. While I liked the conceptual simplicity of
CSN == LSN a lot, and the fact that the standby would see the same commit
order as master, I couldn't figure out how to make async commits to work.


I didn't get async commits problem at first glance.  AFAICS, the difference
between sync commit and async is only that async commit doesn't wait WAL
log to flush.  But async commit still receives LSN.
Could you describe it in more details?


Imagine that you have a stream of normal, synchronous, commits. They get 
assigned LSNs: 1, 2, 3, 4. They become visible to other transactions in 
that order.


The way I described this scheme in the first emails on this thread, was 
to use the current WAL insertion position as the snapshot. That's not 
correct, though: you have to use the current WAL *flush* position as the 
snapshot. Otherwise you would see the results of a transaction that 
hasn't been flushed to disk yet, i.e. which might still get lost, if you 
pull the power plug before the flush happens. So you have to use the 
last flush position as the snapshot.


Now, if you do an asynchronous commit, the effects of that should become 
visible immediately, without waiting for the next flush. You can't do 
that, if its CSN == LSN.


Perhaps you could make an exception for async commits, so that the CSN 
of an async commit is not the commit record's LSN, but the current WAL 
flush position, so that it becomes visible to others immediately. But 
then, how do you distinguish two async transactions that commit roughly 
at the same time, so that the flush position at time of commit is the 
same for both? And now you've given up on the CSN == LSN property, for 
async commits, anyway.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Curing plpgsql's memory leaks for statement-lifespan values

2016-08-10 Thread Pavel Stehule
2016-08-10 11:25 GMT+02:00 Pavel Stehule :

> Hi
>
> 2016-07-27 16:49 GMT+02:00 Tom Lane :
>
>> Robert Haas  writes:
>> > On Mon, Jul 25, 2016 at 6:04 PM, Tom Lane  wrote:
>> >> I suppose that a fix based on putting PG_TRY blocks into all the
>> affected
>> >> functions might be simple enough that we'd risk back-patching it, but
>> >> I don't really want to go that way.
>>
>> > try/catch blocks aren't completely free, either, and PL/pgsql is not
>> > suffering from a deplorable excess of execution speed.
>>
>> BTW, just to annotate that a bit: I did some measurements and found out
>> that on my Linux box, creating/deleting a memory context
>> (AllocSetContextCreate + MemoryContextDelete) is somewhere around 10x
>> more expensive than a PG_TRY block.  This means that the PG_TRY approach
>> would actually be faster for cases involving only a small number of
>> statements-needing-local-storage within a single plpgsql function
>> execution.  However, the memory context creation cost is amortized across
>> repeated executions of a statement, whereas of course PG_TRY won't be.
>> We can roughly estimate that PG_TRY would lose any time we loop through
>> the statement in question more than circa ten times.  So I believe the
>> way I want to do it will win speed-wise in cases where it matters, but
>> it's not entirely an open-and-shut conclusion.
>>
>> Anyway, there are enough other reasons not to go the PG_TRY route.
>>
>
> I did some synthetic benchmarks related to plpgsql speed - bubble sort and
> loop over handling errors and I don't see any slowdown
>
> handling exceptions is little bit faster with your patch
>
>  CREATE OR REPLACE FUNCTION public.loop_test(a integer)
>  RETURNS void
>  LANGUAGE plpgsql
> AS $function$
> declare x int;
> begin
>   for i in 1..a
>   loop
> declare s text;
> begin
>   s := 'AHOJ';
>   x := (random()*1000)::int;
>   raise exception '*';
> exception when others then
>   x := 0; --raise notice 'handled';
> end;
>   end loop;
> end;
> $function$
>
> head - 100K loops 640ms, patched 610ms
>
> Regards
>

Hi

I am sending a review of this patch:

1. There was not any problem with patching and compilation
2. All tests passed without any problem
3. The advantages and disadvantages was discussed detailed in this thread -
selected way is good

+ the code is little bit reduced and cleaned
+ special context can be used in future

4. For this patch new regress tests or documentation is not necessary
5. I didn't find performance slowdown in special tests - the impact on
performance in real code should be insignificant

I'll mark this patch as ready for commiter

Regards

Pavel


>
> Pavel
>
>
>
>> regards, tom lane
>>
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>>
>
>


Re: [HACKERS] Proposal for CSN based snapshots

2016-08-10 Thread Alexander Korotkov
On Tue, Aug 9, 2016 at 3:16 PM, Heikki Linnakangas  wrote:

> (Reviving an old thread)
>
> I spent some time dusting off this old patch, to implement CSN snapshots.
> Attached is a new patch, rebased over current master, and with tons of
> comments etc. cleaned up. There's more work to be done here, I'm posting
> this to let people know I'm working on this again. And to have a backup on
> the 'net :-).
>

Great! It's very nice seeing that you're working on this patch.
After PGCON I've your patch rebased to the master, but you already did much
more.


> I switched to using a separate counter for CSNs. CSN is no longer the same
> as the commit WAL record's LSN. While I liked the conceptual simplicity of
> CSN == LSN a lot, and the fact that the standby would see the same commit
> order as master, I couldn't figure out how to make async commits to work.
>

I didn't get async commits problem at first glance.  AFAICS, the difference
between sync commit and async is only that async commit doesn't wait WAL
log to flush.  But async commit still receives LSN.
Could you describe it in more details?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Proposal for CSN based snapshots

2016-08-10 Thread Alexander Korotkov
On Wed, Aug 10, 2016 at 2:10 PM, Heikki Linnakangas  wrote:

> Yeah, if the csnlog access turns out to be too expensive, we could do
> something like this. In theory, you can always convert a CSN snapshot into
> an old-style list of XIDs, by scanning the csnlog between the xmin and
> xmax. That could be expensive if the distance between xmin and xmax is
> large, of course. But as you said, you can have various hybrid forms, where
> you use a list of XIDs of some range as a cache, for example.
>
> I'm hopeful that we can simply make the csnlog access fast enough, though.
> Looking up an XID in a sorted array is O(log n), while looking up an XID in
> the csnlog is O(1). That ignores all locking and different constant
> factors, of course, but it's not a given that accessing the csnlog has to
> be slower than a binary search of an XID array.


FYI, I'm still fan of idea to rewrite XID with CSN in tuple in the same way
we're writing hint bits now.
I'm going to make prototype of this approach which would be enough for
performance measurements.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] multivariate statistics (v19)

2016-08-10 Thread Ants Aasma
On Wed, Aug 3, 2016 at 4:58 AM, Tomas Vondra
 wrote:
> 2) combining multiple statistics
>
> I think the ability to combine multivariate statistics (covering different
> subsets of conditions) is important and useful, but I'm starting to think
> that the current implementation may not be the correct one (which is why I
> haven't written the SGML docs about this part of the patch series yet).

While researching this topic a few years ago I came across a paper on
this exact topic called "Consistently Estimating the Selectivity of
Conjuncts of Predicates" [1]. While effective it seems to be quite
heavy-weight, so would probably need support for tiered optimization.

[1] 
https://courses.cs.washington.edu/courses/cse544/11wi/papers/markl-vldb-2005.pdf

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics (v19)

2016-08-10 Thread Michael Paquier
On Wed, Aug 10, 2016 at 8:50 PM, Petr Jelinek  wrote:
> On 10/08/16 13:33, Tomas Vondra wrote:
>>
>> On 08/10/2016 06:41 AM, Michael Paquier wrote:
>>>
>>> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra

 2) combining multiple statistics


 I think the ability to combine multivariate statistics (covering
 different
 subsets of conditions) is important and useful, but I'm starting to
 think
 that the current implementation may not be the correct one (which is
 why I
 haven't written the SGML docs about this part of the patch series yet).

 Assume there's a table "t" with 3 columns (a, b, c), and that we're
 estimating query:

SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3

 but that we only have two statistics (a,b) and (b,c). The current
 patch does
 about this:

P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)

 i.e. it estimates the first two conditions using (a,b), and then
 estimates
 (c=3) using (b,c) with "b=2" as a condition. Now, this is very
 efficient,
 but it only works as long as the query contains conditions
 "connecting" the
 two statistics. So if we remove the "b=2" condition from the query, this
 stops working.
>>>
>>>
>>> This is trying to make the algorithm smarter than the user, which is
>>> something I'd think we could live without. In this case statistics on
>>> (a,c) or (a,b,c) are missing. And what if the user does not want to
>>> make use of stats for (a,c) because he only defined (a,b) and (b,c)?
>>>
>>
>> I don't think so. Obviously, if you have statistics covering all the
>> conditions - great, we can't really do better than that.
>>
>> But there's a crucial relation between the number of dimensions of the
>> statistics and accuracy of the statistics. Let's say you have statistics
>> on 8 columns, and you split each dimension twice to build a histogram -
>> that's 256 buckets right there, and we only get ~50% selectivity in each
>> dimension (the actual histogram building algorithm is more complex, but
>> you get the idea).
>
> I think it makes sense to pursue this, but I also think we can easily live
> with not having it in the first version that gets committed and doing it as
> follow-up patch.

This patch is large and complicated enough. As this is not a mandatory
piece to get a basic support, I'd suggest just to drop that for later.
--
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics (v19)

2016-08-10 Thread Michael Paquier
On Wed, Aug 10, 2016 at 8:33 PM, Tomas Vondra
 wrote:
> On 08/10/2016 06:41 AM, Michael Paquier wrote:
>> Patch 0001: there have been comments about that before, and you have
>> put the checks on RestrictInfo in a couple of variables of
>> pull_varnos_walker, so nothing to say from here.
>>
>
> I don't follow. Are you suggesting 0001 is a reasonable fix, or that there's
> a proposed solution?

I think that's reasonable.

>> Patch 0002:
>> +  
>> +   CREATE STATISTICS will create a new multivariate
>> +   statistics on the table. The statistics will be created in the in the
>> +   current database. The statistics will be owned by the user issuing
>> +   the command.
>> +  
>> s/in the/in the/.
>>
>> +  
>> +   Create table t1 with two functionally dependent
>> columns, i.e.
>> +   knowledge of a value in the first column is sufficient for detemining
>> the
>> +   value in the other column. Then functional dependencies are built on
>> those
>> +   columns:
>> s/detemining/determining/
>>
>> +  
>> +   If a schema name is given (for example, CREATE STATISTICS
>> +   myschema.mystat ...) then the statistics is created in the
>> specified
>> +   schema.  Otherwise it is created in the current schema.  The name of
>> +   the table must be distinct from the name of any other statistics in
>> the
>> +   same schema.
>> +  
>> I would just assume that a statistics is located on the schema of the
>> relation it depends on. So the thing that may be better to do is just:
>> - Register the OID of the table a statistics depends on but not the
>> schema.
>> - Give up on those query extensions related to the schema.
>> - Allow the same statistics name to be used for multiple tables.
>> - Just fail if a statistics name is being reused on the table again.
>> It may be better to complain about that even if the column list is
>> different.
>> - Register the dependency between the statistics and the table.
>
> The idea is that the syntax should work even for statistics built on
> multiple tables, e.g. to provide better statistics for joins. That's why the
> schema may be specified (as each table might be in different schema), and so
> on.

So you mean that the same statistics could be shared between tables?
But as this is visibly not a concept introduced yet in this set of
patches, why not just cut it off for now to simplify the whole? If
there is no schema-related field in pg_mv_statistics we could still
add it later if it proves to be useful.

>> +
>>  /*
>> Spurious noise in the patch.
>>
>> +   /* check that at least some statistics were requested */
>> +   if (!build_dependencies)
>> +   ereport(ERROR,
>> +   (errcode(ERRCODE_SYNTAX_ERROR),
>> +errmsg("no statistics type (dependencies) was
>> requested")));
>> So, WITH (dependencies) is mandatory in any case. Why not just
>> dropping it from the first cut then?
>
>
> Because the follow-up patches extend this to require at least one statistics
> type. So in 0004 it becomes
>
> if (!(build_dependencies || build_mcv))
>
> and in 0005 it's
>
> if (!(build_dependencies || build_mcv || build_histogram))
>
> We might drop it from 0002 (and assume build_dependencies=true), and then
> add the check in 0004. But it seems a bit pointless.

This is a complicated set of patches. I'd think that we should try to
simplify things as much as possible first, and the WITH clause is not
mandatory to have as of 0002.

>> Statistics definition reorder the columns by itself depending on their
>> order. For example:
>> create table aa (a int, b int);
>> create statistics aas on aa(b, a) with (dependencies);
>> \d aa
>> "public.aas" (dependencies) ON (a, b)
>> As this defines a correlation between multiple columns, isn't it wrong
>> to assume that (b, a) and (a, b) are always the same correlation? I
>> don't recall such properties as being always commutative (old
>> memories, I suck at stats in general). [...reading README...] So this
>> is caused by the implementation limitations that only limit the
>> analysis between interactions of two columns. Still it seems incorrect
>> to reorder the user-visible portion.
>
> I don't follow. If you talk about Pearson's correlation, that clearly does
> not depend on the order of columns - it's perfectly independent of that. If
> you talk about about correlation in the wider sense (i.e. arbitrary
> dependence between columns), that might depend - but I don't remember a
> single piece of the patch where this might be a problem.

Yes, based on what is done in the patch that may not be a problem, but
I am wondering if this is not restricting things too much.

> Also, which README states that we can only analyze interactions between two
> columns? That's pretty clearly not the case - the patch should handle
> dependencies between more columns without any problems.

I have noticed that the patch evaluates all the set of permutations
possible using a column list, it seems to me though that 

Re: [HACKERS] multivariate statistics (v19)

2016-08-10 Thread Petr Jelinek

On 10/08/16 13:33, Tomas Vondra wrote:

On 08/10/2016 06:41 AM, Michael Paquier wrote:

On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra

2) combining multiple statistics

I think the ability to combine multivariate statistics (covering
different
subsets of conditions) is important and useful, but I'm starting to
think
that the current implementation may not be the correct one (which is
why I
haven't written the SGML docs about this part of the patch series yet).

Assume there's a table "t" with 3 columns (a, b, c), and that we're
estimating query:

   SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3

but that we only have two statistics (a,b) and (b,c). The current
patch does
about this:

   P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)

i.e. it estimates the first two conditions using (a,b), and then
estimates
(c=3) using (b,c) with "b=2" as a condition. Now, this is very
efficient,
but it only works as long as the query contains conditions
"connecting" the
two statistics. So if we remove the "b=2" condition from the query, this
stops working.


This is trying to make the algorithm smarter than the user, which is
something I'd think we could live without. In this case statistics on
(a,c) or (a,b,c) are missing. And what if the user does not want to
make use of stats for (a,c) because he only defined (a,b) and (b,c)?



I don't think so. Obviously, if you have statistics covering all the
conditions - great, we can't really do better than that.

But there's a crucial relation between the number of dimensions of the
statistics and accuracy of the statistics. Let's say you have statistics
on 8 columns, and you split each dimension twice to build a histogram -
that's 256 buckets right there, and we only get ~50% selectivity in each
dimension (the actual histogram building algorithm is more complex, but
you get the idea).



I think it makes sense to pursue this, but I also think we can easily 
live with not having it in the first version that gets committed and 
doing it as follow-up patch.


--
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics (v19)

2016-08-10 Thread Tomas Vondra

On 08/10/2016 06:41 AM, Michael Paquier wrote:

On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
 wrote:

...

But more importantly, I think we'll need to show some of the data in EXPLAIN
output. With per-column statistics it's fairly straightforward to determine
which statistics are used and how. But with multivariate stats things are
often more complicated - there may be multiple candidate statistics (e.g.
histograms covering different subsets of the conditions), it's possible to
apply them in different orders, etc.

But EXPLAIN can't show the info if it's ephemeral and available only within
clausesel.c (and thrown away after the estimation).


This gives a good reason to not do that in clauserel.c, it would be
really cool to be able to get some information regarding the stats
used with a simple EXPLAIN.



I think there are two separate questions:

(a) Whether the query plan is "enriched" with information about 
statistics, or whether this information is ephemeral and available only 
in clausesel.c.


(b) Where exactly this enrichment happens.

Theoretically we might enrich the query plan (add nodes with info about 
the statistics), so that EXPLAIN gets the info, and it might still 
happen in clausesel.c.



2) combining multiple statistics

I think the ability to combine multivariate statistics (covering different
subsets of conditions) is important and useful, but I'm starting to think
that the current implementation may not be the correct one (which is why I
haven't written the SGML docs about this part of the patch series yet).

Assume there's a table "t" with 3 columns (a, b, c), and that we're
estimating query:

   SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3

but that we only have two statistics (a,b) and (b,c). The current patch does
about this:

   P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)

i.e. it estimates the first two conditions using (a,b), and then estimates
(c=3) using (b,c) with "b=2" as a condition. Now, this is very efficient,
but it only works as long as the query contains conditions "connecting" the
two statistics. So if we remove the "b=2" condition from the query, this
stops working.


This is trying to make the algorithm smarter than the user, which is
something I'd think we could live without. In this case statistics on
(a,c) or (a,b,c) are missing. And what if the user does not want to
make use of stats for (a,c) because he only defined (a,b) and (b,c)?



I don't think so. Obviously, if you have statistics covering all the 
conditions - great, we can't really do better than that.


But there's a crucial relation between the number of dimensions of the 
statistics and accuracy of the statistics. Let's say you have statistics 
on 8 columns, and you split each dimension twice to build a histogram - 
that's 256 buckets right there, and we only get ~50% selectivity in each 
dimension (the actual histogram building algorithm is more complex, but 
you get the idea).


I see this as probably the most interesting part of the patch, and quite 
useful. But we'll definitely get the single-statistics estimate first, 
no doubt about that.



Patch 0001: there have been comments about that before, and you have
put the checks on RestrictInfo in a couple of variables of
pull_varnos_walker, so nothing to say from here.



I don't follow. Are you suggesting 0001 is a reasonable fix, or that 
there's a proposed solution?



Patch 0002:
+  
+   CREATE STATISTICS will create a new multivariate
+   statistics on the table. The statistics will be created in the in the
+   current database. The statistics will be owned by the user issuing
+   the command.
+  
s/in the/in the/.

+  
+   Create table t1 with two functionally dependent columns, i.e.
+   knowledge of a value in the first column is sufficient for detemining the
+   value in the other column. Then functional dependencies are built on those
+   columns:
s/detemining/determining/

+  
+   If a schema name is given (for example, CREATE STATISTICS
+   myschema.mystat ...) then the statistics is created in the specified
+   schema.  Otherwise it is created in the current schema.  The name of
+   the table must be distinct from the name of any other statistics in the
+   same schema.
+  
I would just assume that a statistics is located on the schema of the
relation it depends on. So the thing that may be better to do is just:
- Register the OID of the table a statistics depends on but not the schema.
- Give up on those query extensions related to the schema.
- Allow the same statistics name to be used for multiple tables.
- Just fail if a statistics name is being reused on the table again.
It may be better to complain about that even if the column list is
different.
- Register the dependency between the statistics and the table.


The idea is that the syntax should work even for statistics built on 
multiple tables, e.g. to provide better statistics for joins. That's why 
the schema may be specified (as each table 

Re: [HACKERS] Proposal for CSN based snapshots

2016-08-10 Thread Heikki Linnakangas

On 08/10/2016 01:03 PM, Ants Aasma wrote:

On Tue, Aug 9, 2016 at 3:16 PM, Heikki Linnakangas  wrote:

(Reviving an old thread)

I spent some time dusting off this old patch, to implement CSN snapshots.
Attached is a new patch, rebased over current master, and with tons of
comments etc. cleaned up. There's more work to be done here, I'm posting
this to let people know I'm working on this again. And to have a backup on
the 'net :-).


Great to hear. I hope to find some time to review this. In the
meanwhile here is a brain dump of my thoughts on the subject.


Thanks!


* Performance testing. Clearly this should have a performance benefit, at
least under some workloads, to be worthwhile. And not regress.


I guess this is the main question. GetSnapshotData is probably going
to be faster, but the second important benefit that I saw was the
possibility of reducing locking around common activities. This of
course needs to guided by profiling - evils of premature optimization
and all that.


Yep.

For the record, one big reason I'm excited about this is that a snapshot 
can be represented as small fixed-size value. And the reason that's 
exciting is that it makes it more feasible for each backend to publish 
their snapshots, which in turn makes it possible to vacuum old tuples 
more aggressively. In particular, if you have one very long-running 
transaction (think pg_dump), and a stream of transactions, you could 
vacuum away tuples that were inserted after the long-running transaction 
started and deleted before any of the shorter transactions started. You 
could do that without CSN snapshots, but it's a lot simpler with them. 
But that's a separate story.



But that said, my gut tells me that the most important spots are:

* Looking up CSNs for XidVisibleInSnashot. Current version gets a
shared lock on CSNLogControlLock. That might get nasty, for example
when two concurrent sequential scans running on different cores or
sockets hit a patch of transactions within their xmin-xmax interval.


I'm not too worried about that particular case. Shared LWLocks scale 
pretty well these days, thanks to Andres Freund's work in this area. A 
mix of updates and reads on the csnlog might become a scalability 
problem though. We'll find out once we start testing.



* GetSnapshotData. With tps on read only queries potentially pushing
into the millions, it would be awfully nice if contending on a single
cache line could be avoided. Currently it acquires ProcArrayLock and
CommitSeqNoLock to atomically update MyPgXact->snapshotcsn. The
pattern I had in mind to make this lock free would be to read a CSN,
publish value, check against latest OldestGlobalSnapshot value and
loop if necessary, and on the other side, calculate optimistic oldest
snapshot, publish and then recheck, if necessary, republish.


Yeah, I haven't spent any serious effort into optimizing this yet. For 
the final version, should definitely do something like that.



* While commits are not as performance critical it may still be
beneficial to defer clog updates and perform them in larger batches to
reduce clog lock traffic.


Hmm. I doubt batching them is going to help much. But it should be 
straightforward to use atomic operations here too, to reduce locking.



I think I have achieved some clarity on my idea of snapshots that
migrate between being CSN and XID based. The essential problem with
old CSN based snapshots is that the CSN log for their xmin-xmax
interval needs to be kept around in a quick to access datastructure.
And the problem with long running transactions is twofold, first is
that they need a well defined location where to publish the visibility
info for their xids and secondly, they enlarge the xmin-xmax interval
of all concurrent snapshots, needing a potentially huge amount of CSN
log to be kept around. One way to deal with it is to just accept it
and use a SLRU as in this patch.

My new insight is that a snapshot doesn't need to be either-or CSN or
XIP (xid in progress) array based, but it can also be a hybrid. There
would be a sorted array of in progress xids and a non-overlapping CSN
xmin-xmax interval where CSN log needs to be consulted. As snapshots
age they scan the CSN log and incrementally build their XIP array,
basically lazily constructing same data structure used in snapshots
today. Old in progress xids need a slot for themselves to be kept
around, but all new snapshots being taken would immediately classify
them as old, store them in XIP and not include them in their CSN xmin.
Under this scheme the amount of CSN log strictly needed is a
reasonably sized ring buffer for recent xids (probably sized based on
max conn), a sparse map for long transactions (bounded by max conn)
and some slack for snapshots still being converted. A SLRU or similar
is still needed because there is no way to ensure timely conversion of
all snapshots, but by properly timing the notification the probability
of actually using it should be 

Re: [HACKERS] Declarative partitioning

2016-08-10 Thread Amit Langote
On 2016/08/10 19:18, Ashutosh Bapat wrote:
> FOR VALUE clause of a partition does not allow a constant expression like
> (1/5 -1). It gives syntax error
> regression=# create table pt1_p1 partition of pt1 for values start (0) end
> ((1/5) - 1);
> ERROR:  syntax error at or near "("
> LINE 1: ...pt1_p1 partition of pt1 for values start (0) end ((1/5) ...
> 
> Shouldn't we allow constant expressions here?
> 
> If this has been already discussed, please forgive me and point out the
> relevant mail chain.

Yes, the only allowed values there are string literals, numeric values and
null.  Decision to change to it followed from this comment:

"Also, I don't think allowing an a_expr as a bound is remotely sensible
- I think you should allow only Sconst, NumericOnly, NULL, and
UNBOUNDED."

https://www.postgresql.org/message-id/CA%2BTgmoZ1ZMCyGR3b9yvGDq79xYLMnJQwhwn5GVs_GsvPiySDxw%40mail.gmail.com

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning

2016-08-10 Thread Ashutosh Bapat
FOR VALUE clause of a partition does not allow a constant expression like
(1/5 -1). It gives syntax error
regression=# create table pt1_p1 partition of pt1 for values start (0) end
((1/5) - 1);
ERROR:  syntax error at or near "("
LINE 1: ...pt1_p1 partition of pt1 for values start (0) end ((1/5) ...

Shouldn't we allow constant expressions here?

If this has been already discussed, please forgive me and point out the
relevant mail chain.

On Tue, Aug 9, 2016 at 12:48 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> What strikes me odd about these patches is RelOptInfo has remained
> unmodified. For a base partitioned table, I would expect it to be marked as
> partitioned may be indicating the partitioning scheme. Instead of that, I
> see that the code directly deals with PartitionDesc, PartitionKey which are
> part of Relation. It uses other structures like PartitionKeyExecInfo. I
> don't have any specific observation as to why we need such information in
> RelOptInfo, but lack of it makes me uncomfortable. It could be because
> inheritance code does not require any mark in RelOptInfo and your patch
> re-uses inheritance code. I am not sure.
>
> On Tue, Aug 9, 2016 at 9:17 AM, Amit Langote <
> langote_amit...@lab.ntt.co.jp> wrote:
>
>> On 2016/08/09 6:02, Robert Haas wrote:
>> > On Mon, Aug 8, 2016 at 1:40 AM, Amit Langote
>> >  wrote:
>> >>> +1, if we could do it. It will need a change in the way Amit's patch
>> stores
>> >>> partitioning scheme in PartitionDesc.
>> >>
>> >> Okay, I will try to implement this in the next version of the patch.
>> >>
>> >> One thing that comes to mind is what if a user wants to apply hash
>> >> operator class equality to list partitioned key by specifying a hash
>> >> operator class for the corresponding column.  In that case, we would
>> not
>> >> have the ordering procedure with an hash operator class, hence any
>> >> ordering based optimization becomes impossible to implement.  The
>> current
>> >> patch rejects a column for partition key if its type does not have a
>> btree
>> >> operator class for both range and list methods, so this issue doesn't
>> >> exist, however it could be seen as a limitation.
>> >
>> > Yes, I think you should expect careful scrutiny of that issue.  It
>> > seems clear to me that range partitioning requires a btree opclass,
>> > that hash partitioning requires a hash opclass, and that list
>> > partitioning requires at least one of those things.  It would probably
>> > be reasonable to pick one or the other and insist that list
>> > partitioning always requires exactly that, but I can't see how it's
>> > reasonable to insist that you must have both types of opclass for any
>> > type of partitioning.
>>
>> So because we intend to implement optimizations for list partition
>> metadata that presuppose existence of corresponding btree operator class,
>> we should just always require user to specify one (or error out if user
>> doesn't specify and a default one doesn't exist).  That way, we explicitly
>> do not support specify hash equality operator for list partitioning.
>>
>> >> So, we have 3 choices for the internal representation of list
>> partitions:
>> >>
>> >> Choice 1 (the current approach):  Load them in the same order as they
>> are
>> >> found in the partition catalog:
>> >>
>> >> Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'}
>> >> Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'}
>> >>
>> >> In this case, mismatch on the first list would make the two tables
>> >> incompatibly partitioned, whereas they really aren't incompatible.
>> >
>> > Such a limitation seems clearly unacceptable.  We absolutely must be
>> > able to match up compatible partitioning schemes without getting
>> > confused by little details like the order of the partitions.
>>
>> Agreed.  Will change my patch to adopt the below method.
>>
>> >> Choice 2: Representation with 2 arrays:
>> >>
>> >> Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [3, 1, 2, 2, 3, 1]
>> >> Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [2, 3, 1, 1, 2, 3]
>> >>
>> >> It still doesn't help the case of pairwise joins because it's hard to
>> tell
>> >> which value belongs to which partition (the 2nd array carries the
>> original
>> >> partition numbers).  Although it might still work for tuple-routing.
>> >
>> > It's very good for tuple routing.  It can also be used to match up
>> > partitions for pairwise joins.  Compare the first arrays.  If they are
>> > unequal, stop.  Else, compare the second arrays, incrementally
>> > building a mapping between them and returning false if the mapping
>> > turns out to be non-bijective.  For example, in this case, we look at
>> > index 0 and decide that 3 -> 2.  We look at index 1 and decide 1 -> 3.
>> > We look at index 2 and decide 2 -> 1.  We look at index 4 and find
>> > that we already have a mapping for 2, but it's compatible because we
>> > need 2 -> 1 and that's what is already there.  Similarly 

  1   2   >