subject:"\[HACKERS\] Slowness of extended protocol"

>
> I'm not going to respond to the part about dealing with prepared
>> statements errors, since I think we've already covered that and there's
>> nothing new being said. I don't find automatic savepointing acceptable, and
>> a significant change of the PostgreSQL protocol to support this doesn't
>> seem reasonable (but you can try proposing).
>>
>
> As mentioned before. JDBC is not the only postgres driver to do this the
> ODBC driver does this as well. This is a requested feature by users. We
> didn't just decide to do it on our own.
>
> One thing to keep in mind is that both JDBC and ODBC are not exclusively
> PostgreSQL drivers and as such we sometimes have to jump through hoops to
> provide the semantics requested by the API.
>

I don't have anything in particular against automatic savepointing when
requested directly by users (especially if it's opt-in). My problem is more
with automatic savepointing as a solution to a problem created by doing
automatic prepared statements... Way too much automatic stuff going on
there for my taste...

Re: [HACKERS] Slowness of extended protocol

2016-08-15 Thread Dave Cramer

>
>
> I'm not going to respond to the part about dealing with prepared
> statements errors, since I think we've already covered that and there's
> nothing new being said. I don't find automatic savepointing acceptable, and
> a significant change of the PostgreSQL protocol to support this doesn't
> seem reasonable (but you can try proposing).
>

As mentioned before. JDBC is not the only postgres driver to do this the
ODBC driver does this as well. This is a requested feature by users. We
didn't just decide to do it on our own.

One thing to keep in mind is that both JDBC and ODBC are not exclusively
PostgreSQL drivers and as such we sometimes have to jump through hoops to
provide the semantics requested by the API.



Dave Cramer

da...@postgresintl.com
www.postgresintl.com

Re: [HACKERS] Slowness of extended protocol

On Mon, Aug 15, 2016 at 3:16 PM, Vladimir Sitnikov <
sitnikov.vladi...@gmail.com> wrote:

> Vladimir>> Yes, that is what happens.
> Vladimir>> The idea is not to mess with gucs.
>
> Shay:> Wow... That is... insane...
>
> Someone might say that "programming languages that enable side-effects
> are insane".

Lots of connection pools work by sharing the connections and it is up
> to developer
> if he can behave well (see "It is not" below)
>

The insane part is that there's really almost no reason to allow these
side-effects... It's possible to efficiently reset connection state with a
truly negligible impact on

> Shay> it's entirely reasonable for
> Shay> more than one app to use use the same pool
>
> Sharing connections between different applications might be not that good
> idea.
>

Not sure why... It seems like a bad idea mostly if your pool is leaking
state.

> However, I would not agree that "having out-of-process connection
> pool" is the only sane
> way to go.
> I do admit there might be valid cases for out of process pooling,
> however I do not agree
> it is the only way to go. Not only "inprocess" is one of the ways,
> "in-process" way is wildly used
> in at least one of the top enterprise languages.
>
> If you agree with that, you might agree that "in-process connection
> pool that serves
> exactly one application" might work in a reasonable fashion even
> without DISCARD ALL.
>

I never said out-of-process pooling is the way to go - in the .NET world
in-process pooling is very valid. But as I carefully said in my last email,
the same problems you have with multiple applications can occur with
multiple components within the same application. The process boundary isn't
very important here - in some scenarios programmers choose process
separation, in others they choose threads. The same basic problems that
occur in one model can occur in the other, including usage strategies which
make LRU caching very bad.

> Shay> Even with in-process pools it's standard practice (and a good idea)
> to
> Shay> reset state.
>
> It is not. Can you quote where did you get that "standard practice is
> to reset state" from?
>

I guess I take that back, I haven't actually made a thorough comparison
here.

Shay> If some part of a big complex
> Shay> application modified state, and then some other part happens to get
> that
> Shay> physical connection, it's extremely hard to make sense of things.
>
> Let's not go into "functional programming" vs "imperative programming"
> discussion?
> Of course you might argue that "programming in Haskell or OCaml or F#"
> makes
> "extremely easy to make sense of things", but that's completely
> another discussion.
>

I'm not sure what this is referring to... If you're talking about my
comment that "isolation/separation of layers is a good thing in
programming", then I don't think it's the function vs. imperative kind of
argument.

Shay> One note - in Npgsql's implementation of persistent prepared
> statements,
> Shay> instead of sending DISCARD ALL Npgsql sends the commands listed in
> Shay> https://www.postgresql.org/docs/current/static/sql-discard.html,
> except for
> Shay> DEALLOCATE ALL. This cleans up state changes but leaves prepared
> statements
> Shay> in place.
>
> Ok. At least you consider that "always discarding all the state" might be
> bad.
>

Yes I do. I actually implemented persistent prepared statements before this
conversation started - I think it's a great performance booster. I'm still
not sure if it should be opt-in or default, although I admit I'm leaning
towards default. But that feature has very little to do with *implicit*
preparation.

Shay> This is somewhat similar to the CPU reordering you
> Shay> keep coming back to - it's totally invisible
>
> I would disagree. CPU reordering is easily visible if you are dealing
> with multithreaded case.
> It can easily result in application bugs if application misses some
> synchronization.
>
> CPU reordering is very visible to regular programmers, and it is a
> compromise:
> 1) Developers enable compiler and CPU do certain "reorderings"
> 2) Developers agree to follow the rules like "missing synchronization
> might screw things up"
> 3) In the result, the code gets executed faster.
>

The point is that AFAIK the same bugs that can result from reordering can
also result from other basic conditions as well. If you're writing
multithreaded code then you must handle synchronization - this is not a
reordering-specific problem. Therefore if your program is multithreaded but
doesn't do proper synchronization you have a bug - regardless of whether
its manifestation is triggered by CPU reordering or not. I admit I'm not an
expert on this and may be wrong (it would be interesting to know).

> Vladimir> Just in case: PostgreSQL does not execute "discard all" on its
> own.
>
> Shay> Of course it doesn't - it doesn't know anything about connection
> pooling,
> Shay> it only knows about physical connections. When would it execute

Re: [HACKERS] Slowness of extended protocol

2016-08-15 Thread Vladimir Sitnikov

Vladimir>> Yes, that is what happens.
Vladimir>> The idea is not to mess with gucs.

Shay:> Wow... That is... insane...

Someone might say that "programming languages that enable side-effects
are insane".
Lots of connection pools work by sharing the connections and it is up
to developer
if he can behave well (see "It is not" below)

Shay> it's entirely reasonable for
Shay> more than one app to use use the same pool

Sharing connections between different applications might be not that good idea.
However, I would not agree that "having out-of-process connection
pool" is the only sane
way to go.
I do admit there might be valid cases for out of process pooling,
however I do not agree
it is the only way to go. Not only "inprocess" is one of the ways,
"in-process" way is wildly used
in at least one of the top enterprise languages.

If you agree with that, you might agree that "in-process connection
pool that serves
exactly one application" might work in a reasonable fashion even
without DISCARD ALL.

Shay> Even with in-process pools it's standard practice (and a good idea) to
Shay> reset state.

It is not. Can you quote where did you get that "standard practice is
to reset state" from?

Oracle Weblogic application server does not reset connections.
JBoss WildFly application server does not reset connections.
HikariCP connection pool does not reset connections.

I can easily continue the list.
The above are heavily used java servers (Hikari is a pool to be exact).

Shay> If some part of a big complex
Shay> application modified state, and then some other part happens to get that
Shay> physical connection, it's extremely hard to make sense of things.

Let's not go into "functional programming" vs "imperative programming"
discussion?
Of course you might argue that "programming in Haskell or OCaml or F#" makes
"extremely easy to make sense of things", but that's completely
another discussion.

Shay> One note - in Npgsql's implementation of persistent prepared statements,
Shay> instead of sending DISCARD ALL Npgsql sends the commands listed in
Shay> https://www.postgresql.org/docs/current/static/sql-discard.html,
except for
Shay> DEALLOCATE ALL. This cleans up state changes but leaves prepared
statements
Shay> in place.

Ok. At least you consider that "always discarding all the state" might be bad.


Shay> This is somewhat similar to the CPU reordering you
Shay> keep coming back to - it's totally invisible

I would disagree. CPU reordering is easily visible if you are dealing
with multithreaded case.
It can easily result in application bugs if application misses some
synchronization.

CPU reordering is very visible to regular programmers, and it is a compromise:
1) Developers enable compiler and CPU do certain "reorderings"
2) Developers agree to follow the rules like "missing synchronization
might screw things up"
3) In the result, the code gets executed faster.


Vladimir> Just in case: PostgreSQL does not execute "discard all" on its own.

Shay> Of course it doesn't - it doesn't know anything about connection pooling,
Shay> it only knows about physical connections. When would it execute "discard
Shay> all" on its own?

That my point was for "pgpool aiming to look like a regular postgresql
connection".
The point was: "postgresql does not discard on its own, so pgpool
should not discard".


Shay> To enforce isolation, which is maybe the most important way for
programs to
Shay> be reliable - but this is a general theme which you don't seem to agree
Shay> with.

If you want to isolate something, you might better have a
per-application connection pool.
That way, if a particular application consumes all the connections, it
would not impact
other applications. If all the applications use the same
out-of-process pool, there might
be trouble of resource hogging.

Shay> Regardless, resetting state doesn't have to have a necessary effect
Shay> on response times/throughput.

Even if you do not reset prepared statements, "reset query" takes time.
For instance: there's a common problem to "validate connections before use".
That is the pooler should ensure the connection is working before handling it
to the application.
Both Weblogic server, and HikariCP have those connection validation built in
and the validation is enabled by default.

However, it turns out that "connection validation" takes too much time,
it is visible in the application response times, etc, so they both implement a
grace period. That is "if the connection was recently used, it is
assumed to be fine".
Weblogic trusts 15 seconds by default, so if you borrow connections
each 10 seconds, then
they will not be tested.
Well, there's additional background validation, but my main point is
"even select 1"
is visible on the application response times.



Shay> This is something new - maybe it's part of the misunderstanding here. To
Shay> me, the term "prepared statements" always means "server-prepared
Shay> statements"; this seems to be supported by the documentation

Re: [HACKERS] Slowness of extended protocol

On Sat, Aug 13, 2016 at 11:20 PM, Vladimir Sitnikov <
sitnikov.vladi...@gmail.com> wrote:

> Tatsuo>Interesting. What would happen if a user changes some of GUC
> parameters? Subsequent session accidentally inherits the changed GUC
> parameter?
>
> Yes, that is what happens.
> The idea is not to mess with gucs.
>

Wow... That is... insane...

> Tatsuo>There's nothing wrong with DICARD ALL
> Tatsuo>"DISCARD ALL" is perfect for this goal.
>
> It looks like you mean: "let's reset the connection state just in case".
> I see where it might help: e.g. when providing database access to random
> people who might screw a connection in every possible way.
>

It's not about random people screwing up connections, it's about
maintaining isolation between different connections... When using an
out-of-process pool (e.g. pgbouncer/pgpool), it's entirely reasonable for
more than one app to use use the same pool. It's entirely possible for one
app to change a GUC and then the other app goes boom. It's especially hard
to debug too, because it depends on who got the physical connection before
you.

Even with in-process pools it's standard practice (and a good idea) to
reset state. The idea is that pooling, much like a cache (see below!), is
supposed to speed up your application without having any other visible
effects. In other words, running your app with or without a pool should be
identical except for performance aspects. If some part of a big complex
application modified state, and then some other part happens to get that
physical connection, it's extremely hard to make sense of things.

One note - in Npgsql's implementation of persistent prepared statements,
instead of sending DISCARD ALL Npgsql sends the commands listed in
https://www.postgresql.org/docs/current/static/sql-discard.html, except for
DEALLOCATE ALL. This cleans up state changes but leaves prepared statements
in place.

Just in case: do you think one should reset CPU caches, OS filesystem
> caches, DNS caches, bounce the application, and bounce the OS in addition
> to "discard all"?
> Why reset only "connection properties" when we can reset everything to its
> pristine state?
>

Um, because all the caches you mention are, well, caches - by definition
they're not supposed to have any visible impact on any application, except
for making it faster. This is somewhat similar to the CPU reordering you
keep coming back to - it's totally invisible. GUCs are the exact opposite -
you use them to modify how PostgreSQL behaves. So it makes perfect sense to
reset them.

Just in case: PostgreSQL does not execute "discard all" on its own.
>

Of course it doesn't - it doesn't know anything about connection pooling,
it only knows about physical connections. When would it execute "discard
all" on its own?

> If you mean "pgpool is exactly like reconnect to postgresql but faster
> since connection is already established", then OK, that might work in some
> cases (e.g. when response times/throughput are not important), however why
> forcing "you must always start from scratch" execution model?
>

To enforce isolation, which is maybe the most important way for programs to
be reliable - but this is a general theme which you don't seem to agree
with. Regardless, resetting state doesn't have to have a necessary effect
on response times/throughput.

> Developers are not that dumb, and they can understand that "changing gucs
> at random is bad".
>

This has nothing to do with random, developers may have a legitimate reason
to modify a GUC. In fact, GUCs are there precisely so that developers can
change them...

Also, session state is not only about GUCs. DISCARD ALL also releases
locks, resets authorization, resets sequence state (currval/nextval), etc.
etc. All these things should not leak across connections.

> Please read again. PreparedStatement is the only way to execute statements
> in JDBC API. There's no API that allows user to specify "use
> server-prepared here".
> Well, there's non-prepared API in JDBC, however it misses "bind
> variables" support,
> so if bind variables required, developer would use PreparedStatement.

> Java's PreparedStatement does not have an option to distinguish which
statements
> should be server-prepared and which should not.

This is something new - maybe it's part of the misunderstanding here. To
me, the term "prepared statements" always means "server-prepared
statements"; this seems to be supported by the documentation you quote: "If
the driver supports precompilation, the method prepareStatement will send
the statement to the database for precompilation". I don't understand the
concept of a prepared statements which isn't server-prepared - do you have
some sort of driver-only prepared statements, which

Regardless of any optimizations you may be doing, in every database driver
I've ever seen in my life, "prepared" simply means "server-prepared". And
in every driver I've ever seen, there's an explicit API for that. Therefore
server-prepare

Re: [HACKERS] Slowness of extended protocol

2016-08-15 Thread Vladimir Sitnikov

Shay> What? I really didn't understand your point here. All the doc is saying is
Shay> that if the driver doesn't support prepared statements, then using them

Please read again. PreparedStatement is the only way to execute statements
in JDBC API. There's no API that allows user to specify "use
server-prepared here".
Well, there's non-prepared API in JDBC, however it misses "bind
variables" support,
so if bind variables required, developer would use PreparedStatement.

Java's PreparedStatement does not have an option to distinguish which statements
should be server-prepared and which should not.

Vladimir>> My experience shows, that people are very bad at predicting where the
Vladimir>> performance problem would be.
Vladimir>> For 80% (or even 99%) of the cases, they just do not care
thinking if a
Vladimir>> particular statement should be server-prepared or not.
Vladimir>> They have absolutely no idea how much resources it would
take and so on.
Shay> Maybe you're even right saying these things, I don't know. But
that doesn't
Shay> mean I as a driver should solve their problems for them. And I also get
Shay> that you have an additional argument here besides programmer
Shay> laziness/stupidity - the ORM argument - which makes more sense.

Suppose backend can handle 20 server-prepared statements at most (if using more
it would run out of memory).
Suppose an application has 100 statements with ".prepare()" call.
I think it is reasonable for the DB driver to figure out which
statements are most important
and server-prepare just "20 most important ones", and leave the rest
80 as regular
non-prepared statements.

Do you think the DB driver should just follow developer's advice and
server-prepare
all the 100 statements causing backend crash?
Do you think application developer should have a single list of all
the statements ever
used in the application and make sure there's no more than 20 queries in it?

My main point is not "developers are stupid", but "people often have
wrong guess when
it comes to performance". There are too many moving parts, so it is
hard to predict
performance implications.

Often it is much easier to execute a series of benchmarks that
validate certain hypothesis.
For instance, as Tatsuo says, savepoint overhead for DML is higher
than savepoint
overhead for SELECT, so I plan to have that benchmark as well.


Shay> First, there's nothing stopping an ORM from optimizing multiple inserts
Shay> into a single multivalue insert. I do admit I'm not aware of any who do
Shay> this, but it's a good idea for an optimization - I happen to maintain the
Shay> Entity Framework Core provider for Npgsql, I might take a look at this
Shay> optimization (so again thanks for the idea).

Nothings stops, but M framework times N database drivers results in M*N effort
for each feature.
As you say: application should just use batch API, and it's driver's
job to convert
that into suitable for the database sequence of bytes.

Same for Npgsql: if you implement rewrite at Npgsql level, that would
automatically
improve all the framework/applications running on top of Npgsql.


Shay> I'm going to repeat what I said
Shay> before and it would be good to get some reaction to this. Every software
Shay> component in the stack has a role, and maintaining those separations is
Shay> what keeps things simple and sane

You might be missing my comments on CPU, x86, etc.
My reaction is: almost every existing component is extremely hard to
reason about.

For instance: CPU has certain number of registers, it has certain
amount of L1/L2/...
caches and so on.
Do you mean each and every developer should explicitly specify which
program variable should use register and which one should go into L2 cache?

This is a counter-example to your "sane" "separation". CPU is free to reorder
instruction stream as long as the net result complies to the specification.
In the same way, CPU is free to use L1/L2 caches in whatever way it
thinks is the best.
Note: typical DB driver developer does not try to maintain a set of
"optimal assembly
instructions".
Driver developer relies on the compiler and the CPU so they would optimize
driver's code into the best machine code.

Of course driver might have inline assembly, but that is not how
mainstream drivers are written.

Another example: TCP stack. When DB driver sends some data, kernel is
free to reorder
packets, it is free to interleave, delay them, or even send even use
multiple network cards
to send a single TCP stream.
Windows 10 includes several performance improvements to the TCP stack,
and it is nowhere near
to "kernel is doing exactly what application/driver coded".
Once again: application/driver developer does not optimize for a
specific hardware (e.g. network card).
Developers just use common API and it is kernel's job to use best
optimizations for the particular HW.

The same goes to ORM-DB combo. ORM uses DB driver's API, and it's
drivers job to use
optimal command sequence for the specific

Re: [HACKERS] Slowness of extended protocol

Apologies, I accidentally replied off-list, here's the response I sent.
Vladimir, I suggest you reply to this message with your own response...

On Sat, Aug 13, 2016 at 6:32 PM, Vladimir Sitnikov <
sitnikov.vladi...@gmail.com> wrote:

> Shay>To be honest, the mere idea of having an SQL parser inside my driver
> makes me shiver.
>
> Same for me.
> However I cannot wait for PostgreSQL 18 that does not need client-side
> parsing.
>

I'm glad we agree on something. For me the problem has nothing to do with
PostgreSQL or performance - it has to do with database APIs that impose
uniform parameter placeholder formats, and therefore force drivers to
rewrite user SQL queries. AFAIK rewriting a user's SQL for optimization is
totally out of the scope of the driver's work.

Shay>There's nothing your driver is doing that the application developer
> can't do themselves -
> Shay>so your driver isn't faster than other drivers. It's faster only when
> used by lazy programmers.
>
> I'm afraid you do not get the point.
> ORMs like Hibernate, EclipseLink, etc send regular "insert ... values" via
> batch API.
> For the developer the only way to make use of "multivalues" is to
> implement either "ORM fix" or "the driver fix" or "postgresql fix".
>
So the feature has very little to do with laziness of the programmers.
> Application developer just does not have full control of each SQL when
> working though ORM.
> Do you suggest "stop using ORMs"? Do you suggest fixing all the ORMs so it
> uses optimal for each DB insert statement?
> Do you suggest fixing postgresql?
>
> Once again "multivalues rewrite at pgjdbc level" enables the feature
> transparently for all the users. If PostgreSQL 10/11 would improve
> bind/exec performance, we could even drop that rewrite at pgjdbc level and
> revert to the regular flow. That would again be transparent to the
> application.
>
>
I do get the point, and in fact I myself mentioned the ORM case above as an
advantage of implicit query preparation.

First, there's nothing stopping an ORM from optimizing multiple inserts
into a single multivalue insert. I do admit I'm not aware of any who do
this, but it's a good idea for an optimization - I happen to maintain the
Entity Framework Core provider for Npgsql, I might take a look at this
optimization (so again thanks for the idea).

Second, it's well-known that using an ORM almost always implies a
performance sacrifice - it's a tradeoff that's chosen when going with an
ORM. It's great that you optimize multiple inserts, but there are a myriad
of other cases where an ORM generates less efficient SQL that what would be
possible - but I don't think it makes sense for the driver to actually
contain an SQL optimizer. Slightly worse performance isn't in itself a
reason to drop ORMs: it's frequent practice to drop down to raw SQL for
performance-critical operations, etc.

But all that isn't really important - I'm going to repeat what I said
before and it would be good to get some reaction to this. Every software
component in the stack has a role, and maintaining those separations is
what keeps things simple and sane. Just for fun, we could imagine a kernel
network-stack feature which analyzes outgoing messages and optimizes them;
we could even implement your multiple insert -> multivalue insert
optimization there. This would have the advantage of working out-of-the-box
for every driver and every language out there (just like your driver does
provides it transparently for all ORMs) But nobody in their right mind
would think of doing something like this, and for good reason.

The programmer's task is to write SQL, the driver's task is to communicate
that SQL via the database-specific protocol, the kernel's networking
stack's task is to transmit that protocol via TCP, etc. etc. If an ORM is
used, the programmer effectively outsources the task of writing SQL to
another component, which is supposed to do a good job about it. Once you go
about blurring all the lines here, everything becomes more complicated,
brittle and hard to debug.

For what it's worth, I can definitely imagine your kind of optimizations
occurring at some additional layer which the user would choose to use - an
intermediate SQL optimizer between application code (or ORM) and the
driver. This "SQL optimizer" layer would keep the driver itself lean and
simple (for users who *don't* want all the magic), while allowing for
transparent optimizations for ORMs. Or if the magic is implemented at the
driver leve, it should be opt-in, or at least easy to disable entirely.

>

> Shay>are you sure there aren't "hidden" costs on the PostgreSQL side for
> generating so many implicit savepoints?
>
> Technically speaking I use the same savepoint name through bind/exec
> message.
>

Out of curiosity, I whipped up a quick benchmark (voltmeter) of the impact
of adding a savepoint before every command. Each iteration creates a
transaction, sends one Bind/Execute for "SELECT 1" (which was prepared
during setup),

Re: [HACKERS] Slowness of extended protocol

2016-08-13 Thread Tatsuo Ishii

> Tatsuo>Interesting. What would happen if a user changes some of GUC
> parameters? Subsequent session accidentally inherits the changed GUC
> parameter?
> 
> Yes, that is what happens.

Ouch.

> The idea is not to mess with gucs.
> 
> Tatsuo>There's nothing wrong with DICARD ALL
> Tatsuo>"DISCARD ALL" is perfect for this goal.
> 
> It looks like you mean: "let's reset the connection state just in case".
> I see where it might help: e.g. when providing database access to random
> people who might screw a connection in every possible way.

Yes, that's what clients of pgpool is expecting. Clients do not want
to change their applications which were working with PostgreSQL
without pgpool.

> Just in case: do you think one should reset CPU caches, OS filesystem
> caches, DNS caches, bounce the application, and bounce the OS in addition
> to "discard all"?

Aparently no, because that is different from what PostgreSQL is doing
when backend exits.

> Why reset only "connection properties" when we can reset everything to its
> pristine state?

You can propose that kind of variants of DISCARD command. PostgreSQL
is an open source project.

> Just in case: PostgreSQL does not execute "discard all" on its own.
> If you mean "pgpool is exactly like reconnect to postgresql but faster
> since connection is already established", then OK, that might work in some
> cases (e.g. when response times/throughput are not important), however why
> forcing "you must always start from scratch" execution model?

I'm not doing that. I do not ask all the people to use pgpool. People
can choose whatever tools he/she thinks suitable for their purpose.

> Developers are not that dumb, and they can understand that "changing gucs
> at random is bad".
> 
> When a connection pool is dedicated to a particular application (or a set
> of applications), then it makes sense to reuse statement cache for
> performance reasons.
> Of course, it requires some discipline like "don't mess with gucs", however
> that is acceptable and it is easily understandable by developers.

I'm not against such a developer's way. If you like it, you can do
it. Nobody disturbs you. I just want to say that not all the client
want that way.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slowness of extended protocol

2016-08-13 Thread Vladimir Sitnikov

Tatsuo>Interesting. What would happen if a user changes some of GUC
parameters? Subsequent session accidentally inherits the changed GUC
parameter?

Yes, that is what happens.
The idea is not to mess with gucs.

Tatsuo>There's nothing wrong with DICARD ALL
Tatsuo>"DISCARD ALL" is perfect for this goal.

It looks like you mean: "let's reset the connection state just in case".
I see where it might help: e.g. when providing database access to random
people who might screw a connection in every possible way.

Just in case: do you think one should reset CPU caches, OS filesystem
caches, DNS caches, bounce the application, and bounce the OS in addition
to "discard all"?
Why reset only "connection properties" when we can reset everything to its
pristine state?

Just in case: PostgreSQL does not execute "discard all" on its own.
If you mean "pgpool is exactly like reconnect to postgresql but faster
since connection is already established", then OK, that might work in some
cases (e.g. when response times/throughput are not important), however why
forcing "you must always start from scratch" execution model?

Developers are not that dumb, and they can understand that "changing gucs
at random is bad".

When a connection pool is dedicated to a particular application (or a set
of applications), then it makes sense to reuse statement cache for
performance reasons.
Of course, it requires some discipline like "don't mess with gucs", however
that is acceptable and it is easily understandable by developers.

My application cannot afford re-parsing hot SQL statements as hurts
performance. It is very visible in the end-to-end performance tests, so
"discard all" is not used, and developers know not to mess with gucs.

Vladimir

Re: [HACKERS] Slowness of extended protocol

2016-08-13 Thread Tatsuo Ishii

> Shay>I don't know much about the Java world, but both pgbouncer and pgpool
> (the major pools?)
> 
> In Java world, https://github.com/brettwooldridge/HikariCP is a very good
> connection pool.
> Neither pgbouncer nor pgpool is required.
> The architecture is:  application <=> HikariCP <=> pgjdbc <=> PostgreSQL
> 
> The idea is HikariCP pools pgjdbc connections, and server-prepared
> statements are per-pgjdbc connection, so there is no reason to "discard
> all" or "deallocate all" or whatever.

Interesting. What would happen if a user changes some of GUC
parameters? Subsequent session accidentally inherits the changed GUC
parameter?

> Shay>send DISCARD ALL by default. That is a fact, and it has nothing to do
> with any bugs or issues pgbouncer may have.

That is correct for pgpool as well.

> That is configurable. If pgbouncer/pgpool defaults to "wrong
> configuration", why should we (driver developers, backend developers) try
> to accommodate that?

There's nothing wrong with DICARD ALL. It's just not suitable for your
program (and HikariCP?).

I don't know about pgbouncer but pgpool has been created for a general
purpose connection pooler, which means it must behave as much as
similarly to PostgreSQL backend from frontend's point of
view. "DISCARD ALL" is needed to simulate the behavior of backend: it
discards all properties set by a frontend including prepared
statements when a session terminates. "DISCARD ALL" is perfect for
this goal.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slowness of extended protocol

2016-08-13 Thread Vladimir Sitnikov

Shay>To be honest, the mere idea of having an SQL parser inside my driver
makes me shiver.

Same for me.
However I cannot wait for PostgreSQL 18 that does not need client-side
parsing.

Shay>We did, you just dismissed or ignored them

Please prove me wrong, but I did provide a justified answer to both
yours:
https://www.postgresql.org/message-id/CAB%3DJe-FHSwrbJiTcTDeT4J3y_%2BWvN1d%2BS%2B26aesr85swocb7EA%40mail.gmail.com
(starting
with "Why the connection is reset")
and Robert's examples:
https://www.postgresql.org/message-id/CAB%3DJe-GSAs_340dqdrJoTtP6KO6xxN067CtB6Y0ea5c8LRHC9Q%40mail.gmail.com

Shay>There's nothing your driver is doing that the application developer
can't do themselves -
Shay>so your driver isn't faster than other drivers. It's faster only when
used by lazy programmers.

I'm afraid you do not get the point.
ORMs like Hibernate, EclipseLink, etc send regular "insert ... values" via
batch API.
For the developer the only way to make use of "multivalues" is to implement
either "ORM fix" or "the driver fix" or "postgresql fix".

So the feature has very little to do with laziness of the programmers.
Application developer just does not have full control of each SQL when
working though ORM.
Do you suggest "stop using ORMs"? Do you suggest fixing all the ORMs so it
uses optimal for each DB insert statement?
Do you suggest fixing postgresql?

Once again "multivalues rewrite at pgjdbc level" enables the feature
transparently for all the users. If PostgreSQL 10/11 would improve
bind/exec performance, we could even drop that rewrite at pgjdbc level and
revert to the regular flow. That would again be transparent to the
application.

Shay>are you sure there aren't "hidden" costs on the PostgreSQL side for
generating so many implicit savepoints?

Technically speaking I use the same savepoint name through bind/exec
message.

Shay>What you're doing is optimizing developer code, with the assumption
that developers can't be trusted to code efficiently - they're going to
write bad SQL and forget to prepare their statements

Please, be careful. "you are completely wrong here" he-he. Well, you list
the wrong assumption. Why do you think my main assumption is "developers
can't be trusted"?

The proper assumption is: I follow Java database API specification, and I
optimize pgjdbc for the common use case (e.g. ORM or ORM-like).

For instance, if Java application wants to use bind values (e.g. to prevent
security issues), then the only way is to go through
java.sql.PreparedStatement.

Here's the documentation:
https://docs.oracle.com/javase/8/docs/api/java/sql/Connection.html#prepareStatement-java.lang.String-

Here's a quote:
Javadoc> *Note:* This method is optimized for handling parametric SQL
statements that benefit from precompilation. If the driver supports
precompilation, the methodprepareStatement will send the statement to the
database for precompilation. Some drivers may not support precompilation.
In this case, the statement may not be sent to the database until the
PreparedStatement object is executed. This has no direct effect on users;
however, it does affect which methods throw certainSQLException objects.

The most important part is "if the driver supports precompilation..."
There's no API to enable/disable precompilation at all.
So, when using Java, there is no such thing as
"statement.enableServerPrepare=true".

It is expected, that "driver" would "optimize" the handling somehow in the
best possible way.

It is Java API specification that enables me (as a driver developer) to be
flexible, and leverage database features so end user gets best experience.

Vladimir

Re: [HACKERS] Slowness of extended protocol

2016-08-13 Thread Shay Rojansky

Vladimir wrote:

Shay>I don't know much about the Java world, but both pgbouncer and pgpool
> (the major pools?)
>
> In Java world, https://github.com/brettwooldridge/HikariCP is a very good
> connection pool.
> Neither pgbouncer nor pgpool is required.
> The architecture is:  application <=> HikariCP <=> pgjdbc <=> PostgreSQL
>
> The idea is HikariCP pools pgjdbc connections, and server-prepared
> statements are per-pgjdbc connection, so there is no reason to "discard
> all" or "deallocate all" or whatever.
>
> Shay>send DISCARD ALL by default. That is a fact, and it has nothing to do
> with any bugs or issues pgbouncer may have.
>
> That is configurable. If pgbouncer/pgpool defaults to "wrong
> configuration", why should we (driver developers, backend developers) try
> to accommodate that?
>

In a nutshell, that sentence represents much of the problem I have with
this conversation: you simply consider that anything that's different from
your approach is simply "wrong".

Shay>f you do a savepoint on every single command, that surely would impact
> performance even without extra roundtrips...?
>
> My voltmeter says me that the overhead is just 3us for a simple "SELECT"
> statement (see https://github.com/pgjdbc/pgjdbc/pull/477#
> issuecomment-239579547 ).
> I think it is acceptable to have it enabled by default, however it would
> be nice if the database did not require a savepoint dance to heal its "not
> implemented" "cache query cannot change result type" error.
>

That's interesting (I don't mean that ironically). Aside from simple
client-observed speed which you're measuring, are you sure there aren't
"hidden" costs on the PostgreSQL side for generating so many implicit
savepoints? I don't know anything about PostgreSQL transaction/savepoint
internals, but adding savepoints to each and every statement in a
transaction seems like it would have some impact. It would be great to have
the confirmation of someone who really knows the internals.

> Shay>I'm not aware of other drivers that implicitly prepare statements,
> Shay >and definitely of no drivers that internally create savepoints and
> Shay> roll the back without explicit user APIs
>
> Glad to hear that.
> Are you aware of other drivers that translate "insert into table(a,b,c)
> values(?,?,?)" =into=> "insert into table(a,b,c) values(?,?,?),
> (?,?,?),...,(?,?,?)" statement on the fly?
>
> That gives 2-3x performance boost (that includes client result processing
> as well!) for batched inserts since "bind/exec" message processing is not
> that fast. That is why I say that investing into "bind/exec performance"
> makes more sense than investing into "improved execution of non-prepared
> statements".
>

I'm not aware of any, but I also don't consider that part of the driver's
job. There's nothing your driver is doing that the application developer
can't do themselves - so your driver isn't faster than other drivers. It's
faster only when used by lazy programmers.

What you're doing is optimizing developer code, with the assumption that
developers can't be trusted to code efficiently - they're going to write
bad SQL and forget to prepare their statements. That's your approach, and
that's fine. The other approach, which is also fine, is that each software
component has its own role, and that clearly-defined boundaries of
responsibility should exist - that writing SQL is the developer's job, and
that delivering it to the database is the driver's job. To me this
separation of layers/roles is key part of software engineering: it results
in simpler, leaner components which interact in predictable ways, and when
there's a problem it's easier to know exactly where it originated. To be
honest, the mere idea of having an SQL parser inside my driver makes me
shiver.

The HikariCP page you sent (thanks, didn't know it) contains a great
Dijkstra quote that summarizes what I think about this - "Simplicity is
prerequisite for reliability.". IMHO you're going the opposite way by
implicitly rewriting SQL, preparing statements and creating savepoints.

But at the end of the day it's two different philosophies. All I ask is
that you respect the one that isn't yours, which means accepting the
optimization of non-prepared statements. There's no mutual exclusion.

> 2) I don't say "it is the only true way". I would change my mind if
> someone would suggest better solution. Everybody makes mistakes, and I have
> no problem with admitting that "I made a mistake" and moving forward.
> They say: "Don't cling to a mistake just because you spent a lot of time
> making it"
>

:)

So your way isn't the only true way, but others are just clinging to their
mistakes... :)

> However, no-one did suggest a case when application issues lots of unique
> SQL statements.
>

We did, you just dismissed or ignored them.

3) For "performance related" issues, a business case and a voltmeter is
> required to prove there's an issue.
>
>
> Shay>But the second query, which should be

Re: [HACKERS] Slowness of extended protocol

2016-08-13 Thread Dave Cramer

On 11 August 2016 at 10:18, Shay Rojansky  wrote:

>
>
> On Thu, Aug 11, 2016 at 1:22 PM, Vladimir Sitnikov <
> sitnikov.vladi...@gmail.com> wrote:
>
> 2) The driver can use safepoints and autorollback to the good "right
>> before failure" state in case of a known failure. Here's the
>> implementation: https://github.com/pgjdbc/pgjdbc/pull/477
>>
> As far as I can remember, performance overheads are close to zero (no
>> extra roundtrips to create a safepoint)
>>
>
> What? Do you mean you do implicit savepoints and autorollback too? How
> does the driver decide when to do a savepoint? Is it on every single
> command? If not, commands can get lost when an error is raised and you
> automatically roll back? If you do a savepoint on every single command,
> that surely would impact performance even without extra roundtrips...?
>
> You seem to have a very "unique" idea of what a database driver should do
> under-the-hood for its users. At the very least I can say that your concept
> is very far from almost any database driver I've seen up to now (PostgreSQL
> JDBC, psycopg, Npgsql, libpq...). I'm not aware of other drivers that
> implicitly prepare statements, and definitely of no drivers that internally
> create savepoints and roll the back without explicit user APIs. At the very
> least you should be aware (and also clearly admit!) that you're doing
> something very different - not necessarily wrong - and not attempt to
> impose your ideas on everyone as if it's the only true way to write a db
> driver.
>


A number of other drivers default to this behaviour, including at least
MS-SQL and Oracle. psqlODBC also supports this behaviour with statement
rollback mode. And obviously PostgreSQL JDBC which Vladimir is referring to.

Dave Cramer

da...@postgresintl.com
www.postgresintl.com

Re: [HACKERS] Slowness of extended protocol

2016-08-13 Thread Vladimir Sitnikov

Shay>I don't know much about the Java world, but both pgbouncer and pgpool
(the major pools?)

In Java world, https://github.com/brettwooldridge/HikariCP is a very good
connection pool.
Neither pgbouncer nor pgpool is required.
The architecture is:  application <=> HikariCP <=> pgjdbc <=> PostgreSQL

The idea is HikariCP pools pgjdbc connections, and server-prepared
statements are per-pgjdbc connection, so there is no reason to "discard
all" or "deallocate all" or whatever.

Shay>send DISCARD ALL by default. That is a fact, and it has nothing to do
with any bugs or issues pgbouncer may have.

That is configurable. If pgbouncer/pgpool defaults to "wrong
configuration", why should we (driver developers, backend developers) try
to accommodate that?

Shay> What? Do you mean you do implicit savepoints and autorollback too?

I mean that.
On top of that it enables opt-in psql-like ON_ERROR_ROLLBACK functionality
so it could automatically roll back the latest statement if it failed.
For instance, that might simplify migration from other DBs that have the
same "rollback just one statement on error" semantics by default.

Shay>How does the driver decide when to do a savepoint?

By default it would set a savepoint in a case when there's open
transaction, and the query to be executed has been previously described.

In other words, the default mode is to protect user from "cached plan
cannot change result type" error.

Shay>Is it on every single command?

Exactly (modulo the above). If user manually sets "autosave=always", every
command would get prepended with a savepoint and rolled back.

Shay>f you do a savepoint on every single command, that surely would impact
performance even without extra roundtrips...?

My voltmeter says me that the overhead is just 3us for a simple "SELECT"
statement (see
https://github.com/pgjdbc/pgjdbc/pull/477#issuecomment-239579547 ).
I think it is acceptable to have it enabled by default, however it would be
nice if the database did not require a savepoint dance to heal its "not
implemented" "cache query cannot change result type" error.


Shay>I'm not aware of other drivers that implicitly prepare statements,
Shay >and definitely of no drivers that internally create savepoints and
Shay> roll the back without explicit user APIs

Glad to hear that.
Are you aware of other drivers that translate "insert into table(a,b,c)
values(?,?,?)" =into=> "insert into table(a,b,c)
values(?,?,?),(?,?,?),...,(?,?,?)"
statement on the fly?

That gives 2-3x performance boost (that includes client result processing
as well!) for batched inserts since "bind/exec" message processing is not
that fast. That is why I say that investing into "bind/exec performance"
makes more sense than investing into "improved execution of non-prepared
statements".

Shay>you're doing something very different - not necessarily wrong - and not
Shay>attempt to impose your ideas on everyone as if it's the only true way
Shay>to write a db driver.

1) Feel free to pick ideas you like

2) I don't say "it is the only true way". I would change my mind if someone
would suggest better solution. Everybody makes mistakes, and I have no
problem with admitting that "I made a mistake" and moving forward.
They say: "Don't cling to a mistake just because you spent a lot of time
making it"

However, no-one did suggest a case when application issues lots of unique
SQL statements.
The suggested alternative "a new message for non-prepared extended query"
might shave 5-10us per query for those who are lazy to implement a query
cache. However, that is just 5-10ms per 1000 queries. Would that be
noticeable by the end-users? I don't think so.

Having a query cache can easily shave 5-10+ms for each query, that is why I
suggest driver developers to implement a query cache first, and only then
ask for new performance-related messages.

3) For "performance related" issues, a business case and a voltmeter is
required to prove there's an issue.


Shay>But the second query, which should be invalidated, has already been
Shay>sent to the server (because of batching), and boom

-- Doctor, it hurts me when I do that
-- Don't do that

When doing batched SQL, some of the queries might fail with "duplicate
key", or "constraint violation". There's already API that covers those kind
of cases and reports which statements did succeed (if any).
In the case as you described (one query in a batch somehow invalidates the
subsequent one) it would just resort to generic error handling.


Shay>When would you send this ValidatePreparedStatement?
Shay>Before each execution as a separate roundtrip?
Shay>That would kill performance.

Why do you think the performance would degrade? Once again: the current
problem is the client thinks it knows "which columns will be returned by a
particular server-prepared statement" but the table might get change behind
the scenes (e.g. online schema upgrade), so the error occurs. That "return
type" is already validated by the database (the time is

Re: [HACKERS] Slowness of extended protocol

2016-08-11 Thread Shay Rojansky

On Thu, Aug 11, 2016 at 1:22 PM, Vladimir Sitnikov <
sitnikov.vladi...@gmail.com> wrote:

2) The driver can use safepoints and autorollback to the good "right before
> failure" state in case of a known failure. Here's the implementation:
> https://github.com/pgjdbc/pgjdbc/pull/477
>
As far as I can remember, performance overheads are close to zero (no extra
> roundtrips to create a safepoint)
>

What? Do you mean you do implicit savepoints and autorollback too? How does
the driver decide when to do a savepoint? Is it on every single command? If
not, commands can get lost when an error is raised and you automatically
roll back? If you do a savepoint on every single command, that surely would
impact performance even without extra roundtrips...?

You seem to have a very "unique" idea of what a database driver should do
under-the-hood for its users. At the very least I can say that your concept
is very far from almost any database driver I've seen up to now (PostgreSQL
JDBC, psycopg, Npgsql, libpq...). I'm not aware of other drivers that
implicitly prepare statements, and definitely of no drivers that internally
create savepoints and roll the back without explicit user APIs. At the very
least you should be aware (and also clearly admit!) that you're doing
something very different - not necessarily wrong - and not attempt to
impose your ideas on everyone as if it's the only true way to write a db
driver.

3) Backend could somehow invalidate prepared statements, and notify clients
> accordingly. Note: the problem is hard in a generic case, however it might
> be not that hard if we fix well-known often-used cases like "a column is
> added". That however, would add memory overheads to store additional maps
> like "table_oid -> statement_names[]"
>

Assuming your driver supports batching/pipelining (and I hope it does),
that doesn't make sense. Say I send a 2-statement batch, with the first one
a DDL and with the second one some prepared query invalidated by the first.
When your DDL is executed by PostgreSQL your hypothetical notification is
sent. But the second query, which should be invalidated, has already been
sent to the server (because of batching), and boom.

4) Other. For instance, new message flow so frontend and backend could
> re-negotiate "binary vs text formats for the new resulting type". Or
> "ValidatePreparedStatement" message that would just validate the statement
> and avoid killing the transaction if the statement is invalid. Or whatever
> else there can be invented.
>

When would you send this ValidatePreparedStatement? Before each execution
as a separate roundtrip? That would kill performance. Would you send it in
the same packet before Execute? In that case you still get the error when
Execute is evaluated...

There really is no solution to this problem within the current PostgreSQL
way of doing things - although of course we could reinvent the entire
PostgreSQL protocol here to accommodate for your special driver...

The basic truth is this... In every db driver I'm familiar with programmers
are expected to manage query preparation on their own. They're supposed to
do it based on knowledge only they have, weighing pros and cons. They have
responsibility over their own code and they don't outsource major decisions
like this to their driver. When they get an error from PostgreSQL, it's
triggered by something they did in a very explicit and clear way - and they
therefore have a good chance to understand what's going on. They generally
don't get errors triggered by some under-the-hood magic their driver
decided to do for them, and which are hard to diagnose and understand.
Their drivers are simple, predictable and lean, and they don't have to
include complex healing logic to deal with errors they themselves triggered
with under-the-hood logic (e.g. implicit savepoints).

> Shay>So the general point is that the existence of pools is problematic
> for the argument "always prepare for recurring statements".
>
> So what?
> Don't use pools that issue "discard all" or configure them accordingly.
> That's it.
> In Java world, no wildly used pool defaults to "discard everything"
> strategy.
>

I don't know much about the Java world, but both pgbouncer and pgpool (the
major pools?) send DISCARD ALL by default. That is a fact, and it has
nothing to do with any bugs or issues pgbouncer may have. I'm tempted to go
look at other pools in other languages but frankly I don't think that would
have any effect in this conversation...

Re: [HACKERS] Slowness of extended protocol

2016-08-11 Thread Vladimir Sitnikov

Shay>As I said, an error is going to kill the ongoing transaction, how can
this be solved without application logic?

1) At least, some well-defined error code should be created for that kind
of matter.

2) The driver can use safepoints and autorollback to the good "right before
failure" state in case of a known failure. Here's the implementation:
https://github.com/pgjdbc/pgjdbc/pull/477
As far as I can remember, performance overheads are close to zero (no extra
roundtrips to create a safepoint)

3) Backend could somehow invalidate prepared statements, and notify clients
accordingly. Note: the problem is hard in a generic case, however it might
be not that hard if we fix well-known often-used cases like "a column is
added". That however, would add memory overheads to store additional maps
like "table_oid -> statement_names[]"

4) Other. For instance, new message flow so frontend and backend could
re-negotiate "binary vs text formats for the new resulting type". Or
"ValidatePreparedStatement" message that would just validate the statement
and avoid killing the transaction if the statement is invalid. Or whatever
else there can be invented.


Shay>So the general point is that the existence of pools is problematic for
the argument "always prepare for recurring statements".

So what?
Don't use pools that issue "discard all" or configure them accordingly.
That's it.
In Java world, no wildly used pool defaults to "discard everything"
strategy.

Please stop saying "pgbouncer" as its issue is confirmed, and pgbouncer
developers did acknowledge they would prefer to solve "prepared statement
issue" right inside pgbouncer without any cooperation from driver
developers.

Do you mean in C# world major connection pools default to "discard all"
setup? That sounds strange to me.

Vladimir

>

Re: [HACKERS] Slowness of extended protocol

2016-08-11 Thread Shay Rojansky

On Thu, Aug 11, 2016 at 8:39 AM, Vladimir Sitnikov <
sitnikov.vladi...@gmail.com> wrote:

>  Shay:
>
>> Prepared statements can have very visible effects apart from the speedup
>> they provide (e.g. failure because of schema changes) It's not that
>> these effects can't be worked around - they can be - but programmers can be
>> surprised by these effects, which can cause difficult-to-diagnose issues.
>>
>
> The specific effect of "cached plan cannot change return type" can be
> solved by cooperation of backend and frontend (~driver) developers.
>

I really don't see how. As I said, an error is going to kill the ongoing
transaction, how can this be solved without application logic? Unless you
propose keeping track of all statements executed in the query and resend
them?

I'm asking this out of genuine interest (I might be missing something),
since I'm seriously considering implementing implicit query preparation in
Npgsql.

I find that solving that kind of issues is more important than investing
> into "ParseBindExecDeallocateInOneGo" message.
>

I don't think it's solvable. But regardless, it's totally to invest in two
things, I think it's totally legitimate to invest in implicit query
preparation.

> I hope you would forgive me if I just stop the discussion here.
>

Of course, no problem.

> I find I'd better spent that time on just fixing pgbouncer issue rather
> than discussing if it is pgbouncer's or postgresql's issue.
>

The point is that it's not just a pgbouncer "bug". Connection pools like
pgbouncer (there are others) usually send DISCARD ALL on connection close,
which makes prepared statements useless for short-lived connection
scenarios (since you have to reprepare). I don't know if that's changeable
across pool implementations in general, and even if so, requires
significant user understanding for tweaking the pool. So the general point
is that the existence of pools is problematic for the argument "always
prepare for recurring statements".

I'm sorry for being impolite if I was ever.
>

I don't think you were impolite, I personally learned a few things from the
conversation and will probably implement opt-in implicit query preparation
thanks to it.

Re: [HACKERS] Slowness of extended protocol

2016-08-11 Thread Vladimir Sitnikov

 Shay:

> Prepared statements can have very visible effects apart from the speedup
> they provide (e.g. failure because of schema changes) It's not that these
> effects can't be worked around - they can be - but programmers can be
> surprised by these effects, which can cause difficult-to-diagnose issues.
>

The specific effect of "cached plan cannot change return type" can be
solved by cooperation of backend and frontend (~driver) developers.
I find that solving that kind of issues is more important than investing
into "ParseBindExecDeallocateInOneGo" message.


I hope you would forgive me if I just stop the discussion here.
I find I'd better spent that time on just fixing pgbouncer issue rather
than discussing if it is pgbouncer's or postgresql's issue.

I'm sorry for being impolite if I was ever.

Vladimir

>

Re: [HACKERS] Slowness of extended protocol

2016-08-11 Thread Shay Rojansky

Vladimir wrote:

Shay> As Tom said, if an application can benefit from preparing, the
> developer has the responsibility (and also the best knowledge) to manage
> preparation, not the driver. Magical behavior under the hood causes
> surprises, hard-to-diagnose bugs etc.
>
> Why do you do C# then?
> Aren't you supposed to love machine codes as the least magical thing?
> Even x86 does not describe "the exact way the code should be executed".
> All the CPUs shuffle the instructions to make it faster.
>

These are valid questions. I have several answers to that.

First, it depends on the extent to which the "under the hood" activity may
have visible side-effects. CPU instruction reordering is 100% invisible,
having no effect whatsoever on the application apart from speeding it up.
Prepared statements can have very visible effects apart from the speedup
they provide (e.g. failure because of schema changes). It's not that these
effects can't be worked around - they can be - but programmers can be
surprised by these effects, which can cause difficult-to-diagnose issues.
Nobody needs to be really aware of CPU instruction reordering because it
will never cause an application issue (barring a CPU bug).

Another important visible effect of preparing a statement is server-side
cost (i.e. memory). I'm not sure what the overhead is, but it is there and
an irresponsible "prepare everything everywhere" can create a significant
resource drain on your server. I know pgjdbc has knobs for controlling how
many statements are prepared, but in a large-scale performance-sensitive
app a programmer may want to manually decide, on a per-query basis, whether
they want to prepare or not.

But the most important question is that this is a choice that should be
left to the developer, rather than imposed. In some cases developers *do*
drop down to assembly, or prefer to call system calls directly to be in
control of exactly what's happening. They should be able to do this. Now,
of course you can provide an option which disables implicit preparing, but
if you refuse to optimize non-prepared paths you're effectively forcing the
programmer to go down the prepared statement path.

Shay>As Tom said, if an application can benefit from preparing, the
> developer has the responsibility
>
> Does developer have the responsibility to choose between "index scan" and
> "table seq scan"? So your "developer has the responsibility" is like
> building on sand.
>

I'm very glad you raised that point. The programmers indeed outsources that
decision to the database, because the database has the most *knowledge* on
optimal execution (e.g. which indices are defined). Implicit preparing, on
the other hand, has nothing to do with knowledge: as Tom said, the driver
knows absolutely nothing about the application, and doesn't have any
advantage over the programmer in making that decision.

But note that the DBA *does* have to decide which indices exist and which
don't. Would you accept a database that automatically creates indices based
on use, i.e. "x queries caused full table scans, so I'll silently create an
index here"? I wouldn't, it would be utter nonsense and create lots of
unpredictability in terms of performance and resource drain.

My experience shows, that people are very bad at predicting where the
> performance problem would be.
> For 80% (or even 99%) of the cases, they just do not care thinking if a
> particular statement should be server-prepared or not.
> They have absolutely no idea how much resources it would take and so on.
>

And for those people it's great to have an implicit preparation feature,
most probably opt-in. But that shouldn't mean we shouldn't optimize things
for the rest.

> ORMs have no that notion of "this query must be server-prepared, while
> that one must not be".
> And so on.
>
> It is somewhat insane to assume people would use naked SQL. Of course they
> would use ORMs and alike, so they just would not be able to pass that
> additional parameter telling if a particular query out of thousands should
> be server-prepared or not.
>

Uh, I really wouldn't make statements like that if you want to be taken
seriously. Tons of applications use naked SQL and avoid ORMs for many
reasons (e.g. performance benefits of hand-crafted SQL), such general
statements on what applications do in the world are, well, silly. Again,
ORMs are an argument for why implicit preparation should exist (I made that
argument myself above), but they're not an argument for why optimizing
other paths should be excluded.

> Vladimir> "cached plan cannot change result type" -- PostgreSQL just fails
> to execute the server-prepared statement if a table was altered.
>
> Shay>How exactly do users cope with this in pgjdbc? Do they have some
> special API to flush (i.e. deallocate) prepared statements which they're
> supposed to use after a DDL?
>
> First of all, pgjdbc does report those problems to hackers.
> Unfortunately, it is still "not implemented".
> Then,

Re: [HACKERS] Slowness of extended protocol

Shay> it's important to note that query parsing and rewriting isn't an
"inevitable evil".

Ok, I stay corrected. ADO.NET have raw mode in the API. That's interesting.

Let's say "lots of heavily used languages do have their own notion of bind
placeholders".
And for the reset, it is still not that hard to prepare on the go.

Shay> As Tom said, if an application can benefit from preparing, the
developer has the responsibility (and also the best knowledge) to manage
preparation, not the driver. Magical behavior under the hood causes
surprises, hard-to-diagnose bugs etc.

Why do you do C# then?
Aren't you supposed to love machine codes as the least magical thing?
Even x86 does not describe "the exact way the code should be executed".
All the CPUs shuffle the instructions to make it faster.

Shay>As Tom said, if an application can benefit from preparing, the
developer has the responsibility

Does developer have the responsibility to choose between "index scan" and
"table seq scan"? So your "developer has the responsibility" is like
building on sand.

Even "SQL execution on PostgreSQL is a magical behavior under the hood".
Does that mean we should drop the optimizer and implement "fully static
hint-based execution mode"? I do not buy that.

My experience shows, that people are very bad at predicting where the
performance problem would be.
For 80% (or even 99%) of the cases, they just do not care thinking if a
particular statement should be server-prepared or not.
They have absolutely no idea how much resources it would take and so on.

ORMs have no that notion of "this query must be server-prepared, while that
one must not be".
And so on.

It is somewhat insane to assume people would use naked SQL. Of course they
would use ORMs and alike, so they just would not be able to pass that
additional parameter telling if a particular query out of thousands should
be server-prepared or not.

Vladimir> "cached plan cannot change result type" -- PostgreSQL just fails
to execute the server-prepared statement if a table was altered.

Shay>How exactly do users cope with this in pgjdbc? Do they have some
special API to flush (i.e. deallocate) prepared statements which they're
supposed to use after a DDL?

First of all, pgjdbc does report those problems to hackers.
Unfortunately, it is still "not implemented".
Then, a workaround at pgjdbc side is made.
Here's relevant pgjdbc fix: https://github.com/pgjdbc/pgjdbc/pull/451

It analyzes error code, and if it finds "not_implemented
from RevalidateCachedQuery", then it realizes it should re-prepare.
Unfortunately, there is no dedicated error code, but at least there's a
routine name.

On top of that, each pgjdbc commit is tested against HEAD PostgreSQL
revision, so if the routine will get renamed for some reason, we'll know
that right away.

There will be some more parsing to cover "deallocate all" case.

Shay>it have been listed many times - pgbouncer

Let's stop discussing pgbouncer issue here?
It has absolutely nothing to do with pgsql-hackers.
Thanks.

Vladimir

Re: [HACKERS] Slowness of extended protocol

2016-08-10 Thread Shay Rojansky

Some comments...

For the record, I do find implicit/transparent driver-level query
preparation interesting and potentially useful, and have opened
https://github.com/npgsql/npgsql/issues/1237 to think about it - mostly
based on arguments on this thread. One big question mark I have is whether
this behavior should be opt-out as in pgjdbc, rather than opt-in. Like
others in this thread (I think), I prefer my libraries and drivers very
simple, following my exact API calls to the letter and doing minimal magic
under the hood. As Tom said, if an application can benefit from preparing,
the developer has the responsibility (and also the best knowledge) to
manage preparation, not the driver. Magical behavior under the hood causes
surprises, hard-to-diagnose bugs etc.

One interesting case where implicit preparation is unbeatable, though, is
when users aren't coding against the driver directly but are using some
layer, e.g. an ORM. For example, the Entity Framework ORM simply does not
prepare statements (see
https://github.com/aspnet/EntityFramework/issues/5459 which I opened). This
is one reason I find it a valuable feature to have, although probably as
opt-in and not opt-out.

Regarding driver-level SQL parsing, Vladimir wrote:

> Unfortunately, client-side SQL parsing is inevitable evil (see below on
'?'), so any sane driver would cache queries any way. The ones that do not
have query cache will perform badly anyway.

First, it's a very different thing to have a query cache in the driver, and
to implicitly prepare statements.

Second, I personally hate the idea that drivers parse SQL and rewrite
queries. It's true that in many languages database APIs have "standardized"
parameter placeholders, and therefore drivers have no choice but to
rewrite. ADO.NET (the .NET database API) actually does not do this -
parameter placeholders are completely database-dependent (see
https://msdn.microsoft.com/en-us/library/yy6y35y8%28v=vs.110%29.aspx?f=255=-2147217396).
Npgsql does rewrite queries to provide named parameters, but the next
version is going to have a "raw query" mode in which no parsing/rewriting
happens whatsoever. It simply takes your string, packs it into a Parse (or
Query), and sends it off. This is not necessarily going to be the main way
users use Npgsql, but it's important to note that query parsing and
rewriting isn't an "inevitable evil".

Vladimir wrote:

> This new message would be slower than proper query cache, so why should
we all spend time on a half-baked solution?

The cases where the prepared statement path doesn't work have already been
listed many times - pgbouncer (and other pools), drivers which don't
support persisting prepared statement across pooled connection open/close,
and people (like many in this conversation) who don't appreciate their
drivers doing magic under the hood. This is why I thing both proposals are
great - there's nothing wrong with query caching (or more precisely,
implicit statement preparation by the driver), especially if it's opt-in,
but that doesn't mean we shouldn't optimize things for people who don't,
can't or won't go for that.

> To be fair, implementing a cache is a trivial thing when compared with
hard-coding binary/text formats for all the datatypes in each and every
language.
> Remember, each driver has to implement its own set of procedures to
input/output values in text/binary format, and that is a way harder than
implementing the cache we are talking about.

I agree with that, but it's not a question of how easy it is to implement
implicit preparation. It's a question of whether driver developers choose
to do this, based on other considerations as well - many people here think
it's not the job of the driver to decide for you whether to prepare or not,
for example. It has nothing to do with implementation complexity.

> "cached plan cannot change result type" -- PostgreSQL just fails to
execute the server-prepared statement if a table was altered.

That alone seems to be a good reason to make implicit preparation opt-in at
best. How exactly do users cope with this in pgjdbc? Do they have some
special API to flush (i.e. deallocate) prepared statements which they're
supposed to use after a DDL? Do you look at the command tag in the
CommandComplete to know whether a command was DDL or not?

Re: [HACKERS] Slowness of extended protocol

Stephen> While it may have good results in many cases, it's not accurate to
say that using prepared statements will always be faster than not.

There's no silver bullet. <-- that is accurate, but it is useless for
end-user applications
I've never claimed that "server prepared statement" is a silver bullet.

I just claim that "PrepareBindExecuteDeallocate" message does have
justification from performance point of view.

Stephen Frost> Dropping and recreating the prepared statement is how that
particular issue is addressed.

Stephen,

The original problem is: "extended query execution is slow"
I recommend "let's just add a query cache"
You mention: "that does not work since one query in a year might get slower
on 6th execution"
I ask: "what should be done _instead_ of a query cache?"
You say: "the same query cache, but execute 5 times at most"

Does that mean you agree that "query cache is a way to go"?


I do not follow that. Of course, I could add "yet another debugger switch"
to pgjdbc so it executes server-prepared statements 5 times maximum,
however I do consider it to be a PostgreSQL's issue.
I do not like to hard-code "5" into the driver logic.

I do not like making "5 times maximum" a default mode.

Vladimir

Re: [HACKERS] Slowness of extended protocol

2016-08-10 Thread Stephen Frost

* Vladimir Sitnikov (sitnikov.vladi...@gmail.com) wrote:
> 3) "suddently get slow the 6th time" is a PostgreSQL bug that both fails to
> estimate cardinality properly, and it does not provide administrator a way
> to disable the feature (generic vs specific plan).

Dropping and recreating the prepared statement is how that particular
issue is addressed.

> Query cache does have very good results for the overall web page times, and
> problems like "6th execution" are not that often.

While it may have good results in many cases, it's not accurate to say
that using prepared statements will always be faster than not.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Slowness of extended protocol

Stephen>I encourage you to look through the archives

The thing is pl/pgsql suffers from exactly the same problem.
pl/pgsql is not a typical language of choice (e.g. see Tiobe index and
alike), so the probability of running into "prepared statement issues" was
low.

As more languages would use server-prepared statements, the rate of the
issues would naturally increase.

JFYI: I did participate in those conversations, so I do not get which
particular point are you asking for me to "look through" there.

Stephen Frost:

> And is the source of frequent complaints on various mailing lists along
> the lines of "why did my query suddently get slow the 6th time it was
> run?!".
>

I claim the following:
1) People run into such problems with pl/pgsql as well. pl/pgsql does
exactly the same server-prepared logic. So what? Pl/pgsql does have a query
cache, but other languages are forbidden from having one?
2) Those problematic queries are not that often
3) "suddently get slow the 6th time" is a PostgreSQL bug that both fails to
estimate cardinality properly, and it does not provide administrator a way
to disable the feature (generic vs specific plan).

4) Do you have better solution? Of course, the planner is not perfect. Of
course it will have issues with wrong cardinality estimations. So what?
Should we completely abandon the optimizer?
I do not think so.
Query cache does have very good results for the overall web page times, and
problems like "6th execution" are not that often.

By the way, other common problems are:
"cached plan cannot change result type" -- PostgreSQL just fails to execute
the server-prepared statement if a table was altered.
"prepared statement does not exist" -- the applications might use
"deallocate all" for some unknown reason, so the driver has to keep eye on
that.
"set search_path" vs "prepared statement" -- the prepared statement binds
by oids, so "search_path changes" should be accompanied by "deallocate all"
or alike.

Vladimir

Re: [HACKERS] Slowness of extended protocol

Robert>But that makes it the job of every driver to implement some
sort of cache, which IMHO isn't a very reasonable position

Let's wait what Shay decides on implementing query cache in npgsql ?
Here's the issue:  https://github.com/npgsql/npgsql/issues/1237

He could change his mind when it comes to the need of "new
ParseBindExecuteDeallocate message".


Robert>I think you should consider the possibility that those people
know what they are talking about.

I do appreciate all the inputs, however, all the performance-related
complaints in this thread I've seen can trivially be solved by adding a
query cache at the driver level + fixing pgbouncer's issue.
Unfortunately, client-side SQL parsing is inevitable evil (see below on
'?'), so any sane driver would cache queries any way. The ones that do not
have query cache will perform badly anyway.

As far as I can understand, the alternative solution is "to add
ParseBindExecDeallocateInOneGo
message to the protocol". This new message would require changes from all
the drivers, and all the pgbouncers. This new message would be slower than
proper query cache, so why should we all spend time on a half-baked
solution?

Of course, I might miss some use cases, that is why keep asking: please
provide close to production scenario that does require the new protocol
message we are talking about.
Note: examples (by Robert and Shay) like "typical web application that
fetches a single row from DB and prints it to the browser" were already
presented, and they are easily covered by the query cache.



To be fair, implementing a cache is a trivial thing when compared with
hard-coding binary/text formats for all the datatypes in each and every
language.
Remember, each driver has to implement its own set of procedures to
input/output values in text/binary format, and that is a way harder than
implementing the cache we are talking about.

If only there was some ability to have language-independent binary transfer
format (protobuf, avro, kryo, whatever)

Regarding query cache: each language has its own notion how bind variables
are represented in SQL text.
For instance, in Go language (as well as in Java), bind placeholders are
represented as '?' character.
Of course, PostgreSQL does not support that (it wants $1, $2, etc), so Go
driver has to parse SQL text at the driver side in order to convert it into
PostgreSQL-compatible flavor. This parser has to support comments, string
literals, etc, etc. It is just natural thing to have a cache there so the
driver does not repeat the same parsing logic again and again (typical
applications are known to use the same query text multiple times).

Robert>When
there's a common need that affects users of many different programming
languages

You are right. No questions here.
Ok, what is a need? What problem does "new message" solve?

>From what I see there is no performance need to introduce
"ParseBindExecDeallocateInOneGo" message. The thread is
performance-related, so I naturally object spending everybody's time on
implementing a useless feature.

Vladimir>Do you agree that the major part would be some hot queries, the
rest will be
Vladimir> much less frequently used ones (e.g. one time queries)?
Robert>Sure, but I don't want the application to have to know about that

Application does not need to know that. It is like "branch prediction in
the CPU". Application does not need to know there is a branch predictor in
the CPU. The same goes for query cache. Application should just continue to
execute SQLs in a sane manner, and the query cache would pick up the
pattern (server-prepare hot-used queries).

Vladimir

Re: [HACKERS] Slowness of extended protocol

2016-08-10 Thread Stephen Frost

* Vladimir Sitnikov (sitnikov.vladi...@gmail.com) wrote:
> It works completely transparent to the application, and it does use
> server-prepared statements even though application builds "brand new" sql
> text every time.

And is the source of frequent complaints on various mailing lists along
the lines of "why did my query suddently get slow the 6th time it was
run?!".

I encourage you to look through the archives and read up on how and why
that can happen.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Slowness of extended protocol

2016-08-10 Thread Robert Haas

On Wed, Aug 10, 2016 at 11:50 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> Sure, but I don't want the application to have to know about that, and
>> I don't really think the driver should need to know about that either.
>> Your point, as I understand it, is that sufficiently good query
>> caching in the driver can ameliorate the problem, and I agree with
>> that.
>
> I don't, actually.  If a particular application has a query mix that gets
> a good win from caching query plans, it should already be using prepared
> statements.  The fact that that's not a particularly popular thing to do
> isn't simply because people are lazy, it's because they've found out that
> it isn't worth the trouble for them.  Putting query caching logic into
> drivers isn't going to improve performance for such cases, and it could
> very possibly make things worse.  The driver is the place with the
> absolute least leverage for implementing caching; it has no visibility
> into either the application or the server.

I didn't mean to imply, in any way, that it is an ideal solution -
just that people use it successfully.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slowness of extended protocol

2016-08-10 Thread Tom Lane

Robert Haas  writes:
> Sure, but I don't want the application to have to know about that, and
> I don't really think the driver should need to know about that either.
> Your point, as I understand it, is that sufficiently good query
> caching in the driver can ameliorate the problem, and I agree with
> that.

I don't, actually.  If a particular application has a query mix that gets
a good win from caching query plans, it should already be using prepared
statements.  The fact that that's not a particularly popular thing to do
isn't simply because people are lazy, it's because they've found out that
it isn't worth the trouble for them.  Putting query caching logic into
drivers isn't going to improve performance for such cases, and it could
very possibly make things worse.  The driver is the place with the
absolute least leverage for implementing caching; it has no visibility
into either the application or the server.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slowness of extended protocol

2016-08-10 Thread Robert Haas

On Tue, Aug 9, 2016 at 5:07 PM, Vladimir Sitnikov
 wrote:
> I do not buy that "dynamically generated queries defeat server-prepared
> usage" argument. It is just not true (see below).
>
> Do you mean "in language X, where X != Java it is impossible to implement a
> query cache"?
> That is just ridiculus.

Well, multiple experienced users are telling you that this is a real
problem.  You certainly don't have to agree, but when a bunch of other
people who are smart and experienced think that something is a
problem, I think you should consider the possibility that those people
know what they are talking about.

> Do you agree that the major part would be some hot queries, the rest will be
> much less frequently used ones (e.g. one time queries)?

Sure, but I don't want the application to have to know about that, and
I don't really think the driver should need to know about that either.
Your point, as I understand it, is that sufficiently good query
caching in the driver can ameliorate the problem, and I agree with
that.  But that makes it the job of every driver to implement some
sort of cache, which IMHO isn't a very reasonable position.  When
there's a common need that affects users of many different programming
languages, the server should make it easy to meet that need, not
require every driver to implement query caching.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slowness of extended protocol

2016-08-09 Thread Vladimir Sitnikov

Robert Haas:

> but for some reason you can't use prepared statements, for example because
> the queries are dynamically generated and .  That case is analogous to -M
> extended, not -M prepared.  And -M extended is well-known to be SLOWER
>

I do not buy that "dynamically generated queries defeat server-prepared
usage" argument. It is just not true (see below).

Do you mean "in language X, where X != Java it is impossible to implement a
query cache"?
That is just ridiculus.

At the end of the day, there will be a finite number of hot queries that
are important.
Here's relevant pgjdbc commit: https://github.com/pgjdbc/pgjdbc/pull/319
It works completely transparent to the application, and it does use
server-prepared statements even though application builds "brand new" sql
text every time.

It is not something theoretical, but it is something that is already
implemented and battle-tested. The application does build SQL texts based
on the names of tables and columns that are shown in the browser, and
pgjdbc uses query cache (to map user SQL to backend statement name), thus
it benefits from server-prepared statements automatically.

Not a single line change was required at the application side.

Am I missing something?
I cannot imagine a real life case when an application throws 10'000+ UNIQUE
SQL texts per second at the database.
Cases like "where id=1", "where id=2", "where id=3" do not count as they
should be written with bind variables, thus it represents a single SQL text
like "where id=$1".

Robert>you have to keep sending a different query text every time

Do you agree that the major part would be some hot queries, the rest will
be much less frequently used ones (e.g. one time queries)?

In OLTP applications the number of queries is high, and almost all the
queries are reused.
server-prepared to rescue here.
"protocol optimization" would not be noticeable.

In DWH applications the queries might be unique, however the number of
queries is much less, thus the "protocol optimization" would be invisible
as the query plan/process time would be much higher than the gain from
"protocol
optimization".

Vladimir

Re: [HACKERS] Slowness of extended protocol

2016-08-09 Thread Robert Haas

On Tue, Aug 9, 2016 at 4:50 AM, Vladimir Sitnikov
 wrote:
> I've tried pgbench -M prepared, and it is way faster than pgbench -M simple.

That's true, but it's also testing something completely different from
what Shay is concerned about.  -M prepared creates a prepared
statement once, and then executes it many times.  That is, as you say,
faster.  But what Shay is concerned about is the case where you are
using the extended query protocol to send out-of-line parameters but
for some reason you can't use prepared statements, for example because
the queries are dynamically generated and you have to keep sending a
different query text every time.  That case is analogous to -M
extended, not -M prepared.  And -M extended is well-known to be SLOWER
than -M simple.  Here's a quick test on my laptop demonstrating this:

[rhaas ~]$ pgbench -M simple -S -T 10
starting vacuum...end.
transaction type: 
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 10 s
number of transactions actually processed: 119244
latency average: 0.084 ms
tps = 11919.440321 (including connections establishing)
tps = 11923.229139 (excluding connections establishing)
[rhaas ~]$ pgbench -M prepared -S -T 10
starting vacuum...end.
transaction type: 
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 10 s
number of transactions actually processed: 192100
latency average: 0.052 ms
tps = 19210.341944 (including connections establishing)
tps = 19214.820999 (excluding connections establishing)
[rhaas ~]$ pgbench -M extended -S -T 10
starting vacuum...end.
transaction type: 
scaling factor: 1
query mode: extended
number of clients: 1
number of threads: 1
duration: 10 s
number of transactions actually processed: 104275
latency average: 0.096 ms
tps = 10425.510813 (including connections establishing)
tps = 10427.725239 (excluding connections establishing)

Shay is not worried about the people who are getting 19.2k TPS instead
of 11.9k TPS.  Those people are already happy.  He is worried about
the people who are getting 10.4k TPS instead of 11.9k TPS.  People who
can't use prepared statements because their query text varies - and I
have personally written multiple web applications that have exactly
that profile - suffer a big penalty if they choose to use the extended
query protocol to pass parameters.  Here, it's about a 13% performance
loss.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slowness of extended protocol

2016-08-09 Thread Shay Rojansky

On Tue, Aug 9, 2016 at 3:42 PM, Vladimir Sitnikov <
sitnikov.vladi...@gmail.com> wrote:

> Shay>But here's the more important general point. We're driver
> developers, not application developers. I don't really know what
> performance is "just fine" for each of my users, and what is not worth
> optimizing further. Users may follow best practices, or they may not for
> various reasons.
>
> Of course we cannot benchmark all the existing applications, however we
> should at lest try to use "close to production" benchmarks.
>
> Let me recap: even "select 1" shows clear advantage of reusing
> server-prepared statements.
> My machine shows the following results for "select 1 pgbench":
> simple: 38K ops/sec (~26us/op)
> extended: 32K ops/sec (~31us/op)
> prepared: 47K ops/sec (~21us/op)
>
> Note: reusing server-prepared statements shaves 10us (out of 31us), while
> "brand new ParseBindExecDeallocate" message would not able to perform
> better than 26us/op (that is 5 us worse than the prepared one). So it makes
> much more sense investing time in "server-prepared statement reuse" at the
> client side and "improving Bind/Exec performance" at the backend side.
>
> For more complex queries the gap (prepared vs simple) would be much bigger
> since parse/validate/plan for a complex query is much harder operation than
> the one for "select 1"
>

You seem to be misunderstanding the fundamental point here. Nobody is
saying that prepared statements aren't a huge performance booster - they
are. I recommend them to anyone who asks. But you're saying "let's not
optimize anything else", whereas there are many programs out there *not*
using prepared statements for various reasons (e.g. pgbouncer, or simply an
existing codebase). If your opinion is that nothing should be done for
these users, fine - nobody's forcing you to do anything. I simply don't see
why *not* optimize the very widely-used single-statement execution path.

> Note: I do not mean "look, prepared always win". I mean: "if your
> environment does not allow reuse of prepared statements for some reason,
> you lose huge amount of time on re-parsing the queries, and it is worth
> fixing that obvious issue first".
>

Maybe it's worth fixing it, maybe not - that's going to depend on the
application. Some applications may be huge/legacy and hard to change,
others may depend on something like pgbouncer which doesn't allow it. Other
drivers out there probably don't persist prepared statements across
close/open, making prepared statements useless for short-lived scenarios.
Does the Python driver persist prepared statements? Does the Go driver do
so? If not the single-execution flow is very relevant for optimization.

> Shay>I don't see how reusing SQL text affects security in any way.
>
> Reusing SQL text makes application more secure as "build SQL on the fly"
> is prone to SQL injection security issues.
> So reusing SQL text makes application more secure and it enables
> server-prepared statements that improve performance considerably. It is a
> win-win.
>

We've all understood that server-prepared statements are good for
performance. But they aren't more or less vulnerable to SQL injection -
developers can just as well concatenate user-provided strings into a
prepared SQL text.

[deleting more comments trying to convince that prepared statements are
great for performance, which they are]

Shay>Again, in a world where prepared statements aren't persisted across
> connections (e.g. pgbouncer)
>
> pgbouncer does not properly support named statements, and that is
> pbgouncer's issue.
>
> Here's the issue for pgbouncer project: https://github.com/
> pgbouncer/pgbouncer/issues/126#issuecomment-200900171
> The response from pgbouncer team is "all the protocol bits are there, it
> is just implementation from pgbouncer that is missing".
>
> By the way: I do not use pgbouncer, thus there's no much interest for me
> to invest time in fixing pgbouncer's issues.
>

Um, OK... So you're not at all bothered by the fact that the (probably)
most popular PostgreSQL connection pool is incompatible with your argument?
I'm trying to think about actual users and the actual software they use, so
pgbouncer is very relevant.

Shay>Any scenario where you open a relatively short-lived connection and
> execute something once is problematic - imagine a simple web service which
> needs to insert a single record into a table.
>
> I would assume the application does not use random string for a table name
> (and columns/aliases), thus it would result in typical SQL text reuse, thus
> it should trigger "server-side statement prepare" logic. In other way, that
> kind of application does not need the "new ParseBindExecDeallocate message
> we are talking about".
>

I wouldn't assume anything. Maybe the application does want to change
column names (which can't be parameterized). Maybe it needs the extended
protocol simply because it wants to do binary encoding, which isn't
possible with the simple protocol

Re: [HACKERS] Slowness of extended protocol

2016-08-09 Thread Vladimir Sitnikov

Shay>But here's the more important general point. We're driver developers,
not application developers. I don't really know what performance is "just
fine" for each of my users, and what is not worth optimizing further. Users
may follow best practices, or they may not for various reasons.

Of course we cannot benchmark all the existing applications, however we
should at lest try to use "close to production" benchmarks.

Let me recap: even "select 1" shows clear advantage of reusing
server-prepared statements.
My machine shows the following results for "select 1 pgbench":
simple: 38K ops/sec (~26us/op)
extended: 32K ops/sec (~31us/op)
prepared: 47K ops/sec (~21us/op)

Note: reusing server-prepared statements shaves 10us (out of 31us), while
"brand new ParseBindExecDeallocate" message would not able to perform
better than 26us/op (that is 5 us worse than the prepared one). So it makes
much more sense investing time in "server-prepared statement reuse" at the
client side and "improving Bind/Exec performance" at the backend side.

For more complex queries the gap (prepared vs simple) would be much bigger
since parse/validate/plan for a complex query is much harder operation than
the one for "select 1"

Note: I do not mean "look, prepared always win". I mean: "if your
environment does not allow reuse of prepared statements for some reason,
you lose huge amount of time on re-parsing the queries, and it is worth
fixing that obvious issue first".

Shay>I don't see how reusing SQL text affects security in any way.

Reusing SQL text makes application more secure as "build SQL on the fly" is
prone to SQL injection security issues.
So reusing SQL text makes application more secure and it enables
server-prepared statements that improve performance considerably. It is a
win-win.

Shay>a new feature in the Npgsql dev branch which allows prepared
statements to be persisted across open/close on pooled connections

Do you have some benchmark results for "reusing server-prepared statements
across open/close on pooled"? I would expect that feature to be a great win.

Once again, I'd like to focus on real-life cases, not artificial ones.

For example, the major part of my performance fixes to pgjdbc were made as
a part of improving my java application that was suffering from performance
issues when talking to PostgreSQL.
For instance, there were queries that took 20ms to plan and 0.2ms to
execute (the query is like where id=? but the SQL text was more involved).
As transparent server-side statement was implemented at pgjdbc side, it
shaved those 20ms by eliminating Parse messages on the hot path.

In other words, it was not just "lets optimize pgjdbc". It was driven by
the need to optimize the client application, and the profiling results were
pointing to pgjdbc issues.

Shay>Again, in a world where prepared statements aren't persisted across
connections (e.g. pgbouncer)

pgbouncer does not properly support named statements, and that is
pbgouncer's issue.

Here's the issue for pgbouncer project:
https://github.com/pgbouncer/pgbouncer/issues/126#issuecomment-200900171
The response from pgbouncer team is "all the protocol bits are there, it is
just implementation from pgbouncer that is missing".

By the way: I do not use pgbouncer, thus there's no much interest for me to
invest time in fixing pgbouncer's issues.


Shay>Any scenario where you open a relatively short-lived connection and
execute something once is problematic - imagine a simple web service which
needs to insert a single record into a table.

I would assume the application does not use random string for a table name
(and columns/aliases), thus it would result in typical SQL text reuse, thus
it should trigger "server-side statement prepare" logic. In other way, that
kind of application does not need the "new ParseBindExecDeallocate message
we are talking about".

In other words, if an application is using "select name from objects where
id=$1" kind of queries, the driver should be using extended protocol
(Bind/Exec) behind the scenes if it does aim to get high performance.

Vladimir

Re: [HACKERS] Slowness of extended protocol

2016-08-09 Thread Shay Rojansky

On Tue, Aug 9, 2016 at 8:50 AM, Vladimir Sitnikov <
sitnikov.vladi...@gmail.com> wrote:

> Shay>There are many scenarios where connections are very short-lived
> (think about webapps where a pooled connection is allocated per-request and
> reset in between)
>
> Why the connection is reset in between in the first place?
> In pgjdbc we do not reset per-connection statement cache, thus we easily
> reuse named statements for pooled connections.
>

A DISCARD ALL is sent when the connection is returned to the pool, the
prevent state leakage etc. You make a valid comment though - there's
already a new feature in the Npgsql dev branch which allows prepared
statements to be persisted across open/close on pooled connections, it's
interesting to learn that it is standard behavior in pgjdbc. At least up to
now, the logic was not to implcitly keep holding server-side resources
across a pooled connection close, because these may become a drain on the
server etc.

More important, unless I'm mistaken pgbouncer also sends DISCARD ALL to
clean the connection state, as will other pooling solutions. That
unfortunately means that you can't always depend on prepared statements to
persist after closing the connection.

> Shay>There are also many scenarios where you're not necessarily going to
> send the same query multiple times in a single connection lifespan, so
> preparing is again out of the question.
>

> Can you list at least one scenario of that kind, so we can code it into
> pgbench (or alike) and validate "simple vs prepared" performance?
>

Again, in a world where prepared statements aren't persisted across
connections (e.g. pgbouncer), this scenario is extremely common. Any
scenario where you open a relatively short-lived connection and execute
something once is problematic - imagine a simple web service which needs to
insert a single record into a table.

Shay>And more generally, there's no reason for a basic, non-prepared
> execution to be slower than it can be.
>
> That's too generic. If the performance for "end-to-end cases" is just
> fine, then it is not worth optimizing further. Typical application best
> practice is to reuse SQL text (for both security and performance point of
> views), so in typical applications I've seen, query text was reused, thus
> it naturally was handled by server-prepared logic.
>

I don't see how reusing SQL text affects security in any way.

But here's the more important general point. We're driver developers, not
application developers. I don't really know what performance is "just fine"
for each of my users, and what is not worth optimizing further. Users may
follow best practices, or they may not for various reasons. They may be
porting code over from SqlServer, where prepare is almost never used
(because SqlServer caches plans implicitly), or they may simply not be very
good programmers. The API for executing non-prepared statements is there,
we support it and PostgreSQL supports it - it just happens to not be very
fast. Of course we can say "screw everyone not preparing statements", but
that doesn't seem like a very good way to treat users. Especially since the
fix isn't that hard.

Let me highlight another direction: current execution of server-prepared
> statement requires some copying of "parse tree" (or whatever). I bet it
> would be much better investing in removal of that copying rather than
> investing into "make one-time queries faster" thing. If we could make
> "Exec" processing faster, it would immediately improve tons of applications.
>

I don't understand what exactly you're proposing here, but if you have some
unrelated optimization that would speed up prepared statements, by all
means that's great. It's just unrelated to this thread.

> Shay>Of course we can choose a different query to benchmark instead of
> SELECT 1 - feel free to propose one (or several).
>
> I've tried pgbench -M prepared, and it is way faster than pgbench -M
> simple.
>
> Once again: all cases I have in mind would benefit from reusing
> server-prepared statements. In other words, after some warmup the
> appication would use just Bind-Execute-Sync kind of messages, and it would
> completely avoid Parse/Describe ones.
>

Of course that's the ideal scenario. It's just not the *only* scenario for
all users - they may either not have prepared statements persisting across
open/close as detailed above, or their code may simply not be preparing
statements at the moment. Why not help them out for free?

Shay>FYI in Npgsql specifically describe isn't used to get any knowledge
> about parameters - users must populate the correct parameters or query
> execution fails.
>
> I think the main reason to describe for pgjdbc is to get result oids.
> pgjdbc is not "full binary", thus it has to be careful which fields it
> requests in binary format.
> That indeed slows down "unknown queries", but as the query gets reused,
> pgjdbc switches to server-prepared execution, and eliminates parse-describe
> overheads

Re: [HACKERS] Slowness of extended protocol

2016-08-09 Thread Vladimir Sitnikov

Shay>There are many scenarios where connections are very short-lived (think
about webapps where a pooled connection is allocated per-request and reset
in between)

Why the connection is reset in between in the first place?
In pgjdbc we do not reset per-connection statement cache, thus we easily
reuse named statements for pooled connections.

Shay>and the extra roundtrip that preparing entails is too much.

When server-prepared statement gets reused, neither parse neither describe
are used.

Shay>There are also many scenarios where you're not necessarily going to
send the same query multiple times in a single connection lifespan, so
preparing is again out of the question.

Can you list at least one scenario of that kind, so we can code it into
pgbench (or alike) and validate "simple vs prepared" performance?


Shay>And more generally, there's no reason for a basic, non-prepared
execution to be slower than it can be.

That's too generic. If the performance for "end-to-end cases" is just fine,
then it is not worth optimizing further. Typical application best practice
is to reuse SQL text (for both security and performance point of views), so
in typical applications I've seen, query text was reused, thus it naturally
was handled by server-prepared logic.

Let me highlight another direction: current execution of server-prepared
statement requires some copying of "parse tree" (or whatever). I bet it
would be much better investing in removal of that copying rather than
investing into "make one-time queries faster" thing. If we could make
"Exec" processing faster, it would immediately improve tons of applications.


Shay>Of course we can choose a different query to benchmark instead of
SELECT 1 - feel free to propose one (or several).

I've tried pgbench -M prepared, and it is way faster than pgbench -M simple.

Once again: all cases I have in mind would benefit from reusing
server-prepared statements. In other words, after some warmup the
appication would use just Bind-Execute-Sync kind of messages, and it would
completely avoid Parse/Describe ones.

If a statement is indeed "one-time" statement, then I do not care much how
long it would take to execute.

Shay>FYI in Npgsql specifically describe isn't used to get any knowledge
about parameters - users must populate the correct parameters or query
execution fails.

I think the main reason to describe for pgjdbc is to get result oids.
pgjdbc is not "full binary", thus it has to be careful which fields it
requests in binary format.
That indeed slows down "unknown queries", but as the query gets reused,
pgjdbc switches to server-prepared execution, and eliminates parse-describe
overheads completely.

Vladimir

Re: [HACKERS] Slowness of extended protocol

2016-08-09 Thread Shay Rojansky

On Mon, Aug 8, 2016 at 6:44 PM, Vladimir Sitnikov <
sitnikov.vladi...@gmail.com> wrote:
>
> The problem with "empty statement name" is statements with empty name can
>> be reused (for instance, for batch insert executions), so the server side
>> has to do a defensive copy (it cannot predict how many times this unnamed
>> statement will be used).
>>
>
That seems right.

Also, part of the point here is to reduce the number of protocol messages
>> needed in order to send a parameterized query - not to have to do
>> Parse/Describe/Bind/Execute/Sync - since part of the slowdown comes from
>> that (although I'm not sure how much). Your proposal keeps the 5 messages.
>>
>
> As my benchmarks show, notable overhead is due to "defensive copying of
> the execution plan". So I would measure first, and only then would claim
> where the overhead is.
>

> Some more profiling is required to tell which part is a main time consumer.
>

Tom also pointed to the caching as the main culprit, although there seems
to be some message-related overhead as well. It seems that things like
process title changes may be fixable easily - do we really need a process
title change on every message (as opposed to, say, execute messages only).
This profiling and optimization effort can happen in parallel to the
discussion on what to do with the execution plan caching.

> Technically speaking, I would prefer to have a more real-life looking
> example instead of SELECT 1.
> Do you have something in mind?
> For instance, for more complex queries, "Parse/Plan" could cost much more
> than we shave by adding "a special non-cached statement name" or by
> reducing "5 messages into 1".
>
> There's a side problem: describe message requires full roundtrip since
> there are cases when client needs to know how to encode parameters.
> Additional roundtrip hurts much worse than just an additional message that
> is pipelined (e.g. sent in the same TCP packet).
>

This is true, but there doesn't seem to be anything we can do about it - if
your usage relies on describe to get information on parameters (as opposed
to results), you're stuck with an extra roundtrip no matter what. So it
seems you have to use the extended protocol anyway.

FYI in Npgsql specifically describe isn't used to get any knowledge about
parameters - users must populate the correct parameters or query execution
fails.

> Note: it is quite easy to invent a name that is not yet used in the wild,
>>> so it is safe.
>>>
>>
>> That's problematic, how do you know what's being used in the wild and
>> what isn't? The protocol has a specification, it's very problematic to get
>> up one day and to change it retroactively. But again, the empty statement
>> seems to already be there for that.
>>
>
> Empty statement has different semantics, and it is wildly used.
> For instance, pgjdbc uses unnamed statements a lot.
> On the other hand, statement name of "!pq@#!@#42" is rather safe to use
> as a special case.
> Note: statement names are not typically created by humans (statement name
> is not a SQL), and very little PG clients do support named statements.
>

IMHO this simply isn't the kind of thing one does in a serious protocol of
a widely-used product, others seem to agree on this.

> Sir, any new SQL keyword is what you call a "retroactively defining
special semantics".
> It's obvious that very little current clients do use named
server-prepared statements.
> Statement names are not something that is provided by the end-user in a
web page, so it is not a rocket science to come up with a
> statement name that is both short and "never ever used in the wild".

The difference is that before the new SQL keyword is added, trying to use
it results in an error. What you're proposing is taking something that
already works in one way and changing its behavior.

> Shay, can you come up with a real-life use case when those "I claim the
statement will be used only once" is would indeed improve performance?
> Or, to put it in another way: "do you have a real-life case when simple
protocol is faster than extended protocol with statement reuse"?

> I do have a couple of java applications and it turns out there's a huge
win of reusing server-prepared statements.
> There's a problem that "generic plan after 5th execution might be much
worse than a specific one", however those statements are not often
> and I just put hints to the SQL (limit 0, +0, CTE, those kind of things).

I maintain Npgsql, which is a general-purpose database driver and not a
specific application. The general .NET database API (ADO.NET), like most
(all?) DB APIs, allows users to send a simple statement to the database
(ExecuteReader, ExecuteScalar, ExecuteNonQuery). Every time a user uses
these APIs without preparing, they pay a performance penalty because the
extended protocol has more overhead than the simple one.

Obviously smart use of prepared statements is a great idea, but it doesn't
work everywhere. There are many scenarios where

Re: [HACKERS] Slowness of extended protocol

2016-08-08 Thread Tatsuo Ishii

>> On the other hand, usage of some well-defined statement name to trigger
>> the special case would be fine: all pgbouncer versions would pass those
>> parse/bind/exec message as if it were regular messages.
> 
> I do not accept this idea that retroactively defining special semantics
> for certain statement names is not a protocol break.  If it causes any
> change in what the server's response would be, then it is a protocol
> break.

+1. It definitely is a protocol break.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slowness of extended protocol

I'm sorry, we are discussing technical details with no real-life use case
to cover that.
I do not want to suck time for no reason. Please accept my sincere
apologies for not asking the real-life case earlier.

Shay, can you come up with a real-life use case when those "I claim the
statement will be used only once" is would indeed improve performance?
Or, to put it in another way: "do you have a real-life case when simple
protocol is faster than extended protocol with statement reuse"?

I do have a couple of java applications and it turns out there's a huge win
of reusing server-prepared statements.
There's a problem that "generic plan after 5th execution might be much
worse than a specific one", however those statements are not often and I
just put hints to the SQL (limit 0, +0, CTE, those kind of things).

Tom Lane :

> I do not accept this idea that retroactively defining special semantics
> for certain statement names is not a protocol break.


Sir, any new SQL keyword is what you call a "retroactively defining special
semantics".
It's obvious that very little current clients do use named server-prepared
statements.
Statement names are not something that is provided by the end-user in a web
page, so it is not a rocket science to come up with a statement name that
is both short and "never ever used in the wild".

Tom Lane :

> If it causes any
> change in what the server's response would be, then it is a protocol
> break.
>

I see no changes except "backend would report a protocol violation for the
case when special statement is used and message sequence is wrong".


> > Note: it is quite easy to invent a name that is not yet used in the wild,
> > so it is safe.
>
> Sir, that is utter nonsense.



Tom Lane :

> And even if it were true, why is it that
> this way would safely pass through existing releases of pgbouncer when
> other ways would not?  Either pgbouncer needs to understand what it's
> passing through, or it doesn't.
>

Once again: exiting pgbouncer versions know how to parse
Parse/Bind/Exec/Deallocate messages, so if we bless some well-defined
statement name with a semantics that "it is forbidden to reuse that name
for multiple executions in a row", then that is completely transparent for
pgbouncer.  Pgbouncer would just think that "the application is dumb since
it reparses the same statement again and againt", but it would work just
fine.

On contrary, if a new statement name is added, then pgbouncer would fail to
understand the new message.

Vladimir

Re: [HACKERS] Slowness of extended protocol

Shay Rojansky :

> Ah, I understand the proposal better now - you're not proposing encoding a
> new message type in an old one, but rather a magic statement name in Parse
> which triggers an optimized processing path in PostgreSQL, that wouldn't go
> through the query cache etc.
>

Exactly.


> If so, isn't that what the empty statement is already supposed to do? I
> know there's already some optimizations in place around the scenario of
> empty statement name (and empty portal).
>

The problem with "empty statement name" is statements with empty name can
be reused (for instance, for batch insert executions), so the server side
has to do a defensive copy (it cannot predict how many times this unnamed
statement will be used).

Shay Rojansky :

> Also, part of the point here is to reduce the number of protocol messages
> needed in order to send a parameterized query - not to have to do
> Parse/Describe/Bind/Execute/Sync - since part of the slowdown comes from
> that (although I'm not sure how much). Your proposal keeps the 5 messages.
>

As my benchmarks show, notable overhead is due to "defensive copying of the
execution plan". So I would measure first, and only then would claim where
the overhead is.

Some more profiling is required to tell which part is a main time consumer.
Technically speaking, I would prefer to have a more real-life looking
example instead of SELECT 1.
Do you have something in mind?
For instance, for more complex queries, "Parse/Plan" could cost much more
than we shave by adding "a special non-cached statement name" or by
reducing "5 messages into 1".

There's a side problem: describe message requires full roundtrip since
there are cases when client needs to know how to encode parameters.
Additional roundtrip hurts much worse than just an additional message that
is pipelined (e.g. sent in the same TCP packet).

Shay Rojansky :

> But people seem to be suggesting that a significant part of the overhead
> comes from the fact that there are 5 messages, meaning there's no way to
> optimize this without a new message type.
>

Of course 5 messages are slower than 1 message.
However, that does not mean "there's no way to optimize without a new
message type".
Profiling can easily reveal time consumer parts, then we can decide if
there's a solution.
Note: if we improve "SELECT 1" by 10%, it does not mean we improved
statement execution by 10%. Real-life statements matter for proper
profiling/discussion.

 Shay Rojansky :

> Note: it is quite easy to invent a name that is not yet used in the wild,
>> so it is safe.
>>
>
> That's problematic, how do you know what's being used in the wild and what
> isn't? The protocol has a specification, it's very problematic to get up
> one day and to change it retroactively. But again, the empty statement
> seems to already be there for that.
>

Empty statement has different semantics, and it is wildly used.
For instance, pgjdbc uses unnamed statements a lot.
On the other hand, statement name of "!pq@#!@#42" is rather safe to use as
a special case.
Note: statement names are not typically created by humans (statement name
is not a SQL), and very little PG clients do support named statements.

Vladimir

Re: [HACKERS] Slowness of extended protocol

2016-08-08 Thread Tom Lane

Shay Rojansky  writes:
> Ah, I understand the proposal better now - you're not proposing encoding a
> new message type in an old one, but rather a magic statement name in Parse
> which triggers an optimized processing path in PostgreSQL, that wouldn't go
> through the query cache etc.

> If so, isn't that what the empty statement is already supposed to do? I
> know there's already some optimizations in place around the scenario of
> empty statement name (and empty portal).

Right, but the problem is that per the current protocol spec, those
optimizations are not allowed to change any protocol-visible behavior.
We might be able to do a bit more optimization than we're doing now, but
we still have to be able to cache the statement in case it's executed more
than once, we still have to do planning in response to a separate 'bind'
message, etc.  AFAICT most of the added overhead comes from having to
allow for the possibility of re-execution: that forces at least one extra
copy of the parse tree to be made, as well as adding manipulations of the
plan cache.  We can't get out from under that without a protocol change.

> Again, if it's possible to make "Parse/Describe/Bind/Execute/Sync" perform
> close to Query, e.g. when specifying empty statement/portal, that's
> obviously the best thing here. But people seem to be suggesting that a
> significant part of the overhead comes from the fact that there are 5
> messages, meaning there's no way to optimize this without a new message
> type.

The rather crude measurements I've done put most of the blame on the
cacheing part, although the number of messages might matter more on
platforms with slow process-title-update support (I'm looking at you,
Windows).

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slowness of extended protocol

2016-08-08 Thread Shay Rojansky

Vladimir wrote:

> On the other hand, usage of some well-defined statement name to trigger
>>> the special case would be fine: all pgbouncer versions would pass those
>>> parse/bind/exec message as if it were regular messages.
>>>
>>
>> Can you elaborate on what that means exactly? Are you proposing to
>> somehow piggyback on an existing message (e.g. Parse) but use some special
>> statement name that would make PostgreSQL interpret it as a different
>> message type?
>>
>
> Exactly.
> For instance: if client sends Parse(statementName=I_swear_
> the_statement_will_be_used_only_once), then the subsequent message must
> be BindMessage, and the subsequent must be ExecMessage for exactly the same
> statement id.
>

Ah, I understand the proposal better now - you're not proposing encoding a
new message type in an old one, but rather a magic statement name in Parse
which triggers an optimized processing path in PostgreSQL, that wouldn't go
through the query cache etc.

If so, isn't that what the empty statement is already supposed to do? I
know there's already some optimizations in place around the scenario of
empty statement name (and empty portal).

Also, part of the point here is to reduce the number of protocol messages
needed in order to send a parameterized query - not to have to do
Parse/Describe/Bind/Execute/Sync - since part of the slowdown comes from
that (although I'm not sure how much). Your proposal keeps the 5 messages.

Again, if it's possible to make "Parse/Describe/Bind/Execute/Sync" perform
close to Query, e.g. when specifying empty statement/portal, that's
obviously the best thing here. But people seem to be suggesting that a
significant part of the overhead comes from the fact that there are 5
messages, meaning there's no way to optimize this without a new message
type.

Note: it is quite easy to invent a name that is not yet used in the wild,
> so it is safe.
>

That's problematic, how do you know what's being used in the wild and what
isn't? The protocol has a specification, it's very problematic to get up
one day and to change it retroactively. But again, the empty statement
seems to already be there for that.

Re: [HACKERS] Slowness of extended protocol

2016-08-08 Thread Tom Lane

Vladimir Sitnikov  writes:
> The point is "adding a message to current v3 protocol is not a backward
> compatible change".
> The problem with adding new message types is not only "client support", but
> deployment issues as well: new message would require simultaneous upgrade
> of both backend, client, and pgbouncer.

Right ...

> It could make sense to add some kind of transparent extensibility to the
> protocol, so clients can just ignore unknown message types.

I'm not really sure what use that particular behavior would be.  We
certainly want to try to have some incremental extensibility in there
when we do v4, but I think it would more likely take the form of a way
for client and server to agree on which extensions they support.

> On the other hand, usage of some well-defined statement name to trigger
> the special case would be fine: all pgbouncer versions would pass those
> parse/bind/exec message as if it were regular messages.

I do not accept this idea that retroactively defining special semantics
for certain statement names is not a protocol break.  If it causes any
change in what the server's response would be, then it is a protocol
break.

> Note: it is quite easy to invent a name that is not yet used in the wild,
> so it is safe.

Sir, that is utter nonsense.  And even if it were true, why is it that
this way would safely pass through existing releases of pgbouncer when
other ways would not?  Either pgbouncer needs to understand what it's
passing through, or it doesn't.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slowness of extended protocol

Shay Rojansky :

> That's definitely a valid point. But do you think it's a strong enough
> argument to avoid ever adding new messages?
>

The point is "adding a message to current v3 protocol is not a backward
compatible change".
The problem with adding new message types is not only "client support", but
deployment issues as well: new message would require simultaneous upgrade
of both backend, client, and pgbouncer.

It could make sense to add some kind of transparent extensibility to the
protocol, so clients can just ignore unknown message types.

For instance, consider a new message "ExtensibleMessage" is added: '#',
int16 (message id), int32 (message length), contents.
For the pgbouncer case above, pgbouncer could just proxy "unknown requests"
as is and it would cover many of the cases out of the box.
That message has a well-defined message length, so it is easy to skip.

If new "messages" required, a new message_id can be allocated and assigned
to the new message type. Old clients would be able to skip the message as
they know its length.

Of course, adding that '#' message does require support from pgbouncer,
etc, etc, however I'm sure it would simplify protocol evolution in the
future.

Of course there might appear a need for a message that cannot be ignored by
the client, but I think "even a bit of flexibility" is better than no
flexibility at all.

Technically speaking, there's a "NotificationResponse" message, however it
is not that good since it cannot have 0 bytes in the contents.

Shay Rojansky :

>
>> On the other hand, usage of some well-defined statement name to trigger
>> the special case would be fine: all pgbouncer versions would pass those
>> parse/bind/exec message as if it were regular messages.
>>
>
> Can you elaborate on what that means exactly? Are you proposing to somehow
> piggyback on an existing message (e.g. Parse) but use some special
> statement name that would make PostgreSQL interpret it as a different
> message type?
>

Exactly.
For instance: if client sends
Parse(statementName=I_swear_the_statement_will_be_used_only_once), then the
subsequent message must be BindMessage, and the subsequent must be
ExecMessage for exactly the same statement id.

Then backend could recognize the pattern and perform the optimization.
Note: it is quite easy to invent a name that is not yet used in the wild,
so it is safe.
>From security point of view (e.g. if a client would want to exploit
use-after-free kind of issues), backend could detect deviations from this
parse-bind-exec sequence and just drop the mic off.

Shay Rojansky :

> Apart from being a pretty horrible hack,
>

I would not call it a horrible hack. That is just a clever use of existing
bits, and it does not break neither backward nor forward compatibility.
Backward compatibility: new clients would be compatible with old PG
versions (even 8.4).
Forward compatibility: even if the support of that special statement name
would get dropped for some reason, there will be no application issues (it
would just result in a slight performance degradation).

Shay Rojansky :

> it would still break pgbouncer, which has to actually inspect and
> understand SQL being sent to the database (e.g. to know when transactions
> start and stop).
>

Note: I do not suggest to change message formats. The message itself is
just fine and existing pgbouncer versions can inspect the SQL. The
difference is a special statementName, and I see no reasons for that kind
of change to break pgbouncer.

Vladimir

Re: [HACKERS] Slowness of extended protocol

2016-08-08 Thread Shay Rojansky

>
> We could call this "protocol 3.1" since it doesn't break backwards
>> compatibility (no incompatible server-initiated message changes, but it
>> does include a feature that won't be supported by servers which only
>> support 3.0. This could be a sort of "semantic versioning" for the protocol
>> - optional new client-initiated features are a minor version bump, others
>> are a major version bump...
>>
>
> Adding a new message is not backward compatible since it will fail in
> pgbouncer kind of deployments.
> Suppose there's "new backend", "old pgbouncer", "new client" deployment.
> If the client tries to send the new message, it will fail since pgbouncer
> would have no idea what to do with that new message.
>

That's definitely a valid point. But do you think it's a strong enough
argument to avoid ever adding new messages?


> On the other hand, usage of some well-defined statement name to trigger
> the special case would be fine: all pgbouncer versions would pass those
> parse/bind/exec message as if it were regular messages.
>

Can you elaborate on what that means exactly? Are you proposing to somehow
piggyback on an existing message (e.g. Parse) but use some special
statement name that would make PostgreSQL interpret it as a different
message type? Apart from being a pretty horrible hack, it would still break
pgbouncer, which has to actually inspect and understand SQL being sent to
the database (e.g. to know when transactions start and stop).

Re: [HACKERS] Slowness of extended protocol