Re: [PERFORM] Correct use of cursors for very large result sets in Postgres

2017-02-23 Thread John Gorman
Yes of course that’s all verified and taken into account during code 
initialization


From: Vitalii Tymchyshyn [mailto:v...@tym.im]
Sent: Wednesday, February 22, 2017 8:14 PM
To: John Gorman; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Correct use of cursors for very large result sets in 
Postgres

For JDBC there are certain prerequisites for setFetchSize to work, e.g. using 
forward only result sets and transactions.

вт, 21 лют. 2017 о 09:06 John Gorman 
> пише:
My experience with cursors in PostgreSQL with Java has been to stay away from 
them. We support 2 databases with our product, PostgreSQL (default) and SQL 
Server. While re-encrypting data in a database the application used cursors 
with a fetch size of 1000.

Worked perfectly on SQL Server and on PostgreSQL until we got to a PostgreSQL 
table with more than 11 million rows. After spending weeks trying to figure out 
what was happening, I realized that when it gets to a table with more than 10 
million rows for some reason, the cursor functionality just silently stopped 
working and it was reading the entire table. I asked another very senior 
architect to look at it and he came to the same conclusion. Because of limited 
time, I ended up working around it using limit/offset.

Again we are using Java, so the problem could just be in the PostgreSQL JDBC 
driver. Also we were on 9.1 at the time.

Regards
John

From: 
pgsql-performance-ow...@postgresql.org
 
[mailto:pgsql-performance-ow...@postgresql.org]
 On Behalf Of Mike Beaton
Sent: Tuesday, February 21, 2017 6:49 AM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Correct use of cursors for very large result sets in 
Postgres

Thanks, Tom.

Wouldn't this mean that cursors are noticeably non-optimal even for normal data 
sizes, since the entire data to be streamed from the table is always duplicated 
into another buffer and then streamed?

> if you want the whole query result at once, why are you bothering with a 
> cursor?

The PostgreSQL docs 
(https://www.postgresql.org/docs/9.6/static/plpgsql-cursors.html#AEN66551) 
clearly recommend cursors as a way to return a reference to a large result set 
from a function (as I understood, this is recommended precisely as a way to 
avoid tuple-based buffering of the data).

So following that advice, it's not unreasonable that I would actually have a 
cursor to a large dataset.

Then, I would ideally want to be able to fetch the data from that cursor 
without the entire data getting duplicated (even if only a bit at a time 
instead of all at once, which seems to be the best case behaviour) as I go.

Additionally, I thought that if I had a streaming use-case (which I do), and a 
streaming data-access layer (which I do), then since `SELECT * FROM large` is 
absolutely fine, end-to-end, in that situation, then by symmetry and the 
principle of least astonishment `FETCH ALL FROM cursor` might be fine too.



Re: [PERFORM] Correct use of cursors for very large result sets in Postgres

2017-02-22 Thread Vitalii Tymchyshyn
For JDBC there are certain prerequisites for setFetchSize to work, e.g.
using forward only result sets and transactions.

вт, 21 лют. 2017 о 09:06 John Gorman  пише:

> My experience with cursors in PostgreSQL with Java has been to stay away
> from them. We support 2 databases with our product, PostgreSQL (default)
> and SQL Server. While re-encrypting data in a database the application used
> cursors with a fetch size of 1000.
>
>
>
> Worked perfectly on SQL Server and on PostgreSQL until we got to a
> PostgreSQL table with more than 11 million rows. After spending weeks
> trying to figure out what was happening, I realized that when it gets to a
> table with more than 10 million rows for some reason, the cursor
> functionality just silently stopped working and it was reading the entire
> table. I asked another very senior architect to look at it and he came to
> the same conclusion. Because of limited time, I ended up working around it
> using limit/offset.
>
>
>
> Again we are using Java, so the problem could just be in the PostgreSQL
> JDBC driver. Also we were on 9.1 at the time.
>
>
>
> Regards
>
> John
>
>
>
> *From:* pgsql-performance-ow...@postgresql.org [mailto:
> pgsql-performance-ow...@postgresql.org] *On Behalf Of *Mike Beaton
> *Sent:* Tuesday, February 21, 2017 6:49 AM
> *To:* pgsql-performance@postgresql.org
> *Subject:* Re: [PERFORM] Correct use of cursors for very large result
> sets in Postgres
>
>
>
> Thanks, Tom.
>
> Wouldn't this mean that cursors are noticeably non-optimal even for normal
> data sizes, since the entire data to be streamed from the table is always
> duplicated into another buffer and then streamed?
>
>
>
> > if you want the whole query result at once, why are you bothering with
> a cursor?
>
>
>
> The PostgreSQL docs (
> https://www.postgresql.org/docs/9.6/static/plpgsql-cursors.html#AEN66551) 
> clearly
> recommend cursors as a way to return a reference to a large result set from
> a function (as I understood, this is recommended precisely as a way to
> avoid tuple-based buffering of the data).
>
>
>
> So following that advice, it's not unreasonable that I would actually have
> a cursor to a large dataset.
>
>
>
> Then, I would ideally want to be able to fetch the data from that cursor
> without the entire data getting duplicated (even if only a bit at a time
> instead of all at once, which seems to be the best case behaviour) as I go.
>
>
>
> Additionally, I thought that if I had a streaming use-case (which I do),
> and a streaming data-access layer (which I do), then since `SELECT * FROM
> large` is absolutely fine, end-to-end, in that situation, then by symmetry
> and the principle of least astonishment `FETCH ALL FROM cursor` might be
> fine too.
>
>
>


Re: [PERFORM] Correct use of cursors for very large result sets in Postgres

2017-02-21 Thread John Gorman
My experience with cursors in PostgreSQL with Java has been to stay away from 
them. We support 2 databases with our product, PostgreSQL (default) and SQL 
Server. While re-encrypting data in a database the application used cursors 
with a fetch size of 1000.

Worked perfectly on SQL Server and on PostgreSQL until we got to a PostgreSQL 
table with more than 11 million rows. After spending weeks trying to figure out 
what was happening, I realized that when it gets to a table with more than 10 
million rows for some reason, the cursor functionality just silently stopped 
working and it was reading the entire table. I asked another very senior 
architect to look at it and he came to the same conclusion. Because of limited 
time, I ended up working around it using limit/offset.

Again we are using Java, so the problem could just be in the PostgreSQL JDBC 
driver. Also we were on 9.1 at the time.

Regards
John

From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Mike Beaton
Sent: Tuesday, February 21, 2017 6:49 AM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Correct use of cursors for very large result sets in 
Postgres

Thanks, Tom.

Wouldn't this mean that cursors are noticeably non-optimal even for normal data 
sizes, since the entire data to be streamed from the table is always duplicated 
into another buffer and then streamed?

> if you want the whole query result at once, why are you bothering with a 
> cursor?

The PostgreSQL docs 
(https://www.postgresql.org/docs/9.6/static/plpgsql-cursors.html#AEN66551) 
clearly recommend cursors as a way to return a reference to a large result set 
from a function (as I understood, this is recommended precisely as a way to 
avoid tuple-based buffering of the data).

So following that advice, it's not unreasonable that I would actually have a 
cursor to a large dataset.

Then, I would ideally want to be able to fetch the data from that cursor 
without the entire data getting duplicated (even if only a bit at a time 
instead of all at once, which seems to be the best case behaviour) as I go.

Additionally, I thought that if I had a streaming use-case (which I do), and a 
streaming data-access layer (which I do), then since `SELECT * FROM large` is 
absolutely fine, end-to-end, in that situation, then by symmetry and the 
principle of least astonishment `FETCH ALL FROM cursor` might be fine too.



Re: [PERFORM] Correct use of cursors for very large result sets in Postgres

2017-02-21 Thread Mike Beaton
Thanks, Tom.

Wouldn't this mean that cursors are noticeably non-optimal even for normal
data sizes, since the entire data to be streamed from the table is always
duplicated into another buffer and then streamed?

> if you want the whole query result at once, why are you bothering with a
cursor?

The PostgreSQL docs (
https://www.postgresql.org/docs/9.6/static/plpgsql-cursors.html#AEN66551)
clearly
recommend cursors as a way to return a reference to a large result set from
a function (as I understood, this is recommended precisely as a way to
avoid tuple-based buffering of the data).

So following that advice, it's not unreasonable that I would actually have
a cursor to a large dataset.

Then, I would ideally want to be able to fetch the data from that cursor
without the entire data getting duplicated (even if only a bit at a time
instead of all at once, which seems to be the best case behaviour) as I go.

Additionally, I thought that if I had a streaming use-case (which I do),
and a streaming data-access layer (which I do), then since `SELECT * FROM
large` is absolutely fine, end-to-end, in that situation, then by symmetry
and the principle of least astonishment `FETCH ALL FROM cursor` might be
fine too.


Re: [PERFORM] Correct use of cursors for very large result sets in Postgres

2017-02-21 Thread Tom Lane
Mike Beaton  writes:
> New TL;DR (I'm afraid): PostgreSQL is always generating a huge buffer file
> on `FETCH ALL FROM CursorToHuge`.

I poked into this and determined that it's happening because pquery.c
executes FETCH statements the same as it does with any other
tuple-returning utility statement, ie "run it to completion and put
the results in a tuplestore, then send the tuplestore contents to the
client".  I think the main reason nobody worried about that being
non-optimal was that we weren't expecting people to FETCH very large
amounts of data in one go --- if you want the whole query result at
once, why are you bothering with a cursor?

This could probably be improved, but it would (I think) require inventing
an additional PortalStrategy specifically for FETCH, and writing
associated code paths in pquery.c.  Don't know when/if someone might get
excited enough about it to do that.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Correct use of cursors for very large result sets in Postgres

2017-02-21 Thread Mike Beaton
The generated buffer is 140MB, not 14MB. At 14 bytes per row, that makes
sense.

I have done another test.

If I execute `FETCH ALL FROM cursor` I get a 140MB disk buffer file, on the
PostgreSQL server, reported in its log.

If I execute `FETCH 500 FROM cursor` (exactly half the rows), I see a
70MB disk buffer file.

This is regardless of how many rows I actually stream from thE connection
before closing the cursor.


Re: [PERFORM] Correct use of cursors for very large result sets in Postgres

2017-02-18 Thread Mike Beaton
> Seems odd.  Is your cursor just on "SELECT * FROM table", or is there
> some processing in there you're not mentioning?  Maybe it's a cursor
> WITH HOLD and you're exiting the source transaction?

Hi Tom,

I've deleted my own Stack Overflow answer in favour of Laurenz Albe's one.

New TL;DR (I'm afraid): PostgreSQL is always generating a huge buffer file
on `FETCH ALL FROM CursorToHuge`.

The test data is created by:

`SELECT * INTO large FROM generate_series(1, 1000) id;`

The test function to generate the cursor is:


CREATE OR REPLACE FUNCTION lump() RETURNS refcursor
   LANGUAGE plpgsql AS
$$DECLARE
   c CURSOR FOR SELECT id FROM large;
BEGIN
   c := 'c';
   OPEN c;
   RETURN c;
END;$$;


The two tests are:

`SELECT * FROM large;`

Result: no buffer file.

And:


BEGIN;
SELECT lump();
FETCH ALL FROM c;
COMMIT;


Result: 14MB buffer, every time.

The buffer file appears in `base\pgsql_tmp` while the data is streaming but
only appears in the Postgres log file at the point when it is released
(which makes sense, as its final size is part of the log row).

This has the additionally confusing result that the buffer file is reported
in the Postgres logs just before the user sees the first row of data on
`psql` (and on anything using `libpq`), but just after the user sees the
last row of data, on any client program which is streaming the data via a
streaming data access layer (such as `Npgsql`, or `JDBC` with the right
configuration).


Re: [PERFORM] Correct use of cursors for very large result sets in Postgres

2017-02-18 Thread Tom Lane
Mike Beaton  writes:
> One outstanding question I have. Based on a lot of helpful responses given
> to the SO question I can now test and see what disk buffers are generated
> (by setting `log_temp_files` to `0` and then `tail -f log`), as well as how
> long it takes for results to start arriving.

> With a large (10,000,000 row) test table, if I do `SELECT * FROM table` on
> psql it starts to return results immediately with no disk buffer. If I do
> `FETCH ALL FROM cursortotable` on psql it takes about 7.5 seconds to start
> returning results, and generates a 14MB buffer. If I do `SELECT * FROM
> table` on a correctly coded streaming client, it also starts to return
> results immediately with no disk buffer. But if I do `FETCH ALL FROM
> cursortotable` from my streaming client, it takes about 1.5 seconds for
> results to start coming... but again with no disk buffer, as hoped

Seems odd.  Is your cursor just on "SELECT * FROM table", or is there
some processing in there you're not mentioning?  Maybe it's a cursor
WITH HOLD and you're exiting the source transaction?

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Correct use of cursors for very large result sets in Postgres

2017-02-18 Thread Mike Beaton
I meant to say: "the `FETCH 1 FROM cursor` until exhausted pattern will
always be safe". Nasty typo, sorry!


Re: [PERFORM] Correct use of cursors for very large result sets in Postgres

2017-02-17 Thread Mike Beaton
Dear Tom,

This is very helpful, thank you.

You make a very useful point that the limitation is basically on PL/pgSQL
and other PL languages. And someone on SO already pointed out that an
inline SQL function with a enormous sized TABLE return value also doesn't
have any buffering problems. So that's a very convenient option, whenever
SQL alone is powerful enough.

You make the further very helpful point that any library which is written
using `libpq` won't work as desired on `FETCH ALL FROM HugeCursor`. But I
don't know whether that's 'most' libraries. I think that depends on your
programming milieu! I'm working in the world of ADO.NET (but the same seems
to apply to JDBC) where most low level drivers are not written using
`libpq` but rather directly with sockets against the database - which makes
sense because a streaming data reader is part of the contract which those
drivers have to implement.

It's definitely worth noting that the `FETCH 10 FROM cursor` until
exhausted pattern will *always* be safe. But most fundamentally I did, very
specifically, want to know if the `FETCH ALL FROM
CursorToAstronomicallyLargeData` pattern can *ever* work sensibly. It seems
it clearly can and does if certain assumptions are met. Assumptions which I
actually know *are* met, in the case in which I potentially wanted to use
it!

One outstanding question I have. Based on a lot of helpful responses given
to the SO question I can now test and see what disk buffers are generated
(by setting `log_temp_files` to `0` and then `tail -f log`), as well as how
long it takes for results to start arriving.

With a large (10,000,000 row) test table, if I do `SELECT * FROM table` on
psql it starts to return results immediately with no disk buffer. If I do
`FETCH ALL FROM cursortotable` on psql it takes about 7.5 seconds to start
returning results, and generates a 14MB buffer. If I do `SELECT * FROM
table` on a correctly coded streaming client, it also starts to return
results immediately with no disk buffer. But if I do `FETCH ALL FROM
cursortotable` from my streaming client, it takes about 1.5 seconds for
results to start coming... but again with no disk buffer, as hoped

I was kind of hoping that the 'it creates a buffer' and the 'it takes a
while to start' issues would be pretty much directly aligned, but it's
clearly not as simple as that! I don't know if you can offer any more
helpful insight on this last aspect?

Many thanks,

Mike


Re: [PERFORM] Correct use of cursors for very large result sets in Postgres

2017-02-17 Thread Tom Lane
Mike Beaton  writes:
> [ generally accurate information ]

> **WARNINGS**

> It would *never* make sense to do `FETCH ALL FROM cursor` for
> astronomically large data, if your client side code (including your data
> access layer) has any bottleneck at all at which means that all the data
> from a command is fetched before any processing can be done. Many data
> access layers (and especially data access wrappers) are like this. So
> beware. But it is also true that not all client side code is made this way.

It would probably be good to point out that most client-side libraries
will do it that way, including libpq, because then they can make success
or failure of the query look atomic to the application.  If you use an
API that lets you see rows as they come off the wire, it's up to you
to recover properly from a query failure that occurs after some/many rows
have already been returned.

> Returning huge data using a `TABLE` or `SETOF` return type from within a
> PostgeSQL function will *always* be broken (i.e. will create a huge buffer
> and take a very long time to start). This will be so whether the function
> is called from SQL to SQL or called over the wire.

I believe this is false in general.  I think it's probably true for all
the standard PL languages, because they don't want to bother with
suspending/resuming execution, so they make "RETURN NEXT" add the row to
a tuplestore not return it immediately.  But it's definitely possible to
write a C function that returns a row at a time, and depending on what the
calling SQL statement looks like, that could get streamed back to the
client live rather than being buffered first.

As a trivial example, if you do
select generate_series(1,1);
in psql and watch what's happening with "top", you'll see psql's memory
usage going through the roof (because libpq tries to buffer the result)
but the connected backend's memory usage is steady as a rock --- nor
does it dump the data into a temporary file.  On the other hand,
select * from generate_series(1,1);
does dump the data into a temp file, something we ought to work on
improving.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Correct use of cursors for very large result sets in Postgres

2017-02-17 Thread Mike Beaton
I asked the same question at the same time on Stack Overflow (sincere
apologies if this is a breach of etiquette - I really needed an answer, and
I thought the two communities might not overlap).

Stackoverflow now has an answer, by me:
http://stackoverflow.com/q/42292341/#42297234 - which is based on
accumulating the most consistent, coherent information from the answers and
comments given there so far.

I think this is right, and I happily repeat it below, for anyone finding my
question on this list. But I would still *love* to find official PostgreSQL
documentation of all this. And of course to be told - quickly! - if anyone
knows it is still wrong.

***The answer is:***

**Q1:** For `SELECT * FROM AstronomicallyHugeTable` sent over the wire,
then PostgreSQL will *not* generate a huge buffer, and will stream the data
efficiently, starting quickly, to the client.

**Q2:** For `FETCH ALL FROM CursorToAstronomicallyHugeTable` sent over the
wire, then PostgreSQL will also *not* generate a huge buffer, and also will
stream the data efficiently, starting quickly, to the client.

**Implications of this for `FETCH ALL FROM cursor`**

IF (and this is a big if) you have client software which is NOT going to
store all the fetched data anywhere, but is just trying to do something
with it row by row (and this presupposes that your data access layer
supports this, which Npgsql does), then there is nothing wrong with `FETCH
ALL FROM cursor`. No huge buffers anywhere. No long setup time. Processing
huge data this way will certainly run for a very long time - or at least
until the user or some other condition aborts the process, and the cursor
can be closed. But it will start to run quickly, and its usage of resources
will be efficient.

**WARNINGS**

It would *never* make sense to do `FETCH ALL FROM cursor` for
astronomically large data, if your client side code (including your data
access layer) has any bottleneck at all at which means that all the data
from a command is fetched before any processing can be done. Many data
access layers (and especially data access wrappers) are like this. So
beware. But it is also true that not all client side code is made this way.

Returning huge data using a `TABLE` or `SETOF` return type from within a
PostgeSQL function will *always* be broken (i.e. will create a huge buffer
and take a very long time to start). This will be so whether the function
is called from SQL to SQL or called over the wire. The bottleneck is before
the function returns. For efficient returns of very large data sets you
must use a cursor return from a function (or else do `SELECT *` directly
over the wire), in every case.