Re: [HACKERS] libpq Alternate Row Processor

2017-02-14 Thread Jim Nasby

On 2/13/17 8:46 AM, Kyle Gearhart wrote:

profile_filler.txt
61,410,901  ???:_int_malloc [/usr/lib64/libc-2.17.so]
38,321,887  ???:_int_free [/usr/lib64/libc-2.17.so]
31,400,139  ???:pqResultAlloc [/usr/local/pgsql/lib/libpq.so.5.10]
22,839,505  ???:pqParseInput3 [/usr/local/pgsql/lib/libpq.so.5.10]
17,600,004  ???:pqRowProcessor [/usr/local/pgsql/lib/libpq.so.5.10]
16,002,817  ???:malloc [/usr/lib64/libc-2.17.so]
14,716,359  ???:pqGetInt [/usr/local/pgsql/lib/libpq.so.5.10]
14,400,000  ???:check_tuple_field_number [/usr/local/pgsql/lib/libpq.so.5.10]
13,800,324  main.c:main [/usr/local/src/postgresql-perf/test]



profile_filler_callback.txt
16,842,303  ???:pqParseInput3 [/usr/local/pgsql/lib/libpq.so.5.10]
14,810,783  ???:_int_malloc [/usr/lib64/libc-2.17.so]
12,616,338  ???:pqGetInt [/usr/local/pgsql/lib/libpq.so.5.10]
10,000,000  ???:pqSkipnchar [/usr/local/pgsql/lib/libpq.so.5.10]
 9,200,004  main.c:process_callback [/usr/local/src/postgresql-perf/test]


Wow, that's a heck of a difference.

There's a ton of places where the backend copies data for no other 
purpose than to put it into a different memory context. I'm wondering if 
there's improvement to be had there as well, or whether palloc is so 
much faster than malloc that it's not an issue. I suspect that some of 
the effects are being masked by other things since presumably palloc and 
memcpy are pretty cheap on small volumes of data...

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq Alternate Row Processor

2017-02-13 Thread Kyle Gearhart
On Mon, Feb 13, 2017 Merlin Moncure wrote:
>A barebones callback mode ISTM is a complete departure from the classic 
>PGresult interface.  This code is pretty unpleasant IMO:
acct->abalance = *((int*)PQgetvalue(res, 0, i)); abalance = 
acct->__bswap_32(acct->abalance);

> Your code is faster but foists a lot of the work on the user, so it's kind of 
> cheating in a way (although very carefully written applications might be able 
> to benefit).

The bit you call out above is for single row mode.  Binary mode is a slippery 
slope, with or without the proposed callback.

Let's remember that one of the biggest, often overlooked, gains when using an 
ORM is that it abstracts all this mess away.  The goal here is to prevent all 
the ORM/framework folks from having to implement protocol.  Otherwise they get 
to wait on libpq to copy from the socket to the PGconn buffer to the PGresult 
structure to their buffers.  The callback keeps the slowest guy on the 
team...on the bench. 


Kyle Gearhart


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq Alternate Row Processor

2017-02-13 Thread Merlin Moncure
On Mon, Feb 13, 2017 at 8:46 AM, Kyle Gearhart
 wrote:
> On 2/9/17 7:15 PM, Jim Nasby wrote:
>> Can you run a trace to see where all the time is going in the single row 
>> case? I don't see an obvious time-suck with a quick look through the code. 
>> It'd be interesting to see how things change if you eliminate the filler 
>> column from the SELECT.
>
> Traces are attached, these are with callgrind.
>
> profile_nofiller.txt: single row without filler column
> profile_filler.txt: single row with filler column
> profile_filler_callback.txt: callback with filler column
>
> pqResultAlloc looks to hit malloc pretty hard.  The callback reduces all of 
> that to a single malloc for each row.

Couldn't that be optimized, say, by preserving malloc'd memory when in
single row mode and recycling it?  (IIRC during the single row mode
discussion this optimization was voted down).

A barebones callback mode ISTM is a complete departure from the
classic PGresult interface.  This code is pretty unpleasant IMO:
acct->abalance = *((int*)PQgetvalue(res, 0, i));
acct->abalance = __bswap_32(acct->abalance);

Your code is faster but foists a lot of the work on the user, so it's
kind of cheating in a way (although very carefully written
applications might be able to benefit).

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq Alternate Row Processor

2017-02-13 Thread Kyle Gearhart
On 2/9/17 7:15 PM, Jim Nasby wrote:
> Can you run a trace to see where all the time is going in the single row 
> case? I don't see an obvious time-suck with a quick look through the code. 
> It'd be interesting to see how things change if you eliminate the filler 
> column from the SELECT.

Traces are attached, these are with callgrind.  

profile_nofiller.txt: single row without filler column
profile_filler.txt: single row with filler column
profile_filler_callback.txt: callback with filler column

pqResultAlloc looks to hit malloc pretty hard.  The callback reduces all of 
that to a single malloc for each row.

Without the filler, here is the average over 11 runs:
Realusersys
Callback.133.033.035
Single Row  .170.112.029

For the callback case, it's slightly higher than the prior results with the 
filler column.

Profile data file 'callgrind.out.14930' (creator: callgrind-3.11.0)

I1 cache: 
D1 cache: 
LL cache: 
Timerange: Basic block 0 - 74120972
Trigger: Program termination
Profiled target:  ./test -m row (PID 14930, part 1)
Events recorded:  Ir
Events shown: Ir
Event sort order: Ir
Thresholds:   99
Include dirs: 
User annotated:   
Auto-annotation:  off


 Ir 

313,455,690  PROGRAM TOTALS


Ir  file:function

61,410,828  ???:_int_malloc [/usr/lib64/libc-2.17.so]
38,321,887  ???:_int_free [/usr/lib64/libc-2.17.so]
25,800,115  ???:pqResultAlloc [/usr/local/pgsql/lib/libpq.so.5.10]
20,611,330  ???:pqParseInput3 [/usr/local/pgsql/lib/libpq.so.5.10]
16,002,817  ???:malloc [/usr/lib64/libc-2.17.so]
14,800,004  ???:pqRowProcessor [/usr/local/pgsql/lib/libpq.so.5.10]
12,604,893  ???:pqGetInt [/usr/local/pgsql/lib/libpq.so.5.10]
10,400,004  ???:PQsetResultAttrs [/usr/local/pgsql/lib/libpq.so.5.10]
10,200,316  main.c:main [/usr/local/src/postgresql-perf/test]
 9,600,000  ???:check_tuple_field_number [/usr/local/pgsql/lib/libpq.so.5.10]
 8,300,631  ???:__strcpy_sse2_unaligned [/usr/lib64/libc-2.17.so]
 7,500,075  ???:pqResultStrdup [/usr/local/pgsql/lib/libpq.so.5.10]
 7,500,000  ???:pqSkipnchar [/usr/local/pgsql/lib/libpq.so.5.10]
 7,017,368  ???:__memcpy_ssse3_back [/usr/lib64/libc-2.17.so]
 6,900,000  ???:PQgetisnull [/usr/local/pgsql/lib/libpq.so.5.10]
 6,401,100  ???:free [/usr/lib64/libc-2.17.so]
 6,200,004  ???:PQcopyResult [/usr/local/pgsql/lib/libpq.so.5.10]
 6,100,959  ???:__strlen_sse2_pminub [/usr/lib64/libc-2.17.so]
 5,700,000  ???:PQgetvalue [/usr/local/pgsql/lib/libpq.so.5.10]
 4,700,045  ???:PQclear [/usr/local/pgsql/lib/libpq.so.5.10]
 4,200,057  ???:PQmakeEmptyPGresult [/usr/local/pgsql/lib/libpq.so.5.10]
 4,103,903  ???:PQgetResult [/usr/local/pgsql/lib/libpq.so.5.10]
 3,400,000  ???:pqAddTuple [/usr/local/pgsql/lib/libpq.so.5.10]
 3,203,437  ???:pqGetc [/usr/local/pgsql/lib/libpq.so.5.10]
 2,600,034  ???:pqPrepareAsyncResult [/usr/local/pgsql/lib/libpq.so.5.10]
 2,500,679  ???:appendBinaryPQExpBuffer [/usr/local/pgsql/lib/libpq.so.5.10]
 2,300,621  ???:enlargePQExpBuffer [/usr/local/pgsql/lib/libpq.so.5.10]
 1,600,016  ???:appendPQExpBufferStr [/usr/local/pgsql/lib/libpq.so.5.10]
   900,270  ???:resetPQExpBuffer [/usr/local/pgsql/lib/libpq.so.5.10]


Profile data file 'callgrind.out.15062' (creator: callgrind-3.11.0)

I1 cache: 
D1 cache: 
LL cache: 
Timerange: Basic block 0 - 84068364
Trigger: Program termination
Profiled target:  ./test -m row (PID 15062, part 1)
Events recorded:  Ir
Events shown: Ir
Event sort order: Ir
Thresholds:   99
Include dirs: 
User annotated:   
Auto-annotation:  off


 Ir 

358,525,458  PROGRAM TOTALS


Ir  file:function

61,410,901  ???:_int_malloc [/usr/lib64/libc-2.17.so]
38,321,887  ???:_int_free [/usr/lib64/libc-2.17.so]
31,400,139  ???:pqResultAlloc [/usr/local/pgsql/lib/libpq.so.5.10]
22,839,505  ???:pqParseInput3 [/usr/local/pgsql/lib/libpq.so.5.10]
17,600,004  ???:pqRowProcessor [/usr/local/pgsql/lib/libpq.so.5.10]
16,002,817  ???:malloc [/usr/lib64/libc-2.17.so]
14,716,359  ???:pqGetInt 

Re: [HACKERS] libpq Alternate Row Processor

2017-02-09 Thread Jim Nasby

On 2/8/17 5:11 PM, Kyle Gearhart wrote:

Overall, wall clock improves 24%.  User time elapsed is a 430% improvement.  
About half the time is spent waiting on the IO with the callback.  With the 
regular pqRowProcessor only about 16% of the time is spent waiting on IO.


To wit...

realusersys
single row  0.214   0.131   0.048
callback0.161   0.030   0.051

Those are averaged over 11 runs.

Can you run a trace to see where all the time is going in the single row 
case? I don't see an obvious time-suck with a quick look through the 
code. It'd be interesting to see how things change if you eliminate the 
filler column from the SELECT.


Also, the backend should be buffering ~8kb of data before handing that 
to the socket. If that's more than the kernel can buffer I'd expect a 
serious performance hit.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq Alternate Row Processor

2017-02-05 Thread Kyle Gearhart
From: Tom Lane [mailto:t...@sss.pgh.pa.us]:
> Kyle Gearhart  writes:
>> The guts of pqRowProcessor in libpq does a good bit of work to maintain the 
>> internal data structure of a PGresult.  There are a few use cases where the 
>> caller doesn't need the ability to access the result set row by row, column 
>> by column using PQgetvalue.  Think of an ORM that is just going to copy the 
>> data from PGresult for each row into its own structures.

> It seems like you're sort of reinventing "single row mode":
https://www.postgresql.org/docs/devel/static/libpq-single-row-mode.html

> Do we really need yet another way of breaking the unitary-query-result 
> abstraction?


If it's four times faster...then the option should be available in libpq.  I'm 
traveling tomorrow but will try to get a patch and proof with pgbench dataset 
up by the middle of the week.  

The performance gains are consistent with Jim Nasby's findings with SPI.

Kyle Gearhart


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq Alternate Row Processor

2017-02-03 Thread Tom Lane
Kyle Gearhart  writes:
> The guts of pqRowProcessor in libpq does a good bit of work to maintain the 
> internal data structure of a PGresult.  There are a few use cases where the 
> caller doesn't need the ability to access the result set row by row, column 
> by column using PQgetvalue.  Think of an ORM that is just going to copy the 
> data from PGresult for each row into its own structures.

It seems like you're sort of reinventing "single row mode":
https://www.postgresql.org/docs/devel/static/libpq-single-row-mode.html

Do we really need yet another way of breaking the unitary-query-result
abstraction?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq Alternate Row Processor

2017-02-03 Thread Jim Nasby

On 2/3/17 3:53 PM, Kyle Gearhart wrote:

The guts of pqRowProcessor in libpq does a good bit of work to maintain the 
internal data structure of a PGresult.  There are a few use cases where the 
caller doesn't need the ability to access the result set row by row, column by 
column using PQgetvalue.  Think of an ORM that is just going to copy the data 
from PGresult for each row into its own structures.

I've got a working proof of concept that allows the caller to attach a callback 
that pqRowProcessor will call instead of going thru its own routine.  This 
eliminates all the copying of data from the PGconn buffer to a PGresult buffer 
and then ultimately a series of PQgetvalue calls by the client.  The callback 
allows the caller to receive each row's data directly from the PGconn buffer.

It would require exposing struct pgDataValue in libpq-fe.h.  The prototype for 
the callback pointer would be:
int (*PQrowProcessorCB)(PGresult*, const PGdataValue*, int col_count, void 
*user_data);

My initial testing shows a significant performance improvement.  I'd like some 
opinions on this before wiring up a performance proof and updating the 
documentation for a formal patch submission.


I just did essentially the same thing for SPI (use a callback to allow 
the caller to handle the tuple instead of shoving it into a tuplestore). 
A simple test in plpython showed a 460% improvement.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] libpq Alternate Row Processor

2017-02-03 Thread Kyle Gearhart
The guts of pqRowProcessor in libpq does a good bit of work to maintain the 
internal data structure of a PGresult.  There are a few use cases where the 
caller doesn't need the ability to access the result set row by row, column by 
column using PQgetvalue.  Think of an ORM that is just going to copy the data 
from PGresult for each row into its own structures.

I've got a working proof of concept that allows the caller to attach a callback 
that pqRowProcessor will call instead of going thru its own routine.  This 
eliminates all the copying of data from the PGconn buffer to a PGresult buffer 
and then ultimately a series of PQgetvalue calls by the client.  The callback 
allows the caller to receive each row's data directly from the PGconn buffer.

It would require exposing struct pgDataValue in libpq-fe.h.  The prototype for 
the callback pointer would be:
int (*PQrowProcessorCB)(PGresult*, const PGdataValue*, int col_count, void 
*user_data);

My initial testing shows a significant performance improvement.  I'd like some 
opinions on this before wiring up a performance proof and updating the 
documentation for a formal patch submission.


Kyle Gearhart



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers