Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-12-02 Thread Bruce Momjian
On Fri, Nov 14, 2014 at 02:51:32PM +1300, David Rowley wrote:
 Likely for most aggregates, like count, sum, max, min, bit_and and bit_or the
 merge function would be the same as the transition function, as the state type
 is just the same as the input type. It would only be aggregates like avg(),
 stddev*(), bool_and() and bool_or() that would need a new merge function
 made... These would be no more complex than the transition functions... Which
 are just a few lines of code anyway.
 
 We'd simply just not run parallel query if any aggregates used in the query
 didn't have a merge function.
 
 When I mentioned this, I didn't mean to appear to be placing a road block.I 
 was
 just bringing to the table the information that COUNT(*) + COUNT(*) works ok
 for merging COUNT(*)'s sub totals, but AVG(n) + AVG(n) does not.

Sorry, late reply, but, FYI, I don't think our percentile functions
can't be parallelized in the same way:

test= \daS *percent*
  List of aggregate 
functions
   Schema   |  Name   |  Result data type  | 
Argument data types  | Description

+-++--+-
 pg_catalog | percent_rank| double precision   | VARIADIC any 
ORDER BY VARIADIC any   | fractional rank of hypothetical row
 pg_catalog | percentile_cont | double precision   | double precision 
ORDER BY double precision   | continuous distribution percentile
 pg_catalog | percentile_cont | double precision[] | double precision[] 
ORDER BY double precision | multiple continuous percentiles
 pg_catalog | percentile_cont | interval   | double precision 
ORDER BY interval   | continuous distribution percentile
 pg_catalog | percentile_cont | interval[] | double precision[] 
ORDER BY interval | multiple continuous percentiles
 pg_catalog | percentile_disc | anyelement | double precision 
ORDER BY anyelement | discrete percentile
 pg_catalog | percentile_disc | anyarray   | double precision[] 
ORDER BY anyelement   | multiple discrete percentiles

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-12-02 Thread Kouhei Kaigai
 On Fri, Nov 14, 2014 at 02:51:32PM +1300, David Rowley wrote:
  Likely for most aggregates, like count, sum, max, min, bit_and and
  bit_or the merge function would be the same as the transition
  function, as the state type is just the same as the input type. It
  would only be aggregates like avg(), stddev*(), bool_and() and
  bool_or() that would need a new merge function made... These would be
  no more complex than the transition functions... Which are just a few
 lines of code anyway.
 
  We'd simply just not run parallel query if any aggregates used in the
  query didn't have a merge function.
 
  When I mentioned this, I didn't mean to appear to be placing a road
  block.I was just bringing to the table the information that COUNT(*) +
  COUNT(*) works ok for merging COUNT(*)'s sub totals, but AVG(n) + AVG(n)
 does not.
 
 Sorry, late reply, but, FYI, I don't think our percentile functions can't
 be parallelized in the same way:
 
   test= \daS *percent*
 List of
 aggregate functions
  Schema   |  Name   |  Result data type  |
 Argument data types  | Description
   +-++--
 +-
 
pg_catalog | percent_rank| double precision   | VARIADIC
 any ORDER BY VARIADIC any   | fractional rank of hypothetical row
pg_catalog | percentile_cont | double precision   | double
 precision ORDER BY double precision   | continuous distribution percentile
pg_catalog | percentile_cont | double precision[] | double
 precision[] ORDER BY double precision | multiple continuous percentiles
pg_catalog | percentile_cont | interval   | double
 precision ORDER BY interval   | continuous distribution
 percentile
pg_catalog | percentile_cont | interval[] | double
 precision[] ORDER BY interval | multiple continuous percentiles
pg_catalog | percentile_disc | anyelement | double
 precision ORDER BY anyelement | discrete percentile
pg_catalog | percentile_disc | anyarray   | double
 precision[] ORDER BY anyelement   | multiple discrete percentiles
 
Yep, it seems to me the type of aggregate function that is not obvious
to split into multiple partitions.
I think, it is valuable even if we can push-down a part of aggregate
functions which is well known by the core planner.
For example, we know count(*) = sum(nrows), we also know avg(X) can
be rewritten to enhanced avg() that takes both of nrows and partial
sum of X. We can utilize these knowledge to break-down aggregate
functions.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-12-02 Thread Michael Paquier
On Wed, Dec 3, 2014 at 3:23 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 On Fri, Nov 14, 2014 at 02:51:32PM +1300, David Rowley wrote:
  Likely for most aggregates, like count, sum, max, min, bit_and and
  bit_or the merge function would be the same as the transition
  function, as the state type is just the same as the input type. It
  would only be aggregates like avg(), stddev*(), bool_and() and
  bool_or() that would need a new merge function made... These would be
  no more complex than the transition functions... Which are just a few
 lines of code anyway.
 
  We'd simply just not run parallel query if any aggregates used in the
  query didn't have a merge function.
 
  When I mentioned this, I didn't mean to appear to be placing a road
  block.I was just bringing to the table the information that COUNT(*) +
  COUNT(*) works ok for merging COUNT(*)'s sub totals, but AVG(n) + AVG(n)
 does not.

 Sorry, late reply, but, FYI, I don't think our percentile functions can't
 be parallelized in the same way:

   test= \daS *percent*
 List of
 aggregate functions
  Schema   |  Name   |  Result data type  |
 Argument data types  | Description
   +-++--
 +-
 
pg_catalog | percent_rank| double precision   | VARIADIC
 any ORDER BY VARIADIC any   | fractional rank of hypothetical row
pg_catalog | percentile_cont | double precision   | double
 precision ORDER BY double precision   | continuous distribution percentile
pg_catalog | percentile_cont | double precision[] | double
 precision[] ORDER BY double precision | multiple continuous percentiles
pg_catalog | percentile_cont | interval   | double
 precision ORDER BY interval   | continuous distribution
 percentile
pg_catalog | percentile_cont | interval[] | double
 precision[] ORDER BY interval | multiple continuous percentiles
pg_catalog | percentile_disc | anyelement | double
 precision ORDER BY anyelement | discrete percentile
pg_catalog | percentile_disc | anyarray   | double
 precision[] ORDER BY anyelement   | multiple discrete percentiles

 Yep, it seems to me the type of aggregate function that is not obvious
 to split into multiple partitions.
 I think, it is valuable even if we can push-down a part of aggregate
 functions which is well known by the core planner.
 For example, we know count(*) = sum(nrows), we also know avg(X) can
 be rewritten to enhanced avg() that takes both of nrows and partial
 sum of X. We can utilize these knowledge to break-down aggregate
 functions.
Postgres-XC (Postgres-XL) has implemented such parallel aggregate
logic some time ago using a set of sub functions and a finalization
function to do the work.
My 2c.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-17 Thread Robert Haas
On Thu, Nov 13, 2014 at 7:27 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On 12 November 2014 00:54, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Nov 11, 2014 at 3:29 AM, Simon Riggs si...@2ndquadrant.com wrote:
 * only functions marked as CONTAINS NO SQL
 We don't really know what proisparallel is, but we do know what
 CONTAINS NO SQL means and can easily check for it.
 Plus I already have a patch for this, slightly bitrotted.

 Interestingly, I have a fairly solid idea of what proisparallel is,
 but I have no clear idea what CONTAINS NO SQL is or why it's relevant.
 I would imagine that srandom() contains no SQL under any reasonable
 definition of what that means, but it ain't parallel-safe.

 What is wrong in generating random numbers in parallel?

If they're random, nothing, provide that you use a different seed in
each backend.  But if the seed has been set using srandom(), then we
must generate the particular sequence of random numbers associated
with that seed.  If we don't, the behavior is incompatible with what
happens apart from parallel mode.

 But I'm sure many volatile functions would be annoying to support, so
 CONTAINS NO SQL and STABLE/IMMUTABLE seems OK for the first thing.

It might be OK to assume that immutable functions are fine, but not
all stable functions look like they'll be safe, or not without
additional effort: e.g. inet_client_addr(), pg_backend_pid(),
pg_cursor(), pg_event_trigger_dropped_objects().  Or even, more
mundanely, generate_series(), unless we ensure all calls are in the
same backend.  On the other hand, there are quite a few stable
functions that it seems like we would want to allow in parallel
queries, like to_char(), concat(), and textanycat().

I still don't know what CONTAINS NO SQL means.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-17 Thread Simon Riggs
On 14 November 2014 11:02, Kouhei Kaigai kai...@ak.jp.nec.com wrote:

 I'd like to throw community folks a question.
 Did someone have a discussion to the challenge of aggregate push-down across
 relations join in the past? It potentially reduces number of rows to be 
 joined.
 If we already had, I'd like to check up the discussion at that time.

Yes, I was looking at aggregate pushdown. I think it needs the same
changes to aggregates discussed upthread.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-17 Thread Simon Riggs
On 17 November 2014 16:01, Robert Haas robertmh...@gmail.com wrote:

 I still don't know what CONTAINS NO SQL means.

It's a function marking that would indicate we aren't allowed to take
snapshots or run SQL.

I think you should publish the scheme for marking functions as safe
for parallelism, so we can judge that.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-17 Thread David Rowley
On 18 November 2014 05:19, Simon Riggs si...@2ndquadrant.com wrote:

 On 14 November 2014 11:02, Kouhei Kaigai kai...@ak.jp.nec.com wrote:

  I'd like to throw community folks a question.
  Did someone have a discussion to the challenge of aggregate push-down
 across
  relations join in the past? It potentially reduces number of rows to be
 joined.
  If we already had, I'd like to check up the discussion at that time.

 Yes, I was looking at aggregate pushdown. I think it needs the same
 changes to aggregates discussed upthread.


I have something that I've been working on locally. It's far from ready,
but it does work in very simple cases, and shows a nice performance boost.
I'll start another thread soon and copy you both in. Perhaps we can share
some ideas.

Regards

David Rowley


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-17 Thread Kouhei Kaigai
 On 18 November 2014 05:19, Simon Riggs si...@2ndquadrant.com wrote:
 
 
   On 14 November 2014 11:02, Kouhei Kaigai kai...@ak.jp.nec.com
 wrote:
 
I'd like to throw community folks a question.
Did someone have a discussion to the challenge of aggregate
 push-down across
relations join in the past? It potentially reduces number of rows
 to be joined.
If we already had, I'd like to check up the discussion at that
 time.
 
   Yes, I was looking at aggregate pushdown. I think it needs the same
   changes to aggregates discussed upthread.
 
 
 
 
 I have something that I've been working on locally. It's far from ready,
 but it does work in very simple cases, and shows a nice performance boost.
 I'll start another thread soon and copy you both in. Perhaps we can share
 some ideas.
 
Great, it's exactly valuable functionality to be in the core.

--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-14 Thread David Rowley
On 14 November 2014 20:37, Jim Nasby jim.na...@bluetreble.com wrote:

 On 11/12/14, 1:54 AM, David Rowley wrote:


 We'd also need to add some infrastructure to merge aggregate states
 together for this to work properly. This means that could also work for
 avg() and stddev etc. For max() and min() the merge functions would likely
 just be the same as the transition functions.


 Sanity check: what % of a large aggregate query fed by a seqscan actually
 spent in the aggregate functions? Even if you look strictly at CPU cost,
 isn't there more code involved to get data to the aggregate function than
 in the aggregation itself, except maybe for numeric?


You might be right, but that sounds like it would need all the parallel
workers to send each matching tuple to a queue to be processed by some
aggregate node. I guess this would have more advantage for wider tables or
tables with many dead tuples, or if the query has quite a selective where
clause, as less data would make it onto that queue.

Perhaps I've taken 1 step too far forward here. I had been thinking that
each worker would perform the partial seqscan and in the worker context
pass the tuple down to the aggregate node. Then later once each worker had
complete some other perhaps new node type (MergeAggregateStates) would
merge all those intermediate agg states into the final agg state (which
would then be ready for the final function to be called).

Are there any plans for what will be in charge of deciding how many workers
would be allocated to a parallel query? Will this be something that's done
at planning time? Or should the planner just create a parallel friendly
plan, iif the plan is costly enough and then just allow the executor decide
how many workers to throw at the job based on how busy the system is with
other tasks at execution time?

Regards

David Rowley


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-14 Thread Simon Riggs
On 14 November 2014 07:37, Jim Nasby jim.na...@bluetreble.com wrote:
 On 11/12/14, 1:54 AM, David Rowley wrote:

 On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs si...@2ndquadrant.com
 mailto:si...@2ndquadrant.com wrote:


 This plan type is widely used in reporting queries, so will hit the
 mainline of BI applications and many Mat View creations.
 This will allow SELECT count(*) FROM foo to go faster also.

 We'd also need to add some infrastructure to merge aggregate states
 together for this to work properly. This means that could also work for
 avg() and stddev etc. For max() and min() the merge functions would likely
 just be the same as the transition functions.


 Sanity check: what % of a large aggregate query fed by a seqscan actually
 spent in the aggregate functions? Even if you look strictly at CPU cost,
 isn't there more code involved to get data to the aggregate function than in
 the aggregation itself, except maybe for numeric?

Yes, which is why I suggested pre-aggregating before collecting the
streams together.

The point is not that the aggregation is expensive, its that the
aggregation eats data and the required bandwidth for later steps is
reduced and hence does not then become a bottleneck that renders the
parallel Seq Scan ineffective.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-14 Thread Simon Riggs
On 14 November 2014 01:51, David Rowley dgrowle...@gmail.com wrote:

 When I mentioned this, I didn't mean to appear to be placing a road block.I
 was just bringing to the table the information that COUNT(*) + COUNT(*)
 works ok for merging COUNT(*)'s sub totals, but AVG(n) + AVG(n) does not.

  Merge functions should be a simple patch to write. If I thought there was
 going to be a use for them in this release, I'd put my hand up to put a
 patch together.

The hard part is describing and then agreeing the semantics.

If you raise a separate post on this, copy me in and I will help.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-14 Thread Kouhei Kaigai
 On 14 November 2014 07:37, Jim Nasby jim.na...@bluetreble.com wrote:
  On 11/12/14, 1:54 AM, David Rowley wrote:
 
  On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs si...@2ndquadrant.com
  mailto:si...@2ndquadrant.com wrote:
 
 
  This plan type is widely used in reporting queries, so will hit the
  mainline of BI applications and many Mat View creations.
  This will allow SELECT count(*) FROM foo to go faster also.
 
  We'd also need to add some infrastructure to merge aggregate states
  together for this to work properly. This means that could also work
  for
  avg() and stddev etc. For max() and min() the merge functions would
  likely just be the same as the transition functions.
 
 
  Sanity check: what % of a large aggregate query fed by a seqscan
  actually spent in the aggregate functions? Even if you look strictly
  at CPU cost, isn't there more code involved to get data to the
  aggregate function than in the aggregation itself, except maybe for
 numeric?
 
 Yes, which is why I suggested pre-aggregating before collecting the streams
 together.
 
 The point is not that the aggregation is expensive, its that the aggregation
 eats data and the required bandwidth for later steps is reduced and hence
 does not then become a bottleneck that renders the parallel Seq Scan
 ineffective.
 
I'd like to throw community folks a question. 
Did someone have a discussion to the challenge of aggregate push-down across
relations join in the past? It potentially reduces number of rows to be joined.
If we already had, I'd like to check up the discussion at that time.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-14 Thread David Rowley
On Fri, Nov 14, 2014 at 1:27 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On 12 November 2014 00:54, Robert Haas robertmh...@gmail.com wrote:
  Interestingly, I have a fairly solid idea of what proisparallel is,
  but I have no clear idea what CONTAINS NO SQL is or why it's relevant.
  I would imagine that srandom() contains no SQL under any reasonable
  definition of what that means, but it ain't parallel-safe.

 What is wrong in generating random numbers in parallel?


I was just watching Robert's talk on Parallel query on youtube... I think
the answer is at 41:09, the link below should take you there:
https://www.youtube.com/watch?v=3klfarKEtMQ#t=2469

Regards

David Rowley


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-13 Thread Simon Riggs
On 12 November 2014 00:54, Robert Haas robertmh...@gmail.com wrote:

 I'm going to aim for the simpler:

 Hash Aggregate
 - Parallel Seq Scan
 Workers: 4

 Yeah, I know that won't perform as well as what you're proposing, but
 I'm fairly sure it's simpler.

Simple is best, so +1.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-13 Thread Simon Riggs
On 12 November 2014 07:54, David Rowley dgrowle...@gmail.com wrote:
 On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs si...@2ndquadrant.com wrote:


 This plan type is widely used in reporting queries, so will hit the
 mainline of BI applications and many Mat View creations.
 This will allow SELECT count(*) FROM foo to go faster also.


 We'd also need to add some infrastructure to merge aggregate states together
 for this to work properly. This means that could also work for avg() and
 stddev etc. For max() and min() the merge functions would likely just be the
 same as the transition functions.

Do you mean something like a subtotal or intermediate combination functon?

I guess we'd need the same thing to make intermediate aggregates work
during a sort?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-13 Thread Simon Riggs
On 12 November 2014 00:54, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Nov 11, 2014 at 3:29 AM, Simon Riggs si...@2ndquadrant.com wrote:
 * only functions marked as CONTAINS NO SQL
 We don't really know what proisparallel is, but we do know what
 CONTAINS NO SQL means and can easily check for it.
 Plus I already have a patch for this, slightly bitrotted.

 Interestingly, I have a fairly solid idea of what proisparallel is,
 but I have no clear idea what CONTAINS NO SQL is or why it's relevant.
 I would imagine that srandom() contains no SQL under any reasonable
 definition of what that means, but it ain't parallel-safe.

What is wrong in generating random numbers in parallel?

But I'm sure many volatile functions would be annoying to support, so
CONTAINS NO SQL and STABLE/IMMUTABLE seems OK for the first thing.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-13 Thread David Rowley
On Fri, Nov 14, 2014 at 1:19 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On 12 November 2014 07:54, David Rowley dgrowle...@gmail.com wrote:
  On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs si...@2ndquadrant.com
 wrote:
 
 
  This plan type is widely used in reporting queries, so will hit the
  mainline of BI applications and many Mat View creations.
  This will allow SELECT count(*) FROM foo to go faster also.
 
 
  We'd also need to add some infrastructure to merge aggregate states
 together
  for this to work properly. This means that could also work for avg() and
  stddev etc. For max() and min() the merge functions would likely just be
 the
  same as the transition functions.

 Do you mean something like a subtotal or intermediate combination
 functon?


If you had 4 parallel workers performing a seqscan, say the relation was
4000 pages in total, you could say that worker 1 would scan blocks 0-999,
worker 2, 1000-1999 etc. After each worker had finished, there would then
be 4 sets of records then needed to be merged into 1 set.

Take int4_avg_accum() for example it does:

transdata-count++;
transdata-sum += newval;

The merge function would need to perform something like:

transdata-count += transdata2merge.count;
transdata-sum += transdata2merge.sum;

Then the final function could be called on the merged aggregate state.

The same can be applied when the query contains a GROUP BY clause, just
we'd need pay attention to which groups we merge together for that to work
Any HAVING clause would have to be applied after the groups have been
merged.

This whole topic is pretty exciting for data warehouse type workloads.

Regards

David Rowley


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-13 Thread Kouhei Kaigai
   On 12 November 2014 07:54, David Rowley dgrowle...@gmail.com
 wrote:
On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs
 si...@2ndquadrant.com wrote:
   
   
This plan type is widely used in reporting queries, so will hit
 the
mainline of BI applications and many Mat View creations.
This will allow SELECT count(*) FROM foo to go faster also.
   
   
We'd also need to add some infrastructure to merge aggregate
 states together
for this to work properly. This means that could also work for
 avg() and
stddev etc. For max() and min() the merge functions would likely
 just be the
same as the transition functions.
 
 
   Do you mean something like a subtotal or intermediate
 combination functon?
 
 
 
 
 If you had 4 parallel workers performing a seqscan, say the relation was
 4000 pages in total, you could say that worker 1 would scan blocks 0-999,
 worker 2, 1000-1999 etc. After each worker had finished, there would then
 be 4 sets of records then needed to be merged into 1 set.
 
 Take int4_avg_accum() for example it does:
 
 
 transdata-count++;
 transdata-sum += newval;
 
 The merge function would need to perform something like:
 
 transdata-count += transdata2merge.count; sum += transdata2merge.sum;
 
 Then the final function could be called on the merged aggregate state.
 
 The same can be applied when the query contains a GROUP BY clause, just
 we'd need pay attention to which groups we merge together for that to work
 Any HAVING clause would have to be applied after the groups have been merged.
 
 This whole topic is pretty exciting for data warehouse type workloads.
 
More simplify, we can describe parallel aware aggregate function.
Please assume AVG(X) function that takes nrows and sum of X. Its transition
function performs as like you described above, then final function works as
usual.

The job of parallel seq scan needs to do is:
  1. replace AVG(X) by AVG(nrows, sum(X)
  2. generate count(*) on the partial relation being grouped.
  3. generate sum(X) on the partial relation being grouped.

It looks like the following query:
  SELECT AVG(nrows, sum_X) FROM (
SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 0 and 999 
GROUP BY cat
UNION
SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 1000 and 
1999 GROUP BY cat
UNION
SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 2000 and 
2999 GROUP BY cat
UNION
SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 3000 and 
3999 GROUP BY cat
  );

Note that stuff inside sub-query is a job of parallel scan.

Do we need to invent a new infrastructure here?

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-13 Thread Kouhei Kaigai
 On Fri, Nov 14, 2014 at 2:12 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 
 
  On 12 November 2014 07:54, David Rowley
 dgrowle...@gmail.com
Take int4_avg_accum() for example it does:
   
   
transdata-count++;
transdata-sum += newval;
   
The merge function would need to perform something like:
   
 
transdata-count += transdata2merge.count; sum +=
 transdata2merge.sum;
   
Then the final function could be called on the merged aggregate
 state.
   
   More simplify, we can describe parallel aware aggregate function.
   Please assume AVG(X) function that takes nrows and sum of X. Its
 transition
   function performs as like you described above, then final function
 works as
   usual.
 
   The job of parallel seq scan needs to do is:
 1. replace AVG(X) by AVG(nrows, sum(X)
 2. generate count(*) on the partial relation being grouped.
 3. generate sum(X) on the partial relation being grouped.
 
   It looks like the following query:
 SELECT AVG(nrows, sum_X) FROM (
   SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno
 between 0 and 999 GROUP BY cat
   UNION
   SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno
 between 1000 and 1999 GROUP BY cat
   UNION
   SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno
 between 2000 and 2999 GROUP BY cat
   UNION
   SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno
 between 3000 and 3999 GROUP BY cat
 );
 
 
 
 
 Well, this would require giving the planner some kind of knowledge of what
 AVG() is. It currently knows nothing about that. It currently calls the
 transition function for each row, and the final function at the end and
 does not care or need to care about what either of those functions actually
 does. The transformation above looks like it would need to care and the
 logic to add that would be way more complex than aggregate merge functions.
 
It may make sense to have an extra catalog to indicate how usual aggregate
function can be broken down (or unavailable).
Like, AVG(X) = {COUNT(X) + SUM(X)} x N-partitions

 Likely for most aggregates, like count, sum, max, min, bit_and and bit_or
 the merge function would be the same as the transition function, as the
 state type is just the same as the input type. It would only be aggregates
 like avg(), stddev*(), bool_and() and bool_or() that would need a new merge
 function made... These would be no more complex than the transition
 functions... Which are just a few lines of code anyway.
 
 We'd simply just not run parallel query if any aggregates used in the query
 didn't have a merge function.
 
 When I mentioned this, I didn't mean to appear to be placing a road block.I
 was just bringing to the table the information that COUNT(*) + COUNT(*)
 works ok for merging COUNT(*)'s sub totals, but AVG(n) + AVG(n) does not.
 
  Merge functions should be a simple patch to write. If I thought there was
 going to be a use for them in this release, I'd put my hand up to put a
 patch together.
 
Things I'm uncertain is, how caller of aggregate function distinguish a context
to call usual translation function, or new merge function.

Let's back an example of:
  SELECT cat, COUNT(*), AVG(X) FROM t GROUP BY cat;

Its plan tree is probably as follows:
  HashAggregate
   Group Key: cat
   -  Custom Scan (Parallel Scan)
 # of workers: 4

The caller of translation/merge/final function is HashAggregate node.
On the other hand, it has to know whether the underlying plan returns every
records of underlying table or sub-total by parallel scan.
Please correct me, if my assumption is wrong.

Once HashAggregate can know that sub-plan returns sub-total of the relation,
it can chose merge function instead of translation function.
However, what I want to clarify is how to inform HashAggregate node its sub-
plan intends to return sub-total, instead of individual rows.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-13 Thread Jim Nasby

On 11/12/14, 1:54 AM, David Rowley wrote:

On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs si...@2ndquadrant.com 
mailto:si...@2ndquadrant.com wrote:


This plan type is widely used in reporting queries, so will hit the
mainline of BI applications and many Mat View creations.
This will allow SELECT count(*) FROM foo to go faster also.

We'd also need to add some infrastructure to merge aggregate states together 
for this to work properly. This means that could also work for avg() and stddev 
etc. For max() and min() the merge functions would likely just be the same as 
the transition functions.


Sanity check: what % of a large aggregate query fed by a seqscan actually spent 
in the aggregate functions? Even if you look strictly at CPU cost, isn't there 
more code involved to get data to the aggregate function than in the 
aggregation itself, except maybe for numeric?

In other words, I suspect that just having a dirt-simple parallel SeqScan could 
be a win for CPU. It should certainly be a win IO-wise; in my experience we're 
not very good at maxing out IO systems.

(I was curious and came up with the list below for just the page-level stuff 
(ignoring IO). I don't see much code involved in per-tuple work, but I also 
never came across detoasting code, so I suspect I'm missing something...)

ExecScanFetch, heapgettup_pagemode, ReadBuffer, BufferAlloc, 
heap_page_prune_opt, LWLockAcquire... then you can finally do per-tuple work. 
HeapTupleSatisfiesVisibility.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-12 Thread Atri Sharma
On Wed, Nov 12, 2014 at 1:24 PM, David Rowley dgrowle...@gmail.com wrote:


 On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs si...@2ndquadrant.com
 wrote:


 This plan type is widely used in reporting queries, so will hit the
 mainline of BI applications and many Mat View creations.
 This will allow SELECT count(*) FROM foo to go faster also.


 We'd also need to add some infrastructure to merge aggregate states
 together for this to work properly. This means that could also work for
 avg() and stddev etc. For max() and min() the merge functions would likely
 just be the same as the transition functions.


It might make sense to make a new planner operator which can be responsible
for pulling from each of the individual parallel Agg nodes and then
aggregating over the results.


A couple of things that might be worth considering are whether we want to
enforce using parallel aggregation or let planner decide if it wants to do
a parallel aggregate or go with a single plan. For eg, the average
estimated size of groups might be one thing that planner may consider while
deciding between a parallel and a single execution plan.

I dont see merging states as an easy problem, and should perhaps be tackled
apart from this thread.

Also, do we want to allow parallelism only with GroupAggs?

Regards,

Atri


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-12 Thread Robert Haas
On Tue, Nov 11, 2014 at 7:48 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 Isn't provolatile = PROVOLATILE_IMMUTABLE sufficient?

There are certainly things that are parallel-safe that are not
immutable.  It might be the case that everything immutable is
parallel-safe.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-12 Thread Michael Paquier
On Wed, Nov 12, 2014 at 9:49 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Nov 11, 2014 at 7:48 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 Isn't provolatile = PROVOLATILE_IMMUTABLE sufficient?

 There are certainly things that are parallel-safe that are not
 immutable.  It might be the case that everything immutable is
 parallel-safe.
FWIW, when working on the concept of expression and clause
shippability for Postgres-XC (aka the possibility to pass it safely to
another backend, but in another PG node in this case), we discussed
similar things and if I recall correctly we even discussed about
adding a flag to pg_proc to define if a function was shippable or not.
Finally what we finished with was not adding a new flag and use as
rule that all the immutable functions can be safely shipped, and
others not, even some stable functions that *could* be safe. Maybe
Ashutosh has more comments on that, my memory may be failing.
In the end, I think that you would finish with something similar.
My 2c.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-11 Thread Simon Riggs
On 10 November 2014 15:57, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Oct 15, 2014 at 2:55 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Something usable, with severe restrictions, is actually better than we
 have now. I understand the journey this work represents, so don't be
 embarrassed by submitting things with heuristics and good-enoughs in
 it. Our mentor, Mr.Lane, achieved much by spreading work over many
 releases, leaving others to join in the task.

 It occurs to me that, now that the custom-scan stuff is committed, it
 wouldn't be that hard to use that, plus the other infrastructure we
 already have, to write a prototype of parallel sequential scan.

+1

 Given
 where we are with the infrastructure, there would be a number of
 unhandled problems, such as deadlock detection (needs group locking or
 similar), assessment of quals as to parallel-safety (needs
 proisparallel or similar), general waterproofing to make sure that
 pushing down a qual we shouldn't does do anything really dastardly
 like crash the server (another written but yet-to-be-published patch
 adds a bunch of relevant guards), and snapshot sharing (likewise).
 But if you don't do anything weird, it should basically work.

If we build something fairly restricted, but production ready we can
still do something useful to users.

* only functions marked as CONTAINS NO SQL
* parallel_workers = 2 or fixed
* only one Plan type, designed to maximise benefit and minimize optimizer change

Why?

* The above is all that is needed to test any infrastructure changes
we accept. We need both infrastructure and tests. We could do this as
a custom scan plugin, but why not make it work in core - that doesn't
prevent a plugin version also if you desire it.

* only functions marked as CONTAINS NO SQL
We don't really know what proisparallel is, but we do know what
CONTAINS NO SQL means and can easily check for it.
Plus I already have a patch for this, slightly bitrotted.

* parallel_workers = 2 (or at least not make it user settable)
By fixing the number of workers at 2 we avoid any problems caused by
having N variable, such as how to vary N fairly amongst users and
other such considerations. We get the main benefit of parallelism,
without causing other issues across the server.

* Fixed Plan: aggregate-scan
To make everything simpler, allow only plans of a single type.
 SELECT something, list of aggregates
 FROM foo
 WHERE filters
 GROUP BY something
because we know that passing large amounts of data from worker to
master process will be slow, so focusing only on seq scan is not
sensible; we should focus on plans that significantly reduce the
number of rows passed upwards. We could just do this for very
selective WHERE clauses, but that is not an important class of query.
As soon as include aggregates, we reduce data passing significantly
AND we hit a very important subset of queries:

This plan type is widely used in reporting queries, so will hit the
mainline of BI applications and many Mat View creations.
This will allow SELECT count(*) FROM foo to go faster also.

The execution plan for that query type looks like this...
Hash Aggregate
   Gather From Workers
  {Worker Nodes workers = 2
HashAggregate
PartialScan}

Which is simple enough that we include a mechanism for the Gather
operation, plus it is simple enough to not need extensive optimizer
changes - just changes to set_plain_rel_pathlist() to consider
parallel plans.

Plan costs are easily to calculate for the above...
 cpu_worker_startup_cost = 100 -- Startup cost is easy to calculate by
observation, but a reasonably large default will be OK
 cpu_ipc_tuple_cost = 0.1 -- assume x10 normal cost of cpu_tuple_cost
Partial scan costs are just same as SeqScan, just with fewer blocks.
All other costs are the same

We can submit the main patch by Dec 15, fix all the problems by Feb 15.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-11 Thread Kouhei Kaigai
  Given
  where we are with the infrastructure, there would be a number of
  unhandled problems, such as deadlock detection (needs group locking or
  similar), assessment of quals as to parallel-safety (needs
  proisparallel or similar), general waterproofing to make sure that
  pushing down a qual we shouldn't does do anything really dastardly
  like crash the server (another written but yet-to-be-published patch
  adds a bunch of relevant guards), and snapshot sharing (likewise).
  But if you don't do anything weird, it should basically work.
 
 If we build something fairly restricted, but production ready we can still
 do something useful to users.
 
 * only functions marked as CONTAINS NO SQL
 * parallel_workers = 2 or fixed
 * only one Plan type, designed to maximise benefit and minimize optimizer
 change
 
 Why?
 
 * The above is all that is needed to test any infrastructure changes we
 accept. We need both infrastructure and tests. We could do this as a custom
 scan plugin, but why not make it work in core - that doesn't prevent a plugin
 version also if you desire it.
 
 * only functions marked as CONTAINS NO SQL
 We don't really know what proisparallel is, but we do know what CONTAINS
 NO SQL means and can easily check for it.
 Plus I already have a patch for this, slightly bitrotted.
 
Isn't provolatile = PROVOLATILE_IMMUTABLE sufficient?

 * parallel_workers = 2 (or at least not make it user settable) By fixing
 the number of workers at 2 we avoid any problems caused by having N variable,
 such as how to vary N fairly amongst users and other such considerations.
 We get the main benefit of parallelism, without causing other issues across
 the server.
 
 * Fixed Plan: aggregate-scan
 To make everything simpler, allow only plans of a single type.
  SELECT something, list of aggregates
  FROM foo
  WHERE filters
  GROUP BY something
 because we know that passing large amounts of data from worker to master
 process will be slow, so focusing only on seq scan is not sensible; we should
 focus on plans that significantly reduce the number of rows passed upwards.
 We could just do this for very selective WHERE clauses, but that is not
 an important class of query.
 As soon as include aggregates, we reduce data passing significantly AND
 we hit a very important subset of queries:
 
 This plan type is widely used in reporting queries, so will hit the mainline
 of BI applications and many Mat View creations.
 This will allow SELECT count(*) FROM foo to go faster also.
 
I agree with you. The key of performance is how many rows can be reduced
by parallel processor, however, one thing I doubt is that the condition
to kick parallel execution may be too restrictive.
I'm not sure how much workloads run aggregate functions towards flat tables
rather than multiple joined relations.

Sorry, it might be a topic to discuss in a separated thread.
Can't we have a functionality to push-down aggregate functions across table
joins? If we could do this in some cases, but not any cases, it makes sense
to run typical BI workload in parallel.

Let me assume the following query:

  SELECT SUM(t1.X), SUM(t2.Y) FROM t1 JOIN t2 ON t1.pk_id = t2.fk_id GROUP BY 
t1.A, t2.B;

It is a usual query that joins two relations with PK and FK.

It can be broken down, like:

  SELECT SUM(t1.X), SUM(sq2.Y)
  FROM t1 JOIN (SELECT t2.fk_id, t2.B, SUM(t2.Y) Y FROM t2 GROUP BY t2.fk_id, 
t2.B) sq2
  GROUP BY t1.A, sq2.B;

If FK has 100 tuples per one PK in average, aggregate function that runs prior
to join will reduce massive number of tuples to be joined, and we can leverage
parallel processors to make preliminary aggregate on flat table.

I'm now checking the previous academic people's job to identify the condition
what kind of join allows to push down aggregate functions.
  http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.54.7751

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-11 Thread Robert Haas
On Tue, Nov 11, 2014 at 3:29 AM, Simon Riggs si...@2ndquadrant.com wrote:
 * only functions marked as CONTAINS NO SQL
 We don't really know what proisparallel is, but we do know what
 CONTAINS NO SQL means and can easily check for it.
 Plus I already have a patch for this, slightly bitrotted.

Interestingly, I have a fairly solid idea of what proisparallel is,
but I have no clear idea what CONTAINS NO SQL is or why it's relevant.
I would imagine that srandom() contains no SQL under any reasonable
definition of what that means, but it ain't parallel-safe.

 * parallel_workers = 2 (or at least not make it user settable)
 By fixing the number of workers at 2 we avoid any problems caused by
 having N variable, such as how to vary N fairly amongst users and
 other such considerations. We get the main benefit of parallelism,
 without causing other issues across the server.

I think this is a fairly pointless restriction.  The code
simplification we'll get out of it appears to me to be quite minor,
and we'll just end up putting the stuff back in anyway.

 * Fixed Plan: aggregate-scan
 To make everything simpler, allow only plans of a single type.
  SELECT something, list of aggregates
  FROM foo
  WHERE filters
  GROUP BY something
 because we know that passing large amounts of data from worker to
 master process will be slow, so focusing only on seq scan is not
 sensible; we should focus on plans that significantly reduce the
 number of rows passed upwards. We could just do this for very
 selective WHERE clauses, but that is not an important class of query.
 As soon as include aggregates, we reduce data passing significantly
 AND we hit a very important subset of queries:

This is moving the goalposts in a way that I'm not at all comfortable
with.  Parallel sequential-scan is pretty simple and may well be a win
if there's a restrictive filter condition involved.  Parallel
aggregation requires introducing new infrastructure into the aggregate
machinery to allow intermediate state values to be combined, and that
would be a great project for someone to do at some time, but it seems
like a distraction for me to do that right now.

 This plan type is widely used in reporting queries, so will hit the
 mainline of BI applications and many Mat View creations.
 This will allow SELECT count(*) FROM foo to go faster also.

 The execution plan for that query type looks like this...
 Hash Aggregate
Gather From Workers
   {Worker Nodes workers = 2
 HashAggregate
 PartialScan}

I'm going to aim for the simpler:

Hash Aggregate
- Parallel Seq Scan
Workers: 4

Yeah, I know that won't perform as well as what you're proposing, but
I'm fairly sure it's simpler.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-11 Thread David Rowley
On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs si...@2ndquadrant.com wrote:


 This plan type is widely used in reporting queries, so will hit the
 mainline of BI applications and many Mat View creations.
 This will allow SELECT count(*) FROM foo to go faster also.


We'd also need to add some infrastructure to merge aggregate states
together for this to work properly. This means that could also work for
avg() and stddev etc. For max() and min() the merge functions would likely
just be the same as the transition functions.

Regards

David Rowley


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-10 Thread Andres Freund
Hi Robert, All,

On 2014-11-10 10:57:16 -0500, Robert Haas wrote:
 On Wed, Oct 15, 2014 at 2:55 PM, Simon Riggs si...@2ndquadrant.com wrote:
  Something usable, with severe restrictions, is actually better than we
  have now. I understand the journey this work represents, so don't be
  embarrassed by submitting things with heuristics and good-enoughs in
  it. Our mentor, Mr.Lane, achieved much by spreading work over many
  releases, leaving others to join in the task.
 
 It occurs to me that, now that the custom-scan stuff is committed, it
 wouldn't be that hard to use that, plus the other infrastructure we
 already have, to write a prototype of parallel sequential scan.  Given
 where we are with the infrastructure, there would be a number of
 unhandled problems, such as deadlock detection (needs group locking or
 similar), assessment of quals as to parallel-safety (needs
 proisparallel or similar), general waterproofing to make sure that
 pushing down a qual we shouldn't does do anything really dastardly
 like crash the server (another written but yet-to-be-published patch
 adds a bunch of relevant guards), and snapshot sharing (likewise).
 But if you don't do anything weird, it should basically work.
 
 I think this would be useful for a couple of reasons.  First, it would
 be a demonstrable show of progress, illustrating how close we are to
 actually having something you can really deploy.  Second, we could use
 it to demonstrate how the remaining infrastructure patches close up
 gaps in the initial prototype.  Third, it would let us start doing
 real performance testing.

I think it might be a useful experiment - as long as it's clear that
it's that. Which is, I think, what you're thinking about?

 It seems pretty clear that a parallel sequential scan of data that's
 in memory (whether the page cache or the OS cache) can be accelerated
 by having multiple processes scan it in parallel.

Right.

 But it's much less clear what will happen when the data is being read
 in from disk.

I think that *very* heavily depends on the IO subsystem.

 Does parallelism help at all?

I'm pretty damn sure. We can't even make a mildly powerfull storage
fully busy right now. Heck, I can't make my workstation's storage with a
raid 10 out of four spinning disks fully busy.

I think some of that benefit also could be reaped by being better at
hinting the OS...

 What degree of parallelism helps?

That's quite a hard question. Generally the question about how much
parallelism for what is beneficial will be one of the most complicated
areas once the plumbing is in.

 Do we break OS readahead so badly that performance actually regresses?

I don't think it's likely that we break OS readahead - that works on a
per task basis at least on linux afaik. But it's nonetheless very easy
to have too many streams causing to many random reads.

 These are things that are likely to
 need a fair amount of tuning before this is ready for prime time, so
 being able to start experimenting with them in advance of all of the
 infrastructure being completely ready seems like it might help.

I'm not actually entirely sure how much that's going to help. I think
you could very well have a WIP patch ready reasonably quick that doesn't
solve the issues you mention above by patching it in. For the kind of
testing we're talking about that seems likely sufficient - a git branch
somewhere probably is actually easier to compile for people than some
contrib module that needs to be loaded...
And I *do* think that you'll very quickly hit the limits of the custom
scan API. And I'd much rather see you work on improving parallelism than
the custom scan stuff, just so you can prototype further ahead.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-10 Thread Haribabu Kommi
On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-11-10 10:57:16 -0500, Robert Haas wrote:
 Does parallelism help at all?

 I'm pretty damn sure. We can't even make a mildly powerfull storage
 fully busy right now. Heck, I can't make my workstation's storage with a
 raid 10 out of four spinning disks fully busy.

 I think some of that benefit also could be reaped by being better at
 hinting the OS...

Yes, it definitely helps but not only limited to IO bound operations.
It gives a good gain for the queries having CPU intensive where conditions.

One more point we may need to consider, is there any overhead in passing
the data row from workers to backend? I feel this may play a major role
when the selectivity is more.

Regards,
Hari Babu
Fujitsu Australia


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-10 Thread Amit Kapila
On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi kommi.harib...@gmail.com
wrote:

 On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund and...@2ndquadrant.com
wrote:
  On 2014-11-10 10:57:16 -0500, Robert Haas wrote:
  Does parallelism help at all?
 
  I'm pretty damn sure. We can't even make a mildly powerfull storage
  fully busy right now. Heck, I can't make my workstation's storage with a
  raid 10 out of four spinning disks fully busy.
 
  I think some of that benefit also could be reaped by being better at
  hinting the OS...

 Yes, it definitely helps but not only limited to IO bound operations.
 It gives a good gain for the queries having CPU intensive where
conditions.

 One more point we may need to consider, is there any overhead in passing
 the data row from workers to backend?

I am not sure if that overhead will be too much visible if we improve the
use of I/O subsystem by making parallel tasks working on it. However
another idea here could be that instead of passing tuple data, we just
pass tuple id, but in that case we have to retain the pin on the buffer
that contains tuple untill master backend reads from it that might have
it's own kind of problems.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-10 Thread Haribabu Kommi
On Tue, Nov 11, 2014 at 2:35 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi kommi.harib...@gmail.com
 wrote:

 On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund and...@2ndquadrant.com
 wrote:
  On 2014-11-10 10:57:16 -0500, Robert Haas wrote:
  Does parallelism help at all?
 
  I'm pretty damn sure. We can't even make a mildly powerfull storage
  fully busy right now. Heck, I can't make my workstation's storage with a
  raid 10 out of four spinning disks fully busy.
 
  I think some of that benefit also could be reaped by being better at
  hinting the OS...

 Yes, it definitely helps but not only limited to IO bound operations.
 It gives a good gain for the queries having CPU intensive where
 conditions.

 One more point we may need to consider, is there any overhead in passing
 the data row from workers to backend?

 I am not sure if that overhead will be too much visible if we improve the
 use of I/O subsystem by making parallel tasks working on it.

I feel there may be an overhead because of workers needs to put the result
data in the shared memory and the backend has to read from there to process
it further. If the cost of transfering data from worker to backend is more than
fetching a tuple from the scan, then the overhead is visible when the
selectivity is more.

 However
 another idea here could be that instead of passing tuple data, we just
 pass tuple id, but in that case we have to retain the pin on the buffer
 that contains tuple untill master backend reads from it that might have
 it's own kind of problems.

Transfering tuple id doesn't solve the scenarios if the node needs any
projection.

Regards,
Hari Babu
Fujitsu Australia


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-10 Thread Amit Kapila
On Tue, Nov 11, 2014 at 9:42 AM, Haribabu Kommi kommi.harib...@gmail.com
wrote:

 On Tue, Nov 11, 2014 at 2:35 PM, Amit Kapila amit.kapil...@gmail.com
wrote:
  On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi 
kommi.harib...@gmail.com
  wrote:
 
  On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund and...@2ndquadrant.com

  wrote:
   On 2014-11-10 10:57:16 -0500, Robert Haas wrote:
   Does parallelism help at all?
  
   I'm pretty damn sure. We can't even make a mildly powerfull storage
   fully busy right now. Heck, I can't make my workstation's storage
with a
   raid 10 out of four spinning disks fully busy.
  
   I think some of that benefit also could be reaped by being better at
   hinting the OS...
 
  Yes, it definitely helps but not only limited to IO bound operations.
  It gives a good gain for the queries having CPU intensive where
  conditions.
 
  One more point we may need to consider, is there any overhead in
passing
  the data row from workers to backend?
 
  I am not sure if that overhead will be too much visible if we improve
the
  use of I/O subsystem by making parallel tasks working on it.

 I feel there may be an overhead because of workers needs to put the result
 data in the shared memory and the backend has to read from there to
process
 it further. If the cost of transfering data from worker to backend is
more than
 fetching a tuple from the scan, then the overhead is visible when the
 selectivity is more.

  However
  another idea here could be that instead of passing tuple data, we just
  pass tuple id, but in that case we have to retain the pin on the buffer
  that contains tuple untill master backend reads from it that might have
  it's own kind of problems.

 Transfering tuple id doesn't solve the scenarios if the node needs any
 projection.

Hmm, that's why I told that we need to retain buffer pin, so that we can
get the tuple data.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-10 Thread Kouhei Kaigai



 -Original Message-
 From: pgsql-hackers-ow...@postgresql.org
 [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Haribabu Kommi
 Sent: Tuesday, November 11, 2014 1:13 PM
 To: Amit Kapila
 Cc: Andres Freund; Robert Haas; Simon Riggs; Tom Lane;
 pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] using custom scan nodes to prototype parallel
 sequential scan
 
 On Tue, Nov 11, 2014 at 2:35 PM, Amit Kapila amit.kapil...@gmail.com
 wrote:
  On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi
  kommi.harib...@gmail.com
  wrote:
 
  On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund
  and...@2ndquadrant.com
  wrote:
   On 2014-11-10 10:57:16 -0500, Robert Haas wrote:
   Does parallelism help at all?
  
   I'm pretty damn sure. We can't even make a mildly powerfull storage
   fully busy right now. Heck, I can't make my workstation's storage
   with a raid 10 out of four spinning disks fully busy.
  
   I think some of that benefit also could be reaped by being better
   at hinting the OS...
 
  Yes, it definitely helps but not only limited to IO bound operations.
  It gives a good gain for the queries having CPU intensive where
  conditions.
 
  One more point we may need to consider, is there any overhead in
  passing the data row from workers to backend?
 
  I am not sure if that overhead will be too much visible if we improve
  the use of I/O subsystem by making parallel tasks working on it.
 
 I feel there may be an overhead because of workers needs to put the result
 data in the shared memory and the backend has to read from there to process
 it further. If the cost of transfering data from worker to backend is more
 than fetching a tuple from the scan, then the overhead is visible when the
 selectivity is more.
 
In my experience, data copy and transformation to fit TupleTableSlot is the
biggest overhead, rather than scan or join itself...
Probably, a straight-forward way is to construct an array of values/isnull
on a shared memory segment, then the backend process just switch pointers of
tts_values/tts_isnull, with no data copy. It gave us a performance gain.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers