Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On Fri, Nov 14, 2014 at 02:51:32PM +1300, David Rowley wrote: Likely for most aggregates, like count, sum, max, min, bit_and and bit_or the merge function would be the same as the transition function, as the state type is just the same as the input type. It would only be aggregates like avg(), stddev*(), bool_and() and bool_or() that would need a new merge function made... These would be no more complex than the transition functions... Which are just a few lines of code anyway. We'd simply just not run parallel query if any aggregates used in the query didn't have a merge function. When I mentioned this, I didn't mean to appear to be placing a road block.I was just bringing to the table the information that COUNT(*) + COUNT(*) works ok for merging COUNT(*)'s sub totals, but AVG(n) + AVG(n) does not. Sorry, late reply, but, FYI, I don't think our percentile functions can't be parallelized in the same way: test= \daS *percent* List of aggregate functions Schema | Name | Result data type | Argument data types | Description +-++--+- pg_catalog | percent_rank| double precision | VARIADIC any ORDER BY VARIADIC any | fractional rank of hypothetical row pg_catalog | percentile_cont | double precision | double precision ORDER BY double precision | continuous distribution percentile pg_catalog | percentile_cont | double precision[] | double precision[] ORDER BY double precision | multiple continuous percentiles pg_catalog | percentile_cont | interval | double precision ORDER BY interval | continuous distribution percentile pg_catalog | percentile_cont | interval[] | double precision[] ORDER BY interval | multiple continuous percentiles pg_catalog | percentile_disc | anyelement | double precision ORDER BY anyelement | discrete percentile pg_catalog | percentile_disc | anyarray | double precision[] ORDER BY anyelement | multiple discrete percentiles -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On Fri, Nov 14, 2014 at 02:51:32PM +1300, David Rowley wrote: Likely for most aggregates, like count, sum, max, min, bit_and and bit_or the merge function would be the same as the transition function, as the state type is just the same as the input type. It would only be aggregates like avg(), stddev*(), bool_and() and bool_or() that would need a new merge function made... These would be no more complex than the transition functions... Which are just a few lines of code anyway. We'd simply just not run parallel query if any aggregates used in the query didn't have a merge function. When I mentioned this, I didn't mean to appear to be placing a road block.I was just bringing to the table the information that COUNT(*) + COUNT(*) works ok for merging COUNT(*)'s sub totals, but AVG(n) + AVG(n) does not. Sorry, late reply, but, FYI, I don't think our percentile functions can't be parallelized in the same way: test= \daS *percent* List of aggregate functions Schema | Name | Result data type | Argument data types | Description +-++-- +- pg_catalog | percent_rank| double precision | VARIADIC any ORDER BY VARIADIC any | fractional rank of hypothetical row pg_catalog | percentile_cont | double precision | double precision ORDER BY double precision | continuous distribution percentile pg_catalog | percentile_cont | double precision[] | double precision[] ORDER BY double precision | multiple continuous percentiles pg_catalog | percentile_cont | interval | double precision ORDER BY interval | continuous distribution percentile pg_catalog | percentile_cont | interval[] | double precision[] ORDER BY interval | multiple continuous percentiles pg_catalog | percentile_disc | anyelement | double precision ORDER BY anyelement | discrete percentile pg_catalog | percentile_disc | anyarray | double precision[] ORDER BY anyelement | multiple discrete percentiles Yep, it seems to me the type of aggregate function that is not obvious to split into multiple partitions. I think, it is valuable even if we can push-down a part of aggregate functions which is well known by the core planner. For example, we know count(*) = sum(nrows), we also know avg(X) can be rewritten to enhanced avg() that takes both of nrows and partial sum of X. We can utilize these knowledge to break-down aggregate functions. Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On Wed, Dec 3, 2014 at 3:23 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: On Fri, Nov 14, 2014 at 02:51:32PM +1300, David Rowley wrote: Likely for most aggregates, like count, sum, max, min, bit_and and bit_or the merge function would be the same as the transition function, as the state type is just the same as the input type. It would only be aggregates like avg(), stddev*(), bool_and() and bool_or() that would need a new merge function made... These would be no more complex than the transition functions... Which are just a few lines of code anyway. We'd simply just not run parallel query if any aggregates used in the query didn't have a merge function. When I mentioned this, I didn't mean to appear to be placing a road block.I was just bringing to the table the information that COUNT(*) + COUNT(*) works ok for merging COUNT(*)'s sub totals, but AVG(n) + AVG(n) does not. Sorry, late reply, but, FYI, I don't think our percentile functions can't be parallelized in the same way: test= \daS *percent* List of aggregate functions Schema | Name | Result data type | Argument data types | Description +-++-- +- pg_catalog | percent_rank| double precision | VARIADIC any ORDER BY VARIADIC any | fractional rank of hypothetical row pg_catalog | percentile_cont | double precision | double precision ORDER BY double precision | continuous distribution percentile pg_catalog | percentile_cont | double precision[] | double precision[] ORDER BY double precision | multiple continuous percentiles pg_catalog | percentile_cont | interval | double precision ORDER BY interval | continuous distribution percentile pg_catalog | percentile_cont | interval[] | double precision[] ORDER BY interval | multiple continuous percentiles pg_catalog | percentile_disc | anyelement | double precision ORDER BY anyelement | discrete percentile pg_catalog | percentile_disc | anyarray | double precision[] ORDER BY anyelement | multiple discrete percentiles Yep, it seems to me the type of aggregate function that is not obvious to split into multiple partitions. I think, it is valuable even if we can push-down a part of aggregate functions which is well known by the core planner. For example, we know count(*) = sum(nrows), we also know avg(X) can be rewritten to enhanced avg() that takes both of nrows and partial sum of X. We can utilize these knowledge to break-down aggregate functions. Postgres-XC (Postgres-XL) has implemented such parallel aggregate logic some time ago using a set of sub functions and a finalization function to do the work. My 2c. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On Thu, Nov 13, 2014 at 7:27 PM, Simon Riggs si...@2ndquadrant.com wrote: On 12 November 2014 00:54, Robert Haas robertmh...@gmail.com wrote: On Tue, Nov 11, 2014 at 3:29 AM, Simon Riggs si...@2ndquadrant.com wrote: * only functions marked as CONTAINS NO SQL We don't really know what proisparallel is, but we do know what CONTAINS NO SQL means and can easily check for it. Plus I already have a patch for this, slightly bitrotted. Interestingly, I have a fairly solid idea of what proisparallel is, but I have no clear idea what CONTAINS NO SQL is or why it's relevant. I would imagine that srandom() contains no SQL under any reasonable definition of what that means, but it ain't parallel-safe. What is wrong in generating random numbers in parallel? If they're random, nothing, provide that you use a different seed in each backend. But if the seed has been set using srandom(), then we must generate the particular sequence of random numbers associated with that seed. If we don't, the behavior is incompatible with what happens apart from parallel mode. But I'm sure many volatile functions would be annoying to support, so CONTAINS NO SQL and STABLE/IMMUTABLE seems OK for the first thing. It might be OK to assume that immutable functions are fine, but not all stable functions look like they'll be safe, or not without additional effort: e.g. inet_client_addr(), pg_backend_pid(), pg_cursor(), pg_event_trigger_dropped_objects(). Or even, more mundanely, generate_series(), unless we ensure all calls are in the same backend. On the other hand, there are quite a few stable functions that it seems like we would want to allow in parallel queries, like to_char(), concat(), and textanycat(). I still don't know what CONTAINS NO SQL means. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On 14 November 2014 11:02, Kouhei Kaigai kai...@ak.jp.nec.com wrote: I'd like to throw community folks a question. Did someone have a discussion to the challenge of aggregate push-down across relations join in the past? It potentially reduces number of rows to be joined. If we already had, I'd like to check up the discussion at that time. Yes, I was looking at aggregate pushdown. I think it needs the same changes to aggregates discussed upthread. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On 17 November 2014 16:01, Robert Haas robertmh...@gmail.com wrote: I still don't know what CONTAINS NO SQL means. It's a function marking that would indicate we aren't allowed to take snapshots or run SQL. I think you should publish the scheme for marking functions as safe for parallelism, so we can judge that. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On 18 November 2014 05:19, Simon Riggs si...@2ndquadrant.com wrote: On 14 November 2014 11:02, Kouhei Kaigai kai...@ak.jp.nec.com wrote: I'd like to throw community folks a question. Did someone have a discussion to the challenge of aggregate push-down across relations join in the past? It potentially reduces number of rows to be joined. If we already had, I'd like to check up the discussion at that time. Yes, I was looking at aggregate pushdown. I think it needs the same changes to aggregates discussed upthread. I have something that I've been working on locally. It's far from ready, but it does work in very simple cases, and shows a nice performance boost. I'll start another thread soon and copy you both in. Perhaps we can share some ideas. Regards David Rowley
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On 18 November 2014 05:19, Simon Riggs si...@2ndquadrant.com wrote: On 14 November 2014 11:02, Kouhei Kaigai kai...@ak.jp.nec.com wrote: I'd like to throw community folks a question. Did someone have a discussion to the challenge of aggregate push-down across relations join in the past? It potentially reduces number of rows to be joined. If we already had, I'd like to check up the discussion at that time. Yes, I was looking at aggregate pushdown. I think it needs the same changes to aggregates discussed upthread. I have something that I've been working on locally. It's far from ready, but it does work in very simple cases, and shows a nice performance boost. I'll start another thread soon and copy you both in. Perhaps we can share some ideas. Great, it's exactly valuable functionality to be in the core. -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On 14 November 2014 20:37, Jim Nasby jim.na...@bluetreble.com wrote: On 11/12/14, 1:54 AM, David Rowley wrote: We'd also need to add some infrastructure to merge aggregate states together for this to work properly. This means that could also work for avg() and stddev etc. For max() and min() the merge functions would likely just be the same as the transition functions. Sanity check: what % of a large aggregate query fed by a seqscan actually spent in the aggregate functions? Even if you look strictly at CPU cost, isn't there more code involved to get data to the aggregate function than in the aggregation itself, except maybe for numeric? You might be right, but that sounds like it would need all the parallel workers to send each matching tuple to a queue to be processed by some aggregate node. I guess this would have more advantage for wider tables or tables with many dead tuples, or if the query has quite a selective where clause, as less data would make it onto that queue. Perhaps I've taken 1 step too far forward here. I had been thinking that each worker would perform the partial seqscan and in the worker context pass the tuple down to the aggregate node. Then later once each worker had complete some other perhaps new node type (MergeAggregateStates) would merge all those intermediate agg states into the final agg state (which would then be ready for the final function to be called). Are there any plans for what will be in charge of deciding how many workers would be allocated to a parallel query? Will this be something that's done at planning time? Or should the planner just create a parallel friendly plan, iif the plan is costly enough and then just allow the executor decide how many workers to throw at the job based on how busy the system is with other tasks at execution time? Regards David Rowley
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On 14 November 2014 07:37, Jim Nasby jim.na...@bluetreble.com wrote: On 11/12/14, 1:54 AM, David Rowley wrote: On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs si...@2ndquadrant.com mailto:si...@2ndquadrant.com wrote: This plan type is widely used in reporting queries, so will hit the mainline of BI applications and many Mat View creations. This will allow SELECT count(*) FROM foo to go faster also. We'd also need to add some infrastructure to merge aggregate states together for this to work properly. This means that could also work for avg() and stddev etc. For max() and min() the merge functions would likely just be the same as the transition functions. Sanity check: what % of a large aggregate query fed by a seqscan actually spent in the aggregate functions? Even if you look strictly at CPU cost, isn't there more code involved to get data to the aggregate function than in the aggregation itself, except maybe for numeric? Yes, which is why I suggested pre-aggregating before collecting the streams together. The point is not that the aggregation is expensive, its that the aggregation eats data and the required bandwidth for later steps is reduced and hence does not then become a bottleneck that renders the parallel Seq Scan ineffective. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On 14 November 2014 01:51, David Rowley dgrowle...@gmail.com wrote: When I mentioned this, I didn't mean to appear to be placing a road block.I was just bringing to the table the information that COUNT(*) + COUNT(*) works ok for merging COUNT(*)'s sub totals, but AVG(n) + AVG(n) does not. Merge functions should be a simple patch to write. If I thought there was going to be a use for them in this release, I'd put my hand up to put a patch together. The hard part is describing and then agreeing the semantics. If you raise a separate post on this, copy me in and I will help. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On 14 November 2014 07:37, Jim Nasby jim.na...@bluetreble.com wrote: On 11/12/14, 1:54 AM, David Rowley wrote: On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs si...@2ndquadrant.com mailto:si...@2ndquadrant.com wrote: This plan type is widely used in reporting queries, so will hit the mainline of BI applications and many Mat View creations. This will allow SELECT count(*) FROM foo to go faster also. We'd also need to add some infrastructure to merge aggregate states together for this to work properly. This means that could also work for avg() and stddev etc. For max() and min() the merge functions would likely just be the same as the transition functions. Sanity check: what % of a large aggregate query fed by a seqscan actually spent in the aggregate functions? Even if you look strictly at CPU cost, isn't there more code involved to get data to the aggregate function than in the aggregation itself, except maybe for numeric? Yes, which is why I suggested pre-aggregating before collecting the streams together. The point is not that the aggregation is expensive, its that the aggregation eats data and the required bandwidth for later steps is reduced and hence does not then become a bottleneck that renders the parallel Seq Scan ineffective. I'd like to throw community folks a question. Did someone have a discussion to the challenge of aggregate push-down across relations join in the past? It potentially reduces number of rows to be joined. If we already had, I'd like to check up the discussion at that time. Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On Fri, Nov 14, 2014 at 1:27 PM, Simon Riggs si...@2ndquadrant.com wrote: On 12 November 2014 00:54, Robert Haas robertmh...@gmail.com wrote: Interestingly, I have a fairly solid idea of what proisparallel is, but I have no clear idea what CONTAINS NO SQL is or why it's relevant. I would imagine that srandom() contains no SQL under any reasonable definition of what that means, but it ain't parallel-safe. What is wrong in generating random numbers in parallel? I was just watching Robert's talk on Parallel query on youtube... I think the answer is at 41:09, the link below should take you there: https://www.youtube.com/watch?v=3klfarKEtMQ#t=2469 Regards David Rowley
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On 12 November 2014 00:54, Robert Haas robertmh...@gmail.com wrote: I'm going to aim for the simpler: Hash Aggregate - Parallel Seq Scan Workers: 4 Yeah, I know that won't perform as well as what you're proposing, but I'm fairly sure it's simpler. Simple is best, so +1. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On 12 November 2014 07:54, David Rowley dgrowle...@gmail.com wrote: On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs si...@2ndquadrant.com wrote: This plan type is widely used in reporting queries, so will hit the mainline of BI applications and many Mat View creations. This will allow SELECT count(*) FROM foo to go faster also. We'd also need to add some infrastructure to merge aggregate states together for this to work properly. This means that could also work for avg() and stddev etc. For max() and min() the merge functions would likely just be the same as the transition functions. Do you mean something like a subtotal or intermediate combination functon? I guess we'd need the same thing to make intermediate aggregates work during a sort? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On 12 November 2014 00:54, Robert Haas robertmh...@gmail.com wrote: On Tue, Nov 11, 2014 at 3:29 AM, Simon Riggs si...@2ndquadrant.com wrote: * only functions marked as CONTAINS NO SQL We don't really know what proisparallel is, but we do know what CONTAINS NO SQL means and can easily check for it. Plus I already have a patch for this, slightly bitrotted. Interestingly, I have a fairly solid idea of what proisparallel is, but I have no clear idea what CONTAINS NO SQL is or why it's relevant. I would imagine that srandom() contains no SQL under any reasonable definition of what that means, but it ain't parallel-safe. What is wrong in generating random numbers in parallel? But I'm sure many volatile functions would be annoying to support, so CONTAINS NO SQL and STABLE/IMMUTABLE seems OK for the first thing. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On Fri, Nov 14, 2014 at 1:19 PM, Simon Riggs si...@2ndquadrant.com wrote: On 12 November 2014 07:54, David Rowley dgrowle...@gmail.com wrote: On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs si...@2ndquadrant.com wrote: This plan type is widely used in reporting queries, so will hit the mainline of BI applications and many Mat View creations. This will allow SELECT count(*) FROM foo to go faster also. We'd also need to add some infrastructure to merge aggregate states together for this to work properly. This means that could also work for avg() and stddev etc. For max() and min() the merge functions would likely just be the same as the transition functions. Do you mean something like a subtotal or intermediate combination functon? If you had 4 parallel workers performing a seqscan, say the relation was 4000 pages in total, you could say that worker 1 would scan blocks 0-999, worker 2, 1000-1999 etc. After each worker had finished, there would then be 4 sets of records then needed to be merged into 1 set. Take int4_avg_accum() for example it does: transdata-count++; transdata-sum += newval; The merge function would need to perform something like: transdata-count += transdata2merge.count; transdata-sum += transdata2merge.sum; Then the final function could be called on the merged aggregate state. The same can be applied when the query contains a GROUP BY clause, just we'd need pay attention to which groups we merge together for that to work Any HAVING clause would have to be applied after the groups have been merged. This whole topic is pretty exciting for data warehouse type workloads. Regards David Rowley
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On 12 November 2014 07:54, David Rowley dgrowle...@gmail.com wrote: On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs si...@2ndquadrant.com wrote: This plan type is widely used in reporting queries, so will hit the mainline of BI applications and many Mat View creations. This will allow SELECT count(*) FROM foo to go faster also. We'd also need to add some infrastructure to merge aggregate states together for this to work properly. This means that could also work for avg() and stddev etc. For max() and min() the merge functions would likely just be the same as the transition functions. Do you mean something like a subtotal or intermediate combination functon? If you had 4 parallel workers performing a seqscan, say the relation was 4000 pages in total, you could say that worker 1 would scan blocks 0-999, worker 2, 1000-1999 etc. After each worker had finished, there would then be 4 sets of records then needed to be merged into 1 set. Take int4_avg_accum() for example it does: transdata-count++; transdata-sum += newval; The merge function would need to perform something like: transdata-count += transdata2merge.count; sum += transdata2merge.sum; Then the final function could be called on the merged aggregate state. The same can be applied when the query contains a GROUP BY clause, just we'd need pay attention to which groups we merge together for that to work Any HAVING clause would have to be applied after the groups have been merged. This whole topic is pretty exciting for data warehouse type workloads. More simplify, we can describe parallel aware aggregate function. Please assume AVG(X) function that takes nrows and sum of X. Its transition function performs as like you described above, then final function works as usual. The job of parallel seq scan needs to do is: 1. replace AVG(X) by AVG(nrows, sum(X) 2. generate count(*) on the partial relation being grouped. 3. generate sum(X) on the partial relation being grouped. It looks like the following query: SELECT AVG(nrows, sum_X) FROM ( SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 0 and 999 GROUP BY cat UNION SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 1000 and 1999 GROUP BY cat UNION SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 2000 and 2999 GROUP BY cat UNION SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 3000 and 3999 GROUP BY cat ); Note that stuff inside sub-query is a job of parallel scan. Do we need to invent a new infrastructure here? Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On Fri, Nov 14, 2014 at 2:12 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: On 12 November 2014 07:54, David Rowley dgrowle...@gmail.com Take int4_avg_accum() for example it does: transdata-count++; transdata-sum += newval; The merge function would need to perform something like: transdata-count += transdata2merge.count; sum += transdata2merge.sum; Then the final function could be called on the merged aggregate state. More simplify, we can describe parallel aware aggregate function. Please assume AVG(X) function that takes nrows and sum of X. Its transition function performs as like you described above, then final function works as usual. The job of parallel seq scan needs to do is: 1. replace AVG(X) by AVG(nrows, sum(X) 2. generate count(*) on the partial relation being grouped. 3. generate sum(X) on the partial relation being grouped. It looks like the following query: SELECT AVG(nrows, sum_X) FROM ( SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 0 and 999 GROUP BY cat UNION SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 1000 and 1999 GROUP BY cat UNION SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 2000 and 2999 GROUP BY cat UNION SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 3000 and 3999 GROUP BY cat ); Well, this would require giving the planner some kind of knowledge of what AVG() is. It currently knows nothing about that. It currently calls the transition function for each row, and the final function at the end and does not care or need to care about what either of those functions actually does. The transformation above looks like it would need to care and the logic to add that would be way more complex than aggregate merge functions. It may make sense to have an extra catalog to indicate how usual aggregate function can be broken down (or unavailable). Like, AVG(X) = {COUNT(X) + SUM(X)} x N-partitions Likely for most aggregates, like count, sum, max, min, bit_and and bit_or the merge function would be the same as the transition function, as the state type is just the same as the input type. It would only be aggregates like avg(), stddev*(), bool_and() and bool_or() that would need a new merge function made... These would be no more complex than the transition functions... Which are just a few lines of code anyway. We'd simply just not run parallel query if any aggregates used in the query didn't have a merge function. When I mentioned this, I didn't mean to appear to be placing a road block.I was just bringing to the table the information that COUNT(*) + COUNT(*) works ok for merging COUNT(*)'s sub totals, but AVG(n) + AVG(n) does not. Merge functions should be a simple patch to write. If I thought there was going to be a use for them in this release, I'd put my hand up to put a patch together. Things I'm uncertain is, how caller of aggregate function distinguish a context to call usual translation function, or new merge function. Let's back an example of: SELECT cat, COUNT(*), AVG(X) FROM t GROUP BY cat; Its plan tree is probably as follows: HashAggregate Group Key: cat - Custom Scan (Parallel Scan) # of workers: 4 The caller of translation/merge/final function is HashAggregate node. On the other hand, it has to know whether the underlying plan returns every records of underlying table or sub-total by parallel scan. Please correct me, if my assumption is wrong. Once HashAggregate can know that sub-plan returns sub-total of the relation, it can chose merge function instead of translation function. However, what I want to clarify is how to inform HashAggregate node its sub- plan intends to return sub-total, instead of individual rows. Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On 11/12/14, 1:54 AM, David Rowley wrote: On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs si...@2ndquadrant.com mailto:si...@2ndquadrant.com wrote: This plan type is widely used in reporting queries, so will hit the mainline of BI applications and many Mat View creations. This will allow SELECT count(*) FROM foo to go faster also. We'd also need to add some infrastructure to merge aggregate states together for this to work properly. This means that could also work for avg() and stddev etc. For max() and min() the merge functions would likely just be the same as the transition functions. Sanity check: what % of a large aggregate query fed by a seqscan actually spent in the aggregate functions? Even if you look strictly at CPU cost, isn't there more code involved to get data to the aggregate function than in the aggregation itself, except maybe for numeric? In other words, I suspect that just having a dirt-simple parallel SeqScan could be a win for CPU. It should certainly be a win IO-wise; in my experience we're not very good at maxing out IO systems. (I was curious and came up with the list below for just the page-level stuff (ignoring IO). I don't see much code involved in per-tuple work, but I also never came across detoasting code, so I suspect I'm missing something...) ExecScanFetch, heapgettup_pagemode, ReadBuffer, BufferAlloc, heap_page_prune_opt, LWLockAcquire... then you can finally do per-tuple work. HeapTupleSatisfiesVisibility. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On Wed, Nov 12, 2014 at 1:24 PM, David Rowley dgrowle...@gmail.com wrote: On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs si...@2ndquadrant.com wrote: This plan type is widely used in reporting queries, so will hit the mainline of BI applications and many Mat View creations. This will allow SELECT count(*) FROM foo to go faster also. We'd also need to add some infrastructure to merge aggregate states together for this to work properly. This means that could also work for avg() and stddev etc. For max() and min() the merge functions would likely just be the same as the transition functions. It might make sense to make a new planner operator which can be responsible for pulling from each of the individual parallel Agg nodes and then aggregating over the results. A couple of things that might be worth considering are whether we want to enforce using parallel aggregation or let planner decide if it wants to do a parallel aggregate or go with a single plan. For eg, the average estimated size of groups might be one thing that planner may consider while deciding between a parallel and a single execution plan. I dont see merging states as an easy problem, and should perhaps be tackled apart from this thread. Also, do we want to allow parallelism only with GroupAggs? Regards, Atri
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On Tue, Nov 11, 2014 at 7:48 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: Isn't provolatile = PROVOLATILE_IMMUTABLE sufficient? There are certainly things that are parallel-safe that are not immutable. It might be the case that everything immutable is parallel-safe. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On Wed, Nov 12, 2014 at 9:49 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Nov 11, 2014 at 7:48 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: Isn't provolatile = PROVOLATILE_IMMUTABLE sufficient? There are certainly things that are parallel-safe that are not immutable. It might be the case that everything immutable is parallel-safe. FWIW, when working on the concept of expression and clause shippability for Postgres-XC (aka the possibility to pass it safely to another backend, but in another PG node in this case), we discussed similar things and if I recall correctly we even discussed about adding a flag to pg_proc to define if a function was shippable or not. Finally what we finished with was not adding a new flag and use as rule that all the immutable functions can be safely shipped, and others not, even some stable functions that *could* be safe. Maybe Ashutosh has more comments on that, my memory may be failing. In the end, I think that you would finish with something similar. My 2c. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On 10 November 2014 15:57, Robert Haas robertmh...@gmail.com wrote: On Wed, Oct 15, 2014 at 2:55 PM, Simon Riggs si...@2ndquadrant.com wrote: Something usable, with severe restrictions, is actually better than we have now. I understand the journey this work represents, so don't be embarrassed by submitting things with heuristics and good-enoughs in it. Our mentor, Mr.Lane, achieved much by spreading work over many releases, leaving others to join in the task. It occurs to me that, now that the custom-scan stuff is committed, it wouldn't be that hard to use that, plus the other infrastructure we already have, to write a prototype of parallel sequential scan. +1 Given where we are with the infrastructure, there would be a number of unhandled problems, such as deadlock detection (needs group locking or similar), assessment of quals as to parallel-safety (needs proisparallel or similar), general waterproofing to make sure that pushing down a qual we shouldn't does do anything really dastardly like crash the server (another written but yet-to-be-published patch adds a bunch of relevant guards), and snapshot sharing (likewise). But if you don't do anything weird, it should basically work. If we build something fairly restricted, but production ready we can still do something useful to users. * only functions marked as CONTAINS NO SQL * parallel_workers = 2 or fixed * only one Plan type, designed to maximise benefit and minimize optimizer change Why? * The above is all that is needed to test any infrastructure changes we accept. We need both infrastructure and tests. We could do this as a custom scan plugin, but why not make it work in core - that doesn't prevent a plugin version also if you desire it. * only functions marked as CONTAINS NO SQL We don't really know what proisparallel is, but we do know what CONTAINS NO SQL means and can easily check for it. Plus I already have a patch for this, slightly bitrotted. * parallel_workers = 2 (or at least not make it user settable) By fixing the number of workers at 2 we avoid any problems caused by having N variable, such as how to vary N fairly amongst users and other such considerations. We get the main benefit of parallelism, without causing other issues across the server. * Fixed Plan: aggregate-scan To make everything simpler, allow only plans of a single type. SELECT something, list of aggregates FROM foo WHERE filters GROUP BY something because we know that passing large amounts of data from worker to master process will be slow, so focusing only on seq scan is not sensible; we should focus on plans that significantly reduce the number of rows passed upwards. We could just do this for very selective WHERE clauses, but that is not an important class of query. As soon as include aggregates, we reduce data passing significantly AND we hit a very important subset of queries: This plan type is widely used in reporting queries, so will hit the mainline of BI applications and many Mat View creations. This will allow SELECT count(*) FROM foo to go faster also. The execution plan for that query type looks like this... Hash Aggregate Gather From Workers {Worker Nodes workers = 2 HashAggregate PartialScan} Which is simple enough that we include a mechanism for the Gather operation, plus it is simple enough to not need extensive optimizer changes - just changes to set_plain_rel_pathlist() to consider parallel plans. Plan costs are easily to calculate for the above... cpu_worker_startup_cost = 100 -- Startup cost is easy to calculate by observation, but a reasonably large default will be OK cpu_ipc_tuple_cost = 0.1 -- assume x10 normal cost of cpu_tuple_cost Partial scan costs are just same as SeqScan, just with fewer blocks. All other costs are the same We can submit the main patch by Dec 15, fix all the problems by Feb 15. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
Given where we are with the infrastructure, there would be a number of unhandled problems, such as deadlock detection (needs group locking or similar), assessment of quals as to parallel-safety (needs proisparallel or similar), general waterproofing to make sure that pushing down a qual we shouldn't does do anything really dastardly like crash the server (another written but yet-to-be-published patch adds a bunch of relevant guards), and snapshot sharing (likewise). But if you don't do anything weird, it should basically work. If we build something fairly restricted, but production ready we can still do something useful to users. * only functions marked as CONTAINS NO SQL * parallel_workers = 2 or fixed * only one Plan type, designed to maximise benefit and minimize optimizer change Why? * The above is all that is needed to test any infrastructure changes we accept. We need both infrastructure and tests. We could do this as a custom scan plugin, but why not make it work in core - that doesn't prevent a plugin version also if you desire it. * only functions marked as CONTAINS NO SQL We don't really know what proisparallel is, but we do know what CONTAINS NO SQL means and can easily check for it. Plus I already have a patch for this, slightly bitrotted. Isn't provolatile = PROVOLATILE_IMMUTABLE sufficient? * parallel_workers = 2 (or at least not make it user settable) By fixing the number of workers at 2 we avoid any problems caused by having N variable, such as how to vary N fairly amongst users and other such considerations. We get the main benefit of parallelism, without causing other issues across the server. * Fixed Plan: aggregate-scan To make everything simpler, allow only plans of a single type. SELECT something, list of aggregates FROM foo WHERE filters GROUP BY something because we know that passing large amounts of data from worker to master process will be slow, so focusing only on seq scan is not sensible; we should focus on plans that significantly reduce the number of rows passed upwards. We could just do this for very selective WHERE clauses, but that is not an important class of query. As soon as include aggregates, we reduce data passing significantly AND we hit a very important subset of queries: This plan type is widely used in reporting queries, so will hit the mainline of BI applications and many Mat View creations. This will allow SELECT count(*) FROM foo to go faster also. I agree with you. The key of performance is how many rows can be reduced by parallel processor, however, one thing I doubt is that the condition to kick parallel execution may be too restrictive. I'm not sure how much workloads run aggregate functions towards flat tables rather than multiple joined relations. Sorry, it might be a topic to discuss in a separated thread. Can't we have a functionality to push-down aggregate functions across table joins? If we could do this in some cases, but not any cases, it makes sense to run typical BI workload in parallel. Let me assume the following query: SELECT SUM(t1.X), SUM(t2.Y) FROM t1 JOIN t2 ON t1.pk_id = t2.fk_id GROUP BY t1.A, t2.B; It is a usual query that joins two relations with PK and FK. It can be broken down, like: SELECT SUM(t1.X), SUM(sq2.Y) FROM t1 JOIN (SELECT t2.fk_id, t2.B, SUM(t2.Y) Y FROM t2 GROUP BY t2.fk_id, t2.B) sq2 GROUP BY t1.A, sq2.B; If FK has 100 tuples per one PK in average, aggregate function that runs prior to join will reduce massive number of tuples to be joined, and we can leverage parallel processors to make preliminary aggregate on flat table. I'm now checking the previous academic people's job to identify the condition what kind of join allows to push down aggregate functions. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.54.7751 Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On Tue, Nov 11, 2014 at 3:29 AM, Simon Riggs si...@2ndquadrant.com wrote: * only functions marked as CONTAINS NO SQL We don't really know what proisparallel is, but we do know what CONTAINS NO SQL means and can easily check for it. Plus I already have a patch for this, slightly bitrotted. Interestingly, I have a fairly solid idea of what proisparallel is, but I have no clear idea what CONTAINS NO SQL is or why it's relevant. I would imagine that srandom() contains no SQL under any reasonable definition of what that means, but it ain't parallel-safe. * parallel_workers = 2 (or at least not make it user settable) By fixing the number of workers at 2 we avoid any problems caused by having N variable, such as how to vary N fairly amongst users and other such considerations. We get the main benefit of parallelism, without causing other issues across the server. I think this is a fairly pointless restriction. The code simplification we'll get out of it appears to me to be quite minor, and we'll just end up putting the stuff back in anyway. * Fixed Plan: aggregate-scan To make everything simpler, allow only plans of a single type. SELECT something, list of aggregates FROM foo WHERE filters GROUP BY something because we know that passing large amounts of data from worker to master process will be slow, so focusing only on seq scan is not sensible; we should focus on plans that significantly reduce the number of rows passed upwards. We could just do this for very selective WHERE clauses, but that is not an important class of query. As soon as include aggregates, we reduce data passing significantly AND we hit a very important subset of queries: This is moving the goalposts in a way that I'm not at all comfortable with. Parallel sequential-scan is pretty simple and may well be a win if there's a restrictive filter condition involved. Parallel aggregation requires introducing new infrastructure into the aggregate machinery to allow intermediate state values to be combined, and that would be a great project for someone to do at some time, but it seems like a distraction for me to do that right now. This plan type is widely used in reporting queries, so will hit the mainline of BI applications and many Mat View creations. This will allow SELECT count(*) FROM foo to go faster also. The execution plan for that query type looks like this... Hash Aggregate Gather From Workers {Worker Nodes workers = 2 HashAggregate PartialScan} I'm going to aim for the simpler: Hash Aggregate - Parallel Seq Scan Workers: 4 Yeah, I know that won't perform as well as what you're proposing, but I'm fairly sure it's simpler. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs si...@2ndquadrant.com wrote: This plan type is widely used in reporting queries, so will hit the mainline of BI applications and many Mat View creations. This will allow SELECT count(*) FROM foo to go faster also. We'd also need to add some infrastructure to merge aggregate states together for this to work properly. This means that could also work for avg() and stddev etc. For max() and min() the merge functions would likely just be the same as the transition functions. Regards David Rowley
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
Hi Robert, All, On 2014-11-10 10:57:16 -0500, Robert Haas wrote: On Wed, Oct 15, 2014 at 2:55 PM, Simon Riggs si...@2ndquadrant.com wrote: Something usable, with severe restrictions, is actually better than we have now. I understand the journey this work represents, so don't be embarrassed by submitting things with heuristics and good-enoughs in it. Our mentor, Mr.Lane, achieved much by spreading work over many releases, leaving others to join in the task. It occurs to me that, now that the custom-scan stuff is committed, it wouldn't be that hard to use that, plus the other infrastructure we already have, to write a prototype of parallel sequential scan. Given where we are with the infrastructure, there would be a number of unhandled problems, such as deadlock detection (needs group locking or similar), assessment of quals as to parallel-safety (needs proisparallel or similar), general waterproofing to make sure that pushing down a qual we shouldn't does do anything really dastardly like crash the server (another written but yet-to-be-published patch adds a bunch of relevant guards), and snapshot sharing (likewise). But if you don't do anything weird, it should basically work. I think this would be useful for a couple of reasons. First, it would be a demonstrable show of progress, illustrating how close we are to actually having something you can really deploy. Second, we could use it to demonstrate how the remaining infrastructure patches close up gaps in the initial prototype. Third, it would let us start doing real performance testing. I think it might be a useful experiment - as long as it's clear that it's that. Which is, I think, what you're thinking about? It seems pretty clear that a parallel sequential scan of data that's in memory (whether the page cache or the OS cache) can be accelerated by having multiple processes scan it in parallel. Right. But it's much less clear what will happen when the data is being read in from disk. I think that *very* heavily depends on the IO subsystem. Does parallelism help at all? I'm pretty damn sure. We can't even make a mildly powerfull storage fully busy right now. Heck, I can't make my workstation's storage with a raid 10 out of four spinning disks fully busy. I think some of that benefit also could be reaped by being better at hinting the OS... What degree of parallelism helps? That's quite a hard question. Generally the question about how much parallelism for what is beneficial will be one of the most complicated areas once the plumbing is in. Do we break OS readahead so badly that performance actually regresses? I don't think it's likely that we break OS readahead - that works on a per task basis at least on linux afaik. But it's nonetheless very easy to have too many streams causing to many random reads. These are things that are likely to need a fair amount of tuning before this is ready for prime time, so being able to start experimenting with them in advance of all of the infrastructure being completely ready seems like it might help. I'm not actually entirely sure how much that's going to help. I think you could very well have a WIP patch ready reasonably quick that doesn't solve the issues you mention above by patching it in. For the kind of testing we're talking about that seems likely sufficient - a git branch somewhere probably is actually easier to compile for people than some contrib module that needs to be loaded... And I *do* think that you'll very quickly hit the limits of the custom scan API. And I'd much rather see you work on improving parallelism than the custom scan stuff, just so you can prototype further ahead. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-11-10 10:57:16 -0500, Robert Haas wrote: Does parallelism help at all? I'm pretty damn sure. We can't even make a mildly powerfull storage fully busy right now. Heck, I can't make my workstation's storage with a raid 10 out of four spinning disks fully busy. I think some of that benefit also could be reaped by being better at hinting the OS... Yes, it definitely helps but not only limited to IO bound operations. It gives a good gain for the queries having CPU intensive where conditions. One more point we may need to consider, is there any overhead in passing the data row from workers to backend? I feel this may play a major role when the selectivity is more. Regards, Hari Babu Fujitsu Australia -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi kommi.harib...@gmail.com wrote: On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-11-10 10:57:16 -0500, Robert Haas wrote: Does parallelism help at all? I'm pretty damn sure. We can't even make a mildly powerfull storage fully busy right now. Heck, I can't make my workstation's storage with a raid 10 out of four spinning disks fully busy. I think some of that benefit also could be reaped by being better at hinting the OS... Yes, it definitely helps but not only limited to IO bound operations. It gives a good gain for the queries having CPU intensive where conditions. One more point we may need to consider, is there any overhead in passing the data row from workers to backend? I am not sure if that overhead will be too much visible if we improve the use of I/O subsystem by making parallel tasks working on it. However another idea here could be that instead of passing tuple data, we just pass tuple id, but in that case we have to retain the pin on the buffer that contains tuple untill master backend reads from it that might have it's own kind of problems. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On Tue, Nov 11, 2014 at 2:35 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi kommi.harib...@gmail.com wrote: On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-11-10 10:57:16 -0500, Robert Haas wrote: Does parallelism help at all? I'm pretty damn sure. We can't even make a mildly powerfull storage fully busy right now. Heck, I can't make my workstation's storage with a raid 10 out of four spinning disks fully busy. I think some of that benefit also could be reaped by being better at hinting the OS... Yes, it definitely helps but not only limited to IO bound operations. It gives a good gain for the queries having CPU intensive where conditions. One more point we may need to consider, is there any overhead in passing the data row from workers to backend? I am not sure if that overhead will be too much visible if we improve the use of I/O subsystem by making parallel tasks working on it. I feel there may be an overhead because of workers needs to put the result data in the shared memory and the backend has to read from there to process it further. If the cost of transfering data from worker to backend is more than fetching a tuple from the scan, then the overhead is visible when the selectivity is more. However another idea here could be that instead of passing tuple data, we just pass tuple id, but in that case we have to retain the pin on the buffer that contains tuple untill master backend reads from it that might have it's own kind of problems. Transfering tuple id doesn't solve the scenarios if the node needs any projection. Regards, Hari Babu Fujitsu Australia -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On Tue, Nov 11, 2014 at 9:42 AM, Haribabu Kommi kommi.harib...@gmail.com wrote: On Tue, Nov 11, 2014 at 2:35 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi kommi.harib...@gmail.com wrote: On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-11-10 10:57:16 -0500, Robert Haas wrote: Does parallelism help at all? I'm pretty damn sure. We can't even make a mildly powerfull storage fully busy right now. Heck, I can't make my workstation's storage with a raid 10 out of four spinning disks fully busy. I think some of that benefit also could be reaped by being better at hinting the OS... Yes, it definitely helps but not only limited to IO bound operations. It gives a good gain for the queries having CPU intensive where conditions. One more point we may need to consider, is there any overhead in passing the data row from workers to backend? I am not sure if that overhead will be too much visible if we improve the use of I/O subsystem by making parallel tasks working on it. I feel there may be an overhead because of workers needs to put the result data in the shared memory and the backend has to read from there to process it further. If the cost of transfering data from worker to backend is more than fetching a tuple from the scan, then the overhead is visible when the selectivity is more. However another idea here could be that instead of passing tuple data, we just pass tuple id, but in that case we have to retain the pin on the buffer that contains tuple untill master backend reads from it that might have it's own kind of problems. Transfering tuple id doesn't solve the scenarios if the node needs any projection. Hmm, that's why I told that we need to retain buffer pin, so that we can get the tuple data. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
-Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Haribabu Kommi Sent: Tuesday, November 11, 2014 1:13 PM To: Amit Kapila Cc: Andres Freund; Robert Haas; Simon Riggs; Tom Lane; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan On Tue, Nov 11, 2014 at 2:35 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi kommi.harib...@gmail.com wrote: On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-11-10 10:57:16 -0500, Robert Haas wrote: Does parallelism help at all? I'm pretty damn sure. We can't even make a mildly powerfull storage fully busy right now. Heck, I can't make my workstation's storage with a raid 10 out of four spinning disks fully busy. I think some of that benefit also could be reaped by being better at hinting the OS... Yes, it definitely helps but not only limited to IO bound operations. It gives a good gain for the queries having CPU intensive where conditions. One more point we may need to consider, is there any overhead in passing the data row from workers to backend? I am not sure if that overhead will be too much visible if we improve the use of I/O subsystem by making parallel tasks working on it. I feel there may be an overhead because of workers needs to put the result data in the shared memory and the backend has to read from there to process it further. If the cost of transfering data from worker to backend is more than fetching a tuple from the scan, then the overhead is visible when the selectivity is more. In my experience, data copy and transformation to fit TupleTableSlot is the biggest overhead, rather than scan or join itself... Probably, a straight-forward way is to construct an array of values/isnull on a shared memory segment, then the backend process just switch pointers of tts_values/tts_isnull, with no data copy. It gave us a performance gain. Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers