Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-11-12 Thread Dilip Kumar
On Sat, Nov 11, 2017 at 3:25 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Nov 9, 2017 at 3:55 AM, amul sul <sula...@gmail.com> wrote:
>> It took me a little while to understand this calculation.  You have moved 
>> this
>> code from tbm_create(), but I think you should move the following
>> comment as well:
>
> I made an adjustment that I hope will address your concern here, made
> a few other adjustments, and committed this.
>
Thanks, Robert.
-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] path toward faster partition pruning

2017-10-30 Thread Dilip Kumar
On Mon, Oct 30, 2017 at 12:20 PM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/10/30 14:55, Amit Langote wrote:
>> Fixed in the attached updated patch, along with a new test in 0001 to
>> cover this case.  Also, made a few tweaks to 0003 and 0005 (moved some
>> code from the former to the latter) around the handling of 
>> ScalarArrayOpExprs.
>
> Sorry, I'd forgotten to include some changes.
>
> In the previous versions, RT index of the table needed to be passed to
> partition.c, which I realized is no longer needed, so I removed that
> requirement from the interface.  As a result, patches 0002 and 0003 have
> changed in this version.

Some Minor comments:

+ * get_rel_partitions
+ * Return the list of partitions of rel that pass the clauses mentioned
+ * rel->baserestrictinfo
+ *
+ * Returned list contains the AppendRelInfos of chosen partitions.
+ */
+static List *
+get_append_rel_partitions(PlannerInfo *root,

Function name in function header is not correct.

+ !DatumGetBool(((Const *) clause)->constvalue))
+ {
+ *constfalse = true;
+ continue;

DatumGetBool will return true if the first byte of constvalue is
nonzero otherwise
false.  IIUC, this is not the intention here. Or I am missing something?

+ * clauses in this function ourselves, for example, having both a > 1 and
+ * a = 0 the list

a = 0 the list -> a = 0 in the list

+static bool
+partkey_datum_from_expr(const Expr *expr, Datum *value)
+{
+ /*
+ * Add more expression types here as needed to support higher-level
+ * code.
+ */
+ switch (nodeTag(expr))
+ {
+ case T_Const:
+ *value = ((Const *) expr)->constvalue;
+ return true;

I think for evaluating other expressions (e.g. T_Param) we will have
to pass ExprContext to this function. Or we can do something cleaner
because if we want to access the ExprContext inside
partkey_datum_from_expr then we may need to pass it to
"get_partitions_from_clauses" which is a common function for optimizer
and executor.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise aggregation/grouping

2017-10-17 Thread Dilip Kumar
On Tue, Oct 17, 2017 at 10:44 PM, Jeevan Chalke
<jeevan.cha...@enterprisedb.com> wrote:
>

> I didn't get what you mean by regression here. Can you please explain?
>
> I see that PWA plan is selected over regular plan when enabled on the basis
> of costing.
> Regular planning need a Result node due to which costing increases where as
> PWA don't need that and thus wins.

Sorry for not clearly explaining,  I meant that with normal plan
execution time is 263.678 ms whereas with PWA its 339.929 ms.

I only set enable_partition_wise_agg=on and it switched to PWA and
execution time increased by 30%.
I understand that the this is the worst case for PWA where
FinalizeAggregate is getting all the tuple.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise aggregation/grouping

2017-10-17 Thread Dilip Kumar
On Fri, Oct 13, 2017 at 12:06 PM, Jeevan Chalke
<jeevan.cha...@enterprisedb.com> wrote:
>
While playing around with the patch I have noticed one regression with
the partial partition-wise aggregate.

I am consistently able to reproduce this on my local machine.

Scenario: Group by on non-key column and only one tuple per group.

Complete Test:

create table t(a int,b int) partition by range(a);
create table t1 partition of t for values from (1) to (10);
create table t2 partition of t for values from (10) to (20);

insert into t values (generate_series(1,19),generate_series(1, 19));
postgres=# explain analyze select sum(a) from t group by b;
  QUERY
PLAN
--
 Finalize GroupAggregate  (cost=20379.55..28379.51 rows=19
width=12) (actual time=102.311..322.969 rows=19 loops=1)
   Group Key: t1.b
   ->  Merge Append  (cost=20379.55..25379.53 rows=19 width=12)
(actual time=102.303..232.310 rows=19 loops=1)
 Sort Key: t1.b
 ->  Partial GroupAggregate  (cost=10189.72..11939.70
rows=9 width=12) (actual time=52.164..108.967 rows=9 loops=1)
   Group Key: t1.b
   ->  Sort  (cost=10189.72..10439.72 rows=9 width=8)
(actual time=52.158..66.236 rows=9 loops=1)
 Sort Key: t1.b
 Sort Method: external merge  Disk: 1768kB
 ->  Seq Scan on t1  (cost=0.00..1884.99
rows=9 width=8) (actual time=0.860..20.388 rows=9 loops=1)
 ->  Partial GroupAggregate  (cost=10189.82..11939.82
rows=10 width=12) (actual time=50.134..102.976 rows=10
loops=1)
   Group Key: t2.b
   ->  Sort  (cost=10189.82..10439.82 rows=10 width=8)
(actual time=50.128..63.362 rows=10 loops=1)
 Sort Key: t2.b
 Sort Method: external merge  Disk: 1768kB
 ->  Seq Scan on t2  (cost=0.00..1885.00
rows=10 width=8) (actual time=0.498..20.977 rows=10 loops=1)
 Planning time: 0.190 ms
 Execution time: 339.929 ms
(18 rows)

postgres=# set enable_partition_wise_agg=off;
SET
postgres=# explain analyze select sum(a) from t group by b;
QUERY PLAN
--
 GroupAggregate  (cost=26116.53..29616.51 rows=19 width=12)
(actual time=139.413..250.751 rows=19 loops=1)
   Group Key: t1.b
   ->  Sort  (cost=26116.53..26616.52 rows=19 width=8) (actual
time=139.406..168.775 rows=19 loops=1)
 Sort Key: t1.b
 Sort Method: external merge  Disk: 3544kB
 ->  Result  (cost=0.00..5769.98 rows=19 width=8) (actual
time=0.674..76.392 rows=19 loops=1)
   ->  Append  (cost=0.00..3769.99 rows=19 width=8)
(actual time=0.672..40.291 rows=19 loops=1)
 ->  Seq Scan on t1  (cost=0.00..1884.99
rows=9 width=8) (actual time=0.672..12.408 rows=9 loops=1)
 ->  Seq Scan on t2  (cost=0.00..1885.00
rows=10 width=8) (actual time=1.407..11.689 rows=10 loops=1)
 Planning time: 0.146 ms
 Execution time: 263.678 ms
(11 rows)


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Bitmap Heap Scans segfaults due to (tbm->dsa==NULL) on PostgreSQL 10

2017-10-12 Thread Dilip Kumar
On Thu, Oct 12, 2017 at 6:37 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
>
>
> On 10/12/2017 02:40 PM, Dilip Kumar wrote:
>> On Thu, Oct 12, 2017 at 4:31 PM, Tomas Vondra
>> <tomas.von...@2ndquadrant.com> wrote:
>>> Hi,
>>>
>>> It seems that Q19 from TPC-H is consistently failing with segfaults due
>>> to calling tbm_prepare_shared_iterate() with (tbm->dsa==NULL).
>>>
>>> I'm not very familiar with how the dsa is initialized and passed around,
>>> but I only see the failures when the bitmap is constructed by a mix of
>>> BitmapAnd and BitmapOr operations.
>>>
>> I think I have got the issue, bitmap_subplan_mark_shared is not
>> properly pushing the isshared flag to lower level bitmap index node,
>> and because of that tbm_create is passing NULL dsa while creating the
>> tidbitmap.  So this problem will come in very specific combination of
>> BitmapOr and BitmapAnd when BitmapAnd is the first subplan for the
>> BitmapOr.  If BitmapIndex is the first subplan under BitmapOr then
>> there is no problem because BitmapOr node will create the tbm by
>> itself and isshared is set for BitmapOr.
>>
>> Attached patch fixing the issue for me.  I will thoroughly test this
>> patch with other scenario as well.  Thanks for reporting.
>>
>
> Yep, this fixes the failures for me.
>
Thanks for confirming.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Bitmap Heap Scans segfaults due to (tbm->dsa==NULL) on PostgreSQL 10

2017-10-12 Thread Dilip Kumar
On Thu, Oct 12, 2017 at 4:31 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
> Hi,
>
> It seems that Q19 from TPC-H is consistently failing with segfaults due
> to calling tbm_prepare_shared_iterate() with (tbm->dsa==NULL).
>
> I'm not very familiar with how the dsa is initialized and passed around,
> but I only see the failures when the bitmap is constructed by a mix of
> BitmapAnd and BitmapOr operations.
>
I think I have got the issue, bitmap_subplan_mark_shared is not
properly pushing the isshared flag to lower level bitmap index node,
and because of that tbm_create is passing NULL dsa while creating the
tidbitmap.  So this problem will come in very specific combination of
BitmapOr and BitmapAnd when BitmapAnd is the first subplan for the
BitmapOr.  If BitmapIndex is the first subplan under BitmapOr then
there is no problem because BitmapOr node will create the tbm by
itself and isshared is set for BitmapOr.

Attached patch fixing the issue for me.  I will thoroughly test this
patch with other scenario as well.  Thanks for reporting.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 5c934f2..cc7590b 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -4922,7 +4922,11 @@ bitmap_subplan_mark_shared(Plan *plan)
 		bitmap_subplan_mark_shared(
    linitial(((BitmapAnd *) plan)->bitmapplans));
 	else if (IsA(plan, BitmapOr))
+	{
 		((BitmapOr *) plan)->isshared = true;
+		bitmap_subplan_mark_shared(
+  linitial(((BitmapOr *) plan)->bitmapplans));
+	}
 	else if (IsA(plan, BitmapIndexScan))
 		((BitmapIndexScan *) plan)->isshared = true;
 	else

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-10-11 Thread Dilip Kumar
On Fri, Oct 6, 2017 at 9:21 PM, Dilip Kumar <dilipbal...@gmail.com> wrote:
> On Fri, Oct 6, 2017 at 7:24 PM, Dilip Kumar <dilipbal...@gmail.com> wrote:
>> On Fri, Oct 6, 2017 at 6:08 PM, Alexander Kuzmenkov
>> <a.kuzmen...@postgrespro.ru> wrote:
>>>
>>>> Analysis: The estimated value of the lossy_pages is way higher than
>>>> its actual value and reason is that the total_pages calculated by the
>>>> "Mackert and Lohman formula" is not correct.
>>>
>>>
>>> I think the problem might be that the total_pages includes cache effects and
>>> rescans. For bitmap entries we should use something like relation pages *
>>> selectivity.
>>
>> I have noticed that for the TPCH case if I use "pages * selectivity"
>> it give me better results, but IMHO directly multiplying the pages
>> with selectivity may not be the correct way to calculate the number of
>> heap pages it can only give the correct result when all the TID being
>> fetched are clustered.  But on the other hand "Mackert and Lohman
>> formula" formulae consider that all the TID's are evenly distributed
>> across the heap pages which can also give the wrong estimation like we
>> are seeing in our TPCH case.
>
> I agree with the point that the total_pages included the cache effects
> and rescan when loop_count > 1, that can be avoided if we always
> calculate heap_pages as it is calculated in the else part
> (loop_count=0).  Fortunately, in all the TPCH query plan what I posted
> up thread bitmap scan was never at the inner side of the NLJ so
> loop_count was always 0.  I will fix this.

I have fixed the issue. Now, for calculating the lossy pages it will
not consider the rescan.  As mentioned above it will not affect the
TPCH plan so haven't measured the performance again.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


improve_bitmap_cost_v6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-10-06 Thread Dilip Kumar
On Fri, Oct 6, 2017 at 7:24 PM, Dilip Kumar <dilipbal...@gmail.com> wrote:
> On Fri, Oct 6, 2017 at 6:08 PM, Alexander Kuzmenkov
> <a.kuzmen...@postgrespro.ru> wrote:
>>
>>> Analysis: The estimated value of the lossy_pages is way higher than
>>> its actual value and reason is that the total_pages calculated by the
>>> "Mackert and Lohman formula" is not correct.
>>
>>
>> I think the problem might be that the total_pages includes cache effects and
>> rescans. For bitmap entries we should use something like relation pages *
>> selectivity.
>
> I have noticed that for the TPCH case if I use "pages * selectivity"
> it give me better results, but IMHO directly multiplying the pages
> with selectivity may not be the correct way to calculate the number of
> heap pages it can only give the correct result when all the TID being
> fetched are clustered.  But on the other hand "Mackert and Lohman
> formula" formulae consider that all the TID's are evenly distributed
> across the heap pages which can also give the wrong estimation like we
> are seeing in our TPCH case.

I agree with the point that the total_pages included the cache effects
and rescan when loop_count > 1, that can be avoided if we always
calculate heap_pages as it is calculated in the else part
(loop_count=0).  Fortunately, in all the TPCH query plan what I posted
up thread bitmap scan was never at the inner side of the NLJ so
loop_count was always 0.  I will fix this.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-10-06 Thread Dilip Kumar
On Fri, Oct 6, 2017 at 7:04 PM, Dilip Kumar <dilipbal...@gmail.com> wrote:
> On Fri, Oct 6, 2017 at 6:36 PM, Robert Haas <robertmh...@gmail.com> wrote:
>> On Fri, Oct 6, 2017 at 2:12 AM, Dilip Kumar <dilipbal...@gmail.com> wrote:
>>>> The performance results look good, but that's a slightly different
>>>> thing from whether the estimate is accurate.
>>>>
>>>> +nbuckets = tbm_calculate_entires(maxbytes);
>>>>
>>>> entires?
>>>
>>> changed to
>>> + tbm->maxentries = (int) tbm_calculate_entires(maxbytes);
>>
>> My point was not that you should add a cast, but that you wrote
>> "entires" rather than "entries".
>
> My bad, I thought you were suggesting the variable name as "entries"
> instead of "nbuckets" so I removed the variable "nbuckets".  I will
> fix the typo in the function name and post the patch.

Fixed in the attached version.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


improve_bitmap_cost_v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-10-06 Thread Dilip Kumar
On Fri, Oct 6, 2017 at 6:08 PM, Alexander Kuzmenkov
<a.kuzmen...@postgrespro.ru> wrote:
>
>> Analysis: The estimated value of the lossy_pages is way higher than
>> its actual value and reason is that the total_pages calculated by the
>> "Mackert and Lohman formula" is not correct.
>
>
> I think the problem might be that the total_pages includes cache effects and
> rescans. For bitmap entries we should use something like relation pages *
> selectivity.

I have noticed that for the TPCH case if I use "pages * selectivity"
it give me better results, but IMHO directly multiplying the pages
with selectivity may not be the correct way to calculate the number of
heap pages it can only give the correct result when all the TID being
fetched are clustered.  But on the other hand "Mackert and Lohman
formula" formulae consider that all the TID's are evenly distributed
across the heap pages which can also give the wrong estimation like we
are seeing in our TPCH case.

>
> Meanwhile, I ran TPC-H briefly with the v3 patch: scale factor 25, 2
> workers, SSD storage.
> It shows significant improvement on 4MB work_mem and no change on 2GB.
>
> Here are the results (execution time in seconds, average of 5 runs):
> work_mem4MB2GB
> Query masterpatchmasterpatch
> 4179174168167
> 5190168155156
> 628087227229
> 8197114179172
> 10269227190192
> 14    110108106105
>

Thanks for the testing number looks good to me.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-10-06 Thread Dilip Kumar
On Fri, Oct 6, 2017 at 6:36 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, Oct 6, 2017 at 2:12 AM, Dilip Kumar <dilipbal...@gmail.com> wrote:
>>> The performance results look good, but that's a slightly different
>>> thing from whether the estimate is accurate.
>>>
>>> +nbuckets = tbm_calculate_entires(maxbytes);
>>>
>>> entires?
>>
>> changed to
>> + tbm->maxentries = (int) tbm_calculate_entires(maxbytes);
>
> My point was not that you should add a cast, but that you wrote
> "entires" rather than "entries".

My bad, I thought you were suggesting the variable name as "entries"
instead of "nbuckets" so I removed the variable "nbuckets".  I will
fix the typo in the function name and post the patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-10-06 Thread Dilip Kumar
On Thu, Oct 5, 2017 at 8:15 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Sun, Sep 17, 2017 at 7:04 AM, Dilip Kumar <dilipbal...@gmail.com> wrote:
>> I used  lossy_pages = max(0, total_pages - maxentries / 2). as
>> suggesed by Alexander.
>
> Does that formula accurately estimate the number of lossy pages?

I have printed the total_pages, exact_pages and lossy_pages during
planning time, and for testing purpose, I tweak the code a bit so that
it doesn't consider lossy_pages in cost calculation (same as base
code).

I have tested TPCH scale factor 20. at different work_mem(4MB, 20MB,
64MB) and noted down the estimated pages vs actual pages.

Analysis: The estimated value of the lossy_pages is way higher than
its actual value and reason is that the total_pages calculated by the
"Mackert and Lohman formula" is not correct.

work_mem=4 MB

query:4
estimated: total_pages=552472.00 exact_pages=32768.00
lossy_pages=519704.00
actual:exact=18548 lossy=146141

query:6
estimated: total_pages=1541449.00 exact_pages=32768.00
lossy_pages=1508681.00
actual:exact=13417 lossy=430385

query:8
estimated:  total_pages=552472.00 exact_pages=32768.00
lossy_pages=519704.00
actual: exact=56869 lossy=495603

query:14
estimated:  total_pages=1149603.00 exact_pages=32768.00
lossy_pages=1116835.00
actual: exact=17115 lossy=280949

work_mem: 20 MB
query:4
estimated:  total_pages=552472.00 exact_pages=163840.00
lossy_pages=388632.00
actual: exact=109856 lossy=57761

query:6
estimated:   total_pages=1541449.00 exact_pages=163840.00
lossy_pages=1377609.00
actual:  exact=59771 lossy=397956

query:8
estimated:  total_pages=552472.00 exact_pages=163840.00
lossy_pages=388632.00
actual: Heap Blocks: exact=221777 lossy=330695

query:14
estimated:  total_pages=1149603.00 exact_pages=163840.00
lossy_pages=985763.00
actual: exact=63381 lossy=235513

work_mem:64 MB
query:4
estimated:  total_pages=552472.00 exact_pages=552472.00
lossy_pages=0.00
actual: exact=166005 lossy=0

query:6
estimated:  total_pages=1541449.00 exact_pages=524288.00
lossy_pages=1017161.00
actual: exact=277717 lossy=185919

query:8
estimated: total_pages=552472.00 exact_pages=552472.00
lossy_pages=0.00
actual:exact=552472 lossy=0

query:14
estimated:  total_pages=1149603.00 exact_pages=524288.00
lossy_pages=625315.00
actual: exact=309091 lossy=0


>
> The performance results look good, but that's a slightly different
> thing from whether the estimate is accurate.
>
> +nbuckets = tbm_calculate_entires(maxbytes);
>
> entires?

changed to
+ tbm->maxentries = (int) tbm_calculate_entires(maxbytes);


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


improve_bitmap_cost_v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] path toward faster partition pruning

2017-09-28 Thread Dilip Kumar
On Thu, Sep 28, 2017 at 1:44 PM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/09/28 13:58, Dilip Kumar wrote:

>> I think the above logic is common between this patch and the runtime
>> pruning.  I think we can make
>> a reusable function.  Here we are preparing minkey and maxkey of Datum
>> because we can directly fetch rightop->constvalue whereas for runtime
>> pruning we are making minkeys and maxkeys of Expr because during
>> planning time we don't have the values for the Param.  I think we can
>> always make these minkey, maxkey array of Expr and later those can be
>> processed in whatever way we want it.  So this path will fetch the
>> constval out of Expr and runtime pruning will Eval that expression at
>> runtime.
>
> I think that makes sense.  In fact we could even move the minkey/maxkey
> collection code to match_clauses_to_partkey() itself.  No need for a
> different function and worrying about defining a separate interface for
> the same.  We match clauses exactly because we want to extract the
> constant or param values out of them.  No need to do the two activities
> independently and in different places.
>

+1


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] path toward faster partition pruning

2017-09-27 Thread Dilip Kumar
On Wed, Sep 27, 2017 at 6:52 AM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:

I was looking into the latest patch set, seems like we can reuse some
more code between this path and runtime pruning[1]

+ foreach(lc1, matchedclauses[i])
+ {
+ Expr   *clause = lfirst(lc1);
+ Const  *rightop = (Const *) get_rightop(clause);
+ Oid opno = ((OpExpr *) clause)->opno,
+ opfamily = rel->part_scheme->partopfamily[i];
+ StrategyNumber strategy;
+
+ strategy = get_op_opfamily_strategy(opno, opfamily);
+ switch (strategy)
+ {
+ case BTLessStrategyNumber:
+ case BTLessEqualStrategyNumber:
+ if (need_next_max)
+ {
+ maxkeys[i] = rightop->constvalue;
+ if (!maxkey_set[i])
+ n_maxkeys++;
+ maxkey_set[i] = true;
+ max_incl = (strategy == BTLessEqualStrategyNumber);
+ }
+ if (strategy == BTLessStrategyNumber)
+ need_next_max = false;

I think the above logic is common between this patch and the runtime
pruning.  I think we can make
a reusable function.  Here we are preparing minkey and maxkey of Datum
because we can directly fetch rightop->constvalue whereas for runtime
pruning we are making minkeys and maxkeys of Expr because during
planning time we don't have the values for the Param.  I think we can
always make these minkey, maxkey array of Expr and later those can be
processed in whatever way we want it.  So this path will fetch the
constval out of Expr and runtime pruning will Eval that expression at
runtime.

Does this make sense or it will cause one level of extra processing
for this path i.e converting the Expr array to CONST array?

[1] 
https://www.postgresql.org/message-id/CAOG9ApE16ac-_VVZVvv0gePSgkg_BwYEV1NBqZFqDR2bBE0X0A%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] path toward faster partition pruning

2017-09-26 Thread Dilip Kumar
On Tue, Sep 26, 2017 at 2:45 PM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/09/25 20:21, Dilip Kumar wrote:

> I see.  So, in the run-time pruning case, only the work of extracting
> bounding values is deferred to execution time.  Matching clauses with the
> partition key still occurs during planning time.  Only that the clauses
> that require run-time pruning are not those in rel->baserestrictinfo.

Right.
>
>> After separating out the matching clause we will do somewhat similar
>> processing what "get_rel_partitions" is doing. But, at optimizer time
>> for PARAM we will not have Datum values for rightop, so we will keep
>> track of the PARAM itself.
>
> I guess information about which PARAMs map to which partition keys will be
> kept in the plan somehow.

Yes.
>
>> And, finally at runtime when we get the PARAM value we can prepare
>> minkey and maxkey and call get_partitions_for_keys function.
>
> Note that get_partitions_for_keys() is not planner code, nor is it bound
> with any other planning code.  It's callable from executor without much
> change.  Maybe you already know that though.

Yes, Right.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] path toward faster partition pruning

2017-09-25 Thread Dilip Kumar
On Mon, Sep 25, 2017 at 3:34 PM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:

> Thanks for looking at the patches and the comments.

> It's not clear to me whether get_rel_partitions() itself, as it is, is
> callable from outside the planner, because its signature contains
> RelOptInfo.  We have the RelOptInfo in the signature, because we want to
> mark certain fields in it so that latter planning steps can use them.  So,
> get_rel_partitions()'s job is not just to match clauses and find
> partitions, but also to perform certain planner-specific tasks of
> generating information that the later planning steps will want to use.
> That may turn out to be unnecessary, but until we know that, let's not try
> to export get_rel_partitions() itself out of the planner.
>
> OTOH, the function that your refactoring patch separates out to match
> clauses to partition keys and extract bounding values seems reusable
> outside the planner and we should export it in such a way that it can be
> used in the executor.  Then, the hypothetical executor function that does
> the pruning will first call the planner's clause-matching function,
> followed by calling get_partitions_for_keys() in partition.c to get the
> selected partitions.

Thanks for your reply.

Actually, we are still planning to call get_matching_clause at the
optimizer time only.  Since we can not use get_rel_partitions function
directly for runtime pruning because it does all the work (find
matching clause, create minkey and maxkey and call
get_partitions_for_keys) during planning time itself.

For runtime pruning, we are planning to first get_matching_clause
function during optimizer time to identify the clause which is
matching with partition keys, but for PARAM_EXEC case we can not
depend upon baserelrestriction instead we will get the from join
clause, that's the reason I have separated out get_matching_clause.
But it will still be used during planning time.

After separating out the matching clause we will do somewhat similar
processing what "get_rel_partitions" is doing. But, at optimizer time
for PARAM we will not have Datum values for rightop, so we will keep
track of the PARAM itself.

And, finally at runtime when we get the PARAM value we can prepare
minkey and maxkey and call get_partitions_for_keys function.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Effect of changing the value for PARALLEL_TUPLE_QUEUE_SIZE

2017-09-21 Thread Dilip Kumar
On Thu, Sep 21, 2017 at 4:50 PM, Rafia Sabih
<rafia.sa...@enterprisedb.com> wrote:
> On Sun, Sep 17, 2017 at 9:10 PM, Dilip Kumar <dilipbal...@gmail.com> wrote:
>> On Wed, Sep 6, 2017 at 4:14 PM, Rafia Sabih
>> <rafia.sa...@enterprisedb.com> wrote:
>>
>
> Please find the attached file for the revised version.

Thanks for the updated patch, I have some more comments.

+static shm_mq_handle *local_mq_attach(shm_mq_handle *mqh);
+static uint64 space_in_local(local_mq * lq, Size tuple_size);
+static bool read_local_queue(local_mq * lq, bool shm_mq_full);

local_mq * lq  -> local_mq *lq
same for other places as well.

---

+static uint64 space_in_shm(shm_mq *mq);
+
+static uint64 space_in_local(local_mq * lq, Size tuple_size);

we better use Size here instead if uint64

---

+ available = ringsize - used;
+
+ ringsize = lq->mq_ring_size;
+ writer_offset = lq->mq_bytes_written % ringsize;
+ reader_offset = lq->mq_bytes_read % ringsize;
+
+ if (writer_offset + tuple_size < ringsize && reader_offset < writer_offset)
+ available = (ringsize - writer_offset);

even though there is space in queue but tuple need rotation then we
are not allowing it to
write into the local queue.  If there is some strong reason behind
that, please add comments
to explain the same.
---

+ if (shm_mq_full || (written - read) >= .05 * lq->mq_ring_size)
+ return true;
+
+ else
+ return true;
+}

Seems like you want to return 'false' in the else case.



+ read_offset = lq->mq_bytes_read % lq->mq_ring_size;
+ available = space_in_shm(mqh->mqh_queue);
+
+ /* always read data in the aligned form */
+ to_read = MAXALIGN_DOWN(Min(used, available));
+
+ /*
+ * if the amount of data to be send from local queue involves wrapping of
+ * local queue, then send only the data till the end of queue right now
+ * and rest later.
+ */
+ if (lq->mq_bytes_read % lq->mq_ring_size + to_read > lq->mq_ring_size)

You can directly use "read_offset" instead of recalculating
lq->mq_bytes_read % lq->mq_ring_size.


+ do
+ {
+ if (shm_mq_is_detached(mqh->mqh_queue))
+ return SHM_MQ_DETACHED;
+
+ shm_space = space_in_shm(mqh->mqh_queue);
+
+ /*
+ * cannot send data to shared queue, unless there is required
+ * space, so wait till we get some space, since we cannot
+ * write anymore in local queue as of now
+ */
+ WaitLatch(MyLatch, WL_LATCH_SET, 0, WAIT_EVENT_MQ_SEND);
+
+ /* Reset the latch so we don't spin. */
+ ResetLatch(MyLatch);
+
+ /* An interrupt may have occurred while we were waiting. */
+ CHECK_FOR_INTERRUPTS();
+
+ shm_space = space_in_shm(mqh->mqh_queue);
+
+ if (read_local_queue(lq, true) && shm_space > 0)
+ copy_local_to_shared(lq, mqh, false);
+
+ local_space = space_in_local(lq, tuple_size);
+
+ } while (local_space <= tuple_size);
+

1. Just after getting the shm_space, you are calling WaitLatch,
without even checking whether
that space is sufficient to send the tuple.
2. Every time after latch is set (maybe some space freed in the shared
queue) you are calling
copy_local_to_shared to send as much data as possible from local to
shared queue, but that doesn't
even guarantee that we will have sufficient space in the local queue
to accommodate the current tuple.

I think calling copy_local_to_shared multiple time (which will
internally acquire mutex), after latch
is set you can check the shared queue space, don't attempt
copy_local_to_shared unless
shm_space >=tuple_size

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] path toward faster partition pruning

2017-09-19 Thread Dilip Kumar
I have done some refactoring of the code where I have moved the code
of getting the matching clause into the separate function so that it
can fetch the matching clause from any set of given restriction list.

It can be applied on top of 0002-WIP:
planner-side-changes-for-partition-pruning.patch

On Sat, Sep 16, 2017 at 3:13 PM, Dilip Kumar <dilipbal...@gmail.com> wrote:
> On Fri, Sep 15, 2017 at 2:20 PM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>> On 2017/09/15 11:16, Amit Langote wrote:
>
> Thanks for the updated patch.  I was going through the logic of
> get_rel_partitions in 0002 as almost similar functionality will be
> required by runtime partition pruning on which Beena is working.  The
> only difference is that here we are processing the
> "rel->baserestrictinfo" and in case of runtime pruning, we also need
> to process join clauses which are pushed down to appendrel.
>
> So can we make some generic logic which can be used for both the patches.
>
> So basically, we need to do two changes
>
> 1. In get_rel_partitions instead of processing the
> "rel->baserestrictinfo" we can take clause list as input that way we
> can pass any clause list to this function.
>
> 2. Don't call "get_partitions_for_keys" routine from the
> "get_rel_partitions", instead, get_rel_partitions can just prepare
> minkey, maxkey and the caller of the get_rel_partitions can call
> get_partitions_for_keys, because for runtime pruning we need to call
> get_partitions_for_keys at runtime.
>
> After these changes also there will be one problem that the
> get_partitions_for_keys is directly fetching the "rightop->constvalue"
> whereas, for runtime pruning, we need to store rightop itself and
> calculate the value at runtime by param evaluation,  I haven't yet
> thought how can we make this last part generic.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0002-refactor_get_rel_partition.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-09-19 Thread Dilip Kumar
On Tue, Sep 19, 2017 at 1:15 PM, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> On 18 September 2017 at 20:45, Dilip Kumar <dilipbal...@gmail.com> wrote:
>> Please find few more comments.
>>
>> + * in which they appear in the PartitionDesc. Also, extract the
>> + * partition key columns of the root partitioned table. Those of the
>> + * child partitions would be collected during recursive expansion.
>> */
>> + pull_child_partition_columns(_part_cols, oldrelation, oldrelation);
>> expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
>>   lockmode, >append_rel_list,
>> +   _part_cols,
>>
>> pcinfo->all_part_cols is only used in case of update, I think we can
>> call pull_child_partition_columns
>> only if rte has updateCols?
>>
>> @@ -2029,6 +2034,7 @@ typedef struct PartitionedChildRelInfo
>>
>> Index parent_relid;
>> List   *child_rels;
>> + Bitmapset  *all_part_cols;
>> } PartitionedChildRelInfo;
>>
>> I might be missing something, but do we really need to store
>> all_part_cols inside the
>> PartitionedChildRelInfo,  can't we call pull_child_partition_columns
>> directly inside
>> inheritance_planner whenever we realize that RTE has some updateCols
>> and we want to
>> check the overlap?
>
> One thing  we will have to do extra is : Open and close the
> partitioned rels again. The idea was that we collect the bitmap
> *while* we are already expanding through the tree and the rel is open.
> Will check if this is feasible.

Oh, I see.
>
>>
>> +extern void partition_walker_init(PartitionWalker *walker, Relation rel);
>> +extern Relation partition_walker_next(PartitionWalker *walker,
>> +  Relation *parent);
>> +
>>
>> I don't see these functions are used anywhere?
>>
>> +typedef struct PartitionWalker
>> +{
>> + List   *rels_list;
>> + ListCell   *cur_cell;
>> +} PartitionWalker;
>> +
>>
>> Same as above
>
> Yes, this was left out from the earlier implementation. Will have this
> removed in the next updated patch.
Ok. I will continue my review thanks.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-09-18 Thread Dilip Kumar
On Mon, Sep 18, 2017 at 11:29 AM, Dilip Kumar <dilipbal...@gmail.com> wrote:
> On Fri, Sep 15, 2017 at 4:55 PM, Amit Khandekar <amitdkhan...@gmail.com> 
> wrote:
>> On 12 September 2017 at 12:39, Amit Khandekar <amitdkhan...@gmail.com> wrote:
>>> On 12 September 2017 at 11:57, Dilip Kumar <dilipbal...@gmail.com> wrote:
>>>> On Tue, Sep 12, 2017 at 11:15 AM, Amit Khandekar <amitdkhan...@gmail.com> 
>>>> wrote:
>>>>

>> In the attached patch, we first call ExecARUpdateTriggers(), and while
>> doing that, we first save the info that a NEW row is already captured
>> (mtstate->mt_transition_capture->tcs_update_old_table == true). If it
>> captured, we pass NULL transition_capture pointer to
>> ExecARDeleteTriggers() (and ExecARInsertTriggers) so that it does not
>> again capture an extra row.
>>
>> Modified a testcase in update.sql by including DELETE statement
>> trigger that uses transition tables.
>
> Ok, this fix looks correct to me, I will review the latest patch.

Please find few more comments.

+ * in which they appear in the PartitionDesc. Also, extract the
+ * partition key columns of the root partitioned table. Those of the
+ * child partitions would be collected during recursive expansion.
*/
+ pull_child_partition_columns(_part_cols, oldrelation, oldrelation);
expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc,
  lockmode, >append_rel_list,
+   _part_cols,

pcinfo->all_part_cols is only used in case of update, I think we can
call pull_child_partition_columns
only if rte has updateCols?

@@ -2029,6 +2034,7 @@ typedef struct PartitionedChildRelInfo

Index parent_relid;
List   *child_rels;
+ Bitmapset  *all_part_cols;
} PartitionedChildRelInfo;

I might be missing something, but do we really need to store
all_part_cols inside the
PartitionedChildRelInfo,  can't we call pull_child_partition_columns
directly inside
inheritance_planner whenever we realize that RTE has some updateCols
and we want to
check the overlap?

+extern void partition_walker_init(PartitionWalker *walker, Relation rel);
+extern Relation partition_walker_next(PartitionWalker *walker,
+  Relation *parent);
+

I don't see these functions are used anywhere?

+typedef struct PartitionWalker
+{
+ List   *rels_list;
+ ListCell   *cur_cell;
+} PartitionWalker;
+

Same as above



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-09-17 Thread Dilip Kumar
On Fri, Sep 15, 2017 at 4:55 PM, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> On 12 September 2017 at 12:39, Amit Khandekar <amitdkhan...@gmail.com> wrote:
>> On 12 September 2017 at 11:57, Dilip Kumar <dilipbal...@gmail.com> wrote:
>>> On Tue, Sep 12, 2017 at 11:15 AM, Amit Khandekar <amitdkhan...@gmail.com> 
>>> wrote:
>>>
> I found out that, in case when there is a DELETE statement trigger
> using transition tables, it's not only an issue of redundancy; it's a
> correctness issue. Since for transition tables both DELETE and UPDATE
> use the same old row tuplestore for capturing OLD table, that table
> gets duplicate rows: one from ExecARDeleteTriggers() and another from
> ExecARUpdateTriggers(). In presence of INSERT statement trigger using
> transition tables, both INSERT and UPDATE events have separate
> tuplestore, so duplicate rows don't show up in the UPDATE NEW table.
> But, nevertheless, we need to prevent NEW rows to be collected in the
> INSERT event tuplestore, and capture the NEW rows only in the UPDATE
> event tuplestore.
>
> In the attached patch, we first call ExecARUpdateTriggers(), and while
> doing that, we first save the info that a NEW row is already captured
> (mtstate->mt_transition_capture->tcs_update_old_table == true). If it
> captured, we pass NULL transition_capture pointer to
> ExecARDeleteTriggers() (and ExecARInsertTriggers) so that it does not
> again capture an extra row.
>
> Modified a testcase in update.sql by including DELETE statement
> trigger that uses transition tables.

Ok, this fix looks correct to me, I will review the latest patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-09-17 Thread Dilip Kumar
On Sun, Sep 17, 2017 at 4:34 PM, Dilip Kumar <dilipbal...@gmail.com> wrote:
>>
>> I have repeated one of the tests after fixing the problems pointed by
>> you but this time results are not that impressive.  Seems like below
>> check was the problem in the previous patch
>>
>>if (tbm->nentries > tbm->maxentries / 2)
>> tbm->maxentries = Min(tbm->nentries, (INT_MAX - 1) / 2) * 2;
>>
>> Because we were lossifying only till tbm->nentries becomes 90% of
>> tbm->maxentries but later we had this check which will always be true
>> and tbm->maxentries will be doubled and that was the main reason of
>> huge reduction of lossy pages, basically, we started using more
>> work_mem in all the cases.
>>
>> I have taken one reading just to see the impact after fixing the
>> problem with the patch.
>>
>>  Work_mem: 40 MB
>> (Lossy Pages count)
>>
>> Queryhead  patch
>> 6   995223   733087
>> 14 337894   206824
>> 15 995417   798817
>> 20   1654016 1588498
>>
>> Still, we see a good reduction in lossy pages count.  I will perform
>> the test at different work_mem and for different values of
>> TBM_FILFACTOR and share the number soon.
>
> I haven't yet completely measured the performance with executor
> lossification change, meanwhile, I have worked on some of the comments
> on optimiser change and taken the performance again, I still see good
> improvement in the performance (almost 2x for some of the queries) and
> with new method of lossy pages calculation I don't see regression in
> Q14 (now Q14 is not changing its plan).
>
> I used  lossy_pages = max(0, total_pages - maxentries / 2). as
> suggesed by Alexander.
>
>
> Performance Results:
>
> Machine: Intell 56 core machine (2 NUMA node)
> work_mem: varies.
> TPCH S.F: 20
> Median of 3 runs.
>
> work_mem = 4MB
>
> QueryPatch(ms)Head(ms)Change in plan
>
> 4   4686.186   5039.295 PBHS -> PSS
>
> 5   26772.19227500.800BHS -> SS
>
> 6   6615.916   7760.005 PBHS -> PSS
>
> 8   6370.611  12407.731PBHS -> PSS
>
>   15   17493.564   24242.256 BHS -> SS
>
>
> work_mem = 20MB
>
> QueryPatch(ms)Head(ms)Change in plan
>
> 6   6656.467   7469.961 PBHS -> PSS
>
> 8   6116.526  12300.784PBHS -> PSS
>
> 15 17873.72622913.421BHS -> PSS
>
>
> work_mem = 64MB
>
> QueryPatch(ms)Head(ms)   Change in plan
>
> 15 14900.88127460.093   BHS -> PBHS
>


There was some problem with the previous patch, even if the bitmap was
enough to hold all the heap pages I was calculating the lossy pages.
I have fixed that in the attached patch.  I have also verified the
performance it's same as reported in the previous email.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


improve_bitmap_cost_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Effect of changing the value for PARALLEL_TUPLE_QUEUE_SIZE

2017-09-17 Thread Dilip Kumar
On Wed, Sep 6, 2017 at 4:14 PM, Rafia Sabih
<rafia.sa...@enterprisedb.com> wrote:

> I worked on this idea of using local queue as a temporary buffer to
> write the tuples when master is busy and shared queue is full, and it
> gives quite some improvement in the query performance.
>

I have done some initial review of this patch and I have some comments.

/* this is actual size for this tuple which will be written in queue */
+ tuple_size = MAXALIGN(sizeof(Size)) + MAXALIGN(iov.len);
+
+ /* create and attach a local queue, if it is not yet created */
+ if (mqh->mqh_local_queue == NULL)
+ mqh = local_mq_attach(mqh);

I think we can create the local queue when first time we need it. So
basically you
can move this code inside else part where we first identify that there is no
space in the shared queue.

--
+ /* write in local queue if there is enough space*/
+ if (local_space > tuple_size)

I think the condition should be if (local_space >= tuple_size)

--
+ while(shm_space <= 0)
+ {
+ if (shm_mq_is_detached(mqh->mqh_queue))
+ return SHM_MQ_DETACHED;
+
+ shm_space = space_in_shm(mqh->mqh_queue);
+ }

Instead of waiting in CPU intensive while loop we should wait on some latch, why
can't we use wait latch of the shared queue and whenever some space
free, the latch will
be set then we can recheck the space and if it's enough we can write
to share queue
otherwise wait on the latch again.

Check other similar occurrences.
-

+ if (read_local_queue(lq, true) && shm_space > 0)
+ copy_local_to_shared(lq, mqh, false);
+

Instead of adding shm_space > 0 in check it can be Assert or nothing
because above loop will
only break if shm_space > 0.


+ /*
+ * create a local queue, the size of this queue should be way higher
+ * than PARALLEL_TUPLE_QUEUE_SIZE
+ */
+ char *mq;
+ Size len;
+
+ len = 6553600;

Create some macro which is multiple of PARALLEL_TUPLE_QUEUE_SIZE,

---

+ /* this local queue is not required anymore, hence free the space. */
+ pfree(mqh->mqh_local_queue);
+ return;
+}

I think you can remove the return at the end of the void function.
-

In empty_queue(shm_mq_handle *mqh) function I see that only last exit path frees
the local queue but not all even though local queue is created.


Other cosmetic issues.
-
+/* check the space availability in local queue */
+static uint64
+space_in_local(local_mq *lq, Size tuple_size)
+{
+ uint64 read, written, used, available, ringsize, writer_offset, reader_offset;
+
+ ringsize = lq->mq_ring_size;
+ read = lq->mq_bytes_read;
+ written = lq->mq_bytes_written;
+ used = written - read;
+ available = ringsize - used;
+
Alignment is not correct, I think you can run pgindent once.


+ /* check is there is required space in shared queue */
statement need refactoring. "check if there is required space in shared queue" ?

-

+ /* write in local queue if there is enough space*/
space missing before comment end.

-

+
+/* Routines required for local queue */
+
+/*
+ * Initialize a new local message queue, this is kept quite similar
to shm_mq_create.
+ */

Header comments formatting is not in sync with other functions.

-

+}
+/* routine to create and attach local_mq to the shm_mq_handle */

one blank line between two functions.

-

+ bool detached;
+
+ detached = false;

a better way is bool detached = false;

-

Compilation warning

shm_mq.c: In function ‘write_in_local_queue’:
shm_mq.c:1489:32: warning: variable ‘tuple_size’ set but not used
[-Wunused-but-set-variable]
  uint64 bytes_written, nbytes, tuple_size;

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-09-17 Thread Dilip Kumar
On Mon, Sep 4, 2017 at 11:18 AM, Dilip Kumar <dilipbal...@gmail.com> wrote:
> On Thu, Aug 31, 2017 at 11:27 PM, Robert Haas <robertmh...@gmail.com> wrote:
>
> I have repeated one of the tests after fixing the problems pointed by
> you but this time results are not that impressive.  Seems like below
> check was the problem in the previous patch
>
>if (tbm->nentries > tbm->maxentries / 2)
> tbm->maxentries = Min(tbm->nentries, (INT_MAX - 1) / 2) * 2;
>
> Because we were lossifying only till tbm->nentries becomes 90% of
> tbm->maxentries but later we had this check which will always be true
> and tbm->maxentries will be doubled and that was the main reason of
> huge reduction of lossy pages, basically, we started using more
> work_mem in all the cases.
>
> I have taken one reading just to see the impact after fixing the
> problem with the patch.
>
>  Work_mem: 40 MB
> (Lossy Pages count)
>
> Queryhead  patch
> 6   995223   733087
> 14 337894   206824
> 15 995417   798817
> 20   1654016 1588498
>
> Still, we see a good reduction in lossy pages count.  I will perform
> the test at different work_mem and for different values of
> TBM_FILFACTOR and share the number soon.

I haven't yet completely measured the performance with executor
lossification change, meanwhile, I have worked on some of the comments
on optimiser change and taken the performance again, I still see good
improvement in the performance (almost 2x for some of the queries) and
with new method of lossy pages calculation I don't see regression in
Q14 (now Q14 is not changing its plan).

I used  lossy_pages = max(0, total_pages - maxentries / 2). as
suggesed by Alexander.


Performance Results:

Machine: Intell 56 core machine (2 NUMA node)
work_mem: varies.
TPCH S.F: 20
Median of 3 runs.

work_mem = 4MB

QueryPatch(ms)Head(ms)Change in plan

4   4686.186   5039.295 PBHS -> PSS

5   26772.19227500.800BHS -> SS

6   6615.916   7760.005 PBHS -> PSS

8   6370.611  12407.731PBHS -> PSS

  15   17493.564   24242.256 BHS -> SS


work_mem = 20MB

QueryPatch(ms)Head(ms)Change in plan

6   6656.467   7469.961 PBHS -> PSS

8   6116.526  12300.784PBHS -> PSS

15 17873.72622913.421BHS -> PSS


work_mem = 64MB

QueryPatch(ms)Head(ms)   Change in plan

15 14900.88127460.093   BHS -> PBHS


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


improve_bitmap_cost_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] path toward faster partition pruning

2017-09-16 Thread Dilip Kumar
On Fri, Sep 15, 2017 at 2:20 PM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/09/15 11:16, Amit Langote wrote:

Thanks for the updated patch.  I was going through the logic of
get_rel_partitions in 0002 as almost similar functionality will be
required by runtime partition pruning on which Beena is working.  The
only difference is that here we are processing the
"rel->baserestrictinfo" and in case of runtime pruning, we also need
to process join clauses which are pushed down to appendrel.

So can we make some generic logic which can be used for both the patches.

So basically, we need to do two changes

1. In get_rel_partitions instead of processing the
"rel->baserestrictinfo" we can take clause list as input that way we
can pass any clause list to this function.

2. Don't call "get_partitions_for_keys" routine from the
"get_rel_partitions", instead, get_rel_partitions can just prepare
minkey, maxkey and the caller of the get_rel_partitions can call
get_partitions_for_keys, because for runtime pruning we need to call
get_partitions_for_keys at runtime.

After these changes also there will be one problem that the
get_partitions_for_keys is directly fetching the "rightop->constvalue"
whereas, for runtime pruning, we need to store rightop itself and
calculate the value at runtime by param evaluation,  I haven't yet
thought how can we make this last part generic.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] path toward faster partition pruning

2017-09-14 Thread Dilip Kumar
On Wed, Sep 6, 2017 at 4:08 PM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/09/04 10:10, Amit Langote wrote:
>> On 2017/09/02 2:52, Robert Haas wrote:

>
> [PATCH 2/5] WIP: planner-side changes for partition-pruning
>
> This patch adds a stub get_partitions_for_keys in partition.c with a
> suitable interface for the caller to pass bounding keys extracted from the
> query and other related information.
>
> Importantly, it implements the logic within the planner to match query's
> scan keys to a parent table's partition key and form the bounding keys
> that will be passed to partition.c to compute the list of partitions that
> satisfy those bounds.
>

+ Node   *leftop = get_leftop(clause);
+
+ if (IsA(leftop, RelabelType))
+ leftop = (Node *) ((RelabelType *) leftop)->arg;
+
+ if (equal(leftop, partkey))

It appears that this patch always assume that clause will be of form
"var op const", but it can also be "const op var"

That's the reason in below example where in both the queries condition
is same it can only prune in the first case but not in the second.

postgres=# explain select * from t where t.a < 2;
   QUERY PLAN

 Append  (cost=0.00..2.24 rows=1 width=8)
   ->  Seq Scan on t1  (cost=0.00..2.24 rows=1 width=8)
 Filter: (a < 2)
(3 rows)

postgres=# explain select * from t where 2>t.a;
   QUERY PLAN

 Append  (cost=0.00..4.49 rows=2 width=8)
   ->  Seq Scan on t1  (cost=0.00..2.24 rows=1 width=8)
 Filter: (2 > a)
   ->  Seq Scan on t2  (cost=0.00..2.25 rows=1 width=8)
 Filter: (2 > a)
(5 rows)

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-09-12 Thread Dilip Kumar
On Tue, Sep 12, 2017 at 11:15 AM, Amit Khandekar <amitdkhan...@gmail.com> wrote:

> But the statement level trigger function can refer to OLD TABLE and
> NEW TABLE, which will contain all the OLD rows and NEW rows
> respectively. So the updated rows of the partitions (including the
> moved ones) need to be captured. So for OLD TABLE, we need to capture
> the deleted row, and for NEW TABLE, we need to capture the inserted
> row.

Yes, I agree.  So in ExecDelete for OLD TABLE we only need to call
ExecARUpdateTriggers which will make the entry in OLD TABLE only if
transition table is there otherwise nothing and I guess this part
already exists in your patch.  And, we are also calling
ExecARDeleteTriggers and I guess that is to fire the ROW-LEVEL delete
trigger and that is also fine.  What I don't understand is that if
there is no "ROW- LEVEL delete trigger" and there is only a "statement
level delete trigger" with transition table still we are making the
entry in transition table of the delete trigger and that will never be
used.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-09-11 Thread Dilip Kumar
On Thu, Sep 7, 2017 at 11:41 AM, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> On 6 September 2017 at 21:47, Dilip Kumar <dilipbal...@gmail.com> wrote:

> Actually, since transition tables came in, the functions like
> ExecARUpdateTriggers() or ExecARInsertTriggers() have this additional
> purpose of capturing transition table rows, so that the images of the
> tables are visible when statement triggers are fired that refer to
> these transition tables. So in the above code, these functions only
> capture rows, they do not add any event for firing any ROW triggers.
> AfterTriggerSaveEvent() returns without adding any event if it's
> called only for transition capture. So even if UPDATE row triggers are
> defined, they won't get fired in case of row movement, although the
> updated rows would be captured if transition tables are referenced in
> these triggers or in the statement triggers.
>

Ok then I have one more question,

With transition table, we can only support statement level trigger and
for update
statement, we are only going to execute UPDATE statement level
trigger? so is there
any point of making transition table entry for DELETE/INSERT trigger
as those transition
table will never be used.  Or I am missing something?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-09-06 Thread Dilip Kumar
On Mon, Sep 4, 2017 at 10:52 AM, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> On 4 September 2017 at 07:43, Amit Kapila <amit.kapil...@gmail.com> wrote:
> Oops sorry. Now attached.

I have done some basic testing and initial review of the patch.  I
have some comments/doubts.  I will continue the review.

+ if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture)
+ ExecARUpdateTriggers(estate, resultRelInfo, InvalidOid,

For passing invalid ItemPointer we are using InvalidOid, this seems
bit odd to me
are we using simmilar convention some other place? I think it would be better to
just pass 0?

--

- if ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
- (event == TRIGGER_EVENT_UPDATE && update_old_table))
+ if (oldtup != NULL &&
+ ((event == TRIGGER_EVENT_DELETE && delete_old_table) ||
+ (event == TRIGGER_EVENT_UPDATE && update_old_table)))
  {
  Tuplestorestate *old_tuplestore;

- Assert(oldtup != NULL);

Only if TRIGGER_EVENT_UPDATE it is possible that oldtup can be NULL,
so we have added an extra
check for oldtup and removed the Assert, but if  TRIGGER_EVENT_DELETE
we never expect it to be NULL.

Is it better to put Assert outside the condition check (Assert(oldtup
!= NULL || event == TRIGGER_EVENT_UPDATE)) ?
same for the newtup.

I think we should also explain in comments about why oldtup or newtup
can be NULL in case of if
TRIGGER_EVENT_UPDATE

---

+triggers affect the row being moved. As far as AFTER ROW
+triggers are concerned, AFTER DELETE and
+AFTER INSERT triggers are applied; but
+AFTER UPDATE triggers are not applied
+because the UPDATE has been converted to a
+DELETE and INSERT.

Above comments says that ARUpdate trigger is not fired but below code call
ARUpdateTrigger

+ if (mtstate->operation == CMD_UPDATE && mtstate->mt_transition_capture)
+ ExecARUpdateTriggers(estate, resultRelInfo, InvalidOid,
+ NULL,
+ tuple,
+ NULL,
+ mtstate->mt_transition_capture);


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-09-03 Thread Dilip Kumar
On Thu, Aug 31, 2017 at 11:27 PM, Robert Haas <robertmh...@gmail.com> wrote:

I have repeated one of the tests after fixing the problems pointed by
you but this time results are not that impressive.  Seems like below
check was the problem in the previous patch

   if (tbm->nentries > tbm->maxentries / 2)
tbm->maxentries = Min(tbm->nentries, (INT_MAX - 1) / 2) * 2;

Because we were lossifying only till tbm->nentries becomes 90% of
tbm->maxentries but later we had this check which will always be true
and tbm->maxentries will be doubled and that was the main reason of
huge reduction of lossy pages, basically, we started using more
work_mem in all the cases.

I have taken one reading just to see the impact after fixing the
problem with the patch.

 Work_mem: 40 MB
(Lossy Pages count)

Queryhead  patch
6   995223   733087
14 337894   206824
15 995417   798817
20   1654016 1588498

Still, we see a good reduction in lossy pages count.  I will perform
the test at different work_mem and for different values of
TBM_FILFACTOR and share the number soon.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2017-09-01 Thread Dilip Kumar
46156.245)

 53235  Client  | ClientRead
  29470  IPC | ProcArrayGroupUpdate
   4302  LWLock  | wal_insert
   3717  Activity| LogicalLauncherMain
   3715  Activity| AutoVacuumMain
   3463  LWLock  | ProcArrayLock
   3140  Lock| extend
   2934  Activity| BgWriterMain
   1434  Activity| WalWriterMain
   1198  Activity| CheckpointerMain
   1073  LWLock  | XidGenLock
869  IPC     | ClogGroupUpdate


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-08-31 Thread Dilip Kumar
On Fri, Aug 11, 2017 at 10:44 AM, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> On 4 August 2017 at 22:28, Amit Khandekar <amitdkhan...@gmail.com> wrote:
>>>

I am planning to review and test this patch, Seems like this patch
needs to be rebased.

[dilip@localhost postgresql]$ patch -p1 <
../patches/update-partition-key_v15.patch
patching file doc/src/sgml/ddl.sgml
patching file doc/src/sgml/ref/update.sgml
patching file doc/src/sgml/trigger.sgml
patching file src/backend/catalog/partition.c
Hunk #3 succeeded at 910 (offset -1 lines).
Hunk #4 succeeded at 924 (offset -1 lines).
Hunk #5 succeeded at 934 (offset -1 lines).
Hunk #6 succeeded at 994 (offset -1 lines).
Hunk #7 succeeded at 1009 with fuzz 1 (offset 3 lines).
Hunk #8 FAILED at 1023.
Hunk #9 succeeded at 1059 with fuzz 2 (offset 10 lines).
Hunk #10 succeeded at 2069 (offset 2 lines).
Hunk #11 succeeded at 2406 (offset 2 lines).
1 out of 11 hunks FAILED -- saving rejects to file
src/backend/catalog/partition.c.rej
patching file src/backend/commands/copy.c
Hunk #2 FAILED at 1426.
Hunk #3 FAILED at 1462.
Hunk #4 succeeded at 2616 (offset 7 lines).
Hunk #5 succeeded at 2726 (offset 8 lines).
Hunk #6 succeeded at 2846 (offset 8 lines).
2 out of 6 hunks FAILED -- saving rejects to file
src/backend/commands/copy.c.rej
patching file src/backend/commands/trigger.c
Hunk #4 succeeded at 5261 with fuzz 2.
patching file src/backend/executor/execMain.c
Hunk #1 succeeded at 65 (offset 1 line).
Hunk #2 succeeded at 103 (offset 1 line).
Hunk #3 succeeded at 1829 (offset 20 lines).
Hunk #4 succeeded at 1860 (offset 20 lines).
Hunk #5 succeeded at 1927 (offset 20 lines).
Hunk #6 succeeded at 2044 (offset 21 lines).
Hunk #7 FAILED at 3210.
Hunk #8 FAILED at 3244.
Hunk #9 succeeded at 3289 (offset 26 lines).
Hunk #10 FAILED at 3340.
Hunk #11 succeeded at 3387 (offset 29 lines).
Hunk #12 succeeded at 3424 (offset 29 lines).
3 out of 12 hunks FAILED -- saving rejects to file
src/backend/executor/execMain.c.rej
patching file src/backend/executor/execReplication.c
patching file src/backend/executor/nodeModifyTable.c

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-08-31 Thread Dilip Kumar
On Wed, Aug 30, 2017 at 2:00 AM, Robert Haas <robertmh...@gmail.com> wrote:

>
> I suggest defining a TBM_FILLFACTOR constant instead of repeating the
> value 0.9 in a bunch of places.  I think it would also be good to try
> to find the sweet spot for that constant.  Making it bigger reduces
> the number of lossy entries  created, but making it smaller reduces
> the number of times we have to walk the bitmap.  So if, for example,
> 0.75 is sufficient to produce almost all of the gain, then I think we
> would want to prefer 0.75 to 0.9.  But if 0.9 is better, then we can
> stick with that.
>
> Note that a value higher than 0.9375 wouldn't be sane without some
> additional safety precautions because maxentries could be as low as
> 16.

Thanks for the feedback.  I will work on it.

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-08-28 Thread Dilip Kumar
On Wed, Aug 23, 2017 at 9:45 AM, Dilip Kumar <dilipbal...@gmail.com> wrote:
>>

>> ...
>> if (tbm->nentries <= tbm->maxentries / 2)
>> {
>> /*
>>  * We have made enough room.
>> ...
>> I think we could try higher fill factor, say, 0.9. tbm_lossify basically
>> just continues iterating over the hashtable with not so much overhead per a
>> call, so calling it more frequently should not be a problem. On the other
>> hand, it would have to process less pages, and the bitmap would be less
>> lossy.
>>
>> I didn't benchmark the index scan per se with the 0.9 fill factor, but the
>> reduction of lossy pages was significant.
>
> I will try this and produce some performance number.
>

I have done some performance testing as suggested by Alexander (patch attached).

Performance results:  I can see a significant reduction in lossy_pages
count in all the queries and also a noticeable reduction in the
execution time in some of the queries.  I have tested with 2 different
work_mem. Below are the test results (lossy pages count and execution
time).


TPCH benchmark: 20 scale factor
Machine: Power 4 socket
Tested with max_parallel_worker_per_gather=0

Work_mem: 20 MB

(Lossy Pages count:)
Query head  patch

4   166551  35478
5330679  35765
6   1160339  211357
14  666897  103275
15 1160518 211544
20  1982981  405903


(Time in ms:)
Queryhead   patch

414849 14093
576790 74486
625816 14327
14   16011 11093
15   5138135326
20  25   195501


Work_mem: 40 MB
(Lossy Pages count)

Queryhead  patch

6  995223195681
14337894  75744
15 995417   195873
20   1654016   199113


(Time in ms)
Queryhead  patch

6   23819    14571
14 1351411183
15 49980 32400
20204441   188978

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


lossify_slow.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-08-22 Thread Dilip Kumar
On Tue, Aug 22, 2017 at 8:40 PM, Alexander Kumenkov
<a.kuzmen...@postgrespro.ru> wrote:
> Hi Dilip,
>
> Recently I was thinking about this too, when working on the index-only
> count(*) patch for indexes supporting amgetbitmap [1]. That patch teaches
> bitmap heap scan node to skip heap fetches under certain conditions. Exact
> tidbitmap pages are a prerequisite for this, so the lossines of the bitmap
> heavily influences the cost of a scan.
>
> I used a very simple estimation: lossy_pages = max(0, total_pages -
> maxentries / 2). The rationale is that after the initial lossification, the
> number of lossy pages grows slower. It is good enough to reflect this
> initial sharp increase in the lossy page number.

Make sense to me.
>
> The thing that seems more important to me here is that the tidbitmap is very
> aggressive in lossifying the pages: it tries to keep the number of entries
> under maxentries / 2, see tbm_lossify():
> ...
> if (tbm->nentries <= tbm->maxentries / 2)
> {
> /*
>  * We have made enough room.
> ...
> I think we could try higher fill factor, say, 0.9. tbm_lossify basically
> just continues iterating over the hashtable with not so much overhead per a
> call, so calling it more frequently should not be a problem. On the other
> hand, it would have to process less pages, and the bitmap would be less
> lossy.
>
> I didn't benchmark the index scan per se with the 0.9 fill factor, but the
> reduction of lossy pages was significant.

I will try this and produce some performance number.

Thanks for the input.

>
> [1]
> https://www.postgresql.org/message-id/flat/251401bb-6f53-b957-4128-578ac22e8acf%40postgrespro.ru#251401bb-6f53-b957-4128-578ac22e8...@postgrespro.ru
>


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] why not parallel seq scan for slow functions

2017-08-18 Thread Dilip Kumar
On Fri, 18 Aug 2017 at 4:48 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:

> On Thu, Aug 17, 2017 at 2:45 PM, Dilip Kumar <dilipbal...@gmail.com>
> wrote:
> > On Thu, Aug 17, 2017 at 2:09 PM, Dilip Kumar <dilipbal...@gmail.com>
> wrote:
> >>
> >> Either we can pass "num_gene" to merge_clump or we can store num_gene
> >> in the root. And inside merge_clump we can check. Do you see some more
> >> complexity?
> >>
>
> I think something like that should work.

Ok

>
>
> > After putting some more thought I see one more problem but not sure
> > whether we can solve it easily. Now, if we skip generating the gather
> > path at top level node then our cost comparison while adding the
> > element to pool will not be correct as we are skipping some of the
> > paths (gather path).  And, it's very much possible that the path1 is
> > cheaper than path2 without adding gather on top of it but with gather,
> > path2 can be cheaper.
> >
>
> I think that should not matter because the costing of gather is mainly
> based on a number of rows and that should be same for both path1 and
> path2 in this case.


Yeah, I think you are right.

>
>
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] why not parallel seq scan for slow functions

2017-08-17 Thread Dilip Kumar
On Thu, Aug 17, 2017 at 2:09 PM, Dilip Kumar <dilipbal...@gmail.com> wrote:
>
> Either we can pass "num_gene" to merge_clump or we can store num_gene
> in the root. And inside merge_clump we can check. Do you see some more
> complexity?
>
After putting some more thought I see one more problem but not sure
whether we can solve it easily. Now, if we skip generating the gather
path at top level node then our cost comparison while adding the
element to pool will not be correct as we are skipping some of the
paths (gather path).  And, it's very much possible that the path1 is
cheaper than path2 without adding gather on top of it but with gather,
path2 can be cheaper.  But, there is not an easy way to handle it
because without targetlist we can not calculate the cost of the
gather(which is the actual problem).

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] why not parallel seq scan for slow functions

2017-08-17 Thread Dilip Kumar
On Sat, Aug 12, 2017 at 6:48 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Thu, Aug 10, 2017 at 1:07 AM, Robert Haas <robertmh...@gmail.com> wrote:
>> On Tue, Aug 8, 2017 at 3:50 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>> Right.
>>>
>
> I think skipping a generation of gather paths for scan node or top
> level join node generated via standard_join_search seems straight
> forward, but skipping for paths generated via geqo seems to be tricky
> (See use of generate_gather_paths in merge_clump).

Either we can pass "num_gene" to merge_clump or we can store num_gene
in the root. And inside merge_clump we can check. Do you see some more
complexity?

if (joinrel)

{
/* Create GatherPaths for any useful partial paths for rel */
if (old_clump->size + new_clump->size < num_gene)
  generate_gather_paths(root, joinrel);

}

  Assuming, we find
> some way to skip it for top level scan/join node, I don't think that
> will be sufficient, we have some special way to push target list below
> Gather node in apply_projection_to_path, we need to move that part as
> well in generate_gather_paths.
>

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-08-16 Thread Dilip Kumar
On Wed, Jul 26, 2017 at 10:35 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Jun 8, 2017 at 10:44 AM, Dilip Kumar <dilipbal...@gmail.com> wrote:
>> On Thu, May 18, 2017 at 8:07 PM, Robert Haas <robertmh...@gmail.com> wrote:
>>

Thanks for the feedback.  I haven't yet worked on optimizer side
feedback but I have done some work for improving the executor part,
please find my comment inline.

> There are two main considerations here.  One, it's better to lossify
> pages with many bits set than with few bits set, because the
> additional work we thereby incur is less.  Two, it's better to lossify
> pages that are in the same chunk as other pages which we are also
> going to lossify, because that's how we actually save memory.  The
> current code will cheerfully lossify a chunk that contains no other
> pages, or will lossify one page from a chunk but not the others,
> saving no memory but hurting performance.
>
> Maybe the simplest change here would be to make it so that when we
> decide to lossify a chunk, we lossify all pages in the chunk, but only
> if there's more than one.  In other words, given a proposed page P to
> lossify, probe the hash table for all keys in the same chunk as P and
> build up a words[] array for the proposed chunk.  If that words[]
> array will end up with only 1 bit set, then forget the whole thing;
> otherwise, delete all of the entries for the individual pages and
> insert the new chunk instead.

I have attempted a very simple POC with this approach just to see how
many lossy pages we can save if we lossify all the pages falling in
the same chunk first, before moving to the next page.  I have taken
some data on TPCH scale 20 with different work_mem (only calculated
lossy pages did not test performance).  I did not see a significant
reduction in lossy pages.  (POC patch attached with the mail in case
someone is interested to test or more experiment).

64MB

TPCH QueryHead Lossy_pages   Patch Lossy_pages
lossy_page_reduce
Q6   534984529745
  5239
Q15 535072 529785 5287
Q20   1586933 1584731 2202

40MB
TPCH Query   Head Lossy_pages  Patch Lossy_pages   lossy_page_reduce
Q6  995223  993490
  1733
Q14337894   332890
 5004
Q15995417   993511
 1906
Q20  1654016 1652982
   1034

20MB
TPCH QueryHead Lossy_pagesPatch Lossy_pages
lossy_page_reduce
Q4166551 165280
   1271
Q5330679 330028
 651
Q6   1160339   1159937
   402
Q14   666897666032
   865
Q15 1160518   1160017
  501
Q20 1982981   1982828
 153


> As far as the patch itself is concerned, tbm_calculate_exact_pages()
> is computing the number of "exact pages" which will fit into the
> TIDBitmap, but I think that instead of tbm_calculate_exact_pages() you
> should have something like tbm_calculate_entries() that just returns
> nbuckets, and then let the caller work out how many entries are going
> to be exact and how many are going to be inexact.  An advantage of
> that approach is that the new function could be used by tbm_create()
> instead of duplicating the logic.

Ok
  I'm not sure that the way you are
> doing the rest of the calculation is wrong, but I've got no confidence
> that it's right, either: the way WORDS_PER_CHUNK is used looks pretty
> random, and the comments aren't enough for me to figure it out.

+ * Eq1: nbuckets = exact_bucket + lossy_buckets
+ * Eq2: total_pages = exact_bucket + (lossy_buckets * WORDS_PER_CHUNK)

I have derived my formulae based on these two equations.  But, it
assumes that all the lossy_buckets(chunk) will hold a WORDS_PER_CHUNK
number of pages, which seems very optimistic.

>
> It's unclear what assumptions we should make while trying to estimate
> the number of lossy pages.  The effectiveness of lossification depends
> on the average number of pages that get folded into a chunk; but how
> many will that be?  If we made some of the improvements proposed
> above, it would probably be higher than it is now, but either way it's
> not clear what number to use.  One possible assumption is that t

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-07-25 Thread Dilip Kumar
On Tue, Jul 25, 2017 at 8:59 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, Jul 25, 2017 at 1:31 AM, Rafia Sabih
> <rafia.sa...@enterprisedb.com> wrote:
>> - other queries show a good 20-30% improvement in performance. Performance
>> numbers are as follows,
>>
>> Query| un_part_head (seconds) | part_head (seconds) | part_patch (seconds) |
>> 3 | 76 |127 | 88 |
>> 4 |17 | 244 | 41 |
>> 5 | 52 | 123 | 84 |
>> 7 | 73 | 134 | 103 |
>> 10 | 67 | 111 | 89 |
>> 12 | 53 | 114 | 99 |
>> 18 | 447 | 709 | 551 |
>
> Hmm.  This certainly shows that benefit of the patch, although it's
> rather sad that we're still slower than if we hadn't partitioned the
> data in the first place.  Why does partitioning hurt performance so
> much?

I was analysing some of the plans (without partition and with
partition), Seems like one of the reasons of performing worse with the
partitioned table is that we can not use an index on the partitioned
table.

Q4 is taking 17s without partition whereas it's taking 244s with partition.

Now if we analyze the plan

Without partition, it can use parameterize index scan on lineitem
table which is really big in size. But with partition, it has to scan
this table completely.

  ->  Nested Loop Semi Join
 ->  Parallel Bitmap Heap Scan on orders
   ->  Bitmap Index Scan on
idx_orders_orderdate  (cost=0.00..24378.88 r
   ->  Index Scan using idx_lineitem_orderkey on
lineitem  (cost=0.57..29.29 rows=105 width=8) (actual
time=0.031..0.031 rows=1 loops=1122364)
   Index Cond: (l_orderkey =
orders.o_orderkey)
   Filter: (l_commitdate < l_receiptdate)
   Rows Removed by Filter: 1

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] why not parallel seq scan for slow functions

2017-07-12 Thread Dilip Kumar
On Wed, Jul 12, 2017 at 10:55 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Wed, Jul 12, 2017 at 1:50 AM, Jeff Janes <jeff.ja...@gmail.com> wrote:
>> On Mon, Jul 10, 2017 at 9:51 PM, Dilip Kumar <dilipbal...@gmail.com> wrote:
>>>
>>> So because of this high projection cost the seqpath and parallel path
>>> both have fuzzily same cost but seqpath is winning because it's
>>> parallel safe.
>>
>>
>> I think you are correct.  However, unless parallel_tuple_cost is set very
>> low, apply_projection_to_path never gets called with the Gather path as an
>> argument.  It gets ruled out at some earlier stage, presumably because it
>> assumes the projection step cannot make it win if it is already behind by
>> enough.
>>
>
> I think that is genuine because tuple communication cost is very high.
> If your table is reasonable large then you might want to try by
> increasing parallel workers (Alter Table ... Set (parallel_workers =
> ..))
>
>> So the attached patch improves things, but doesn't go far enough.
>>
>
> It seems to that we need to adjust the cost based on if the below node
> is projection capable.  See attached.

Patch looks good to me.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] why not parallel seq scan for slow functions

2017-07-10 Thread Dilip Kumar
On Tue, Jul 11, 2017 at 9:02 AM, Jeff Janes <jeff.ja...@gmail.com> wrote:
> If I have a slow function which is evaluated in a simple seq scan, I do not
> get parallel execution, even though it would be massively useful.  Unless
> force_parallel_mode=ON, then I get a dummy parallel plan with one worker.
>
> explain select aid,slow(abalance) from pgbench_accounts;

After analysing this, I see multiple reasons of this getting not selected

1. The query is selecting all the tuple and the benefit what we are
getting by parallelism is by dividing cpu_tuple_cost which is 0.01 but
for each tuple sent from worker to gather there is parallel_tuple_cost
which is 0.1 for each tuple.  (which will be very less in case of
aggregate).   Maybe you can try some selecting with some condition.

like below:
postgres=# explain select slow(abalance) from pgbench_accounts where
abalance > 1;
QUERY PLAN
---
 Gather  (cost=0.00..46602.33 rows=1 width=4)
   Workers Planned: 2
   ->  Parallel Seq Scan on pgbench_accounts  (cost=0.00..46602.33
rows=1 width=4)
 Filter: (abalance > 1)

2. The second problem I am seeing is that (maybe the code problem),
the "slow" function is very costly (1000) and in
apply_projection_to_path we account for this cost.  But, I have
noticed that for gather node also we are adding this cost to all the
rows but actually, if the lower node is already doing the projection
then gather node just need to send out the tuple instead of actually
applying the projection.

In below function, we always multiply the target->cost.per_tuple with
path->rows, but in case of gather it should multiply this with
subpath->rows

apply_projection_to_path()


path->startup_cost += target->cost.startup - oldcost.startup;
path->total_cost += target->cost.startup - oldcost.startup +
(target->cost.per_tuple - oldcost.per_tuple) * path->rows;


So because of this high projection cost the seqpath and parallel path
both have fuzzily same cost but seqpath is winning because it's
parallel safe.

>
> CREATE OR REPLACE FUNCTION slow(integer)
>  RETURNS integer
>  LANGUAGE plperl
>  IMMUTABLE PARALLEL SAFE STRICT COST 1000
> AS $function$
>   my $thing=$_[0];
>   foreach (1..1_000_000) {
> $thing = sqrt($thing);
> $thing *= $thing;
>   };
>   return ($thing+0);
> $function$;
>
> The partial path is getting added to the list of paths, it is just not
> getting chosen, even if parallel_*_cost are set to zero.  Why not?
>
> If I do an aggregate, then it does use parallel workers:
>
> explain select sum(slow(abalance)) from pgbench_accounts;
>
> It doesn't use as many as I would like, because there is a limit based on
> the logarithm of the table size (I'm using -s 10 and get 3 parallel
> processes) , but at least I know how to start looking into that.
>
> Also, how do you debug stuff like this?  Are there some gdb tricks to make
> this easier to introspect into the plans?
>
> Cheers,
>
> Jeff


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error while copying a large file in pg_rewind

2017-07-06 Thread Dilip Kumar
On Thu, Jul 6, 2017 at 4:18 PM, Kuntal Ghosh <kuntalghosh.2...@gmail.com> wrote:
> On Wed, Jul 5, 2017 at 9:35 AM, Michael Paquier
> <michael.paqu...@gmail.com> wrote:
>> On Tue, Jul 4, 2017 at 4:41 PM, Kuntal Ghosh <kuntalghosh.2...@gmail.com> 
>> wrote:
>>> I've not yet started the patch and it may take some time for me to
>>> understand and write
>>> the patch in a correct way. Since, you've almost written the patch,
>>> IMHO, please go ahead
>>> and submit the patch. I'll happily review and test it. :-)
>>>
>>> Thanks for the notes.
>>
>> OK, thanks. Here you go.
>>
> Thanks for the patch. It looks good and it solves the existing issues.
>
> But, I'm little concerned/doubt regarding the following part of the code.
> +/*
> + * Converts an int64 from network byte order to native format.
> + */
> +static int64
> +pg_recvint64(int64 value)
> +{
> +   union
> +   {
> +   int64   i64;
> +   uint32  i32[2];
> +   } swap;
> +   int64   result;
> +
> +   swap.i64 = value;
> +
> +   result = (uint32) ntohl(swap.i32[0]);
> +   result <<= 32;
> +   result |= (uint32) ntohl(swap.i32[1]);
> +
> +   return result;
> +}
> Does this always work correctly irrespective of the endianess of the
> underlying system?

I think this will have problem, we may need to do like

 union
   {
   int64   i64;
   uint8  i8[8];
   } swap;


and reverse complete array if byte order is changed

or we can use something like "be64toh"

>
>
> --
> Thanks & Regards,
> Kuntal Ghosh
> EnterpriseDB: http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] hash partitioning

2017-07-05 Thread Dilip Kumar
On Mon, Jul 3, 2017 at 4:39 PM, amul sul <sula...@gmail.com> wrote:
> Thanks to catching this, fixed in the attached version.

Few comments on the latest version.

0001 looks fine, for 0002 I have some comments.

1.
+ hbounds = (PartitionHashBound * *) palloc(nparts *
+  sizeof(PartitionHashBound *));

/s/(PartitionHashBound * *)/(PartitionHashBound **)/g

2.
RelationBuildPartitionDesc
{
 


* catalog scan that retrieved them, whereas that in the latter is
* defined by canonicalized representation of the list values or the
* range bounds.
*/
for (i = 0; i < nparts; i++)
result->oids[mapping[i]] = oids[i];

Should this comments mention about hash as well?

3.

if (b1->datums[b1->ndatums - 1][0] != b2->datums[b2->ndatums - 1][0])
return false;

if (b1->ndatums != b2->ndatums)
return false;

If ndatums itself is different then no need to access datum memory, so
better to check ndatum first.

4.
+ * next larger modulus.  For example, if you have a bunch
+ * of partitions that all have modulus 5, you can add a
+ * new new partition with modulus 10 or a new partition

Typo, "new new partition"  -> "new partition"


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Default Partition for Range

2017-07-03 Thread Dilip Kumar
On Fri, Jun 30, 2017 at 11:56 AM, Beena Emerson <memissemer...@gmail.com> wrote:
> Hello Dilip,
>
> On Wed, Jun 21, 2017 at 6:27 PM, Dilip Kumar <dilipbal...@gmail.com> wrote:
>> On Tue, Jun 20, 2017 at 6:57 PM, Dilip Kumar <dilipbal...@gmail.com> wrote:
>>> This is basically crashing in RelationBuildPartitionDesc, so I think

>
> Thank you for your review and analysis.
>
> I have updated the patch.
> - bound->content is set to RANGE_DATUM_DEFAULT for each of the keys
> and not just the first one.
> - Improve the way of handling DEFAULT bounds in RelationBuildPartitionDesc,
>
> There is a test for multiple column range in alter_table.sql

Thanks for update patch,  After this fix segmentation fault is solved.

I have some more comments.

1.

lower = make_one_range_bound(key, -1, spec->lowerdatums, true);
  upper = make_one_range_bound(key, -1, spec->upperdatums, false);

+ /* Default case is not handled here */
+ if (spec->is_default)
+ break;
+

I think we can move default check just above the make range bound it
will avoid unnecessary processing.


2.
RelationBuildPartitionDesc
{


   /*
* If either of them has infinite element, we can't equate
* them.  Even when both are infinite, they'd have
* opposite signs, because only one of cur and prev is a
* lower bound).
*/
if (cur->content[j] != RANGE_DATUM_FINITE ||
  prev->content[j] != RANGE_DATUM_FINITE)
{
is_distinct = true;
break;
}
After default range partitioning patch the above comment doesn't sound
very accurate and
can lead to the confusion, now here we are not only considering
infinite bounds but
there is also a possibility that prev bound can be DEFAULT, so
comments should clearly
mention that.

3.

+ bound->datums = datums ? (Datum *) palloc0(key->partnatts * sizeof(Datum))
+ : NULL;
  bound->content = (RangeDatumContent *) palloc0(key->partnatts *
sizeof(RangeDatumContent));
  bound->lower = lower;

  i = 0;
+
+ /* datums are NULL for default */
+ if (datums == NULL)
+ for (i = 0; i < key->partnatts; i++)
+ bound->content[i] = RANGE_DATUM_DEFAULT;

To me, it will look cleaner if we keep bound->content=NULL for
default, instead of allocating memory
and initializing each element,  But it's just a suggestions and you
can decide whichever way
seems better to you.  Then the other places e.g.
+ if (cur->content[i] == RANGE_DATUM_DEFAULT)
+ {
+ /*

will change like

+ if (cur->content == NULL)
+ {
+ /*

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optional message to user when terminating/cancelling backend

2017-06-22 Thread Dilip Kumar
On Thu, Jun 22, 2017 at 7:18 PM, Daniel Gustafsson <dan...@yesql.se> wrote:
> Good point.  I’ve attached a new version which issues a NOTICE on truncation
> and also addresses all other comments so far in this thread.  Thanks a lot for
> the early patch reviews!
>
> cheers ./daniel

I have done an initial review of the patch. I have some comments/suggestions.


+int
+GetCancelMessage(char **buffer, size_t buf_len)
+{
+ volatile BackendCancelShmemStruct *slot = MyCancelSlot;
+ int msg_length = 0;
+

Returned value of this function is never used, better to convert it to
just void.

---

+bool
+HasCancelMessage(void)
+{
+ volatile BackendCancelShmemStruct *slot = MyCancelSlot;


+/*
+ * Return the configured cancellation message and its length. If the returned
+ * length is greater than the size of the passed buffer, truncation has been
+ * performed. The message is cleared on reading.
+ */
+int
+GetCancelMessage(char **buffer, size_t buf_len)

I think it will be good to merge these two function where
GetCancelMessage will first check whether it has the message or not
if it does then allocate the buffer of size slot->len and copy the
slot message to allocated buffer otherwise just return NULL.

So it will look like "char *GetCancelMessage()"

-

+ SpinLockAcquire(>mutex);
+ strlcpy(*buffer, (const char *) slot->message, buf_len);

strlcpy(*buffer, (const char *) slot->message, slot->len);

Isn't it better to copy only upto slot->len, seems like it will always
be <= buf_len, and if not then
we can do min(buf_len, slot->len)



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Default Partition for Range

2017-06-21 Thread Dilip Kumar
On Wed, Jun 21, 2017 at 8:08 PM, Robert Haas <robertmh...@gmail.com> wrote:
> I think somebody should do some testing of the existing code with
> valgrind.  And then apply the list-partitioning patch and this patch,
> and do some more testing with valgrind.  It seems to be really easy to
> miss these uninitialized access problems during code review.

I think it will help,  but it may not help in all the scenarios
because most of the places we are allocating memory with palloc0 (
initialized with 0) but it never initialized with RANGE_DATUM_DEFAULT
except the first element (in the case of DEFAULT partition).  And,
later they may be considered as RANGE_DATUM_FINITE (because its value
is 0).

One solution can be that if bound is DEFAULT then initialize with
RANGE_DATUM_DEFAULT for the complete content array for that partition
bound instead of just first.  Otherwise, we need to be careful of
early exiting wherever we are looping the content array of the DEFAULT
bound.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Autovacuum launcher occurs error when cancelled by SIGINT

2017-06-21 Thread Dilip Kumar
On Wed, Jun 21, 2017 at 7:47 PM, Kuntal Ghosh
<kuntalghosh.2...@gmail.com> wrote:
>> IMHO, It's not a good idea to use DSM call to verify the DSA handle.
>>
> Okay. Is there any particular scenario you've in mind where this may fail?

It's not about failure, but about the abstraction.  When we are using
the DSA we should not directly access the DSM which is under DSA.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Autovacuum launcher occurs error when cancelled by SIGINT

2017-06-21 Thread Dilip Kumar
On Wed, Jun 21, 2017 at 6:50 PM, Kuntal Ghosh
<kuntalghosh.2...@gmail.com> wrote:
> I think we can just check dsm_find_mapping() to check whether the dsm
> handle is already attached. Something like,
>
> }
> -   else
> +   else if(!dsm_find_mapping(AutoVacuumShmem->av_dsa_handle))
> {
> AutoVacuumDSA = dsa_attach(AutoVacuumShmem->av_dsa_handle);
> dsa_pin_mapping(AutoVacuumDSA);
>
> Thoughts?

IMHO, It's not a good idea to use DSM call to verify the DSA handle.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Default Partition for Range

2017-06-21 Thread Dilip Kumar
On Tue, Jun 20, 2017 at 6:57 PM, Dilip Kumar <dilipbal...@gmail.com> wrote:
> This is basically crashing in RelationBuildPartitionDesc, so I think
> we don't have any test case for testing DEFAULT range partition where
> partition key has more than one attribute.  So I suggest we can add
> such test case.

Some more comments.


- bound->datums = (Datum *) palloc0(key->partnatts * sizeof(Datum));
- bound->content = (RangeDatumContent *) palloc0(key->partnatts *
-   sizeof(RangeDatumContent));
+ bound->datums = datums ? (Datum *) palloc0(key->partnatts * sizeof(Datum))
+ : NULL;
+ bound->content = (RangeDatumContent *) palloc0(
+ (datums ? key->partnatts : 1) * sizeof(RangeDatumContent));
  bound->lower = lower;

  i = 0;
+
+ /* If default, datums are NULL */
+ if (datums == NULL)
+ bound->content[i] = RANGE_DATUM_DEFAULT;


For the default partition we are only setting bound->content[0] to
default, but content for others key
attributes are not initialized.  But later in the code, if the content
of the first key is RANGE_DATUM_DEFAULT then it should not access the
next content,  but I see there are some exceptions.  Which can access
uninitialized value?

for example see below code


for (i = 0; i < key->partnatts; i++)
  {
+ if (rb_content[i] == RANGE_DATUM_DEFAULT)   --> why it's going to
content for next parttion attribute, we never initialized that?
+ continue;
+
  if (rb_content[i] != RANGE_DATUM_FINITE)
  return rb_content[i] == RANGE_DATUM_NEG_INF ? -1 : 1;


Also
In RelatiobBuildPartitionDesc


for (j = 0; j < key->partnatts; j++)
{
-- Suppose first is RANGE_DATUM_DEFAULT then we should not check next
   because that is never initialized.  I think this is the cause of
the crash also what I have reported above.

if (rbounds[i]->content[j] == RANGE_DATUM_FINITE)
boundinfo->datums[i][j] =
datumCopy(rbounds[i]->datums[j],
 key->parttypbyval[j],
 key->parttyplen[j]);
/* Remember, we are storing the tri-state value. */
boundinfo->content[i][j] = rbounds[i]->content[j];


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Regarding Postgres Dynamic Shared Memory (DSA)

2017-06-20 Thread Dilip Kumar
On Tue, Jun 20, 2017 at 6:48 PM, Mahendranath Gurram
<mahendran...@zohocorp.com> wrote:
> The steps you followed are right. May i know in which OS you tried?
> Mac/Linux.
>
> Because,  In Mac, it is working fine. just as expected. But the
> problem(segmentation fault) I'm facing is with linux systems.

Mine is Linux.

CentOS Linux release 7.2.1511


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Default Partition for Range

2017-06-20 Thread Dilip Kumar
On Thu, Jun 15, 2017 at 11:20 AM, Beena Emerson <memissemer...@gmail.com> wrote:
> Hello,
>
> PFA the updated patch.
> This is rebased over v21 patches of list partition.
> (http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg316818.html)

While testing I have noticed segmentation fault for a simple case.

create table test(a int, b int) partition by range(a,b);
create table test1 partition of test DEFAULT;
postgres=# drop table test1;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

This is basically crashing in RelationBuildPartitionDesc, so I think
we don't have any test case for testing DEFAULT range partition where
partition key has more than one attribute.  So I suggest we can add
such test case.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Regarding Postgres Dynamic Shared Memory (DSA)

2017-06-20 Thread Dilip Kumar
ess
------
    0


I don't see any segmentation fault. Is there some other step which I
have missed.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Postgresql bug report - unexpected behavior of suppress_redundant_updates_trigger

2017-06-19 Thread Dilip Kumar
On Mon, Jun 19, 2017 at 5:20 PM, Artus de benque
<artusdeben...@gmail.com> wrote:
> postgres=# UPDATE test_table SET field = 'hi' WHERE id = 1;
> UPDATE 0
> test_db=# UPDATE test_table SET field = rpad('', 2001, 'a') WHERE id = 1;
> UPDATE 1
> test_db=# UPDATE test_table SET field = rpad('', 2001, 'a') WHERE id = 1;
> UPDATE 1 <--- BUG: expected 0, as we ran the same update twice

Seems like in "suppress_redundant_updates_trigger"  we are comparing
toasted tuple with the new tuple and that is the cause of the bug.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql's \d and \dt are sending their complaints to different output files

2017-06-19 Thread Dilip Kumar
On Mon, Jun 19, 2017 at 6:30 PM, Oleksandr Shulgin
<oleksandr.shul...@zalando.de> wrote:
> I think can be helpful, though rarely, to be able to send the output of \d*
> commands to a file.  At the same time it would be nice to see the message on
> stderr instead of appending to the output file, in case the relation was not
> found, because of less head-scratching needed to realize the problem.  So
> I'd vote for changing \dt (and alike) behavior to use stderr.

+1

And, we can also change the error message to be more consistent.

\d says "Did not find any relation named "xxx" ". whereas \dt says "No
matching relations found".

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-06-08 Thread Dilip Kumar
On Thu, May 18, 2017 at 8:07 PM, Robert Haas <robertmh...@gmail.com> wrote:

Thanks for the feedback and sorry for the delayed response.

> You might need to adjust effective_cache_size.

You are right. But, effective_cache_size will have the impact on the
number of pages_fetched when it's used as parameterized path (i.e
inner side of the nested loop). But for our case where we see the
wrong number of pages got estimated (Q10), it was for the
non-parameterized path.  However, I have also tested with high
effective cache size but did not observe any change.

> The Mackert and Lohman
> formula isn't exactly counting unique pages fetched.

Right.

>It will count
> the same page twice if it thinks the page will be evicted from the
> cache after the first fetch and before the second one.

That too when loop count > 1.  If loop_count =1 then seems like it
doesn't consider the effective_cache size. But, actually, multiple
tuples can fall into the same page.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] hash partitioning

2017-06-06 Thread Dilip Kumar
On Tue, Jun 6, 2017 at 2:41 PM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> Consider an example using the partition hierarchy:
>
> root (a int, b char, c int) partition by range (a)
>
>  -> level1 from (1) to (10) partition by list (b)
>
>  -> level2 in ('a') parition by range (c)
>
>  -> leaf from (1) to (10)
>
> Inserting (1, 'b', 1) into level1 will fail, because tuple can't be routed
> at level1 (no partition defined for b = 'b').
>
> Inserting (1, 'a', 10) into level1 will fail, because tuple can't be
> routed at level2 (no partition defined for c >= 10).
>
> Inserting (10, 'a', 1) into level1 will fail, because, although it was
> able to get through level1 and level2 into leaf, a = 10 falls out of
> level1's defined range.  We don't check that 1 <= a < 10 before starting
> the tuple-routing.
>
> I wonder if we should...  Since we don't allow BR triggers on partitioned
> tables, there should not be any harm in doing it just before calling
> ExecFindPartition().  Perhaps, topic for a new thread.

Yeah, correct.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] hash partitioning

2017-06-06 Thread Dilip Kumar
On Tue, Jun 6, 2017 at 1:03 PM, amul sul <sula...@gmail.com> wrote:
> May I ask you, how you sure about 8 is an unfit value for t1 relation?
> And what if the value other than 8, for e.g. 7?

Well, First I created t1 as a leaf relation like below, and I tested
insert into t1 with value 8 and it was violating the partition
constraint of t1, however, 7 was fine.

create table t (a int) partition by hash(a);
create table t1 partition of t for values with (modulus 2, remainder 1);

Later I dropped this t1 and created 2 level partition with the leaf as a range.

drop table t1;
create table t1 partition of t for values with (modulus 2, remainder
1) partition by range(a);
create table t1_1 partition of t1 for values from (8) to (10);

So now, I am sure that t1_1 can accept the value 8 and its parent t1 can't.

So I think this can only happen in the case of partitioned by hash
that a value is legal for the child but illegal for the parent?  Isn't
it a good idea that if a user is inserting in the top level relation
he should know for which partition exactly the constraint got
violated?

>
> Updated patch attached.

Thanks.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Default Partition for Range

2017-06-05 Thread Dilip Kumar
On Mon, Jun 5, 2017 at 1:43 AM, Beena Emerson <memissemer...@gmail.com> wrote:
> The new patch is rebased over default_partition_v18.patch
> [http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg315831.html]

I have done the initial review of the patch, I have few comments.


+ if ((lower->content[0] == RANGE_DATUM_DEFAULT))
+ {
+ if (partition_bound_has_default(partdesc->boundinfo))
+ {
+ overlap = true;
+ with = partdesc->boundinfo->default_index;
+ }

I think we can change if ((lower->content[0] == RANGE_DATUM_DEFAULT))
check to if (spec->is_default) that way it will be consistent with the
check in the PARTITION_STRATEGY_LIST.  Or we can move this complete
check outside of the switch.

-

- PartitionRangeDatum *datum = castNode(PartitionRangeDatum, lfirst(lc));
+ Node   *node = lfirst(lc);
+ PartitionRangeDatum *datum;
+
+ datum = castNode(PartitionRangeDatum, node);

Why do you need to change this?

--

+ if (!datums)
+ bound->content[i] = RANGE_DATUM_DEFAULT;
+

Better to check if (datums != NULL) for non boolean types.

-
+ if (content1[i] == RANGE_DATUM_DEFAULT ||
+ content2[i] == RANGE_DATUM_DEFAULT)
+ {
+ if (content1[i] == content2[i])
+ return 0;
+ else if (content1[i] == RANGE_DATUM_DEFAULT)
+ return -1;

I don't see any comments why default partition should be considered
smallest in the index comparison. For negative infinity, it's pretty
obvious by the enum name itself.

-----

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] hash partitioning

2017-06-03 Thread Dilip Kumar
On Thu, May 25, 2017 at 9:59 AM, amul sul <sula...@gmail.com> wrote:
> On Mon, May 22, 2017 at 2:23 PM, amul sul <sula...@gmail.com> wrote:
>>
>> Updated patch attached. Thanks a lot for review.
>>
> Minor fix in the document, PFA.

Patch need rebase

---
Function header is not consistent with other neighbouring functions
(some function contains function name in the header but others don't)
+/*
+ * Compute the hash value for given not null partition key values.
+ */

--
postgres=# create table t1 partition of t for values with (modulus 2,
remainder 1) partition by range(a);
CREATE TABLE
postgres=# create table t1_1 partition of t1 for values from (8) to (10);
CREATE TABLE
postgres=# insert into t1 values(8);
2017-06-03 18:41:46.067 IST [5433] ERROR:  new row for relation "t1_1"
violates partition constraint
2017-06-03 18:41:46.067 IST [5433] DETAIL:  Failing row contains (8).
2017-06-03 18:41:46.067 IST [5433] STATEMENT:  insert into t1 values(8);
ERROR:  new row for relation "t1_1" violates partition constraint
DETAIL:  Failing row contains (8).

The value 8 is violating the partition constraint of the t1 and we are
trying to insert to value in t1,
still, the error is coming from the leaf level table t1_1, that may be
fine but from error, it appears that
it's violating the constraint of t1_1 whereas it's actually violating
the constraint of t1.

>From Implementation, it appears that based on the key are identifying
the leaf partition and it's only failing during ExecInsert while
checking the partition constraint.

Other than that, patch looks fine to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-06-01 Thread Dilip Kumar
On Thu, Jun 1, 2017 at 8:24 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, May 31, 2017 at 1:18 PM, Dilip Kumar <dilipbal...@gmail.com> wrote:
>> +   if (jointype = JOIN_SEMI)
>> +   {
>> +   sjinfo->jointype = JOIN_INNER;
>> +   }
>
> That is pretty obviously half-baked and completely untested.

Actually, I was not proposing this patch instead I wanted to discuss
the approach.  I was claiming that for
non-equal JOIN_SEMI selectivity estimation instead of calculating
selectivity in an existing way i.e
= 1- (selectivity of equal JOIN_SEMI)  the better way would be = 1-
(selectivity of equal).  I have only tested only standalone scenario
where it solves the problem but not the TPCH cases.  But I was more
interested in discussing that the way I am thinking how it should
calculate the nonequal SEMI join selectivity make any sense.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error while creating subscription when server is running in single user mode

2017-06-01 Thread Dilip Kumar
On Thu, Jun 1, 2017 at 1:02 PM, Michael Paquier
<michael.paqu...@gmail.com> wrote:
> Thanks, this looks correct to me at quick glance.
>
> +if (!IsUnderPostmaster)
> +ereport(FATAL,
> +(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +errmsg("subscription commands are not supported by
> single-user servers")));
> The messages could be more detailed, like directly the operation of
> CREATE/ALTER/DROP SUBCRIPTION in each error message. But that's a nit.

Thanks for looking into it.  Yeah, I think it's better to give
specific message instead of generic because we still support some of
the subscription commands even in single-user mode i.e ALTER
SUBSCRIPTION OWNER.  My patch doesn't block this command and there is
no harm in supporting this in single-user mode but does this make any
sense?  We may create some use case like creation subscription in
normal mode and then ALTER OWNER in single user mode but it makes
little sense to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


subscription_error_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error while creating subscription when server is running in single user mode

2017-05-31 Thread Dilip Kumar
On Wed, May 31, 2017 at 2:20 PM, Michael Paquier
<michael.paqu...@gmail.com> wrote:
> Yeah, see 0e0f43d6 for example. A simple fix is to look at
> IsUnderPostmaster when creating, altering or dropping a subscription
> in subscriptioncmds.c.

Yeah, below patch fixes that.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


subscription_error.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY (query) TO ... doesn't allow parallelism

2017-05-31 Thread Dilip Kumar
On Wed, May 31, 2017 at 7:19 PM, Andres Freund <and...@anarazel.de> wrote:
> At the moment $subject doesn't allow parallelism, because copy.c's
> pg_plan_query() invocation doesn't set the CURSOR_OPT_PARALLEL_OK
> flag.
>
> To me that appears to be an oversight rather than intentional.  A
> somewhat annoying one at that, because it's not uncommong to use COPY to
> execute reports to a CSV file and such.
>
> Robert, am I missing a complication?
>
> I personally think we should fix this in 9.6 and 10, but I've to admit
> I'm not entirely impartial, because Citus hit this...

IMHO, For copy_to I don't see any problem in parallelizing it.  I just
tested by changing the cursorOption and it's working in parallel.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


copy_to_parallel.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POC: Sharing record typmods between backends

2017-05-31 Thread Dilip Kumar
On Wed, May 31, 2017 at 12:53 PM, Robert Haas <robertmh...@gmail.com> wrote:
> Well, SH_TYPE's members SH_ELEMENT_TYPE *data and void *private_data
> are not going to work in DSM, because they are pointers.  You can
> doubtless come up with a way around that problem, but I guess the
> question is whether that's actually any better than just using DHT.

Probably I misunderstood the question. I assumed that we need to bring
in DHT only for achieving this goal. But, if the question is simply
the comparison of DHT vs simplehash for this particular case then I
agree that DHT is a more appropriate choice.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-05-31 Thread Dilip Kumar
On Fri, Mar 17, 2017 at 6:49 PM, Thomas Munro
<thomas.mu...@enterprisedb.com> wrote:
> Right.  If I temporarily hack neqjoinsel() thus:
>
> result = 1.0 - result;
> +
> +   if (jointype == JOIN_SEMI)
> +   result = 1.0;
> +
> PG_RETURN_FLOAT8(result);
>  }

I was looking into this problem. IMHO, the correct solution will be
that for JOIN_SEMI, neqjoinsel should not estimate the equijoin
selectivity using eqjoinsel_semi, instead, it should calculate the
equijoin selectivity as inner join and it should get the selectivity
of <> by (1-equijoin selectivity). Because for the inner_join we can
claim that "selectivity of '=' + selectivity of '<>' = 1", but same is
not true for the semi-join selectivity. For semi-join it is possible
that selectivity of '=' and '<>' is both are 1.

something like below


@@ -2659,7 +2659,13 @@ neqjoinsel(PG_FUNCTION_ARGS)
SpecialJoinInfo *sjinfo = (SpecialJoinInfo *) PG_GETARG_POINTER(4);
Oid eqop;
float8  result;

+   if (jointype = JOIN_SEMI)
+   {
+   sjinfo->jointype = JOIN_INNER;
+   }
/*
 * We want 1 - eqjoinsel() where the equality operator is the one
 * associated with this != operator, that is, its negator.

We may need something similar for anti-join as well.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POC: Sharing record typmods between backends

2017-05-31 Thread Dilip Kumar
On Wed, May 31, 2017 at 10:57 AM, Robert Haas <robertmh...@gmail.com> wrote:
>> Simplehash provides an option to provide your own allocator function
>> to it. So in the allocator function, you can allocate memory from DSA.
>> After it reaches some threshold it expands the size (double) and it
>> will again call the allocator function to allocate the bigger memory.
>> You can refer pagetable_allocate in tidbitmap.c.
>
> That only allows the pagetable to be shared, not the hash table itself.

I agree with you. But, if I understand the use case correctly we need
to store the TupleDesc for the RECORD in shared hash so that it can be
shared across multiple processes.  I think this can be achieved with
the simplehash as well.

For getting this done, we need some fixed shared memory for holding
static members of SH_TYPE and the process which creates the simplehash
will be responsible for copying these static members to the shared
location so that other processes can access the SH_TYPE.  And, the
dynamic part (the actual hash entries) can be allocated using DSA by
registering SH_ALLOCATE function.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error while creating subscription when server is running in single user mode

2017-05-31 Thread Dilip Kumar
On Wed, May 31, 2017 at 7:54 AM, tushar <tushar.ah...@enterprisedb.com> wrote:
> centos@centos-cpula bin]$ ./postgres --single postgres -D m1data
> PostgreSQL stand-alone backend 10beta1
> backend> create subscription sub connection 'dbname=postgres port=5433
> user=centos' publication p with (create_slot=0,enabled=off);
> 2017-05-31 12:53:09.318 BST [10469] LOG:  statement: create subscription sub
> connection 'dbname=postgres port=5433 user=centos' publication p with
> (create_slot=0,enabled=off);
>
> 2017-05-31 12:53:09.326 BST [10469] ERROR:  epoll_ctl() failed: Bad file
> descriptor

IMHO, In single user mode, it can not support replication (it can not
have background WALReciver task). However, I believe there should be a
proper error if the above statement is correct.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POC: Sharing record typmods between backends

2017-05-30 Thread Dilip Kumar
On Tue, May 30, 2017 at 1:09 AM, Thomas Munro
<thomas.mu...@enterprisedb.com> wrote:
>> * Perhaps simplehash + an LWLock would be better than dht, but I
>> haven't looked into that.  Can it be convinced to work in DSA memory
>> and to grow on demand?

Simplehash provides an option to provide your own allocator function
to it. So in the allocator function, you can allocate memory from DSA.
After it reaches some threshold it expands the size (double) and it
will again call the allocator function to allocate the bigger memory.
You can refer pagetable_allocate in tidbitmap.c.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HINT message for "ALTER SUBSCRIPTION.. WITH" need to change with SET keyword

2017-05-19 Thread Dilip Kumar
On Fri, May 19, 2017 at 3:41 PM, tushar <tushar.ah...@enterprisedb.com> wrote:
> postgres=# drop subscription sub;
> ERROR:  could not connect to publisher when attempting to drop the
> replication slot "pub"
> DETAIL:  The error was: could not connect to server: No such file or
> directory
> Is the server running locally and accepting
> connections on Unix domain socket "/tmp/.s.PGSQL.5432"?
> HINT:  Use ALTER SUBSCRIPTION ... WITH (slot_name = NONE) to disassociate
> the subscription from the slot.
>
> expected = "HINT:  Use ALTER SUBSCRIPTION ... SET (slot_name = NONE) to
> disassociate the subscription from the slot."

Seems like syntax got changed in
b807f59828fbc02fea612e1cbc0066c6dfa3be9b commit but missed to change
the hint. Attached patch fixes that.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


fix_subscription_hint.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] hash partitioning

2017-05-18 Thread Dilip Kumar
On Wed, May 17, 2017 at 2:07 PM, amul sul <sula...@gmail.com> wrote:
>> I would suggest "non-zero positive", since that's what we are using in
>> the documentation.
>>
>
> Understood, Fixed in the attached version.

Why non-zero positive?  We do support zero for the remainder right?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-05-18 Thread Dilip Kumar
I would like to propose a patch to improve the cost of bitmap heap
scan that is sensitive to work_mem.  Currently, in bitmap scan, we
don't consider work_mem. Now, in cases when there are a lot of lossy
pages bitmap scan gets selected that eventually leads to degraded
performance.

While evaluating parallel bitmap heap scan on TPCH we noticed that in
many queries selecting bitmap heap scan gives good performance high
work_mem but at low work_mem it slows the query compared to sequence
scan or index scan. This shows that bitmap heap scan is a better
alternative when most of the heap pages fit into work_mem.

Attached POC patch fixes the problem by considering work_mem for bitmap costing.

Performance numbers with this patch on different values of work_mem
are as follows,
workload: TPCH scale factor 20
machine: POWER 8

work_mem = 4MB
QueryHead(ms)Patch(ms)Improvement   Change in plan
4   13759.63214464.491   0.95xPBHS -> PSS
5   47581.55841888.853   1.14xBHS -> SS
6   14051.55313853.449   1.01xPBHS -> PSS
821529.98 11289.25 1.91xPBHS -> PSS
  1037844.51 34460.669   1.10xBHS -> SS
  1410131.49 15281.49 0.66xBHS -> SS
  1543579.83334971.051  1.25xBHS -> SS

work_mem = 20MB
QueryHead(ms)Patch(ms)Improvement   Change in plan
6   14592  13521.06  1.08x  PBHS -> PSS
8   20223.106   10716.0621.89x  PBHS -> PSS
15 40486.95733687.706   1.20x  BHS -> PSS

work_mem = 64MB
QueryHead(ms)Patch(ms)  ImprovementChange in plan
15 40904.57225750.873   1.59x  BHS -> PBHS

work_mem = 1GB
No plan got changed

Most of the queries show decent improvement, however, Q14 shows
regression at work_mem = 4MB. On analysing this case, I found that
number of pages_fetched calculated by "Mackert and Lohman formula" is
very high (1112817) compared to the actual unique heap pages fetched
(293314). Therefore, while costing bitmap scan using 1112817 pages and
4MB of work_mem, we predicted that even after we lossify all the pages
it can not fit into work_mem, hence bitmap scan was not selected.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


improve_bitmap_cost_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-05-17 Thread Dilip Kumar
On Wed, May 17, 2017 at 3:15 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> Earlier I thought that option1 is better but later I think that this
>> can complicate the situation as we are firing first BR update then BR
>> delete and can change the row multiple time and defining such
>> behaviour can be complicated.
>>
>
> If we have to go by this theory, then the option you have preferred
> will still execute BR triggers for both delete and insert, so input
> row can still be changed twice.

Yeah, right as per my theory above Option3 have the same problem.

But after putting some more thought I realised that only for "Before
Update" or the "Before Insert" trigger row can be changed. Correct me
if I am assuming something wrong?

So now again option3 will make more sense.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-05-17 Thread Dilip Kumar
On Fri, May 12, 2017 at 4:17 PM, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> Option 3
> 
>
> BR, AR delete triggers on source partition
> BR, AR insert triggers on destination partition.
>
> Rationale :
> Since the update is converted to delete+insert, just skip the update
> triggers completely.

+1 to option3
Generally, BR triggers are used for updating the ROW value and AR
triggers to VALIDATE the row or to modify some other tables.  So it
seems that we can fire the triggers what is actual operation is
happening at the partition level.

For source partition, it's only the delete operation (no update
happened) so we fire delete triggers and for the destination only
insert operations so fire only inserts triggers.  That will keep the
things simple.  And, it will also be in sync with the actual partition
level delete/insert operations.

We may argue that user might have declared only update triggers and as
he has executed the update operation he may expect those triggers to
get fired.  But, I think this behaviour can be documented with the
proper logic that if the user is updating the partition key then he
must be ready with the Delete/Insert triggers also, he can not rely
only upon update level triggers.

Earlier I thought that option1 is better but later I think that this
can complicate the situation as we are firing first BR update then BR
delete and can change the row multiple time and defining such
behaviour can be complicated.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] hash partitioning

2017-05-16 Thread Dilip Kumar
On Tue, May 16, 2017 at 4:22 PM, amul sul <sula...@gmail.com> wrote:
> v6 patch has bug in partition oid mapping and indexing, fixed in the
> attached version.
>
> Now partition oids will be arranged in the ascending order of hash
> partition bound  (i.e. modulus and remainder sorting order)

Thanks for the update patch. I have some more comments.


+ if (spec->remainder < 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
+  errmsg("hash partition remainder must be less than modulus")));

I think this error message is not correct, you might want to change it
to "hash partition remainder must be non-negative integer"

---

+ The table is partitioned by specifying remainder and modulus for each
+ partition. Each partition holds rows for which the hash value of

Wouldn't it be better to say "modulus and remainder" instead of
"remainder and modulus" then it will be consistent?

---
+   An UPDATE that causes a row to move from one partition to
+   another fails, because

fails, because -> fails because

---

Wouldn't it be a good idea to document how to increase the number of
hash partitions, I think we can document it somewhere with an example,
something like Robert explained upthread?

create table foo (a integer, b text) partition by hash (a);
create table foo1 partition of foo with (modulus 2, remainder 0);
create table foo2 partition of foo with (modulus 2, remainder 1);

You can detach foo1, create two new partitions with modulus 4 and
remainders 0 and 2, and move the data over from the old partition

I think it will be good information for a user to have? or it's
already documented and I missed it?



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] hash partitioning

2017-05-15 Thread Dilip Kumar
On Mon, May 15, 2017 at 9:06 PM, Dilip Kumar <dilipbal...@gmail.com> wrote:
> Test2:
> postgres=# insert into x2 values(100);   -- it should violates
> partition constraint
> INSERT 0 1
>
> Seems like a bug or am I missing something completely?

Sorry, my bad. It's modulus on the hashvalue, not the column.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] hash partitioning

2017-05-15 Thread Dilip Kumar
On Mon, May 15, 2017 at 4:27 PM, amul sul <sula...@gmail.com> wrote:
> Updated patches attached.

While testing latest patch I found a strange behaviour.

test1:
postgres=# create table x (a int) partition by hash(a);
CREATE TABLE
postgres=# create table x1 partition of x for values with (modulus 4,
remainder 0);
CREATE TABLE
postgres=# create table x2 partition of x for values with (modulus 4,
remainder 1);
CREATE TABLE
postgres=# insert into x values(1);
2017-05-15 20:55:20.446 IST [28045] ERROR:  no partition of relation
"x" found for row
2017-05-15 20:55:20.446 IST [28045] DETAIL:  Partition key of the
failing row contains (a) = (1).
2017-05-15 20:55:20.446 IST [28045] STATEMENT:  insert into x values(1);
ERROR:  no partition of relation "x" found for row
DETAIL:  Partition key of the failing row contains (a) = (1).

Test2:
postgres=# insert into x2 values(100);   -- it should violates
partition constraint
INSERT 0 1

Seems like a bug or am I missing something completely?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Increasing parallel workers at runtime

2017-05-15 Thread Dilip Kumar
On Mon, May 15, 2017 at 7:36 PM, Haribabu Kommi
<kommi.harib...@gmail.com> wrote:
>
> The wait of the workers to send tuples is may be
> because of less number of workers. So increasing
> the parallel workers may improve the performance.

Yes, for the cases where a significant amount of work is still
pending, but it might hurt the performance where the work is about to
complete but not yet completed.

@@ -346,6 +346,38 @@ gather_readnext(GatherState *gatherstate)
  if (nvisited >= gatherstate->nreaders)
  {
  /*
+ * As we are going to wait for the workers to send tuples, this may be
+ * possible because of not sufficient workers that are planned?
+ * Does the gather have all the required parallel workers? If not try to get
+ * some more workers (only when all the previously allocated workers are still
+ * doing the job) before we wait, this will further increase the performance
+ * of the query as planned.
+ */
You might want to do similar changes for gather_merge_readnext.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in ExecModifyTable function and trigger issues for foreign tables

2017-05-15 Thread Dilip Kumar
On Mon, May 15, 2017 at 5:43 PM, Ashutosh Bapat
<ashutosh.ba...@enterprisedb.com> wrote:
> Yes. postgresAddForeignUpdateTargets() which is called by
> rewriteTargetListUD injects "ctid". "wholerow" is always there. Not
> for postgres_fdw but for other wrappers it might be a bad news. ctid,
> whole row obtained from the remote postgres server will fit the tuple
> descriptor of parent, but for other FDWs the column injected by
> rewriteTargetListUD() may make the child tuple look different from
> that of the parent, so we may not pass that column down to the child.

But, can't we call rewriteTargetListUD for all the inheritors if the
inheritor is a foreign table which will intern call the
postgresAddForeignUpdateTargets?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in ExecModifyTable function and trigger issues for foreign tables

2017-05-15 Thread Dilip Kumar
On Mon, May 15, 2017 at 2:46 PM, Ildus Kurbangaliev
<i.kurbangal...@postgrespro.ru> wrote:
> Hi,
> planSlot contains already projected tuple, you can't use it as oldtuple.
> I think problem is that `rewriteTargetListUD` called only for parent
> relation, so there is no `wholerow` attribute for foreign tables.

Oh, right!

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in ExecModifyTable function and trigger issues for foreign tables

2017-05-14 Thread Dilip Kumar
On Sun, May 14, 2017 at 9:54 PM, Dilip Kumar <dilipbal...@gmail.com> wrote:
> After your fix, now tupleid is invalid which is expected, but seems
> like we need to do something more. As per the comments seems like it
> is expected to get the oldtuple from planSlot.  But I don't see any
> code for handling that part.

Maybe we should do something like attached patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


fix_fdw_trigger.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in ExecModifyTable function and trigger issues for foreign tables

2017-05-14 Thread Dilip Kumar
On Sun, May 14, 2017 at 5:35 PM, Ildus Kurbangaliev
<i.kurbangal...@postgrespro.ru> wrote:
> TRAP: FailedAssertion("!(((const void*)(fdw_trigtuple) != ((void *)0))
> ^ ((bool) (((const void*)(tupleid) != ((void *)0)) &&
> ((tupleid)->ip_posid != 0", File: "trigger.c", Line: 2428)
>
> I'm not sure how it should be fixed, because as I see `oldtuple` will
> be set only for AFTER ROW triggers by `wholerow` junk attribute.

There is comment on the ExecUpdate function

 *  When updating a foreign table,
 * tupleid is invalid; the FDW has to figure out which row to
 * update using data from the planSlot.  oldtuple is passed to
 * foreign table triggers; it is NULL when the foreign table has
 * no relevant triggers.

After your fix, now tupleid is invalid which is expected, but seems
like we need to do something more. As per the comments seems like it
is expected to get the oldtuple from planSlot.  But I don't see any
code for handling that part.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] hash partitioning

2017-05-13 Thread Dilip Kumar
On Fri, May 12, 2017 at 6:08 PM, amul sul <sula...@gmail.com> wrote:
> Hi,
>
> Please find the following updated patches attached:

I have done some testing with the new patch, most of the cases worked
as per the expectation except below

I expect the planner to select only "Seq Scan on t1" whereas it's
scanning both the partitions?

create table t (a int, b varchar) partition by hash(a);
create table t1 partition of t for values with (modulus 8, remainder 0);
create table t2 partition of t for values with (modulus 8, remainder 1);

postgres=# explain select * from t where a=8;
QUERY PLAN
--
 Append  (cost=0.00..51.75 rows=12 width=36)
   ->  Seq Scan on t1  (cost=0.00..25.88 rows=6 width=36)
 Filter: (a = 8)
   ->  Seq Scan on t2  (cost=0.00..25.88 rows=6 width=36)
 Filter: (a = 8)
(5 rows)


Some cosmetic comments.
---
+ RangeVar   *rv = makeRangeVarFromNameList(castNode(List, nameEl->arg));
+

Useless Hunk.

 /*
- * Build a CREATE SEQUENCE command to create the sequence object, and
- * add it to the list of things to be done before this CREATE/ALTER
- * TABLE.
+ * Build a CREATE SEQUENCE command to create the sequence object, and add
+ * it to the list of things to be done before this CREATE/ALTER TABLE.
  */

Seems like, in src/backend/parser/parse_utilcmd.c, you have changed
the existing code with
pgindent.  I think it's not a good idea to mix pgindent changes with your patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] hash partitioning

2017-05-11 Thread Dilip Kumar
On Wed, May 3, 2017 at 6:39 PM, amul sul <sula...@gmail.com> wrote:
> On Thu, Apr 27, 2017 at 1:42 AM, Robert Haas <robertmh...@gmail.com> wrote:
>
>>I spent some time today looking at these patches.  It seems like there
>>is some more work still needed here to produce something committable
>>regardless of which way we go, but I am inclined to think that Amul's
>>patch is a better basis for work going forward than Nagata-san's
>>patch. Here are some general comments on the two patches:
>
> Thanks for your time.
>
> [...]
>
>> - Neither patch contains any documentation updates, which is bad.
>
> Fixed in the attached version.

I have done an intial review of the patch and I have some comments.  I
will continue the review
and testing and report the results soon

-
Patch need to be rebased



if (key->strategy == PARTITION_STRATEGY_RANGE)
{
/* Disallow nulls in the range partition key of the tuple */
for (i = 0; i < key->partnatts; i++)
if (isnull[i])
ereport(ERROR,
(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
errmsg("range partition key of row contains null")));
}

We need to add PARTITION_STRATEGY_HASH as well, we don't support NULL
for hash also, right?


RangeDatumContent **content;/* what's contained in each range bound datum?
  * (see the above enum); NULL for list
  * partitioned tables */

This will be NULL for hash as well we need to change the comments.
-

  bool has_null; /* Is there a null-accepting partition? false
  * for range partitioned tables */
  int null_index; /* Index of the null-accepting partition; -1

Comments needs to be changed for these two members as well


+/* One bound of a hash partition */
+typedef struct PartitionHashBound
+{
+ int modulus;
+ int remainder;
+ int index;
+} PartitionHashBound;

It will good to add some comments to explain the structure members


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication ApplyContext bloat

2017-04-20 Thread Dilip Kumar
On Thu, Apr 20, 2017 at 7:04 PM, Stas Kelvich <s.kelv...@postgrespro.ru> wrote:
> Thanks for noting.
>
> Added short description of ApplyContext and ApplyMessageContext to README.


+ApplyContext --- permanent during whole lifetime of apply worker. It is
+possible to use TopMemoryContext here as well, but for simplicity of
memory usage
+analysys we spin up different context.


Typo

/analysys/analysis


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] AGG_HASHED cost estimate

2017-04-20 Thread Dilip Kumar
On Thu, Apr 20, 2017 at 4:16 PM, Ashutosh Bapat
<ashutosh.ba...@enterprisedb.com> wrote:
> but cost this without numGroups.
>
> /*
>  * The transCost.per_tuple component of aggcosts should be charged once
>  * per input tuple, corresponding to the costs of evaluating the aggregate
>  * transfns and their input expressions (with any startup cost of course
>  * charged but once).  The finalCost component is charged once per output
>  * tuple, corresponding to the costs of evaluating the finalfns.
>  *
>  * If we are grouping, we charge an additional cpu_operator_cost per
>  * grouping column per input tuple for grouping comparisons.
>  *
>
> The reason may be that hashing isn't as costly as a comparison. I
> don't how true is that.

Earlier in GatherMerge thread[1], Rushabh mentioned that hashAggregate
is getting picked where actually grouping aggregate with GatherMerge
was faster during actual execution time and he suspected problems with
costing of hashAggregate. Maybe this is one of those?

[1]https://www.postgresql.org/message-id/CAGPqQf2164iV6k-_M75qEZWiCfRarA_SKSmHjc0Uh1rEf5RJrA%40mail.gmail.com


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel bitmapscan isn't exercised in regression tests

2017-04-05 Thread Dilip Kumar
On Tue, Apr 4, 2017 at 5:51 AM, Dilip Kumar <dilipbal...@gmail.com> wrote:
> Sure I can do that, In attached patch, I only fixed the problem of not
> executing the bitmap test.  Now, I will add few cases to cover other
> parts especially rescan and prefetching logic.

I have added two test cases to cover rescan, prefetch and lossy pages
logic for parallel bitmap.  I have removed the existing case because
these two new cases will be enough to cover that part as well.

Now, nodeBitmapHeapScan.c has 95.5% of line coverage.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


parallel_bitmap_test_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Parallel Bitmap Heap Scan - Prefetch pages are not updated properly

2017-04-04 Thread Dilip Kumar
While analyzing the coverage for the prefetching part, I found an
issue that prefetch_pages were not updated properly while executing in
parallel mode.

Attached patch fixes the same.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


parallel_bitmap_prefetch_fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel bitmapscan isn't exercised in regression tests

2017-04-03 Thread Dilip Kumar
On Mon, Apr 3, 2017 at 11:22 PM, Andres Freund <and...@anarazel.de> wrote:
> That's better than before, but I'd appreciate working on a bit more
> coverage. E.g. rescans probably aren't exercised in that test, right?
>
> If you have time & energy, it'd also be good to expand the tests to
> cover the prefetching logic - it's quite bad that it's currently not
> tested at all :(

Sure I can do that, In attached patch, I only fixed the problem of not
executing the bitmap test.  Now, I will add few cases to cover other
parts especially rescan and prefetching logic.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel bitmapscan isn't exercised in regression tests

2017-04-01 Thread Dilip Kumar
On Sat, Apr 1, 2017 at 12:16 AM, Andres Freund <and...@anarazel.de> wrote:
> Hi,
>
> The parallel code-path isn't actually exercised in the tests added in
> [1], as evidenced by [2] (they just explain).  That imo needs to be
> fixed.

Thanks for reporting. Attached patch fixes that.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


parallel_bitmap_test.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-31 Thread Dilip Kumar
On Thu, Mar 30, 2017 at 5:27 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> I am not sure if we can consider it as completely synthetic because we
> might see some similar cases for json datatypes.  Can we once try to
> see the impact when the same test runs from multiple clients?  For
> your information, I am also trying to setup some tests along with one
> of my colleague and we will report the results once the tests are
> complete.
We have done some testing and below is the test details and results.

Test:
I have derived this test from above test given by pavan[1] except
below difference.

- I have reduced the fill factor to 40 to ensure that multiple there
is scope in the page to store multiple WARM chains.
- WARM updated all the tuples.
- Executed a large select to enforce lot of recheck tuple within single query.
- Smaller tuple size (aid field is around ~100 bytes) just to ensure
tuple have sufficient space on a page to get WARM updated.

Results:
---
* I can see more than 15% of regression in this case. This regression
is repeatable.
* If I increase the fill factor to 90 than regression reduced to 7%,
may be only fewer tuples are getting WARM updated and others are not
because of no space left on page after few WARM update.

Test Setup:

Machine Information:

Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz
RAM: 64GB

Config Change:
  synchronous_commit=off

——Setup.sql—

DROP TABLE IF EXISTS pgbench_accounts;
CREATE TABLE pgbench_accounts (
aid text,
bid bigint,
abalance bigint,
filler1 text DEFAULT md5(random()::text),
filler2 text DEFAULT md5(random()::text),
filler3 text DEFAULT md5(random()::text),
filler4 text DEFAULT md5(random()::text),
filler5 text DEFAULT md5(random()::text),
filler6 text DEFAULT md5(random()::text),
filler7 text DEFAULT md5(random()::text),
filler8 text DEFAULT md5(random()::text),
filler9 text DEFAULT md5(random()::text),
filler10 text DEFAULT md5(random()::text),
filler11 text DEFAULT md5(random()::text),
filler12 text DEFAULT md5(random()::text)
) WITH (fillfactor=40);

\set scale 10
\set end 0
\set start (:end + 1)
\set end (:start + (:scale * 100))

INSERT INTO pgbench_accounts SELECT generate_series(:start, :end
)::text || repeat('a', 100), (random()::bigint) % :scale, 0;

CREATE UNIQUE INDEX pgb_a_aid ON pgbench_accounts(aid);
CREATE INDEX pgb_a_filler1 ON pgbench_accounts(filler1);
CREATE INDEX pgb_a_filler2 ON pgbench_accounts(filler2);
CREATE INDEX pgb_a_filler3 ON pgbench_accounts(filler3);
CREATE INDEX pgb_a_filler4 ON pgbench_accounts(filler4);

UPDATE pgbench_accounts SET filler1 = 'X';--WARM update all the tuples

—Test.sql—
set enable_seqscan=off;
set enable_bitmapscan=off;
explain analyze select * FROM pgbench_accounts WHERE aid < '400' ||
repeat('a', 100) ORDER BY aid

—Script.sh—
./psql -d postgres -f setup.sql
./pgbench -c1 -j1 -T300 -M prepared -f test.sql postgres

Patch:
tps = 3554.345313 (including connections establishing)
tps = 3554.880776 (excluding connections establishing)

Head:
tps = 4208.876829 (including connections establishing)
tps = 4209.440321 (excluding connections establishing)

*** After changing fill factor to 90 —

Patch:
tps = 3794.414770 (including connections establishing)
tps = 3794.919592 (excluding connections establishing)

Head:
tps = 4206.445608 (including connections establishing)
tps = 4207.033559 (excluding connections establishing)

[1]
https://www.postgresql.org/message-id/CABOikdMduu9wOhfvNzqVuNW4YdBgbgwv-A%3DHNFCL7R5Tmbx7JA%40mail.gmail.com


I have done some perfing for the patch and I have noticed that time is
increased in heap_check_warm_chain function.

Top 10 functions in perf results (with patch):
+8.98% 1.04%  postgres  postgres[.] varstr_cmp
+7.24% 0.00%  postgres  [unknown]   [.] 
+6.34% 0.36%  postgres  libc-2.17.so[.] clock_gettime
+6.34% 0.00%  postgres  [unknown]   [.] 0x0003
+6.18% 6.15%  postgres  [vdso]  [.] __vdso_clock_gettime
+5.72% 0.02%  postgres  [kernel.kallsyms]   [k] system_call_fastpath
+4.08% 4.06%  postgres  libc-2.17.so[.] __memcpy_ssse3_back
+4.08% 4.06%  postgres  libc-2.17.so[.] get_next_seq
+3.92% 0.00%  postgres  [unknown]   [.] 0x6161616161616161
+3.07% 3.05%  postgres  postgres[.] heap_check_warm_chain


Thanks to Amit for helping in discussing the test ideas.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-30 Thread Dilip Kumar
On Wed, Mar 29, 2017 at 11:51 AM, Pavan Deolasee
<pavan.deola...@gmail.com> wrote:
> Thanks. I think your patch of tracking interesting attributes seems ok too
> after the performance issue was addressed. Even though we can still improve
> that further, at least Mithun confirmed that there is no significant
> regression anymore and in fact for one artificial case, patch does better
> than even master.

I was trying to compile these patches on latest
head(f90d23d0c51895e0d7db7910538e85d3d38691f0) for some testing but I
was not able to compile it.

make[3]: *** [postgres.bki] Error 1
make[3]: Leaving directory
`/home/dilip/work/pg_codes/pbms_final/postgresql/src/backend/catalog'
make[2]: *** [submake-schemapg] Error 2
make[2]: Leaving directory
`/home/dilip/work/pg_codes/pbms_final/postgresql/src/backend'
make[1]: *** [all-backend-recurse] Error 2
make[1]: Leaving directory `/home/dilip/work/pg_codes/pbms_final/postgresql/src'
make: *** [all-src-recurse] Error 2

I tried doing maintainer-clean, deleting postgres.bki but still the same error.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-27 Thread Dilip Kumar
On Mon, Mar 27, 2017 at 5:58 PM, Robert Haas <robertmh...@gmail.com> wrote:
> Failing to pass DSA_ALLOC_HUGE is an obvious oversight, but I think
> the ptbase == NULL check shouldn't be needed, because we're not
> passing DSA_ALLOC_NO_OOM.  And that's good, because this is going to
> be called from SH_CREATE, which isn't prepared for a NULL return
> anyway.
>
> Am I all wet?

Yes you are right that we are not passing DSA_ALLOC_NO_OOM, so
dsa_allocate_extended will return error in case if allocation failed.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-27 Thread Dilip Kumar
On Mon, Mar 27, 2017 at 12:53 PM, Rafia Sabih
<rafia.sa...@enterprisedb.com> wrote:
> Recently, on testing TPC-H 300 scale factor I stumbled on to a error
> for Q6, the test environment is as follows,
> work_mem = 1GB,
> shared_buffers = 10 GB,
> Effective_cache_size = 10GB
> random_page_cost = seq+page_cost =0.1
>
> The error message is, ERROR:  invalid DSA memory alloc request size 1610612740
> In case required, stack trace is as follows,

Thanks for reporting.  In pagetable_allocate, DSA_ALLOC_HUGE flag were
missing while allocating the memory from the DSA.  I have also handled
the NULL pointer return from dsa_get_address.

Please verify with the attached patch.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


bitmap_hugepage_fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Problem in Parallel Bitmap Heap Scan?

2017-03-26 Thread Dilip Kumar
On Sat, Mar 25, 2017 at 2:25 PM, Thomas Munro
<thomas.mu...@enterprisedb.com> wrote:
>>> I think in this area we need more testing, reason these are not tested
>>> properly because these are not the natural case for parallel bitmap.
>>> I think in next few days I will test more such cases by forcing the
>>> parallel bitmap.
>>>
>>
>> Okay, is your testing complete?

Yes, I have done more testing around this area with more cases, like
one page with BitmapOr etc.
Now it looks fine to me.
>>
>>> Here is the patch to fix the issue in hand.  I have also run the
>>> regress suit with force_parallel_mode=regress and all the test are
>>> passing.
>>>
>>
>> Thomas, did you get chance to verify Dilip's latest patch?
>>
>> I have added this issue in PostgreSQL 10 Open Items list so that we
>> don't loose track of this issue.
>
> The result is correct with this patch.  I ran make installcheck then
> the same steps as above and the query result was correct after
> creating the index.

Thanks for confirming.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling parallelism for queries coming from SQL or other PL functions

2017-03-22 Thread Dilip Kumar
On Wed, Mar 22, 2017 at 10:33 PM, Robert Haas <robertmh...@gmail.com> wrote:
> So couldn't we actually make this test !fcache->returnsSet || !es->lazyEval?
> That would let us allow parallel execution for all non-set-returning
> functions, and also for set-returning functions that end up with
> es->lazyEval set to false.

Yes, this is the right thing to do although we may not enable
parallelism for any more queries by adding "|| !es->lazyEval". Because
SELECT are always marked as es->lazyEval=true(And so far we have
parallelism only for select).  But here we calling the parameter to
ExecutorRun as execute_once so  !fcache->returnsSet || !es->lazyEval
is the correct one and future proof.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Problem in Parallel Bitmap Heap Scan?

2017-03-21 Thread Dilip Kumar
On Wed, Mar 22, 2017 at 5:38 AM, Thomas Munro
<thomas.mu...@enterprisedb.com> wrote:
> Isn't that one row short?  What happened to this one?
>
>  10.0.0.0/8 | 10::/8

Actually, In my last test I did not connect to regression database, I
have simply taken table and the few rows from inet.sql so it was only
16 rows even with seqscan.

Here are the updated results when I connect to regression database and re-test.

regression=# SELECT * FROM inet_tbl WHERE i <> '192.168.1.0/24'::cidr
ORDER BY i;
 c  |i
+--
 10.0.0.0/8 | 9.1.2.3/8
 10.0.0.0/8 | 10.1.2.3/8
 10.0.0.0/32| 10.1.2.3/8
 10.0.0.0/8 | 10.1.2.3/8
 10.1.0.0/16| 10.1.2.3/16
 10.1.2.0/24| 10.1.2.3/24
 10.1.2.3/32| 10.1.2.3
 10.0.0.0/8 | 11.1.2.3/8
 192.168.1.0/24 | 192.168.1.226/24
 192.168.1.0/24 | 192.168.1.255/24
 192.168.1.0/24 | 192.168.1.0/25
 192.168.1.0/24 | 192.168.1.255/25
 192.168.1.0/26 | 192.168.1.226
 10.0.0.0/8 | 10::/8
 :::1.2.3.4/128 | ::4.3.2.1/24
 10:23::f1/128  | 10:23::f1/64
 10:23::8000/113| 10:23::
(17 rows)

regression=# explain analyze SELECT * FROM inet_tbl WHERE i <>
'192.168.1.0/24'::cidr
ORDER BY i;
QUERY PLAN
---
 Gather Merge  (cost=16.57..16.67 rows=10 width=64) (actual
time=4.972..4.983 rows=17 loops=1)
   Workers Planned: 1
   Workers Launched: 1
   ->  Sort  (cost=16.56..16.58 rows=10 width=64) (actual
time=0.107..0.110 rows=8 loops=2)
 Sort Key: i
 Sort Method: quicksort  Memory: 26kB
 ->  Parallel Bitmap Heap Scan on inet_tbl  (cost=12.26..16.39
rows=10 width=64) (actual time=0.051..0.053 rows=8 loops=2)
   Recheck Cond: (i <> '192.168.1.0/24'::inet)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on inet_idx3  (cost=0.00..12.26
rows=17 width=0) (actual time=0.016..0.016 rows=17 loops=1)
 Index Cond: (i <> '192.168.1.0/24'::inet)
 Planning time: 0.113 ms
 Execution time: 5.691 ms
(13 rows)


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Problem in Parallel Bitmap Heap Scan?

2017-03-21 Thread Dilip Kumar
On Tue, Mar 21, 2017 at 4:47 PM, Thomas Munro
<thomas.mu...@enterprisedb.com> wrote:
> I noticed a failure in the inet.sql test while running the regression
> tests with parallelism cranked up, and can reproduce it interactively
> as follows.  After an spgist index is created and the plan changes to
> the one shown below, the query returns no rows.

Thanks for reporting.  Seems like we are getting issues related to
TBM_ONE_PAGE and TBM_EMPTY.

I think in this area we need more testing, reason these are not tested
properly because these are not the natural case for parallel bitmap.
I think in next few days I will test more such cases by forcing the
parallel bitmap.

Here is the patch to fix the issue in hand.  I have also run the
regress suit with force_parallel_mode=regress and all the test are
passing.

Results after fix.

postgres=# explain analyze SELECT * FROM inet_tbl WHERE i <>
'192.168.1.0/24'::cidr
ORDER BY i;
QUERY PLAN
--
 Gather Merge  (cost=16.53..16.62 rows=9 width=64) (actual
time=4.467..4.478 rows=16 loops=1)
   Workers Planned: 1
   Workers Launched: 1
   ->  Sort  (cost=16.52..16.54 rows=9 width=64) (actual
time=0.090..0.093 rows=8 loops=2)
 Sort Key: i
 Sort Method: quicksort  Memory: 25kB
 ->  Parallel Bitmap Heap Scan on inet_tbl  (cost=12.26..16.37
rows=9 width=64) (actual time=0.048..0.050 rows=8 loops=2)
   Recheck Cond: (i <> '192.168.1.0/24'::inet)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on inet_idx3  (cost=0.00..12.25
rows=16 width=0) (actual time=0.016..0.016 rows=16 loops=1)
 Index Cond: (i <> '192.168.1.0/24'::inet)
 Planning time: 0.149 ms
 Execution time: 5.143 ms
(13 rows)

postgres=# SELECT * FROM inet_tbl WHERE i <> '192.168.1.0/24'::cidr
ORDER BY i;
 c  |i
+--
 10.0.0.0/8 | 9.1.2.3/8
 10.0.0.0/8 | 10.1.2.3/8
 10.0.0.0/32| 10.1.2.3/8
 10.0.0.0/8 | 10.1.2.3/8
 10.1.0.0/16| 10.1.2.3/16
 10.1.2.0/24| 10.1.2.3/24
 10.1.2.3/32| 10.1.2.3
 10.0.0.0/8 | 11.1.2.3/8
 192.168.1.0/24 | 192.168.1.226/24
 192.168.1.0/24 | 192.168.1.255/24
 192.168.1.0/24 | 192.168.1.0/25
 192.168.1.0/24 | 192.168.1.255/25
 192.168.1.0/26 | 192.168.1.226
 :::1.2.3.4/128 | ::4.3.2.1/24
 10:23::f1/128  | 10:23::f1/64
 10:23::8000/113    | 10:23::
(16 rows)


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


fix_parallel_bitmap_handling_onepage.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling parallelism for queries coming from SQL or other PL functions

2017-03-21 Thread Dilip Kumar
On Tue, Mar 21, 2017 at 3:36 PM, Rafia Sabih
<rafia.sa...@enterprisedb.com> wrote:
> On Wed, Mar 15, 2017 at 8:55 PM, Robert Haas <robertmh...@gmail.com> wrote:
>> Note this:
>>
>> if (completed || !fcache->returnsSet)
>> postquel_end(es);
>>
>> When the SQL function doesn't return a set, then we can allow
>> parallelism even when lazyEval is set, because we'll only call
>> ExecutorStart() once.  But my impression is that something like this:

How about taking the decision of execute_once based on
fcache->returnsSet instead of based on lazyEval?

change
+ ExecutorRun(es->qd, ForwardScanDirection, count, !es->lazyEval);
to
+ ExecutorRun(es->qd, ForwardScanDirection, count, !fcache->returnsSet);

IMHO, Robert have the same thing in mind?

>SELECT * FROM blah() LIMIT 3
>
>...will trigger three separate calls to ExecutorRun(), which is a
>problem if the plan is a parallel plan.

And you also need to test this case what Robert have mentioned up thread.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Bitmap scans a bit broken

2017-03-16 Thread Dilip Kumar
On Thu, Mar 16, 2017 at 8:26 PM, Emre Hasegeli <e...@hasegeli.com> wrote:
>> Hopefully, this time I got it correct.  Since I am unable to reproduce
>> the issue so I will again need your help in verifying the fix.
>
> It is not crashing with the new patch.  Thank you.

Thanks for verifying.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   3   4   5   >