Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

2017-09-13 Thread Tom Lane
Kuntal Ghosh  writes:
> On Tue, Sep 12, 2017 at 9:47 PM, Tom Lane  wrote:
>> Aleksander Alekseev  writes:
>>> The following review has been posted through the commitfest application:
>>> make installcheck-world:  tested, passed
>>> Implements feature:   tested, passed
>>> Spec compliant:   tested, passed
>>> Documentation:tested, passed
>>> Also I didn't manage to find any typos or obvious mistakes in the code.

> I've performed the regression tests for the same. It passed all the
> test cases. Also, I've verified the feature implementation using the
> queries posted by you earlier and some of my own test cases. It is
> working as expected.

Pushed, thanks for reviewing!

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

2017-09-13 Thread Kuntal Ghosh
On Tue, Sep 12, 2017 at 9:47 PM, Tom Lane  wrote:
> Aleksander Alekseev  writes:
>> The following review has been posted through the commitfest application:
>> make installcheck-world:  tested, passed
>> Implements feature:   tested, passed
>> Spec compliant:   tested, passed
>> Documentation:tested, passed
>
>> Also I didn't manage to find any typos or obvious mistakes in the code.
>
> Thanks for reviewing!  I assume this is against the prior version of the
> patch, though, not the one I just posted with updates for contrib.
> Do you want to look at those?
>
I've performed the regression tests for the same. It passed all the
test cases. Also, I've verified the feature implementation using the
queries posted by you earlier and some of my own test cases. It is
working as expected.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

2017-09-13 Thread Aleksander Alekseev
Hi Tom,

> Thanks for reviewing!  I assume this is against the prior version of the
> patch, though, not the one I just posted with updates for contrib.
> Do you want to look at those?
> 
>   regards, tom lane

No, I reviewed the latest v4 patch right after you submitted it.

-- 
Best regards,
Aleksander Alekseev


signature.asc
Description: PGP signature


Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

2017-09-12 Thread Tom Lane
Aleksander Alekseev  writes:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   tested, passed
> Documentation:tested, passed

> Also I didn't manage to find any typos or obvious mistakes in the code.

Thanks for reviewing!  I assume this is against the prior version of the
patch, though, not the one I just posted with updates for contrib.
Do you want to look at those?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

2017-09-12 Thread Aleksander Alekseev
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Also I didn't manage to find any typos or obvious mistakes in the code.

The new status of this patch is: Ready for Committer

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

2017-09-12 Thread Tom Lane
Here's an updated version of this patch that is rebased against HEAD
(no changes there except line numbers) and adds the necessary updates
for contrib.

I'd like to get this committed fairly soon before it bit-rots.
Anyone want to review it?

regards, tom lane

diff --git a/contrib/citext/Makefile b/contrib/citext/Makefile
index 563cd22..e32a7de 100644
--- a/contrib/citext/Makefile
+++ b/contrib/citext/Makefile
@@ -3,7 +3,8 @@
 MODULES = citext
 
 EXTENSION = citext
-DATA = citext--1.4.sql citext--1.3--1.4.sql \
+DATA = citext--1.4.sql citext--1.4--1.5.sql \
+	citext--1.3--1.4.sql \
 	citext--1.2--1.3.sql citext--1.1--1.2.sql \
 	citext--1.0--1.1.sql citext--unpackaged--1.0.sql
 PGFILEDESC = "citext - case-insensitive character string data type"
diff --git a/contrib/citext/citext--1.4--1.5.sql b/contrib/citext/citext--1.4--1.5.sql
new file mode 100644
index 000..97942cb
--- /dev/null
+++ b/contrib/citext/citext--1.4--1.5.sql
@@ -0,0 +1,14 @@
+/* contrib/citext/citext--1.4--1.5.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION citext UPDATE TO '1.5'" to load this file. \quit
+
+ALTER OPERATOR <= (citext, citext) SET (
+RESTRICT   = scalarlesel,
+JOIN   = scalarlejoinsel
+);
+
+ALTER OPERATOR >= (citext, citext) SET (
+RESTRICT   = scalargesel,
+JOIN   = scalargejoinsel
+);
diff --git a/contrib/citext/citext.control b/contrib/citext/citext.control
index 17fce4e..4cd6e09 100644
--- a/contrib/citext/citext.control
+++ b/contrib/citext/citext.control
@@ -1,5 +1,5 @@
 # citext extension
 comment = 'data type for case-insensitive character strings'
-default_version = '1.4'
+default_version = '1.5'
 module_pathname = '$libdir/citext'
 relocatable = true
diff --git a/contrib/cube/Makefile b/contrib/cube/Makefile
index be7a1bc..244c1d9 100644
--- a/contrib/cube/Makefile
+++ b/contrib/cube/Makefile
@@ -4,7 +4,8 @@ MODULE_big = cube
 OBJS= cube.o cubeparse.o $(WIN32RES)
 
 EXTENSION = cube
-DATA = cube--1.2.sql cube--1.1--1.2.sql cube--1.0--1.1.sql \
+DATA = cube--1.2.sql cube--1.2--1.3.sql \
+	cube--1.1--1.2.sql cube--1.0--1.1.sql \
 	cube--unpackaged--1.0.sql
 PGFILEDESC = "cube - multidimensional cube data type"
 
diff --git a/contrib/cube/cube--1.2--1.3.sql b/contrib/cube/cube--1.2--1.3.sql
new file mode 100644
index 000..a688f19
--- /dev/null
+++ b/contrib/cube/cube--1.2--1.3.sql
@@ -0,0 +1,12 @@
+/* contrib/cube/cube--1.2--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION cube UPDATE TO '1.3'" to load this file. \quit
+
+ALTER OPERATOR <= (cube, cube) SET (
+	RESTRICT = scalarlesel, JOIN = scalarlejoinsel
+);
+
+ALTER OPERATOR >= (cube, cube) SET (
+	RESTRICT = scalargesel, JOIN = scalargejoinsel
+);
diff --git a/contrib/cube/cube.control b/contrib/cube/cube.control
index b03cfa0..af062d4 100644
--- a/contrib/cube/cube.control
+++ b/contrib/cube/cube.control
@@ -1,5 +1,5 @@
 # cube extension
 comment = 'data type for multidimensional cubes'
-default_version = '1.2'
+default_version = '1.3'
 module_pathname = '$libdir/cube'
 relocatable = true
diff --git a/contrib/hstore/Makefile b/contrib/hstore/Makefile
index 311cc09..ab7fef3 100644
--- a/contrib/hstore/Makefile
+++ b/contrib/hstore/Makefile
@@ -5,7 +5,8 @@ OBJS = hstore_io.o hstore_op.o hstore_gist.o hstore_gin.o hstore_compat.o \
 	$(WIN32RES)
 
 EXTENSION = hstore
-DATA = hstore--1.4.sql hstore--1.3--1.4.sql hstore--1.2--1.3.sql \
+DATA = hstore--1.4.sql hstore--1.4--1.5.sql \
+	hstore--1.3--1.4.sql hstore--1.2--1.3.sql \
 	hstore--1.1--1.2.sql hstore--1.0--1.1.sql \
 	hstore--unpackaged--1.0.sql
 PGFILEDESC = "hstore - key/value pair data type"
diff --git a/contrib/hstore/hstore--1.4--1.5.sql b/contrib/hstore/hstore--1.4--1.5.sql
new file mode 100644
index 000..92c1832
--- /dev/null
+++ b/contrib/hstore/hstore--1.4--1.5.sql
@@ -0,0 +1,14 @@
+/* contrib/hstore/hstore--1.4--1.5.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION hstore UPDATE TO '1.5'" to load this file. \quit
+
+ALTER OPERATOR #<=# (hstore, hstore) SET (
+   RESTRICT = scalarlesel,
+   JOIN = scalarlejoinsel
+);
+
+ALTER OPERATOR #>=# (hstore, hstore) SET (
+   RESTRICT = scalargesel,
+   JOIN = scalargejoinsel
+);
diff --git a/contrib/hstore/hstore.control b/contrib/hstore/hstore.control
index f99a937..8a71947 100644
--- a/contrib/hstore/hstore.control
+++ b/contrib/hstore/hstore.control
@@ -1,5 +1,5 @@
 # hstore extension
 comment = 'data type for storing sets of (key, value) pairs'
-default_version = '1.4'
+default_version = '1.5'
 module_pathname = '$libdir/hstore'
 relocatable = true
diff --git a/contrib/isn/Makefile b/contrib/isn/Makefile
index 9543a4b..ab6b175 100644
--- a/contrib/isn/Makefile
+++ b/contrib/isn/Makefile
@@ -3,7 +3,8 @@
 MODULES = isn
 
 EXTENSION = isn
-DATA = isn--1.1.sql isn--1.0--1.1.sql isn--unpackaged--1

Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

2017-07-09 Thread Kuntal Ghosh
On Fri, Jul 7, 2017 at 2:53 AM, Tom Lane  wrote:
> Kuntal Ghosh  writes:
Wow. Thank you for the wonderful explanation.

>> On Thu, Jul 6, 2017 at 3:45 AM, Tom Lane  wrote:
>
>> if histogram_bounds are assigned as,
>> {0,9,19,29,39,49,59,69,79,89,99,109,119,129,13,..}
>> ...
>> So, if val=low, then hisfrac = (bucket_current - 1)/num_of_buckets
>> which means it assumes val is included in the previous bucket.
>
> I looked at that again and realized that one of the answers I was missing
> lies in the behavior of analyze.c's compute_scalar_stats().  When it has
> collected "nvals" values and wants to make a histogram containing
> "num_hist" entries, it does this:
>
>  * The object of this loop is to copy the first and last values[]
>  * entries along with evenly-spaced values in between.  So the
>  * i'th value is values[(i * (nvals - 1)) / (num_hist - 1)].
>
> (where i runs from 0 to num_hist - 1).  Because the "/" denotes integer
> division, this effectively means that for all but the first entry,
> we are taking the last value out of the corresponding bucket of samples.
> That's obviously true for the last histogram entry, which will be the very
> last sample value, and it's also true for earlier ones --- except for the
> *first* histogram entry, which is necessarily the first one in its bucket.
> So the first histogram bucket is actually a tad narrower than the others,
> which is visible in the typical data you showed above: 0 to 9 is only 9
> counts wide, but all the remaining buckets are 10 counts wide.  That
> explains why we need a scale adjustment just in the first bucket.
>
Agreed. But, I've some doubt regarding the amount of adjustment needed.
+  if (i == 1)
+  histfrac += eq_selec * (1.0 - binfrac);
For predicate x<=p, when p lies in the first bucket, following is the amount
of binfrac that we're off from the actual one.
(Assume the same histogram boundaries i.e., 0,9,19,...)

For p=0, (1/10)-(0/9) = 1/10 * (1 - 0)
For p=1, (2/10)-(1/9) = 1/10 * (1 - 1/9)
For p=2, (3/10)-(2/9) = 1/10 * (1 - 2/9)
For p=3, (4/10)-(3/9) = 1/10 * (1 - 3/9)

In general, 1/(high + 1 - low) * (1.0 - binfrac) and the difference in
histfrac is
(1.0 - binfrac) / (high + 1 - low) / num_of_hist_buckets.

So, the above adjustment for the first bucket looks correct when,
otherdistinct ~ (high + 1 - low) * num_of_hist_buckets

Is that always true or I'm missing something?

> I think I'm also beginning to grasp why scalarineqsel's basic estimation
> process produces an estimate for "x <= p" rather than some other condition
> such as "x < p".  For clarity, imagine that we're given p equal to the
> last histogram entry.  If the test operator is in fact "<=", then it will
> say "true" for every histogram entry, and we'll fall off the end of the
> histogram and return 1.0, which is correct.  If the test operator is "<",
> it will return "true" for all but the last entry, so that we end up with
> "i" equal to sslot.nvalues - 1 --- but we will compute binfrac = 1.0,
> if convert_to_scalar() produces sane answers, again resulting in
> histfrac = 1.0.  Similar reasoning applies for ">" and ">=", so that in
> all cases we arrive at histfrac = 1.0 if p equals the last histogram
> entry.  More generally, if p is equal to some interior histogram entry,
> say the k'th one out of n total, we end up with histfrac = (k-1)/(n-1)
> no matter which operator we probe with, assuming that convert_to_scalar()
> correctly gives us a binfrac of 0.0 or 1.0 depending on which of the
> adjacent bins we end up examining.  So, remembering that the histogram
> entries occupy the right ends of their buckets, the proper interpretation
> of that is that it's the probability of "x <= p".  (This is wrong for the
> first histogram entry, but that's why we need a correction there.)
>
True. In general, histfrac = (k-1)/(n-1) + binfrac where binfrac
depends on the linear interpolation.

For p=some histogram boundary, it'll always be the probability of
"x<=p". When p lies inside the bucket and we've a flat distribution,
then also it'll be the probability of "x<=p". But, when we've high
variance inside a bucket and p lies inside the bucket, linear
interpolation fails to capture the actual probability. But, as you've
mentioned earlier, I guess that's not a new problem.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

2017-07-07 Thread Tom Lane
I wrote:
> I looked at that again and realized that one of the answers I was missing
> lies in the behavior of analyze.c's compute_scalar_stats().

I updated the patch's comments based on this.  I'll just attach the
selfuncs.c part of the patch, since nothing else changed.

regards, tom lane

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index e103f5e..3ec629b 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -163,7 +163,7 @@ static double var_eq_non_const(VariableStatData *vardata, Oid operator,
  bool varonleft, bool negate);
 static double ineq_histogram_selectivity(PlannerInfo *root,
 		   VariableStatData *vardata,
-		   FmgrInfo *opproc, bool isgt,
+		   FmgrInfo *opproc, bool isgt, bool iseq,
 		   Datum constval, Oid consttype);
 static double eqjoinsel_inner(Oid operator,
 VariableStatData *vardata1, VariableStatData *vardata2);
@@ -544,18 +544,21 @@ neqsel(PG_FUNCTION_ARGS)
 /*
  *	scalarineqsel		- Selectivity of "<", "<=", ">", ">=" for scalars.
  *
- * This is the guts of both scalarltsel and scalargtsel.  The caller has
- * commuted the clause, if necessary, so that we can treat the variable as
- * being on the left.  The caller must also make sure that the other side
- * of the clause is a non-null Const, and dissect same into a value and
- * datatype.
+ * This is the guts of scalarltsel/scalarlesel/scalargtsel/scalargesel.
+ * The isgt and iseq flags distinguish which of the four cases apply.
+ *
+ * The caller has commuted the clause, if necessary, so that we can treat
+ * the variable as being on the left.  The caller must also make sure that
+ * the other side of the clause is a non-null Const, and dissect that into
+ * a value and datatype.  (This definition simplifies some callers that
+ * want to estimate against a constructed value instead of a Const node.)
  *
  * This routine works for any datatype (or pair of datatypes) known to
  * convert_to_scalar().  If it is applied to some other datatype,
  * it will return a default estimate.
  */
 static double
-scalarineqsel(PlannerInfo *root, Oid operator, bool isgt,
+scalarineqsel(PlannerInfo *root, Oid operator, bool isgt, bool iseq,
 			  VariableStatData *vardata, Datum constval, Oid consttype)
 {
 	Form_pg_statistic stats;
@@ -587,7 +590,8 @@ scalarineqsel(PlannerInfo *root, Oid operator, bool isgt,
 	 * If there is a histogram, determine which bin the constant falls in, and
 	 * compute the resulting contribution to selectivity.
 	 */
-	hist_selec = ineq_histogram_selectivity(root, vardata, &opproc, isgt,
+	hist_selec = ineq_histogram_selectivity(root, vardata,
+			&opproc, isgt, iseq,
 			constval, consttype);
 
 	/*
@@ -757,7 +761,8 @@ histogram_selectivity(VariableStatData *vardata, FmgrInfo *opproc,
  *	ineq_histogram_selectivity	- Examine the histogram for scalarineqsel
  *
  * Determine the fraction of the variable's histogram population that
- * satisfies the inequality condition, ie, VAR < CONST or VAR > CONST.
+ * satisfies the inequality condition, ie, VAR < (or <=, >, >=) CONST.
+ * The isgt and iseq flags distinguish which of the four cases apply.
  *
  * Returns -1 if there is no histogram (valid results will always be >= 0).
  *
@@ -768,7 +773,7 @@ histogram_selectivity(VariableStatData *vardata, FmgrInfo *opproc,
 static double
 ineq_histogram_selectivity(PlannerInfo *root,
 		   VariableStatData *vardata,
-		   FmgrInfo *opproc, bool isgt,
+		   FmgrInfo *opproc, bool isgt, bool iseq,
 		   Datum constval, Oid consttype)
 {
 	double		hist_selec;
@@ -795,11 +800,17 @@ ineq_histogram_selectivity(PlannerInfo *root,
 		if (sslot.nvalues > 1)
 		{
 			/*
-			 * Use binary search to find proper location, ie, the first slot
-			 * at which the comparison fails.  (If the given operator isn't
-			 * actually sort-compatible with the histogram, you'll get garbage
-			 * results ... but probably not any more garbage-y than you would
-			 * from the old linear search.)
+			 * Use binary search to find the desired location, namely the
+			 * right end of the histogram bin containing the comparison value,
+			 * which is the leftmost entry for which the comparison operator
+			 * succeeds (if isgt) or fails (if !isgt).  (If the given operator
+			 * isn't actually sort-compatible with the histogram, you'll get
+			 * garbage results ... but probably not any more garbage-y than
+			 * you would have from the old linear search.)
+			 *
+			 * In this loop, we pay no attention to whether the operator iseq
+			 * or not; that detail will be mopped up below.  (We cannot tell,
+			 * anyway, whether the operator thinks the values are equal.)
 			 *
 			 * If the binary search accesses the first or last histogram
 			 * entry, we try to replace that endpoint with the true column min
@@ -864,25 +875,74 @@ ineq_histogram_selectivity(PlannerInfo *root,
 
 			if (lobound <= 0)
 			{
-/* 

Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

2017-07-06 Thread Tom Lane
Kuntal Ghosh  writes:
> On Thu, Jul 6, 2017 at 3:45 AM, Tom Lane  wrote:
> + * In the first bin (i==1), add a fudge factor that ensures
> + * that histfrac is at least eq_selec.  We do this because we
> + * know that the first histogram entry does satisfy the
> + * inequality (if !isgt) or not satisfy it (if isgt), so our
> + * estimate here should certainly not be zero even if binfrac
> + * is zero.  (XXX experimentally this is the correct way to do
> + * it, but why isn't it a linear adjustment across the whole
> + * histogram rather than just the first bin?)

> Given that the values are distinct, (I've some doubts for real number case)

Well, really, we are *always* dealing with discrete distributions here.

> if histogram_bounds are assigned as,
> {0,9,19,29,39,49,59,69,79,89,99,109,119,129,13,..}
> ...
> So, if val=low, then hisfrac = (bucket_current - 1)/num_of_buckets
> which means it assumes val is included in the previous bucket.

I looked at that again and realized that one of the answers I was missing
lies in the behavior of analyze.c's compute_scalar_stats().  When it has
collected "nvals" values and wants to make a histogram containing
"num_hist" entries, it does this:

 * The object of this loop is to copy the first and last values[]
 * entries along with evenly-spaced values in between.  So the
 * i'th value is values[(i * (nvals - 1)) / (num_hist - 1)].

(where i runs from 0 to num_hist - 1).  Because the "/" denotes integer
division, this effectively means that for all but the first entry,
we are taking the last value out of the corresponding bucket of samples.
That's obviously true for the last histogram entry, which will be the very
last sample value, and it's also true for earlier ones --- except for the
*first* histogram entry, which is necessarily the first one in its bucket.
So the first histogram bucket is actually a tad narrower than the others,
which is visible in the typical data you showed above: 0 to 9 is only 9
counts wide, but all the remaining buckets are 10 counts wide.  That
explains why we need a scale adjustment just in the first bucket.

I think I'm also beginning to grasp why scalarineqsel's basic estimation
process produces an estimate for "x <= p" rather than some other condition
such as "x < p".  For clarity, imagine that we're given p equal to the
last histogram entry.  If the test operator is in fact "<=", then it will
say "true" for every histogram entry, and we'll fall off the end of the
histogram and return 1.0, which is correct.  If the test operator is "<",
it will return "true" for all but the last entry, so that we end up with
"i" equal to sslot.nvalues - 1 --- but we will compute binfrac = 1.0,
if convert_to_scalar() produces sane answers, again resulting in
histfrac = 1.0.  Similar reasoning applies for ">" and ">=", so that in
all cases we arrive at histfrac = 1.0 if p equals the last histogram
entry.  More generally, if p is equal to some interior histogram entry,
say the k'th one out of n total, we end up with histfrac = (k-1)/(n-1)
no matter which operator we probe with, assuming that convert_to_scalar()
correctly gives us a binfrac of 0.0 or 1.0 depending on which of the
adjacent bins we end up examining.  So, remembering that the histogram
entries occupy the right ends of their buckets, the proper interpretation
of that is that it's the probability of "x <= p".  (This is wrong for the
first histogram entry, but that's why we need a correction there.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

2017-07-06 Thread Kuntal Ghosh
On Thu, Jul 6, 2017 at 3:45 AM, Tom Lane  wrote:
> I wrote:
>> (Pokes at it some more...) Oh, interesting: it behaves that way except
>> when p is exactly the lowest histogram entry.
>
+ /*
+ * In the first bin (i==1), add a fudge factor that ensures
+ * that histfrac is at least eq_selec.  We do this because we
+ * know that the first histogram entry does satisfy the
+ * inequality (if !isgt) or not satisfy it (if isgt), so our
+ * estimate here should certainly not be zero even if binfrac
+ * is zero.  (XXX experimentally this is the correct way to do
+ * it, but why isn't it a linear adjustment across the whole
+ * histogram rather than just the first bin?)
+ */
Given that the values are distinct, (I've some doubts for real number case)

if histogram_bounds are assigned as,
{0,9,19,29,39,49,59,69,79,89,99,109,119,129,13,..}

I think the buckets are defined as,
0 < bucket1 <= 9
9 < bucket2 <=19
19 < bucket3 <= 29 and so on.

Because, the histfrac is calculated as follows:

histfrac = (double) (bucket_current - 1) + (val - low) / (high - low);
(where bucket_current is obtained by doing a binary search on
histogram_bounds.)
histfrac /= (double) (nvalues - 1);

So, if val=low, then hisfrac = (bucket_current - 1)/num_of_buckets
which means it assumes val is included in the previous bucket.

This means that it always fails to calculate the selectivity for
lowest histogram boundary. Hence, we need adjustment only for the
first bucket.

Do you think my reasoning justifies your concern?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

2017-07-05 Thread Tom Lane
I wrote:
> (Pokes at it some more...) Oh, interesting: it behaves that way except
> when p is exactly the lowest histogram entry.

Here's a revised version that addresses that point and cleans up some
other minor details about treatment of cases near the histogram endpoints.
I'm still pretty unclear on exactly why it works (see XXX comments in
patch), but testing shows that it does work very well indeed given a flat
data distribution.  On less-flat distributions, you get some estimation
errors, but that's neither new nor surprising.

regards, tom lane

diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml
index 333a36c..e9ff110 100644
--- a/doc/src/sgml/xindex.sgml
+++ b/doc/src/sgml/xindex.sgml
@@ -792,8 +792,7 @@ CREATE OPERATOR < (
It is important to specify the correct commutator and negator operators,
as well as suitable restriction and join selectivity
functions, otherwise the optimizer will be unable to make effective
-   use of the index.  Note that the less-than, equal, and
-   greater-than cases should use different selectivity functions.
+   use of the index.
   
 
   
diff --git a/doc/src/sgml/xoper.sgml b/doc/src/sgml/xoper.sgml
index 8568e21..d484d80 100644
--- a/doc/src/sgml/xoper.sgml
+++ b/doc/src/sgml/xoper.sgml
@@ -242,20 +242,11 @@ column OP constant
 
  eqsel for =
  neqsel for <>
- scalarltsel for < or <=
- scalargtsel for > or >=
-   
-It might seem a little odd that these are the categories, but they
-make sense if you think about it.  = will typically accept only
-a small fraction of the rows in a table; <> will typically reject
-only a small fraction.  < will accept a fraction that depends on
-where the given constant falls in the range of values for that table
-column (which, it just so happens, is information collected by
-ANALYZE and made available to the selectivity estimator).
-<= will accept a slightly larger fraction than < for the same
-comparison constant, but they're close enough to not be worth
-distinguishing, especially since we're not likely to do better than a
-rough guess anyhow.  Similar remarks apply to > and >=.
+ scalarltsel for <
+ scalarlesel for <=
+ scalargtsel for >
+ scalargesel for >=
+

 

@@ -267,10 +258,12 @@ column OP constant

 

-You can use scalarltsel and scalargtsel for comparisons on data types that
-have some sensible means of being converted into numeric scalars for
-range comparisons.  If possible, add the data type to those understood
-by the function convert_to_scalar() in src/backend/utils/adt/selfuncs.c.
+You can use scalarltsel, scalarlesel,
+scalargtsel and scalargesel for comparisons on
+data types that have some sensible means of being converted into numeric
+scalars for range comparisons.  If possible, add the data type to those
+understood by the function convert_to_scalar() in
+src/backend/utils/adt/selfuncs.c.
 (Eventually, this function should be replaced by per-data-type functions
 identified through a column of the pg_type system catalog; but that hasn't happened
 yet.)  If you do not do this, things will still work, but the optimizer's
@@ -310,8 +303,10 @@ table1.column1 OP table2.column2
  
   eqjoinsel for =
   neqjoinsel for <>
-  scalarltjoinsel for < or <=
-  scalargtjoinsel for > or >=
+  scalarltjoinsel for <
+  scalarlejoinsel for <=
+  scalargtjoinsel for >
+  scalargejoinsel for >=
   areajoinsel for 2D area-based comparisons
   positionjoinsel for 2D position-based comparisons
   contjoinsel for 2D containment-based comparisons
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 9d34025..b4cbc34 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -71,7 +71,7 @@ static RelOptInfo *find_single_rel_for_clauses(PlannerInfo *root,
  *
  * We also recognize "range queries", such as "x > 34 AND x < 42".  Clauses
  * are recognized as possible range query components if they are restriction
- * opclauses whose operators have scalarltsel() or scalargtsel() as their
+ * opclauses whose operators have scalarltsel or a related function as their
  * restriction selectivity estimator.  We pair up clauses of this form that
  * refer to the same variable.  An unpairable clause of this kind is simply
  * multiplied into the selectivity product in the normal way.  But when we
@@ -92,8 +92,8 @@ static RelOptInfo *find_single_rel_for_clauses(PlannerInfo *root,
  * A free side-effect is that we can recognize redundant inequalities such
  * as "x < 4 AND x < 5"; only the tighter constraint will be counted.
  *
- * Of course this is all very dependent on the behavior of
- * scalarltsel/scalargtsel; perhaps some day we can generalize the approach.
+ * Of course this is all very dependent on the behavior 

Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

2017-07-04 Thread Kuntal Ghosh
On Tue, Jul 4, 2017 at 10:56 PM, Kuntal Ghosh
 wrote:
> On Tue, Jul 4, 2017 at 9:20 PM, Tom Lane  wrote:
>> Kuntal Ghosh  writes:
>>> On Tue, Jul 4, 2017 at 9:23 AM, Tom Lane  wrote:
 ... I have to admit that I've failed to wrap my brain around exactly
 why it's correct.  The arguments that I've constructed so far seem to
 point in the direction of applying the opposite correction, which is
 demonstrably wrong.  Perhaps someone whose college statistics class
 wasn't quite so long ago can explain this satisfactorily?
>>
>>> I guess that you're referring the last case, i.e.
>>> explain analyze select * from tenk1 where thousand between 10 and 10;
>>
>> No, the thing that is bothering me is why it seems to be correct to
>> apply a positive correction for ">=", a negative correction for "<",
>> and no correction for "<=" or ">".  That seems weird and I can't
>> construct a plausible explanation for it.  I think it might be a
>> result of the fact that, given a discrete distribution rather than
>> a continuous one, the histogram boundary values should be understood
>> as having some "width" rather than being zero-width points on the
>> distribution axis.  But the arguments I tried to fashion on that
>> basis led to other rules that didn't actually work.
>>
>> It's also possible that this logic is in fact wrong and it just happens
>> to give the right answer anyway for uniformly-distributed cases.
>>
> So, here are two points I think:
> 1. When should we apply(add/subtract) the correction?
> 2. What should be the correction?
>
> The first point:
> there can be further two cases,
> a) histfrac - actual_selectivity(p<=0) = 0.
Sorry for the typo. I meant (p<=10) for all the cases.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

2017-07-04 Thread Kuntal Ghosh
On Tue, Jul 4, 2017 at 9:20 PM, Tom Lane  wrote:
> Kuntal Ghosh  writes:
>> On Tue, Jul 4, 2017 at 9:23 AM, Tom Lane  wrote:
>>> ... I have to admit that I've failed to wrap my brain around exactly
>>> why it's correct.  The arguments that I've constructed so far seem to
>>> point in the direction of applying the opposite correction, which is
>>> demonstrably wrong.  Perhaps someone whose college statistics class
>>> wasn't quite so long ago can explain this satisfactorily?
>
>> I guess that you're referring the last case, i.e.
>> explain analyze select * from tenk1 where thousand between 10 and 10;
>
> No, the thing that is bothering me is why it seems to be correct to
> apply a positive correction for ">=", a negative correction for "<",
> and no correction for "<=" or ">".  That seems weird and I can't
> construct a plausible explanation for it.  I think it might be a
> result of the fact that, given a discrete distribution rather than
> a continuous one, the histogram boundary values should be understood
> as having some "width" rather than being zero-width points on the
> distribution axis.  But the arguments I tried to fashion on that
> basis led to other rules that didn't actually work.
>
> It's also possible that this logic is in fact wrong and it just happens
> to give the right answer anyway for uniformly-distributed cases.
>
So, here are two points I think:
1. When should we apply(add/subtract) the correction?
2. What should be the correction?

The first point:
there can be further two cases,
a) histfrac - actual_selectivity(p<=0) = 0.
For this case, I've an explanation why applying the correction in
above way works correctly. (This is the case with tenk1)
Since, histfrac correctly calculates selectivity for (p<=0),
hist_selec will either be <= or > (isgt is true). Hence, there is no
need to apply the correction. A negative correction is needed for less
than operator (sel(<=10) - sel(=10)). Similarly, a positive correction
is needed for greater than and equals to operator (sel(>10) +
sel(=10)).

b) histfrac - actual_selectivity(p<=0) != 0.
This is possible when we've high variance in the histogram buckets. I
guess here things may go wrong. Please consider the following example,
UPDATE tenk1 SET thousand=11 where thousand=10;
VACUUM ANALYZE tenk1;
explain analyze select * from tenk1 where thousand between 10 and 10;
 Bitmap Heap Scan on tenk1  (cost=4.39..39.52 rows=10 width=244)
(actual time=0.018..0.018 rows=0 loops=1)

The second point:
In this case, the correction is calculated correctly as selectivity of
(p=0) because of uniform distribution. Hence, it works. When we don't
have uniform distribution, the current calculation of the correction
may prove to be over-estimation for most of the time.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

2017-07-04 Thread Tom Lane
I wrote:
> No, the thing that is bothering me is why it seems to be correct to
> apply a positive correction for ">=", a negative correction for "<",
> and no correction for "<=" or ">".  That seems weird and I can't
> construct a plausible explanation for it.

After further thought, I can put a little more clarity to this, but
it's still not really resolved.  It's easily shown by experiment that
the existing code correctly computes the probability that "x <= p"
where p is the given probe value.  It uses that value as-is for the <
and <= cases, and 1 minus that value for > and >=.  From this statement,
it's clear why the above is the right way to correct matters.  What I
find remarkable is that this is what the loop computes regardless of
which of the four operators is used to probe, and regardless of whether
the probe value p is exactly equal to some histogram boundary value.
That doesn't seem intuitive at all --- when p does match a histogram
entry, you'd think it would matter which operator you probe with.

(Pokes at it some more...) Oh, interesting: it behaves that way except
when p is exactly the lowest histogram entry.  The code is probably
blowing off that edge case without enough thought.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

2017-07-04 Thread Tom Lane
Kuntal Ghosh  writes:
> On Tue, Jul 4, 2017 at 9:23 AM, Tom Lane  wrote:
>> ... I have to admit that I've failed to wrap my brain around exactly
>> why it's correct.  The arguments that I've constructed so far seem to
>> point in the direction of applying the opposite correction, which is
>> demonstrably wrong.  Perhaps someone whose college statistics class
>> wasn't quite so long ago can explain this satisfactorily?

> I guess that you're referring the last case, i.e.
> explain analyze select * from tenk1 where thousand between 10 and 10;

No, the thing that is bothering me is why it seems to be correct to
apply a positive correction for ">=", a negative correction for "<",
and no correction for "<=" or ">".  That seems weird and I can't
construct a plausible explanation for it.  I think it might be a
result of the fact that, given a discrete distribution rather than
a continuous one, the histogram boundary values should be understood
as having some "width" rather than being zero-width points on the
distribution axis.  But the arguments I tried to fashion on that
basis led to other rules that didn't actually work.

It's also possible that this logic is in fact wrong and it just happens
to give the right answer anyway for uniformly-distributed cases.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

2017-07-04 Thread Kuntal Ghosh
On Tue, Jul 4, 2017 at 9:23 AM, Tom Lane  wrote:

> Aside from the mind-bendingly-tedious changes in pg_operator.h, the meat
> of the patch is in selfuncs.c's ineq_histogram_selectivity(), which now
> applies a correction for equal values in the cases where we were getting
> it wrong before.  While this logic seems experimentally correct (see
> above), I have to admit that I've failed to wrap my brain around exactly
> why it's correct.  The arguments that I've constructed so far seem to
> point in the direction of applying the opposite correction, which is
> demonstrably wrong.  Perhaps someone whose college statistics class
> wasn't quite so long ago can explain this satisfactorily?
>
I guess that you're referring the last case, i.e.
explain analyze select * from tenk1 where thousand between 10 and 10;

IMHO, following are the things that I've understood.
The predicate internally got translated to predicate A (p >= 10) and
predicate B (p <=10);
In ineq_histogram_selectivity,
For predicate A, hist_selec = p
For predicate B, hist_selec = 1-p
In clauselist_selectivity,
we calculate the selectivity as = ((p) + (1 - p)) - 1= 0, rounded of to 1.0e-10.

After your changes,
In ineq_histogram_selectivity,
For predicate A, hist_selec = p + correction (since, isgt=iseq)
For predicate B, hist_selec = 1-p
In clauselist_selectivity,
we calculate the selectivity as = ((p + correction) + (1 - p)) - 1= correction,

The correction is calculated as = 1 / num_distinct_values = .001.

Since, the thousand column of tenk1 is uniformly distributed, this
turns out to be the exact selectivity. (rows = .001 * 1000 = 10)

Thoughts?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

2017-07-04 Thread Tom Lane
Ashutosh Bapat  writes:
> On Tue, Jul 4, 2017 at 9:23 AM, Tom Lane  wrote:
>> regression=# explain analyze select * from tenk1 where thousand < 10;
>> before:
>> Bitmap Heap Scan on tenk1  (cost=5.14..241.38 rows=110 width=244) (actual 
>> time=0.121..0.623 rows=100 loops=1)
>> with patch:
>> Bitmap Heap Scan on tenk1  (cost=5.06..227.42 rows=100 width=244) (actual 
>> time=0.054..0.300 rows=100 loops=1)

> It's expected that the estimates will change with this patch. But I am
> wondering why should actual times vary so much. May be that's just
> accidental. Butthe actual timings are consistently lower with the
> patch except the last one

I didn't intend those examples to illustrate anything except row counts,
so I didn't worry about making the timings comparable.  Some of the
examples were probably the first query in a session and so would've faced
cache-populating costs the others didn't.  Also, the "before" examples
were all done on 9.6 not HEAD.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

2017-07-04 Thread Ashutosh Bapat
On Tue, Jul 4, 2017 at 9:23 AM, Tom Lane  wrote:
>
> regression=# explain analyze select * from tenk1 where thousand < 10;
> before:
>  Bitmap Heap Scan on tenk1  (cost=5.14..241.38 rows=110 width=244) (actual 
> time=0.121..0.623 rows=100 loops=1)
> with patch:
>  Bitmap Heap Scan on tenk1  (cost=5.06..227.42 rows=100 width=244) (actual 
> time=0.054..0.300 rows=100 loops=1)

It's expected that the estimates will change with this patch. But I am
wondering why should actual times vary so much. May be that's just
accidental. Butthe actual timings are consistently lower with the
patch except the last one

>
> regression=# explain analyze select * from tenk1 where thousand between 10 
> and 10;
> before:
>  Index Scan using tenk1_thous_tenthous on tenk1  (cost=0.29..8.30 rows=1 
> width=244) (actual time=0.041..0.112 rows=10 loops=1)
> with patch:
>  Bitmap Heap Scan on tenk1  (cost=4.39..39.52 rows=10 width=244) (actual 
> time=0.074..0.142 rows=10 loops=1)

The actual time has increased even though estimation is correct.

Those differences may just vanish if we take average of multiple runs
and are not that important to ponder about.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

2017-07-03 Thread Tom Lane
I was reminded by bug #14729 that we are not very good on corner cases
involving inequality operators, ie < <= > >=.  Historically, the
selectivity functions have simply not distinguished < from <= or >
from >=, arguing that the fraction of the population that satisfies
the "=" aspect can be considered to be vanishingly small, if the
comparison value isn't any of the most-common-values for the variable.
(If it is, the code path that executes the operator against each MCV
will take care of things properly.)  But that isn't really true unless
we're dealing with a continuum of variable values, and in practice
we seldom are.  If "x = const" would estimate a nonzero number of
rows for a given const value, then it follows that we ought to estimate
different numbers of rows for "x < const" and "x <= const", whether or
not the const is one of the MCVs.

Hence, the attached patch, which splits the scalar inequality selectivity
functions into four sets instead of two.  This demonstrably produces
better estimates, for example using the "tenk1" table in the regression
database:

regression=# explain analyze select * from tenk1 where thousand < 10;
before:
 Bitmap Heap Scan on tenk1  (cost=5.14..241.38 rows=110 width=244) (actual 
time=0.121..0.623 rows=100 loops=1)
with patch:
 Bitmap Heap Scan on tenk1  (cost=5.06..227.42 rows=100 width=244) (actual 
time=0.054..0.300 rows=100 loops=1)

regression=# explain analyze select * from tenk1 where thousand <= 10;
before:
 Bitmap Heap Scan on tenk1  (cost=5.14..241.38 rows=110 width=244) (actual 
time=0.120..0.383 rows=110 loops=1)
with patch:
 Bitmap Heap Scan on tenk1  (cost=5.14..241.38 rows=110 width=244) (actual 
time=0.062..0.288 rows=110 loops=1)

regression=# explain analyze select * from tenk1 where thousand > 10;
before:
 Seq Scan on tenk1  (cost=0.00..483.00 rows=9890 width=244) (actual 
time=0.030..6.276 rows=9890 loops=1)
with patch:
 Seq Scan on tenk1  (cost=0.00..483.00 rows=9890 width=244) (actual 
time=0.019..4.881 rows=9890 loops=1)

regression=# explain analyze select * from tenk1 where thousand >= 10;
before:
 Seq Scan on tenk1  (cost=0.00..483.00 rows=9890 width=244) (actual 
time=0.022..5.371 rows=9900 loops=1)
with patch:
 Seq Scan on tenk1  (cost=0.00..483.00 rows=9900 width=244) (actual 
time=0.014..3.783 rows=9900 loops=1)

regression=# explain analyze select * from tenk1 where thousand between 10 and 
11;
before:
 Bitmap Heap Scan on tenk1  (cost=4.39..39.52 rows=10 width=244) (actual 
time=0.082..0.215 rows=20 loops=1)
with patch:
 Bitmap Heap Scan on tenk1  (cost=4.49..70.61 rows=20 width=244) (actual 
time=0.080..0.207 rows=20 loops=1)
   Recheck Cond: ((thousand >= 10) AND (thousand <= 11))

regression=# explain analyze select * from tenk1 where thousand between 10 and 
10;
before:
 Index Scan using tenk1_thous_tenthous on tenk1  (cost=0.29..8.30 rows=1 
width=244) (actual time=0.041..0.112 rows=10 loops=1)
with patch:
 Bitmap Heap Scan on tenk1  (cost=4.39..39.52 rows=10 width=244) (actual 
time=0.074..0.142 rows=10 loops=1)

As these examples show, it's cases with very tight range constraints where
this really makes enough difference to be worth doing.  I believe the
complaint in bug #14729 basically corresponds to the last example, where
the erroneous rowcount estimate is driving a bad choice of join plan.

Aside from the mind-bendingly-tedious changes in pg_operator.h, the meat
of the patch is in selfuncs.c's ineq_histogram_selectivity(), which now
applies a correction for equal values in the cases where we were getting
it wrong before.  While this logic seems experimentally correct (see
above), I have to admit that I've failed to wrap my brain around exactly
why it's correct.  The arguments that I've constructed so far seem to
point in the direction of applying the opposite correction, which is
demonstrably wrong.  Perhaps someone whose college statistics class
wasn't quite so long ago can explain this satisfactorily?

Aside from the missing/inadequate comment about why to apply this
correction, there remains work to update several contrib modules that
reference scalarltsel/scalargtsel.  That could be done separately though.
An extension that doesn't change its <= or >= operators to reference
scalarlesel/scalargesel isn't worse off than before, it's just failing
to benefit from this improvement.

Obviously this is too late for v10; I'll stick it into the next
commitfest.

regards, tom lane

diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml
index 333a36c..e9ff110 100644
--- a/doc/src/sgml/xindex.sgml
+++ b/doc/src/sgml/xindex.sgml
@@ -792,8 +792,7 @@ CREATE OPERATOR < (
It is important to specify the correct commutator and negator operators,
as well as suitable restriction and join selectivity
functions, otherwise the optimizer will be unable to make effective
-   use of the index.  Note that the less-than, equal, and
-   greater-than cases should use different selectivity functions.
+   us