Re: [HACKERS] Selectivity estimation for inet operators

2015-04-01 Thread Tom Lane
Emre Hasegeli e...@hasegeli.com writes:
 [ inet-selfuncs-v14.patch ]

After further reflection I concluded that the best way to deal with the
O(N^2) runtime problem for the join selectivity function was to set a
limit on the number of statistics values we'd consider, as was discussed
awhile back IIRC.  We can easily consider just the first N values of the
MCV arrays, since those are sorted by decreasing frequency anyway.  For
the histogram arrays, taking every K'th value seems like the thing to do.

I made the limit be 1024 elements as that seemed to give an acceptable
maximum runtime (a couple hundred msec on my machine).  We could probably
reduce that if anyone feels the max runtime needs to be less.

I had to drop the idea of breaking out of the histogram loop early as that
didn't play nicely with the decimation logic, unfortunately.

Anyway, pushed.  Thanks for your perseverance on this!

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Selectivity estimation for inet operators

2015-02-15 Thread Emre Hasegeli
New version of the patch attached with the optimization to break the
loop before looking at all of the histogram values.  I can reduce
join selectivity estimation runtime by reducing the values of the
left hand side or both of the sides, if there is interest.

  Even if the above aspects of the code are really completely right, the
  comments fail to explain why.  I spent a lot of time on the comments,
  but so far as these points are concerned they still only explain what
  is being done and not why it's a useful calculation to make.

 I couldn't write better comments because I don't have strong arguments
 about it.  We can say that we don't try to make use of the both of
 the endpoints, because we don't know how to combine them.  We only use
 the one with matching family and masklen, and when both of them match
 we use the distant one to be on the safer side.

I added two more sentences to explain the calculation.
diff --git a/src/backend/utils/adt/network_selfuncs.c b/src/backend/utils/adt/network_selfuncs.c
index 73fc1ca..51a33c2 100644
--- a/src/backend/utils/adt/network_selfuncs.c
+++ b/src/backend/utils/adt/network_selfuncs.c
@@ -1,32 +1,972 @@
 /*-
  *
  * network_selfuncs.c
  *	  Functions for selectivity estimation of inet/cidr operators
  *
- * Currently these are just stubs, but we hope to do better soon.
+ * This module provides estimators for the subnet inclusion and overlap
+ * operators.  Estimates are based on null fraction, most common values,
+ * and histogram of inet/cidr columns.
  *
  * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
  *
  * IDENTIFICATION
  *	  src/backend/utils/adt/network_selfuncs.c
  *
  *-
  */
 #include postgres.h
 
+#include math.h
+
+#include access/htup_details.h
+#include catalog/pg_operator.h
+#include catalog/pg_statistic.h
 #include utils/inet.h
+#include utils/lsyscache.h
+#include utils/selfuncs.h
 
 
+/* Default selectivity for the inet overlap operator */
+#define DEFAULT_OVERLAP_SEL 0.01
+
+/* Default selectivity for the various inclusion operators */
+#define DEFAULT_INCLUSION_SEL 0.005
+
+/* Default selectivity for specified operator */
+#define DEFAULT_SEL(operator) \
+	((operator) == OID_INET_OVERLAP_OP ? \
+	 DEFAULT_OVERLAP_SEL : DEFAULT_INCLUSION_SEL)
+
+static Selectivity networkjoinsel_inner(Oid operator,
+	 VariableStatData *vardata1, VariableStatData *vardata2);
+static Selectivity networkjoinsel_semi(Oid operator,
+	VariableStatData *vardata1, VariableStatData *vardata2);
+static Selectivity mcv_population(float4 *mcv_numbers, int mcv_nvalues);
+static Selectivity inet_hist_value_sel(Datum *values, int nvalues,
+	Datum constvalue, int opr_codenum);
+static Selectivity inet_mcv_join_sel(Datum *mcv1_values,
+  float4 *mcv1_numbers, int mcv1_nvalues, Datum *mcv2_values,
+  float4 *mcv2_numbers, int mcv2_nvalues, Oid operator);
+static Selectivity inet_mcv_hist_sel(Datum *mcv_values, float4 *mcv_numbers,
+  int mcv_nvalues, Datum *hist_values, int hist_nvalues,
+  int opr_codenum);
+static Selectivity inet_hist_inclusion_join_sel(Datum *hist1_values,
+			 int hist1_nvalues,
+			 Datum *hist2_values, int hist2_nvalues,
+			 int opr_codenum);
+static Selectivity inet_semi_join_sel(Datum lhs_value,
+   bool mcv_exists, Datum *mcv_values, int mcv_nvalues,
+   bool hist_exists, Datum *hist_values, int hist_nvalues,
+   double hist_weight,
+   FmgrInfo *proc, int opr_codenum);
+static int	inet_opr_codenum(Oid operator);
+static int	inet_inclusion_cmp(inet *left, inet *right, int opr_codenum);
+static int inet_masklen_inclusion_cmp(inet *left, inet *right,
+		   int opr_codenum);
+static int inet_hist_match_divider(inet *boundary, inet *query,
+		int opr_codenum);
+
+/*
+ * Selectivity estimation for the subnet inclusion/overlap operators
+ */
 Datum
 networksel(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_FLOAT8(0.001);
+	PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0);
+	Oid			operator = PG_GETARG_OID(1);
+	List	   *args = (List *) PG_GETARG_POINTER(2);
+	int			varRelid = PG_GETARG_INT32(3);
+	VariableStatData vardata;
+	Node	   *other;
+	bool		varonleft;
+	Selectivity selec,
+mcv_selec,
+non_mcv_selec;
+	Datum		constvalue,
+			   *hist_values;
+	int			hist_nvalues;
+	Form_pg_statistic stats;
+	double		sumcommon,
+nullfrac;
+	FmgrInfo	proc;
+
+	/*
+	 * If expression is not (variable op something) or (something op
+	 * variable), then punt and return a default estimate.
+	 */
+	if (!get_restriction_variable(root, args, varRelid,
+  vardata, other, varonleft))
+		PG_RETURN_FLOAT8(DEFAULT_SEL(operator));
+
+	/*
+	 * Can't do anything useful if the something is not a constant, either.
+	 */
+	if (!IsA(other, Const))
+	{
+		

Re: [HACKERS] Selectivity estimation for inet operators

2014-12-07 Thread Michael Paquier
On Wed, Dec 3, 2014 at 6:14 AM, Emre Hasegeli e...@hasegeli.com wrote:
 Actually, there's a second large problem with this patch: blindly
 iterating through all combinations of MCV and histogram entries makes the
 runtime O(N^2) in the statistics target.  I made up some test data (by
 scanning my mail logs) and observed the following planning times, as
 reported by EXPLAIN ANALYZE:

 explain analyze select * from relays r1, relays r2 where r1.ip = r2.ip;
 explain analyze select * from relays r1, relays r2 where r1.ip  r2.ip;

 stats targeteqjoinsel   networkjoinsel

 100 0.27 ms 1.85 ms
 10002.56 ms 167.2 ms
 1   56.6 ms 13987.1 ms

 I don't think it's necessary for network selectivity to be quite as
 fast as eqjoinsel, but I doubt we can tolerate 14 seconds planning
 time for a query that might need just milliseconds to execute :-(

 It seemed to me that it might be possible to reduce the runtime by
 exploiting knowledge about the ordering of the histograms, but
 I don't have time to pursue that right now.

 I make it break the loop when we passed the last possible match. Patch
 attached.  It reduces the runtime almost 50% with large histograms.

 We can also make it use only some elements of the MCV and histogram
 for join estimation.  I have experimented with reducing the right and
 the left hand side of the lists on the previous versions.  I remember
 it was better to reduce only the left hand side.  I think it would be
 enough to use log(n) elements of the right hand side MCV and histogram.
 I can make the change, if you think selectivity estimation function
 is the right place for this optimization.
Marking as Returned with feedback as more work needs to be done.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Selectivity estimation for inet operators

2014-12-02 Thread Emre Hasegeli
 I spent a fair chunk of the weekend hacking on this patch to make
 it more understandable and fix up a lot of what seemed to me pretty
 clear arithmetic errors in the upper layers of the patch.  However,
 I couldn't quite convince myself to commit it, because the business
 around estimation for partial histogram-bucket matches still doesn't
 make any sense to me.  Specifically this:

 /* Partial bucket match. */
 left_divider = inet_hist_match_divider(left, query, opr_codenum);
 right_divider = inet_hist_match_divider(right, query, 
 opr_codenum);

 if (left_divider = 0 || right_divider = 0)
 match += 1.0 / pow(2.0, Max(left_divider, right_divider));

 Now unless I'm missing something pretty basic about the divider
 function, it returns larger numbers for inputs that are further away
 from each other (ie, have more not-in-common significant address bits).
 So the above calculation seems exactly backwards to me: if one endpoint
 of a bucket is close to the query, or even an exact match, and the
 other endpoint is further away, we completely ignore the close/exact
 match and assign a bucket match fraction based only on the further-away
 endpoint.  Isn't that exactly backwards?

You are right that partial bucket match calculation isn't fair on
some circumstances.

 I experimented with logic like this:

 if (left_divider = 0  right_divider = 0)
 match += 1.0 / pow(2.0, Min(left_divider, right_divider));
 else if (left_divider = 0 || right_divider = 0)
 match += 1.0 / pow(2.0, Max(left_divider, right_divider));

 ie, consider the closer endpoint if both are valid.  But that didn't seem
 to work a whole lot better.  I think really we need to consider both
 endpoints not just one to the exclusion of the other.

I have tried many combinations like this.  Including arithmetic,
geometric, logarithmic mean functions.  I could not get good results
with any of them, so I left it in a basic form.

Max() works pretty well most of the time, because if the query matches
the bucket generally it is close to both of the endpoints.  By using
Max(), we are actually crediting the ones which are close to the both
of the endpoints.  

 I'm also not exactly convinced by the divider function itself,
 specifically about the decision to fail and return -1 if the masklen
 comparison comes out wrong.  This effectively causes the masklen to be
 the most significant part of the value (after the IP family), which seems
 totally wrong.  ISTM we ought to consider the number of leading bits in
 common as the primary indicator of how far apart a query and a
 histogram endpoint are.

The partial match calculation with Max() is especially unfair on
the buckets where more significant bits change.  For example 63/8 and
64/8.  Returning -1 instead of a high divider, forces it to use
the divider for the other endpoint.  We consider the number of leading
bits in common as the primary indicator, just for the other endpoint.

I have also experimented with the count of the common bits of
the endpoints of the bucket for better partial match calculation.
I could not find out a meaningful equation with it.

 Even if the above aspects of the code are really completely right, the
 comments fail to explain why.  I spent a lot of time on the comments,
 but so far as these points are concerned they still only explain what
 is being done and not why it's a useful calculation to make.

I couldn't write better comments because I don't have strong arguments
about it.  We can say that we don't try to make use of the both of
the endpoints, because we don't know how to combine them.  We only use
the one with matching family and masklen, and when both of them match
we use the distant one to be on the safer side.

Thank you for looking at it.  Comments look much better now.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Selectivity estimation for inet operators

2014-12-02 Thread Emre Hasegeli
 Actually, there's a second large problem with this patch: blindly
 iterating through all combinations of MCV and histogram entries makes the
 runtime O(N^2) in the statistics target.  I made up some test data (by
 scanning my mail logs) and observed the following planning times, as
 reported by EXPLAIN ANALYZE:

 explain analyze select * from relays r1, relays r2 where r1.ip = r2.ip;
 explain analyze select * from relays r1, relays r2 where r1.ip  r2.ip;

 stats targeteqjoinsel   networkjoinsel

 100 0.27 ms 1.85 ms
 10002.56 ms 167.2 ms
 1   56.6 ms 13987.1 ms

 I don't think it's necessary for network selectivity to be quite as
 fast as eqjoinsel, but I doubt we can tolerate 14 seconds planning
 time for a query that might need just milliseconds to execute :-(

 It seemed to me that it might be possible to reduce the runtime by
 exploiting knowledge about the ordering of the histograms, but
 I don't have time to pursue that right now.

I make it break the loop when we passed the last possible match. Patch
attached.  It reduces the runtime almost 50% with large histograms.

We can also make it use only some elements of the MCV and histogram
for join estimation.  I have experimented with reducing the right and
the left hand side of the lists on the previous versions.  I remember
it was better to reduce only the left hand side.  I think it would be
enough to use log(n) elements of the right hand side MCV and histogram.
I can make the change, if you think selectivity estimation function
is the right place for this optimization.
diff --git a/src/backend/utils/adt/network_selfuncs.c b/src/backend/utils/adt/network_selfuncs.c
index f854847..16f39db 100644
--- a/src/backend/utils/adt/network_selfuncs.c
+++ b/src/backend/utils/adt/network_selfuncs.c
@@ -612,20 +612,23 @@ inet_hist_value_sel(Datum *values, int nvalues, Datum constvalue,
 		return 0.0;
 
 	query = DatumGetInetPP(constvalue);
 
 	/* left is the left boundary value of the current bucket ... */
 	left = DatumGetInetPP(values[0]);
 	left_order = inet_inclusion_cmp(left, query, opr_codenum);
 
 	for (i = 1; i  nvalues; i++)
 	{
+		if (left_order == 256)
+			break;
+
 		/* ... and right is the right boundary value */
 		right = DatumGetInetPP(values[i]);
 		right_order = inet_inclusion_cmp(right, query, opr_codenum);
 
 		if (left_order == 0  right_order == 0)
 		{
 			/* The whole bucket matches, since both endpoints do. */
 			match += 1.0;
 		}
 		else if ((left_order = 0  right_order = 0) ||
@@ -854,20 +857,23 @@ inet_opr_codenum(Oid operator)
 static int
 inet_inclusion_cmp(inet *left, inet *right, int opr_codenum)
 {
 	if (ip_family(left) == ip_family(right))
 	{
 		int			order;
 
 		order = bitncmp(ip_addr(left), ip_addr(right),
 		Min(ip_bits(left), ip_bits(right)));
 
+		if (order  0)
+			return 256;
+
 		if (order != 0)
 			return order;
 
 		return inet_masklen_inclusion_cmp(left, right, opr_codenum);
 	}
 
 	return ip_family(left) - ip_family(right);
 }
 
 /*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Selectivity estimation for inet operators

2014-12-01 Thread Tom Lane
I wrote:
 I spent a fair chunk of the weekend hacking on this patch to make
 it more understandable and fix up a lot of what seemed to me pretty
 clear arithmetic errors in the upper layers of the patch.  However,
 I couldn't quite convince myself to commit it, because the business
 around estimation for partial histogram-bucket matches still doesn't
 make any sense to me.

Actually, there's a second large problem with this patch: blindly
iterating through all combinations of MCV and histogram entries makes the
runtime O(N^2) in the statistics target.  I made up some test data (by
scanning my mail logs) and observed the following planning times, as
reported by EXPLAIN ANALYZE:

explain analyze select * from relays r1, relays r2 where r1.ip = r2.ip;
explain analyze select * from relays r1, relays r2 where r1.ip  r2.ip;

stats targeteqjoinsel   networkjoinsel

100 0.27 ms 1.85 ms
10002.56 ms 167.2 ms
1   56.6 ms 13987.1 ms

I don't think it's necessary for network selectivity to be quite as
fast as eqjoinsel, but I doubt we can tolerate 14 seconds planning
time for a query that might need just milliseconds to execute :-(

It seemed to me that it might be possible to reduce the runtime by
exploiting knowledge about the ordering of the histograms, but
I don't have time to pursue that right now.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Selectivity estimation for inet operators

2014-11-30 Thread Tom Lane
Emre Hasegeli e...@hasegeli.com writes:
 Thanks. Overall, my impression of this patch is that it works very
 well. But damned if I understood *how* it works :-). There's a lot
 of statistics involved, and it's not easy to see why something is
 multiplied by something else. I'm adding comments as I read through
 it.

 Thank you for looking at it.  I tried to add more comments to
 the multiplications.  New version attached.  It also fixes a bug
 caused by wrong operator order used on histogram to histogram
 selectivity estimation for inner join.

I spent a fair chunk of the weekend hacking on this patch to make
it more understandable and fix up a lot of what seemed to me pretty
clear arithmetic errors in the upper layers of the patch.  However,
I couldn't quite convince myself to commit it, because the business
around estimation for partial histogram-bucket matches still doesn't
make any sense to me.  Specifically this:

/* Partial bucket match. */
left_divider = inet_hist_match_divider(left, query, opr_codenum);
right_divider = inet_hist_match_divider(right, query, opr_codenum);

if (left_divider = 0 || right_divider = 0)
match += 1.0 / pow(2.0, Max(left_divider, right_divider));

Now unless I'm missing something pretty basic about the divider
function, it returns larger numbers for inputs that are further away
from each other (ie, have more not-in-common significant address bits).
So the above calculation seems exactly backwards to me: if one endpoint
of a bucket is close to the query, or even an exact match, and the
other endpoint is further away, we completely ignore the close/exact
match and assign a bucket match fraction based only on the further-away
endpoint.  Isn't that exactly backwards?

I experimented with logic like this:

if (left_divider = 0  right_divider = 0)
match += 1.0 / pow(2.0, Min(left_divider, right_divider));
else if (left_divider = 0 || right_divider = 0)
match += 1.0 / pow(2.0, Max(left_divider, right_divider));

ie, consider the closer endpoint if both are valid.  But that didn't seem
to work a whole lot better.  I think really we need to consider both
endpoints not just one to the exclusion of the other.

I'm also not exactly convinced by the divider function itself,
specifically about the decision to fail and return -1 if the masklen
comparison comes out wrong.  This effectively causes the masklen to be
the most significant part of the value (after the IP family), which seems
totally wrong.  ISTM we ought to consider the number of leading bits in
common as the primary indicator of how far apart a query and a
histogram endpoint are.

Even if the above aspects of the code are really completely right, the
comments fail to explain why.  I spent a lot of time on the comments,
but so far as these points are concerned they still only explain what
is being done and not why it's a useful calculation to make.

Anyway, attached is my updated version of the patch.  (I did commit the
added #define in pg_operator.h, so that the patch can be independent of
that file in future.)  I've marked this waiting on author in the CF app.

regards, tom lane

diff --git a/src/backend/utils/adt/network_selfuncs.c b/src/backend/utils/adt/network_selfuncs.c
index d0d806f..f854847 100644
*** a/src/backend/utils/adt/network_selfuncs.c
--- b/src/backend/utils/adt/network_selfuncs.c
***
*** 3,9 
   * network_selfuncs.c
   *	  Functions for selectivity estimation of inet/cidr operators
   *
!  * Currently these are just stubs, but we hope to do better soon.
   *
   * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
   * Portions Copyright (c) 1994, Regents of the University of California
--- 3,11 
   * network_selfuncs.c
   *	  Functions for selectivity estimation of inet/cidr operators
   *
!  * This module provides estimators for the subnet inclusion and overlap
!  * operators.  Estimates are based on null fraction, most common values,
!  * and histogram of inet/cidr columns.
   *
   * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
   * Portions Copyright (c) 1994, Regents of the University of California
***
*** 16,32 
   */
  #include postgres.h
  
  #include utils/inet.h
  
  
  Datum
  networksel(PG_FUNCTION_ARGS)
  {
! 	PG_RETURN_FLOAT8(0.001);
  }
  
  Datum
  networkjoinsel(PG_FUNCTION_ARGS)
  {
! 	PG_RETURN_FLOAT8(0.001);
  }
--- 18,949 
   */
  #include postgres.h
  
+ #include math.h
+ 
+ #include access/htup_details.h
+ #include catalog/pg_operator.h
+ #include catalog/pg_statistic.h
  #include utils/inet.h
+ #include utils/lsyscache.h
+ #include utils/selfuncs.h
+ 
+ 
+ /* Default selectivity for the inet overlap operator */
+ #define DEFAULT_OVERLAP_SEL 0.01
  
+ /* Default selectivity for the various inclusion operators */
+ #define DEFAULT_INCLUSION_SEL 0.005
+ 
+ /* 

Re: [HACKERS] Selectivity estimation for inet operators

2014-09-27 Thread Emre Hasegeli
 Thanks. Overall, my impression of this patch is that it works very
 well. But damned if I understood *how* it works :-). There's a lot
 of statistics involved, and it's not easy to see why something is
 multiplied by something else. I'm adding comments as I read through
 it.

Thank you for looking at it.  I tried to add more comments to
the multiplications.  New version attached.  It also fixes a bug
caused by wrong operator order used on histogram to histogram
selectivity estimation for inner join.

 I've gotten to the inet_semi_join_selec function:
 
  [function]
 
 This desperately needs comment at the top of the function explaining
 what it does. Let me try to explain what I think it does:
 
 [explanation]

I used your explanation on the new version.

 Now, I think that last step is wrong. Firstly, the Do not bother if
 histogram weight is smaller than 0.1 rule seems bogus. The
 his2_weight is the total number of rows represented by the
 histogram, so surely it can't be less than 1. It can't really be
 less than the statistics target. Unless maybe if the histogram was
 collected when the table was large, but it has since shrunk to
 contain only a few rows, but that seems like a very bizarre corner
 case. At least it needs more comments explaining what the test is
 all about, but I think we should just always use the histogram (if
 it's available).

It was an unnecessary check.  I put an assert instead of it.

 Secondly, if we estimate that there is on average 1.0 matching row
 in the table, it does not follow that the probability that at least
 one row matches is 1.0. Assuming a gaussian distribution with mean
 1.0, the probability that at least one row matches is 0.5. Assuming
 a gaussian distribution here isn't quite right - I guess a Poisson
 distribution would be more accurate - but it sure doesn't seem right
 as it is.

 The error isn't very big, and perhaps you don't run into that very
 often, so I'm not sure what the best way to fix that would be. My
 statistics skills are a bit rusty, but I think the appropriate way
 would be to apply the Poisson distribution, with the estimated
 number of matched rows as the mean. The probability of at least one
 match would be the cumulative distribution function at k=1. It
 sounds like overkill, if this is case occurs only rarely. But then
 again, perhaps it's not all that rare.

A function of his_weight and his_selec could be a better option
than just multiplying them.  I am not sure about the function or
it worths the trouble.  Join selectivity estimation function for
equality doesn't even bother to look at the histograms.  Others
only return constant values.

 That said, I can't immediately find a test case where that error
 would matter. I tried this:
 
 create table inettbl1 (a inet);
 insert into inettbl1 select '10.0.0.' || (g % 255) from
 generate_series(1, 10) g;
 analyze inettbl1;
 explain analyze select count(*) from inettbl1 where a = ANY
 (SELECT a from inettbl1);
 
 The estimate for that is pretty accurate, 833 rows estimated vs 1000
 actual, with the current patch. I'm afraid if we fixed
 inet_semi_join_selec the way I suggest, the estimate would be
 smaller, i.e. more wrong. Is there something else in the estimates
 that accidentally compensates for this currently?

The partial bucket match on inet_his_inclusion_selec() causes low
estimates.  Which also effects non join estimation but not as much as
it effects join estimations.  If that works more correctly, semi
join estimation can be higher than it should be.

network_selfuncs.c:602:
   /* Partial bucket match. */

   left_divider = inet_his_match_divider(left, query, 
 opr_order);
   right_divider = inet_his_match_divider(right, query, 
 opr_order);

   if (left_divider = 0 || right_divider = 0)
   match += 1.0 / pow(2, Max(left_divider, 
 right_divider));

I think this calculation can benefit from a statistical function
more than the semi join.  Using the different bit count as power
of two is the best I could find.  It works quite well on most of
the cases.
diff --git a/src/backend/utils/adt/network_selfuncs.c b/src/backend/utils/adt/network_selfuncs.c
index d0d806f..e9f9696 100644
--- a/src/backend/utils/adt/network_selfuncs.c
+++ b/src/backend/utils/adt/network_selfuncs.c
@@ -3,7 +3,8 @@
  * network_selfuncs.c
  *	  Functions for selectivity estimation of inet/cidr operators
  *
- * Currently these are just stubs, but we hope to do better soon.
+ * Estimates are based on null fraction, most common values, and
+ * histogram of inet/cidr datatypes.
  *
  * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -16,17 +17,864 @@
  */
 #include postgres.h
 
+#include math.h
+
+#include access/htup_details.h
+#include catalog/pg_collation.h
+#include catalog/pg_operator.h
+#include 

Re: [HACKERS] Selectivity estimation for inet operators

2014-09-25 Thread Heikki Linnakangas

On 09/07/2014 07:09 PM, Emre Hasegeli wrote:

I updated the patch to cover semi and anti joins with eqjoinsel_semi().
I think it is better than returning a constant.


What you did there is utterly unacceptable from a modularity standpoint;
and considering that the values will be nowhere near right, the argument
that it's better than returning a constant seems pretty weak.  I think
you should just take that out again.


I will try to come up with a better, data type specific implementation
in a week.


New version with semi join estimation function attached.


Thanks. Overall, my impression of this patch is that it works very well. 
But damned if I understood *how* it works :-). There's a lot of 
statistics involved, and it's not easy to see why something is 
multiplied by something else. I'm adding comments as I read through it.


I've gotten to the inet_semi_join_selec function:


/*
 * Inet semi join selectivity estimation.
 */
static Selectivity
inet_semi_join_selec(bool mcv2_exists, Datum *mcv2_values, int mcv2_nvalues,
 bool his2_exists, Datum *his2_values, 
int his2_nvalues,
 double his2_weight, Datum *constvalue,
 FmgrInfo *proc, short opr_order)
{
if (mcv2_exists)
{
int i;

for (i = 0; i  mcv2_nvalues; i++)
{
if (DatumGetBool(FunctionCall2Coll(proc, 
DEFAULT_COLLATION_OID,

   *constvalue, mcv2_values[i])))
return 1.0;
}
}

/* Do not bother if histogram weight is smaller than 0.1. */
if (his2_exists  his2_weight  0.1)
{
Selectivity his_selec;

his_selec = inet_his_inclusion_selec(his2_values, his2_nvalues,

 constvalue, opr_order);

if (his_selec  0)
return Min(1.0, his2_weight * his_selec);
}

return 0.0;
}


This desperately needs comment at the top of the function explaining 
what it does. Let me try to explain what I think it does:


This function calculates the probability that there is at least one row 
in table B, which satisfies the constant op column qual. The constant 
is passed as argument, and for table B, the MCV list and histogram is 
provided. his2_weight is the total number of rows in B that are covered 
by the histogram. For example, if the table has 1000 rows, and 10% of 
the rows in the table are in the MCV, and another 10% are NULLs, 
his_weight would be 800.


First, we check if the constant matches any of the most common values. 
If it does, return 1.0, because then there is surely a match.


Next, we use the histogram to estimate the number of rows in the table 
that matches the qual. If it amounts to more than 1 row, we return 1.0. 
If it's between 0.0 and 1.0 rows, we return that number as the probability.



Now, I think that last step is wrong. Firstly, the Do not bother if 
histogram weight is smaller than 0.1 rule seems bogus. The his2_weight 
is the total number of rows represented by the histogram, so surely it 
can't be less than 1. It can't really be less than the statistics 
target. Unless maybe if the histogram was collected when the table was 
large, but it has since shrunk to contain only a few rows, but that 
seems like a very bizarre corner case. At least it needs more comments 
explaining what the test is all about, but I think we should just always 
use the histogram (if it's available).


Secondly, if we estimate that there is on average 1.0 matching row in 
the table, it does not follow that the probability that at least one row 
matches is 1.0. Assuming a gaussian distribution with mean 1.0, the 
probability that at least one row matches is 0.5. Assuming a gaussian 
distribution here isn't quite right - I guess a Poisson distribution 
would be more accurate - but it sure doesn't seem right as it is.


The error isn't very big, and perhaps you don't run into that very 
often, so I'm not sure what the best way to fix that would be. My 
statistics skills are a bit rusty, but I think the appropriate way would 
be to apply the Poisson distribution, with the estimated number of 
matched rows as the mean. The probability of at least one match would be 
the cumulative distribution function at k=1. It sounds like overkill, if 
this is case occurs only rarely. But then again, perhaps it's not all 
that rare.


That said, I can't immediately find a test case where that error would 
matter. I tried this:


create table inettbl1 (a inet);
insert into inettbl1 select '10.0.0.' || (g % 255) from 
generate_series(1, 10) g;

analyze inettbl1;
explain analyze select count(*) from inettbl1 where a = ANY 

Re: [HACKERS] Selectivity estimation for inet operators

2014-09-16 Thread Brightwell, Adam

 New version with semi join estimation function attached.


I have performed the following initial review:

- Patch format. -- submitted as unified, but not sure it makes it any
easier to read than context format.
- Apply to current master (77e65bf).  -- success (though, I do get
Stripping trailing CR's from patch; notification)
- check-world -- success
- Whitespace - were the whitespace changes in pg_operator.h necessary?

As for implementation, I'll leave that to those with a better understanding
of the purpose/expectations of the modified functions.

-Adam

-- 
Adam Brightwell - adam.brightw...@crunchydatasolutions.com
Database Engineer - www.crunchydatasolutions.com


Re: [HACKERS] Selectivity estimation for inet operators

2014-09-07 Thread Emre Hasegeli
   I updated the patch to cover semi and anti joins with eqjoinsel_semi().
   I think it is better than returning a constant.
 
  What you did there is utterly unacceptable from a modularity standpoint;
  and considering that the values will be nowhere near right, the argument
  that it's better than returning a constant seems pretty weak.  I think
  you should just take that out again.
 
 I will try to come up with a better, data type specific implementation
 in a week.

New version with semi join estimation function attached.
diff --git a/src/backend/utils/adt/network_selfuncs.c b/src/backend/utils/adt/network_selfuncs.c
index d0d806f..f6cb2f6 100644
--- a/src/backend/utils/adt/network_selfuncs.c
+++ b/src/backend/utils/adt/network_selfuncs.c
@@ -1,32 +1,819 @@
 /*-
  *
  * network_selfuncs.c
  *	  Functions for selectivity estimation of inet/cidr operators
  *
- * Currently these are just stubs, but we hope to do better soon.
+ * Estimates are based on null fraction, distinct value count, most common
+ * values, and histogram of inet/cidr datatypes.
  *
  * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
  *
  * IDENTIFICATION
  *	  src/backend/utils/adt/network_selfuncs.c
  *
  *-
  */
 #include postgres.h
 
+#include math.h
+
+#include access/htup_details.h
+#include catalog/pg_collation.h
+#include catalog/pg_operator.h
+#include catalog/pg_statistic.h
+#include utils/lsyscache.h
 #include utils/inet.h
+#include utils/selfuncs.h
+
+
+/* Default selectivity constant for the inet overlap operator */
+#define DEFAULT_OVERLAP_SEL 0.01
+
+/* Default selectivity constant for the other operators */
+#define DEFAULT_INCLUSION_SEL 0.005
+
+/* Default selectivity for given operator */
+#define DEFAULT_SEL(operator) \
+	((operator) == OID_INET_OVERLAP_OP ? \
+			DEFAULT_OVERLAP_SEL : DEFAULT_INCLUSION_SEL)
 
+static Selectivity networkjoinsel_inner(Oid operator,
+	VariableStatData *vardata1, VariableStatData *vardata2);
+static Selectivity networkjoinsel_semi(Oid operator,
+	VariableStatData *vardata1, VariableStatData *vardata2);
+static short int inet_opr_order(Oid operator);
+static Selectivity inet_his_inclusion_selec(Datum *values, int nvalues,
+	Datum *constvalue, short int opr_order);
+static Selectivity inet_mcv_join_selec(Datum *mcv1_values,
+	float4 *mcv1_numbers, int mcv1_nvalues, Datum *mcv2_values,
+	float4 *mcv2_numbers, int mcv2_nvalues, Oid operator,
+	Selectivity *max_selec_p);
+static Selectivity inet_mcv_his_selec(Datum *mcv_values, float4 *mcv_numbers,
+	int mcv_nvalues, Datum *his_values, int his_nvalues,
+	short int opr_order, Selectivity *max_selec_p);
+static Selectivity inet_his_inclusion_join_selec(Datum *his1_values,
+	int his1_nvalues, Datum *his2_values, int his2_nvalues,
+	short int opr_order);
+static Selectivity inet_semi_join_selec(bool mcv2_exists, Datum *mcv2_values,
+	int mcv2_nvalues, bool his2_exists, Datum *his2_values,
+	int his2_nvalues, double his2_weight, Datum *constvalue,
+	FmgrInfo *proc, short int opr_order);
+static short int inet_inclusion_cmp(inet *left, inet *right,
+	short int opr_order);
+static short int inet_masklen_inclusion_cmp(inet *left, inet *right,
+	short int opr_order);
+static short int inet_his_match_divider(inet *boundary, inet *query,
+	short int opr_order);
 
+/*
+ * Selectivity estimation for the subnet inclusion operators
+ */
 Datum
 networksel(PG_FUNCTION_ARGS)
 {
-	PG_RETURN_FLOAT8(0.001);
+	PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0);
+	Oid			operator = PG_GETARG_OID(1);
+	List	   *args = (List *) PG_GETARG_POINTER(2);
+	int			varRelid = PG_GETARG_INT32(3),
+his_nvalues;
+	VariableStatData vardata;
+	Node	   *other;
+	bool		varonleft;
+	Selectivity selec,
+max_mcv_selec;
+	Datum		constvalue,
+			   *his_values;
+	Form_pg_statistic stats;
+	double		nullfrac;
+	FmgrInfo	proc;
+
+	/*
+	 * If expression is not (variable op something) or (something op
+	 * variable), then punt and return a default estimate.
+	 */
+	if (!get_restriction_variable(root, args, varRelid,
+  vardata, other, varonleft))
+		PG_RETURN_FLOAT8(DEFAULT_SEL(operator));
+
+	/*
+	 * Can't do anything useful if the something is not a constant, either.
+	 */
+	if (!IsA(other, Const))
+	{
+		ReleaseVariableStats(vardata);
+		PG_RETURN_FLOAT8(DEFAULT_SEL(operator));
+	}
+
+	/* All of the subnet inclusion operators are strict. */
+	if (((Const *) other)-constisnull)
+	{
+		ReleaseVariableStats(vardata);
+		PG_RETURN_FLOAT8(0.0);
+	}
+
+	if (!HeapTupleIsValid(vardata.statsTuple))
+	{
+		ReleaseVariableStats(vardata);
+		PG_RETURN_FLOAT8(DEFAULT_SEL(operator));
+	}
+
+	constvalue = ((Const *) other)-constvalue;
+	stats = (Form_pg_statistic) 

Re: [HACKERS] Selectivity estimation for inet operators

2014-08-31 Thread Emre Hasegeli
 * Isn't X  Y equivalent to network_scan_first(X)  Y AND 
 network_scan_last(X)  Y? Or at least close enough for selectivity 
 estimation purposes? Pardon my ignorance - I'm not too familiar with the 
 inet datatype - but how about just calling scalarineqsel for both bounds?

Actually, X  Y is equivalent to

network_scan_first(X) = network_host(Y) AND
network_scan_last(X) = network_host(Y) AND
network_masklen(X)  network_masklen(X)

but we do not have statistics for neither network_scan_last(X)
nor network_masklen(X).  I tried to find a solution based on
the implementation of the operators.

 * inet_mcv_join_selec() is O(n^2) where n is the number of entries in the 
 MCV lists. With the max statistics target of 1, a worst case query on 
 my laptop took about 15 seconds to plan. Maybe that's acceptable, but you 
 went through some trouble to make planning of MCV vs histogram faster, by 
 the log2 method to compare only some values, so I wonder why you didn't do 
 the same for the MCV vs MCV case?

It was like that in the previous versions.  It was causing worse
estimation, but I was trying to reduce both sides of the lists.  It
works slightly better when only the left hand side of the list is
reduced.  Attached version works like that.

 * A few typos: lenght - length.

Fixed.

Thank you for looking at it.
diff --git a/src/backend/utils/adt/network_selfuncs.c 
b/src/backend/utils/adt/network_selfuncs.c
index d0d806f..a00706c 100644
--- a/src/backend/utils/adt/network_selfuncs.c
+++ b/src/backend/utils/adt/network_selfuncs.c
@@ -1,32 +1,671 @@
 /*-
  *
  * network_selfuncs.c
  *   Functions for selectivity estimation of inet/cidr operators
  *
- * Currently these are just stubs, but we hope to do better soon.
+ * Estimates are based on null fraction, distinct value count, most common
+ * values, and histogram of inet/cidr datatypes.
  *
  * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
  *
  * IDENTIFICATION
  *   src/backend/utils/adt/network_selfuncs.c
  *
  *-
  */
 #include postgres.h
 
+#include math.h
+
+#include access/htup_details.h
+#include catalog/pg_collation.h
+#include catalog/pg_operator.h
+#include catalog/pg_statistic.h
+#include utils/lsyscache.h
 #include utils/inet.h
+#include utils/selfuncs.h
 
 
+/* Default selectivity constant for the inet overlap operator */
+#define DEFAULT_OVERLAP_SEL 0.01
+
+/* Default selectivity constant for the other operators */
+#define DEFAULT_INCLUSION_SEL 0.005
+
+/* Default selectivity for given operator */
+#define DEFAULT_SEL(operator) \
+   ((operator) == OID_INET_OVERLAP_OP ? \
+   DEFAULT_OVERLAP_SEL : DEFAULT_INCLUSION_SEL)
+
+static Selectivity networkjoinsel_inner(Oid operator,
+VariableStatData *vardata1, 
VariableStatData *vardata2);
+extern double eqjoinsel_semi(Oid operator, VariableStatData *vardata1,
+  VariableStatData *vardata2, RelOptInfo *inner_rel);
+extern RelOptInfo *find_join_input_rel(PlannerInfo *root, Relids relids);
+static short int inet_opr_order(Oid operator);
+static Selectivity inet_his_inclusion_selec(Datum *values, int nvalues,
+Datum *constvalue, short int 
opr_order);
+static Selectivity inet_mcv_join_selec(Datum *values1, float4 *numbers1,
+   int nvalues1, Datum *values2, float4 
*numbers2,
+   int nvalues2, int red_nvalues, Oid 
operator);
+static Selectivity inet_mcv_his_selec(Datum *mcv_values, float4 *mcv_numbers,
+  int mcv_nvalues, Datum *his_values, int 
his_nvalues,
+  int red_nvalues, short int opr_order,
+  Selectivity *max_selec_pointer);
+static Selectivity inet_his_inclusion_join_selec(Datum *his1_values,
+ int his1_nvalues, Datum *his2_values, 
int his2_nvalues,
+ int red_nvalues, 
short int opr_order);
+static short int inet_inclusion_cmp(inet *left, inet *right,
+  short int opr_order);
+static short int inet_masklen_inclusion_cmp(inet *left, inet *right,
+  short int opr_order);
+static short int inet_his_match_divider(inet *boundary, inet *query,
+  short int opr_order);
+
+/*
+ * Selectivity estimation for the subnet inclusion operators
+ */
 Datum
 networksel(PG_FUNCTION_ARGS)
 {
-   PG_RETURN_FLOAT8(0.001);
+   PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0);
+   Oid 

Re: [HACKERS] Selectivity estimation for inet operators

2014-08-31 Thread Emre Hasegeli
 Heikki Linnakangas hlinnakan...@vmware.com writes:
  * inet_mcv_join_selec() is O(n^2) where n is the number of entries in 
  the MCV lists. With the max statistics target of 1, a worst case 
  query on my laptop took about 15 seconds to plan. Maybe that's 
  acceptable, but you went through some trouble to make planning of MCV vs 
  histogram faster, by the log2 method to compare only some values, so I 
  wonder why you didn't do the same for the MCV vs MCV case?
 
 Actually, what I think needs to be asked is the opposite question: why is
 the other code ignoring some of the statistical data?  If the user asked
 us to collect a lot of stats detail it seems reasonable that he's
 expecting us to use it to get more accurate estimates.  It's for sure
 not obvious why these estimators should take shortcuts that are not being
 taken in the much-longer-established code for scalar comparison estimates.

It will still use more statistical data, when statistics_target is
higher.  It was not sure that the user wants to spent O(n^2) amount
of time based on statistics_target.  Attached version is without
this optimization.  Estimates are better without it, but planning
takes more time.

 I'm not exactly convinced that the math adds up in this logic, either.
 The way in which it combines results from looking at the MCV lists and
 at the histograms seems pretty arbitrary.

I taught the product of the join will be

(left_mcv + left_histogram) * (right_mcv + right_histogram) * 
selectivity

and tried to calculate it as in the following:

(left_mcv * right_mcv * selectivity) +
(right_mcv * left_histogram * selectivity) +
(left_mcv * right_histogram * selectivity) +
(left_histogram * right_histogram * selectivity)

where left_histogram is

1.0 - left_nullfrac - left_mcv

I fixed calculation for the MCV vs histogram part.  The estimates of
inner join are very close to the actual rows with statistics_target = 1000.
I think the calculation should be right.
diff --git a/src/backend/utils/adt/network_selfuncs.c 
b/src/backend/utils/adt/network_selfuncs.c
index d0d806f..4367f0e 100644
--- a/src/backend/utils/adt/network_selfuncs.c
+++ b/src/backend/utils/adt/network_selfuncs.c
@@ -1,32 +1,655 @@
 /*-
  *
  * network_selfuncs.c
  *   Functions for selectivity estimation of inet/cidr operators
  *
- * Currently these are just stubs, but we hope to do better soon.
+ * Estimates are based on null fraction, distinct value count, most common
+ * values, and histogram of inet/cidr datatypes.
  *
  * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
  *
  * IDENTIFICATION
  *   src/backend/utils/adt/network_selfuncs.c
  *
  *-
  */
 #include postgres.h
 
+#include math.h
+
+#include access/htup_details.h
+#include catalog/pg_collation.h
+#include catalog/pg_operator.h
+#include catalog/pg_statistic.h
+#include utils/lsyscache.h
 #include utils/inet.h
+#include utils/selfuncs.h
 
 
+/* Default selectivity constant for the inet overlap operator */
+#define DEFAULT_OVERLAP_SEL 0.01
+
+/* Default selectivity constant for the other operators */
+#define DEFAULT_INCLUSION_SEL 0.005
+
+/* Default selectivity for given operator */
+#define DEFAULT_SEL(operator) \
+   ((operator) == OID_INET_OVERLAP_OP ? \
+   DEFAULT_OVERLAP_SEL : DEFAULT_INCLUSION_SEL)
+
+static Selectivity networkjoinsel_inner(Oid operator,
+VariableStatData *vardata1, 
VariableStatData *vardata2);
+extern double eqjoinsel_semi(Oid operator, VariableStatData *vardata1,
+  VariableStatData *vardata2, RelOptInfo *inner_rel);
+extern RelOptInfo *find_join_input_rel(PlannerInfo *root, Relids relids);
+static short int inet_opr_order(Oid operator);
+static Selectivity inet_his_inclusion_selec(Datum *values, int nvalues,
+Datum *constvalue, short int 
opr_order);
+static Selectivity inet_mcv_join_selec(Datum *values1, float4 *numbers1,
+   int nvalues1, Datum *values2, float4 
*numbers2,
+   int nvalues2, Oid operator, Selectivity 
*max_selec_p);
+static Selectivity inet_mcv_his_selec(Datum *mcv_values, float4 *mcv_numbers,
+  int mcv_nvalues, Datum *his_values, int 
his_nvalues,
+  short int opr_order, Selectivity 
*max_selec_p);
+static Selectivity inet_his_inclusion_join_selec(Datum *his1_values,
+ int his1_nvalues, Datum *his2_values, 
int his2_nvalues,
+ short int opr_order);
+static short int inet_inclusion_cmp(inet *left, inet 

Re: [HACKERS] Selectivity estimation for inet operators

2014-08-31 Thread Emre Hasegeli
 What you did there is utterly unacceptable from a modularity standpoint;
 and considering that the values will be nowhere near right, the argument
 that it's better than returning a constant seems pretty weak.  I think
 you should just take that out again.

I will try to come up with a better, data type specific implementation
in a week.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Selectivity estimation for inet operators

2014-08-30 Thread Tom Lane
Emre Hasegeli e...@hasegeli.com writes:
 I updated the patch to cover semi and anti joins with eqjoinsel_semi().
 I think it is better than returning a constant.

What you did there is utterly unacceptable from a modularity standpoint;
and considering that the values will be nowhere near right, the argument
that it's better than returning a constant seems pretty weak.  I think
you should just take that out again.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Selectivity estimation for inet operators

2014-08-30 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 * inet_mcv_join_selec() is O(n^2) where n is the number of entries in 
 the MCV lists. With the max statistics target of 1, a worst case 
 query on my laptop took about 15 seconds to plan. Maybe that's 
 acceptable, but you went through some trouble to make planning of MCV vs 
 histogram faster, by the log2 method to compare only some values, so I 
 wonder why you didn't do the same for the MCV vs MCV case?

Actually, what I think needs to be asked is the opposite question: why is
the other code ignoring some of the statistical data?  If the user asked
us to collect a lot of stats detail it seems reasonable that he's
expecting us to use it to get more accurate estimates.  It's for sure
not obvious why these estimators should take shortcuts that are not being
taken in the much-longer-established code for scalar comparison estimates.

I'm not exactly convinced that the math adds up in this logic, either.
The way in which it combines results from looking at the MCV lists and
at the histograms seems pretty arbitrary.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Selectivity estimation for inet operators

2014-08-28 Thread Heikki Linnakangas

On 08/26/2014 12:44 PM, Emre Hasegeli wrote:

I agree with you that we can support other join type and anti join later,
If others don’t have any objection in doing other parts later I will mark as Ready 
For Committer.


I updated the patch to cover semi and anti joins with eqjoinsel_semi().
I think it is better than returning a constant.  The new version
attached with the new version of the test script.  Can you please
look at it again and mark it as ready for committer if it seems okay
to you?


I took a quick look at this. Some questions:

* Isn't X  Y equivalent to network_scan_first(X)  Y AND 
network_scan_last(X)  Y? Or at least close enough for selectivity 
estimation purposes? Pardon my ignorance - I'm not too familiar with the 
inet datatype - but how about just calling scalarineqsel for both bounds?


* inet_mcv_join_selec() is O(n^2) where n is the number of entries in 
the MCV lists. With the max statistics target of 1, a worst case 
query on my laptop took about 15 seconds to plan. Maybe that's 
acceptable, but you went through some trouble to make planning of MCV vs 
histogram faster, by the log2 method to compare only some values, so I 
wonder why you didn't do the same for the MCV vs MCV case?


* A few typos: lenght - length.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Selectivity estimation for inet operators

2014-08-26 Thread Emre Hasegeli
 I agree with you that we can support other join type and anti join later,
 If others don’t have any objection in doing other parts later I will mark as 
 Ready For Committer.

I updated the patch to cover semi and anti joins with eqjoinsel_semi().
I think it is better than returning a constant.  The new version
attached with the new version of the test script.  Can you please
look at it again and mark it as ready for committer if it seems okay
to you?
diff --git a/src/backend/utils/adt/network_selfuncs.c 
b/src/backend/utils/adt/network_selfuncs.c
index d0d806f..eca9e7c 100644
--- a/src/backend/utils/adt/network_selfuncs.c
+++ b/src/backend/utils/adt/network_selfuncs.c
@@ -1,32 +1,669 @@
 /*-
  *
  * network_selfuncs.c
  *   Functions for selectivity estimation of inet/cidr operators
  *
- * Currently these are just stubs, but we hope to do better soon.
+ * Estimates are based on null fraction, distinct value count, most common
+ * values, and histogram of inet/cidr datatypes.
  *
  * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
  *
  * IDENTIFICATION
  *   src/backend/utils/adt/network_selfuncs.c
  *
  *-
  */
 #include postgres.h
 
+#include math.h
+
+#include access/htup_details.h
+#include catalog/pg_collation.h
+#include catalog/pg_operator.h
+#include catalog/pg_statistic.h
+#include utils/lsyscache.h
 #include utils/inet.h
+#include utils/selfuncs.h
 
 
+/* Default selectivity constant for the inet overlap operator */
+#define DEFAULT_OVERLAP_SEL 0.01
+
+/* Default selectivity constant for the other operators */
+#define DEFAULT_INCLUSION_SEL 0.005
+
+/* Default selectivity for given operator */
+#define DEFAULT_SEL(operator) \
+   ((operator) == OID_INET_OVERLAP_OP ? \
+   DEFAULT_OVERLAP_SEL : DEFAULT_INCLUSION_SEL)
+
+static Selectivity networkjoinsel_inner(Oid operator,
+VariableStatData *vardata1, 
VariableStatData *vardata2);
+extern double eqjoinsel_semi(Oid operator, VariableStatData *vardata1,
+  VariableStatData *vardata2, RelOptInfo *inner_rel);
+extern RelOptInfo *find_join_input_rel(PlannerInfo *root, Relids relids);
+static short int inet_opr_order(Oid operator);
+static Selectivity inet_his_inclusion_selec(Datum *values, int nvalues,
+Datum *constvalue, short int 
opr_order);
+static Selectivity inet_mcv_join_selec(Datum *values1, float4 *numbers1,
+   int nvalues1, Datum *values2, float4 
*numbers2,
+   int nvalues2, Oid operator);
+static Selectivity inet_mcv_his_selec(Datum *mcv_values, float4 *mcv_numbers,
+  int mcv_nvalues, Datum *his_values, int 
his_nvalues,
+  int red_nvalues, short int opr_order,
+  Selectivity *max_selec_pointer);
+static Selectivity inet_his_inclusion_join_selec(Datum *his1_values,
+ int his1_nvalues, Datum *his2_values, 
int his2_nvalues,
+ int red_nvalues, 
short int opr_order);
+static short int inet_inclusion_cmp(inet *left, inet *right,
+  short int opr_order);
+static short int inet_masklen_inclusion_cmp(inet *left, inet *right,
+  short int opr_order);
+static short int inet_his_match_divider(inet *boundary, inet *query,
+  short int opr_order);
+
+/*
+ * Selectivity estimation for the subnet inclusion operators
+ */
 Datum
 networksel(PG_FUNCTION_ARGS)
 {
-   PG_RETURN_FLOAT8(0.001);
+   PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0);
+   Oid operator = PG_GETARG_OID(1);
+   List   *args = (List *) PG_GETARG_POINTER(2);
+   int varRelid = PG_GETARG_INT32(3),
+   his_nvalues;
+   VariableStatData vardata;
+   Node   *other;
+   boolvaronleft;
+   Selectivity selec,
+   max_mcv_selec;
+   Datum   constvalue,
+  *his_values;
+   Form_pg_statistic stats;
+   double  nullfrac;
+   FmgrInfoproc;
+
+   /*
+* If expression is not (variable op something) or (something op
+* variable), then punt and return a default estimate.
+*/
+   if (!get_restriction_variable(root, args, varRelid,
+ vardata, 
other, varonleft))
+   PG_RETURN_FLOAT8(DEFAULT_SEL(operator));
+
+   

Re: [HACKERS] Selectivity estimation for inet operators

2014-07-14 Thread Dilip kumar
On 12 July 2014 23:25, Emre Hasegeli Wrote,

  I have one last comment, after clarifying this I can move it to
 ready for committer.
  1. In networkjoinsel, For avoiding the case of huge statistics, only
 some of the values from mcv and histograms are used (calculated using
 SQRT).
  -- But in my opinion, if histograms and mcv both are exist then its
 fine, but if only mcv's are there in that case, we can match complete
 MCV, it will give better accuracy.
 In other function like eqjoinsel also its matching complete MCV.
 
 I was not sure of reducing statistics, at all.  I could not find any
 other selectivity estimation function which does this.  After testing
 it some more, I reached the conclusion that it would be better to only
 reduce the values of the outer loop on histogram match.  Now it matches
 complete MCV lists to each other.  I also switched back to
 log2() from sqrt() to make the outer list smaller.

OK

 
 I rethink your previous advice to threat histogram bucket partially
 matched when the constant matches the last boundary, and changed it
 that way.  It is better than using the selectivity for only one value.
 Removing this part also make the function more simple.  The new version
 of the patch attached.

 This seems good to me.

 
 While looking at it I find some other small problems and fixed them.
 I also realized that I forgot to support other join types than inner
 join.  Currently, the default estimation is used for anti joins.
 I think the patch will need more than trivial amount of change to
 support anti joins.  I can work on it later.  While doing it, outer
 join selectivity estimation can also be improved.  I think the patch is
 better than nothing in its current state.

I agree with you that we can support other join type and anti join later,
If others don’t have any objection in doing other parts later I will mark as 
Ready For Committer.

Regards,
Dilip




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Selectivity estimation for inet operators

2014-07-12 Thread Emre Hasegeli
 I have one last comment, after clarifying this I can move it to ready for 
 committer.
 1. In networkjoinsel, For avoiding the case of huge statistics, only some of 
 the values from mcv and histograms are used (calculated using SQRT).
 -- But in my opinion, if histograms and mcv both are exist then its fine, but 
 if only mcv's are there in that case, we can match complete MCV, it will give 
 better accuracy.
In other function like eqjoinsel also its matching complete MCV.

I was not sure of reducing statistics, at all.  I could not find any
other selectivity estimation function which does this.  After testing
it some more, I reached the conclusion that it would be better to
only reduce the values of the outer loop on histogram match.  Now it
matches complete MCV lists to each other.  I also switched back to
log2() from sqrt() to make the outer list smaller.

I rethink your previous advice to threat histogram bucket partially
matched when the constant matches the last boundary, and changed it
that way.  It is better than using the selectivity for only one value.
Removing this part also make the function more simple.  The new
version of the patch attached.

While looking at it I find some other small problems and fixed them.
I also realized that I forgot to support other join types than inner
join.  Currently, the default estimation is used for anti joins.
I think the patch will need more than trivial amount of change to
support anti joins.  I can work on it later.  While doing it, outer
join selectivity estimation can also be improved.  I think the patch
is better than nothing in its current state.
diff --git a/src/backend/utils/adt/network_selfuncs.c 
b/src/backend/utils/adt/network_selfuncs.c
index d0d806f..08ec945 100644
--- a/src/backend/utils/adt/network_selfuncs.c
+++ b/src/backend/utils/adt/network_selfuncs.c
@@ -1,32 +1,626 @@
 /*-
  *
  * network_selfuncs.c
  *   Functions for selectivity estimation of inet/cidr operators
  *
- * Currently these are just stubs, but we hope to do better soon.
+ * Estimates are based on null fraction, distinct value count, most common
+ * values, and histogram of inet/cidr datatypes.
  *
  * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
  *
  * IDENTIFICATION
  *   src/backend/utils/adt/network_selfuncs.c
  *
  *-
  */
 #include postgres.h
 
+#include math.h
+
+#include access/htup_details.h
+#include catalog/pg_collation.h
+#include catalog/pg_operator.h
+#include catalog/pg_statistic.h
+#include utils/lsyscache.h
 #include utils/inet.h
+#include utils/selfuncs.h
 
 
+/* Default selectivity constant for the inet overlap operator */
+#define DEFAULT_OVERLAP_SEL 0.01
+
+/* Default selectivity constant for the other operators */
+#define DEFAULT_INCLUSION_SEL 0.005
+
+/* Default selectivity for given operator */
+#define DEFAULT_SEL(operator) \
+   ((operator) == OID_INET_OVERLAP_OP ? \
+   DEFAULT_OVERLAP_SEL : DEFAULT_INCLUSION_SEL)
+
+static short int inet_opr_order(Oid operator, bool reversed);
+static Selectivity inet_his_inclusion_selec(Datum *values, int nvalues,
+   Datum *constvalue, short int opr_order);
+static Selectivity inet_mcv_join_selec(Datum *values1, float4 *numbers1,
+   int nvalues1, Datum *values2, float4 
*numbers2,
+   int nvalues2, Oid operator, bool 
reversed);
+static Selectivity inet_mcv_his_selec(Datum *mcv_values, float4 *mcv_numbers,
+   int mcv_nvalues, Datum *his_values, int 
his_nvalues,
+   int red_nvalues, short int opr_order,
+   Selectivity *max_selec_pointer);
+static Selectivity inet_his_inclusion_join_selec(Datum *his1_values,
+   int his1_nvalues, Datum *his2_values, 
int his2_nvalues,
+   int red_nvalues, short int opr_order);
+static short int inet_inclusion_cmp(inet *left, inet *right,
+   short 
int opr_order);
+static short int inet_masklen_inclusion_cmp(inet *left, inet *right,
+   
short int opr_order);
+static short int inet_his_match_divider(inet *boundary, inet *query,
+   
short int opr_order);
+
+/*
+ * Selectivity estimation for the subnet inclusion operators
+ */
 Datum
 networksel(PG_FUNCTION_ARGS)
 {
-   PG_RETURN_FLOAT8(0.001);
+   PlannerInfo*root = (PlannerInfo *) PG_GETARG_POINTER(0);
+   Oid

Re: [HACKERS] Selectivity estimation for inet operators

2014-07-07 Thread Dilip kumar
On 06 July 2014 20:33, Emre Hasegeli Wrote,
 
  Apart from that there is one spell check you can correct
  -- in inet_his_inclusion_selec comments histogram boundies  -
  histogram boundaries :)
 
 I fixed it.  New version attached.  The debug log statements are also
 removed.

I have done with the review, patch seems fine to me

I have one last comment, after clarifying this I can move it to ready for 
committer.
1. In networkjoinsel, For avoiding the case of huge statistics, only some of 
the values from mcv and histograms are used (calculated using SQRT).
-- But in my opinion, if histograms and mcv both are exist then its fine, but 
if only mcv's are there in that case, we can match complete MCV, it will give 
better accuracy.
   In other function like eqjoinsel also its matching complete MCV.



Thanks  Regards,
Dilip Kumar

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Selectivity estimation for inet operators

2014-07-06 Thread Emre Hasegeli
Thank you for looking at it.

 In inet_his_inclusion_selec function, 
 When the constant matches only the right side of the bucket, and if it’s a 
 last bucket then it's never considered as partial match candidate.
 In my opinion, if it's not a last bucket then for next bucket it will become 
 left boundary and this will be treated as partial match so no problem, but 
 in-case of last bucket it can give wrong selectivity.
 
 Can't we consider it as partial bucket match if it is last bucket ?

Actually, in that case, the ratio for one value in the column is used.
I clarified the comment about it.  I do not think it is common enough
case to make the function more complicated.

 Apart from that there is one spell check you can correct
 -- in inet_his_inclusion_selec comments
 histogram boundies  - histogram boundaries :)

I fixed it.  New version attached.  The debug log statements are also
removed.
diff --git a/src/backend/utils/adt/network_selfuncs.c 
b/src/backend/utils/adt/network_selfuncs.c
index d0d806f..d8aeae9 100644
--- a/src/backend/utils/adt/network_selfuncs.c
+++ b/src/backend/utils/adt/network_selfuncs.c
@@ -1,32 +1,677 @@
 /*-
  *
  * network_selfuncs.c
  *   Functions for selectivity estimation of inet/cidr operators
  *
- * Currently these are just stubs, but we hope to do better soon.
+ * Estimates are based on null fraction, distinct value count, most common
+ * values, and histogram of inet/cidr datatypes.
  *
  * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
  *
  * IDENTIFICATION
  *   src/backend/utils/adt/network_selfuncs.c
  *
  *-
  */
 #include postgres.h
 
+#include math.h
+
+#include access/htup_details.h
+#include catalog/pg_collation.h
+#include catalog/pg_operator.h
+#include catalog/pg_statistic.h
+#include utils/lsyscache.h
 #include utils/inet.h
+#include utils/selfuncs.h
 
 
+/* Default selectivity constant for the inet overlap operator */
+#define DEFAULT_OVERLAP_SEL 0.01
+
+/* Default selectivity constant for the other operators */
+#define DEFAULT_INCLUSION_SEL 0.005
+
+/* Default selectivity for given operator */
+#define DEFAULT_SEL(operator) \
+   ((operator) == OID_INET_OVERLAP_OP ? \
+   DEFAULT_OVERLAP_SEL : DEFAULT_INCLUSION_SEL)
+
+static short int inet_opr_order(Oid operator, bool reversed);
+static Selectivity inet_his_inclusion_selec(Datum *values, int nvalues,
+   int red_nvalues, Datum *constvalue, 
double ndistinct,
+   short int opr_order);
+static Selectivity inet_mcv_join_selec(Datum *values1, float4 *numbers1,
+   int nvalues1, Datum *values2, float4 
*numbers2,
+   int nvalues2, Oid operator, bool 
reversed);
+static Selectivity inet_mcv_his_selec(Datum *mcv_values, float4 *mcv_numbers,
+   int mcv_nvalues, Datum *his_values, int 
his_nvalues,
+   int red_nvalues, short int opr_order);
+static Selectivity inet_his_inclusion_join_selec(Datum *his1_values,
+   int his1_nvalues, int red1_nvalues, 
Datum *his2_values,
+   int his2_nvalues, int red2_nvalues, 
short int opr_order);
+static short int inet_inclusion_cmp(inet *left, inet *right,
+   short 
int opr_order);
+static short int inet_masklen_inclusion_cmp(inet *left, inet *right,
+   
short int opr_order);
+static short int inet_his_match_divider(inet *boundary, inet *query,
+   
short int opr_order);
+
+/*
+ * Selectivity estimation for the subnet inclusion operators
+ */
 Datum
 networksel(PG_FUNCTION_ARGS)
 {
-   PG_RETURN_FLOAT8(0.001);
+   PlannerInfo*root = (PlannerInfo *) PG_GETARG_POINTER(0);
+   Oid operator = PG_GETARG_OID(1);
+   List   *args = (List *) PG_GETARG_POINTER(2);
+   int varRelid = PG_GETARG_INT32(3),
+   his_nvalues;
+   VariableStatData vardata;
+   Node   *other;
+   boolvaronleft;
+   Selectivity selec,
+   max_mcv_selec,
+   max_his_selec;
+   Datum   constvalue,
+  *his_values;
+   Form_pg_statistic stats;
+   FmgrInfoproc;
+
+   /*
+* If expression 

Re: [HACKERS] Selectivity estimation for inet operators

2014-07-02 Thread Dilip kumar
On, 15 May 2014 14:04 Emre Hasegeli Wrote, 

 
 * matching first MCV to second MCV
 * searching first MCV in the second histogram
 * searching second MCV in the first histogram
 * searching boundaries of the first histogram in the second histogram
 
 Comparing the lists with each other slows down the function when
 statistics set to higher values. To avoid this problem I only use
 log(n) values of the lists. It is the first log(n) value for MCV,
 evenly separated values for histograms. In my tests, this optimization
 does not affect the planning time when statistics = 100, but does
 affect accuracy of the estimation. I can send the version without this
 optimization, if slow down with larger statistics is not a problem
 which should be solved on the selectivity estimation function.


I have started reviewing this patch, so far I have done basic reviews and some 
testing/debugging.

1. Patch applied to git head.
2. Basic testing works fine.

I have one query,

In inet_his_inclusion_selec function, 
When the constant matches only the right side of the bucket, and if it’s a last 
bucket then it's never considered as partial match candidate.
In my opinion, if it's not a last bucket then for next bucket it will become 
left boundary and this will be treated as partial match so no problem, but 
in-case of last bucket it can give wrong selectivity.

Can't we consider it as partial bucket match if it is last bucket ?

Apart from that there is one spell check you can correct
-- in inet_his_inclusion_selec comments
histogram boundies  - histogram boundaries :)

Thanks  Regards,
Dilip Kumar





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Selectivity estimation for inet operators

2014-06-18 Thread Emre Hasegeli
I wanted to check the patch last time and found a bug effecting
MVC vs MVC part of the join selectivity. Fixed version attached.

Emre Hasegeli e...@hasegeli.com:
 Comparing the lists with each other slows down the function when
 statistics set to higher values. To avoid this problem I only use
 log(n) values of the lists. It is the first log(n) value for MCV,
 evenly separated values for histograms. In my tests, this optimization
 does not affect the planning time when statistics = 100, but does
 affect accuracy of the estimation. I can send the version without
 this optimization, if slow down with larger statistics is not a problem
 which should be solved on the selectivity estimation function.

Also, I changed this from log(n) to sqrt(n). It seems much better
now.

I try to explain the reason to processes some of the values with more
comments. I hope it is understandable.


inet-selfuncs-v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Selectivity estimation for inet operators

2014-05-15 Thread Emre Hasegeli
New version of the selectivity estimation patch attached. I am adding
it to CommitFest 2014-06. Previous version of it reviewed by
Andreas Karlson on the previous CommitFest with the GiST support patch.
The new version includes join selectivity estimation.

Join selectivity is calculated in 4 steps:

* matching first MCV to second MCV
* searching first MCV in the second histogram
* searching second MCV in the first histogram
* searching boundaries of the first histogram in the second histogram

Comparing the lists with each other slows down the function when
statistics set to higher values. To avoid this problem I only use
log(n) values of the lists. It is the first log(n) value for MCV,
evenly separated values for histograms. In my tests, this optimization
does not affect the planning time when statistics = 100, but does
affect accuracy of the estimation. I can send the version without
this optimization, if slow down with larger statistics is not a problem
which should be solved on the selectivity estimation function.

I also attach the script I was using for testing and I left log statements
in the networkjoinsel() function to make testing easier. These statements
should be removed before commit.


inet-selfuncs-v4.patch
Description: Binary data


inet-selfuncs-test.sql
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers