Re: Implement missing join selectivity estimation for range types

2024-01-05 Thread Schoemans Maxime
On 05/01/2024 11:37, vignesh C wrote:
 > One of the tests was aborted at [1], kindly post an updated patch for 
the same:

Thank you for notifying us.
I believe I fixed the issue but it is hard to be certain as the issue 
did not arise when running the regression tests locally.

Regards,
Maximediff --git a/src/backend/utils/adt/multirangetypes_selfuncs.c b/src/backend/utils/adt/multirangetypes_selfuncs.c
index 981c1fd298..6abc43f149 100644
--- a/src/backend/utils/adt/multirangetypes_selfuncs.c
+++ b/src/backend/utils/adt/multirangetypes_selfuncs.c
@@ -1335,3 +1335,542 @@ calc_hist_selectivity_contains(TypeCacheEntry *typcache,
 
 	return sum_frac;
 }
+
+/*
+ * This is a utility function used to estimate the join selectivity of
+ * range attributes using rangebound histogram statistics as described
+ * in this paper:
+ *
+ * Diogo Repas, Zhicheng Luo, Maxime Schoemans and Mahmoud Sakr, 2022
+ * Selectivity Estimation of Inequality Joins In Databases
+ * https://doi.org/10.48550/arXiv.2206.07396
+ *
+ * The attributes being joined will be treated as random variables
+ * that follow a distribution modeled by a Probability Density Function (PDF).
+ * Let the two attributes be denoted X, Y.
+ * This function finds the probability P(X < Y).
+ * Note that the PDFs of the two variables can easily be obtained
+ * from their bounds histogram, respectively hist1 and hist2 .
+ *
+ * Let the PDF of X, Y be denoted as f_X, f_Y.
+ * The probability P(X < Y) can be formalized as follows:
+ * P(X < Y)= integral_-inf^inf( integral_-inf^y ( f_X(x) * f_Y(y) dx dy ) )
+ *= integral_-inf^inf( F_X(y) * f_Y(y) dy )
+ * where F_X(y) denote the Cumulative Distribution Function of X at y.
+ * Note that F_X is the selectivity estimation (non-join),
+ * which is implemented using the function calc_hist_selectivity_scalar.
+ *
+ * Now given the histograms of the two attributes X, Y, we note the following:
+ * - The PDF of Y is a step function
+ * (constant piece-wise, where each piece is defined in a bin of Y's histogram)
+ * - The CDF of X is linear piece-wise
+ *   (each piece is defined in a bin of X's histogram)
+ * This leads to the conclusion that their product
+ * (used to calculate the equation above) is also linear piece-wise.
+ * A new piece starts whenever either the bin of X or the bin of Y changes.
+ * By parallel scanning the two rangebound histograms of X and Y,
+ * we evaluate one piece of the result between every two consecutive rangebounds
+ * in the union of the two histograms.
+ *
+ * Given that the product F_X * f_y is linear in the interval
+ * between every two consecutive rangebounds, let them be denoted prev, cur,
+ * it can be shown that the above formula can be discretized into the following:
+ * P(X < Y) =
+ *   0.5 * sum_0^{n+m-1} ( ( F_X(prev) + F_X(cur) ) * ( F_Y(cur) - F_Y(prev) ) )
+ * where n, m are the lengths of the two histograms.
+ *
+ * As such, it is possible to fully compute the join selectivity
+ * as a summation of CDFs, iterating over the bounds of the two histograms.
+ * This maximizes the code reuse, since the CDF is computed using
+ * the calc_hist_selectivity_scalar function, which is the function used
+ * for selectivity estimation (non-joins).
+ *
+ */
+static double
+calc_hist_join_selectivity(TypeCacheEntry *typcache,
+		   const RangeBound *hist1, int nhist1,
+		   const RangeBound *hist2, int nhist2)
+{
+	int			i,
+j;
+	double		selectivity = 0.0,	/* initialization */
+prev_sel1 = -1.0,	/* to skip the first iteration */
+prev_sel2 = 0.0;	/* initialization */
+
+	/*
+	 * Histograms will never be empty. In fact, a histogram will never have
+	 * less than 2 values (1 bin)
+	 */
+	Assert(nhist1 > 1);
+	Assert(nhist2 > 1);
+
+	/* Fast-forwards i and j to start of iteration */
+	for (i = 0; range_cmp_bound_values(typcache, [i], [0]) < 0; i++);
+	for (j = 0; range_cmp_bound_values(typcache, [j], [0]) < 0; j++);
+
+	/* Do the estimation on overlapping regions */
+	while (i < nhist1 && j < nhist2)
+	{
+		double		cur_sel1,
+	cur_sel2;
+		RangeBound	cur_sync;
+
+		if (range_cmp_bound_values(typcache, [i], [j]) < 0)
+			cur_sync = hist1[i++];
+		else if (range_cmp_bound_values(typcache, [i], [j]) > 0)
+			cur_sync = hist2[j++];
+		else
+		{
+			/* If equal, skip one */
+			cur_sync = hist1[i];
+			i++;
+			j++;
+		}
+		cur_sel1 = calc_hist_selectivity_scalar(typcache, _sync,
+hist1, nhist1, false);
+		cur_sel2 = calc_hist_selectivity_scalar(typcache, _sync,
+hist2, nhist2, false);
+
+		/* Skip the first iteration */
+		if (prev_sel1 >= 0)
+			selectivity += (prev_sel1 + cur_sel1) * (cur_sel2 - prev_sel2);
+
+		/* Prepare for the next iteration */
+		prev_sel1 = cur_sel1;
+		prev_sel2 = cur_sel2;
+	}
+
+	/* P(X < Y) = 0.5 * Sum(...) */
+	selectivity /= 2;
+
+	/* Include remainder of hist2 if any */
+	if (j < nhist2)
+		selectivity += 1 - prev_sel2;
+
+	return selectivity;
+}
+
+/*
+ * multirangejoinsel -- join cardinality for 

Re: Implement missing join selectivity estimation for range types

2023-11-20 Thread Schoemans Maxime
On 14/11/2023 20:46, Tom Lane wrote:
> I took a brief look through this very interesting work.  I concur
> with Tomas that it feels a little odd that range join selectivity
> would become smarter than scalar inequality join selectivity, and
> that we really ought to prioritize applying these methods to that
> case.  Still, that's a poor reason to not take the patch.

Indeed, we started with ranges as this was the simpler case (no MCV) and 
was the topic of a course project.
The idea is to later write a second patch that applies these ideas to 
scalar inequality while also handling MCV's correctly.

> I also agree with the upthread criticism that having two identical
> functions in different source files will be a maintenance nightmare.
> Don't do it.  When and if there's a reason for the behavior to
> diverge between the range and multirange cases, it'd likely be
> better to handle that by passing in a flag to say what to do.

The duplication is indeed not ideal. However, there are already 8 other 
duplicate functions between the two files.
I would thus suggest to leave the duplication in this patch and create a 
second one that removes all duplication from the two files, instead of 
just removing the duplication for our new function.
What are your thoughts on this? If we do this, should the function 
definitions go in rangetypes.h or should we create a new 
rangetypes_selfuncs.h header?

> But my real unhappiness with the patch as-submitted is the test cases,
> which require rowcount estimates to be reproduced exactly.

> We need a more forgiving test method. Usually the
> approach is to set up a test case where the improved accuracy of
> the estimate changes the planner's choice of plan compared to what
> you got before, since that will normally not be too prone to change
> from variations of a percent or two in the estimates.

I have changed the test method to produce query plans for a 3-way range 
join.
The plans for the different operators differ due to the computed 
selectivity estimation, which was not the case before this patch.

Regards,
Maxime Schoemansdiff --git a/src/backend/utils/adt/multirangetypes_selfuncs.c b/src/backend/utils/adt/multirangetypes_selfuncs.c
index cefc4710fd..c670d225a0 100644
--- a/src/backend/utils/adt/multirangetypes_selfuncs.c
+++ b/src/backend/utils/adt/multirangetypes_selfuncs.c
@@ -1335,3 +1335,558 @@ calc_hist_selectivity_contains(TypeCacheEntry *typcache,
 
 	return sum_frac;
 }
+
+/*
+ * This is a utility function used to estimate the join selectivity of
+ * range attributes using rangebound histogram statistics as described
+ * in this paper:
+ *
+ * Diogo Repas, Zhicheng Luo, Maxime Schoemans and Mahmoud Sakr, 2022
+ * Selectivity Estimation of Inequality Joins In Databases
+ * https://doi.org/10.48550/arXiv.2206.07396
+ *
+ * The attributes being joined will be treated as random variables
+ * that follow a distribution modeled by a Probability Density Function (PDF).
+ * Let the two attributes be denoted X, Y.
+ * This function finds the probability P(X < Y).
+ * Note that the PDFs of the two variables can easily be obtained
+ * from their bounds histogram, respectively hist1 and hist2 .
+ *
+ * Let the PDF of X, Y be denoted as f_X, f_Y.
+ * The probability P(X < Y) can be formalized as follows:
+ * P(X < Y)= integral_-inf^inf( integral_-inf^y ( f_X(x) * f_Y(y) dx dy ) )
+ *= integral_-inf^inf( F_X(y) * f_Y(y) dy )
+ * where F_X(y) denote the Cumulative Distribution Function of X at y.
+ * Note that F_X is the selectivity estimation (non-join),
+ * which is implemented using the function calc_hist_selectivity_scalar.
+ *
+ * Now given the histograms of the two attributes X, Y, we note the following:
+ * - The PDF of Y is a step function
+ * (constant piece-wise, where each piece is defined in a bin of Y's histogram)
+ * - The CDF of X is linear piece-wise
+ *   (each piece is defined in a bin of X's histogram)
+ * This leads to the conclusion that their product
+ * (used to calculate the equation above) is also linear piece-wise.
+ * A new piece starts whenever either the bin of X or the bin of Y changes.
+ * By parallel scanning the two rangebound histograms of X and Y,
+ * we evaluate one piece of the result between every two consecutive rangebounds
+ * in the union of the two histograms.
+ *
+ * Given that the product F_X * f_y is linear in the interval
+ * between every two consecutive rangebounds, let them be denoted prev, cur,
+ * it can be shown that the above formula can be discretized into the following:
+ * P(X < Y) =
+ *   0.5 * sum_0^{n+m-1} ( ( F_X(prev) + F_X(cur) ) * ( F_Y(cur) - F_Y(prev) ) )
+ * where n, m are the lengths of the two histograms.
+ *
+ * As such, it is possible to fully compute the join selectivity
+ * as a summation of CDFs, iterating over the bounds of the two histograms.
+ * This maximizes the code reuse, since the CDF is computed using
+ * the calc_hist_selectivity_scalar function, which is the function used

Re: Implement missing join selectivity estimation for range types

2023-07-07 Thread Schoemans Maxime
Hi,

Thank you for picking up this patch.

 > The patch doesn't apply to the current postgres version. Could you 
please update it?
Indeed, the code was initially written on pg15.
You can find attached a new version of the patch that can be applied on 
the current master branch of postgres.

Please let us know if there is anything else we can do.

Best regards,
Maxime Schoemansdiff --git a/src/backend/utils/adt/multirangetypes_selfuncs.c b/src/backend/utils/adt/multirangetypes_selfuncs.c
index cefc4710fd..c670d225a0 100644
--- a/src/backend/utils/adt/multirangetypes_selfuncs.c
+++ b/src/backend/utils/adt/multirangetypes_selfuncs.c
@@ -1335,3 +1335,558 @@ calc_hist_selectivity_contains(TypeCacheEntry *typcache,
 
 	return sum_frac;
 }
+
+/*
+ * This is a utility function used to estimate the join selectivity of
+ * range attributes using rangebound histogram statistics as described
+ * in this paper:
+ *
+ * Diogo Repas, Zhicheng Luo, Maxime Schoemans and Mahmoud Sakr, 2022
+ * Selectivity Estimation of Inequality Joins In Databases
+ * https://doi.org/10.48550/arXiv.2206.07396
+ *
+ * The attributes being joined will be treated as random variables
+ * that follow a distribution modeled by a Probability Density Function (PDF).
+ * Let the two attributes be denoted X, Y.
+ * This function finds the probability P(X < Y).
+ * Note that the PDFs of the two variables can easily be obtained
+ * from their bounds histogram, respectively hist1 and hist2 .
+ *
+ * Let the PDF of X, Y be denoted as f_X, f_Y.
+ * The probability P(X < Y) can be formalized as follows:
+ * P(X < Y)= integral_-inf^inf( integral_-inf^y ( f_X(x) * f_Y(y) dx dy ) )
+ *= integral_-inf^inf( F_X(y) * f_Y(y) dy )
+ * where F_X(y) denote the Cumulative Distribution Function of X at y.
+ * Note that F_X is the selectivity estimation (non-join),
+ * which is implemented using the function calc_hist_selectivity_scalar.
+ *
+ * Now given the histograms of the two attributes X, Y, we note the following:
+ * - The PDF of Y is a step function
+ * (constant piece-wise, where each piece is defined in a bin of Y's histogram)
+ * - The CDF of X is linear piece-wise
+ *   (each piece is defined in a bin of X's histogram)
+ * This leads to the conclusion that their product
+ * (used to calculate the equation above) is also linear piece-wise.
+ * A new piece starts whenever either the bin of X or the bin of Y changes.
+ * By parallel scanning the two rangebound histograms of X and Y,
+ * we evaluate one piece of the result between every two consecutive rangebounds
+ * in the union of the two histograms.
+ *
+ * Given that the product F_X * f_y is linear in the interval
+ * between every two consecutive rangebounds, let them be denoted prev, cur,
+ * it can be shown that the above formula can be discretized into the following:
+ * P(X < Y) =
+ *   0.5 * sum_0^{n+m-1} ( ( F_X(prev) + F_X(cur) ) * ( F_Y(cur) - F_Y(prev) ) )
+ * where n, m are the lengths of the two histograms.
+ *
+ * As such, it is possible to fully compute the join selectivity
+ * as a summation of CDFs, iterating over the bounds of the two histograms.
+ * This maximizes the code reuse, since the CDF is computed using
+ * the calc_hist_selectivity_scalar function, which is the function used
+ * for selectivity estimation (non-joins).
+ *
+ */
+static double
+calc_hist_join_selectivity(TypeCacheEntry *typcache,
+		   const RangeBound *hist1, int nhist1,
+		   const RangeBound *hist2, int nhist2)
+{
+	int			i,
+j;
+	double		selectivity,
+cur_sel1,
+cur_sel2,
+prev_sel1,
+prev_sel2;
+	RangeBound	cur_sync;
+
+	/*
+	 * Histograms will never be empty. In fact, a histogram will never have
+	 * less than 2 values (1 bin)
+	 */
+	Assert(nhist1 > 1);
+	Assert(nhist2 > 1);
+
+	/* Fast-forwards i and j to start of iteration */
+	for (i = 0; range_cmp_bound_values(typcache, [i], [0]) < 0; i++);
+	for (j = 0; range_cmp_bound_values(typcache, [j], [0]) < 0; j++);
+
+	if (range_cmp_bound_values(typcache, [i], [j]) < 0)
+		cur_sync = hist1[i++];
+	else if (range_cmp_bound_values(typcache, [i], [j]) > 0)
+		cur_sync = hist2[j++];
+	else
+	{
+		/* If equal, skip one */
+		cur_sync = hist1[i];
+		i++;
+		j++;
+	}
+	prev_sel1 = calc_hist_selectivity_scalar(typcache, _sync,
+			 hist1, nhist1, false);
+	prev_sel2 = calc_hist_selectivity_scalar(typcache, _sync,
+			 hist2, nhist2, false);
+
+	/*
+	 * Do the estimation on overlapping region
+	 */
+	selectivity = 0.0;
+	while (i < nhist1 && j < nhist2)
+	{
+		if (range_cmp_bound_values(typcache, [i], [j]) < 0)
+			cur_sync = hist1[i++];
+		else if (range_cmp_bound_values(typcache, [i], [j]) > 0)
+			cur_sync = hist2[j++];
+		else
+		{
+			/* If equal, skip one */
+			cur_sync = hist1[i];
+			i++;
+			j++;
+		}
+		cur_sel1 = calc_hist_selectivity_scalar(typcache, _sync,
+hist1, nhist1, false);
+		cur_sel2 = calc_hist_selectivity_scalar(typcache, _sync,
+hist2, nhist2, false);
+
+		

Re: Implement missing join selectivity estimation for range types

2023-06-19 Thread Schoemans Maxime
This is a quick correction as the last patch contained a missing semicolon.

Regards,
Maxime SchoemansFrom ebd62356210eff2f38772a9c46a0a8792c0e9ce3 Mon Sep 17 00:00:00 2001
From: Maxime Schoemans 
Date: Mon, 20 Mar 2023 11:48:05 -0400
Subject: [PATCH v2] Apply division before adding remainder

---
 src/backend/utils/adt/multirangetypes_selfuncs.c | 5 -
 src/backend/utils/adt/rangetypes_selfuncs.c  | 5 -
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/adt/multirangetypes_selfuncs.c b/src/backend/utils/adt/multirangetypes_selfuncs.c
index 7ba4aa8b04..ad14b789f4 100644
--- a/src/backend/utils/adt/multirangetypes_selfuncs.c
+++ b/src/backend/utils/adt/multirangetypes_selfuncs.c
@@ -1412,11 +1412,14 @@ calc_hist_join_selectivity(TypeCacheEntry *typcache,
 		prev_sel2 = cur_sel2;
 	}
 
+	/* P(X < Y) = 0.5 * Sum(...) */
+	selectivity /= 2;
+
 	/* Include remainder of hist2 if any */
 	if (j < nhist2)
 		selectivity += 1 - prev_sel2;
 
-	return selectivity / 2;
+	return selectivity;
 }
 
 /*
diff --git a/src/backend/utils/adt/rangetypes_selfuncs.c b/src/backend/utils/adt/rangetypes_selfuncs.c
index 007e14bcf6..90970943b3 100644
--- a/src/backend/utils/adt/rangetypes_selfuncs.c
+++ b/src/backend/utils/adt/rangetypes_selfuncs.c
@@ -1342,11 +1342,14 @@ calc_hist_join_selectivity(TypeCacheEntry *typcache,
 		prev_sel2 = cur_sel2;
 	}
 
+	/* P(X < Y) = 0.5 * Sum(...) */
+	selectivity /= 2;
+
 	/* Include remainder of hist2 if any */
 	if (j < nhist2)
 		selectivity += 1 - prev_sel2;
 
-	return selectivity / 2;
+	return selectivity;
 }
 
 /*
-- 
2.17.1



Re: Implement missing join selectivity estimation for range types

2023-06-19 Thread Schoemans Maxime
Hi,

In the selectivity algorithm, the division was applied after adding the 
remaining histogram buckets of histogram2 that don't overlap with histogram1.
This could lead to reducing selectivity by half, e.g., in the case that 
histogram2 is completely right of histogram1.
The correct calculation is to divide by two before adding the remainder.
This patch-fix does the needed.

Best regards,
Maxime Schoemans

On 20/03/2023 16:34, maxime wrote:
Hi Tomas,

As a quick update, the paper related to this work has finally been published in 
Mathematics (https://www.mdpi.com/2227-7390/11/6/1383).
During revision we also added a figure showing a comparison of our algorithm vs 
the existing algorithms in Oracle, SQL Server, MySQL and PostgreSQL, which can 
be found in the experiments section of the paper.
As can be seen, our algorithm outperforms even Oracle and SQL Server.

During this revision we also found a small bug, so we are working on a revision 
of the patch, which fixes this.


Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs
a proper comment, not just "this is a copy from rangetypes".


Right, the comment should elaborate more that the collected statistics are
currently that same as rangetypes but may potentially deviate.



However, it seems the two functions are exactly the same. Would the
functions diverge in the future? If not, maybe there should be just a
single shared function?


Indeed, it is possible that the two functions will deviate if that statistics
of multirange types will be refined.



Right, but are there any such plans? Also, what's the likelihood we'll
add new statistics to only one of the places (e.g. for multiranges but
not plain ranges)?

I'd keep a single function until we actually need two. That's also
easier for maintenance - with two it's easy to fix a bug in one place
and forget about the other, etc.

Regarding our previous discussion about the duplication of 
calc_hist_join_selectivity in rangetypes_selfuncs.c and 
multirangetypes_selfuncs.c, we can also remove this duplication in the revision 
if needed.
Note that currently, there are no external functions shared between 
rangetypes_selfuncs.c and multirangetypes_selfuncs.c.
Any function that was used in both files was duplicated as a static function.
The functions calc_hist_selectivity_scalar, calc_length_hist_frac, 
calc_hist_selectivity_contained and calc_hist_selectivity_contains are examples 
of this, where the function is identical but has been declared static in both 
files.
That said, we can remove the duplication of calc_hist_join_selectivity if it 
still needed.
We would, however, require some guidance as to where to put the external 
definition of this function, as there does not appear to be a 
rangetypes_selfuncs.h header.
Should it simply go into utils/selfuncs.h or should we create a new header file?

Best regards,
Maxime Schoemans

From 53291919f536f6e7b04ca87f408e5b95e730ddb0 Mon Sep 17 00:00:00 2001
From: Maxime Schoemans 
Date: Mon, 20 Mar 2023 11:48:05 -0400
Subject: [PATCH] Apply division before adding remainder

---
 src/backend/utils/adt/multirangetypes_selfuncs.c | 5 -
 src/backend/utils/adt/rangetypes_selfuncs.c  | 5 -
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/adt/multirangetypes_selfuncs.c b/src/backend/utils/adt/multirangetypes_selfuncs.c
index 7ba4aa8b04..ad14b789f4 100644
--- a/src/backend/utils/adt/multirangetypes_selfuncs.c
+++ b/src/backend/utils/adt/multirangetypes_selfuncs.c
@@ -1412,11 +1412,14 @@ calc_hist_join_selectivity(TypeCacheEntry *typcache,
 		prev_sel2 = cur_sel2;
 	}
 
+	/* P(X < Y) = 0.5 * Sum(...) */
+	selectivity /= 2;
+
 	/* Include remainder of hist2 if any */
 	if (j < nhist2)
 		selectivity += 1 - prev_sel2;
 
-	return selectivity / 2;
+	return selectivity;
 }
 
 /*
diff --git a/src/backend/utils/adt/rangetypes_selfuncs.c b/src/backend/utils/adt/rangetypes_selfuncs.c
index 007e14bcf6..129ef9648f 100644
--- a/src/backend/utils/adt/rangetypes_selfuncs.c
+++ b/src/backend/utils/adt/rangetypes_selfuncs.c
@@ -1342,11 +1342,14 @@ calc_hist_join_selectivity(TypeCacheEntry *typcache,
 		prev_sel2 = cur_sel2;
 	}
 
+	/* P(X < Y) = 0.5 * Sum(...) */
+	selectivity /= 2
+
 	/* Include remainder of hist2 if any */
 	if (j < nhist2)
 		selectivity += 1 - prev_sel2;
 
-	return selectivity / 2;
+	return selectivity;
 }
 
 /*
-- 
2.17.1



Re: Implement missing join selectivity estimation for range types

2023-03-20 Thread Schoemans Maxime
Hi Tomas,

As a quick update, the paper related to this work has finally been published in 
Mathematics (https://www.mdpi.com/2227-7390/11/6/1383).
During revision we also added a figure showing a comparison of our algorithm vs 
the existing algorithms in Oracle, SQL Server, MySQL and PostgreSQL, which can 
be found in the experiments section of the paper.
As can be seen, our algorithm outperforms even Oracle and SQL Server.

During this revision we also found a small bug, so we are working on a revision 
of the patch, which fixes this.


Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs
a proper comment, not just "this is a copy from rangetypes".


Right, the comment should elaborate more that the collected statistics are
currently that same as rangetypes but may potentially deviate.



However, it seems the two functions are exactly the same. Would the
functions diverge in the future? If not, maybe there should be just a
single shared function?


Indeed, it is possible that the two functions will deviate if that statistics
of multirange types will be refined.



Right, but are there any such plans? Also, what's the likelihood we'll
add new statistics to only one of the places (e.g. for multiranges but
not plain ranges)?

I'd keep a single function until we actually need two. That's also
easier for maintenance - with two it's easy to fix a bug in one place
and forget about the other, etc.

Regarding our previous discussion about the duplication of 
calc_hist_join_selectivity in rangetypes_selfuncs.c and 
multirangetypes_selfuncs.c, we can also remove this duplication in the revision 
if needed.
Note that currently, there are no external functions shared between 
rangetypes_selfuncs.c and multirangetypes_selfuncs.c.
Any function that was used in both files was duplicated as a static function.
The functions calc_hist_selectivity_scalar, calc_length_hist_frac, 
calc_hist_selectivity_contained and calc_hist_selectivity_contains are examples 
of this, where the function is identical but has been declared static in both 
files.
That said, we can remove the duplication of calc_hist_join_selectivity if it 
still needed.
We would, however, require some guidance as to where to put the external 
definition of this function, as there does not appear to be a 
rangetypes_selfuncs.h header.
Should it simply go into utils/selfuncs.h or should we create a new header file?

Best regards,
Maxime Schoemans