Re: PATCH: Using BRIN indexes for sorted output

2024-02-01 Thread vignesh C
On Sun, 21 Jan 2024 at 07:32, vignesh C  wrote:
>
> On Wed, 2 Aug 2023 at 21:34, Tomas Vondra  
> wrote:
> >
> >
> >
> > On 8/2/23 17:25, Sergey Dudoladov wrote:
> > > Hello,
> > >
> > >> Parallel version is not supported, but I think it should be possible.
> > >
> > > @Tomas are you working on this ? If not, I would like to give it a try.
> > >
> >
> > Feel free to try. Just keep it in a separate part/patch, to make it
> > easier to combine the work later.
> >
> > >> static void
> > >> AssertCheckRanges(BrinSortState *node)
> > >> {
> > >> #ifdef USE_ASSERT_CHECKING
> > >>
> > >> #endif
> > >> }
> > >
> > > I guess it should not be empty at the ongoing development stage.
> > >
> > > Attached a small modification of the patch with a draft of the docs.
> > >
> >
> > Thanks. FWIW it's generally better to always post the whole patch
> > series, otherwise the cfbot gets confused as it's unable to combine
> > stuff from different messages.
>
> Are we planning to take this patch forward? It has been nearly 5
> months since the last discussion on this. If the interest has gone
> down and if there are no plans to handle this I'm thinking of
> returning this commitfest entry in this commitfest and can be opened
> when there is more interest.

Since the author or no one else showed interest in taking it forward
and the patch had no activity for more than 5 months, I have changed
the status to RWF. Feel free to add a new CF entry when someone is
planning to resume work more actively by starting off with a rebased
version.

Regards,
Vignesh




Re: PATCH: Using BRIN indexes for sorted output

2024-01-20 Thread vignesh C
On Wed, 2 Aug 2023 at 21:34, Tomas Vondra  wrote:
>
>
>
> On 8/2/23 17:25, Sergey Dudoladov wrote:
> > Hello,
> >
> >> Parallel version is not supported, but I think it should be possible.
> >
> > @Tomas are you working on this ? If not, I would like to give it a try.
> >
>
> Feel free to try. Just keep it in a separate part/patch, to make it
> easier to combine the work later.
>
> >> static void
> >> AssertCheckRanges(BrinSortState *node)
> >> {
> >> #ifdef USE_ASSERT_CHECKING
> >>
> >> #endif
> >> }
> >
> > I guess it should not be empty at the ongoing development stage.
> >
> > Attached a small modification of the patch with a draft of the docs.
> >
>
> Thanks. FWIW it's generally better to always post the whole patch
> series, otherwise the cfbot gets confused as it's unable to combine
> stuff from different messages.

Are we planning to take this patch forward? It has been nearly 5
months since the last discussion on this. If the interest has gone
down and if there are no plans to handle this I'm thinking of
returning this commitfest entry in this commitfest and can be opened
when there is more interest.

Regards,
Vignesh




Re: PATCH: Using BRIN indexes for sorted output

2023-08-02 Thread Tomas Vondra



On 8/2/23 17:25, Sergey Dudoladov wrote:
> Hello,
> 
>> Parallel version is not supported, but I think it should be possible.
> 
> @Tomas are you working on this ? If not, I would like to give it a try.
> 

Feel free to try. Just keep it in a separate part/patch, to make it
easier to combine the work later.

>> static void
>> AssertCheckRanges(BrinSortState *node)
>> {
>> #ifdef USE_ASSERT_CHECKING
>>
>> #endif
>> }
> 
> I guess it should not be empty at the ongoing development stage.
> 
> Attached a small modification of the patch with a draft of the docs.
> 

Thanks. FWIW it's generally better to always post the whole patch
series, otherwise the cfbot gets confused as it's unable to combine
stuff from different messages.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PATCH: Using BRIN indexes for sorted output

2023-08-02 Thread Sergey Dudoladov
Hello,

> Parallel version is not supported, but I think it should be possible.

@Tomas are you working on this ? If not, I would like to give it a try.

> static void
> AssertCheckRanges(BrinSortState *node)
> {
> #ifdef USE_ASSERT_CHECKING
>
> #endif
> }

I guess it should not be empty at the ongoing development stage.

Attached a small modification of the patch with a draft of the docs.

Regards,
Sergey Dudoladov
From d4050be4bfd0a518eba0ff0a7b561f0420be9861 Mon Sep 17 00:00:00 2001
From: Sergey Dudoladov 
Date: Wed, 2 Aug 2023 16:47:35 +0200
Subject: [PATCH] BRIN sort: add docs / fix typos

---
 doc/src/sgml/brin.sgml  | 48 +
 src/backend/executor/nodeBrinSort.c |  9 +++---
 2 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/brin.sgml b/doc/src/sgml/brin.sgml
index 9c5ffcddf8..a76f26e032 100644
--- a/doc/src/sgml/brin.sgml
+++ b/doc/src/sgml/brin.sgml
@@ -810,6 +810,54 @@ LOG:  request for BRIN range summarization for index "brin_wi_idx" page 128 was
 
 
 
+
+ BRIN Sort
+
+ 
+  This section provides an overview of BRIN sort that may be
+  useful to advanced users. See src/backend/executor/nodeBrinSort.c
+  in the source distribution for the implementation.
+ 
+
+ 
+  Overview
+  
+   The information about ranges in a BRIN index can be used
+   to split a table into parts and sort each part separately. A BRIN
+   sort is an index scan that uses the idea to incrementally sort the data. On
+   large tables BRIN index scan bridges the performance gap 
+   between having a large B-tree index that supports ordered retrieval, and having
+   no index at all.
+  
+ 
+
+ 
+  Implementation
+  
+
+  BRIN sort fetches a list of page ranges from the BRIN
+  index and sorts them by their summary info depending on the operator class and
+  sort direction. It then determines the watermark: the next summary value from
+  the list. Tuples with summary values less than the watermark are sorted and
+  outputted directly; tuples with values greater than the watermark are spilled
+  into the next iteration. The process continues until either all ranges are
+  processed or the desired number of tuples has been retrieved.
+  
+ 
+
+ 
+  Limitations
+  
+   The performance of the BRIN sort depends on the degree of
+   correlation between the values of the indexed column and their physical location
+   on disk: in the case of high correlation, there will be few overlapping ranges,
+   and performance will be best. Another limitation is that BRIN
+   sort only works for indexes built on a single column.
+  
+ 
+
+
+
 
  Extensibility
 
diff --git a/src/backend/executor/nodeBrinSort.c b/src/backend/executor/nodeBrinSort.c
index d1b6b4e1ed..c640aea2d5 100644
--- a/src/backend/executor/nodeBrinSort.c
+++ b/src/backend/executor/nodeBrinSort.c
@@ -731,8 +731,9 @@ brinsort_load_spill_tuples(BrinSortState *node, bool check_watermark)
 static bool
 brinsort_next_range(BrinSortState *node, bool asc)
 {
-	/* FIXME free the current bs_range, if any */
-	node->bs_range = NULL;
+	/* free the current bs_range, if any */
+	if (node->bs_range != NULL)
+		pfree(node->bs_range);
 
 	/*
 	 * Mark the position, so that we can restore it in case we reach the
@@ -1227,7 +1228,7 @@ IndexNext(BrinSortState *node)
 
 			case BRINSORT_PROCESS_RANGE:
 
-elog(DEBUG1, "phase BRINSORT_PROCESS_RANGE");
+elog(DEBUG1, "phase = BRINSORT_PROCESS_RANGE");
 
 slot = node->ss.ps.ps_ResultTupleSlot;
 
@@ -1263,7 +1264,7 @@ IndexNext(BrinSortState *node)
 }
 else
 {
-	/* updte the watermark and try reading more ranges */
+	/* update the watermark and try reading more ranges */
 	node->bs_phase = BRINSORT_LOAD_RANGE;
 	brinsort_update_watermark(node, false, asc);
 }
-- 
2.34.1



Re: PATCH: Using BRIN indexes for sorted output

2023-07-14 Thread Tomas Vondra



On 7/14/23 16:42, Matthias van de Meent wrote:
> On Fri, 14 Jul 2023 at 16:21, Tomas Vondra
>  wrote:
>>
>>
>>
>>
>> On 7/11/23 13:20, Matthias van de Meent wrote:
>>> On Mon, 10 Jul 2023 at 22:04, Tomas Vondra
>>>  wrote:
 Approximation in what sense? My understanding was we'd get a range of
 distances that we know covers all rows in that range. So the results
 should be accurate, no?
>>>
>>> The distance is going to be accurate only to the degree that the
>>> summary can produce accurate distances for the datapoints it
>>> represents. That can be quite imprecise due to the nature of the
>>> contained datapoints: a summary of the points (-1, -1) and (1, 1) will
>>> have a minimum distance of 0 to the origin, where the summary (-1, 0)
>>> and (-1, 0.5) would have a much more accurate distance of 1.
>>
>> Ummm, I'm probably missing something, or maybe my mental model of this
>> is just wrong, but why would the distance for the second summary be more
>> accurate? Or what does "more accurate" mean?
>>
>> Is that about the range of distances for the summary? For the first
>> range the summary is a bounding box [(-1,1), (1,1)] so all we know the
>> points may have distance in range [0, sqrt(2)]. While for the second
>> summary it's [1, sqrt(1.25)].
> 
> Yes; I was trying to refer to the difference between what results you
> get from the summary vs what results you get from the actual
> datapoints: In this case, for finding points which are closest to the
> origin, the first bounding box has a less accurate estimate than the
> second.
> 

OK. I think regular minmax indexes have a similar issue with
non-distance ordering, because we don't know if the min/max values are
still in the page range (or deleted/updated).

>>> The point I was making is that the summary can only approximate the
>>> distance, and that approximation is fine w.r.t. the BRIN sort
>>> algoritm.
>>>
>>
>> I think as long as the approximation (whatever it means) does not cause
>> differences in results (compared to not using an index), it's OK.
> 

I haven't written any code yet, but I think if we don't try to find the
exact min/max distances for the summary (e.g. by calculating the closest
point exactly) but rather "estimates" that are guaranteed to bound the
actual min/max, that's good enough for the sorting.

For the max, this probably is not an issue, as we can just calculate
distance for the corners and use a maximum of that. At least with
reasonable euclidean distance ... in 2D I'm imagining the bounding box
summary as a rectangle, with the "max distance" being a minimum radius
of a circle containing it (the rectangle).

For min we're looking for the largest radius not intersecting with the
box, which seems harder to calculate I think.

However, now that I'm thinking about it - don't (SP-)GiST indexes
already do pretty much exactly this?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PATCH: Using BRIN indexes for sorted output

2023-07-14 Thread Matthias van de Meent
On Fri, 14 Jul 2023 at 16:21, Tomas Vondra
 wrote:
>
>
>
>
> On 7/11/23 13:20, Matthias van de Meent wrote:
>> On Mon, 10 Jul 2023 at 22:04, Tomas Vondra
>>  wrote:
>>> Approximation in what sense? My understanding was we'd get a range of
>>> distances that we know covers all rows in that range. So the results
>>> should be accurate, no?
>>
>> The distance is going to be accurate only to the degree that the
>> summary can produce accurate distances for the datapoints it
>> represents. That can be quite imprecise due to the nature of the
>> contained datapoints: a summary of the points (-1, -1) and (1, 1) will
>> have a minimum distance of 0 to the origin, where the summary (-1, 0)
>> and (-1, 0.5) would have a much more accurate distance of 1.
>
> Ummm, I'm probably missing something, or maybe my mental model of this
> is just wrong, but why would the distance for the second summary be more
> accurate? Or what does "more accurate" mean?
>
> Is that about the range of distances for the summary? For the first
> range the summary is a bounding box [(-1,1), (1,1)] so all we know the
> points may have distance in range [0, sqrt(2)]. While for the second
> summary it's [1, sqrt(1.25)].

Yes; I was trying to refer to the difference between what results you
get from the summary vs what results you get from the actual
datapoints: In this case, for finding points which are closest to the
origin, the first bounding box has a less accurate estimate than the
second.

> > The point I was making is that the summary can only approximate the
> > distance, and that approximation is fine w.r.t. the BRIN sort
> > algoritm.
> >
>
> I think as long as the approximation (whatever it means) does not cause
> differences in results (compared to not using an index), it's OK.

Agreed.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: PATCH: Using BRIN indexes for sorted output

2023-07-14 Thread Tomas Vondra




On 7/11/23 13:20, Matthias van de Meent wrote:
> On Mon, 10 Jul 2023 at 22:04, Tomas Vondra
>  wrote:
>> On 7/10/23 18:18, Matthias van de Meent wrote:
>>> On Mon, 10 Jul 2023 at 17:09, Tomas Vondra
>>>  wrote:
 On 7/10/23 14:38, Matthias van de Meent wrote:
>> I haven't really thought about geometric types, just about minmax and
>> minmax-multi. It's not clear to me what the benefit for these types be.
>> I mean, we can probably sort points lexicographically, but is anyone
>> doing that in queries? It seems useless for order by distance.
>
> Yes, that's why you would sort them by distance, where the distance is
> generated by the opclass as min/max distance between the summary and
> the distance's origin, and then inserted into the tuplesort.
>

 OK, so the query says "order by distance from point X" and we calculate
 the min/max distance of values in a given page range.
>>>
>>> Yes, and because it's BRIN that's an approximation, which should
>>> generally be fine.
>>>
>>
>> Approximation in what sense? My understanding was we'd get a range of
>> distances that we know covers all rows in that range. So the results
>> should be accurate, no?
> 
> The distance is going to be accurate only to the degree that the
> summary can produce accurate distances for the datapoints it
> represents. That can be quite imprecise due to the nature of the
> contained datapoints: a summary of the points (-1, -1) and (1, 1) will
> have a minimum distance of 0 to the origin, where the summary (-1, 0)
> and (-1, 0.5) would have a much more accurate distance of 1.

Ummm, I'm probably missing something, or maybe my mental model of this
is just wrong, but why would the distance for the second summary be more
accurate? Or what does "more accurate" mean?

Is that about the range of distances for the summary? For the first
range the summary is a bounding box [(-1,1), (1,1)] so all we know the
points may have distance in range [0, sqrt(2)]. While for the second
summary it's [1, sqrt(1.25)].

> The point I was making is that the summary can only approximate the
> distance, and that approximation is fine w.r.t. the BRIN sort
> algoritm.
> 

I think as long as the approximation (whatever it means) does not cause
differences in results (compared to not using an index), it's OK.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PATCH: Using BRIN indexes for sorted output

2023-07-11 Thread Matthias van de Meent
On Mon, 10 Jul 2023 at 22:04, Tomas Vondra
 wrote:
> On 7/10/23 18:18, Matthias van de Meent wrote:
>> On Mon, 10 Jul 2023 at 17:09, Tomas Vondra
>>  wrote:
>>> On 7/10/23 14:38, Matthias van de Meent wrote:
> I haven't really thought about geometric types, just about minmax and
> minmax-multi. It's not clear to me what the benefit for these types be.
> I mean, we can probably sort points lexicographically, but is anyone
> doing that in queries? It seems useless for order by distance.

 Yes, that's why you would sort them by distance, where the distance is
 generated by the opclass as min/max distance between the summary and
 the distance's origin, and then inserted into the tuplesort.

>>>
>>> OK, so the query says "order by distance from point X" and we calculate
>>> the min/max distance of values in a given page range.
>>
>> Yes, and because it's BRIN that's an approximation, which should
>> generally be fine.
>>
>
> Approximation in what sense? My understanding was we'd get a range of
> distances that we know covers all rows in that range. So the results
> should be accurate, no?

The distance is going to be accurate only to the degree that the
summary can produce accurate distances for the datapoints it
represents. That can be quite imprecise due to the nature of the
contained datapoints: a summary of the points (-1, -1) and (1, 1) will
have a minimum distance of 0 to the origin, where the summary (-1, 0)
and (-1, 0.5) would have a much more accurate distance of 1. The point
I was making is that the summary can only approximate the distance,
and that approximation is fine w.r.t. the BRIN sort algoritm.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)




Re: PATCH: Using BRIN indexes for sorted output

2023-07-10 Thread Tomas Vondra



On 7/10/23 18:18, Matthias van de Meent wrote:
> On Mon, 10 Jul 2023 at 17:09, Tomas Vondra
>  wrote:
>> On 7/10/23 14:38, Matthias van de Meent wrote:
>>> Kind of. For single-dimensional opclasses (minmax, minmax_multi) we
>>> only need to extract the normal min/max values for ASC/DESC sorts,
>>> which are readily available in the summary. But for multi-dimensional
>>> and distance searches (nearest neighbour) we need to calculate the
>>> distance between the indexed value(s) and the origin value to compare
>>> the summary against, and the order would thus be asc/desc on distance
>>> - a distance which may not be precisely represented by float types -
>>> thus 'relative order' with its own order operation.
>>>
>>
>> Can you give some examples of such data / queries, and how would it
>> leverage the BRIN sort stuff?
> 
> Order by distance would be `ORDER BY box <-> '(1, 2)'::point ASC`, and
> the opclass would then decide that `<->(box, point) ASC` means it has
> to return the closest distance from the point to the summary, for some
> measure of 'distance' (this case L2, <#> other types, etc.). For DESC,
> that would return the distance from `'(1,2)'::point` to the furthest
> edge of the summary away from that point. Etc.
> 

Thanks.

>> For distance searches, I imagine this as data indexed by BRIN inclusion
>> opclass, which creates a bounding box. We could return closest/furthest
>> point on the bounding box (from the point used in the query). Which
>> seems a bit like a R-tree ...
> 
> Kind of; it would allow us to utilize such orderings without the
> expensive 1 tuple = 1 index entry and without scanning the full table
> before getting results. No tree involved, just a sequential scan on
> the index to allow some sketch-based pre-sort on the data. Again, this
> would work similar to how GiST's internal pages work: each downlink in
> GiST contains a summary of the entries on the downlinked page, and
> distance searches use a priority queue where the priority is the
> distance of the opclass-provided distance operator - lower distance
> means higher priority.

Yes, that's roughly how I understood this too - a tradeoff that won't
give the same performance as GiST, but much smaller and cheaper to maintain.

> For BRIN, we'd have to build a priority queue
> for the whole table at once, but presorting table sections is part of
> the design of BRIN sort, right?

Yes, that's kinda the whole point of BRIN sort.

> 
>> But I have no idea what would this do for multi-dimensional searches, or
>> what would those searches do? How would you sort such data other than
>> lexicographically? Which I think is covered by the current BRIN Sort,
>> because the data is either stored as multiple columns, in which case we
>> use the BRIN on the first column. Or it's indexed using BRIN minmax as a
>> tuple of values, but then it's sorted lexicographically.
> 
> Yes, just any BRIN summary that allows distance operators and the like
> should be enough MINMAX is easy to understand, and box inclusion are
> IMO also fairly easy to understand.
> 

True. If minmax is interpreted as inclusion with a simple 1D points, it
kinda does the same thing. (Of course, minmax work with data types that
don't have distances, but there's similarity.)

 I haven't really thought about geometric types, just about minmax and
 minmax-multi. It's not clear to me what the benefit for these types be.
 I mean, we can probably sort points lexicographically, but is anyone
 doing that in queries? It seems useless for order by distance.
>>>
>>> Yes, that's why you would sort them by distance, where the distance is
>>> generated by the opclass as min/max distance between the summary and
>>> the distance's origin, and then inserted into the tuplesort.
>>>
>>
>> OK, so the query says "order by distance from point X" and we calculate
>> the min/max distance of values in a given page range.
> 
> Yes, and because it's BRIN that's an approximation, which should
> generally be fine.
> 

Approximation in what sense? My understanding was we'd get a range of
distances that we know covers all rows in that range. So the results
should be accurate, no?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PATCH: Using BRIN indexes for sorted output

2023-07-10 Thread Matthias van de Meent
On Mon, 10 Jul 2023 at 17:09, Tomas Vondra
 wrote:
> On 7/10/23 14:38, Matthias van de Meent wrote:
>> Kind of. For single-dimensional opclasses (minmax, minmax_multi) we
>> only need to extract the normal min/max values for ASC/DESC sorts,
>> which are readily available in the summary. But for multi-dimensional
>> and distance searches (nearest neighbour) we need to calculate the
>> distance between the indexed value(s) and the origin value to compare
>> the summary against, and the order would thus be asc/desc on distance
>> - a distance which may not be precisely represented by float types -
>> thus 'relative order' with its own order operation.
>>
>
> Can you give some examples of such data / queries, and how would it
> leverage the BRIN sort stuff?

Order by distance would be `ORDER BY box <-> '(1, 2)'::point ASC`, and
the opclass would then decide that `<->(box, point) ASC` means it has
to return the closest distance from the point to the summary, for some
measure of 'distance' (this case L2, <#> other types, etc.). For DESC,
that would return the distance from `'(1,2)'::point` to the furthest
edge of the summary away from that point. Etc.

> For distance searches, I imagine this as data indexed by BRIN inclusion
> opclass, which creates a bounding box. We could return closest/furthest
> point on the bounding box (from the point used in the query). Which
> seems a bit like a R-tree ...

Kind of; it would allow us to utilize such orderings without the
expensive 1 tuple = 1 index entry and without scanning the full table
before getting results. No tree involved, just a sequential scan on
the index to allow some sketch-based pre-sort on the data. Again, this
would work similar to how GiST's internal pages work: each downlink in
GiST contains a summary of the entries on the downlinked page, and
distance searches use a priority queue where the priority is the
distance of the opclass-provided distance operator - lower distance
means higher priority. For BRIN, we'd have to build a priority queue
for the whole table at once, but presorting table sections is part of
the design of BRIN sort, right?

> But I have no idea what would this do for multi-dimensional searches, or
> what would those searches do? How would you sort such data other than
> lexicographically? Which I think is covered by the current BRIN Sort,
> because the data is either stored as multiple columns, in which case we
> use the BRIN on the first column. Or it's indexed using BRIN minmax as a
> tuple of values, but then it's sorted lexicographically.

Yes, just any BRIN summary that allows distance operators and the like
should be enough MINMAX is easy to understand, and box inclusion are
IMO also fairly easy to understand.

>>> I haven't really thought about geometric types, just about minmax and
>>> minmax-multi. It's not clear to me what the benefit for these types be.
>>> I mean, we can probably sort points lexicographically, but is anyone
>>> doing that in queries? It seems useless for order by distance.
>>
>> Yes, that's why you would sort them by distance, where the distance is
>> generated by the opclass as min/max distance between the summary and
>> the distance's origin, and then inserted into the tuplesort.
>>
>
> OK, so the query says "order by distance from point X" and we calculate
> the min/max distance of values in a given page range.

Yes, and because it's BRIN that's an approximation, which should
generally be fine.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)




Re: PATCH: Using BRIN indexes for sorted output

2023-07-10 Thread Matthias van de Meent
On Mon, 10 Jul 2023 at 13:43, Tomas Vondra
 wrote:
> On 7/10/23 12:22, Matthias van de Meent wrote:
>> On Fri, 7 Jul 2023 at 20:26, Tomas Vondra  
>> wrote:
>>> However, it's not quite clear to me is what you mean by the weight- and
>>> compare-operators? That is, what are
>>>
>>>  - brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min
>>>  - brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max
>>>  - brin_minmax_compare(order, order) -> int
>>>
>>> supposed to do? Or what does "PG_FUNCTION_ARGS=brintuple" mean?
>>
>> _minorder/_maxorder is for extracting the minimum/maximum relative
>> order of each range, used for ASC/DESC sorting of operator results
>> (e.g. to support ORDER BY <->(box_column, '(1,2)'::point) DESC).
>> PG_FUNCTION_ARGS is mentioned because of PG calling conventions;
>> though I did forget to describe the second operator argument for the
>> distance function. We might also want to use only one such "order
>> extraction function" with DESC/ASC indicated by an argument.
>>
>
> I'm still not sure I understand what "minimum/maximum relative order"
> is. Isn't it the same as returning min/max value that can appear in the
> range? Although, that wouldn't work for points/boxes.

Kind of. For single-dimensional opclasses (minmax, minmax_multi) we
only need to extract the normal min/max values for ASC/DESC sorts,
which are readily available in the summary. But for multi-dimensional
and distance searches (nearest neighbour) we need to calculate the
distance between the indexed value(s) and the origin value to compare
the summary against, and the order would thus be asc/desc on distance
- a distance which may not be precisely represented by float types -
thus 'relative order' with its own order operation.

>>> In principle we just need a procedure that tells us min/max for a given
>>> page range - I guess that's what the minorder/maxorder functions do? But
>>> why would we need the compare one? We're comparing by the known data
>>> type, so we can just delegate the comparison to that, no?
>>
>> Is there a comparison function for any custom orderable type that we
>> can just use? GIST distance ordering uses floats, and I don't quite
>> like that from a user perspective, as it makes ordering operations
>> imprecise. I'd rather allow (but discourage) any type with its own
>> compare function.
>>
>
> I haven't really thought about geometric types, just about minmax and
> minmax-multi. It's not clear to me what the benefit for these types be.
> I mean, we can probably sort points lexicographically, but is anyone
> doing that in queries? It seems useless for order by distance.

Yes, that's why you would sort them by distance, where the distance is
generated by the opclass as min/max distance between the summary and
the distance's origin, and then inserted into the tuplesort.

(previously)
>>> I finally had time to look at this patch again. There's a bit of bitrot,
>>> so here's a rebased version (no other changes).

It seems like you forgot to attach the rebased patch, so unless you're
actively working on updating the patchset right now, could you send
the rebase to make CFBot happy?


Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)




Re: PATCH: Using BRIN indexes for sorted output

2023-07-10 Thread Tomas Vondra



On 7/10/23 12:22, Matthias van de Meent wrote:
> On Fri, 7 Jul 2023 at 20:26, Tomas Vondra  
> wrote:
>>
>> Hi,
>>
>> I finally had time to look at this patch again. There's a bit of bitrot,
>> so here's a rebased version (no other changes).
> 
> Thanks!
> 
>> On 2/27/23 16:40, Matthias van de Meent wrote:
>>> Some initial comments on 0004:
>>>
 +/*
 + * brin_minmax_ranges
 + *Load the BRIN ranges and sort them.
 + */
 +Datum
 +brin_minmax_ranges(PG_FUNCTION_ARGS)
 +{
>>>
>>> Like in 0001, this seems to focus on only single columns. Can't we put
>>> the scan and sort infrastructure in brin, and put the weight- and
>>> compare-operators in the opclasses? I.e.
>>> brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min and
>>> brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max, and a
>>> brin_minmax_compare(order, order) -> int? I'm thinking of something
>>> similar to GIST's distance operators, which would make implementing
>>> ordering by e.g. `pointcolumn <-> (1, 2)::point` implementable in the
>>> brin infrastructure.
>>>
>>
>> However, it's not quite clear to me is what you mean by the weight- and
>> compare-operators? That is, what are
>>
>>  - brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min
>>  - brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max
>>  - brin_minmax_compare(order, order) -> int
>>
>> supposed to do? Or what does "PG_FUNCTION_ARGS=brintuple" mean?
> 
> _minorder/_maxorder is for extracting the minimum/maximum relative
> order of each range, used for ASC/DESC sorting of operator results
> (e.g. to support ORDER BY <->(box_column, '(1,2)'::point) DESC).
> PG_FUNCTION_ARGS is mentioned because of PG calling conventions;
> though I did forget to describe the second operator argument for the
> distance function. We might also want to use only one such "order
> extraction function" with DESC/ASC indicated by an argument.
> 

I'm still not sure I understand what "minimum/maximum relative order"
is. Isn't it the same as returning min/max value that can appear in the
range? Although, that wouldn't work for points/boxes.

>> In principle we just need a procedure that tells us min/max for a given
>> page range - I guess that's what the minorder/maxorder functions do? But
>> why would we need the compare one? We're comparing by the known data
>> type, so we can just delegate the comparison to that, no?
> 
> Is there a comparison function for any custom orderable type that we
> can just use? GIST distance ordering uses floats, and I don't quite
> like that from a user perspective, as it makes ordering operations
> imprecise. I'd rather allow (but discourage) any type with its own
> compare function.
> 

I haven't really thought about geometric types, just about minmax and
minmax-multi. It's not clear to me what the benefit for these types be.
I mean, we can probably sort points lexicographically, but is anyone
doing that in queries? It seems useless for order by distance.

>> Also, the existence of these opclass procedures should be enough to
>> identify the opclasses which can support this.
> 
> Agreed.
> 
 +/*
 + * XXX Does it make sense (is it possible) to have a sort by more than one
 + * column, using a BRIN index?
 + */
>>>
>>> Yes, even if only one prefix column is included in the BRIN index
>>> (e.g. `company` in `ORDER BY company, date`, the tuplesort with table
>>> tuples can add additional sorting without first reading the whole
>>> table, potentially (likely) reducing the total resource usage of the
>>> query. That utilizes the same idea as incremental sorts, but with the
>>> caveat that the input sort order is approximately likely but not at
>>> all guaranteed. So, even if the range sort is on a single index
>>> column, we can still do the table's tuplesort on all ORDER BY
>>> attributes, as long as a prefix of ORDER BY columns are included in
>>> the BRIN index.
>>>
>>
>> That's now quite what I meant, though. When I mentioned sorting by more
>> than one column, I meant using a multi-column BRIN index on those
>> columns. Something like this:
>>
>>CREATE TABLE t (a int, b int);
>>INSERT INTO t ...
>>CREATE INDEX ON t USING brin (a,b);
>>
>>SELECT * FROM t ORDER BY a, b;
>>
>> Now, what I think you described is using BRIN to sort by "a", and then
>> do incremental sort for "b". What I had in mind is whether it's possible
>> to use BRIN to sort by "b" too.
>>
>> I was suspecting it might be made to work, but now that I think about it
>> again it probably can't - BRIN pretty much sorts the columns separately,
>> it's not like having an ordering by (a,b) - first by "a", then "b".
> 
> I imagine it would indeed be limited to an extremely small subset of
> cases, and probably not worth the effort to implement in an initial
> version.
> 

OK, let's stick to order by a single column.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL 

Re: PATCH: Using BRIN indexes for sorted output

2023-07-10 Thread Matthias van de Meent
On Fri, 7 Jul 2023 at 20:26, Tomas Vondra  wrote:
>
> Hi,
>
> I finally had time to look at this patch again. There's a bit of bitrot,
> so here's a rebased version (no other changes).

Thanks!

> On 2/27/23 16:40, Matthias van de Meent wrote:
> > Some initial comments on 0004:
> >
> >> +/*
> >> + * brin_minmax_ranges
> >> + *Load the BRIN ranges and sort them.
> >> + */
> >> +Datum
> >> +brin_minmax_ranges(PG_FUNCTION_ARGS)
> >> +{
> >
> > Like in 0001, this seems to focus on only single columns. Can't we put
> > the scan and sort infrastructure in brin, and put the weight- and
> > compare-operators in the opclasses? I.e.
> > brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min and
> > brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max, and a
> > brin_minmax_compare(order, order) -> int? I'm thinking of something
> > similar to GIST's distance operators, which would make implementing
> > ordering by e.g. `pointcolumn <-> (1, 2)::point` implementable in the
> > brin infrastructure.
> >
>
> However, it's not quite clear to me is what you mean by the weight- and
> compare-operators? That is, what are
>
>  - brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min
>  - brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max
>  - brin_minmax_compare(order, order) -> int
>
> supposed to do? Or what does "PG_FUNCTION_ARGS=brintuple" mean?

_minorder/_maxorder is for extracting the minimum/maximum relative
order of each range, used for ASC/DESC sorting of operator results
(e.g. to support ORDER BY <->(box_column, '(1,2)'::point) DESC).
PG_FUNCTION_ARGS is mentioned because of PG calling conventions;
though I did forget to describe the second operator argument for the
distance function. We might also want to use only one such "order
extraction function" with DESC/ASC indicated by an argument.

> In principle we just need a procedure that tells us min/max for a given
> page range - I guess that's what the minorder/maxorder functions do? But
> why would we need the compare one? We're comparing by the known data
> type, so we can just delegate the comparison to that, no?

Is there a comparison function for any custom orderable type that we
can just use? GIST distance ordering uses floats, and I don't quite
like that from a user perspective, as it makes ordering operations
imprecise. I'd rather allow (but discourage) any type with its own
compare function.

> Also, the existence of these opclass procedures should be enough to
> identify the opclasses which can support this.

Agreed.

> >> +/*
> >> + * XXX Does it make sense (is it possible) to have a sort by more than one
> >> + * column, using a BRIN index?
> >> + */
> >
> > Yes, even if only one prefix column is included in the BRIN index
> > (e.g. `company` in `ORDER BY company, date`, the tuplesort with table
> > tuples can add additional sorting without first reading the whole
> > table, potentially (likely) reducing the total resource usage of the
> > query. That utilizes the same idea as incremental sorts, but with the
> > caveat that the input sort order is approximately likely but not at
> > all guaranteed. So, even if the range sort is on a single index
> > column, we can still do the table's tuplesort on all ORDER BY
> > attributes, as long as a prefix of ORDER BY columns are included in
> > the BRIN index.
> >
>
> That's now quite what I meant, though. When I mentioned sorting by more
> than one column, I meant using a multi-column BRIN index on those
> columns. Something like this:
>
>CREATE TABLE t (a int, b int);
>INSERT INTO t ...
>CREATE INDEX ON t USING brin (a,b);
>
>SELECT * FROM t ORDER BY a, b;
>
> Now, what I think you described is using BRIN to sort by "a", and then
> do incremental sort for "b". What I had in mind is whether it's possible
> to use BRIN to sort by "b" too.
>
> I was suspecting it might be made to work, but now that I think about it
> again it probably can't - BRIN pretty much sorts the columns separately,
> it's not like having an ordering by (a,b) - first by "a", then "b".

I imagine it would indeed be limited to an extremely small subset of
cases, and probably not worth the effort to implement in an initial
version.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: PATCH: Using BRIN indexes for sorted output

2023-07-07 Thread Tomas Vondra
Hi,

I finally had time to look at this patch again. There's a bit of bitrot,
so here's a rebased version (no other changes).

[more comments inline]

On 2/27/23 16:40, Matthias van de Meent wrote:
> Hi,
> 
> On Thu, 23 Feb 2023 at 17:44, Matthias van de Meent
>  wrote:
>>
>> I'll see to further reviewing 0004 and 0005 when I have additional time.
> 
> Some initial comments on 0004:
> 
>> +/*
>> + * brin_minmax_ranges
>> + *Load the BRIN ranges and sort them.
>> + */
>> +Datum
>> +brin_minmax_ranges(PG_FUNCTION_ARGS)
>> +{
> 
> Like in 0001, this seems to focus on only single columns. Can't we put
> the scan and sort infrastructure in brin, and put the weight- and
> compare-operators in the opclasses? I.e.
> brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min and
> brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max, and a
> brin_minmax_compare(order, order) -> int? I'm thinking of something
> similar to GIST's distance operators, which would make implementing
> ordering by e.g. `pointcolumn <-> (1, 2)::point` implementable in the
> brin infrastructure.
> 
> Note: One big reason I don't really like the current
> brin_minmax_ranges (and the analyze code in 0001) is because it breaks
> the operatorclass-vs-index abstraction layer. Operator classes don't
> (shouldn't) know or care about which attribute number they have, nor
> what the index does with the data.
> Scanning the index is not something that I expect the operator class
> to do, I expect that the index code organizes the scan, and forwards
> the data to the relevant operator classes.
> Scanning the index N times for N attributes can be excused if there
> are good reasons, but I'd rather have that decision made in the
> index's core code rather than at the design level.
> 

I think you're right. It'd be more appropriate to have most of the core
scanning logic in brin.c, and then delegate only some minor decisions to
the opclass. Like, comparisons, extraction of min/max from ranges etc.

However, it's not quite clear to me is what you mean by the weight- and
compare-operators? That is, what are

 - brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min
 - brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max
 - brin_minmax_compare(order, order) -> int

supposed to do? Or what does "PG_FUNCTION_ARGS=brintuple" mean?

In principle we just need a procedure that tells us min/max for a given
page range - I guess that's what the minorder/maxorder functions do? But
why would we need the compare one? We're comparing by the known data
type, so we can just delegate the comparison to that, no?

Also, the existence of these opclass procedures should be enough to
identify the opclasses which can support this.

>> +/*
>> + * XXX Does it make sense (is it possible) to have a sort by more than one
>> + * column, using a BRIN index?
>> + */
> 
> Yes, even if only one prefix column is included in the BRIN index
> (e.g. `company` in `ORDER BY company, date`, the tuplesort with table
> tuples can add additional sorting without first reading the whole
> table, potentially (likely) reducing the total resource usage of the
> query. That utilizes the same idea as incremental sorts, but with the
> caveat that the input sort order is approximately likely but not at
> all guaranteed. So, even if the range sort is on a single index
> column, we can still do the table's tuplesort on all ORDER BY
> attributes, as long as a prefix of ORDER BY columns are included in
> the BRIN index.
> 

That's now quite what I meant, though. When I mentioned sorting by more
than one column, I meant using a multi-column BRIN index on those
columns. Something like this:

   CREATE TABLE t (a int, b int);
   INSERT INTO t ...
   CREATE INDEX ON t USING brin (a,b);

   SELECT * FROM t ORDER BY a, b;

Now, what I think you described is using BRIN to sort by "a", and then
do incremental sort for "b". What I had in mind is whether it's possible
to use BRIN to sort by "b" too.

I was suspecting it might be made to work, but now that I think about it
again it probably can't - BRIN pretty much sorts the columns separately,
it's not like having an ordering by (a,b) - first by "a", then "b".

>> +/*
>> + * XXX We can be a bit smarter for LIMIT queries - once we
>> + * know we have more rows in the tuplesort than we need to
>> + * output, we can stop spilling - those rows are not going
>> + * to be needed. We can discard the tuplesort (no need to
>> + * respill) and stop spilling.
>> + */
> 
> Shouldn't that be "discard the tuplestore"?
> 

Yeah, definitely.

>> +#define BRIN_PROCNUM_RANGES 12/* optional */
> 
> It would be useful to add documentation for this in this patch.
> 

Right, this should be documented in doc/src/sgml/brin.sgml.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PATCH: Using BRIN indexes for sorted output

2023-03-01 Thread Alvaro Herrera
On 2023-Feb-24, Tomas Vondra wrote:

> On 2/24/23 16:14, Alvaro Herrera wrote:

> > I think a formulation of this kind has the benefit that it works after
> > BlockNumber is enlarged to 64 bits, and doesn't have to be changed ever
> > again (assuming it is correct).
> 
> Did anyone even propose doing that? I suspect this is unlikely to be the
> only place that'd might be broken by that.

True about other places also needing fixes, and no I haven't see anyone;
but while 32 TB does seem very far away to us now, it might be not
*that* far away.  So I think doing it the other way is better.

> > ... if pagesPerRange is not a whole divisor of MaxBlockNumber, I think
> > this will neglect the last range in the table.
> 
> Why would it? Let's say BlockNumber is uint8, i.e. 255 max. And there
> are 10 pages per range. That's 25 "full" ranges, and the last range
> being just 5 pages. So we get into
> 
>prevHeapBlk = 240
>heapBlk = 250
> 
> and we read the last 5 pages. And then we update
> 
>prevHeapBlk = 250
>heapBlk = (250 + 10) % 255 = 5
> 
> and we don't do that loop. Or did I get this wrong, somehow?

I stand corrected.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/




Re: PATCH: Using BRIN indexes for sorted output

2023-02-27 Thread Matthias van de Meent
Hi,

On Thu, 23 Feb 2023 at 17:44, Matthias van de Meent
 wrote:
>
> I'll see to further reviewing 0004 and 0005 when I have additional time.

Some initial comments on 0004:

> +/*
> + * brin_minmax_ranges
> + *Load the BRIN ranges and sort them.
> + */
> +Datum
> +brin_minmax_ranges(PG_FUNCTION_ARGS)
> +{

Like in 0001, this seems to focus on only single columns. Can't we put
the scan and sort infrastructure in brin, and put the weight- and
compare-operators in the opclasses? I.e.
brin_minmax_minorder(PG_FUNCTION_ARGS=brintuple) -> range.min and
brin_minmax_maxorder(PG_FUNCTION_ARGS=brintuple) -> range.max, and a
brin_minmax_compare(order, order) -> int? I'm thinking of something
similar to GIST's distance operators, which would make implementing
ordering by e.g. `pointcolumn <-> (1, 2)::point` implementable in the
brin infrastructure.

Note: One big reason I don't really like the current
brin_minmax_ranges (and the analyze code in 0001) is because it breaks
the operatorclass-vs-index abstraction layer. Operator classes don't
(shouldn't) know or care about which attribute number they have, nor
what the index does with the data.
Scanning the index is not something that I expect the operator class
to do, I expect that the index code organizes the scan, and forwards
the data to the relevant operator classes.
Scanning the index N times for N attributes can be excused if there
are good reasons, but I'd rather have that decision made in the
index's core code rather than at the design level.

> +/*
> + * XXX Does it make sense (is it possible) to have a sort by more than one
> + * column, using a BRIN index?
> + */

Yes, even if only one prefix column is included in the BRIN index
(e.g. `company` in `ORDER BY company, date`, the tuplesort with table
tuples can add additional sorting without first reading the whole
table, potentially (likely) reducing the total resource usage of the
query. That utilizes the same idea as incremental sorts, but with the
caveat that the input sort order is approximately likely but not at
all guaranteed. So, even if the range sort is on a single index
column, we can still do the table's tuplesort on all ORDER BY
attributes, as long as a prefix of ORDER BY columns are included in
the BRIN index.

> +/*
> + * XXX We can be a bit smarter for LIMIT queries - once we
> + * know we have more rows in the tuplesort than we need to
> + * output, we can stop spilling - those rows are not going
> + * to be needed. We can discard the tuplesort (no need to
> + * respill) and stop spilling.
> + */

Shouldn't that be "discard the tuplestore"?

> +#define BRIN_PROCNUM_RANGES 12/* optional */

It would be useful to add documentation for this in this patch.


Kind regards,

Matthias van de Meent.




Re: PATCH: Using BRIN indexes for sorted output

2023-02-27 Thread Matthias van de Meent
On Fri, 24 Feb 2023, 20:14 Tomas Vondra,  wrote:
>
> On 2/24/23 19:03, Matthias van de Meent wrote:
> > On Thu, 23 Feb 2023 at 19:48, Tomas Vondra
> >> Yeah, that sounds like a bug. Also a sign the tests should have some
> >> by-ref data types (presumably there are none, as that would certainly
> >> trip some asserts etc.).
> >
> > I'm not sure we currently trip asserts, as the data we store in the
> > memory context is very limited, making it unlikely we actually release
> > the memory region back to the OS.
> > I did get assertion failures by adding the attached patch, but I don't
> > think that's something we can do in the final release.
> >
>
> But we should randomize the memory if we ever do pfree(), and it's
> strange valgrind didn't complain when I ran tests with it.

Well, we don't clean up the decoding tuple immediately after our last
iteration, so the memory context (and the last tuple's attributes) are
still valid memory addresses. And, assuming that min/max values for
all brin ranges all have the same max-aligned length, the attribute
pointers are likely to point to the same offset within the decoding
tuple's memory context's memory segment, which would mean the dangling
pointers still point to valid memory - just not memory with the
contents they expected to be pointing to.

> >> Considering how tiny BRIN indexes are, this is likely orders of
> >> magnitude less I/O than we expend on sampling rows from the table. I
> >> mean, with the default statistics target we read ~3 pages (~240MB)
> >> or more just to sample the rows. Randomly, while the BRIN index is
> >> likely scanned mostly sequentially.
> >
> > Mostly agreed; except I think it's not too difficult to imagine a BRIN
> > index that is on that scale; with as an example the bloom filters that
> > easily take up several hundreds of bytes.
> >
> > With the default configuration of 128 pages_per_range,
> > n_distinct_per_range of -0.1, and false_positive_rate of 0.01, the
> > bloom filter size is 4.36 KiB - each indexed item on its own page. It
> > is still only 1% of the original table's size, but there are enough
> > tables that are larger than 24GB that this could be a significant
> > issue.
> >
>
> Right, it's certainly true BRIN indexes may be made fairly large (either
> by using something like bloom or by making the ranges much smaller). But
> those are exactly the indexes where building statistics for all columns
> at once is going to cause issues with memory usage etc.
>
> Note: Obviously, that depends on how much data per range we need to keep
> in memory. For bloom I doubt we'd actually want to keep all the filters,
> we'd probably calculate just some statistics (e.g. number of bits set),
> so maybe the memory consumption is not that bad.

Yes, I was thinking something along the same lines for bloom as well.
Something like 'average number of bits set' (or: histogram number of
bits set), and/or for each bit a count (or %) how many times it is
set, etc.

> >> Maybe there are cases where this would be an issue, but I haven't seen
> >> one when working on this patch (and I did a lot of experiments). I'd
> >> like to see one before we start optimizing it ...
> >
> > I'm not only worried about optimizing it, I'm also worried that we're
> > putting this abstraction at the wrong level in a way that is difficult
> > to modify.
> >
>
> Yeah, that's certainly a valid concern. I admit I only did the minimum
> amount of work on this part, as I was focused on the sorting part.
>
> >> This also reminds me that the issues I actually saw (e.g. memory
> >> consumption) would be made worse by processing all columns at once,
> >> because then you need to keep more columns in memory.
> >
> > Yes, I that can be a valid concern, but don't we already do similar
> > things in the current table statistics gathering?
> >
>
> Not really, I think. We sample a bunch of rows once, but then we build
> statistics on this sample for each attribute / expression independently.
> We could of course read the whole index into memory and then run the
> analysis, but I think tuples tend to be much smaller (thanks to TOAST
> etc.) and we only really scan a limited amount of them (30k).

Just to note, with default settings, sampling 30k index entries from
BRIN would cover ~29 GiB of a table. This isn't a lot, but it also
isn't exactly a small table. I think that it would be difficult to get
accurate avg_overlap statistics for some shapes of BRIN data...

> But if we're concerned about the I/O, the BRIN is likely fairly large,
> so maybe reading it into memory at once is not a great idea.

Agreed, we can't always expect that the interesting parts of the BRIN
index always fit in the available memory.

Kind regards,

Matthias van de Meent




Re: PATCH: Using BRIN indexes for sorted output

2023-02-24 Thread Tomas Vondra



On 2/24/23 19:03, Matthias van de Meent wrote:
> On Thu, 23 Feb 2023 at 19:48, Tomas Vondra
>  wrote:
>>
>> On 2/23/23 17:44, Matthias van de Meent wrote:
>>> On Thu, 23 Feb 2023 at 16:22, Tomas Vondra
>>>  wrote:

 On 2/23/23 15:19, Matthias van de Meent wrote:
> Comments on 0001, mostly comments and patch design:
>>>
>>> One more comment:
>>>
>> + range->min_value = bval->bv_values[0];
>> + range->max_value = bval->bv_values[1];
>>>
>>> This produces dangling pointers for minmax indexes on by-ref types
>>> such as text and bytea, due to the memory context of the decoding
>>> tuple being reset every iteration. :-(
>>>
>>
>> Yeah, that sounds like a bug. Also a sign the tests should have some
>> by-ref data types (presumably there are none, as that would certainly
>> trip some asserts etc.).
> 
> I'm not sure we currently trip asserts, as the data we store in the
> memory context is very limited, making it unlikely we actually release
> the memory region back to the OS.
> I did get assertion failures by adding the attached patch, but I don't
> think that's something we can do in the final release.
> 

But we should randomize the memory if we ever do pfree(), and it's
strange valgrind didn't complain when I ran tests with it. But yeah,
I'll take a look and see if we can add some tests covering this.

>>> With the proposed patch, we do O(ncols_statsenabled) scans over the
>>> BRIN index. Each scan reads all ncol columns of all block ranges from
>>> disk, so in effect the data scan does on the order of
>>> O(ncols_statsenabled * ncols * nranges) IOs, or O(n^2) on cols when
>>> all columns have statistics enabled.
>>>
>>
>> I don't think that's the number of I/O operations we'll do, because we
>> always read the whole BRIN tuple at once. So I believe it should rather
>> be something like
>>
>>   O(ncols_statsenabled * nranges)
>>
>> assuming nranges is the number of page ranges. But even that's likely a
>> significant overestimate because most of the tuples will be served from
>> shared buffers.
> 
> We store some data per index attribute, which makes the IO required
> for a single indexed range proportional to the number of index
> attributes.
> If we then scan the index a number of times proportional to the number
> of attributes, the cumulative IO load of statistics gathering for that
> index is quadratic on the number of attributes, not linear, right?
> 

Ah, OK. I was focusing on number of I/O operations while you're talking
about amount of I/O performed. You're right the amount of I/O is
quadratic, but I think what's more interesting is the comparison of the
alternative ANALYZE approaches. The current simple approach does a
multiple of the I/O amount, linear to the number of attributes.

Which is not great, ofc.


>> Considering how tiny BRIN indexes are, this is likely orders of
>> magnitude less I/O than we expend on sampling rows from the table. I
>> mean, with the default statistics target we read ~3 pages (~240MB)
>> or more just to sample the rows. Randomly, while the BRIN index is
>> likely scanned mostly sequentially.
> 
> Mostly agreed; except I think it's not too difficult to imagine a BRIN
> index that is on that scale; with as an example the bloom filters that
> easily take up several hundreds of bytes.
> 
> With the default configuration of 128 pages_per_range,
> n_distinct_per_range of -0.1, and false_positive_rate of 0.01, the
> bloom filter size is 4.36 KiB - each indexed item on its own page. It
> is still only 1% of the original table's size, but there are enough
> tables that are larger than 24GB that this could be a significant
> issue.
> 

Right, it's certainly true BRIN indexes may be made fairly large (either
by using something like bloom or by making the ranges much smaller). But
those are exactly the indexes where building statistics for all columns
at once is going to cause issues with memory usage etc.

Note: Obviously, that depends on how much data per range we need to keep
in memory. For bloom I doubt we'd actually want to keep all the filters,
we'd probably calculate just some statistics (e.g. number of bits set),
so maybe the memory consumption is not that bad.

In fact, for such indexes the memory consumption may already be an issue
even when analyzing the index column by column. My feeling is we'll need
to do something about that, e.g. by reading only a sample of the ranges,
or something like that. That might help with the I/O too, I guess.

> Note that most of my concerns are related to our current
> documentation's statement that there are no demerits to multi-column
> BRIN indexes:
> 
> """
> 11.3. Multicolumn Indexes
> 
> [...] The only reason to have multiple BRIN indexes instead of one
> multicolumn BRIN index on a single table is to have a different
> pages_per_range storage parameter.
> """
> 

True, we may need to clarify this in the documentation.

> Wide BRIN indexes add IO overhead for single-attribute scans when
> compared to 

Re: PATCH: Using BRIN indexes for sorted output

2023-02-24 Thread Matthias van de Meent
On Thu, 23 Feb 2023 at 19:48, Tomas Vondra
 wrote:
>
> On 2/23/23 17:44, Matthias van de Meent wrote:
> > On Thu, 23 Feb 2023 at 16:22, Tomas Vondra
> >  wrote:
> >>
> >> On 2/23/23 15:19, Matthias van de Meent wrote:
> >>> Comments on 0001, mostly comments and patch design:
> >
> > One more comment:
> >
>  + range->min_value = bval->bv_values[0];
>  + range->max_value = bval->bv_values[1];
> >
> > This produces dangling pointers for minmax indexes on by-ref types
> > such as text and bytea, due to the memory context of the decoding
> > tuple being reset every iteration. :-(
> >
>
> Yeah, that sounds like a bug. Also a sign the tests should have some
> by-ref data types (presumably there are none, as that would certainly
> trip some asserts etc.).

I'm not sure we currently trip asserts, as the data we store in the
memory context is very limited, making it unlikely we actually release
the memory region back to the OS.
I did get assertion failures by adding the attached patch, but I don't
think that's something we can do in the final release.

> > With the proposed patch, we do O(ncols_statsenabled) scans over the
> > BRIN index. Each scan reads all ncol columns of all block ranges from
> > disk, so in effect the data scan does on the order of
> > O(ncols_statsenabled * ncols * nranges) IOs, or O(n^2) on cols when
> > all columns have statistics enabled.
> >
>
> I don't think that's the number of I/O operations we'll do, because we
> always read the whole BRIN tuple at once. So I believe it should rather
> be something like
>
>   O(ncols_statsenabled * nranges)
>
> assuming nranges is the number of page ranges. But even that's likely a
> significant overestimate because most of the tuples will be served from
> shared buffers.

We store some data per index attribute, which makes the IO required
for a single indexed range proportional to the number of index
attributes.
If we then scan the index a number of times proportional to the number
of attributes, the cumulative IO load of statistics gathering for that
index is quadratic on the number of attributes, not linear, right?

> Considering how tiny BRIN indexes are, this is likely orders of
> magnitude less I/O than we expend on sampling rows from the table. I
> mean, with the default statistics target we read ~3 pages (~240MB)
> or more just to sample the rows. Randomly, while the BRIN index is
> likely scanned mostly sequentially.

Mostly agreed; except I think it's not too difficult to imagine a BRIN
index that is on that scale; with as an example the bloom filters that
easily take up several hundreds of bytes.

With the default configuration of 128 pages_per_range,
n_distinct_per_range of -0.1, and false_positive_rate of 0.01, the
bloom filter size is 4.36 KiB - each indexed item on its own page. It
is still only 1% of the original table's size, but there are enough
tables that are larger than 24GB that this could be a significant
issue.

Note that most of my concerns are related to our current
documentation's statement that there are no demerits to multi-column
BRIN indexes:

"""
11.3. Multicolumn Indexes

[...] The only reason to have multiple BRIN indexes instead of one
multicolumn BRIN index on a single table is to have a different
pages_per_range storage parameter.
"""

Wide BRIN indexes add IO overhead for single-attribute scans when
compared to single-attribute indexes, so having N single-attribute
index scans to build statistics the statistics on an N-attribute index
is not great.

> Maybe there are cases where this would be an issue, but I haven't seen
> one when working on this patch (and I did a lot of experiments). I'd
> like to see one before we start optimizing it ...

I'm not only worried about optimizing it, I'm also worried that we're
putting this abstraction at the wrong level in a way that is difficult
to modify.

> This also reminds me that the issues I actually saw (e.g. memory
> consumption) would be made worse by processing all columns at once,
> because then you need to keep more columns in memory.

Yes, I that can be a valid concern, but don't we already do similar
things in the current table statistics gathering?

Kind regards,

Matthias van de Meent




Re: PATCH: Using BRIN indexes for sorted output

2023-02-24 Thread Matthias van de Meent
On Fri, 24 Feb 2023 at 17:04, Tomas Vondra
 wrote:
>
> On 2/24/23 16:14, Alvaro Herrera wrote:
> > ... if pagesPerRange is not a whole divisor of MaxBlockNumber, I think
> > this will neglect the last range in the table.
> >
>
> Why would it? Let's say BlockNumber is uint8, i.e. 255 max. And there
> are 10 pages per range. That's 25 "full" ranges, and the last range
> being just 5 pages. So we get into
>
>prevHeapBlk = 240
>heapBlk = 250
>
> and we read the last 5 pages. And then we update
>
>prevHeapBlk = 250
>heapBlk = (250 + 10) % 255 = 5
>
> and we don't do that loop. Or did I get this wrong, somehow?

The result is off-by-one due to (u)int8 overflows being mod-256, but
apart from that your result is accurate.

The condition only stops the loop when we wrap around or when we go
past the last block, but no earlier than that.


Kind regards,

Matthias van de Meent




Re: PATCH: Using BRIN indexes for sorted output

2023-02-24 Thread Tomas Vondra



On 2/24/23 16:14, Alvaro Herrera wrote:
> On 2023-Feb-24, Tomas Vondra wrote:
> 
>> I guess the easiest fix would be to do the arithmetic in 64 bits. That'd
>> eliminate the overflow.
> 
> Yeah, that might be easy to set up.  We then don't have to worry about
> it until BlockNumber is enlarged to 64 bits ... but by that time surely
> we can just grow it again to a 128 bit loop variable.
> 
>> Alternatively, we could do something like
>>
>>   prevHeapBlk = 0;
>>   for (heapBlk = 0; (heapBlk < nblocks) && (prevHeapBlk <= heapBlk);
>>heapBlk += pagesPerRange)
>>   {
>> ...
>> prevHeapBlk = heapBlk;
>>   }
> 
> I think a formulation of this kind has the benefit that it works after
> BlockNumber is enlarged to 64 bits, and doesn't have to be changed ever
> again (assuming it is correct).
> 

Did anyone even propose doing that? I suspect this is unlikely to be the
only place that'd might be broken by that.

> ... if pagesPerRange is not a whole divisor of MaxBlockNumber, I think
> this will neglect the last range in the table.
> 

Why would it? Let's say BlockNumber is uint8, i.e. 255 max. And there
are 10 pages per range. That's 25 "full" ranges, and the last range
being just 5 pages. So we get into

   prevHeapBlk = 240
   heapBlk = 250

and we read the last 5 pages. And then we update

   prevHeapBlk = 250
   heapBlk = (250 + 10) % 255 = 5

and we don't do that loop. Or did I get this wrong, somehow?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PATCH: Using BRIN indexes for sorted output

2023-02-24 Thread Alvaro Herrera
On 2023-Feb-24, Tomas Vondra wrote:

> I guess the easiest fix would be to do the arithmetic in 64 bits. That'd
> eliminate the overflow.

Yeah, that might be easy to set up.  We then don't have to worry about
it until BlockNumber is enlarged to 64 bits ... but by that time surely
we can just grow it again to a 128 bit loop variable.

> Alternatively, we could do something like
> 
>   prevHeapBlk = 0;
>   for (heapBlk = 0; (heapBlk < nblocks) && (prevHeapBlk <= heapBlk);
>heapBlk += pagesPerRange)
>   {
> ...
> prevHeapBlk = heapBlk;
>   }

I think a formulation of this kind has the benefit that it works after
BlockNumber is enlarged to 64 bits, and doesn't have to be changed ever
again (assuming it is correct).

... if pagesPerRange is not a whole divisor of MaxBlockNumber, I think
this will neglect the last range in the table.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/




Re: PATCH: Using BRIN indexes for sorted output

2023-02-24 Thread Tomas Vondra



On 2/24/23 09:39, Alvaro Herrera wrote:
> On 2023-Feb-23, Matthias van de Meent wrote:
> 
>>> + for (heapBlk = 0; heapBlk < nblocks; heapBlk += pagesPerRange)
>>
>> I am not familiar with the frequency of max-sized relations, but this
>> would go into an infinite loop for pagesPerRange values >1 for
>> max-sized relations due to BlockNumber wraparound. I think there
>> should be some additional overflow checks here.
> 
> They are definitely not very common -- BlockNumber wraps around at 32 TB
> IIUC.  But yeah, I guess it is a possibility, and perhaps we should find
> a way to write these loops in a more robust manner.
> 

I guess the easiest fix would be to do the arithmetic in 64 bits. That'd
eliminate the overflow.

Alternatively, we could do something like

  prevHeapBlk = 0;
  for (heapBlk = 0; (heapBlk < nblocks) && (prevHeapBlk <= heapBlk);
   heapBlk += pagesPerRange)
  {
...
prevHeapBlk = heapBlk;
  }


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PATCH: Using BRIN indexes for sorted output

2023-02-24 Thread Alvaro Herrera
On 2023-Feb-23, Matthias van de Meent wrote:

> > + for (heapBlk = 0; heapBlk < nblocks; heapBlk += pagesPerRange)
> 
> I am not familiar with the frequency of max-sized relations, but this
> would go into an infinite loop for pagesPerRange values >1 for
> max-sized relations due to BlockNumber wraparound. I think there
> should be some additional overflow checks here.

They are definitely not very common -- BlockNumber wraps around at 32 TB
IIUC.  But yeah, I guess it is a possibility, and perhaps we should find
a way to write these loops in a more robust manner.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"World domination is proceeding according to plan"(Andrew Morton)




Re: PATCH: Using BRIN indexes for sorted output

2023-02-23 Thread Tomas Vondra
On 2/23/23 17:44, Matthias van de Meent wrote:
> On Thu, 23 Feb 2023 at 16:22, Tomas Vondra
>  wrote:
>>
>> On 2/23/23 15:19, Matthias van de Meent wrote:
>>> Comments on 0001, mostly comments and patch design:
> 
> One more comment:
> 
 + range->min_value = bval->bv_values[0];
 + range->max_value = bval->bv_values[1];
> 
> This produces dangling pointers for minmax indexes on by-ref types
> such as text and bytea, due to the memory context of the decoding
> tuple being reset every iteration. :-(
> 

Yeah, that sounds like a bug. Also a sign the tests should have some
by-ref data types (presumably there are none, as that would certainly
trip some asserts etc.).

 +range_values_cmp(const void *a, const void *b, void *arg)
>>>
>>> Can the arguments of these functions be modified into the types they
>>> are expected to receive? If not, could you add a comment on why that's
>>> not possible?
>>
>> The reason is that that's what qsort() expects. If you change that to
>> actual data types, you'll get compile-time warnings. I agree this may
>> need better comments, though.
> 
> Thanks in advance.
> 
 + * Statistics calculated by index AM (e.g. BRIN for ranges, etc.).
>>>
>>> Could you please expand on this? We do have GIST support for ranges, too.
>>>
>>
>> Expand in what way? This is meant to be AM-specific, so if GiST wants to
>> collect some additional stats, it's free to do so - perhaps some of the
>> ideas from the stats collected for BRIN would be applicable, but it's
>> also bound to the index structure.
> 
> I don't quite understand the flow of the comment, as I don't clearly
> see what the "BRIN for ranges" tries to refer to. In my view, that
> makes it a bad example which needs further explanation or rewriting,
> aka "expanding on".
> 

Ah, right. Yeah, the "BRIN for ranges" wording is a bit misleading. It
should really say only BRIN, but I was focused on the minmax use case,
so I mentioned the ranges.

 + * brin_minmax_stats
 + *Calculate custom statistics for a BRIN minmax index.
 + *
 + * At the moment this calculates:
 + *
 + *  - number of summarized/not-summarized and all/has nulls ranges
>>>
>>> I think statistics gathering of an index should be done at the AM
>>> level, not attribute level. The docs currently suggest that the user
>>> builds one BRIN index with 16 columns instead of 16 BRIN indexes with
>>> one column each, which would make the statistics gathering use 16x
>>> more IO if the scanned data cannot be reused.
>>>
>>
>> Why? The row sample is collected only once and used for building all the
>> index AM stats - it doesn't really matter if we analyze 16 single-column
>> indexes or 1 index with 16 columns. Yes, we'll need to scan more
>> indexes, but the with 16 columns the summaries will be larger so the
>> total amount of I/O will be almost the same I think.
>>
>> Or maybe I don't understand what I/O you're talking about?
> 
> With the proposed patch, we do O(ncols_statsenabled) scans over the
> BRIN index. Each scan reads all ncol columns of all block ranges from
> disk, so in effect the data scan does on the order of
> O(ncols_statsenabled * ncols * nranges) IOs, or O(n^2) on cols when
> all columns have statistics enabled.
> 

I don't think that's the number of I/O operations we'll do, because we
always read the whole BRIN tuple at once. So I believe it should rather
be something like

  O(ncols_statsenabled * nranges)

assuming nranges is the number of page ranges. But even that's likely a
significant overestimate because most of the tuples will be served from
shared buffers.

Considering how tiny BRIN indexes are, this is likely orders of
magnitude less I/O than we expend on sampling rows from the table. I
mean, with the default statistics target we read ~3 pages (~240MB)
or more just to sample the rows. Randomly, while the BRIN index is
likely scanned mostly sequentially.

Maybe there are cases where this would be an issue, but I haven't seen
one when working on this patch (and I did a lot of experiments). I'd
like to see one before we start optimizing it ...

This also reminds me that the issues I actually saw (e.g. memory
consumption) would be made worse by processing all columns at once,
because then you need to keep more columns in memory.


>>> It is possible to build BRIN indexes on more than one column with more
>>> than one opclass family like `USING brin (id int8_minmax_ops, id
>>> int8_bloom_ops)`. This would mean various duplicate statistics fields,
>>> no?
>>> It seems to me that it's more useful to do the null- and n_summarized
>>> on the index level instead of duplicating that inside the opclass.
>>
>> I don't think it's worth it. The amount of data this would save is tiny,
>> and it'd only apply to cases where the index includes the same attribute
>> multiple times, and that's pretty rare I think. I don't think it's worth
>> the extra complexity.
> 
> Not necessarily, it was just an example of where we'd 

Re: PATCH: Using BRIN indexes for sorted output

2023-02-23 Thread Matthias van de Meent
On Thu, 23 Feb 2023 at 16:22, Tomas Vondra
 wrote:
>
> On 2/23/23 15:19, Matthias van de Meent wrote:
>> Comments on 0001, mostly comments and patch design:

One more comment:

>>> + range->min_value = bval->bv_values[0];
>>> + range->max_value = bval->bv_values[1];

This produces dangling pointers for minmax indexes on by-ref types
such as text and bytea, due to the memory context of the decoding
tuple being reset every iteration. :-(

> >> +range_values_cmp(const void *a, const void *b, void *arg)
> >
> > Can the arguments of these functions be modified into the types they
> > are expected to receive? If not, could you add a comment on why that's
> > not possible?
>
> The reason is that that's what qsort() expects. If you change that to
> actual data types, you'll get compile-time warnings. I agree this may
> need better comments, though.

Thanks in advance.

> >> + * Statistics calculated by index AM (e.g. BRIN for ranges, etc.).
> >
> > Could you please expand on this? We do have GIST support for ranges, too.
> >
>
> Expand in what way? This is meant to be AM-specific, so if GiST wants to
> collect some additional stats, it's free to do so - perhaps some of the
> ideas from the stats collected for BRIN would be applicable, but it's
> also bound to the index structure.

I don't quite understand the flow of the comment, as I don't clearly
see what the "BRIN for ranges" tries to refer to. In my view, that
makes it a bad example which needs further explanation or rewriting,
aka "expanding on".

> >> + * brin_minmax_stats
> >> + *Calculate custom statistics for a BRIN minmax index.
> >> + *
> >> + * At the moment this calculates:
> >> + *
> >> + *  - number of summarized/not-summarized and all/has nulls ranges
> >
> > I think statistics gathering of an index should be done at the AM
> > level, not attribute level. The docs currently suggest that the user
> > builds one BRIN index with 16 columns instead of 16 BRIN indexes with
> > one column each, which would make the statistics gathering use 16x
> > more IO if the scanned data cannot be reused.
> >
>
> Why? The row sample is collected only once and used for building all the
> index AM stats - it doesn't really matter if we analyze 16 single-column
> indexes or 1 index with 16 columns. Yes, we'll need to scan more
> indexes, but the with 16 columns the summaries will be larger so the
> total amount of I/O will be almost the same I think.
>
> Or maybe I don't understand what I/O you're talking about?

With the proposed patch, we do O(ncols_statsenabled) scans over the
BRIN index. Each scan reads all ncol columns of all block ranges from
disk, so in effect the data scan does on the order of
O(ncols_statsenabled * ncols * nranges) IOs, or O(n^2) on cols when
all columns have statistics enabled.

> > It is possible to build BRIN indexes on more than one column with more
> > than one opclass family like `USING brin (id int8_minmax_ops, id
> > int8_bloom_ops)`. This would mean various duplicate statistics fields,
> > no?
> > It seems to me that it's more useful to do the null- and n_summarized
> > on the index level instead of duplicating that inside the opclass.
>
> I don't think it's worth it. The amount of data this would save is tiny,
> and it'd only apply to cases where the index includes the same attribute
> multiple times, and that's pretty rare I think. I don't think it's worth
> the extra complexity.

Not necessarily, it was just an example of where we'd save IO.
Note that the current gathering method already retrieves all tuple
attribute data, so from a basic processing perspective we'd save some
time decoding as well.

> >
> > I'm planning on reviewing the other patches, and noticed that a lot of
> > the patches are marked WIP. Could you share a status on those, because
> > currently that status is unknown: Are these patches you don't plan on
> > including, or are these patches only (or mostly) included for
> > debugging?
> >
>
> I think the WIP label is a bit misleading, I used it mostly to mark
> patches that are not meant to be committed on their own. A quick overview:
>
> [...]

Thanks for the explanation, that's quite helpful. I'll see to further
reviewing 0004 and 0005 when I have additional time.

Kind regards,

Matthias van de Meent.




Re: PATCH: Using BRIN indexes for sorted output

2023-02-23 Thread Tomas Vondra
On 2/23/23 15:19, Matthias van de Meent wrote:
> On Sat, 18 Feb 2023 at 16:55, Tomas Vondra
>  wrote:
>>
>> cfbot complained there's one more place triggering a compiler warning on
>> 32-bit systems, so here's a version fixing that.
>>
>> I've also added a copy of the regression tests but using the indexam
>> stats added in 0001. This is just a copy of the already existing
>> regression tests, just with enable_indexam_stats=true - this should
>> catch some of the issues that went mostly undetected in the earlier
>> patch versions.
> 
> 
> Comments on 0001, mostly comments and patch design:
> 
>> +range_minval_cmp(const void *a, const void *b, void *arg)
>> [...]
>> +range_maxval_cmp(const void *a, const void *b, void *arg)
>> [...]
>> +range_values_cmp(const void *a, const void *b, void *arg)
> 
> Can the arguments of these functions be modified into the types they
> are expected to receive? If not, could you add a comment on why that's
> not possible?
> I don't think it's good practise to "just" use void* for arguments to
> cast them to more concrete types in the next lines without an
> immediate explanation.
> 

The reason is that that's what qsort() expects. If you change that to
actual data types, you'll get compile-time warnings. I agree this may
need better comments, though.

>> + * Statistics calculated by index AM (e.g. BRIN for ranges, etc.).
> 
> Could you please expand on this? We do have GIST support for ranges, too.
> 

Expand in what way? This is meant to be AM-specific, so if GiST wants to
collect some additional stats, it's free to do so - perhaps some of the
ideas from the stats collected for BRIN would be applicable, but it's
also bound to the index structure.

>> + * brin_minmax_stats
>> + *Calculate custom statistics for a BRIN minmax index.
>> + *
>> + * At the moment this calculates:
>> + *
>> + *  - number of summarized/not-summarized and all/has nulls ranges
> 
> I think statistics gathering of an index should be done at the AM
> level, not attribute level. The docs currently suggest that the user
> builds one BRIN index with 16 columns instead of 16 BRIN indexes with
> one column each, which would make the statistics gathering use 16x
> more IO if the scanned data cannot be reused.
> 

Why? The row sample is collected only once and used for building all the
index AM stats - it doesn't really matter if we analyze 16 single-column
indexes or 1 index with 16 columns. Yes, we'll need to scan more
indexes, but the with 16 columns the summaries will be larger so the
total amount of I/O will be almost the same I think.

Or maybe I don't understand what I/O you're talking about?

> It is possible to build BRIN indexes on more than one column with more
> than one opclass family like `USING brin (id int8_minmax_ops, id
> int8_bloom_ops)`. This would mean various duplicate statistics fields,
> no?
> It seems to me that it's more useful to do the null- and n_summarized
> on the index level instead of duplicating that inside the opclass.

I don't think it's worth it. The amount of data this would save is tiny,
and it'd only apply to cases where the index includes the same attribute
multiple times, and that's pretty rare I think. I don't think it's worth
the extra complexity.

> 
>> + for (heapBlk = 0; heapBlk < nblocks; heapBlk += pagesPerRange)
> 
> I am not familiar with the frequency of max-sized relations, but this
> would go into an infinite loop for pagesPerRange values >1 for
> max-sized relations due to BlockNumber wraparound. I think there
> should be some additional overflow checks here.
> 

Good point, but that's a pre-existing issue. We do this same loop in a
number of places.

>> +/*
>> + * get_attstaindexam
>> + *
>> + *  Given the table and attribute number of a column, get the index AM
>> + *  statistics.  Return NULL if no data available.
>> + *
> 
> Shouldn't this be "given the index and attribute number" instead of
> "the table and attribute number"?
> I think we need to be compatible with indexes on expression here, so
> that we don't fail to create (or use) statistics for an index `USING
> brin ( (int8range(my_min_column, my_max_column, '[]'))
> range_inclusion_ops)` when we implement stats for range_inclusion_ops.
> 
>> + * Alternative to brin_minmax_match_tuples_to_ranges2, leveraging ordering
> 
> Does this function still exist?
> 

Yes, but only in 0003 - it's a "brute-force" algorithm used as a
cross-check the result of the optimized algorithm in 0001. You're right
it should not be referenced in the comment.

> 
> I'm planning on reviewing the other patches, and noticed that a lot of
> the patches are marked WIP. Could you share a status on those, because
> currently that status is unknown: Are these patches you don't plan on
> including, or are these patches only (or mostly) included for
> debugging?
> 

I think the WIP label is a bit misleading, I used it mostly to mark
patches that are not meant to be committed on their own. A quick overview:


Re: PATCH: Using BRIN indexes for sorted output

2023-02-23 Thread Matthias van de Meent
On Sat, 18 Feb 2023 at 16:55, Tomas Vondra
 wrote:
>
> cfbot complained there's one more place triggering a compiler warning on
> 32-bit systems, so here's a version fixing that.
>
> I've also added a copy of the regression tests but using the indexam
> stats added in 0001. This is just a copy of the already existing
> regression tests, just with enable_indexam_stats=true - this should
> catch some of the issues that went mostly undetected in the earlier
> patch versions.


Comments on 0001, mostly comments and patch design:

> +range_minval_cmp(const void *a, const void *b, void *arg)
> [...]
> +range_maxval_cmp(const void *a, const void *b, void *arg)
> [...]
> +range_values_cmp(const void *a, const void *b, void *arg)

Can the arguments of these functions be modified into the types they
are expected to receive? If not, could you add a comment on why that's
not possible?
I don't think it's good practise to "just" use void* for arguments to
cast them to more concrete types in the next lines without an
immediate explanation.

> + * Statistics calculated by index AM (e.g. BRIN for ranges, etc.).

Could you please expand on this? We do have GIST support for ranges, too.

> + * brin_minmax_stats
> + *Calculate custom statistics for a BRIN minmax index.
> + *
> + * At the moment this calculates:
> + *
> + *  - number of summarized/not-summarized and all/has nulls ranges

I think statistics gathering of an index should be done at the AM
level, not attribute level. The docs currently suggest that the user
builds one BRIN index with 16 columns instead of 16 BRIN indexes with
one column each, which would make the statistics gathering use 16x
more IO if the scanned data cannot be reused.

It is possible to build BRIN indexes on more than one column with more
than one opclass family like `USING brin (id int8_minmax_ops, id
int8_bloom_ops)`. This would mean various duplicate statistics fields,
no?
It seems to me that it's more useful to do the null- and n_summarized
on the index level instead of duplicating that inside the opclass.

> + for (heapBlk = 0; heapBlk < nblocks; heapBlk += pagesPerRange)

I am not familiar with the frequency of max-sized relations, but this
would go into an infinite loop for pagesPerRange values >1 for
max-sized relations due to BlockNumber wraparound. I think there
should be some additional overflow checks here.

> +/*
> + * get_attstaindexam
> + *
> + *  Given the table and attribute number of a column, get the index AM
> + *  statistics.  Return NULL if no data available.
> + *

Shouldn't this be "given the index and attribute number" instead of
"the table and attribute number"?
I think we need to be compatible with indexes on expression here, so
that we don't fail to create (or use) statistics for an index `USING
brin ( (int8range(my_min_column, my_max_column, '[]'))
range_inclusion_ops)` when we implement stats for range_inclusion_ops.

> + * Alternative to brin_minmax_match_tuples_to_ranges2, leveraging ordering

Does this function still exist?

.

I'm planning on reviewing the other patches, and noticed that a lot of
the patches are marked WIP. Could you share a status on those, because
currently that status is unknown: Are these patches you don't plan on
including, or are these patches only (or mostly) included for
debugging?


Kind regards,

Matthias van de Meent




Re: PATCH: Using BRIN indexes for sorted output

2023-02-18 Thread Tomas Vondra
On 2/18/23 19:51, Justin Pryzby wrote:
> Are (any of) these patches targetting v16 ?
> 

Probably not. Maybe if there's more feedback / scrutiny, but I'm not
sure one commitfest is enough to polish the patch (especially
considering I haven't done much on the costing yet).

> typos:
> ar we - we are?
> morestly - mostly
> interstect - intersect
> 
>> + * XXX We don't sort the bins, so just do binary sort. For large number of 
>> values
>> + * this might be an issue, for small number of values a linear search is 
>> fine.
> 
> "binary sort" is wrong?
> 
>> + * only half of there ranges, thus 1/2. This can be extended to randomly
> 
> half of *these* ranges ?
> 

Thanks, I'll fix those.

>> From 7b3307c27b35ece119feab4891f03749250e454b Mon Sep 17 00:00:00 2001
>> From: Tomas Vondra 
>> Date: Mon, 17 Oct 2022 18:39:28 +0200
>> Subject: [PATCH 01/11] Allow index AMs to build and use custom statistics
> 
> I think the idea can also apply to btree - currently, correlation is
> considered to be a property of a column, but not an index.  But that
> fails to distinguish between a freshly built index, and an index with
> out of order heap references, which can cause an index scan to be a lot
> more expensive.
> 
> I implemented per-index correlation stats way back when:
> https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com
> 
> See also:
> https://www.postgresql.org/message-id/14438.1512499...@sss.pgh.pa.us
> 
> With my old test case:
> 
> Index scan is 3x slower than bitmap scan, but index scan is costed as
> being cheaper:
> 
> postgres=# explain analyze SELECT * FROM t WHERE i>11 AND i<55;
>  Index Scan using t_i_idx on t  (cost=0.43..21153.74 rows=130912 width=8) 
> (actual time=0.107..222.737 rows=128914 loops=1)
> 
> postgres=# SET enable_indexscan =no;
> postgres=# explain analyze SELECT * FROM t WHERE i>11 AND i<55;
>  Bitmap Heap Scan on t  (cost=2834.28..26895.96 rows=130912 width=8) (actual 
> time=16.830..69.860 rows=128914 loops=1)
> 
> If it's clustered, then the index scan is almost twice as fast, and the
> costs are more consistent with the associated time.  The planner assumes
> that the indexes are freshly built...
> 
> postgres=# CLUSTER t USING t_i_idx ;
> postgres=# explain analyze SELECT * FROM t WHERE i>11 AND i<55;
>  Index Scan using t_i_idx on t  (cost=0.43..20121.74 rows=130912 width=8) 
> (actual time=0.084..117.549 rows=128914 loops=1)
> 

Yeah, the concept of indexam statistics certainly applies to other index
types, and for btree we might collect information about correlation etc.
I haven't looked at the 2017 patch, but it seems reasonable.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PATCH: Using BRIN indexes for sorted output

2023-02-18 Thread Justin Pryzby
Are (any of) these patches targetting v16 ?

typos:
ar we - we are?
morestly - mostly
interstect - intersect

> + * XXX We don't sort the bins, so just do binary sort. For large number of 
> values
> + * this might be an issue, for small number of values a linear search is 
> fine.

"binary sort" is wrong?

> + * only half of there ranges, thus 1/2. This can be extended to randomly

half of *these* ranges ?

> From 7b3307c27b35ece119feab4891f03749250e454b Mon Sep 17 00:00:00 2001
> From: Tomas Vondra 
> Date: Mon, 17 Oct 2022 18:39:28 +0200
> Subject: [PATCH 01/11] Allow index AMs to build and use custom statistics

I think the idea can also apply to btree - currently, correlation is
considered to be a property of a column, but not an index.  But that
fails to distinguish between a freshly built index, and an index with
out of order heap references, which can cause an index scan to be a lot
more expensive.

I implemented per-index correlation stats way back when:
https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com

See also:
https://www.postgresql.org/message-id/14438.1512499...@sss.pgh.pa.us

With my old test case:

Index scan is 3x slower than bitmap scan, but index scan is costed as
being cheaper:

postgres=# explain analyze SELECT * FROM t WHERE i>11 AND i<55;
 Index Scan using t_i_idx on t  (cost=0.43..21153.74 rows=130912 width=8) 
(actual time=0.107..222.737 rows=128914 loops=1)

postgres=# SET enable_indexscan =no;
postgres=# explain analyze SELECT * FROM t WHERE i>11 AND i<55;
 Bitmap Heap Scan on t  (cost=2834.28..26895.96 rows=130912 width=8) (actual 
time=16.830..69.860 rows=128914 loops=1)

If it's clustered, then the index scan is almost twice as fast, and the
costs are more consistent with the associated time.  The planner assumes
that the indexes are freshly built...

postgres=# CLUSTER t USING t_i_idx ;
postgres=# explain analyze SELECT * FROM t WHERE i>11 AND i<55;
 Index Scan using t_i_idx on t  (cost=0.43..20121.74 rows=130912 width=8) 
(actual time=0.084..117.549 rows=128914 loops=1)

-- 
Justin




Re: PATCH: Using BRIN indexes for sorted output

2023-02-16 Thread Justin Pryzby
On Thu, Feb 16, 2023 at 03:07:59PM +0100, Tomas Vondra wrote:
> Rebased version of the patches, fixing only minor conflicts.

Per cfbot, the patch fails on 32 bit builds.
+ERROR:  count mismatch: 0 != 1000

And causes warnings in mingw cross-compile.

On Sun, Oct 23, 2022 at 11:32:37PM -0500, Justin Pryzby wrote:
> I think new GUCs should be enabled during patch development.
> Maybe in a separate 0002 patch "for CI only not for commit".
> That way "make check" at least has a chance to hit that new code
> paths.
> 
> Also, note that indxpath.c had the var initialized to true.

In your patch, the amstats guc is still being set to false during
startup by the guc machinery.  And the tests crash everywhere if it's
set to on:

TRAP: failed Assert("(nmatches_unique >= 1) && (nmatches_unique <= 
unique[nvalues-1])"), File: "../src/backend/access/brin/brin_minmax.c", Line: 
644, PID: 25519

>  . Some typos in your other patches: "heuristics heuristics".  ste.
>lest (least).

These are still present.

-- 
Justin




Re: PATCH: Using BRIN indexes for sorted output

2022-12-06 Thread Andres Freund
Hi,

On 2022-11-17 00:52:35 +0100, Tomas Vondra wrote:
> Well, yeah. That's pretty much exactly what the last version of this
> patch (from October 23) does.

That version unfortunately doesn't build successfully:
https://cirrus-ci.com/task/5108789846736896

[03:02:48.641] Duplicate OIDs detected:
[03:02:48.641] 9979
[03:02:48.641] 9980
[03:02:48.641] found 2 duplicate OID(s) in catalog data

Greetings,

Andres Freund




Re: PATCH: Using BRIN indexes for sorted output

2022-11-16 Thread Tomas Vondra
On 11/16/22 22:52, Greg Stark wrote:
> Fwiw tuplesort does do something like what you want for the top-k
> case. At least it used to last I looked -- not sure if it went out
> with the tapesort ...
> > For top-k it inserts new tuples into the heap data structure and then
> pops the top element out of the hash. That keeps a fixed number of
> elements in the heap. It's always inserting and removing at the same
> time. I don't think it would be very hard to add a tuplesort interface
> to access that behaviour.
> 


Bounded sorts are still there, implemented using a heap (which is what
you're talking about, I think). I actually looked at it some time ago,
and it didn't look like a particularly good match for the general case
(without explicit LIMIT). Bounded sorts require specifying number of
tuples, and then discard the remaining tuples. But you don't know how
many tuples you'll actually find until the next minval - you have to
keep them all.

Maybe we could feed the tuples into a (sorted) heap incrementally, and
consume tuples until the next minval value. I'm not against exploring
that idea, but it certainly requires more work than just slapping some
interface to existing code.

> For something like BRIN you would sort the ranges by minvalue then
> insert all the tuples for each range. Before inserting tuples for a
> new range you would first pop out all the tuples that are < the
> minvalue for the new range.
> 

Well, yeah. That's pretty much exactly what the last version of this
patch (from October 23) does.

> I'm not sure how you handle degenerate BRIN indexes that behave
> terribly. Like, if many BRIN ranges covered the entire key range.
> Perhaps there would be a clever way to spill the overflow and switch
> to quicksort for the spilled tuples without wasting lots of work
> already done and without being too inefficient.

In two ways:

1) Don't have such BRIN index - if it has many degraded ranges, it's
bound to perform poorly even for WHERE conditions. We've lived with this
until now, I don't think this makes the issue any worse.

2) Improving statistics for BRIN indexes - until now the BRIN costing is
very crude, we have almost no information about how wide the ranges are,
how much they overlap, etc. The 0001 part (discussed in a thread [1])
aims to provide much better statistics. Yes, the costing still doesn't
use that information very much.


regards


[1] https://commitfest.postgresql.org/40/3952/

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PATCH: Using BRIN indexes for sorted output

2022-11-16 Thread Greg Stark
Fwiw tuplesort does do something like what you want for the top-k
case. At least it used to last I looked -- not sure if it went out
with the tapesort ...

For top-k it inserts new tuples into the heap data structure and then
pops the top element out of the hash. That keeps a fixed number of
elements in the heap. It's always inserting and removing at the same
time. I don't think it would be very hard to add a tuplesort interface
to access that behaviour.

For something like BRIN you would sort the ranges by minvalue then
insert all the tuples for each range. Before inserting tuples for a
new range you would first pop out all the tuples that are < the
minvalue for the new range.

I'm not sure how you handle degenerate BRIN indexes that behave
terribly. Like, if many BRIN ranges covered the entire key range.
Perhaps there would be a clever way to spill the overflow and switch
to quicksort for the spilled tuples without wasting lots of work
already done and without being too inefficient.




Re: PATCH: Using BRIN indexes for sorted output

2022-10-24 Thread Tomas Vondra



On 10/24/22 06:32, Justin Pryzby wrote:
> On Sat, Oct 15, 2022 at 02:33:50PM +0200, Tomas Vondra wrote:
>> Of course, if there are e.g. BTREE indexes this is going to be slower,
>> but people are unlikely to have both index types on the same column.
> 
> On Sun, Oct 16, 2022 at 05:48:31PM +0200, Tomas Vondra wrote:
>> I don't think it's all that unfair. How likely is it to have both a BRIN
>> and btree index on the same column? And even if you do have such indexes
> 
> Note that we (at my work) use unique, btree indexes on multiple columns
> for INSERT ON CONFLICT into the most-recent tables: UNIQUE(a,b,c,...),
> plus a separate set of indexes on all tables, used for searching:
> BRIN(a) and BTREE(b).  I'd hope that the costing is accurate enough to
> prefer the btree index for searching the most-recent table, if that's
> what's faster (for example, if columns b and c are specified).
> 

Well, the costing is very crude at the moment - at the moment it's
pretty much just a copy of the existing BRIN costing. And the cost is
likely going to increase, because brinsort needs to do regular BRIN
bitmap scan (more or less) and then also a sort (which is an extra cost,
of course). So if it works now, I don't see why would brinsort break it.
Moreover, if you don't have ORDER BY in the query, I don't see why would
we create a brinsort at all.

But if you could test this once the costing gets improved, that'd be
very valuable.

>> +/* There must not be any TID scan in progress yet. */
>> +Assert(node->ss.ss_currentScanDesc == NULL);
>> +
>> +/* Initialize the TID range scan, for the provided block range. */
>> +if (node->ss.ss_currentScanDesc == NULL)
>> +{
> 
> Why is this conditional on the condition that was just Assert()ed ?
> 

Yeah, that's a mistake, due to how the code evolved.

>>  
>> +void
>> +cost_brinsort(BrinSortPath *path, PlannerInfo *root, double loop_count,
>> +   bool partial_path)
> 
> It's be nice to refactor existing code to avoid this part being so
> duplicitive.
> 
>> + * In some situations (particularly with OR'd index conditions) we may
>> + * have scan_clauses that are not equal to, but are logically implied 
>> by,
>> + * the index quals; so we also try a predicate_implied_by() check to see
> 
> Isn't that somewhat expensive ?
> 
> If that's known, then it'd be good to say that in the documentation.
> 

Some of this is probably a residue from create_indexscan_path and may
not be needed for this new node.

>> +{
>> +{"enable_brinsort", PGC_USERSET, QUERY_TUNING_METHOD,
>> +gettext_noop("Enables the planner's use of BRIN sort 
>> plans."),
>> +NULL,
>> +GUC_EXPLAIN
>> +},
>> +_brinsort,
>> +false,
> 
> I think new GUCs should be enabled during patch development.
> Maybe in a separate 0002 patch "for CI only not for commit".
> That way "make check" at least has a chance to hit that new code paths.
> 
> Also, note that indxpath.c had the var initialized to true.
> 

Good point.

>> +attno = (i + 1);
>> +   nranges = (nblocks / pagesPerRange);
>> +   node->bs_phase = (nullsFirst) ? 
>> BRINSORT_LOAD_NULLS : BRINSORT_LOAD_RANGE;
> 
> I'm curious why you have parenthesis these places ?
> 

Not sure, it seemed more readable when writing the code I guess.

>> +#ifndef NODEBrinSort_H
>> +#define NODEBrinSort_H
> 
> NODEBRIN_SORT would be more consistent with NODEINCREMENTALSORT.
> But I'd prefer NODE_* - otherwise it looks like NO DEBRIN.
> 

Yeah, stupid search/replace on the indescan code, which was used as a
starting point.

> This needed a bunch of work needed to pass any of the regression tests -
> even with the feature set to off.
> 
>  . meson.build needs the same change as the corresponding ./Makefile.
>  . guc missing from postgresql.conf.sample
>  . brin_validate.c is missing support for the opr function.
>I gather you're planning on changing this part (?) but this allows to
>pass tests for now.
>  . mingw is warning about OidIsValid(pointer) in nodeBrinSort.c.
>https://cirrus-ci.com/task/5771227447951360?logs=mingw_cross_warning#L969
>  . Uninitialized catalog attribute.
>  . Some typos in your other patches: "heuristics heuristics".  ste.
>lest (least).
> 

Thanks, I'll get this fixed. I've posted the patch as a PoC to showcase
it and gather some feedback, I should have mentioned it's incomplete in
these ways.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PATCH: Using BRIN indexes for sorted output

2022-10-18 Thread Tomas Vondra
On 10/17/22 16:00, Matthias van de Meent wrote:
> On Mon, 17 Oct 2022 at 05:43, Tomas Vondra
>  wrote:
>>
>> On 10/16/22 22:17, Matthias van de Meent wrote:
>>> On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
>>>  wrote:
 Try to formulate the whole algorithm. Maybe I'm missing something.

 The current algorithm is something like this:

 1. request info about ranges from the BRIN opclass
 2. sort them by maxval and minval
>>>
>>> Why sort on maxval and minval? That seems wasteful for effectively all
>>> sorts, where range sort on minval should suffice: If you find a range
>>> that starts at 100 in a list of ranges sorted at minval, you've
>>> processed all values <100. You can't make a similar comparison when
>>> that range is sorted on maxvals.
>>
>> Because that allows to identify overlapping ranges quickly.
>>
>> Imagine you have the ranges sorted by maxval, which allows you to add
>> tuples in small increments. But how do you know there's not a range
>> (possibly with arbitrarily high maxval), that however overlaps with the
>> range we're currently processing?
> 
> Why do we need to identify overlapping ranges specifically? If you
> sort by minval, it becomes obvious that any subsequent range cannot
> contain values < the minval of the next range in the list, allowing
> you to emit any values less than the next, unprocessed, minmax range's
> minval.
> 

D'oh! I think you're right, it should be possible to do this with only
sort by minval. And it might actually be better way to do that.

I think I chose the "maxval" ordering because it seemed reasonable.
Looking at the current range and using the maxval as the threshold
seemed reasonable. But it leads to a bunch of complexity with the
intersecting ranges, and I never reconsidered this choice. Silly me.

 3. NULLS FIRST: read all ranges that might have NULLs => output
 4. read the next range (by maxval) into tuplesort
(if no more ranges, go to (9))
 5. load all tuples from "splill" tuplestore, compare to maxval
>>>
>>> Instead of this, shouldn't an update to tuplesort that allows for
>>> restarting the sort be better than this? Moving tuples that we've
>>> accepted into BRINsort state but not yet returned around seems like a
>>> waste of cycles, and I can't think of a reason why it can't work.
>>>
>>
>> I don't understand what you mean by "update to tuplesort". Can you
>> elaborate?
> 
> Tuplesort currently only allows the following workflow: you to load
> tuples, then call finalize, then extract tuples. There is currently no
> way to add tuples once you've started extracting them.
> 
> For my design to work efficiently or without hacking into the
> internals of tuplesort, we'd need a way to restart or 'un-finalize'
> the tuplesort so that it returns to the 'load tuples' phase. Because
> all data of the previous iteration is already sorted, adding more data
> shouldn't be too expensive.
> 

Not sure. I still think it's better to limit the amount of data we have
in the tuplesort. Even if the tuplesort can efficiently skip the already
sorted part, it'll still occupy disk space, possibly even force the data
to disk etc. (We'll still have to write that into a tuplestore, but that
should be relatively small and short-lived/recycled).

FWIW I wonder if the assumption that tuplesort can quickly skip already
sorted data holds e.g. for tuplesorts much larger than work_mem, but I
haven't checked that.

I'd also like to include some more info in the explain, like how many
times we did a sort, and what was the largest amount of data we sorted.
Although, maybe that could be tracked by tracking the tuplesort size of
the last sort.

Considering the tuplesort does not currently support this, I'll probably
stick to the existing approach with separate tuplestore. There's enough
complexity in the patch already, I think. The only thing we'll need with
the minval ordering is the ability to "peek ahead" to the next minval
(which is going to be the threshold used to route values either to
tuplesort or tuplestore).

>>  The point of spilling them into a tuplestore is to make the sort cheaper
>> by not sorting tuples that can't possibly be produced, because the value
>> exceeds the current maxval. Consider ranges sorted by maxval
>> [...]
>>
>> Or maybe I just don't understand what you mean.
> 
> If we sort the ranges by minval like this:
> 
> 1. [0,1000]
> 2. [0,999]
> 3. [50,998]
> 4. [100,997]
> 5. [100,996]
> 6. [150,995]
> 
> Then we can load and sort the values for range 1 and 2, and emit all
> values up to (not including) 50 - the minval of the next,
> not-yet-loaded range in the ordered list of ranges. Then add the
> values from range 3 to the set of tuples we have yet to output; sort;
> and then emit valus up to 100 (range 4's minval), etc. This reduces
> the amount of tuples in the tuplesort to the minimum amount needed to
> output any specific value.
> 
> If the ranges are sorted and loaded by maxval, like your algorithm expects:

Re: PATCH: Using BRIN indexes for sorted output

2022-10-17 Thread Matthias van de Meent
On Mon, 17 Oct 2022 at 05:43, Tomas Vondra
 wrote:
>
> On 10/16/22 22:17, Matthias van de Meent wrote:
> > On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
> >  wrote:
> >> Try to formulate the whole algorithm. Maybe I'm missing something.
> >>
> >> The current algorithm is something like this:
> >>
> >> 1. request info about ranges from the BRIN opclass
> >> 2. sort them by maxval and minval
> >
> > Why sort on maxval and minval? That seems wasteful for effectively all
> > sorts, where range sort on minval should suffice: If you find a range
> > that starts at 100 in a list of ranges sorted at minval, you've
> > processed all values <100. You can't make a similar comparison when
> > that range is sorted on maxvals.
>
> Because that allows to identify overlapping ranges quickly.
>
> Imagine you have the ranges sorted by maxval, which allows you to add
> tuples in small increments. But how do you know there's not a range
> (possibly with arbitrarily high maxval), that however overlaps with the
> range we're currently processing?

Why do we need to identify overlapping ranges specifically? If you
sort by minval, it becomes obvious that any subsequent range cannot
contain values < the minval of the next range in the list, allowing
you to emit any values less than the next, unprocessed, minmax range's
minval.

> >> 3. NULLS FIRST: read all ranges that might have NULLs => output
> >> 4. read the next range (by maxval) into tuplesort
> >>(if no more ranges, go to (9))
> >> 5. load all tuples from "splill" tuplestore, compare to maxval
> >
> > Instead of this, shouldn't an update to tuplesort that allows for
> > restarting the sort be better than this? Moving tuples that we've
> > accepted into BRINsort state but not yet returned around seems like a
> > waste of cycles, and I can't think of a reason why it can't work.
> >
>
> I don't understand what you mean by "update to tuplesort". Can you
> elaborate?

Tuplesort currently only allows the following workflow: you to load
tuples, then call finalize, then extract tuples. There is currently no
way to add tuples once you've started extracting them.

For my design to work efficiently or without hacking into the
internals of tuplesort, we'd need a way to restart or 'un-finalize'
the tuplesort so that it returns to the 'load tuples' phase. Because
all data of the previous iteration is already sorted, adding more data
shouldn't be too expensive.

>  The point of spilling them into a tuplestore is to make the sort cheaper
> by not sorting tuples that can't possibly be produced, because the value
> exceeds the current maxval. Consider ranges sorted by maxval
> [...]
>
> Or maybe I just don't understand what you mean.

If we sort the ranges by minval like this:

1. [0,1000]
2. [0,999]
3. [50,998]
4. [100,997]
5. [100,996]
6. [150,995]

Then we can load and sort the values for range 1 and 2, and emit all
values up to (not including) 50 - the minval of the next,
not-yet-loaded range in the ordered list of ranges. Then add the
values from range 3 to the set of tuples we have yet to output; sort;
and then emit valus up to 100 (range 4's minval), etc. This reduces
the amount of tuples in the tuplesort to the minimum amount needed to
output any specific value.

If the ranges are sorted and loaded by maxval, like your algorithm expects:

1. [150,995]
2. [100,996]
3. [100,997]
4. [50,998]
5. [0,999]
6. [0,1000]

We need to load all ranges into the sort before it could start
emitting any tuples, as all ranges overlap with the first range.

> > [algo]
>
> I don't think this works, because we may get a range (Rs') with very
> high maxval (thus read very late from Rs), but with very low minval.
> AFAICS max_sorted must never go back, and this breaks it.

max_sorted cannot go back, because it is the min value of the next
range in the list of ranges sorted by min value; see also above.

There is a small issue in my algorithm where I use <= for yielding
values where it should be <, where initialization of max_value to NULL
is then be incorrect, but apart from that I don't think there are any
issues with the base algorithm.

> > The maximum cost of this tuplesort would be the cost of sorting a
> > seqscanned table, plus sorting the relevant BRIN ranges, plus the 1
> > extra compare per tuple and range that are needed to determine whether
> > the range or tuple should be extracted from the tuplesort. The minimum
> > cost would be the cost of sorting all BRIN ranges, plus sorting all
> > tuples in one of the index's ranges.
> >
>
> I'm not a tuplesort expert, but my assumption it's better to sort
> smaller amounts of rows - which is why the patch sorts only the rows it
> knows it can actually output.

I see that the two main differences between our designs are in
answering these questions:

- How do we select table ranges for processing?
- How do we handle tuples that we know we can't output yet?

For the first, I think the differences are explained above. The main
drawback of your selection 

Re: PATCH: Using BRIN indexes for sorted output

2022-10-16 Thread Tomas Vondra
On 10/16/22 22:17, Matthias van de Meent wrote:
> First of all, it's really great to see that this is being worked on.
> 
> On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
>  wrote:
>> Try to formulate the whole algorithm. Maybe I'm missing something.
>>
>> The current algorithm is something like this:
>>
>> 1. request info about ranges from the BRIN opclass
>> 2. sort them by maxval and minval
> 
> Why sort on maxval and minval? That seems wasteful for effectively all
> sorts, where range sort on minval should suffice: If you find a range
> that starts at 100 in a list of ranges sorted at minval, you've
> processed all values <100. You can't make a similar comparison when
> that range is sorted on maxvals.
> 

Because that allows to identify overlapping ranges quickly.

Imagine you have the ranges sorted by maxval, which allows you to add
tuples in small increments. But how do you know there's not a range
(possibly with arbitrarily high maxval), that however overlaps with the
range we're currently processing?

Consider these ranges sorted by maxval

  range #1  [0,100]
  range #2  [101,200]
  range #3  [150,250]
  ...
  range #100 [190,10]

processing the range #1 is simple, because there are no overlapping
ranges. When processing range #2, that's not the case - the following
range #3 is overlapping too, so we need to load the tuples too. But
there may be other ranges (in arbitrary distance) also overlapping.

So we either have to cross-check everything with everything - that's
O(N^2) so not great, or we can invent a way to eliminate ranges that
can't overlap.

The patch does that by having two arrays - one sorted by maxval, one
sorted by minval. After proceeding to the next range by maxval (using
the first array), the minval-sorted array is used to detect overlaps.
This can be done quickly, because we only care for new matches since the
previous range, so we can remember the index to the array and start from
it. And we can stop once the minval exceeds the maxval for the range in
the first step. Because we'll only sort tuples up to that point.

>> 3. NULLS FIRST: read all ranges that might have NULLs => output
>> 4. read the next range (by maxval) into tuplesort
>>(if no more ranges, go to (9))
>> 5. load all tuples from "splill" tuplestore, compare to maxval
> 
> Instead of this, shouldn't an update to tuplesort that allows for
> restarting the sort be better than this? Moving tuples that we've
> accepted into BRINsort state but not yet returned around seems like a
> waste of cycles, and I can't think of a reason why it can't work.
> 

I don't understand what you mean by "update to tuplesort". Can you
elaborate?

The point of spilling them into a tuplestore is to make the sort cheaper
by not sorting tuples that can't possibly be produced, because the value
exceeds the current maxval. Consider ranges sorted by maxval

   [0,1000]
   [500,1500]
   [1001,2000]
   ...

We load tuples from [0,1000] and use 1000 as "threshold" up to which we
can sort. But we have to load tuples from the overlapping range(s) too,
e.g. from [500,1500] except that all tuples with values > 1000 can't be
produced (because there might be yet more ranges intersecting with that
part).

So why sort these tuples at all? Imagine imperfectly correlated table
where each range overlaps with ~10 other ranges. If we feed all of that
into the tuplestore, we're now sorting 11x the amount of data.

Or maybe I just don't understand what you mean.


>> 6. load all tuples from no-summarized ranges (first range only)
>>(into tuplesort/tuplestore, depending on maxval comparison)
>> 7. load all intersecting ranges (with minval < current maxval)
>>(into tuplesort/tuplestore, depending on maxval comparison)
>> 8. sort the tuplesort, output all tuples, then back to (4)
>> 9. NULLS LAST: read all ranges that might have NULLs => output
>> 10. done
>>
>> For "DESC" ordering the process is almost the same, except that we swap
>> minval/maxval in most places.
> 
> When I was thinking about this feature at the PgCon unconference, I
> was thinking about it more along the lines of the following system
> (for ORDER BY col ASC NULLS FIRST):
> 
> 1. prepare tuplesort Rs (for Rangesort) for BRIN tuples, ordered by
> [has_nulls, min ASC]
> 2. scan info about ranges from BRIN, store them in Rs.
> 3. Finalize the sorting of Rs.
> 4. prepare tuplesort Ts (for Tuplesort) for sorting on the specified
> column ordering.
> 5. load all tuples from no-summarized ranges into Ts'
> 6. while Rs has a block range Rs' with has_nulls:
>- Remove Rs' from Rs
>- store the tuples of Rs' range in Ts.
> We now have all tuples with NULL in our sorted set; max_sorted = (NULL)
> 7. Finalize the Ts sorted set.
> 8. While the next tuple Ts' in the Ts tuplesort <= max_sorted
>   - Remove Ts' from Ts
>   - Yield Ts'
> Now, all tuples up to and including max_sorted are yielded.
> 9. If there are no more ranges in Rs:
>   - Yield all remaining tuples from Ts, then 

Re: PATCH: Using BRIN indexes for sorted output

2022-10-16 Thread Matthias van de Meent
First of all, it's really great to see that this is being worked on.

On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
 wrote:
> Try to formulate the whole algorithm. Maybe I'm missing something.
>
> The current algorithm is something like this:
>
> 1. request info about ranges from the BRIN opclass
> 2. sort them by maxval and minval

Why sort on maxval and minval? That seems wasteful for effectively all
sorts, where range sort on minval should suffice: If you find a range
that starts at 100 in a list of ranges sorted at minval, you've
processed all values <100. You can't make a similar comparison when
that range is sorted on maxvals.

> 3. NULLS FIRST: read all ranges that might have NULLs => output
> 4. read the next range (by maxval) into tuplesort
>(if no more ranges, go to (9))
> 5. load all tuples from "splill" tuplestore, compare to maxval

Instead of this, shouldn't an update to tuplesort that allows for
restarting the sort be better than this? Moving tuples that we've
accepted into BRINsort state but not yet returned around seems like a
waste of cycles, and I can't think of a reason why it can't work.

> 6. load all tuples from no-summarized ranges (first range only)
>(into tuplesort/tuplestore, depending on maxval comparison)
> 7. load all intersecting ranges (with minval < current maxval)
>(into tuplesort/tuplestore, depending on maxval comparison)
> 8. sort the tuplesort, output all tuples, then back to (4)
> 9. NULLS LAST: read all ranges that might have NULLs => output
> 10. done
>
> For "DESC" ordering the process is almost the same, except that we swap
> minval/maxval in most places.

When I was thinking about this feature at the PgCon unconference, I
was thinking about it more along the lines of the following system
(for ORDER BY col ASC NULLS FIRST):

1. prepare tuplesort Rs (for Rangesort) for BRIN tuples, ordered by
[has_nulls, min ASC]
2. scan info about ranges from BRIN, store them in Rs.
3. Finalize the sorting of Rs.
4. prepare tuplesort Ts (for Tuplesort) for sorting on the specified
column ordering.
5. load all tuples from no-summarized ranges into Ts'
6. while Rs has a block range Rs' with has_nulls:
   - Remove Rs' from Rs
   - store the tuples of Rs' range in Ts.
We now have all tuples with NULL in our sorted set; max_sorted = (NULL)
7. Finalize the Ts sorted set.
8. While the next tuple Ts' in the Ts tuplesort <= max_sorted
  - Remove Ts' from Ts
  - Yield Ts'
Now, all tuples up to and including max_sorted are yielded.
9. If there are no more ranges in Rs:
  - Yield all remaining tuples from Ts, then return.
10. "un-finalize" Ts, so that we can start adding tuples to that tuplesort.
 This is different from Tomas' implementation, as he loads the
tuples into a new tuplestore.
11. get the next item from Rs: Rs'
   - remove Rs' from Rs
   - assign Rs' min value to max_sorted
   - store the tuples of Rs' range in Ts
12. while the next item Rs' from Rs has a min value of max_sorted:
   - remove Rs' from Rs
   - store the tuples of Rs' range in Ts
13. The 'new' value from the next item from Rs is stored in
max_sorted. If no such item exists, max_sorted is assigned a sentinel
value (+INF)
14. Go to Step 7

This set of operations requires a restarting tuplesort for Ts, but I
don't think that would result in many API changes for tuplesort. It
reduces the overhead of large overlapping ranges, as it doesn't need
to copy all tuples that have been read from disk but have not yet been
returned.

The maximum cost of this tuplesort would be the cost of sorting a
seqscanned table, plus sorting the relevant BRIN ranges, plus the 1
extra compare per tuple and range that are needed to determine whether
the range or tuple should be extracted from the tuplesort. The minimum
cost would be the cost of sorting all BRIN ranges, plus sorting all
tuples in one of the index's ranges.

Kind regards,

Matthias van de Meent

PS. Are you still planning on giving the HOT optimization for BRIN a
second try? I'm fairly confident that my patch at [0] would fix the
issue that lead to the revert of that feature, but it introduced ABI
changes after the feature freeze and thus it didn't get in. The patch
might need some polishing, but I think it shouldn't take too much
extra effort to get into PG16.

[0] 
https://www.postgresql.org/message-id/CAEze2Wi9%3DBay_%3DrTf8Z6WPgZ5V0tDOayszQJJO%3DR_9aaHvr%2BTg%40mail.gmail.com




Re: PATCH: Using BRIN indexes for sorted output

2022-10-16 Thread Matthias van de Meent
On Sun, 16 Oct 2022 at 16:42, Tom Lane  wrote:
>
> It also seems kind of unfair to decide
> that the relevant comparison point is a seqscan rather than a
> btree indexscan.

I think the comparison against full table scan seems appropriate, as
the benefit of BRIN is less space usage when compared to other
indexes, and better IO selectivity than full table scans.

A btree easily requires 10x the space of a normal BRIN index, and may
require a lot of random IO whilst scanning. This BRIN-sorted scan
would have a much lower random IO cost during its scan, and would help
bridge the performance gap between having index that supports ordered
retrieval, and no index at all, which is especially steep in large
tables.

I think that BRIN would be an alternative to btree as a provider of
sorted data, even when the table is not 100% clustered. This
BRIN-assisted table sort can help reduce the amount of data that is
accessed in top-N sorts significantly, both at the index and at the
relation level, without having the space overhead of "all sortable
columns get a btree index".

If BRIN gets its HOT optimization back, the benefits would be even
larger, as we would then have an index that can speed up top-N sorts
without bloating other indexes, and at very low disk footprint.
Columns that are only occasionally accessed in a sorted manner could
then get BRIN minmax indexes to support this sort, at minimal overhead
to the rest of the application.

Kind regards,

Matthias van de Meent




Re: PATCH: Using BRIN indexes for sorted output

2022-10-16 Thread Tomas Vondra
On 10/16/22 16:42, Zhihong Yu wrote:
> ...
> 
> I don't have good answer w.r.t. splitting the page range [0,127] now.
> Let me think more about it.
> 

Sure, no problem.

> The 10 step flow (subject to changes down the road) should be either
> given in the description of the patch or, written as comment inside the
> code.
> This would help people grasp the concept much faster.

True. I'll add it to the next version of the pach.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PATCH: Using BRIN indexes for sorted output

2022-10-16 Thread Tomas Vondra



On 10/16/22 16:41, Tom Lane wrote:
> Tomas Vondra  writes:
>> I forgot to mention one important issue in my list yesterday, and that's
>> memory consumption.
> 
> TBH, this is all looking like vastly more complexity than benefit.
> It's going to be impossible to produce a reliable cost estimate
> given all the uncertainty, and I fear that will end in picking
> BRIN-based sorting when it's not actually a good choice.
> 

Maybe. If it turns out the estimates we have are insufficient to make
good planning decisions, that's life.

As I wrote in my message, I know the BRIN costing is a bit shaky in
general (not just for this new operation), and I intend to propose some
improvement in a separate patch.

I think the main issue with BRIN costing is that we have no stats about
the ranges, and we can't estimate how many ranges we'll really end up
accessing. If you have 100 rows, will that be 1 range or 100 ranges? Or
for the BRIN Sort, how many overlapping ranges will there be?

I intend to allow index AMs to collect custom statistics, and the BRIN
minmax opfamily would collect e.g. this:

1) number of non-summarized ranges
2) number of all-nulls ranges
3) number of has-nulls ranges
4) average number of overlaps (given a random range, how many other
   ranges intersect with it)
5) how likely is it for a row to hit multiple ranges (cross-check
   sample rows vs. ranges)

I believe this will allow much better / more reliable BRIN costing (the
number of overlaps is particularly useful for the this patch).

> The examples you showed initially are cherry-picked to demonstrate
> the best possible case, which I doubt has much to do with typical
> real-world tables.  It would be good to see what happens with
> not-perfectly-sequential data before even deciding this is worth
> spending more effort on.

Yes, the example was trivial "happy case" example. Obviously, the
performance degrades as the data become more random (with ranges wider),
forcing the BRIN Sort to read / sort more tuples.

But let's see an example with less correlated data, say, like this:

create table t (a int) with (fillfactor = 10);

insert into t select i + 1 * random()
from generate_series(1,1000) s(i);

With the fillfactor=10, there are ~2500 values per 1MB range, so this
means each range overlaps with ~4 more. The results then look like this:

1) select * from t order by a;

  seqscan+sort: 4437 ms
  brinsort: 4233 ms

2) select * from t order by a limit 10;

  seqscan+sort: 1859 ms
  brinsort: 4 ms

If you increase the random factor from 1 to 10 (so, 40 ranges),
the seqscan timings remain about the same, while brinsort gets to 5200
and 20 ms. And with 1M, it's ~6000 and 300 ms.

Only at 500, where we pretty much read 1/2 the table because the
ranges intersect, we get the same timing as the seqscan (for the LIMIT
query). The "full sort" query is more like 5000 vs. 6600 ms, so slower
but not by a huge amount.

Yes, this is a very simple example. I can do more tests with other
datasets (larger/smaller, different distribution, ...).

> It also seems kind of unfair to decide
> that the relevant comparison point is a seqscan rather than a
> btree indexscan.
> 

I don't think it's all that unfair. How likely is it to have both a BRIN
and btree index on the same column? And even if you do have such indexes
(say, on different sets of keys), we kinda already have this costing
issue with index and bitmap index scans.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PATCH: Using BRIN indexes for sorted output

2022-10-16 Thread Zhihong Yu
On Sun, Oct 16, 2022 at 7:33 AM Tomas Vondra 
wrote:

>
>
> On 10/16/22 16:01, Zhihong Yu wrote:
> >
> >
> > On Sun, Oct 16, 2022 at 6:51 AM Tomas Vondra
> > mailto:tomas.von...@enterprisedb.com>>
> > wrote:
> >
> > On 10/15/22 14:33, Tomas Vondra wrote:
> > > Hi,
> > >
> > > ...
> > >
> > > There's a bunch of issues with this initial version of the patch,
> > > usually described in XXX comments in the relevant places.6)
> > >
> > > ...
> >
> > I forgot to mention one important issue in my list yesterday, and
> that's
> > memory consumption. The way the patch is coded now, the new BRIN
> support
> > function (brin_minmax_ranges) produces information about *all*
> ranges in
> > one go, which may be an issue. The worst case is 32TB table, with
> 1-page
> > BRIN ranges, which means ~4 billion ranges. The info is an array of
> ~32B
> > structs, so this would require ~128GB of RAM. With the default
> 128-page
> > ranges, it's still be ~1GB, which is quite a lot.
> >
> > We could have a discussion about what's the reasonable size of BRIN
> > ranges on such large tables (e.g. building a bitmap on 4 billion
> ranges
> > is going to be "not cheap" so this is likely pretty rare). But we
> should
> > not introduce new nodes that ignore work_mem, so we need a way to
> deal
> > with such cases somehow.
> >
> > The easiest solution likely is to check this while planning - we can
> > check the table size, calculate the number of BRIN ranges, and check
> > that the range info fits into work_mem, and just not create the path
> > when it gets too large. That's what we did for HashAgg, although that
> > decision was unreliable because estimating GROUP BY cardinality is
> hard.
> >
> > The wrinkle here is that counting just the range info (BrinRange
> struct)
> > does not include the values for by-reference types. We could use
> average
> > width - that's just an estimate, though.
> >
> > A more comprehensive solution seems to be to allow requesting chunks
> of
> > the BRIN ranges. So that we'd get "slices" of ranges and we'd process
> > those. So for example if you have 1000 ranges, and you can only
> handle
> > 100 at a time, we'd do 10 loops, each requesting 100 ranges.
> >
> > This has another problem - we do care about "overlaps", and we can't
> > really know if the overlapping ranges will be in the same "slice"
> > easily. The chunks would be sorted (for example) by maxval. But there
> > can be a range with much higher maxval (thus in some future slice),
> but
> > very low minval (thus intersecting with ranges in the current slice).
> >
> > Imagine ranges with these minval/maxval values, sorted by maxval:
> >
> > [101,200]
> > [201,300]
> > [301,400]
> > [150,500]
> >
> > and let's say we can only process 2-range slices. So we'll get the
> first
> > two, but both of them intersect with the very last range.
> >
> > We could always include all the intersecting ranges into the slice,
> but
> > what if there are too many very "wide" ranges?
> >
> > So I think this will need to switch to an iterative communication
> with
> > the BRIN index - instead of asking "give me info about all the
> ranges",
> > we'll need a way to
> >
> >   - request the next range (sorted by maxval)
> >   - request the intersecting ranges one by one (sorted by minval)
> >
> > Of course, the BRIN side will have some of the same challenges with
> > tracking the info without breaking the work_mem limit, but I suppose
> it
> > can store the info into a tuplestore/tuplesort, and use that instead
> of
> > plain in-memory array. Alternatively, it could just return those, and
> > BrinSort would use that. OTOH it seems cleaner to have some sort of
> API,
> > especially if we want to support e.g. minmax-multi opclasses, that
> have
> > a more complicated concept of "intersection".
> >
> >
> > regards
> >
> > --
> > Tomas Vondra
> > EnterpriseDB: http://www.enterprisedb.com <
> http://www.enterprisedb.com>
> > The Enterprise PostgreSQL Company
> >
> > Hi,
> > In your example involving [150,500], can this range be broken down into
> > 4 ranges, ending in 200, 300, 400 and 500, respectively ?
> > That way, there is no intersection among the ranges.
> >
>
> Not really, I think. These "value ranges" map to "page ranges" and how
> would you split those? I mean, you know values [150,500] map to blocks
> [0,127]. You split the values into [150,200], [201,300], [301,400]. How
> do you split the page range [0,127]?
>
> Also, splitting a range into more ranges is likely making the issue
> worse, because it increases the number of ranges, right? And I mean,
> much worse, because imagine a "wide" range that overlaps with every
> other range - the number of ranges would explode.
>
> It's not clear to me at which point 

Re: PATCH: Using BRIN indexes for sorted output

2022-10-16 Thread Tom Lane
Tomas Vondra  writes:
> I forgot to mention one important issue in my list yesterday, and that's
> memory consumption.

TBH, this is all looking like vastly more complexity than benefit.
It's going to be impossible to produce a reliable cost estimate
given all the uncertainty, and I fear that will end in picking
BRIN-based sorting when it's not actually a good choice.

The examples you showed initially are cherry-picked to demonstrate
the best possible case, which I doubt has much to do with typical
real-world tables.  It would be good to see what happens with
not-perfectly-sequential data before even deciding this is worth
spending more effort on.  It also seems kind of unfair to decide
that the relevant comparison point is a seqscan rather than a
btree indexscan.

regards, tom lane




Re: PATCH: Using BRIN indexes for sorted output

2022-10-16 Thread Tomas Vondra



On 10/16/22 16:01, Zhihong Yu wrote:
> 
> 
> On Sun, Oct 16, 2022 at 6:51 AM Tomas Vondra
> mailto:tomas.von...@enterprisedb.com>>
> wrote:
> 
> On 10/15/22 14:33, Tomas Vondra wrote:
> > Hi,
> >
> > ...
> >
> > There's a bunch of issues with this initial version of the patch,
> > usually described in XXX comments in the relevant places.6)
> >
> > ...
> 
> I forgot to mention one important issue in my list yesterday, and that's
> memory consumption. The way the patch is coded now, the new BRIN support
> function (brin_minmax_ranges) produces information about *all* ranges in
> one go, which may be an issue. The worst case is 32TB table, with 1-page
> BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B
> structs, so this would require ~128GB of RAM. With the default 128-page
> ranges, it's still be ~1GB, which is quite a lot.
> 
> We could have a discussion about what's the reasonable size of BRIN
> ranges on such large tables (e.g. building a bitmap on 4 billion ranges
> is going to be "not cheap" so this is likely pretty rare). But we should
> not introduce new nodes that ignore work_mem, so we need a way to deal
> with such cases somehow.
> 
> The easiest solution likely is to check this while planning - we can
> check the table size, calculate the number of BRIN ranges, and check
> that the range info fits into work_mem, and just not create the path
> when it gets too large. That's what we did for HashAgg, although that
> decision was unreliable because estimating GROUP BY cardinality is hard.
> 
> The wrinkle here is that counting just the range info (BrinRange struct)
> does not include the values for by-reference types. We could use average
> width - that's just an estimate, though.
> 
> A more comprehensive solution seems to be to allow requesting chunks of
> the BRIN ranges. So that we'd get "slices" of ranges and we'd process
> those. So for example if you have 1000 ranges, and you can only handle
> 100 at a time, we'd do 10 loops, each requesting 100 ranges.
> 
> This has another problem - we do care about "overlaps", and we can't
> really know if the overlapping ranges will be in the same "slice"
> easily. The chunks would be sorted (for example) by maxval. But there
> can be a range with much higher maxval (thus in some future slice), but
> very low minval (thus intersecting with ranges in the current slice).
> 
> Imagine ranges with these minval/maxval values, sorted by maxval:
> 
> [101,200]
> [201,300]
> [301,400]
> [150,500]
> 
> and let's say we can only process 2-range slices. So we'll get the first
> two, but both of them intersect with the very last range.
> 
> We could always include all the intersecting ranges into the slice, but
> what if there are too many very "wide" ranges?
> 
> So I think this will need to switch to an iterative communication with
> the BRIN index - instead of asking "give me info about all the ranges",
> we'll need a way to
> 
>   - request the next range (sorted by maxval)
>   - request the intersecting ranges one by one (sorted by minval)
> 
> Of course, the BRIN side will have some of the same challenges with
> tracking the info without breaking the work_mem limit, but I suppose it
> can store the info into a tuplestore/tuplesort, and use that instead of
> plain in-memory array. Alternatively, it could just return those, and
> BrinSort would use that. OTOH it seems cleaner to have some sort of API,
> especially if we want to support e.g. minmax-multi opclasses, that have
> a more complicated concept of "intersection".
> 
> 
> regards
> 
> -- 
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com 
> The Enterprise PostgreSQL Company
> 
> Hi,
> In your example involving [150,500], can this range be broken down into
> 4 ranges, ending in 200, 300, 400 and 500, respectively ?
> That way, there is no intersection among the ranges.
> 

Not really, I think. These "value ranges" map to "page ranges" and how
would you split those? I mean, you know values [150,500] map to blocks
[0,127]. You split the values into [150,200], [201,300], [301,400]. How
do you split the page range [0,127]?

Also, splitting a range into more ranges is likely making the issue
worse, because it increases the number of ranges, right? And I mean,
much worse, because imagine a "wide" range that overlaps with every
other range - the number of ranges would explode.

It's not clear to me at which point you'd make the split. At the
beginning, right after loading the ranges from BRIN index? A lot of that
may be unnecessary, in case the range is loaded as a "non-intersecting"
range.

Try to formulate the whole algorithm. Maybe I'm missing something.

The current algorithm is something like 

Re: PATCH: Using BRIN indexes for sorted output

2022-10-16 Thread Tomas Vondra



On 10/16/22 03:36, Zhihong Yu wrote:
> 
> 
> On Sat, Oct 15, 2022 at 8:23 AM Tomas Vondra
> mailto:tomas.von...@enterprisedb.com>>
> wrote:
> 
> On 10/15/22 15:46, Zhihong Yu wrote:
> >...
> >     8) Parallel version is not supported, but I think it shouldn't be
> >     possible. Just make the leader build the range info, and then
> let the
> >     workers to acquire/sort ranges and merge them by Gather Merge.
> > ...
> > Hi,
> > I am still going over the patch.
> >
> > Minor: for #8, I guess you meant `it should be possible` .
> >
> 
> Yes, I meant to say it should be possible. Sorry for the confusion.
> 
> 
> 
> regards
> 
> -- 
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com 
> The Enterprise PostgreSQL Company
> 
> Hi,
> 
> For brin_minmax_ranges, looking at the assignment to gottuple and
> reading gottuple, it seems variable gottuple can be omitted - we can
> check tup directly.
> 
> +   /* Maybe mark the range as processed. */
> +   range->processed |= mark_processed;
> 
> `Maybe` can be dropped.
> 

No, because the "mark_processed" may be false. So we may not mark it as
processed in some cases.

> For brinsort_load_tuples(), do we need to check for interrupts inside
> the loop ?
> Similar question for subsequent methods involving loops, such
> as brinsort_load_unsummarized_ranges.
> 

We could/should, although most of the loops should be very short.


regrds

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PATCH: Using BRIN indexes for sorted output

2022-10-16 Thread Zhihong Yu
On Sun, Oct 16, 2022 at 6:51 AM Tomas Vondra 
wrote:

> On 10/15/22 14:33, Tomas Vondra wrote:
> > Hi,
> >
> > ...
> >
> > There's a bunch of issues with this initial version of the patch,
> > usually described in XXX comments in the relevant places.6)
> >
> > ...
>
> I forgot to mention one important issue in my list yesterday, and that's
> memory consumption. The way the patch is coded now, the new BRIN support
> function (brin_minmax_ranges) produces information about *all* ranges in
> one go, which may be an issue. The worst case is 32TB table, with 1-page
> BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B
> structs, so this would require ~128GB of RAM. With the default 128-page
> ranges, it's still be ~1GB, which is quite a lot.
>
> We could have a discussion about what's the reasonable size of BRIN
> ranges on such large tables (e.g. building a bitmap on 4 billion ranges
> is going to be "not cheap" so this is likely pretty rare). But we should
> not introduce new nodes that ignore work_mem, so we need a way to deal
> with such cases somehow.
>
> The easiest solution likely is to check this while planning - we can
> check the table size, calculate the number of BRIN ranges, and check
> that the range info fits into work_mem, and just not create the path
> when it gets too large. That's what we did for HashAgg, although that
> decision was unreliable because estimating GROUP BY cardinality is hard.
>
> The wrinkle here is that counting just the range info (BrinRange struct)
> does not include the values for by-reference types. We could use average
> width - that's just an estimate, though.
>
> A more comprehensive solution seems to be to allow requesting chunks of
> the BRIN ranges. So that we'd get "slices" of ranges and we'd process
> those. So for example if you have 1000 ranges, and you can only handle
> 100 at a time, we'd do 10 loops, each requesting 100 ranges.
>
> This has another problem - we do care about "overlaps", and we can't
> really know if the overlapping ranges will be in the same "slice"
> easily. The chunks would be sorted (for example) by maxval. But there
> can be a range with much higher maxval (thus in some future slice), but
> very low minval (thus intersecting with ranges in the current slice).
>
> Imagine ranges with these minval/maxval values, sorted by maxval:
>
> [101,200]
> [201,300]
> [301,400]
> [150,500]
>
> and let's say we can only process 2-range slices. So we'll get the first
> two, but both of them intersect with the very last range.
>
> We could always include all the intersecting ranges into the slice, but
> what if there are too many very "wide" ranges?
>
> So I think this will need to switch to an iterative communication with
> the BRIN index - instead of asking "give me info about all the ranges",
> we'll need a way to
>
>   - request the next range (sorted by maxval)
>   - request the intersecting ranges one by one (sorted by minval)
>
> Of course, the BRIN side will have some of the same challenges with
> tracking the info without breaking the work_mem limit, but I suppose it
> can store the info into a tuplestore/tuplesort, and use that instead of
> plain in-memory array. Alternatively, it could just return those, and
> BrinSort would use that. OTOH it seems cleaner to have some sort of API,
> especially if we want to support e.g. minmax-multi opclasses, that have
> a more complicated concept of "intersection".
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
> Hi,
In your example involving [150,500], can this range be broken down into 4
ranges, ending in 200, 300, 400 and 500, respectively ?
That way, there is no intersection among the ranges.

bq. can store the info into a tuplestore/tuplesort

Wouldn't this involve disk accesses which may reduce the effectiveness of
BRIN sort ?

Cheers


Re: PATCH: Using BRIN indexes for sorted output

2022-10-16 Thread Tomas Vondra
On 10/15/22 14:33, Tomas Vondra wrote:
> Hi,
> 
> ...
> 
> There's a bunch of issues with this initial version of the patch,
> usually described in XXX comments in the relevant places.6)
> 
> ...

I forgot to mention one important issue in my list yesterday, and that's
memory consumption. The way the patch is coded now, the new BRIN support
function (brin_minmax_ranges) produces information about *all* ranges in
one go, which may be an issue. The worst case is 32TB table, with 1-page
BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B
structs, so this would require ~128GB of RAM. With the default 128-page
ranges, it's still be ~1GB, which is quite a lot.

We could have a discussion about what's the reasonable size of BRIN
ranges on such large tables (e.g. building a bitmap on 4 billion ranges
is going to be "not cheap" so this is likely pretty rare). But we should
not introduce new nodes that ignore work_mem, so we need a way to deal
with such cases somehow.

The easiest solution likely is to check this while planning - we can
check the table size, calculate the number of BRIN ranges, and check
that the range info fits into work_mem, and just not create the path
when it gets too large. That's what we did for HashAgg, although that
decision was unreliable because estimating GROUP BY cardinality is hard.

The wrinkle here is that counting just the range info (BrinRange struct)
does not include the values for by-reference types. We could use average
width - that's just an estimate, though.

A more comprehensive solution seems to be to allow requesting chunks of
the BRIN ranges. So that we'd get "slices" of ranges and we'd process
those. So for example if you have 1000 ranges, and you can only handle
100 at a time, we'd do 10 loops, each requesting 100 ranges.

This has another problem - we do care about "overlaps", and we can't
really know if the overlapping ranges will be in the same "slice"
easily. The chunks would be sorted (for example) by maxval. But there
can be a range with much higher maxval (thus in some future slice), but
very low minval (thus intersecting with ranges in the current slice).

Imagine ranges with these minval/maxval values, sorted by maxval:

[101,200]
[201,300]
[301,400]
[150,500]

and let's say we can only process 2-range slices. So we'll get the first
two, but both of them intersect with the very last range.

We could always include all the intersecting ranges into the slice, but
what if there are too many very "wide" ranges?

So I think this will need to switch to an iterative communication with
the BRIN index - instead of asking "give me info about all the ranges",
we'll need a way to

  - request the next range (sorted by maxval)
  - request the intersecting ranges one by one (sorted by minval)

Of course, the BRIN side will have some of the same challenges with
tracking the info without breaking the work_mem limit, but I suppose it
can store the info into a tuplestore/tuplesort, and use that instead of
plain in-memory array. Alternatively, it could just return those, and
BrinSort would use that. OTOH it seems cleaner to have some sort of API,
especially if we want to support e.g. minmax-multi opclasses, that have
a more complicated concept of "intersection".


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PATCH: Using BRIN indexes for sorted output

2022-10-15 Thread Zhihong Yu
On Sat, Oct 15, 2022 at 8:23 AM Tomas Vondra 
wrote:

> On 10/15/22 15:46, Zhihong Yu wrote:
> >...
> > 8) Parallel version is not supported, but I think it shouldn't be
> > possible. Just make the leader build the range info, and then let the
> > workers to acquire/sort ranges and merge them by Gather Merge.
> > ...
> > Hi,
> > I am still going over the patch.
> >
> > Minor: for #8, I guess you meant `it should be possible` .
> >
>
> Yes, I meant to say it should be possible. Sorry for the confusion.
>
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
Hi,

For brin_minmax_ranges, looking at the assignment to gottuple and
reading gottuple, it seems variable gottuple can be omitted - we can check
tup directly.

+   /* Maybe mark the range as processed. */
+   range->processed |= mark_processed;

`Maybe` can be dropped.

For brinsort_load_tuples(), do we need to check for interrupts inside the
loop ?
Similar question for subsequent methods involving loops, such
as brinsort_load_unsummarized_ranges.

Cheers


Re: PATCH: Using BRIN indexes for sorted output

2022-10-15 Thread Tomas Vondra
On 10/15/22 15:46, Zhihong Yu wrote:
>...
> 8) Parallel version is not supported, but I think it shouldn't be
> possible. Just make the leader build the range info, and then let the
> workers to acquire/sort ranges and merge them by Gather Merge.
> ...
> Hi,
> I am still going over the patch.
> 
> Minor: for #8, I guess you meant `it should be possible` .
> 

Yes, I meant to say it should be possible. Sorry for the confusion.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: PATCH: Using BRIN indexes for sorted output

2022-10-15 Thread Zhihong Yu
On Sat, Oct 15, 2022 at 5:34 AM Tomas Vondra 
wrote:

> Hi,
>
> There have been a couple discussions about using BRIN indexes for
> sorting - in fact this was mentioned even in the "Improving Indexing
> Performance" unconference session this year (don't remember by whom).
> But I haven't seen any patches, so here's one.
>
> The idea is that we can use information about ranges to split the table
> into smaller parts that can be sorted in smaller chunks. For example if
> you have a tiny 2MB table with two ranges, with values in [0,100] and
> [101,200] intervals, then it's clear we can sort the first range, output
> tuples, and then sort/output the second range.
>
> The attached patch builds "BRIN Sort" paths/plans, closely resembling
> index scans, only for BRIN indexes. And this special type of index scan
> does what was mentioned above - incrementally sorts the data. It's a bit
> more complicated because of overlapping ranges, ASC/DESC, NULL etc.
>
> This is disabled by default, using a GUC enable_brinsort (you may need
> to tweak other GUCs to disable parallel plans etc.).
>
> A trivial example, demonstrating the benefits:
>
>   create table t (a int) with (fillfactor = 10);
>   insert into t select i from generate_series(1,1000) s(i);
>
>
> First, a simple LIMIT query:
>
> explain (analyze, costs off) select * from t order by a limit 10;
>
>   QUERY PLAN
> 
>  Limit (actual time=1879.768..1879.770 rows=10 loops=1)
>->  Sort (actual time=1879.767..1879.768 rows=10 loops=1)
>  Sort Key: a
>  Sort Method: top-N heapsort  Memory: 25kB
>  ->  Seq Scan on t
>  (actual time=0.007..1353.110 rows=1000 loops=1)
>  Planning Time: 0.083 ms
>  Execution Time: 1879.786 ms
> (7 rows)
>
>   QUERY PLAN
> 
>  Limit (actual time=1.217..1.219 rows=10 loops=1)
>->  BRIN Sort using t_a_idx on t
>(actual time=1.216..1.217 rows=10 loops=1)
>  Sort Key: a
>  Planning Time: 0.084 ms
>  Execution Time: 1.234 ms
> (5 rows)
>
> That's a pretty nice improvement - of course, this is thanks to having a
> perfectly sequential, and the difference can be almost arbitrary by
> making the table smaller/larger. Similarly, if the table gets less
> sequential (making ranges to overlap), the BRIN plan will be more
> expensive. Feel free to experiment with other data sets.
>
> However, not only the LIMIT queries can improve - consider a sort of the
> whole table:
>
> test=# explain (analyze, costs off) select * from t order by a;
>
>QUERY PLAN
> -
>  Sort (actual time=2806.468..3487.213 rows=1000 loops=1)
>Sort Key: a
>Sort Method: external merge  Disk: 117528kB
>->  Seq Scan on t (actual time=0.018..1498.754 rows=1000 loops=1)
>  Planning Time: 0.110 ms
>  Execution Time: 3766.825 ms
> (6 rows)
>
> test=# explain (analyze, costs off) select * from t order by a;
> QUERY PLAN
>
>
> --
>  BRIN Sort using t_a_idx on t (actual time=1.210..2670.875 rows=1000
> loops=1)
>Sort Key: a
>  Planning Time: 0.073 ms
>  Execution Time: 2939.324 ms
> (4 rows)
>
> Right - not a huge difference, but still a nice 25% speedup, mostly due
> to not having to spill data to disk and sorting smaller amounts of data.
>
> There's a bunch of issues with this initial version of the patch,
> usually described in XXX comments in the relevant places.6)
>
> 1) The paths are created in build_index_paths() because that's what
> creates index scans (which the new path resembles). But that is expected
> to produce IndexPath, not BrinSortPath, so it's not quite correct.
> Should be somewhere "higher" I guess.
>
> 2) BRIN indexes don't have internal ordering, i.e. ASC/DESC and NULLS
> FIRST/LAST does not really matter for them. The patch just generates
> paths for all 4 combinations (or tries to). Maybe there's a better way.
>
> 3) I'm not quite sure the separation of responsibilities between
> opfamily and opclass is optimal. I added a new amproc, but maybe this
> should be split differently. At the moment only minmax indexes have
> this, but adding this to minmax-multi should be trivial.
>
> 4) The state changes in nodeBrinSort is a bit confusing. Works, but may
> need cleanup and refactoring. Ideas welcome.
>
> 5) The costing is essentially just plain cost_index. I have some ideas
> about BRIN costing in general, which I'll post in a separate thread (as
> it's not specific to this patch).
>
> 6) At the moment this only picks one of the index keys, specified in the
> ORDER BY clause. I think we can generalize this to multiple keys, but
> thinking about multi-key ranges was