Re: [HACKERS] Re: Poor cost estimate with interaction between table correlation and partial indexes
Hmm... It seems the command I used for obtaining a patch I got from here https://wiki.postgresql.org/wiki/Working_with_Git truncated part of the patch. I've attached the file generated from git diff --patience master improve-partial-index-correlation-calculation --no-color > improve-correlated-partial-index-cost-v2.patch to this email. What is the correct command for generating a context diff? diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index 051a854..58224e6 100644 --- a/src/backend/optimizer/path/costsize.c +++ b/src/backend/optimizer/path/costsize.c @@ -163,7 +163,10 @@ static void set_rel_width(PlannerInfo *root, RelOptInfo *rel); static double relation_byte_size(double tuples, int width); static double page_size(double tuples, int width); static double get_parallel_divisor(Path *path); - +static double ordered_page_read_cost(double pages_fetched, + int baserel_pages, + double spc_seq_page_cost, + double spc_random_page_cost); /* * clamp_row_est @@ -652,9 +655,26 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count, if (pages_fetched > 0) { - min_IO_cost = spc_random_page_cost; - if (pages_fetched > 1) -min_IO_cost += (pages_fetched - 1) * spc_seq_page_cost; + if (index->indpred == NIL) + { +min_IO_cost = spc_random_page_cost; +if (pages_fetched > 1) + min_IO_cost += (pages_fetched - 1) * spc_seq_page_cost; + } + else + { +/* + * For a partial index perfectly correlated with the table + * ordering, consecutive pages fetched are not guarenteed to + * be adjacent in the table. Instead use + * ordered_page_read_cost to estimate the cost of reading + * pages from the heap. + */ +min_IO_cost = ordered_page_read_cost(pages_fetched, + baserel->pages, + spc_seq_page_cost, + spc_seq_page_cost); + } } else min_IO_cost = 0; @@ -934,13 +954,11 @@ cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel, Cost indexTotalCost; QualCost qpqual_cost; Cost cpu_per_tuple; - Cost cost_per_page; Cost cpu_run_cost; double tuples_fetched; double pages_fetched; double spc_seq_page_cost, spc_random_page_cost; - double T; /* Should only be applied to base relations */ Assert(IsA(baserel, RelOptInfo)); @@ -961,28 +979,16 @@ cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel, _fetched); startup_cost += indexTotalCost; - T = (baserel->pages > 1) ? (double) baserel->pages : 1.0; /* Fetch estimated page costs for tablespace containing table. */ get_tablespace_page_costs(baserel->reltablespace, _random_page_cost, _seq_page_cost); - /* - * For small numbers of pages we should charge spc_random_page_cost - * apiece, while if nearly all the table's pages are being read, it's more - * appropriate to charge spc_seq_page_cost apiece. The effect is - * nonlinear, too. For lack of a better idea, interpolate like this to - * determine the cost per page. - */ - if (pages_fetched >= 2.0) - cost_per_page = spc_random_page_cost - - (spc_random_page_cost - spc_seq_page_cost) - * sqrt(pages_fetched / T); - else - cost_per_page = spc_random_page_cost; - - run_cost += pages_fetched * cost_per_page; + run_cost += ordered_page_read_cost(pages_fetched, + baserel->pages, + spc_seq_page_cost, + spc_random_page_cost); /* * Estimate CPU costs per tuple. @@ -5183,3 +5189,34 @@ compute_bitmap_pages(PlannerInfo *root, RelOptInfo *baserel, Path *bitmapqual, return pages_fetched; } + + +/* + * Estimate the total cost of reading possibly nonconsecutive pages from the + * heap in order. + */ +static double +ordered_page_read_cost(double pages_fetched, int baserel_pages, + double spc_seq_page_cost, double spc_random_page_cost) +{ + double cost_per_page; + double T; + + T = (baserel_pages > 1) ? (double) baserel_pages : 1.0; + + /* + * For small numbers of pages we should charge spc_random_page_cost + * apiece, while if nearly all the table's pages are being read, it's more + * appropriate to charge spc_seq_page_cost apiece. The effect is + * nonlinear, too. For lack of a better idea, interpolate like this to + * determine the cost per page. + */ + if (pages_fetched >= 2.0) + cost_per_page = spc_random_page_cost - + (spc_random_page_cost - spc_seq_page_cost) + * sqrt(pages_fetched / T); + else + cost_per_page = spc_random_page_cost; + + return pages_fetched * cost_per_page; +} -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Poor cost estimate with interaction between table correlation and partial indexes
(Sorry David. I initially replied only to you) Ok. I've attached a patch of a proof-of-concept. I have a few questions about tests. What is typical workflow to add tests for changes to the planner? Also I ran make check and it appears one of the existing tests is failing. What is a typical way for going about discovering why the query plan for a specific query changed? Also, how should I go about changing the old test? Should I replace the old test output with the new test output or modify the old test slightly to get it to produce the same case as before? Thanks, Michael diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index 051a854..58224e6 100644 *** a/src/backend/optimizer/path/costsize.c --- b/src/backend/optimizer/path/costsize.c *** *** 163,169 static void set_rel_width(PlannerInfo *root, RelOptInfo *rel); static double relation_byte_size(double tuples, int width); static double page_size(double tuples, int width); static double get_parallel_divisor(Path *path); ! /* * clamp_row_est --- 163,172 static double relation_byte_size(double tuples, int width); static double page_size(double tuples, int width); static double get_parallel_divisor(Path *path); ! static double ordered_page_read_cost(double pages_fetched, ! int baserel_pages, ! double spc_seq_page_cost, ! double spc_random_page_cost); /* * clamp_row_est *** *** 652,660 cost_index(IndexPath *path, PlannerInfo *root, double loop_count, if (pages_fetched > 0) { ! min_IO_cost = spc_random_page_cost; ! if (pages_fetched > 1) ! min_IO_cost += (pages_fetched - 1) * spc_seq_page_cost; } else min_IO_cost = 0; --- 655,680 if (pages_fetched > 0) { ! if (index->indpred == NIL) ! { ! min_IO_cost = spc_random_page_cost; ! if (pages_fetched > 1) ! min_IO_cost += (pages_fetched - 1) * spc_seq_page_cost; ! } ! else ! { ! /* ! * For a partial index perfectly correlated with the table ! * ordering, consecutive pages fetched are not guarenteed to ! * be adjacent in the table. Instead use ! * ordered_page_read_cost to estimate the cost of reading ! * pages from the heap. ! */ ! min_IO_cost = ordered_page_read_cost(pages_fetched, ! baserel->pages, ! spc_seq_page_cost, ! spc_seq_page_cost); ! } } else min_IO_cost = 0; *** *** 934,946 cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel, Cost indexTotalCost; QualCost qpqual_cost; Cost cpu_per_tuple; - Cost cost_per_page; Cost cpu_run_cost; double tuples_fetched; double pages_fetched; double spc_seq_page_cost, spc_random_page_cost; - double T; /* Should only be applied to base relations */ Assert(IsA(baserel, RelOptInfo)); --- 954,964 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Poor cost estimate with interaction between table correlation and partial indexes
Do you think this is a reasonable approach? Should I start working on a patch based on the solution I described or is there some other approach I should look into? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Poor cost estimate with interaction between table correlation and partial indexes
Hi. I'm looking to get started contributing code to Postgres. A small issue I'm aware of that I think would make a good first contribution is a poor cost estimate made due to an interaction between table correlation and partial indexes. Currently the planner assumes that when an index is perfectly correlated with a table and a range scan is performed on the index, all of the table page reads performed by the index scan except for the first one will be sequential reads. While this assumption is correct for regular indexes, it is not true for partial indexes. The assumption holds for regular indexes because the rows corresponding to two entries in a regular index that is perfectly correlated with the table are guaranteed to be next to each other in the table. On the other hand with a partial index perfectly correlated with a table, there may be rows in the table in between the two rows corresponding to two adjacent entries in the index that are not included in the index because they do not satisfy the partial index predicate. To make the cost calculation for this case more accurate, I want to apply the same estimate as the one currently used to estimate the cost of a bitmap heap scan. The bitmap heap scan cost estimate applies in this case because both cases involve reading pages from disk ordered by the location in the table, but where the pages may not be consecutive. For the relevant functions, see cost_index and cost_index_heap_scan in costsize.c. Thanks, Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Highly Variable Planning Times
Thanks for the help Andres. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Highly Variable Planning Times
> The profile seems to confirm that this is largely due to cache misses. Can you elaborate? Cache misses of what? Why is the planning time so variable? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Highly Variable Planning Times
> TBH, I see no convincing explanation in that article of why 1300 partial > indexes are a good idea. I don't like it much either. I've been trying to eliminate most of the need for the partial indexes, but this is the current state of things. > *At best*, you're doing substantial work in the > planner to avoid the first tree descent step or two in a single > non-partial index. While the specific example I gave in the post could be replaced with a non-partial index, most of the partial indexes contain predicates that are not straightforward to index with non-partial indexes. About 85% of the partial indexes contain a regular expression in them for CSS selector matching. I've tried using a trigram GIN index, but it wound up not working too well due to the data being highly redundant (almost every CSS hierarchy contains 'div' in it). Additionally, most of the predicates for partial indexes are extremely specific making the partial indexes very small. The sum total size of all of the partial indexes is still much smaller than if we were to index every necessary field with regular indexes. Thanks, Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Highly Variable Planning Times
Hi, I've been encountering highly variable planning time on PG 9.5.4. Running `EXPLAIN SELECT * FROM events_1171738` will take anywhere from 200ms to 4s. This likely has to do with the table having 1300 partial indexes on it (for reasons elaborated on in https://blog.heapanalytics.com/running-10-million-postgresql-indexes-in-production/). I'm trying to figure out if there is something I can do to eliminate the slow planning from happening. When I used `strace -c` on the backend process, I found that the number of `semop` system calls is much higher for the slower queries than the faster ones. After digging in a bit more I found that when the query was slow, there would be a single semaphore which would make up most of the `semop` calls. Here is a snippet of the output when I ran `strace -r` on a query with high planning time: 0.000225 open("base/16384/13236647", O_RDWR) = 219 0.000110 lseek(219, 0, SEEK_END) = 16384 0.000122 semop(1736711, [{9, -1, 0}], 1) = 0 0.004110 semop(1769480, [{10, 1, 0}], 1) = 0 0.000197 semop(1736711, [{9, -1, 0}], 1) = 0 0.010096 semop(1540097, [{1, 1, 0}], 1) = 0 0.000176 semop(1736711, [{9, -1, 0}], 1) = 0 0.004069 semop(1638404, [{15, 1, 0}], 1) = 0 0.000244 open("base/16384/13236648", O_RDWR) = 220 0.000131 lseek(220, 0, SEEK_END) = 16384 0.000212 semop(1736711, [{9, -1, 0}], 1) = 0 0.007851 semop(1736711, [{8, 1, 0}], 1) = 0 0.000130 semop(1736711, [{9, -1, 0}], 1) = 0 0.000101 semop(1769480, [{5, 1, 0}], 1) = 0 0.000450 open("base/16384/13236653", O_RDWR) = 221 0.000151 lseek(221, 0, SEEK_END) = 180224 0.000141 semop(1736711, [{9, -1, 0}], 1) = 0 0.000239 semop(1736711, [{9, -1, 0}], 1) = 0 0.000419 semop(1736711, [{10, 1, 0}], 1) = 0 0.000127 semop(1736711, [{9, -1, 0}], 1) = 0 0.001493 semop(1703942, [{3, 1, 0}], 1) = 0 You can see that most of the semop calls are waits on semaphore #9 in semaphore set 1736711. I looked up the relation for the files being opened and they all seem to be indexes on the table. Based on some flame graphs I generated, I believe most of the time is spent inside of the function get_relation_info. I think the open/lseek calls are coming from the estimate_rel_size call in get_relation_info, but I haven't been able to figure out where the semop calls are coming from. It seems that the semop calls are at least a symptom of the high planning time. I'm looking for help in finding the root cause of the high planning time and for figuring out a way for eliminating the high planning time. Thanks, Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Merge Join with an Index Optimization
I discovered that the kind of join I proposed is called the leapfrog triejoin: https://arxiv.org/pdf/1210.0481v5.pdf
Re: [HACKERS] Merge Join with an Index Optimization
On Sun, Sep 11, 2016 at 9:20 AM Tom Lane <t...@sss.pgh.pa.us> wrote: > Michael Malis <michaelmal...@gmail.com> writes: > >> As I understand it, a merge join will currently read all tuples from > both > >> subqueries (besides early termination). I believe it should be possible > to > >> take advantages of the indexes on one or both of the tables being read > from > >> to skip a large number of tuples that would currently be read. > >> > > IIUC, what you're proposing is to replace "read past some tuples in the > > index" with "re-descend the index btree". Yes, that would win if it > Roughly yes, although is it necessary to rescan the btree from the top? Is it not possible to "resume" the scan from the previous location? > > You might want to troll the archives for past discussions of index skip > > scan, which is more or less the same idea (could possibly be exactly the > > same idea, depending on how it's implemented). > After searching a bit, all I was able to find was this <https://www.postgresql.org/message-id/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com>. It looks like someone had a rough patch of applying a similar idea to DISTINCT. I can't tell what happened to it, but in the patch they mention that it should be possible to do a "local traversal ie from the current page". A different, but similar optimization, would be to first check the join condition with the data in the index. Then only fetch the full row only if the data in the index passes the join condition. I imagine that this would be easier to implement, although less efficent than what I proposed above because it has to scan the entire index. Thanks, Michael
[HACKERS] Merge Join with an Index Optimization
Hi, As I understand it, a merge join will currently read all tuples from both subqueries (besides early termination). I believe it should be possible to take advantages of the indexes on one or both of the tables being read from to skip a large number of tuples that would currently be read. As an example, let's say we are doing equi merge join between two tables with the following data: (3, 4, 5, 9, 10, 11) (0, 1, 2, 6, 7, 8) Currently, even though no tuples would be returned from the merge join, all of the data will be read from both tables. If there are indexes on both tables, pg should be able to execute as follows. Get the tuple with value 3 from the first table and then look up the first value greater than 3 in the second table using the index. In this case, that value would be 6. Since 3 != 6, pg would then look up the lowest value greater than 6 in the first table. This process repeats, and tuples are output whenever the values are equal. This can be thought of as an alternating nested loop join, where the positions in the indexes are maintained. In the case where only one of the tables has an index, this can be thought of as a nested loop join where the inner relation is the table with the index, and the position in that index is maintained while iterating over the outer relation (the outer relation would need to be sorted for this to work). I can't demonstrate in any benchmarks, but I imagine this would dramatically speed up cases of a merge join between two tables where the number of tuples that satisfy the join condition is small. I don't know the Postgres internals well enough to know if there is anything obvious that would prevent this optimization. Thanks, Michael