Re: [HACKERS] Merge Join with an Index Optimization

2016-09-11 Thread Michael Malis
On Sun, Sep 11, 2016 at 9:20 AM Tom Lane <t...@sss.pgh.pa.us> wrote:

> Michael Malis <michaelmal...@gmail.com> writes:
> >> As I understand it, a merge join will currently read all tuples from
> both
> >> subqueries (besides early termination). I believe it should be possible
> to
> >> take advantages of the indexes on one or both of the tables being read
> from
> >> to skip a large number of tuples that would currently be read.
> >>
> > IIUC, what you're proposing is to replace "read past some tuples in the
> > index" with "re-descend the index btree".  Yes, that would win if it
>

Roughly yes, although is it necessary to rescan the btree from the top? Is
it not possible to "resume" the scan from the previous location?


> > You might want to troll the archives for past discussions of index skip
> > scan, which is more or less the same idea (could possibly be exactly the
> > same idea, depending on how it's implemented).
>

After searching a bit, all I was able to find was this
<https://www.postgresql.org/message-id/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com>.
It looks like someone had a rough patch of applying a similar idea to
DISTINCT. I can't tell what happened to it, but in the patch they mention
that it should be possible to do a "local traversal ie from the current
page".

A different, but similar optimization, would be to first check the join
condition with the data in the index. Then only fetch the full row only if
the data in the index passes the join condition. I imagine that this would
be easier to implement, although less efficent than what I proposed above
because it has to scan the entire index.

Thanks,
Michael


[HACKERS] Merge Join with an Index Optimization

2016-09-10 Thread Michael Malis
Hi,

As I understand it, a merge join will currently read all tuples from both
subqueries (besides early termination). I believe it should be possible to
take advantages of the indexes on one or both of the tables being read from
to skip a large number of tuples that would currently be read. As an
example, let's say we are doing equi merge join between two tables with the
following data:

  (3, 4, 5, 9, 10, 11)
  (0, 1, 2, 6, 7, 8)

Currently, even though no tuples would be returned from the merge join, all
of the data will be read from both tables. If there are indexes on both
tables, pg should be able to execute as follows. Get the tuple with value 3
from the first table and then look up the first value greater than 3 in the
second table using the index. In this case, that value would be 6. Since 3
!= 6, pg would then look up the lowest value greater than 6 in the first
table. This process repeats, and tuples are output whenever the values are
equal. This can be thought of as an alternating nested loop join, where the
positions in the indexes are maintained. In the case where only one of the
tables has an index, this can be thought of as a nested loop join where the
inner relation is the table with the index, and the position in that index
is maintained while iterating over the outer relation (the outer relation
would need to be sorted for this to work).

I can't demonstrate in any benchmarks, but I imagine this would
dramatically speed up cases of a merge join between two tables where the
number of tuples that satisfy the join condition is small. I don't know the
Postgres internals well enough to know if there is anything obvious that
would prevent this optimization.

Thanks,
Michael


Re: [HACKERS] Merge Join with an Index Optimization

2016-10-10 Thread Michael Malis
I discovered that the kind of join I proposed is called the leapfrog
triejoin: https://arxiv.org/pdf/1210.0481v5.pdf


[HACKERS] Highly Variable Planning Times

2017-04-19 Thread Michael Malis
Hi,

I've been encountering highly variable planning time on PG 9.5.4.
Running `EXPLAIN SELECT * FROM events_1171738` will take anywhere from
200ms to 4s. This likely has to do with the table having 1300 partial
indexes on it (for reasons elaborated on in
https://blog.heapanalytics.com/running-10-million-postgresql-indexes-in-production/).
I'm trying to figure out if there is something I can do to eliminate
the slow planning from happening.

When I used `strace -c` on the backend process, I found that the
number of `semop` system calls is much higher for the slower queries
than the faster ones. After digging in a bit more I found that when
the query was slow, there would be a single semaphore which would make
up most of the `semop` calls. Here is a snippet of the output when I
ran `strace -r` on a query with high planning time:

0.000225 open("base/16384/13236647", O_RDWR) = 219
0.000110 lseek(219, 0, SEEK_END)   = 16384
0.000122 semop(1736711, [{9, -1, 0}], 1) = 0
0.004110 semop(1769480, [{10, 1, 0}], 1) = 0
0.000197 semop(1736711, [{9, -1, 0}], 1) = 0
0.010096 semop(1540097, [{1, 1, 0}], 1) = 0
0.000176 semop(1736711, [{9, -1, 0}], 1) = 0
0.004069 semop(1638404, [{15, 1, 0}], 1) = 0
0.000244 open("base/16384/13236648", O_RDWR) = 220
0.000131 lseek(220, 0, SEEK_END)   = 16384
0.000212 semop(1736711, [{9, -1, 0}], 1) = 0
0.007851 semop(1736711, [{8, 1, 0}], 1) = 0
0.000130 semop(1736711, [{9, -1, 0}], 1) = 0
0.000101 semop(1769480, [{5, 1, 0}], 1) = 0
0.000450 open("base/16384/13236653", O_RDWR) = 221
0.000151 lseek(221, 0, SEEK_END)   = 180224
0.000141 semop(1736711, [{9, -1, 0}], 1) = 0
0.000239 semop(1736711, [{9, -1, 0}], 1) = 0
0.000419 semop(1736711, [{10, 1, 0}], 1) = 0
0.000127 semop(1736711, [{9, -1, 0}], 1) = 0
0.001493 semop(1703942, [{3, 1, 0}], 1) = 0

You can see that most of the semop calls are waits on semaphore #9 in
semaphore set 1736711. I looked up the relation for the files being
opened and they all seem to be indexes on the table. Based on some
flame graphs I generated, I believe most of the time is spent inside
of the function get_relation_info. I think the open/lseek calls are
coming from the estimate_rel_size call in get_relation_info, but I
haven't been able to figure out where the semop calls are coming from.

It seems that the semop calls are at least a symptom of the high
planning time. I'm looking for help in finding the root cause of the
high planning time and for figuring out a way for eliminating the high
planning time.

Thanks,
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Highly Variable Planning Times

2017-04-19 Thread Michael Malis
> TBH, I see no convincing explanation in that article of why 1300 partial
> indexes are a good idea.

I don't like it much either. I've been trying to eliminate most of the
need for the partial indexes, but this is the current state of things.

> *At best*, you're doing substantial work in the
> planner to avoid the first tree descent step or two in a single
> non-partial index.

While the specific example I gave in the post could be replaced with a
non-partial index, most of the partial indexes contain predicates that
are not straightforward to index with non-partial indexes. About 85%
of the partial indexes contain a regular expression in them for CSS
selector matching. I've tried using a trigram GIN index, but it wound
up not working too well due to the data being highly redundant (almost
every CSS hierarchy contains 'div' in it). Additionally, most of the
predicates for partial indexes are extremely specific making the
partial indexes very small. The sum total size of all of the partial
indexes is still much smaller than if we were to index every necessary
field with regular indexes.

Thanks,
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Highly Variable Planning Times

2017-04-19 Thread Michael Malis
> The profile seems to confirm that this is largely due to cache misses.

Can you elaborate? Cache misses of what? Why is the planning time so variable?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Highly Variable Planning Times

2017-04-19 Thread Michael Malis
Thanks for the help Andres.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Poor cost estimate with interaction between table correlation and partial indexes

2017-08-26 Thread Michael Malis
Hi.

I'm looking to get started contributing code to Postgres. A small
issue I'm aware of that I think would make a good first contribution
is a poor cost estimate made due to an interaction between table
correlation and partial indexes. Currently the planner assumes that
when an index is perfectly correlated with a table and a range scan is
performed on the index, all of the table page reads performed by the
index scan except for the first one will be sequential reads. While
this assumption is correct for regular indexes, it is not true for
partial indexes.

The assumption holds for regular indexes because the rows
corresponding to two entries in a regular index that is perfectly
correlated with the table are guaranteed to be next to each other in
the table. On the other hand with a partial index perfectly correlated
with a table, there may be rows in the table in between the two rows
corresponding to two adjacent entries in the index that are not
included in the index because they do not satisfy the partial index
predicate.

To make the cost calculation for this case more accurate, I want to
apply the same estimate as the one currently used to estimate the cost
of a bitmap heap scan. The bitmap heap scan cost estimate applies in
this case because both cases involve reading pages from disk ordered
by the location in the table, but where the pages may not be
consecutive.

For the relevant functions, see cost_index and cost_index_heap_scan in
costsize.c.

Thanks,
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Poor cost estimate with interaction between table correlation and partial indexes

2017-08-26 Thread Michael Malis
Do you think this is a reasonable approach? Should I start working on
a patch based on the solution I described or is there some other
approach I should look into?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Poor cost estimate with interaction between table correlation and partial indexes

2017-08-27 Thread Michael Malis
(Sorry David. I initially replied only to you)

Ok. I've attached a patch of a proof-of-concept. I have a few
questions about tests.

What is typical workflow to add tests for changes to the planner? Also
I ran make check and it appears one of the existing tests is failing.
What is a typical way for going about discovering why the query plan
for a specific query changed? Also, how should I go about changing the
old test? Should I replace the old test output with the new test
output or modify the old test slightly to get it to produce the same
case as before?

Thanks,
Michael
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 051a854..58224e6 100644
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
***
*** 163,169  static void set_rel_width(PlannerInfo *root, RelOptInfo *rel);
  static double relation_byte_size(double tuples, int width);
  static double page_size(double tuples, int width);
  static double get_parallel_divisor(Path *path);
! 
  
  /*
   * clamp_row_est
--- 163,172 
  static double relation_byte_size(double tuples, int width);
  static double page_size(double tuples, int width);
  static double get_parallel_divisor(Path *path);
! static double ordered_page_read_cost(double pages_fetched,
! 	   int baserel_pages,
! 	   double spc_seq_page_cost,
! 	   double spc_random_page_cost);
  
  /*
   * clamp_row_est
***
*** 652,660  cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
  
  		if (pages_fetched > 0)
  		{
! 			min_IO_cost = spc_random_page_cost;
! 			if (pages_fetched > 1)
! min_IO_cost += (pages_fetched - 1) * spc_seq_page_cost;
  		}
  		else
  			min_IO_cost = 0;
--- 655,680 
  
  		if (pages_fetched > 0)
  		{
! 			if (index->indpred == NIL)
! 			{
! min_IO_cost = spc_random_page_cost;
! if (pages_fetched > 1)
! 	min_IO_cost += (pages_fetched - 1) * spc_seq_page_cost;
! 			}
! 			else
! 			{
! /*
!  * For a partial index perfectly correlated with the table
!  * ordering, consecutive pages fetched are not guarenteed to
!  * be adjacent in the table. Instead use
!  * ordered_page_read_cost to estimate the cost of reading
!  * pages from the heap.
!  */
! min_IO_cost = ordered_page_read_cost(pages_fetched,
! 	 baserel->pages,
! 	 spc_seq_page_cost,
! 	 spc_seq_page_cost);
! 			}
  		}
  		else
  			min_IO_cost = 0;
***
*** 934,946  cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
  	Cost		indexTotalCost;
  	QualCost	qpqual_cost;
  	Cost		cpu_per_tuple;
- 	Cost		cost_per_page;
  	Cost		cpu_run_cost;
  	double		tuples_fetched;
  	double		pages_fetched;
  	double		spc_seq_page_cost,
  spc_random_page_cost;
- 	double		T;
  
  	/* Should only be applied to base relations */
  	Assert(IsA(baserel, RelOptInfo));
--- 954,964 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Poor cost estimate with interaction between table correlation and partial indexes

2017-08-27 Thread Michael Malis
Hmm... It seems the command I used for obtaining a patch I got from
here https://wiki.postgresql.org/wiki/Working_with_Git truncated part
of the patch. I've attached the file generated from git diff
--patience master improve-partial-index-correlation-calculation
--no-color > improve-correlated-partial-index-cost-v2.patch to this
email. What is the correct command for generating a context diff?
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 051a854..58224e6 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -163,7 +163,10 @@ static void set_rel_width(PlannerInfo *root, RelOptInfo *rel);
 static double relation_byte_size(double tuples, int width);
 static double page_size(double tuples, int width);
 static double get_parallel_divisor(Path *path);
-
+static double ordered_page_read_cost(double pages_fetched,
+	   int baserel_pages,
+	   double spc_seq_page_cost,
+	   double spc_random_page_cost);
 
 /*
  * clamp_row_est
@@ -652,9 +655,26 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
 
 		if (pages_fetched > 0)
 		{
-			min_IO_cost = spc_random_page_cost;
-			if (pages_fetched > 1)
-min_IO_cost += (pages_fetched - 1) * spc_seq_page_cost;
+			if (index->indpred == NIL)
+			{
+min_IO_cost = spc_random_page_cost;
+if (pages_fetched > 1)
+	min_IO_cost += (pages_fetched - 1) * spc_seq_page_cost;
+			}
+			else
+			{
+/*
+ * For a partial index perfectly correlated with the table
+ * ordering, consecutive pages fetched are not guarenteed to
+ * be adjacent in the table. Instead use
+ * ordered_page_read_cost to estimate the cost of reading
+ * pages from the heap.
+ */
+min_IO_cost = ordered_page_read_cost(pages_fetched,
+	 baserel->pages,
+	 spc_seq_page_cost,
+	 spc_seq_page_cost);
+			}
 		}
 		else
 			min_IO_cost = 0;
@@ -934,13 +954,11 @@ cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 	Cost		indexTotalCost;
 	QualCost	qpqual_cost;
 	Cost		cpu_per_tuple;
-	Cost		cost_per_page;
 	Cost		cpu_run_cost;
 	double		tuples_fetched;
 	double		pages_fetched;
 	double		spc_seq_page_cost,
 spc_random_page_cost;
-	double		T;
 
 	/* Should only be applied to base relations */
 	Assert(IsA(baserel, RelOptInfo));
@@ -961,28 +979,16 @@ cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 		 _fetched);
 
 	startup_cost += indexTotalCost;
-	T = (baserel->pages > 1) ? (double) baserel->pages : 1.0;
 
 	/* Fetch estimated page costs for tablespace containing table. */
 	get_tablespace_page_costs(baserel->reltablespace,
 			  _random_page_cost,
 			  _seq_page_cost);
 
-	/*
-	 * For small numbers of pages we should charge spc_random_page_cost
-	 * apiece, while if nearly all the table's pages are being read, it's more
-	 * appropriate to charge spc_seq_page_cost apiece.  The effect is
-	 * nonlinear, too. For lack of a better idea, interpolate like this to
-	 * determine the cost per page.
-	 */
-	if (pages_fetched >= 2.0)
-		cost_per_page = spc_random_page_cost -
-			(spc_random_page_cost - spc_seq_page_cost)
-			* sqrt(pages_fetched / T);
-	else
-		cost_per_page = spc_random_page_cost;
-
-	run_cost += pages_fetched * cost_per_page;
+	run_cost += ordered_page_read_cost(pages_fetched,
+	   baserel->pages,
+	   spc_seq_page_cost,
+	   spc_random_page_cost);
 
 	/*
 	 * Estimate CPU costs per tuple.
@@ -5183,3 +5189,34 @@ compute_bitmap_pages(PlannerInfo *root, RelOptInfo *baserel, Path *bitmapqual,
 
 	return pages_fetched;
 }
+
+
+/*
+ * Estimate the total cost of reading possibly nonconsecutive pages from the
+ * heap in order.
+ */
+static double
+ordered_page_read_cost(double pages_fetched, int baserel_pages,
+	   double spc_seq_page_cost, double spc_random_page_cost)
+{
+	double cost_per_page;
+	double T;
+
+	T = (baserel_pages > 1) ? (double) baserel_pages : 1.0;
+
+	/*
+	 * For small numbers of pages we should charge spc_random_page_cost
+	 * apiece, while if nearly all the table's pages are being read, it's more
+	 * appropriate to charge spc_seq_page_cost apiece.  The effect is
+	 * nonlinear, too. For lack of a better idea, interpolate like this to
+	 * determine the cost per page.
+	 */
+	if (pages_fetched >= 2.0)
+		cost_per_page = spc_random_page_cost -
+			(spc_random_page_cost - spc_seq_page_cost)
+			* sqrt(pages_fetched / T);
+	else
+		cost_per_page = spc_random_page_cost;
+
+	return pages_fetched * cost_per_page;
+}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers