Re: [PERFORM] Millions of tables

2016-09-30 Thread Jim Nasby

On 9/29/16 6:11 AM, Alex Ignatov (postgrespro) wrote:

With millions of tables you have to setautovacuum_max_workers
 sky-high =). We have some situation when at thousands of tables
autovacuum can’t vacuum all tables that need it. Simply it vacuums some
of most modified table and never reach others. Only manual vacuum can
help with this situation. With wraparound issue it can be a nightmare


Specifically, autovac isn't going to start worrying about anti-wrap 
vacuums until tables start hitting autovacuum_freeze_max_age (or 
autovacuum_multixact_freeze_max_age). Any tables that hit that threshold 
go to the front of the line for being vacuumed. (But keep in mind that 
there is no universal line, just what each worker computes on it's own 
when it's started).


Where things will completely fall apart for you is if a lot of tables 
all have roughly the same relfrozenxid (or relminmxid), like they would 
immediately after a large load. In that scenario you'll suddenly have 
loads of work for autovac to do, all at the same time. That will make 
the database, DBAs and you Very Unhappy (tm).


Somehow, some way, you *must* do a vacuum of the entire database. 
Luckily the freeze map in 9.6 means you'd only have to do that one time 
(assuming the data really is static). In any older version, (auto)vacuum 
will need to eventually *read everything in every table* at least once 
every ~2B transactions.


The only impact the number of tables is going to have on this is 
granularity. If you have a small number of large tables, you'll have 
(auto)vacuum processes that will need to run *uninterrupted* for a long 
time to move the freeze threshold on each table. If you have tons of 
small tables, you'll need tons of separate (auto)vacuums, but each one 
will run for a shorter interval, and if one gets interrupted it won't be 
as big a deal.


There is one potentially significant difference between autovac and 
manual vacuums here; autovac treats toast tables as just another table, 
with their own stats and their own freeze needs. If you're generating a 
lot of toast records that might make a difference.


When it comes to vacuum, you might find 
https://www.pgcon.org/2015/schedule/events/829.en.html useful.


On a different topic... I didn't see anything in the thread about what 
you're storing, but with the row counts you're talking about I'm 
guessing it's something that's time-series. 
https://github.com/ElephantStack/ElephantStack is a project exploring 
the idea of using Postgres array types as a far more efficient way to 
store that kind of data; instead of an overhead of 24 bytes per row 
(plus indexes) arrays give you essentially zero overhead per row. 
There's no code yet, but a few of us have done testing on some real 
world data (see the google group referenced from the README).

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Unexpected expensive index scan

2016-09-30 Thread Jim Nasby

On 9/28/16 1:11 PM, Jake Nielsen wrote:

Beautiful! After changing the random_page_cost to 1.0 the original query
went from ~3.5s to ~35ms. This is exactly the kind of insight I was
fishing for in the original post. I'll keep in mind that the query
planner is very tunable and has these sorts of hardware-related
trade-offs in the future. I can't thank you enough!


Be careful with setting random_page_cost to exactly 1... that tells the 
planner that an index scan has nearly the same cost as a sequential 
scan, which is absolutely never the case, even with the database in 
memory. 1.1 or maybe even 1.01 is probably a safer bet.


Also note that you can set those parameters within a single session, as 
well as within a single transaction. So if you need to force a different 
setting for a single query, you could always do


BEGIN;
SET LOCAL random_page_cost = 1;
SELECT ...
COMMIT; (or rollback...)
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] MYSQL Stats

2016-09-30 Thread Joe Proietti
My Apologies ,  was in the wrong email/forum,  please disregard my email!

From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Joe Proietti
Sent: Friday, September 30, 2016 8:03 AM
To: Jake Nielsen ; Tom Lane 
Cc: pgsql-performance@postgresql.org
Subject: [PERFORM] MYSQL Stats

Hi,
I am relatively new to MYSQL and not really sure I am in the right forum for 
this.

I have a situation which I am not understanding.  I am performing a simple 
query :

Select * from tableA
Where date >= ‘2016’06-01’
And date < ‘2016-07-01’

Index is on date
Query returns 6271 rows

When doing explain on the same query
The rows column shows  11462,  nearly twice the amount  (this result is 
consistent on most all tables)

When selecting count from the table , returns  2668664

When selecting from information_schema.tables  table_rows column shows 2459114

While this is indicative of out dated statistics

Have done an analyze table but no changes.

Thanks,
Joe

From: 
pgsql-performance-ow...@postgresql.org
 [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Jake Nielsen
Sent: Wednesday, September 28, 2016 2:11 PM
To: Tom Lane >
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Unexpected expensive index scan



On Wed, Sep 28, 2016 at 6:04 AM, Tom Lane 
> wrote:
[ Please don't re-quote the entire damn thread in each followup. Have
some respect for your readers' time, and assume that they have already
seen the previous traffic, or could go look it up if they haven't.
The point of quoting at all is just to quickly remind people where we
are in the discussion. ]

Sorry, understood.


If you say "well yeah, but it seems to perform fine when I force
it to use that index anyway", the answer may be that you need to
adjust random_page_cost.  The default value is OK for tables that
are mostly sitting on spinning rust, but if your database is
RAM-resident or SSD-resident you probably want a value closer to 1.

Ahhh, this could absolutely be the key right here. I could totally see why it 
would make sense for the planner to do what it's doing given that it's 
weighting sequential access more favorably than random access.

Beautiful! After changing the random_page_cost to 1.0 the original query went 
from ~3.5s to ~35ms. This is exactly the kind of insight I was fishing for in 
the original post. I'll keep in mind that the query planner is very tunable and 
has these sorts of hardware-related trade-offs in the future. I can't thank you 
enough!

Cheers!



[PERFORM] MYSQL Stats

2016-09-30 Thread Joe Proietti
Hi,
I am relatively new to MYSQL and not really sure I am in the right forum for 
this.

I have a situation which I am not understanding.  I am performing a simple 
query :

Select * from tableA
Where date >= ‘2016’06-01’
And date < ‘2016-07-01’

Index is on date
Query returns 6271 rows

When doing explain on the same query
The rows column shows  11462,  nearly twice the amount  (this result is 
consistent on most all tables)

When selecting count from the table , returns  2668664

When selecting from information_schema.tables  table_rows column shows 2459114

While this is indicative of out dated statistics

Have done an analyze table but no changes.

Thanks,
Joe

From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Jake Nielsen
Sent: Wednesday, September 28, 2016 2:11 PM
To: Tom Lane 
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Unexpected expensive index scan



On Wed, Sep 28, 2016 at 6:04 AM, Tom Lane 
> wrote:
[ Please don't re-quote the entire damn thread in each followup. Have
some respect for your readers' time, and assume that they have already
seen the previous traffic, or could go look it up if they haven't.
The point of quoting at all is just to quickly remind people where we
are in the discussion. ]

Sorry, understood.


If you say "well yeah, but it seems to perform fine when I force
it to use that index anyway", the answer may be that you need to
adjust random_page_cost.  The default value is OK for tables that
are mostly sitting on spinning rust, but if your database is
RAM-resident or SSD-resident you probably want a value closer to 1.

Ahhh, this could absolutely be the key right here. I could totally see why it 
would make sense for the planner to do what it's doing given that it's 
weighting sequential access more favorably than random access.

Beautiful! After changing the random_page_cost to 1.0 the original query went 
from ~3.5s to ~35ms. This is exactly the kind of insight I was fishing for in 
the original post. I'll keep in mind that the query planner is very tunable and 
has these sorts of hardware-related trade-offs in the future. I can't thank you 
enough!

Cheers!



Re: [PERFORM] Multiple-Table-Spanning Joins with ORs in WHERE Clause

2016-09-30 Thread Sven R. Kunze
Now I found time to investigate all proposed queries side by side. Here 
are the results (warmup + multiple executions). TL;DR - Jeff's proposed 
answer performs significantly faster with our data than any other 
solution (both planning and execution time).



I have no real idea how PostgreSQL does query rewriting but I guess the 
following steps (and reverse ones) are necessary:


1) detect "DISTINCT+LEFT OUTER JOIN" and rewrite to "SUBQUERY"

2) detect "MUTUAL JOIN ON KEY + OR" and rewrite to "UNION"

3) detect "MUTUAL IN KEY+ OR" and rewrite to "UNION"

4) detect "UNION + MUTUAL JOIN ON KEY" and rewrite to "SUBQUERY + UNION"


Doing (1+2) or (3+4) would result in the optimal query. To (1+2) seems 
easier to do, although a "common SELECT lift up"/"UNION push down" (if 
that's even the correct name) would also be great to have (that's 4)). 
Is this somehow correct?



Regarding cost estimation: it seems like PostgreSQL is clever enough 
here. So, I tend to agree with Jeff that this is not an issue with cost 
estimation.



 DISTINCT + LEFT OUTER JOIN


explain analyze
SELECT distinct 
FROM "big_table"
LEFT OUTER JOIN "table_a" ON ("big_table"."id" = 
"table_a"."big_table_id")
LEFT OUTER JOIN "table_b" ON ("big_table"."id" = 
"table_b"."big_table_id")

WHERE
("table_a"."item_id" IN ()
OR
"table_b"."item_id" IN ());



 HashAggregate  (cost=206268.67..206269.46 rows=79 width=185) (actual 
time=904.859..904.860 rows=5 loops=1)

   Group Key: 
   ->  Merge Left Join  (cost=1.26..206265.11 rows=79 width=185) 
(actual time=904.835..904.846 rows=6 loops=1)

 Merge Cond: (big_table.id = table_a.big_table_id)
 Filter: (((table_a.item_id)::text = ANY ('items>'::text[])) OR ((table_b.item_id)::text = ANY ('items>'::text[])))

 Rows Removed by Filter: 901355
 ->  Merge Left Join  (cost=0.85..196703.22 rows=858293 
width=243) (actual time=0.009..745.736 rows=858690 loops=1)

   Merge Cond: (big_table.id = table_b.big_table_id)
   ->  Index Scan using big_table_pkey on big_table  
(cost=0.42..180776.64 rows=858293 width=185) (actual time=0.005..399.102 
rows=858690 loops=1)
   ->  Index Scan using table_b_pkey on table_b  
(cost=0.42..10343.86 rows=274959 width=62) (actual time=0.003..60.961 
rows=274958 loops=1)
 ->  Index Scan using table_a_big_table_id on table_a  
(cost=0.42..4445.35 rows=118836 width=57) (actual time=0.003..25.456 
rows=118833 loops=1)

 Planning time: 0.934 ms
 Execution time: 904.936 ms




 SUBQUERY

explain analyze
SELECT 
FROM "big_table"
WHERE
"big_table"."id" in (SELECT "table_a"."big_table_id" FROM "table_a" 
WHERE "table_a"."item_id" in ())

OR
"big_table"."id" in (SELECT "table_b"."big_table_id" FROM "table_b" 
WHERE "table_b"."item_id" IN ());




 Seq Scan on big_table  (cost=100.41..115110.80 rows=643720 width=185) 
(actual time=229.819..229.825 rows=5 loops=1)

   Filter: ((hashed SubPlan 1) OR (hashed SubPlan 2))
   Rows Removed by Filter: 858685
   SubPlan 1
 ->  Index Scan using table_a_item_id_211f18d89c25bc21_uniq on 
table_a (cost=0.42..58.22 rows=9 width=4) (actual time=0.026..0.043 
rows=5 loops=1)
   Index Cond: ((item_id)::text = ANY ('items>'::text[]))

   SubPlan 2
 ->  Index Scan using table_b_item_id_611f9f519d835e89_uniq on 
table_b (cost=0.42..42.15 rows=5 width=4) (actual time=0.007..0.040 
rows=5 loops=1)
   Index Cond: ((item_id)::text = ANY ('items>'::text[]))

 Planning time: 0.261 ms
 Execution time: 229.901 ms



 UNION

explain analyze
SELECT 
FROM "big_table"
INNER JOIN "table_a" ON ("big_table"."id" = "table_a"."big_table_id")
WHERE
"table_a"."item_id" IN ()
UNION
SELECT 
FROM "big_table"
INNER JOIN "table_b" ON ("big_table"."id" = "table_b"."big_table_id")
WHERE
"table_b"."item_id" IN ();

 HashAggregate  (cost=216.84..216.98 rows=14 width=185) (actual 
time=0.092..0.093 rows=5 loops=1)

   Group Key: 
   ->  Append  (cost=22.59..216.21 rows=14 width=185) (actual 
time=0.035..0.080 rows=10 loops=1)
 ->  Nested Loop  (cost=22.59..132.17 rows=9 width=185) (actual 
time=0.034..0.044 rows=5 loops=1)
   ->  Bitmap Heap Scan on table_a (cost=22.16..56.10 
rows=9 width=4) (actual time=0.029..0.029 rows=5 loops=1)
 Recheck Cond: ((item_id)::text = ANY ('items>'::text[]))

 Heap Blocks: exact=1
 ->  Bitmap Index Scan on 
table_a_item_id_211f18d89c25bc21_uniq  (cost=0.00..22.16 rows=9 width=0) 
(actual time=0.027..0.027 rows=5 loops=1)
   Index Cond: ((item_id)::text = ANY 
(''::text[]))
   ->  Index Scan using big_table_pkey on big_table  
(cost=0.42..8.44 rows=1 width=185) (actual time=0.002..0.002 rows=1 loops=5)

 Index Cond: (id = table_a.big_table_id)
 ->  Nested Loop  (cost=22.58..83.90 rows=5 width=185) (actual 
time=0.029..0.035 rows=5 loops=1)
  

Re: [PERFORM] Multiple-Table-Spanning Joins with ORs in WHERE Clause

2016-09-30 Thread Sven R. Kunze

On 29.09.2016 22:26, Jeff Janes wrote:
Well, I don't recall seeing this issue on this list before (or a few 
other forums I read) while I see several other issues over and over 
again.  So that is why I think it is a niche issue.  Perhaps I've have 
seen it before and just forgotten, or have not recognized it as being 
the same issue each time.


Understood.



This multitude of solution also shows that applications developers
might be overwhelmed by choosing the most appropriate AND most
long-lasting one. Because what I take from the discussion is that
a UNION might be appropriate right now but that could change in
the future even for the very same use-case at hand.


I'm not sure what would cause it to change.  Do you mean if you 
suddenly start selecting a much larger portion of the table?  I don't 
know that the union would be particularly bad in that case, either.


Not suddenly but gradually. Data can change and we don't know for sure 
how people will use our systems in the future. Hence, another plan would 
be more optimal or even a seq scan on big_table would be faster.


In the case at hand, I doubt it but you never know.

I'm not saying it wouldn't be nice to fix it.  I just don't think it 
is particularly likely to happen soon.  I could be wrong (especially 
if you can write the code to make it happen).


I have been thinking about this. It would be an interesting exercise as 
I haven't written much of C in the last decade but sometimes one needs 
to get out of the comfort zone to get things going.



Sven