Re: [PERFORM] Odd behavior with indices

2016-03-04 Thread Tom Lane
Merlin Moncure  writes:
> On Mon, Feb 29, 2016 at 12:47 PM, Tom Lane  wrote:
>> FWIW, PG >= 9.5 will ignore a LIMIT 1 inside an EXISTS, so that you get
>> the same plan with or without it.  But that does act as an optimization
>> fence in earlier releases.

> Does 'offset 0' still work as it did?

Yes.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Odd behavior with indices

2016-03-04 Thread Merlin Moncure
On Mon, Feb 29, 2016 at 12:47 PM, Tom Lane  wrote:
> FWIW, PG >= 9.5 will ignore a LIMIT 1 inside an EXISTS, so that you get
> the same plan with or without it.  But that does act as an optimization
> fence in earlier releases.

Does 'offset 0' still work as it did?

merlin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Odd behavior with indices

2016-02-29 Thread Tom Lane
Matheus de Oliveira  writes:
> Em 26 de fev de 2016 4:44 PM, "joe meiring" 
> escreveu:
>> The same query for parameters is rather slow and does NOT use the index:
>> 
>> EXPLAIN ANALYZE
>> select *
>> from parameter
>> where exists (
>> select 1 from datavalue
>> where datavalue.parameter_id = parameter.id limit 1
>> );

> Please, could you execute both queries without the LIMIT 1 and show us the
> plans?

> LIMIT in the inner query is like a fence and it caps some optimizations
> available for EXISTS, you'd better avoid it and see if you get a proper
> semi-join plan then.

FWIW, PG >= 9.5 will ignore a LIMIT 1 inside an EXISTS, so that you get
the same plan with or without it.  But that does act as an optimization
fence in earlier releases.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Odd behavior with indices

2016-02-28 Thread Matheus de Oliveira
Em 26 de fev de 2016 4:44 PM, "joe meiring" 
escreveu:
>
> The same query for parameters is rather slow and does NOT use the index:
>
> EXPLAIN ANALYZE
> select *
> from parameter
> where exists (
>   select 1 from datavalue
>   where datavalue.parameter_id = parameter.id limit 1
> );
>

Please, could you execute both queries without the LIMIT 1 and show us the
plans?

LIMIT in the inner query is like a fence and it caps some optimizations
available for EXISTS, you'd better avoid it and see if you get a proper
semi-join plan then.

Regards.


Re: [PERFORM] Odd behavior with indices

2016-02-26 Thread David G. Johnston
On Fri, Feb 26, 2016 at 1:38 PM, joe meiring 
wrote:

> Here's the distribution of parameter_id's
>
> select count(parameter_id), parameter_id from datavalue group by parameter_id
> 88169   142889171 815805   178570124257262 213947049 151225902 24091090 
> 3103877  10633764  11994442  1849232   2014935   4563638  132955919 7
>
>
​Ok...again its beyond my present experience ​but its what the planner
thinks about the distribution, and not what actually is present, that
matters.

David J.


Re: [PERFORM] Odd behavior with indices

2016-02-26 Thread joe meiring
Here's the distribution of parameter_id's

select count(parameter_id), parameter_id from datavalue group by parameter_id
88169   142889171 815805   178570124257262 213947049 151225902
24091090 3103877  10633764  11994442  1849232   2014935   4563638
132955919 7


On Fri, Feb 26, 2016 at 2:02 PM, David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Fri, Feb 26, 2016 at 12:43 PM, joe meiring 
> wrote:
>
>> Also available on S.O.:
>>
>>
>> http://stackoverflow.com/questions/35658238/postgres-odd-behavior-with-indices
>>
>> I've got a datavalue table with ~200M rows or so, with indices on both
>> site_id and parameter_id. I need to execute queries like "return all
>> sites with data" and "return all parameters with data". The site table
>> has only 200 rows or so, and the parameter table has only 100 or so rows.
>>
>> The site query is fast and uses the index:
>>
>> EXPLAIN ANALYZEselect *from sitewhere exists (
>>   select 1 from datavalue
>>   where datavalue.site_id = site.id limit 1);
>>
>> Seq Scan on site  (cost=0.00..64.47 rows=64 width=113) (actual 
>> time=0.046..1.106 rows=89 loops=1)
>>   Filter: (SubPlan 1)
>>   Rows Removed by Filter: 39
>>   SubPlan 1
>> ->  Limit  (cost=0.44..0.47 rows=1 width=0) (actual time=0.008..0.008 
>> rows=1 loops=128)
>>   ->  Index Only Scan using ix_datavalue_site_id on datavalue  
>> (cost=0.44..8142.71 rows=248930 width=0) (actual time=0.008..0.008 rows=1 
>> loops=128)
>> Index Cond: (site_id = site.id)
>> Heap Fetches: 0
>> Planning time: 0.361 ms
>> Execution time: 1.149 ms
>>
>> The same query for parameters is rather slow and does NOT use the index:
>>
>> EXPLAIN ANALYZEselect *from parameterwhere exists (
>>   select 1 from datavalue
>>   where datavalue.parameter_id = parameter.id limit 1);
>>
>> Seq Scan on parameter  (cost=0.00..20.50 rows=15 width=2648) (actual 
>> time=2895.972..21331.701 rows=15 loops=1)
>>   Filter: (SubPlan 1)
>>   Rows Removed by Filter: 6
>>   SubPlan 1
>> ->  Limit  (cost=0.00..0.34 rows=1 width=0) (actual 
>> time=1015.790..1015.790 rows=1 loops=21)
>>   ->  Seq Scan on datavalue  (cost=0.00..502127.10 rows=1476987 
>> width=0) (actual time=1015.786..1015.786 rows=1 loops=21)
>> Filter: (parameter_id = parameter.id)
>> Rows Removed by Filter: 7739355
>> Planning time: 0.123 ms
>> Execution time: 21331.736 ms
>>
>> What the deuce is going on here? Alternatively, whats a good way to do
>> this?
>>
>> Any help/guidance appreciated!
>>
>>
>>
>> Some of the table description:
>>
>> \d datavalue
>>
>> id BIGINT DEFAULT nextval('datavalue_id_seq'::regclass) NOT NULL,
>> value DOUBLE PRECISION NOT NULL,
>> site_id INTEGER NOT NULL,
>> parameter_id INTEGER NOT NULL,
>> deployment_id INTEGER,
>> instrument_id INTEGER,
>> invalid BOOLEAN,
>> Indexes:
>> "datavalue_pkey" PRIMARY KEY, btree (id)
>> "datavalue_datetime_utc_site_id_parameter_id_instrument_id_key" UNIQUE 
>> CONSTRAINT, btree (datetime_utc, site_id, parameter_id, instrument_id)
>> "ix_datavalue_instrument_id" btree (instrument_id)
>> "ix_datavalue_parameter_id" btree (parameter_id)
>> "ix_datavalue_site_id" btree (site_id)
>> "tmp_idx" btree (site_id, datetime_utc)
>> Foreign-key constraints:
>> "datavalue_instrument_id_fkey" FOREIGN KEY (instrument_id) REFERENCES 
>> instrument(id) ON UPDATE CASCADE ON DELETE CASCADE
>> "datavalue_parameter_id_fkey" FOREIGN KEY (parameter_id) REFERENCES 
>> parameter(id) ON UPDATE CASCADE ON DELETE CASCADE
>> "datavalue_site_id_fkey" FOREIGN KEY (site_id) REFERENCES 
>> coastal.site(id) ON UPDATE CASCADE ON DELETE CASCADE
>> "datavalue_statistic_type_id_fkey"
>>
>>
>> ​I'm not great with the details but the short answer - aside from the
> fact that you should consider increasing the statistics on these columns -
> is that at a certain point querying the index and then subsequently
> checking the table for visibility is more expensive than simply scanning
> and then discarding ​the extra rows.
>
> The fact that you could perform an INDEX ONLY scan in the first query
> makes that cost go away since no subsequent heap check is required.  In the
> parameters query the planner thinks it needs 1.5 million of the rows and
> will have to check each of them for visibility.  It decided that scanning
> the entire table was more efficient.
>
> The LIMIT 1 in both queries should not be necessary.  The planner is smart
> enough to stop once it finds what it is looking for.  In fact the LIMIT's
> presence may be a contributing factor...but I cannot say for sure.
>
> A better query seems like it would be:
>
> WITH active_sites AS (
> SELECT DISTINCT site_id FROM datavalues;
> )
> SELECT *
> FROM sites
> JOIN active_sites USING (site_id);
>
> David J.
>


Re: [PERFORM] Odd behavior with indices

2016-02-26 Thread David G. Johnston
On Fri, Feb 26, 2016 at 12:43 PM, joe meiring 
wrote:

> Also available on S.O.:
>
>
> http://stackoverflow.com/questions/35658238/postgres-odd-behavior-with-indices
>
> I've got a datavalue table with ~200M rows or so, with indices on both
> site_id and parameter_id. I need to execute queries like "return all
> sites with data" and "return all parameters with data". The site table
> has only 200 rows or so, and the parameter table has only 100 or so rows.
>
> The site query is fast and uses the index:
>
> EXPLAIN ANALYZEselect *from sitewhere exists (
>   select 1 from datavalue
>   where datavalue.site_id = site.id limit 1);
>
> Seq Scan on site  (cost=0.00..64.47 rows=64 width=113) (actual 
> time=0.046..1.106 rows=89 loops=1)
>   Filter: (SubPlan 1)
>   Rows Removed by Filter: 39
>   SubPlan 1
> ->  Limit  (cost=0.44..0.47 rows=1 width=0) (actual time=0.008..0.008 
> rows=1 loops=128)
>   ->  Index Only Scan using ix_datavalue_site_id on datavalue  
> (cost=0.44..8142.71 rows=248930 width=0) (actual time=0.008..0.008 rows=1 
> loops=128)
> Index Cond: (site_id = site.id)
> Heap Fetches: 0
> Planning time: 0.361 ms
> Execution time: 1.149 ms
>
> The same query for parameters is rather slow and does NOT use the index:
>
> EXPLAIN ANALYZEselect *from parameterwhere exists (
>   select 1 from datavalue
>   where datavalue.parameter_id = parameter.id limit 1);
>
> Seq Scan on parameter  (cost=0.00..20.50 rows=15 width=2648) (actual 
> time=2895.972..21331.701 rows=15 loops=1)
>   Filter: (SubPlan 1)
>   Rows Removed by Filter: 6
>   SubPlan 1
> ->  Limit  (cost=0.00..0.34 rows=1 width=0) (actual 
> time=1015.790..1015.790 rows=1 loops=21)
>   ->  Seq Scan on datavalue  (cost=0.00..502127.10 rows=1476987 
> width=0) (actual time=1015.786..1015.786 rows=1 loops=21)
> Filter: (parameter_id = parameter.id)
> Rows Removed by Filter: 7739355
> Planning time: 0.123 ms
> Execution time: 21331.736 ms
>
> What the deuce is going on here? Alternatively, whats a good way to do
> this?
>
> Any help/guidance appreciated!
>
>
>
> Some of the table description:
>
> \d datavalue
>
> id BIGINT DEFAULT nextval('datavalue_id_seq'::regclass) NOT NULL,
> value DOUBLE PRECISION NOT NULL,
> site_id INTEGER NOT NULL,
> parameter_id INTEGER NOT NULL,
> deployment_id INTEGER,
> instrument_id INTEGER,
> invalid BOOLEAN,
> Indexes:
> "datavalue_pkey" PRIMARY KEY, btree (id)
> "datavalue_datetime_utc_site_id_parameter_id_instrument_id_key" UNIQUE 
> CONSTRAINT, btree (datetime_utc, site_id, parameter_id, instrument_id)
> "ix_datavalue_instrument_id" btree (instrument_id)
> "ix_datavalue_parameter_id" btree (parameter_id)
> "ix_datavalue_site_id" btree (site_id)
> "tmp_idx" btree (site_id, datetime_utc)
> Foreign-key constraints:
> "datavalue_instrument_id_fkey" FOREIGN KEY (instrument_id) REFERENCES 
> instrument(id) ON UPDATE CASCADE ON DELETE CASCADE
> "datavalue_parameter_id_fkey" FOREIGN KEY (parameter_id) REFERENCES 
> parameter(id) ON UPDATE CASCADE ON DELETE CASCADE
> "datavalue_site_id_fkey" FOREIGN KEY (site_id) REFERENCES 
> coastal.site(id) ON UPDATE CASCADE ON DELETE CASCADE
> "datavalue_statistic_type_id_fkey"
>
>
> ​I'm not great with the details but the short answer - aside from the fact
that you should consider increasing the statistics on these columns - is
that at a certain point querying the index and then subsequently checking
the table for visibility is more expensive than simply scanning and then
discarding ​the extra rows.

The fact that you could perform an INDEX ONLY scan in the first query makes
that cost go away since no subsequent heap check is required.  In the
parameters query the planner thinks it needs 1.5 million of the rows and
will have to check each of them for visibility.  It decided that scanning
the entire table was more efficient.

The LIMIT 1 in both queries should not be necessary.  The planner is smart
enough to stop once it finds what it is looking for.  In fact the LIMIT's
presence may be a contributing factor...but I cannot say for sure.

A better query seems like it would be:

WITH active_sites AS (
SELECT DISTINCT site_id FROM datavalues;
)
SELECT *
FROM sites
JOIN active_sites USING (site_id);

David J.