Re: Faster distinct query?

2021-09-24 Thread Israel Brewster
> On Sep 23, 2021, at 8:55 PM, Michael Lewis  wrote:
> 
> It’s only when I add in the AND data.channels=channels.channel that the query 
> time blows up to 5+ minutes. I personally don’t understand why there would be 
> such a large difference between the two queries - something wrong with my 
> indexes?
> 
> Did you do the manual vacuum as suggested by Tom?

I ran a VACUUM ANALYZE, yes.

> Maybe at least check pg_stat_user_tables for the last vacuum or autovacuum 
> timestamp, and how many records have changed since then.

volcano_seismology=# SELECT 
seq_scan,seq_tup_read,idx_scan,idx_tup_fetch,n_tup_ins,n_tup_upd,n_tup_del,n_tup_hot_upd,n_live_tup,n_dead_tup,n_mod_since_analyze,n_ins_since_vacuum,last_vacuum,last_autovacuum,last_analyze,last_autoanalyze,vacuum_count,autovacuum_count,analyze_count,autoanalyze_count
 FROM pg_stat_user_tables WHERE relname='data';
-[ RECORD 1 ]---+--
seq_scan| 95
seq_tup_read| 25899340540
idx_scan| 728372
idx_tup_fetch   | 51600217033
n_tup_ins   | 840283699
n_tup_upd   | 66120702
n_tup_del   | 2375651
n_tup_hot_upd   | 0
n_live_tup  | 839266956
n_dead_tup  | 66585751
n_mod_since_analyze | 58896052
n_ins_since_vacuum  | 24890460
last_vacuum | 2021-09-22 21:32:11.367855+00
last_autovacuum | 2021-09-14 07:13:23.745862+00
last_analyze| 2021-09-22 21:32:21.071092+00
last_autoanalyze| 2021-09-21 11:54:36.924762+00
vacuum_count| 1
autovacuum_count| 1
analyze_count   | 1
autoanalyze_count   | 2

Note that the update count was due to a (hopefully) one-time process where I 
had to change the value of a bunch of records. Generally this *should be* an 
insert-once-read-many database.


> The system is deciding on an index scan because that should be very fast, but 
> if it needs to look at the actual table table to determine if the process 
> executing that query should still be allowed to see that tuple (row version) 
> then the index only scan slows down a lot I believe. The huge number of heap 
> fetches that Tom pointed out mean that the system doesn't know that ALL 
> processes should be able to see those tuples. As someone else suggested 
> lowering the max freeze age, you might just do a manual "vacuum freeze" as 
> well to allow future auto vacuum processes to quickly skip over those pages 
> as well.

Ok, ran a VACUUM (ANALYZE, FREEZE), and am now showing this from 
pg_stat_user_tables:

volcano_seismology=# SELECT 
seq_scan,seq_tup_read,idx_scan,idx_tup_fetch,n_tup_ins,n_tup_upd,n_tup_del,n_tup_hot_upd,n_live_tup,n_dead_tup,n_mod_since_analyze,n_ins_since_vacuum,last_vacuum,last_autovacuum,last_analyze,last_autoanalyze,vacuum_count,autovacuum_count,analyze_count,autoanalyze_count
 FROM pg_stat_user_tables WHERE relname='data';
-[ RECORD 1 ]---+--
seq_scan| 96
seq_tup_read| 26737263238
idx_scan| 732396
idx_tup_fetch   | 52571927369
n_tup_ins   | 841017819
n_tup_upd   | 66120702
n_tup_del   | 2388723
n_tup_hot_upd   | 0
n_live_tup  | 840198830
n_dead_tup  | 10173
n_mod_since_analyze | 0
n_ins_since_vacuum  | 0
last_vacuum | 2021-09-24 17:18:18.34282+00
last_autovacuum | 2021-09-14 07:13:23.745862+00
last_analyze| 2021-09-24 17:18:31.576238+00
last_autoanalyze| 2021-09-21 11:54:36.924762+00
vacuum_count| 2
autovacuum_count| 1
analyze_count   | 2
autoanalyze_count   | 2

However, adding the AND data.channels=channels.channel to the query still makes 
it take around 5 minutes (https://explain.depesz.com/s/7hb1 
). So, again, running VACUUM didn’t appear 
to help any.

Also perhaps interestingly, if I again modify the query to only match on 
channel, not station, it is again fast (though not quite as fast): 
https://explain.depesz.com/s/HLb8 

So, basically, I can quickly get a list of all channels for which I have data, 
or all stations for which I have data, but getting a list of all channels for 
each station is slow.

> I've heard of the loose indexscan before mentioned on this thread, but I'm 
> not seeing how to implement that for multiple columns. Anyone have an example 
> or pseudo code perhaps?

This is my stumbling block to implementing this option as well. That said, with 
the lookup table in place, perhaps all the rest of this is a moot point? 

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145




Re: Faster distinct query?

2021-09-23 Thread Michael Lewis
>
> It’s only when I add in the AND data.channels=channels.channel that the
> query time blows up to 5+ minutes. I personally don’t understand why there
> would be such a large difference between the two queries - something wrong
> with my indexes?
>

Did you do the manual vacuum as suggested by Tom? Maybe at least check
pg_stat_user_tables for the last vacuum or autovacuum timestamp, and how
many records have changed since then. The system is deciding on an index
scan because that should be very fast, but if it needs to look at the
actual table table to determine if the process executing that query should
still be allowed to see that tuple (row version) then the index only scan
slows down a lot I believe. The huge number of heap fetches that Tom
pointed out mean that the system doesn't know that ALL processes should be
able to see those tuples. As someone else suggested lowering the max freeze
age, you might just do a manual "vacuum freeze" as well to allow future
auto vacuum processes to quickly skip over those pages as well.


I've heard of the loose indexscan before mentioned on this thread, but I'm
not seeing how to implement that for multiple columns. Anyone have an
example or pseudo code perhaps?


Re: Faster distinct query?

2021-09-23 Thread Israel Brewster
> On Sep 23, 2021, at 10:36 AM, Geoff Winkless  wrote:
> 
> On Wed, 22 Sept 2021 at 21:05, Israel Brewster  > wrote:
> I was wondering if there was any way to improve the performance of this query:
> 
> SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP BY 
> station;
> 
> If you have tables of possible stations and channels (and if not, why not?), 
> then an EXISTS query, something like
> 
> SELECT stations.name , ARRAY_AGG(channels.name 
> ) 
>   FROM stations, channels 
>   WHERE EXISTS 
>(SELECT FROM data WHERE data.channels=channels.name 
>  AND data.station=stations.name 
> )
> GROUP BY stations.name 
> 
> will usually be much faster, because it can stop scanning after the first 
> match in the index.

So that one ran in about 5 minutes as well - apparently the time it takes to 
scan the index, given the similarity of run times for each of the different 
queries: https://explain.depesz.com/s/w46h 

It’s making me think though, because this similar (though incomplete, and 
therefore incorrect result) query runs in only 19ms 
(https://explain.depesz.com/s/iZnN ):

SELECT
stations.name,
array_agg(channels.channel)
FROM stations,channels
WHERE EXISTS (SELECT
  FROM data
  WHERE data.station=stations.id)
GROUP BY stations.name

It’s only when I add in the AND data.channels=channels.channel that the query 
time blows up to 5+ minutes. I personally don’t understand why there would be 
such a large difference between the two queries - something wrong with my 
indexes?

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

> 
> Geoff



Re: Faster distinct query?

2021-09-23 Thread Geoff Winkless
On Wed, 22 Sept 2021 at 21:05, Israel Brewster 
wrote:

> I was wondering if there was any way to improve the performance of this
> query:
>
>
> SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP BY 
> station;
>
> If you have tables of possible stations and channels (and if not, why
not?), then an EXISTS query, something like

SELECT stations.name, ARRAY_AGG(channels.name)
  FROM stations, channels
  WHERE EXISTS
   (SELECT FROM data WHERE data.channels=channels.name AND data.station=
stations.name)
GROUP BY stations.name

will usually be much faster, because it can stop scanning after the first match
in the index.

Geoff


Re: Faster distinct query?

2021-09-23 Thread Rob Sargent
I would look into pre-loading the lookup table (and pre-emptive 
maintenance).  Add the foreign key, but not the trigger.


That makes sense. Thanks!

Yeah, then I got to wondering: Do you care?  Are these stations likely 
to be spoofed?  You have the station id and type in you data table and 
essentially the same in your lookup table.  If you're not replacing the 
id+type in your data table with a lookup id you really don't need to 
even have a foreign key.  Maybe sync them regularly but I'm not seeing 
the value in the runtime overhead.  Now presumably the station table is 
entirely pinned in memory and foreign key check might not be much 
overhead but it won't be zero.





Re: Faster distinct query?

2021-09-23 Thread Israel Brewster
> On Sep 23, 2021, at 8:33 AM, Rob Sargent  wrote:
> 
> On 9/23/21 10:16 AM, Israel Brewster wrote:
>>> On Sep 23, 2021, at 4:34 AM, Ryan Booz >> > wrote:
>>> 
>>> Heh, I honestly forgot about the recursive CTE. Certainly worth a try and 
>>> wouldn't require installing other extensions.
>>> 
>>> This is what depesz is referring to: 
>>> https://wiki.postgresql.org/wiki/Loose_indexscan 
>>> 
>> Thanks for the pointer. Will definitely have to spend some time wrapping my 
>> brain around that one - I’ve done some CTE’s before, but not recursive that 
>> I can recall. Should be fun!
>> 
>> If it helps matters any, my structure is currently the following:
>> 
>> table “stations” listing station details (name, latitude, longitude, etc) 
>> with a smallint primary key “id"
>> table “data” with many (many!) data columns (mostly doubles), a station 
>> column that is a smallint referencing the stations table, and a channel 
>> column which is a varchar containing the *name* of the channel the data came 
>> in on.
>> 
>> I will readily accept that this may not be the best structure for the DB. 
>> For example, perhaps the channel column should be normalized out as has been 
>> mentioned a couple of times as an option. This would make sense, and would 
>> certainly simplify this portion of the project.
>> 
>> If I do go with a lookup table updated by a trigger, what would be the best 
>> option for the query the trigger runs - an upset (ON CONFLICT DO NOTHING)? 
>> Or a query followed by an insert if needed? The normal case would be that 
>> the entry already exists (millions of hits vs only the occasional insert 
>> needed).
>> 
>> 
> I would look into pre-loading the lookup table (and pre-emptive maintenance). 
>  Add the foreign key, but not the trigger.

That makes sense. Thanks!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

Re: Faster distinct query?

2021-09-23 Thread Rob Sargent

On 9/23/21 10:16 AM, Israel Brewster wrote:
On Sep 23, 2021, at 4:34 AM, Ryan Booz > wrote:


Heh, I honestly forgot about the recursive CTE. Certainly worth a try 
and wouldn't require installing other extensions.


This is what depesz is referring to: 
https://wiki.postgresql.org/wiki/Loose_indexscan 



Thanks for the pointer. Will definitely have to spend some time 
wrapping my brain around that one - I’ve done some CTE’s before, but 
not recursive that I can recall. Should be fun!


If it helps matters any, my structure is currently the following:

table “stations” listing station details (name, latitude, longitude, 
etc) with a smallint primary key “id"
table “data” with many (many!) data columns (mostly doubles), a 
station column that is a smallint referencing the stations table, and 
a channel column which is a varchar containing the *name* of the 
channel the data came in on.


I will readily accept that this may not be the best structure for the 
DB. For example, perhaps the channel column should be normalized out 
as has been mentioned a couple of times as an option. This would make 
sense, and would certainly simplify this portion of the project.


If I do go with a lookup table updated by a trigger, what would be the 
best option for the query the trigger runs - an upset (ON CONFLICT DO 
NOTHING)? Or a query followed by an insert if needed? The normal case 
would be that the entry already exists (millions of hits vs only the 
occasional insert needed).



I would look into pre-loading the lookup table (and pre-emptive 
maintenance).  Add the foreign key, but not the trigger.


Re: Faster distinct query?

2021-09-23 Thread Israel Brewster
> On Sep 23, 2021, at 4:34 AM, Ryan Booz  wrote:
> 
> Heh, I honestly forgot about the recursive CTE. Certainly worth a try and 
> wouldn't require installing other extensions.
> 
> This is what depesz is referring to: 
> https://wiki.postgresql.org/wiki/Loose_indexscan 
> 
Thanks for the pointer. Will definitely have to spend some time wrapping my 
brain around that one - I’ve done some CTE’s before, but not recursive that I 
can recall. Should be fun!

If it helps matters any, my structure is currently the following:

table “stations” listing station details (name, latitude, longitude, etc) with 
a smallint primary key “id"
table “data” with many (many!) data columns (mostly doubles), a station column 
that is a smallint referencing the stations table, and a channel column which 
is a varchar containing the *name* of the channel the data came in on.

I will readily accept that this may not be the best structure for the DB. For 
example, perhaps the channel column should be normalized out as has been 
mentioned a couple of times as an option. This would make sense, and would 
certainly simplify this portion of the project.

If I do go with a lookup table updated by a trigger, what would be the best 
option for the query the trigger runs - an upset (ON CONFLICT DO NOTHING)? Or a 
query followed by an insert if needed? The normal case would be that the entry 
already exists (millions of hits vs only the occasional insert needed).

Thanks again for all the suggestions!
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

> 
> On Thu, Sep 23, 2021 at 3:04 AM hubert depesz lubaczewski  > wrote:
> On Wed, Sep 22, 2021 at 12:05:22PM -0800, Israel Brewster wrote:
> > I was wondering if there was any way to improve the performance of this 
> > query:
> > 
> > SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP BY 
> > station;
> > 
> > The explain execution plan can be found here:
> > https://explain.depesz.com/s/mtxB#html 
> >  
> >  > >
> > 
> > and it looks pretty straight forward. It does an index_only scan, followed 
> > by an aggregate, to produce a result that is a list of stations along with 
> > a list of channels associated with each (there can be anywhere from 1 to 3 
> > channels associated with each station). This query takes around 5 minutes 
> > to run.
> > 
> > To work around the issue, I created a materialized view that I can update 
> > periodically, and of course I can query said view in no time flat. However, 
> > I’m concerned that as the dataset grows, the time it takes to refresh the 
> > view will also grow (correct me if I am wrong there).
> > 
> > This is running PostgreSQL 13, and the index referenced is a two-column 
> > index on data(station, channel)
> 
> It looks that there is ~ 170 stations, and ~ 800 million rows int he
> table.
> 
> can you tell us how many rows has this:
> 
> select distinct station, channel from data;
> 
> If this is not huge, then you can make the query run much faster using
> skip scan - recursive cte.
> 
> Best regards,
> 
> depesz
> 
> 
> 



Re: Faster distinct query?

2021-09-23 Thread Israel Brewster
> On Sep 22, 2021, at 11:04 PM, hubert depesz lubaczewski  
> wrote:
> 
> On Wed, Sep 22, 2021 at 12:05:22PM -0800, Israel Brewster wrote:
>> I was wondering if there was any way to improve the performance of this 
>> query:
>> 
>> SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP BY 
>> station;
>> 
>> The explain execution plan can be found here:
>> https://explain.depesz.com/s/mtxB#html 
>> 
>> 
>> and it looks pretty straight forward. It does an index_only scan, followed 
>> by an aggregate, to produce a result that is a list of stations along with a 
>> list of channels associated with each (there can be anywhere from 1 to 3 
>> channels associated with each station). This query takes around 5 minutes to 
>> run.
>> 
>> To work around the issue, I created a materialized view that I can update 
>> periodically, and of course I can query said view in no time flat. However, 
>> I’m concerned that as the dataset grows, the time it takes to refresh the 
>> view will also grow (correct me if I am wrong there).
>> 
>> This is running PostgreSQL 13, and the index referenced is a two-column 
>> index on data(station, channel)
> 
> It looks that there is ~ 170 stations, and ~ 800 million rows int he
> table.
> 
> can you tell us how many rows has this:
> 
> select distinct station, channel from data;

At the moment, about 170, but I would expect it to stabilize at around 510 or 
less once I am pulling in all the channels. Getting this query (or the 
stored/cached results thereof, as it shouldn’t change too often) working fast 
enough to be used in the live system is simply the first step to pulling in 
three times as much data (that’ll be fun!)  

> 
> If this is not huge, then you can make the query run much faster using
> skip scan - recursive cte.

Sounds like something to look into. Of course, if I go with a lookup table, 
updated by an on insert trigger, it becomes a moot point. I’ll have to spend 
some time wrapping my head around the concept, and figuring out how to write it 
so that I get distinct per station rather than just a straight up distinct, but 
theoretically at least it makes sense.

Thanks!
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

> 
> Best regards,
> 
> depesz
> 





Re: Faster distinct query?

2021-09-23 Thread Ryan Booz
Heh, I honestly forgot about the recursive CTE. Certainly worth a try and
wouldn't require installing other extensions.

This is what depesz is referring to:
https://wiki.postgresql.org/wiki/Loose_indexscan

On Thu, Sep 23, 2021 at 3:04 AM hubert depesz lubaczewski 
wrote:

> On Wed, Sep 22, 2021 at 12:05:22PM -0800, Israel Brewster wrote:
> > I was wondering if there was any way to improve the performance of this
> query:
> >
> > SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP
> BY station;
> >
> > The explain execution plan can be found here:
> > https://explain.depesz.com/s/mtxB#html <
> https://explain.depesz.com/s/mtxB#html>
> >
> > and it looks pretty straight forward. It does an index_only scan,
> followed by an aggregate, to produce a result that is a list of stations
> along with a list of channels associated with each (there can be anywhere
> from 1 to 3 channels associated with each station). This query takes around
> 5 minutes to run.
> >
> > To work around the issue, I created a materialized view that I can
> update periodically, and of course I can query said view in no time flat.
> However, I’m concerned that as the dataset grows, the time it takes to
> refresh the view will also grow (correct me if I am wrong there).
> >
> > This is running PostgreSQL 13, and the index referenced is a two-column
> index on data(station, channel)
>
> It looks that there is ~ 170 stations, and ~ 800 million rows int he
> table.
>
> can you tell us how many rows has this:
>
> select distinct station, channel from data;
>
> If this is not huge, then you can make the query run much faster using
> skip scan - recursive cte.
>
> Best regards,
>
> depesz
>
>
>
>


Re: Faster distinct query?

2021-09-23 Thread hubert depesz lubaczewski
On Wed, Sep 22, 2021 at 12:05:22PM -0800, Israel Brewster wrote:
> I was wondering if there was any way to improve the performance of this query:
> 
> SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP BY 
> station;
> 
> The explain execution plan can be found here:
> https://explain.depesz.com/s/mtxB#html 
> 
> 
> and it looks pretty straight forward. It does an index_only scan, followed by 
> an aggregate, to produce a result that is a list of stations along with a 
> list of channels associated with each (there can be anywhere from 1 to 3 
> channels associated with each station). This query takes around 5 minutes to 
> run.
> 
> To work around the issue, I created a materialized view that I can update 
> periodically, and of course I can query said view in no time flat. However, 
> I’m concerned that as the dataset grows, the time it takes to refresh the 
> view will also grow (correct me if I am wrong there).
> 
> This is running PostgreSQL 13, and the index referenced is a two-column index 
> on data(station, channel)

It looks that there is ~ 170 stations, and ~ 800 million rows int he
table.

can you tell us how many rows has this:

select distinct station, channel from data;

If this is not huge, then you can make the query run much faster using
skip scan - recursive cte.

Best regards,

depesz





Re: Faster distinct query?

2021-09-22 Thread Mladen Gogala



On 9/22/21 16:20, David G. Johnston wrote:
I'd probably turn that index into a foreign key that just ensures that 
every (station,channel) that appears in the data table also appears on 
the lookup table.  Grouping and array-ifying the lookup table would be 
trivial.  Either modify the application code or add a trigger to 
populate the lookup table as needed.



I fully agree with this. Adding a trigger to populate a lookup table is 
a standard design in situations like this. Using "DISTINCT" almost 
always spells trouble for the performance.



--
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com





Re: Faster distinct query?

2021-09-22 Thread David Rowley
On Thu, 23 Sept 2021 at 13:21, Israel Brewster  wrote:
> Ah, yes indeed. That version runs in about 30 seconds rather than 5 minutes! 
> See the explain analyze output here: https://explain.depesz.com/s/L5Bf It 
> looks more complicated, but being able to run parallel definitely makes a 
> difference, and there may be some other improvements in there that I’m not 
> aware of as well!

That's good.  You should also look into the VACUUM thing mentioned by
Tom.  If this table is just receiving INSERTs and not UPDATE/DELETEs
then you might want to consider tweaking the auto-vacuum settings for
it.

The default autovacuum_vacuum_insert_scale_factor will mean that
auto-vacuum will only kick off a worker to vacuum this table when 20%
of the total rows have been inserted since the last vacuum.  It's
possible that might account for your large number of heap fetches.

If the table is insert-only, then you could drop the
autovacuum_vacuum_insert_scale_factor down a bit. In the command
below, I set it to 2%.  Also dropping the autovacuum_freeze_min_age is
a pretty good thing to do for tables that are never or almost never
are UPDATEd or DELETEd from.

alter table data set (autovacuum_vacuum_insert_scale_factor=0.02,
autovacuum_freeze_min_age=0);

Vacuuming an insert-only table more often is not a great deal of extra
work, and it's possible even less work if you were to vacuum before
recently inserted pages got evicted from shared_buffers or the
kernel's buffers.  The already vacuumed and frozen portion of the
table will be skipped using the visibility and freeze map, which is
very fast to do.

David




Re: Faster distinct query?

2021-09-22 Thread David Rowley
On Thu, 23 Sept 2021 at 13:28, David G. Johnston
 wrote:
> Agreed, though if the query author needs to do that here we’ve violated the 
> spirit of the declarative SQL language.  At first blush nothing about the 
> original query seems like it should be preventing parallelism.  Each worker 
> builds its own distinct array then the final concatenation is made distinct.

We don't parallelise DISTINCT / ORDER BY aggregates.

David




Re: Faster distinct query?

2021-09-22 Thread David G. Johnston
On Wednesday, September 22, 2021, David Rowley  wrote:

> I think writing the query in such a way
> that allows it to be parallelised is likely going to result in some
> quite good performance improvements. i.e:
>

Agreed, though if the query author needs to do that here we’ve violated the
spirit of the declarative SQL language.  At first blush nothing about the
original query seems like it should be preventing parallelism.  Each worker
builds its own distinct array then the final concatenation is made distinct.

David J.


Re: Faster distinct query?

2021-09-22 Thread David Rowley
On Thu, 23 Sept 2021 at 08:21, Michael Lewis  wrote:
> select station, array_agg(distinct(channel)) as channels
> FROM(
> SELECT station,channel FROM data GROUP BY station,channel
> ) AS sub
> group by station;

Since the subquery is grouping by station, channel, then there's no
need for the DISTINCT in the aggregate function. Removing that should
remove some tuplestore overhead from the aggregate node.

David




Re: Faster distinct query?

2021-09-22 Thread Tom Lane
Michael Lewis  writes:
> On Wed, Sep 22, 2021 at 2:48 PM Tom Lane  wrote:
>> The "index-only" scan is reported to do 86m heap fetches along the
>> way to returning 812m rows, so the data is apparently pretty dirty.

> Do you say that because you would expect many more than 10 tuples per page?

No, I say that because if the table were entirely all-visible, there
would have been *zero* heap fetches.  As it stands, it's reasonable
to suspect that a pretty sizable fraction of the index-only scan's
runtime went into random-access heap fetches made to verify
visibility of individual rows.

(You will, of course, never get to exactly zero heap fetches in an
IOS unless the table data is quite static.  But one dirty page
out of every ten seems like there were a lot of recent changes.
A VACUUM to clean that up might be well worthwhile.)

regards, tom lane




Re: Faster distinct query?

2021-09-22 Thread Michael Lewis
On Wed, Sep 22, 2021 at 2:48 PM Tom Lane  wrote:

> "David G. Johnston"  writes:
> > There is no where clause so I'm doubtful there is much to be gained going
> > down this path.  The Index-Only scan seems like an optimal way to obtain
> > this data and the existing query already does that.
>
> The "index-only" scan is reported to do 86m heap fetches along the
> way to returning 812m rows, so the data is apparently pretty dirty.
>

Do you say that because you would expect many more than 10 tuples per page?


Re: Faster distinct query?

2021-09-22 Thread Israel Brewster
> On Sep 22, 2021, at 2:05 PM, Ryan Booz  wrote:
> 
> Ah. I didn't realize that. If SkipScan was chosen, you'd actually see it as 
> one of the execution nodes. I also realize I was making a few assumptions 
> about your data, are channels shared among stations, or are all channels 
> unique (like an ID) per station? That would impact the index and approach.

Ok, that may be a good point: “channel” is currently a varchar column, 
containing something like ‘BHZ’, ‘EHZ’, ‘BHE’ etc. There are only a handful of 
possible channels that I am currently aware of, which are shared among stations 
- most stations have a ‘BHZ’ channel, for example. That would be fairly simple 
to normalize out if that would help.

> 
> Something like:
> 
> station | channel
> --|---
> 11
> 12
> 23
> 24
> 
> or:
> station | channel
> --|---
> 11
> 12
> 21
> 22
> 
> 
> 
> 
> On Wed, Sep 22, 2021 at 5:53 PM Israel Brewster  > wrote:
>> On Sep 22, 2021, at 1:50 PM, Ryan Booz > > wrote:
>> 
>> Cool. I'd be interested to see the explain on it if you ever try it again. 
>> On that cardinality, I'd expect it to be really fast, so I'm interested to 
>> see if the (SkipScan) nodes were actually used.
> 
> With timescaledb extension installed, the explain is what I posted in the 
> original message (https://explain.depesz.com/s/mtxB#html 
> ). Without timescaledb installed, the 
> explain looks the same, except it takes twice as long to run.
> 
> Unless I missed something in your message, i.e. some sort of tweak to the 
> query to get it to use the timescaledb features?
> 
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory 
> Geophysical Institute - UAF 
> 2156 Koyukuk Drive 
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145
> 
>> 
>> On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster > > wrote:
>> 
>>> On Sep 22, 2021, at 12:49 PM, Ryan Booz >> > wrote:
>>> 
>>> [Timescale Dev Advocate here]
>>> I realize this might not be the most accepted answer (could be interpreted 
>>> as trying to "sell" something), but feels like an opportunity to talk about 
>>> DISTINCT queries and opportunities. Because you have that index, Timescale 
>>> 2.3 added a "Skip Scan" query planner node that works on regular BTree 
>>> indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at 
>>> all). In this case, your distinct query would likely run in a few 
>>> milliseconds based on the counts you mention (170 stations, 3 channels per 
>>> station), and then the outer aggregation would do the GROUP BY. So, you 
>>> **could** add the TimescaleDB extension to your database (or a copy of) and 
>>> give it a try. You don't actually need to use any TimescaleDB features 
>>> otherwise.
>> 
>> I had actually already done that, as I was considering, in spite of past 
>> negative experiences with timescaledb, experimenting with it on this DB to 
>> see if it worked any better with this data. Out of curiosity, I tried 
>> removing the timescaledb extension, whereupon the query in question took 
>> roughly twice as long. So you are right that installing timescaledb speeds 
>> things up, even when not using any timescaledb specific functions. So that 
>> was a good call. Thanks!
>> 
>> ---
>> Israel Brewster
>> Software Engineer
>> Alaska Volcano Observatory 
>> Geophysical Institute - UAF 
>> 2156 Koyukuk Drive 
>> Fairbanks AK 99775-7320
>> Work: 907-474-5172
>> cell:  907-328-9145
>> 
>>> 
>>> A writeup of why this kind of DISTINCT query is slow in PostgreSQL (for 
>>> now) and what we did to overcome it: 
>>> https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
>>>  
>>> 
>>> Plans for a similar feature in PostgreSQL proper that we'd totally support 
>>> but hasn't made forward progress yet: 
>>> https://commitfest.postgresql.org/19/1741/ 
>>> 
>>> Anyway, it might be worth a shot. HTH
>>> 
>>> Ryan B
>>> 
>>> On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston 
>>> mailto:david.g.johns...@gmail.com>> wrote:
>>> On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis >> > wrote:
>>> In the future, please share the plan returned by explain analyze, and some 
>>> data about how many rows in the involved tables,
>>> 
>>> I believe we consider it acceptable to link to an explain viewer, which is 
>>> what the OP did.  Reading explain output in email has its own challenges, 
>>> and I'd rather have the website than a text attachment.
>>> 
>>> 
>>> How does the below work? It should do a very simple index scan only, then 
>>> aggregate the relative few rows 

Re: Faster distinct query?

2021-09-22 Thread Ryan Booz
Ah. I didn't realize that. If SkipScan was chosen, you'd actually see it as
one of the execution nodes. I also realize I was making a few assumptions
about your data, are channels shared among stations, or are all channels
unique (like an ID) per station? That would impact the index and approach.

Something like:

station | channel
--|---
11
12
23
24

or:
station | channel
--|---
11
12
21
22




On Wed, Sep 22, 2021 at 5:53 PM Israel Brewster 
wrote:

> On Sep 22, 2021, at 1:50 PM, Ryan Booz  wrote:
>
> Cool. I'd be interested to see the explain on it if you ever try it again.
> On that cardinality, I'd expect it to be really fast, so I'm interested to
> see if the (SkipScan) nodes were actually used.
>
>
> With timescaledb extension installed, the explain is what I posted in the
> original message (https://explain.depesz.com/s/mtxB#html). Without
> timescaledb installed, the explain looks the same, except it takes twice as
> long to run.
>
> Unless I missed something in your message, i.e. some sort of tweak to the
> query to get it to use the timescaledb features?
>
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory
> Geophysical Institute - UAF
> 2156 Koyukuk Drive
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145
>
>
> On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster 
> wrote:
>
>>
>> On Sep 22, 2021, at 12:49 PM, Ryan Booz  wrote:
>>
>> [Timescale Dev Advocate here]
>> I realize this might not be the most accepted answer (could be
>> interpreted as trying to "sell" something), but feels like an opportunity
>> to talk about DISTINCT queries and opportunities. Because you have that
>> index, Timescale 2.3 added a "Skip Scan" query planner node that works on
>> regular BTree indexes (it doesn't have to be time-series/TimescaleDB
>> Hypertable data at all). In this case, your distinct query would likely run
>> in a few milliseconds based on the counts you mention (170 stations, 3
>> channels per station), and then the outer aggregation would do the GROUP
>> BY. So, you **could** add the TimescaleDB extension to your database (or a
>> copy of) and give it a try. You don't actually need to use any TimescaleDB
>> features otherwise.
>>
>>
>> I had actually already done that, as I was considering, in spite of past
>> negative experiences with timescaledb, experimenting with it on this DB to
>> see if it worked any better with this data. Out of curiosity, I tried
>> removing the timescaledb extension, whereupon the query in question took
>> roughly twice as long. So you are right that installing timescaledb speeds
>> things up, even when not using any timescaledb specific functions. So that
>> was a good call. Thanks!
>>
>> ---
>> Israel Brewster
>> Software Engineer
>> Alaska Volcano Observatory
>> Geophysical Institute - UAF
>> 2156 Koyukuk Drive
>> Fairbanks AK 99775-7320
>> Work: 907-474-5172
>> cell:  907-328-9145
>>
>>
>>
>>- A writeup of why this kind of DISTINCT query is slow in PostgreSQL
>>(for now) and what we did to overcome it:
>>
>> https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
>>- Plans for a similar feature in PostgreSQL proper that we'd totally
>>support but hasn't made forward progress yet:
>>https://commitfest.postgresql.org/19/1741/
>>
>> Anyway, it might be worth a shot. HTH
>>
>> Ryan B
>>
>> On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <
>> david.g.johns...@gmail.com> wrote:
>>
>>> On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis 
>>> wrote:
>>>
 In the future, please share the plan returned by explain analyze, and
 some data about how many rows in the involved tables,

>>>
>>> I believe we consider it acceptable to link to an explain viewer, which
>>> is what the OP did.  Reading explain output in email has its own
>>> challenges, and I'd rather have the website than a text attachment.
>>>
>>>
 How does the below work? It should do a very simple index scan only,
 then aggregate the relative few rows after the fact.

 select station, array_agg(distinct(channel)) as channels
 FROM(
 SELECT station,channel FROM data GROUP BY station,channel
 ) AS sub
 group by station;

>>>
>>> Yeah, am pondering this too, though seems like the queries should be
>>> identical so the plan/execution should be the same either way.
>>>
>>>
 If there is correlation between station & channel, then you might look
 at creating a multivariate statistics object and analyzing the table so the
 planner can make better choices

>>>
>>> There is no where clause so I'm doubtful there is much to be gained
>>> going down this path.  The Index-Only scan seems like an optimal way to
>>> obtain this data and the existing query already does that.  The aggregation
>>> path might vary though it seems like that shouldn't be the 

Re: Faster distinct query?

2021-09-22 Thread Israel Brewster
> On Sep 22, 2021, at 1:50 PM, Ryan Booz  wrote:
> 
> Cool. I'd be interested to see the explain on it if you ever try it again. On 
> that cardinality, I'd expect it to be really fast, so I'm interested to see 
> if the (SkipScan) nodes were actually used.

With timescaledb extension installed, the explain is what I posted in the 
original message (https://explain.depesz.com/s/mtxB#html 
). Without timescaledb installed, the 
explain looks the same, except it takes twice as long to run.

Unless I missed something in your message, i.e. some sort of tweak to the query 
to get it to use the timescaledb features?

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

> 
> On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster  > wrote:
> 
>> On Sep 22, 2021, at 12:49 PM, Ryan Booz > > wrote:
>> 
>> [Timescale Dev Advocate here]
>> I realize this might not be the most accepted answer (could be interpreted 
>> as trying to "sell" something), but feels like an opportunity to talk about 
>> DISTINCT queries and opportunities. Because you have that index, Timescale 
>> 2.3 added a "Skip Scan" query planner node that works on regular BTree 
>> indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at 
>> all). In this case, your distinct query would likely run in a few 
>> milliseconds based on the counts you mention (170 stations, 3 channels per 
>> station), and then the outer aggregation would do the GROUP BY. So, you 
>> **could** add the TimescaleDB extension to your database (or a copy of) and 
>> give it a try. You don't actually need to use any TimescaleDB features 
>> otherwise.
> 
> I had actually already done that, as I was considering, in spite of past 
> negative experiences with timescaledb, experimenting with it on this DB to 
> see if it worked any better with this data. Out of curiosity, I tried 
> removing the timescaledb extension, whereupon the query in question took 
> roughly twice as long. So you are right that installing timescaledb speeds 
> things up, even when not using any timescaledb specific functions. So that 
> was a good call. Thanks!
> 
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory 
> Geophysical Institute - UAF 
> 2156 Koyukuk Drive 
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145
> 
>> 
>> A writeup of why this kind of DISTINCT query is slow in PostgreSQL (for now) 
>> and what we did to overcome it: 
>> https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
>>  
>> 
>> Plans for a similar feature in PostgreSQL proper that we'd totally support 
>> but hasn't made forward progress yet: 
>> https://commitfest.postgresql.org/19/1741/ 
>> 
>> Anyway, it might be worth a shot. HTH
>> 
>> Ryan B
>> 
>> On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston 
>> mailto:david.g.johns...@gmail.com>> wrote:
>> On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis > > wrote:
>> In the future, please share the plan returned by explain analyze, and some 
>> data about how many rows in the involved tables,
>> 
>> I believe we consider it acceptable to link to an explain viewer, which is 
>> what the OP did.  Reading explain output in email has its own challenges, 
>> and I'd rather have the website than a text attachment.
>> 
>> 
>> How does the below work? It should do a very simple index scan only, then 
>> aggregate the relative few rows after the fact.
>> 
>> select station, array_agg(distinct(channel)) as channels
>> FROM(
>> SELECT station,channel FROM data GROUP BY station,channel
>> ) AS sub
>> group by station;
>> 
>> Yeah, am pondering this too, though seems like the queries should be 
>> identical so the plan/execution should be the same either way.
>> 
>> 
>> If there is correlation between station & channel, then you might look at 
>> creating a multivariate statistics object and analyzing the table so the 
>> planner can make better choices
>> 
>> There is no where clause so I'm doubtful there is much to be gained going 
>> down this path.  The Index-Only scan seems like an optimal way to obtain 
>> this data and the existing query already does that.  The aggregation path 
>> might vary though it seems like that shouldn't be the case here.
>> 
>> David J.
> 



Re: Faster distinct query?

2021-09-22 Thread Ryan Booz
Cool. I'd be interested to see the explain on it if you ever try it again.
On that cardinality, I'd expect it to be really fast, so I'm interested to
see if the (SkipScan) nodes were actually used.

On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster 
wrote:

>
> On Sep 22, 2021, at 12:49 PM, Ryan Booz  wrote:
>
> [Timescale Dev Advocate here]
> I realize this might not be the most accepted answer (could be interpreted
> as trying to "sell" something), but feels like an opportunity to talk about
> DISTINCT queries and opportunities. Because you have that index,
> Timescale 2.3 added a "Skip Scan" query planner node that works on regular
> BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable
> data at all). In this case, your distinct query would likely run in a few
> milliseconds based on the counts you mention (170 stations, 3 channels per
> station), and then the outer aggregation would do the GROUP BY. So, you
> **could** add the TimescaleDB extension to your database (or a copy of) and
> give it a try. You don't actually need to use any TimescaleDB features
> otherwise.
>
>
> I had actually already done that, as I was considering, in spite of past
> negative experiences with timescaledb, experimenting with it on this DB to
> see if it worked any better with this data. Out of curiosity, I tried
> removing the timescaledb extension, whereupon the query in question took
> roughly twice as long. So you are right that installing timescaledb speeds
> things up, even when not using any timescaledb specific functions. So that
> was a good call. Thanks!
>
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory
> Geophysical Institute - UAF
> 2156 Koyukuk Drive
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145
>
>
>
>- A writeup of why this kind of DISTINCT query is slow in PostgreSQL
>(for now) and what we did to overcome it:
>
> https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
>- Plans for a similar feature in PostgreSQL proper that we'd totally
>support but hasn't made forward progress yet:
>https://commitfest.postgresql.org/19/1741/
>
> Anyway, it might be worth a shot. HTH
>
> Ryan B
>
> On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <
> david.g.johns...@gmail.com> wrote:
>
>> On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis  wrote:
>>
>>> In the future, please share the plan returned by explain analyze, and
>>> some data about how many rows in the involved tables,
>>>
>>
>> I believe we consider it acceptable to link to an explain viewer, which
>> is what the OP did.  Reading explain output in email has its own
>> challenges, and I'd rather have the website than a text attachment.
>>
>>
>>> How does the below work? It should do a very simple index scan only,
>>> then aggregate the relative few rows after the fact.
>>>
>>> select station, array_agg(distinct(channel)) as channels
>>> FROM(
>>> SELECT station,channel FROM data GROUP BY station,channel
>>> ) AS sub
>>> group by station;
>>>
>>
>> Yeah, am pondering this too, though seems like the queries should be
>> identical so the plan/execution should be the same either way.
>>
>>
>>> If there is correlation between station & channel, then you might look
>>> at creating a multivariate statistics object and analyzing the table so the
>>> planner can make better choices
>>>
>>
>> There is no where clause so I'm doubtful there is much to be gained going
>> down this path.  The Index-Only scan seems like an optimal way to obtain
>> this data and the existing query already does that.  The aggregation path
>> might vary though it seems like that shouldn't be the case here.
>>
>> David J.
>>
>
>


Re: Faster distinct query?

2021-09-22 Thread Israel Brewster

> On Sep 22, 2021, at 12:49 PM, Ryan Booz  wrote:
> 
> [Timescale Dev Advocate here]
> I realize this might not be the most accepted answer (could be interpreted as 
> trying to "sell" something), but feels like an opportunity to talk about 
> DISTINCT queries and opportunities. Because you have that index, Timescale 
> 2.3 added a "Skip Scan" query planner node that works on regular BTree 
> indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at 
> all). In this case, your distinct query would likely run in a few 
> milliseconds based on the counts you mention (170 stations, 3 channels per 
> station), and then the outer aggregation would do the GROUP BY. So, you 
> **could** add the TimescaleDB extension to your database (or a copy of) and 
> give it a try. You don't actually need to use any TimescaleDB features 
> otherwise.

I had actually already done that, as I was considering, in spite of past 
negative experiences with timescaledb, experimenting with it on this DB to see 
if it worked any better with this data. Out of curiosity, I tried removing the 
timescaledb extension, whereupon the query in question took roughly twice as 
long. So you are right that installing timescaledb speeds things up, even when 
not using any timescaledb specific functions. So that was a good call. Thanks!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

> 
> A writeup of why this kind of DISTINCT query is slow in PostgreSQL (for now) 
> and what we did to overcome it: 
> https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
>  
> 
> Plans for a similar feature in PostgreSQL proper that we'd totally support 
> but hasn't made forward progress yet: 
> https://commitfest.postgresql.org/19/1741/ 
> 
> Anyway, it might be worth a shot. HTH
> 
> Ryan B
> 
> On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston  > wrote:
> On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis  > wrote:
> In the future, please share the plan returned by explain analyze, and some 
> data about how many rows in the involved tables,
> 
> I believe we consider it acceptable to link to an explain viewer, which is 
> what the OP did.  Reading explain output in email has its own challenges, and 
> I'd rather have the website than a text attachment.
> 
> 
> How does the below work? It should do a very simple index scan only, then 
> aggregate the relative few rows after the fact.
> 
> select station, array_agg(distinct(channel)) as channels
> FROM(
> SELECT station,channel FROM data GROUP BY station,channel
> ) AS sub
> group by station;
> 
> Yeah, am pondering this too, though seems like the queries should be 
> identical so the plan/execution should be the same either way.
> 
> 
> If there is correlation between station & channel, then you might look at 
> creating a multivariate statistics object and analyzing the table so the 
> planner can make better choices
> 
> There is no where clause so I'm doubtful there is much to be gained going 
> down this path.  The Index-Only scan seems like an optimal way to obtain this 
> data and the existing query already does that.  The aggregation path might 
> vary though it seems like that shouldn't be the case here.
> 
> David J.



Re: Faster distinct query?

2021-09-22 Thread Israel Brewster
> On Sep 22, 2021, at 12:48 PM, Tom Lane  wrote:
> 
> "David G. Johnston"  writes:
>> There is no where clause so I'm doubtful there is much to be gained going
>> down this path.  The Index-Only scan seems like an optimal way to obtain
>> this data and the existing query already does that.
> 
> The "index-only" scan is reported to do 86m heap fetches along the
> way to returning 812m rows, so the data is apparently pretty dirty.
> It's possible that a preliminary VACUUM to get page-all-visible hint
> bits set would be a net win.

I do have autovaccum turned on, but perhaps I need to do a manual? The initial 
population of the database was accomplished via logical replication from a 
different database cluster (needed to move this database to more dedicated 
hardware), so perhaps that left the database in a state that autovaccum doesn’t 
address? Or perhaps my autovaccum settings aren’t kosher - I haven’t adjusted 
that portion of the config any.

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

> 
>   regards, tom lane





Re: Faster distinct query?

2021-09-22 Thread Ryan Booz
[Timescale Dev Advocate here]
I realize this might not be the most accepted answer (could be interpreted
as trying to "sell" something), but feels like an opportunity to talk about
DISTINCT queries and opportunities. Because you have that index,
Timescale 2.3 added a "Skip Scan" query planner node that works on regular
BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable
data at all). In this case, your distinct query would likely run in a few
milliseconds based on the counts you mention (170 stations, 3 channels per
station), and then the outer aggregation would do the GROUP BY. So, you
**could** add the TimescaleDB extension to your database (or a copy of) and
give it a try. You don't actually need to use any TimescaleDB features
otherwise.


   - A writeup of why this kind of DISTINCT query is slow in PostgreSQL
   (for now) and what we did to overcome it:
   
https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
   - Plans for a similar feature in PostgreSQL proper that we'd totally
   support but hasn't made forward progress yet:
   https://commitfest.postgresql.org/19/1741/

Anyway, it might be worth a shot. HTH

Ryan B

On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis  wrote:
>
>> In the future, please share the plan returned by explain analyze, and
>> some data about how many rows in the involved tables,
>>
>
> I believe we consider it acceptable to link to an explain viewer, which is
> what the OP did.  Reading explain output in email has its own challenges,
> and I'd rather have the website than a text attachment.
>
>
>> How does the below work? It should do a very simple index scan only, then
>> aggregate the relative few rows after the fact.
>>
>> select station, array_agg(distinct(channel)) as channels
>> FROM(
>> SELECT station,channel FROM data GROUP BY station,channel
>> ) AS sub
>> group by station;
>>
>
> Yeah, am pondering this too, though seems like the queries should be
> identical so the plan/execution should be the same either way.
>
>
>> If there is correlation between station & channel, then you might look at
>> creating a multivariate statistics object and analyzing the table so the
>> planner can make better choices
>>
>
> There is no where clause so I'm doubtful there is much to be gained going
> down this path.  The Index-Only scan seems like an optimal way to obtain
> this data and the existing query already does that.  The aggregation path
> might vary though it seems like that shouldn't be the case here.
>
> David J.
>


Re: Faster distinct query?

2021-09-22 Thread Tom Lane
"David G. Johnston"  writes:
> There is no where clause so I'm doubtful there is much to be gained going
> down this path.  The Index-Only scan seems like an optimal way to obtain
> this data and the existing query already does that.

The "index-only" scan is reported to do 86m heap fetches along the
way to returning 812m rows, so the data is apparently pretty dirty.
It's possible that a preliminary VACUUM to get page-all-visible hint
bits set would be a net win.

regards, tom lane




Re: Faster distinct query?

2021-09-22 Thread Israel Brewster
> On Sep 22, 2021, at 12:20 PM, David G. Johnston  
> wrote:
> 
> On Wed, Sep 22, 2021 at 1:05 PM Israel Brewster  > wrote:
> To work around the issue, I created a materialized view that I can update 
> periodically, and of course I can query said view in no time flat. However, 
> I’m concerned that as the dataset grows, the time it takes to refresh the 
> view will also grow (correct me if I am wrong there).
> 
> I'd probably turn that index into a foreign key that just ensures that every 
> (station,channel) that appears in the data table also appears on the lookup 
> table.  Grouping and array-ifying the lookup table would be trivial.  Either 
> modify the application code or add a trigger to populate the lookup table as 
> needed.

Makes sense. I was actually considering this approach (albeit without the 
foreign key - that’s a nice additional safety measure), but was concerned about 
the overhead that adding said trigger would have on inserts - thus my thought 
to try the materialized view. As a reference, this database is receiving 1Hz 
data from around 170 stations, with up to three channels of data per station. 
So something like 350-500 inserts per second, although the data is “grouped” 
into 10 minute batches. I’ll give it another look.

> The parentheses around channel in "array_agg(distinct(channel))" are 
> unnecessary - you are invoking composite-type syntax, which is ignored in the 
> single column case unless you write the optional ROW keyword, i.e., distinct 
> ROW(channel)

Good information, thanks!

> David J.


---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

Re: Faster distinct query?

2021-09-22 Thread David G. Johnston
On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis  wrote:

> In the future, please share the plan returned by explain analyze, and some
> data about how many rows in the involved tables,
>

I believe we consider it acceptable to link to an explain viewer, which is
what the OP did.  Reading explain output in email has its own challenges,
and I'd rather have the website than a text attachment.


> How does the below work? It should do a very simple index scan only, then
> aggregate the relative few rows after the fact.
>
> select station, array_agg(distinct(channel)) as channels
> FROM(
> SELECT station,channel FROM data GROUP BY station,channel
> ) AS sub
> group by station;
>

Yeah, am pondering this too, though seems like the queries should be
identical so the plan/execution should be the same either way.


> If there is correlation between station & channel, then you might look at
> creating a multivariate statistics object and analyzing the table so the
> planner can make better choices
>

There is no where clause so I'm doubtful there is much to be gained going
down this path.  The Index-Only scan seems like an optimal way to obtain
this data and the existing query already does that.  The aggregation path
might vary though it seems like that shouldn't be the case here.

David J.


Re: Faster distinct query?

2021-09-22 Thread Michael Lewis
In the future, please share the plan returned by explain analyze, and some
data about how many rows in the involved tables, what type of system you
are running it on, any changes from default configs, etc.

How does the below work? It should do a very simple index scan only, then
aggregate the relative few rows after the fact.

select station, array_agg(distinct(channel)) as channels
FROM(
SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;

If there is correlation between station & channel, then you might look at
creating a multivariate statistics object and analyzing the table so the
planner can make better choices, knowing that channel is dependent on
station perhaps. I wouldn't necessarily think that it would help this
query, but perhaps others. Also, you might try creating only dependencies,
only ndistinct type, or some combination other than all 3 types.

ref- https://www.postgresql.org/docs/current/sql-createstatistics.html


Re: Faster distinct query?

2021-09-22 Thread David G. Johnston
On Wed, Sep 22, 2021 at 1:05 PM Israel Brewster 
wrote:

> To work around the issue, I created a materialized view that I can update
> periodically, and of course I can query said view in no time flat. However,
> I’m concerned that as the dataset grows, the time it takes to refresh the
> view will also grow (correct me if I am wrong there).
>

I'd probably turn that index into a foreign key that just ensures that
every (station,channel) that appears in the data table also appears on the
lookup table.  Grouping and array-ifying the lookup table would be
trivial.  Either modify the application code or add a trigger to populate
the lookup table as needed.

The parentheses around channel in "array_agg(distinct(channel))" are
unnecessary - you are invoking composite-type syntax, which is ignored in
the single column case unless you write the optional ROW keyword, i.e.,
distinct ROW(channel)
David J.