Re: Queries containing ORDER BY and LIMIT started to work slowly

2023-08-29 Thread Rick Otten
On Tue, Aug 29, 2023 at 3:57 PM Rondat Flyag  wrote:

> I took the dump just to store it on another storage (external HDD). I
> didn't do anything with it.
>
> 29.08.2023, 21:42, "Jeff Janes" :
>
>
>
> On Tue, Aug 29, 2023 at 1:47 PM Rondat Flyag 
> wrote:
>
> I have a legacy system that uses `Posgresql 9.6` and `Ubuntu 16.04`.
> Everything was fine several days ago even with standard Postgresql
> settings. I dumped a database with the compression option (maximum
> compression level -Z 9) in order to have a smaller size (`pg_dump
> --compress=9 database_name > database_name.sql`). After that I got a lot of
> problems.
>
>
> You describe taking a dump of the database, but don't describe doing
> anything with it.  Did you replace your system with one restored from that
> dump?  If so, did vacuum and analyze afterwards?
>
> Cheers,
>
> Jeff
>
>
Since this is a very old system and backups are fairly I/O intensive, it is
possible you have a disk going bad?  Sometimes after doing a bunch of I/O
on an old disk, it will accelerate its decline.  You could be about to lose
it altogether.


Re: Queries containing ORDER BY and LIMIT started to work slowly

2023-08-29 Thread Jeff Janes
On Tue, Aug 29, 2023 at 2:55 PM Rondat Flyag  wrote:

> I took the dump just to store it on another storage (external HDD). I
> didn't do anything with it.
>

I don't see how that could cause the problem, it is probably just a
coincidence.  Maybe taking the dump held a long-lived snapshot open which
caused some bloat.   But if that was enough to push your system over the
edge, it was probably too close to the edge to start with.

Do you have a plan for the query while it was fast?  If not, maybe you can
force it back to the old plan by setting enable_seqscan=off or perhaps
enable_sort=off, to let you capture the old plan for comparison.

The estimate for the seq scan of  isbns_statistics is off by almost a
factor of 2.  A seq scan with no filters and which can not stop early
should not be hard to estimate accurately, so this suggests autovac is not
keeping up.  VACUUM ANALYZE all of the involved tables and see if that
fixes things.

Cheers,

Jeff


Re: Index bloat and REINDEX/VACUUM optimization for partial index

2023-08-29 Thread Jeff Janes
On Mon, Aug 28, 2023 at 8:33 PM jayaprabhakar k 
wrote:

> Hi,
>
> TL;DR:
> Observations:
>
>1. REINDEX requires a full table scan
>   - Roughly create a new index, rename index, drop old index.
>   - REINDEX is not incremental. running reindex frequently does not
>   reduce the future reindex time.
>2. REINDEX does not use the index itself
>3. VACUUM does not clean up the indices. (relpages >> reltuples) I
>understand, vacuum is supposed to remove pages only if there are no live
>tuples in the page, but somehow, even immediately after vacuum, I see
>relpages significantly greater than reltuples. I would have assumed,
>relpages <= reltuples
>4. Query Planner does not consider index bloat, so uses highly bloated
>partial index that is terribly slow over other index
>
> Your points 3 and 4 are not correct.  empty index pages are put on a
freelist for future reuse, they are not physically removed from the
underlying index files.  Maybe they are not actually getting put on the
freelist or not being reused from the freelist for some reason, but that
would be a different issue.  Use the extension pgstattuple to see what its
function pgstatindex says about the index.

The planner does take index bloat into consideration, but its effect size
is low.  Which it should be, as empty or irrelevant pages should be
efficiently skipped during the course of most index operations. To figure
out what is going with your queries, you should do an EXPLAIN (ANALYZE,
BUFFERS) of them, but with it being slow and with it being fast.


> Question: Is there a way to optimize postgres vacuum/reindex when using
> partial indexes?
>

Without knowing what is actually going wrong, I can only offer
generalities.  Make sure you don't have long-lived transactions which
prevent efficient clean up.  Increase the frequency on which vacuum runs on
the table.  It can't reduce the size of an already bloated index, but by
keeping the freelist stocked it should be able prevent it from getting
bloated in the first place.  Also, it can remove empty pages from being
linked into the index tree structure, which means they won't need to be
scanned even though they are still in the file.  It can also free up space
inside non-empty pages for future reuse within that same page, and so that
index tuples don't need to be chased down in the table only to be found to
be not visible.


> ```
> SELECT [columns list]
>   FROM tasks
>   WHERE status NOT IN (3,4,5) AND created > NOW() - INTERVAL '30 days' AND
> updated < NOW() - interval '30 minutes'
> ```
>
> Since we are only interested in the pending tasks, I created a partial
> index
>  `*"tasks_pending_status_created_type_idx" btree (status, created,
> task_type) WHERE status <> ALL (ARRAY[3, 4, 5])*`.
>

This looks like a poorly designed index.  Since the status condition
exactly matches the index where clause, there is no residual point in
having "status" be the first column in the index, it can only get in the
way (for this particular query).  Move it to the end, or remove it
altogether.

Within the tuples which pass the status check, which inequality is more
selective, the "created" one or "updated" one?

Cheers,

Jeff


Re: Queries containing ORDER BY and LIMIT started to work slowly

2023-08-29 Thread Rondat Flyag
I took the dump just to store it on another storage (external HDD). I didn't do anything with it. 29.08.2023, 21:42, "Jeff Janes" :  On Tue, Aug 29, 2023 at 1:47 PM Rondat Flyag  wrote:I have a legacy system that uses `Posgresql 9.6` and `Ubuntu 16.04`. Everything was fine several days ago even with standard Postgresql settings. I dumped a database with the compression option (maximum compression level -Z 9) in order to have a smaller size (`pg_dump --compress=9 database_name > database_name.sql`). After that I got a lot of problems. You describe taking a dump of the database, but don't describe doing anything with it.  Did you replace your system with one restored from that dump?  If so, did vacuum and analyze afterwards? Cheers, Jeff

Re: Queries containing ORDER BY and LIMIT started to work slowly

2023-08-29 Thread Jeff Janes
On Tue, Aug 29, 2023 at 1:47 PM Rondat Flyag  wrote:

> I have a legacy system that uses `Posgresql 9.6` and `Ubuntu 16.04`.
> Everything was fine several days ago even with standard Postgresql
> settings. I dumped a database with the compression option (maximum
> compression level -Z 9) in order to have a smaller size (`pg_dump
> --compress=9 database_name > database_name.sql`). After that I got a lot of
> problems.
>

You describe taking a dump of the database, but don't describe doing
anything with it.  Did you replace your system with one restored from that
dump?  If so, did vacuum and analyze afterwards?

Cheers,

Jeff


Queries containing ORDER BY and LIMIT started to work slowly

2023-08-29 Thread Rondat Flyag
I have a legacy system that uses `Posgresql 9.6` and `Ubuntu 16.04`. Everything was fine several days ago even with standard Postgresql settings. I dumped a database with the compression option (maximum compression level -Z 9) in order to have a smaller size (`pg_dump --compress=9 database_name > database_name.sql`). After that I got a lot of problems. Some queries for certain tables started to be executed very slow. Queries for other tables work fine. Here are the tables that I have issues with.      asins:    id (integer)    value (string), index b-tree    type (string)    books:    id (integer)    asin (string), index b-tree    ...    (total 32 columns)   asins_statistics:    id (integer)    average_price (float)    average_rating (integer)    asin_id (foreign key)    ...    (total 17 columns) These tables contain 1 400 000 rows each. Detailed info in attachments. Basically I used the following query and it worked well:      (1) SELECT * FROM ISBNS JOIN BOOKS ON BOOKS.ISBN = ISBN.VALUE JOIN ISBNS_STATISTICS ON ISBNS_STATISTICS.ISBN_ID = ISBNS.ID ORDER BY ISBNS.VALUE LIMIT 100;  But after I made the dump it started to be executed extremely slow. I'm not sure whether it's because of the dump, but before the dump everything worked well. This query also works well:      SELECT * FROM ISBNS JOIN BOOKS ON BOOKS.ISBN = ISBN.VALUE JOIN ISBNS_STATISTICS ON ISBNS_STATISTICS.ISBN_ID = ISBNS.ID LIMIT 100; This query is executed quickly too:     SELECT * FROM ISBNS JOIN BOOKS ON BOOKS.ISBN = ISBN.VALUE ORDER BY ISBNS.VALUE LIMIT 100;  I changed performance settings (for instance, increased `shared_buffer`), but it didn't increase speed too much. I've read that queries containing LIMIT and ORDER BY work very slow, but if I make such queries to other tables it works fine. The query plan for query (1) is in attachment. So, the questions are:1. Why everything worked well and started to work slowly?2. Why similar queries to other tables are still executed quickly? Thank you in advance. Cheers,Serg    Table "public.asins_statistics"
Column |Type |  
 Modifiers   
---+-+---
 id| integer | not null default 
nextval('asins_statistics_id_seq'::regclass)
 average_cost_amazon   | double precision| 
 average_price_new | double precision| 
 quantity_sold_new | double precision| 
 quantity_in_transit   | double precision| 
 quantity_present_new  | double precision| 
 ranks_thirty  | integer | 
 ranks_ninety  | integer | 
 average_profit_new| double precision| 
 average_roi_new   | double precision| 
 average_selling_time  | double precision| 
 asin_id   | integer | 
 average_cost_aob  | double precision| 
 last_sold | timestamp without time zone | 
 average_price_used| double precision| 
 quantity_sold_used| integer | 
 quantity_present_used | integer | 
 average_profit_used   | double precision| 
Indexes:
"asins_statistics_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
"asins_statistics_asin_id_fkey" FOREIGN KEY (asin_id) REFERENCES asins(id) 
ON DELETE CASCADE

 Table "public.books"
   Column   |Type | 
Modifiers  
+-+
 id | integer | not 
null default nextval('books_id_seq'::regclass)
 link   | character varying(300)  | 
 asin   | character varying(60)   | 
 title  | character varying(400)  | 
 isbn   | character varying(50)   | 
 newer_edition_available| boolean | 
 newer_edition_link | character varying(150)  | 
 cover_type | character varying(100)  | 
 block_until| timestamp without time zone | 
 latest_trade_in_available  | boolean | 
 latest_trade_in_price  | double precision| 
 latest_rank| bigint  | 
 latest_profit_like_new | double precision| 
 

Re: Index bloat and REINDEX/VACUUM optimization for partial index

2023-08-29 Thread jayaprabhakar k
Thanks Peter. It is *14.4*, But on AWS RDS Aurora instance. I am trying to
read the links you shared - B-Tree Deletion and deduplication, etc. I still
don't fully understand what I need to do. In the BTree documentation,

> The average and worst-case number of versions per logical row can be kept
> low purely through targeted incremental deletion passes. It's quite
> possible that the on-disk size of certain indexes will never increase by
> even one single page/block despite *constant* version churn from UPDATEs.


In our case, almost all the tuples stop being covered by the index as they
fail the predicate, and only a tiny 1000s of rows pass the index predicate
at any point in time. But, we still see the index size continue to
increase, index lookups become slow over time, and  vacuum (non full)
doesn't reduce the index size much.

Do we need to do anything specific to better utilize the targeted
incremental deletion passes?


SELECT VERSION();
 version

-
 PostgreSQL 14.4 on x86_64-pc-linux-gnu, compiled by
x86_64-pc-linux-gnu-gcc (GCC) 7.4.0, 64-bit
(1 row)






On Mon, 28 Aug 2023 at 18:49, Peter Geoghegan  wrote:

> On Mon, Aug 28, 2023 at 5:33 PM jayaprabhakar k 
> wrote:
> > REINDEX requires a full table scan
> >
> > Roughly create a new index, rename index, drop old index.
> > REINDEX is not incremental. running reindex frequently does not reduce
> the future reindex time.
>
> You didn't say which Postgres version you're on. Note that Postgres 14
> can deal with index bloat a lot better than earlier versions could.
> This is known to work well with partial indexes. See:
>
>
> https://www.postgresql.org/message-id/flat/CAL9smLAjt9mZC2%3DqBeJwuNPq7KMAYGTWWQw_hvA-Lfo0b3ycow%40mail.gmail.com
>
> --
> Peter Geoghegan
>


Re: Range partitioning query performance with date_trunc (vs timescaledb)

2023-08-29 Thread David Rowley
On Tue, 29 Aug 2023 at 19:40, Philippe Pepiot  wrote:
> I'm trying to implement some range partitioning on timeseries data. But it
> looks some queries involving date_trunc() doesn't make use of partitioning.
>
> BEGIN;
> CREATE TABLE test (
> time TIMESTAMP WITHOUT TIME ZONE NOT NULL,
> value FLOAT NOT NULL
> ) PARTITION BY RANGE (time);
> CREATE INDEX test_time_idx ON test(time DESC);
> CREATE TABLE test_y2010 PARTITION OF test FOR VALUES FROM ('2020-01-01') TO 
> ('2021-01-01');
> CREATE TABLE test_y2011 PARTITION OF test FOR VALUES FROM ('2021-01-01') TO 
> ('2022-01-01');
> CREATE VIEW vtest AS SELECT DATE_TRUNC('year', time) AS time, SUM(value) AS 
> value FROM test GROUP BY 1;
> EXPLAIN (COSTS OFF) SELECT * FROM vtest WHERE time >= TIMESTAMP '2021-01-01';
> ROLLBACK;
>
> The plan query all partitions:

> I wonder if there is a way with a reasonable amount of SQL code to achieve 
> this
> with vanilla postgres ?

The only options I see for you are

1) partition by LIST(date_Trunc('year', time)), or;
2) use a set-returning function instead of a view and pass the date
range you want to select from the underlying table via parameters.

I imagine you won't want to do #1. However, it would at least also
allow the aggregation to be performed before the Append if you SET
enable_partitionwise_aggregate=1.

#2 isn't as flexible as a view as you'd have to create another
function or expand the parameters of the existing one if you want to
add items to the WHERE clause.

Unfortunately, date_trunc is just a black box to partition pruning, so
it's not able to determine that DATE_TRUNC('year', time) >=
'2021-01-01'  is the same as time >= '2021-01-01'.  It would be
possible to make PostgreSQL do that, but that's a core code change,
not something that you can do from SQL.

David




Range partitioning query performance with date_trunc (vs timescaledb)

2023-08-29 Thread Philippe Pepiot
Hi,

I'm trying to implement some range partitioning on timeseries data. But it
looks some queries involving date_trunc() doesn't make use of partitioning.

BEGIN;
CREATE TABLE test (
time TIMESTAMP WITHOUT TIME ZONE NOT NULL,
value FLOAT NOT NULL
) PARTITION BY RANGE (time);
CREATE INDEX test_time_idx ON test(time DESC);
CREATE TABLE test_y2010 PARTITION OF test FOR VALUES FROM ('2020-01-01') TO 
('2021-01-01');
CREATE TABLE test_y2011 PARTITION OF test FOR VALUES FROM ('2021-01-01') TO 
('2022-01-01');
CREATE VIEW vtest AS SELECT DATE_TRUNC('year', time) AS time, SUM(value) AS 
value FROM test GROUP BY 1;
EXPLAIN (COSTS OFF) SELECT * FROM vtest WHERE time >= TIMESTAMP '2021-01-01';
ROLLBACK;

The plan query all partitions:

HashAggregate
  Group Key: (date_trunc('year'::text, test."time"))
  ->  Append
->  Seq Scan on test_y2010 test_1
  Filter: (date_trunc('year'::text, "time") >= '2021-01-01 
00:00:00'::timestamp without time zone)
->  Seq Scan on test_y2011 test_2
  Filter: (date_trunc('year'::text, "time") >= '2021-01-01 
00:00:00'::timestamp without time zone)


The view is there so show the use case, but we get almost similar plan with 
SELECT * FROM test WHERE DATE_TRUNC('year', time) >= TIMESTAMP '2021-01-01';


I tested a variation with timescaledb which seem using trigger based
partitioning:

BEGIN;
CREATE EXTENSION IF NOT EXISTS timescaledb;
CREATE TABLE test (
time TIMESTAMP WITHOUT TIME ZONE NOT NULL,
value FLOAT NOT NULL
);
SELECT create_hypertable('test', 'time', chunk_time_interval => INTERVAL '1 
year');
CREATE VIEW vtest AS SELECT time_bucket('1 year', time) AS time, SUM(value) AS 
value FROM test GROUP BY 1;
-- insert some data as partitions are created on the fly
INSERT INTO test VALUES (TIMESTAMP '2020-01-15', 1.0), (TIMESTAMP '2021-12-15', 
2.0);
\d+ test
EXPLAIN (COSTS OFF) SELECT * FROM vtest WHERE time >= TIMESTAMP '2021-01-01';
ROLLBACK;


The plan query a single partition:

GroupAggregate
  Group Key: (time_bucket('1 year'::interval, _hyper_1_2_chunk."time"))
  ->  Result
->  Index Scan Backward using _hyper_1_2_chunk_test_time_idx on 
_hyper_1_2_chunk
  Index Cond: ("time" >= '2021-01-01 00:00:00'::timestamp without 
time zone)
  Filter: (time_bucket('1 year'::interval, "time") >= '2021-01-01 
00:00:00'::timestamp without time zone)


Note single partition query only works with time_bucket(), not with 
date_trunc(), I guess
there is some magic regarding this in time_bucket() implementation.


I wonder if there is a way with a reasonable amount of SQL code to achieve this
with vanilla postgres ?

Maybe by taking assumption that DATE_TRUNC(..., time) <= time ?

Thanks!