Re: Gradual migration from integer to bigint?

2023-10-05 Thread Nick Cleaton
On Sat, 30 Sept 2023, 23:37 Tom Lane,  wrote:

>
> I think what you're asking for is a scheme whereby some rows in a
> table have datatype X in a particular column while other rows in
> the very same physical table have datatype Y in the same column.
>

An alternative for NOT NULL columns would be to use a new attnum for the
bigint version of the id, but add a column to pg_attribute allowing linking
the new id col to the dropped old id col, to avoid the table rewrite.

Global read code change needed: on finding a NULL in a NOT NULL column,
check for a link to a dropped old col and use that value instead if found.
The check could be almost free in the normal case if there's already a
check for unexpected NULL or tuple too short.

Then a metadata-only operation can create the new id col and drop and
rename and link the old id col, and fix up fkeys etc for the attnum change.

Indexes are an issue. Require the in-advance creation of indexes like
btree(id::bigint) mirroring every index involving id maybe ? Those could
then be swapped in as part of the same metadata operation.


Re: AW: [Extern] Re: consistent postgresql snapshot

2022-05-12 Thread Nick Cleaton
On Thu, 12 May 2022 at 14:48, Tom Lane  wrote:

> "Zwettler Markus (OIZ)"  writes:
> > I don't want to do use the normal backup algorithm where pg_start_backup
> + pg_stop_backup will fix any fractured block and I am required to have all
> archived logfiles, therefore.
> > I want to produce an atomic consistent disk snapshot.
>
> [ shrug... ]  You can't have that.  [snip]
>
> The only way you could get a consistent on-disk image is to shut
> the server down (being sure to do a clean not "immediate" shutdown)
> and then take the snapshot.
>

I think you could work around that by taking a dirty snapshot, making a
writable filesystem from it, waiting until you've archived enough WAL to
get that to a consistent state, and then firing up a temporary postmaster
on that filesystem to go through recovery and shut down cleanly.


Re: Handling glibc v2.28 breaking changes

2022-04-25 Thread Nick Cleaton
On Mon, 25 Apr 2022 at 12:45, Laurenz Albe  wrote:

>
> You could consider upgrade in several steps:
>
> - pg_upgrade to v14 on the current operating system
> - use replication, than switchover to move to a current operating system
> on a different
>   machine
> - REINDEX CONCURRENTLY all indexes on string expressions
>
> You could get data corruption and bad query results between the second and
> the third steps,
> so keep that interval short.
>

We did something like this, with the addition of a step where we used a
new-OS replica to run amcheck's bt_index_check() over all of the btree
indexes to find those actually corrupted by the libc upgrade in practice
with our data. It was a small fraction of them, and we were able to fit an
offline reindex of those btrees and all texty non-btree indexes into an
acceptable downtime window, with REINDEX CONCURRENTLY of everything else as
a lower priority after the upgrade.


Re: Is this a reasonable use for advisory locks?

2022-04-14 Thread Nick Cleaton
On Thu, 14 Apr 2022 at 10:47, Steve Baldwin  wrote:

> Ok, so you want to allow _other_ updates to a customer while this process
> is happening? In that case, advisory locks will probably work. The only
> consideration is that the 'id' is a bigint. If your customer id maps to
> that, great. If not (for example we use UUID's), you will need some way to
> convert that id to a bigint.
>

Alternatively, create a new table that records the start timestamp of the
most recent run of your code block for each customer, and update that as
the first action in your transaction. Then row locks on that table will
protect you from concurrent runs.


Re: Advice on using materialized views

2021-12-06 Thread Nick Cleaton
On Mon, 6 Dec 2021 at 18:48, Phil Endecott
 wrote:
>
> - I have a raw data table, indexed on a timestamp column, to which
>   new data is regularly being added.
>
> - I have a set of views that produce hourly/daily/monthly summaries
>   of this data. Querying these views is slow, so I'd like to make
>   them materialized views.
>
> - But I'd like the most recent data to be included in the results
>   of my queries. So I think I need a combined view that is the
>   union of the materialized view and a non-materialised view for
>   the recent values.

Assuming your table is insert-only:

How about instead of using a materialized view at all, you define a
table of hourly summaries which your script updates, and define a view
which merges that with an on-the-fly summary of main table rows newer
than the most recent summarised hour.




Re: concurrent re-partitioning of declarative partitioned tables

2020-12-02 Thread Nick Cleaton
On Wed, 2 Dec 2020 at 15:59, Michael Lewis  wrote:

> You can not have overlapping partitions that are both attached.
>

Not directly, no. That's why I'm considering the _partition_channel hack.

Why do you want to merge partitions that you are "done with" instead of
> just leaving them partitioned by day?
>

I have some random access index lookups on columns not in the partition
key, where values are unique over the entire table so at most one partition
is going to return a row. A lookup that touches 4 or 5 pages in each of 100
partition indexes is more expensive than one that touches 6 or 7 pages in
each of 10 larger partition indexes.

Why are you partitioning at all? Are you confident that you need partitions
> for performance & that the trade-offs are worth the cost, or are you
> needing to detach/drop old data quickly to adhere to a retention policy?
>

I do want cheap drops of old data, but also many queries have indexable
conditions on non-key columns and also only want records from the most
recent N days, so partition pruning is useful there with small partitions
for recent records.


Re: concurrent re-partitioning of declarative partitioned tables

2020-12-02 Thread Nick Cleaton
On Wed, 2 Dec 2020 at 16:07, David G. Johnston 
wrote:

> On Mon, Nov 30, 2020 at 8:36 AM Nick Cleaton  wrote:
>
>> I want to set up a large table on postgresql 12.4, using declarative
>> partitioning to partition by record creation date. I'd like to have recent
>> records in small partitions but old records in a few larger partitions, so
>> I want merges. The merges should be concurrent, in the sense that they lock
>> out readers or writers only for very short intervals if at all.
>>
>
> Once a date has passed is the table for that date effectively read-only?
>

No, old records get updated from time to time, although updates are much
less common than for recent records.


concurrent re-partitioning of declarative partitioned tables

2020-11-30 Thread Nick Cleaton
I want to set up a large table on postgresql 12.4, using declarative
partitioning to partition by record creation date. I'd like to have recent
records in small partitions but old records in a few larger partitions, so
I want merges. The merges should be concurrent, in the sense that they lock
out readers or writers only for very short intervals if at all.

I'm looking at adding an extra boolean column and partitioning on that at
the top level, with two parallel date-based partition trees underneath, so
that I can effectively have overlapping date partitions:


create table mytable (
  record_date timestamp with time zone not null,
  _partition_channel boolean,
  ...
)
partition by list (_partition_channel);

create table mytable_chan_null
  partition of mytable for values in (null)
  partition by range (record_date);

create table mytable_chan_true
  partition of mytable for values in (true)
  partition by range (record_date);

create table mytable_day_20200101
  partition of mytable_chan_null
  for values from ('2020-01-01') to ('2020-01-02');

...

create table mytable_day_20200107
  partition of mytable_chan_null
  for values from ('2020-01-07') to ('2020-01-08');


Then to merge several day-partitions into a week-partition:

create table mytable_week_20200101
  partition of mytable_chan_true
  for values from ('2020-01-01') to ('2020-01-08');

... and migrate rows in batches by updating _partition_channel to true,
then finally drop the empty day partitions.

Since record_date is an insertion timestamp, I don't mind that after this
merge updating the record_date of a merged row could fail due to a missing
partition. Likewise there's no need for new rows to be inserted with
record_date values in previously merged ranges.

Questions:

Are there any hidden pitfalls with this approach ?

Have I missed a simpler way ?

Is there a project out there that will manage this for me ?


Re: A limit clause can cause a poor index choice

2020-05-20 Thread Nick Cleaton
On Tue, 19 May 2020 at 22:15, Michael Lewis  wrote:

> Increase default_statistics_target, at least on that column, and see if
you get a much much better plan. I don't know where I got this query from
online, but here ya go. I'd be curious how frac_MCV in this changes when
default_statistics_target is more like 250 or 500 and the table is analyzed
again to reflect that change.

It chooses the fast plan for a limit of 10 if the stats target is
approaching the number of distinct customer_id values, which is 6000 for
this test table:

 stats |  frac_mcv   | n_distinct | n_mcv | n_hist | correlation | l10 |
l100 | l1000
---+-++---++-+-+--+---
-1 | 0.01566 |   5728 |34 |101 |  0.98172975 | f   | f
   | t
   150 | 0.01505 |   5821 |38 |151 |   0.9817175 | f   | f
   | t
   250 |  0.04347998 |   5867 |   134 |251 |  0.98155195 | f   | t
   | t
   500 |  0.12606017 |   5932 |   483 |501 |  0.98155344 | f   | t
   | t
   750 |  0.18231618 |   5949 |   750 |751 |  0.98166454 | f   | t
   | t
  1000 |   0.2329197 |   5971 |  1000 |   1001 |   0.9816691 | f   | t
   | t
  1500 |   0.3312785 |   5982 |  1500 |   1501 |0.981609 | f   | t
   | t
  3000 |   0.6179379 |   5989 |  3000 |   2989 |0.981612 | f   | t
   | t
  4000 |   0.8033856 |   5994 |  4000 |   1994 |   0.9816348 | f   | t
   | t
  4500 |   0.8881603 |   5994 |  4500 |   1494 |  0.98160636 | f   | t
   | t
  4800 |   0.9281193 |   5993 |  4800 |   1193 |   0.9816273 | f   | t
   | t
  4900 |   0.9396781 |   5994 |  4900 |   1094 |   0.9816546 | f   | t
   | t
  5000 |   0.9500147 |   5993 |  5000 |993 |   0.9816481 | t   | t
   | t
  6000 |0.999714 |   5996 |  5923 | 73 |  0.98162216 | t   | t
   | t
 1 |  0.5905 |   5998 |  5970 | 28 |  0.98164326 | t   | t
   | t


Re: A limit clause can cause a poor index choice

2020-05-20 Thread Nick Cleaton
On Tue, 19 May 2020 at 21:56, Mohamed Wael Khobalatte
 wrote:

> I believe a second ordering, by id desc, will get your query to use the right 
> index, and shouldn't be functionally different from what you would expect.

Thanks, that works nicely on our production table, even with much
larger sets of customer_id values.

> What happens when you raise the limit? Say to a 1000?

A limit of 1000 makes it choose the fast plan. A limit of 100 causes
it to choose the fast plan if I raise the stats target on that column
to 250 or above, otherwise not.




A limit clause can cause a poor index choice

2020-05-19 Thread Nick Cleaton
The attached script builds a 10G test table which demonstrates a
problem that we have in production with postgresql 12.3-1.pgdg18.04+1
on ubuntu linux. Indexes:

test_orders_o_date_idx btree(o_date)
test_orders_customer_id_o_date_idx btree(customer_id, o_date)

We query for the most recent orders for sets of customers, and
sometimes none of those customers have any orders and the results are
empty:

explain analyze select * from test_orders where
customer_id=any(ARRAY[9993,9997,9912,9954,9100,9101,9102,9234,9500,9512])
order by o_date desc;

QUERY PLAN

 Sort  (cost=24848.96..24870.67 rows=8686 width=1839) (actual
time=1.101..1.102 rows=0 loops=1)
   Sort Key: o_date DESC
   Sort Method: quicksort  Memory: 25kB
   ->  Index Scan using test_orders_customer_id_o_date_idx on
test_orders  (cost=0.43..17361.20 rows=8686 width=1839) (actual
time=1.047..1.047 rows=0 loops=1)
 Index Cond: (customer_id = ANY
('{9993,9997,9912,9954,9100,9101,9102,9234,9500,9512}'::integer[]))
 Planning Time: 3.821 ms
 Execution Time: 1.174 ms
(7 rows)

So far so good. But if we add a limit clause to the query then the
plan goes very wrong:

explain analyze select * from test_orders where
customer_id=any(ARRAY[9993,9997,9912,9954,9100,9101,9102,9234,9500,9512])
order by o_date desc limit 10;

  QUERY PLAN
-
 Limit  (cost=0.43..1660.98 rows=10 width=1839) (actual
time=4990.424..4990.424 rows=0 loops=1)
   ->  Index Scan Backward using test_orders_o_date_idx on test_orders
 (cost=0.43..1442355.43 rows=8686 width=1839) (actual
time=4990.423..4990.423 rows=0 loops=1)
 Filter: (customer_id = ANY
('{9993,9997,9912,9954,9100,9101,9102,9234,9500,9512}'::integer[]))
 Rows Removed by Filter: 500
 Planning Time: 0.063 ms
 Execution Time: 4990.435 ms


Is there something we can adjust to get it to prefer
test_orders_customer_id_o_date_idx even when there's a limit clause ?
#!/usr/bin/python3

import random
import datetime

secs_in_day = 24*60*60

longstr = """iufdpoaiusoto3u5034534i5j345k345lku09s80s9dfjwer.,newrwwerwerwerlwerjlwejrlkewjr""" * 10

print("""

drop table if exists test_orders;
drop sequence if exists test_orders_id_seq;

CREATE SEQUENCE test_orders_id_seq
START WITH 1
INCREMENT BY 1
NO MINVALUE
NO MAXVALUE
CACHE 1;

CREATE TABLE test_orders (
id integer DEFAULT nextval('test_orders_id_seq'::regclass) NOT NULL,
o_date timestamp with time zone NOT NULL,
customer_id integer,
str1 text,
num1 integer,
long1 text,
long2 text,
long3 text,
long4 text
);

COPY test_orders(o_date, customer_id, str1, num1, long1, long2, long3, long4) FROM stdin;""")

for day in range(5000):
orders = [(secs_in_day * day + random.randrange(secs_in_day), customer) for customer in range(day, day+1000)]
for o_date, customer_id in sorted(orders):
print(datetime.datetime.fromtimestamp(1234234234 + o_date).isoformat(),
customer_id,
"blah",
random.randrange(100),
longstr,
longstr,
longstr,
longstr,
sep="\t")

print("""\\.

create index test_orders_o_date_idx on test_orders using btree(o_date);
create index test_orders_customer_id_o_date_idx on test_orders using btree(customer_id, o_date);

analyze test_orders;
""")


Re: how to securely delete the storage freed when a table is dropped?

2018-04-13 Thread Nick Cleaton
On 13 April 2018 at 18:48, Jonathan Morgan
 wrote:
> For a system with information stored in a PostgreSQL 9.5 database, in which
> data stored in a table that is deleted must be securely deleted (like shred
> does to files), and where the system is persistent even though any
> particular table likely won't be (so can't just shred the disks at
> "completion"), I'm trying to figure out my options for securely deleting the
> underlying data files when a table is dropped.
>
> As background, I'm not a DBA, but I am an experienced implementor in many
> languages, contexts, and databases. I've looked online and haven't been able
> to find a way to ask PostgreSQL to do the equivalent of shredding its
> underlying files before releasing them to the OS when a table is DROPped. Is
> there a built-in way to ask PostgreSQL to do this? (I might just not have
> searched for the right thing - my apologies if I missed something)
>
> A partial answer we're looking at is shredding the underlying data files for
> a given relation and its indexes manually before dropping the tables, but
> this isn't so elegant, and I'm not sure it is getting all the information
> from the tables that we need to delete.
>
> We also are looking at strategies for shredding free space on our data disk
> - either running a utility to do that, or periodically replicating the data
> volume, swapping in the results of the copy, then shredding the entire
> volume that was the source so its "free" space is securely overwritten in
> the process.
>
> Are we missing something? Are there other options we haven't found? If we
> have to clean up manually, are there other places we need to go to shred
> data than the relation files for a given table, and all its related indexes,
> in the database's folder? Any help or advice will be greatly appreciated.

Can you encrypt the data in the application, above the DB level ? That
would be cleaner if you can.

If not, you'll have to worry about both the DB's data files themselves
and the WAL files in pg_xlog/ which hold copies of the recently
written data. Even if you securely scrub the deleted parts of the
filesystems after dropping the table, there could still be copies of
secret table data in WAL files that haven't yet been overwritten.

One way to scrub deleted files would be to use ZFS and have an extra
disk. When it's time to scrub, "zpool attach" the extra disk to your
zpool, which will cause ZFS to copy over only the files that haven't
been deleted, in the background. When that's finished you can detach
the original disk from the zpool and then do a low-level overwrite of
that entire disk. For extra security points use encrypted block
devices underneath ZFS, and instead of scrubbing the disk just destroy
the encryption key that you were using for it.