Re: Improving pg_dump performance when handling large numbers of LOBs

2024-02-05 Thread Shaheed Haque
Might it be worth a modest amount of time using some basic profiling to see
where the time is going? A week is a looonnngg time, even for 150e6
operations. For example, if there an unexpectedly high IO load, some
temporary M.2 storage might help?

On Tue, 6 Feb 2024, 01:36 Ron Johnson,  wrote:

> On Mon, Feb 5, 2024 at 2:01 PM Wyatt Tellis 
> wrote:
>
>> Hi,
>>
>> We've inherited a series of legacy PG 12 clusters that each contain a
>> database that we need to migrate to a PG 15 cluster. Each database contains
>> about 150 million large objects totaling about 250GB.
>>
>
> 250*10^9 / (150*10^6) = 1667 bytes.  That's *tiny*.
>
> Am I misunderstanding you?
>
>>


Re: How to do faster DML

2024-02-05 Thread veem v
On Tue, 6 Feb 2024 at 10:45, Lok P  wrote:

> Thank you Greg.
>
> *"and keeping all your active stuff in cache. Since you have 127 columns,
> only pull back the columns you absolutely need for each query."*
>
> Understood the point. As postgres is a "row store" database, so keeping
> the size of the row lesser by making the number of columns to minimum,
> should definitely help as more number of rows can be packed then into one
> page or block (which I believe is 8KB default size per block). And that
> mean more number of blocks can be cached while reading, so better cache hit
> ratio.
>
> As you rightly pointed out, Another thing I noticed the shared_buffer
> parameters set as 2029684 in this instance, which comes to ~21MB and that
> seems very small for a database operating in large scale. And I see we have
> RAM in the instance showing as ~256GB. So thinking of bumping it to
> something as ~40-50GB. Hope that will help to some extent. Not sure if
> there is methods to manually,  cache some objects(tables/indexes) which
> were getting used frequently by the read queries.
>
>
> On Tue, Feb 6, 2024 at 10:27 AM Greg Sabino Mullane 
> wrote:
>
>>
>> So when you said *"In that case, and based on the numbers you provided,
>>> daily partitioning seems a decent solution."*
>>> , does it mean that size of the partition (which will be ~100-150GB per
>>> partition here) is not an issue for the postgres to serve read latency
>>> sensitive queries?
>>>
>>
>> Yes, that should be fine. At the end of the day, what's going to be more
>> important is making sure you are using good indexing, and keeping all your
>> active stuff in cache. Since you have 127 columns, only pull back the
>> columns you absolutely need for each query.
>>
>> Cheers,
>> Greg
>>
>>
>
Good discussion!! Don't want to divert the thread. And apology for
comparing with another RDBMS, but curious to know the difference in
behaviour.

With respect to the storage part:- In Oracle database we were supposed to
keep the frequently accessed column first and less frequently accessed
columns towards last in the table while creating table DDL. This used to
help the query performance while accessing those columns as it's a row
store database. Are there any such performance implications exists in
postgres? And there the data types like varchar used to only occupy the
space which the real data/value contains.

But here in postgres, it seems a bit different as it has significant
overhead in storage, as these seem like being fixed length data types and
will occupy the space no matter what the read data holds. So curious to
know, if in this case will there be any performance impact accessing those
columns, if such a column which is frequently accessed but has been put
towards the end of the table because of storage space saving?


Re: How to do faster DML

2024-02-05 Thread Lok P
Thank you Greg.

*"and keeping all your active stuff in cache. Since you have 127 columns,
only pull back the columns you absolutely need for each query."*

Understood the point. As postgres is a "row store" database, so keeping the
size of the row lesser by making the number of columns to minimum, should
definitely help as more number of rows can be packed then into one page or
block (which I believe is 8KB default size per block). And that mean more
number of blocks can be cached while reading, so better cache hit ratio.

As you rightly pointed out, Another thing I noticed the shared_buffer
parameters set as 2029684 in this instance, which comes to ~21MB and that
seems very small for a database operating in large scale. And I see we have
RAM in the instance showing as ~256GB. So thinking of bumping it to
something as ~40-50GB. Hope that will help to some extent. Not sure if
there is methods to manually,  cache some objects(tables/indexes) which
were getting used frequently by the read queries.


On Tue, Feb 6, 2024 at 10:27 AM Greg Sabino Mullane 
wrote:

>
> So when you said *"In that case, and based on the numbers you provided,
>> daily partitioning seems a decent solution."*
>> , does it mean that size of the partition (which will be ~100-150GB per
>> partition here) is not an issue for the postgres to serve read latency
>> sensitive queries?
>>
>
> Yes, that should be fine. At the end of the day, what's going to be more
> important is making sure you are using good indexing, and keeping all your
> active stuff in cache. Since you have 127 columns, only pull back the
> columns you absolutely need for each query.
>
> Cheers,
> Greg
>
>


Re: How to do faster DML

2024-02-05 Thread Greg Sabino Mullane
> So when you said *"In that case, and based on the numbers you provided,
> daily partitioning seems a decent solution."*
> , does it mean that size of the partition (which will be ~100-150GB per
> partition here) is not an issue for the postgres to serve read latency
> sensitive queries?
>

Yes, that should be fine. At the end of the day, what's going to be more
important is making sure you are using good indexing, and keeping all your
active stuff in cache. Since you have 127 columns, only pull back the
columns you absolutely need for each query.

Cheers,
Greg


Re: How to do faster DML

2024-02-05 Thread Lok P
On Tue, Feb 6, 2024 at 8:34 AM Greg Sabino Mullane 
wrote:

> On Sun, Feb 4, 2024 at 3:52 PM Lok P  wrote:
>
>> This table will always be queried on the transaction_date column as one
>> of the filters criteria. But the querying/search criteria can span from a
>> day to a month worth of transaction date data.
>>
>
> In that case, and based on the numbers you provided, daily partitioning
> seems a decent solution.
>
> What I mean was, we will definitely need the data for querying purposes by
>> the users, but just to keep the storage space incontrol (and to help the
>> read queries), we were thinking of having the historical partitions
>> compressed. And for that , if any compression strategy should we follow on
>> postgres?
>>
>
> There is no compression strategy, per se. There are ways Postgres
> internally compresses the data (see "TOAST"), but in general, the table is
> either there or not. If space is a concern you should examine if you truly
> need 127 columns, make sure they are arranged correctly (search for
> 'postgres column tetris'), and move unused and older partitions elsewhere -
> like a separate cheaper Postgres server, or something offline.
>
> With regards to loading data to the table faster, wondering why you said '
>> *NO*' to load the data first and enabling/Creating the Primary key and
>> Foreign key constraint later approach.
>>
>
> For an initial load, this is fine, if you do things very carefully. I'd
> build the child table indexes post load but still feed things into the main
> table as an initial tradeoff, but YMMV.
>
> Just looked back and saw this is actually Aurora, not Postgres. Most of
> the advice on this thread should still apply, but be aware that things are
> not the same as Postgres, especially at the storage layer. For all the
> money you are giving them, don't forget to bug them as well.
>
> Cheers,
> Greg
>
>
Thank you so much Greg. That helps.

So when you said *"In that case, and based on the numbers you provided,
daily partitioning seems a decent solution."*
, does it mean that size of the partition (which will be ~100-150GB per
partition here) is not an issue for the postgres to serve read latency
sensitive queries?

And yes I think you are spot on in regards to the storage. Actually we
loaded csv files which were having size summed to ~200GB and I was
wondering howcome the table size becomes ~1TB when that data is loaded to
the database. I was not aware about the "column tetris" and how postgres
padding additional spaces to the column data based on subsequent columns
type, still trying to digest the concept. Though we have many columns
with NULL values in them. Will try to reorder the column in the table and
hopefully , we will be able to really reduce the storage space through
that. Will raise a case too.

Regards
Lok


Re: How to do faster DML

2024-02-05 Thread Greg Sabino Mullane
On Sun, Feb 4, 2024 at 3:52 PM Lok P  wrote:

> This table will always be queried on the transaction_date column as one of
> the filters criteria. But the querying/search criteria can span from a day
> to a month worth of transaction date data.
>

In that case, and based on the numbers you provided, daily partitioning
seems a decent solution.

What I mean was, we will definitely need the data for querying purposes by
> the users, but just to keep the storage space incontrol (and to help the
> read queries), we were thinking of having the historical partitions
> compressed. And for that , if any compression strategy should we follow on
> postgres?
>

There is no compression strategy, per se. There are ways Postgres
internally compresses the data (see "TOAST"), but in general, the table is
either there or not. If space is a concern you should examine if you truly
need 127 columns, make sure they are arranged correctly (search for
'postgres column tetris'), and move unused and older partitions elsewhere -
like a separate cheaper Postgres server, or something offline.

With regards to loading data to the table faster, wondering why you said '
> *NO*' to load the data first and enabling/Creating the Primary key and
> Foreign key constraint later approach.
>

For an initial load, this is fine, if you do things very carefully. I'd
build the child table indexes post load but still feed things into the main
table as an initial tradeoff, but YMMV.

Just looked back and saw this is actually Aurora, not Postgres. Most of the
advice on this thread should still apply, but be aware that things are not
the same as Postgres, especially at the storage layer. For all the money
you are giving them, don't forget to bug them as well.

Cheers,
Greg


Re: Unused indexes

2024-02-05 Thread Greg Sabino Mullane
>
> The pg_stat_*_tables tables idx_* columns has accumulated usage since the
> last time you started the postmaster.
>

Actually, those persist at restart - you can use

 select datname, stats_reset from pg_stat_database;

to see when/if they were reset. You can look for zero/low entries in
pg_stat_user_indexes.idx_scan to find unused indexes. Note that replicas
maintain their own stats, so checking only the primary may cause a false
positive.


> I sql server we have this option to disable it and need to rebuild it to
>> ensemble it
>>
>
> Sadly, PG does not have ALTER INDEX ... DISABLE;.
>

Not really sure what the purpose of that is in sql server, but Ron is
correct, we have nothing equivalent. General usage in Postgres is to drop
the index if it is unused. If you need to create it again, easy enough with
CREATE INDEX CONCURRENTLY. Keeping your schema changes in a VCS (e.g. git)
is a good way to document when and why the index was dropped. I suppose in
a pinch you could keep the old index around by sticking it in a table
comment.

Cheers,
Greg


Re: select from composite type

2024-02-05 Thread Adrian Klaver

On 2/5/24 16:35, Lorusso Domenico wrote:
ah ehm.. I solved, it was very easy but I believed it should use the 
from clause...


execute 'select ($1).* ' using _attribute into _r;


Beat me to it

For the reason why it works:

https://www.postgresql.org/docs/current/sql-expressions.html#SQL-SYNTAX-ROW-CONSTRUCTORS



Il giorno mar 6 feb 2024 alle ore 01:01 Lorusso Domenico 
mailto:domenico@gmail.com>> ha scritto:


here an example (the actual case in more complex, but the point it's
the same)

do $$
declare
_attribute_list temp1.my_type[];
_attribute temp1.my_type;

_r record;
begin
_attribute_list=array[row(1,'Hello') , row(2,'Goodbye')];

_attribute= row(1,'Doh');

raise notice '%', _attribute_list;

for _r in execute 'select * from unnest($1) where foo=1' using
_attribute_list loop
raise notice '%', _r;
end loop;

--Error
execute 'select * from $1' using _attribute into _r;

raise notice '%', _r;
end;
$$;

So I able to manage an array of complex type (why I use an array,
because in a previous answer the community suggest to me to use
an array to pass a list of information instead of temporary table),
but I can't do the same thing with just an element.

Of course I can set an element as part of an array with just that
element but. it's sad...


Il giorno lun 5 feb 2024 alle ore 01:48 David G. Johnston
mailto:david.g.johns...@gmail.com>> ha
scritto:

On Sun, Feb 4, 2024 at 5:39 PM Adrian Klaver
mailto:adrian.kla...@aklaver.com>>
wrote:


 >
 > attribute_list is  an array of composite type (with 20
fields).

I am trying to wrap my head around "array of composite
type". Please
provide an example.


ARRAY[ (1,2)::point, (3,4)::point ]::point[]

The main problem is the concept of writing "from($1)" in any
query makes no sense, you cannot parameterize a from clause
directly like that.  You have to put the value somewhere an
expression is directly allowed.

David J.



-- 
Domenico L.


per stupire mezz'ora basta un libro di storia,
io cercai di imparare la Treccani a memoria... [F.d.A.]



--
Domenico L.

per stupire mezz'ora basta un libro di storia,
io cercai di imparare la Treccani a memoria... [F.d.A.]


--
Adrian Klaver
adrian.kla...@aklaver.com





Re: select from composite type

2024-02-05 Thread Lorusso Domenico
ah ehm.. I solved, it was very easy but I believed it should use the
from clause...

execute 'select ($1).* ' using _attribute into _r;

Il giorno mar 6 feb 2024 alle ore 01:01 Lorusso Domenico <
domenico@gmail.com> ha scritto:

> here an example (the actual case in more complex, but the point it's the
> same)
>
> do $$
> declare
> _attribute_list temp1.my_type[];
> _attribute temp1.my_type;
>
> _r record;
> begin
> _attribute_list=array[row(1,'Hello') , row(2,'Goodbye')];
>
> _attribute= row(1,'Doh');
>
> raise notice '%', _attribute_list;
>
> for _r in execute 'select * from unnest($1) where foo=1' using
> _attribute_list loop
> raise notice '%', _r;
> end loop;
>
> --Error
> execute 'select * from $1' using _attribute into _r;
>
> raise notice '%', _r;
> end;
> $$;
>
> So I able to manage an array of complex type (why I use an array, because
> in a previous answer the community suggest to me to use an array to pass a
> list of information instead of temporary table), but I can't do the same
> thing with just an element.
>
> Of course I can set an element as part of an array with just that element
> but. it's sad...
>
>
> Il giorno lun 5 feb 2024 alle ore 01:48 David G. Johnston <
> david.g.johns...@gmail.com> ha scritto:
>
>> On Sun, Feb 4, 2024 at 5:39 PM Adrian Klaver 
>> wrote:
>>
>>>
>>> >
>>> > attribute_list is  an array of composite type (with 20 fields).
>>>
>>> I am trying to wrap my head around "array of composite type". Please
>>> provide an example.
>>>
>>
>> ARRAY[ (1,2)::point, (3,4)::point ]::point[]
>>
>> The main problem is the concept of writing "from($1)" in any query makes
>> no sense, you cannot parameterize a from clause directly like that.  You
>> have to put the value somewhere an expression is directly allowed.
>>
>> David J.
>>
>
>
> --
> Domenico L.
>
> per stupire mezz'ora basta un libro di storia,
> io cercai di imparare la Treccani a memoria... [F.d.A.]
>


-- 
Domenico L.

per stupire mezz'ora basta un libro di storia,
io cercai di imparare la Treccani a memoria... [F.d.A.]


Re: select from composite type

2024-02-05 Thread Lorusso Domenico
here an example (the actual case in more complex, but the point it's the
same)

do $$
declare
_attribute_list temp1.my_type[];
_attribute temp1.my_type;

_r record;
begin
_attribute_list=array[row(1,'Hello') , row(2,'Goodbye')];

_attribute= row(1,'Doh');

raise notice '%', _attribute_list;

for _r in execute 'select * from unnest($1) where foo=1' using
_attribute_list loop
raise notice '%', _r;
end loop;

--Error
execute 'select * from $1' using _attribute into _r;

raise notice '%', _r;
end;
$$;

So I able to manage an array of complex type (why I use an array, because
in a previous answer the community suggest to me to use an array to pass a
list of information instead of temporary table), but I can't do the same
thing with just an element.

Of course I can set an element as part of an array with just that element
but. it's sad...


Il giorno lun 5 feb 2024 alle ore 01:48 David G. Johnston <
david.g.johns...@gmail.com> ha scritto:

> On Sun, Feb 4, 2024 at 5:39 PM Adrian Klaver 
> wrote:
>
>>
>> >
>> > attribute_list is  an array of composite type (with 20 fields).
>>
>> I am trying to wrap my head around "array of composite type". Please
>> provide an example.
>>
>
> ARRAY[ (1,2)::point, (3,4)::point ]::point[]
>
> The main problem is the concept of writing "from($1)" in any query makes
> no sense, you cannot parameterize a from clause directly like that.  You
> have to put the value somewhere an expression is directly allowed.
>
> David J.
>


-- 
Domenico L.

per stupire mezz'ora basta un libro di storia,
io cercai di imparare la Treccani a memoria... [F.d.A.]


Re: Deleting duplicate rows using ctid ?

2024-02-05 Thread David G. Johnston
On Mon, Feb 5, 2024 at 4:09 PM David Gauthier  wrote:

>
> I want the result to be just 2 recs, one for each dog.
>

My present goto link for this question:


https://harshitjain.home.blog/2019/06/17/postgresql-how-to-delete-duplicate-rows-in-postgresql/

David J.


Deleting duplicate rows using ctid ?

2024-02-05 Thread David Gauthier
I have a table with 4 columns, none of them unique.  I want to delete all
but one of the duplicate records.  I think there's a way to do this with
ctid.  Examples that pop up in google searches always rely on something
like a unique (id) field, like a primary key,  (no good in my case)

create table dog (variety varchar, name varchar, color varchar, age int);
insert into dogs
  ('lab','moby','white',12),
  ('lab','moby','white',12),
  ('spaniel','max','black',13),
  ('spaniel','max','black'),13,
  ('lab','moby','white',12);

I want the result to be just 2 recs, one for each dog.


Re: Improving pg_dump performance when handling large numbers of LOBs

2024-02-05 Thread Adrian Klaver

On 2/5/24 12:32, Ken Marshall wrote:

On Mon, Feb 05, 2024 at 09:17:53PM +0100, Andreas Joseph Krogh wrote:


På mandag 05. februar 2024 kl. 21:10:53, skrev Wyatt Tellis <
wyatt.tel...@gmail.com >:

No, we don't have the ability to make schema changes and the schema in the
PG15 copy needs to match what's in the PG 12 versions

Well then, I guess it boils down to how badly the ones in charge wants this
migration…

--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com 
www.visena.com 
  


Hi,

Can you use logical replication? Then you aren't under the gun to make
a fast copy.


https://www.postgresql.org/docs/current/logical-replication-restrictions.html

"Large objects (see Chapter 35) are not replicated. There is no 
workaround for that, other than storing data in normal tables."





Regards,
Ken




--
Adrian Klaver
adrian.kla...@aklaver.com





Re: Improving pg_dump performance when handling large numbers of LOBs

2024-02-05 Thread Tom Lane
Wyatt Tellis  writes:
> Yes, the LOBs themselves are tiny, but there are a lot of them (~150
> million) which seem to be slowing down pg_dump.  Note, we did not
> design/build this system and agree that use of LOBs for this purpose was
> not necessary.

I don't know of anything much you can do at the user level.  But there
is a patchset under development to improve pg_dump's behavior with
tons of blobs:

https://www.postgresql.org/message-id/842242.1706287...@sss.pgh.pa.us

That's intended for v17, and I'm not sure how tough it might be to
adapt to v15 pg_dump, but maybe you could make that happen.  In any
case, more eyeballs reviewing that patchset would be welcome.

regards, tom lane




Re: Improving pg_dump performance when handling large numbers of LOBs

2024-02-05 Thread Ken Marshall
On Mon, Feb 05, 2024 at 09:17:53PM +0100, Andreas Joseph Krogh wrote:
> 
> På mandag 05. februar 2024 kl. 21:10:53, skrev Wyatt Tellis <
> wyatt.tel...@gmail.com >:
> 
> No, we don't have the ability to make schema changes and the schema in the 
> PG15 copy needs to match what's in the PG 12 versions
> 
> Well then, I guess it boils down to how badly the ones in charge wants this 
> migration…
> 
> --
> Andreas Joseph Krogh
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andr...@visena.com 
> www.visena.com 
>  

Hi,

Can you use logical replication? Then you aren't under the gun to make
a fast copy.

Regards,
Ken




Re: Improving pg_dump performance when handling large numbers of LOBs

2024-02-05 Thread Andreas Joseph Krogh


På mandag 05. februar 2024 kl. 21:10:53, skrev Wyatt Tellis <
wyatt.tel...@gmail.com >:

No, we don't have the ability to make schema changes and the schema in the 
PG15 copy needs to match what's in the PG 12 versions

Well then, I guess it boils down to how badly the ones in charge wants this 
migration…






--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com 
www.visena.com 
 


Re: Improving pg_dump performance when handling large numbers of LOBs

2024-02-05 Thread Ron Johnson
Just checking... and I sympathize with your situation.

On Mon, Feb 5, 2024 at 2:56 PM Wyatt Tellis  wrote:

> Yes, the LOBs themselves are tiny, but there are a lot of them (~150
> million) which seem to be slowing down pg_dump.  Note, we did not
> design/build this system and agree that use of LOBs for this purpose was
> not necessary.
>
> Wyatt
>
> On Mon, Feb 5, 2024 at 11:36 AM Ron Johnson 
> wrote:
>
>> On Mon, Feb 5, 2024 at 2:01 PM Wyatt Tellis 
>> wrote:
>>
>>> Hi,
>>>
>>> We've inherited a series of legacy PG 12 clusters that each contain a
>>> database that we need to migrate to a PG 15 cluster. Each database contains
>>> about 150 million large objects totaling about 250GB.
>>>
>>
>> 250*10^9 / (150*10^6) = 1667 bytes.  That's *tiny*.
>>
>> Am I misunderstanding you?
>>
>>>


Re: Improving pg_dump performance when handling large numbers of LOBs

2024-02-05 Thread Wyatt Tellis
No, we don't have the ability to make schema changes and the schema in the
PG15 copy needs to match what's in the PG 12 versions

Wyatt

On Mon, Feb 5, 2024 at 12:05 PM Andreas Joseph Krogh 
wrote:

> På mandag 05. februar 2024 kl. 20:55:43, skrev Wyatt Tellis <
> wyatt.tel...@gmail.com>:
>
> Yes, the LOBs themselves are tiny, but there are a lot of them (~150
> million) which seem to be slowing down pg_dump.  Note, we did not
> design/build this system and agree that use of LOBs for this purpose was
> not necessary.
>
> Well, the data is there nonetheless, is it an option to convert it to
> bytea before migration?
>
>
> --
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andr...@visena.com
> www.visena.com
> 
>
>


Re: Improving pg_dump performance when handling large numbers of LOBs

2024-02-05 Thread Andreas Joseph Krogh


På mandag 05. februar 2024 kl. 20:55:43, skrev Wyatt Tellis <
wyatt.tel...@gmail.com >:

Yes, the LOBs themselves are tiny, but there are a lot of them (~150 million) 
which seem to be slowing down pg_dump. Note, we did not design/build this 
system and agree that use of LOBs for this purpose was not necessary.

Well, the data is there nonetheless, is it an option to convert it to bytea 
before migration?






--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com 
www.visena.com 
 


Re: Improving pg_dump performance when handling large numbers of LOBs

2024-02-05 Thread Wyatt Tellis
Yes, the LOBs themselves are tiny, but there are a lot of them (~150
million) which seem to be slowing down pg_dump.  Note, we did not
design/build this system and agree that use of LOBs for this purpose was
not necessary.

Wyatt

On Mon, Feb 5, 2024 at 11:36 AM Ron Johnson  wrote:

> On Mon, Feb 5, 2024 at 2:01 PM Wyatt Tellis 
> wrote:
>
>> Hi,
>>
>> We've inherited a series of legacy PG 12 clusters that each contain a
>> database that we need to migrate to a PG 15 cluster. Each database contains
>> about 150 million large objects totaling about 250GB.
>>
>
> 250*10^9 / (150*10^6) = 1667 bytes.  That's *tiny*.
>
> Am I misunderstanding you?
>
>>


Re: Improving pg_dump performance when handling large numbers of LOBs

2024-02-05 Thread Adrian Klaver


On 2/5/24 11:35 AM, Ron Johnson wrote:
On Mon, Feb 5, 2024 at 2:01 PM Wyatt Tellis  
wrote:


Hi,

We've inherited a series of legacy PG 12 clusters that each
contain a database that we need to migrate to a PG 15 cluster.
Each database contains about 150 million large objects totaling
about 250GB.


250*10^9 / (150*10^6) = 1667 bytes.  That's *tiny*.

Am I misunderstanding you?



I think it less about the size of the individual objects then the number 
(150 million) of them.


AFAIK that can't be handled by COPY, therefore they have to be 
transferred individually.


--
Adrian Klaver
adrian.kla...@aklaver.com


Re: Improving pg_dump performance when handling large numbers of LOBs

2024-02-05 Thread Ron Johnson
On Mon, Feb 5, 2024 at 2:01 PM Wyatt Tellis  wrote:

> Hi,
>
> We've inherited a series of legacy PG 12 clusters that each contain a
> database that we need to migrate to a PG 15 cluster. Each database contains
> about 150 million large objects totaling about 250GB.
>

250*10^9 / (150*10^6) = 1667 bytes.  That's *tiny*.

Am I misunderstanding you?

>


Re: How to do faster DML

2024-02-05 Thread veem v
I have not much experience with postgres on this scale though, others may
provide better suggestions. However, with this scale you will have a single
table size ~30TB+. And someone just pointed out in this thread,  ~180
partitions/table as the good to have limit,  and if that is true it would
be ~170GB per partition. Looks bulky for a system where readresponse time
is expected to be in milliseconds.

On Mon, 5 Feb 2024 at 16:51, Lok P  wrote:

>
>
> On Mon, 5 Feb, 2024, 2:21 am Lok P,  wrote:
>
>
> On Sun, Feb 4, 2024 at 9:18 PM Greg Sabino Mullane 
> wrote:
>
> Partitioning will depend a lot on how you access the data. Is it always
> using that transaction_date column in the where clause? Can you share the
> table definition via \d? We've talked about this table quite a bit, but not
> actually seen what it looks like. We are flying blind a little bit. You
> mentioned your queries are slow, but can you share an explain analyze on
> one of these slow queries?
>
> 45 billion rows is quite manageable. How many rows are in each day? You
> may want to do a larger bucket than 24 hours per partition.
>
> 1)Even after partitioning the target table , to speed up the data load on
> this table , Is there an option to disable the primary and foreign keys and
> re-enable them post data load finishes. Will that be a good idea
>
>
> No.
>
> 3)As the size of the table or each partition is going to be very large and
> this will be a read intensive application, compressing the historical
> partition will help us save the storage space and will also help the read
> queries performance
>
>
> I am not sure what this means. If you are not going to need the data
> anymore, dump the data to deep storage and drop the partition.
>
> Cheers,
> Greg
>
>
>
> Thank you.
>
> The table has ~127 columns of different data types , combinations of
> Numeric, varchar, date etc. And is having current size ~1TB holding
> ~3billion rows currently and the row size is ~300bytes.
>
> Currently it has lesser volume , but in future the daily transaction per
> day which will be inserted into this table will be Max ~500million
> rows/day. And the plan is to persist at least ~3months of transaction data
> which will be around 45billion rows in total. And if all works well , we
> may need to persist ~6 months worth of data in this database in future and
> that will be ~90 billion.
>
> This table will always be queried on the transaction_date column as one of
> the filters criteria. But the querying/search criteria can span from a day
> to a month worth of transaction date data.
>
> When you said "*You may want to do a larger bucket than 24 hours per
> partition.*", do you mean to say partition by weekly or so? Currently as
> per math i.e. 1TB of storage for ~3billion rows. So the daily range
> partition size( to hold ~500million transactions/day) will be around
> ~100-150GB. Won't that be too much data for a single partition to operate
> on, and increasing the granularity further(say weekly) will make the
> partition more bulkier?
>
> What I mean was, we will definitely need the data for querying purposes by
> the users, but just to keep the storage space incontrol (and to help the
> read queries), we were thinking of having the historical partitions
> compressed. And for that , if any compression strategy should we follow on
> postgres?
>
> With regards to loading data to the table faster, wondering why you said '
> *NO*' to load the data first and enabling/Creating the Primary key and
> Foreign key constraint later approach. Because this table is a child table
> and the parent is already having data in it, loading data to this table in
> presence of PK and FK makes it too slow as it tries to validate those for
> each set of rows. So we were thinking if doing it at a later stage at
> oneshot will be a faster approach. Please suggest.
>
> I will try to collect some SELECT query and post the explain analyze.
> Currently we are trying to get rid of the duplicates.
>
> Regards
> Lok
>
>
> Any thoughts, based on above usage pattern?
>
> While I did the maths based on the current stats with limited data sets.
> The partitions size coming as 100 to 150gb as I explained above, if we keep
> it daily range partition. Should we have to make it hourly then?
>
>  So want some experts view if this going to work fine for a read latency
> intensive applications or we should have some different strategy?
>


Improving pg_dump performance when handling large numbers of LOBs

2024-02-05 Thread Wyatt Tellis
Hi,

We've inherited a series of legacy PG 12 clusters that each contain a
database that we need to migrate to a PG 15 cluster. Each database contains
about 150 million large objects totaling about 250GB. When using pg_dump
we've found that it takes a couple of weeks to dump out this much data.
We've tried using the jobs option with the directory format but that seems
to save each LOB separately which makes moving the resulting dump to
another location unwieldy.  Has anyone else had to deal with dumping a
database with these many LOBs?  Are there any suggestions for how to
improve performance?

Thanks,

Wyatt


Re: Question on partitioning

2024-02-05 Thread veem v
On Mon, 5 Feb 2024 at 17:52, Laurenz Albe  wrote:

> On Mon, 2024-02-05 at 03:09 +0530, veem v wrote:
> > In postgresql, Is it possible to partition an existing nonpartitioned
> table having data
> > already residing in it and indexes and constraints defined in it,
> without the need of
> > manually moving the data around, to make it faster? Similarly merging
> multiple partitions
> > to one partition or splitting a single partition into multiple
> partitions?
>
> There is no way to do that.
>
> Yours,
> Laurenz Albe


Thank you very much Laurenz.

Actually in other databases (for example like in Oracle) there exists sql
syntax to split one partition into multiple and merge multiple partitions
back to one. So I was hoping there may be some way to do it in postgres.
Anyway, thanks for clarifying my doubt on this.

Which means for any such operation we need to create a new partition table
with that structure and load that with the data from the existing partition
table. Also for making or converting a non partition table to a partitioned
one , we have to first create the blank partition table structure as per
our need and then pump the data from the non partition table to the newly
created partitioned table, which means we need to take some downtime to
switch from non partitioned table to partitioned one. Please correct if
wrong.

Additionally I see a lot of other restrictions like
1)When creating indexes on the partition table "concurrently" keywords are
not allowed.
2)While creating foreign key ,  it does not allow a "not valid" clause if
the table is partitioned.
3) While creating indexes on this table or running any ALTER command, the
SELECT queries running from other processes run longer. Does it take any
lock while doing DDL on the base table in postgres?


On Mon, 5 Feb 2024 at 17:52, Laurenz Albe  wrote:

> On Mon, 2024-02-05 at 03:09 +0530, veem v wrote:
> > In postgresql, Is it possible to partition an existing nonpartitioned
> table having data
> > already residing in it and indexes and constraints defined in it,
> without the need of
> > manually moving the data around, to make it faster? Similarly merging
> multiple partitions
> > to one partition or splitting a single partition into multiple
> partitions?
>
> There is no way to do that.
>
> Yours,
> Laurenz Albe
>


Re: Yum Update nothing provides libarmadillo.so.12()(64bit) needed by gdal36-libs-3.6.4-6PGDG.rhel9.x86_64 from pgdg-common

2024-02-05 Thread Devrim Gündüz
Hi,

On Mon, 2024-02-05 at 08:16 -0800, Marcelo Marques wrote:
> Yes, the EPEL repo is enabled.

(Please keep the list CC'ed)

Armadillo 12 packages *are* in the EPEL repo:

https://dl.fedoraproject.org/pub/epel/9/Everything/x86_64/Packages/a/

If you have a local mirror, make sure that it is enabled. Also make sure
that armadillo is not in the excludes list in either dnf.conf or in the
repo config files.

-HTH

Regards,
-- 
Devrim Gündüz
Open Source Solution Architect, PostgreSQL Major Contributor
Twitter: @DevrimGunduz , @DevrimGunduzTR




Re: Unused indexes

2024-02-05 Thread Ron Johnson
On Mon, Feb 5, 2024 at 9:02 AM arun chirappurath 
wrote:

> Hi All,
>
> Do we have a script to get unused indexes for 30 days and once identified
> do we have an option to disable  and enable when required?
>

The pg_stat_*_tables tables idx_* columns has accumulated usage since the
last time you started the postmaster.


> I sql server we have this option to disable it and need to rebuild it to
> ensemble it
>

Sadly, PG does not have ALTER INDEX ... DISABLE;.


Re: Why doesn't Postgres apply limit on groups when retrieving N results per group using WHERE IN + ORDER BY

2024-02-05 Thread David G. Johnston
On Mon, Feb 5, 2024 at 8:55 AM Ron Johnson  wrote:

>
> Who knows which users are going to be in that list???
>
>
It doesn't matter.  Worse case scenario there is only one user in the
result and so all 50 rows are their earliest 50 rows.  The system will thus
never need more than the earliest 50 rows per user to answer this question.

David J.


Re: Why doesn't Postgres apply limit on groups when retrieving N results per group using WHERE IN + ORDER BY

2024-02-05 Thread Ron Johnson
On Mon, Feb 5, 2024 at 7:23 AM Sean v  wrote:

> This is related to a question I asked on dbs.stackexchange.com:
> https://dba.stackexchange.com/questions/335501/why-doesnt-postgres-apply-limit-on-groups-when-retrieving-n-results-per-group
>
> But to reiterate - I have a query like this:
>
> SELECT "orders".*
>
> FROM "orders"
>
> WHERE (user_id IN ?, ?, ?)
>
> ORDER BY "orders"."created_at" LIMIT 50
>
[snip]

> So my question is twofold:
> - why doesn't Postgres use the composite index, and then retrieve only the
> minimum necessary amount of rows (50 per user) using the query I posted
> above?
>
>
But your query *does not* list the first 50 rows *per user*.  It only
returns the first 50 rows of:

SELECT "orders".*

FROM "orders"

WHERE (user_id IN ?, ?, ?)

ORDER BY "orders"."created_at"

Who knows which users are going to be in that list???


Unused indexes

2024-02-05 Thread arun chirappurath
Hi All,

Do we have a script to get unused indexes for 30 days and once identified
do we have an option to disable  and enable when required?

I sql server we have this option to disable it and need to rebuild it to
ensemble it

Thanks,
Arun


Why doesn't Postgres apply limit on groups when retrieving N results per group using WHERE IN + ORDER BY

2024-02-05 Thread Sean v
This is related to a question I asked on dbs.stackexchange.com:
https://dba.stackexchange.com/questions/335501/why-doesnt-postgres-apply-limit-on-groups-when-retrieving-n-results-per-group

But to reiterate - I have a query like this:

SELECT "orders".*

FROM "orders"

WHERE (user_id IN ?, ?, ?)

ORDER BY "orders"."created_at" LIMIT 50

I have two indexes - `(user_id)` and `(user_id, created_at)`. Only the
first index is ever used with this query.

If I query for a specific user_id, as expected it uses the composite index.
If I use an inner select and trick the query planner into doing a nested
loop for each user, it no longer uses the composite index.

I can understand that the query planner would not know ahead of time which
users would have those 50 newest orders.

I imagined that it would be clever enough to determine that only 50 results
are needed, and that it could use the `(user_id, created_at)` index to get
50 orders for each user. Then sort and filter those few hundred results in
memory.

Instead what I'm seeing is that it gets all orders for each user using the
`user_id` index and then sorts/filters them all in memory.

Here is an example query plan:

Limit  (cost=45271.94..45272.06 rows=50 width=57) (actual
time=13.221..13.234 rows=50 loops=1)
  Buffers: shared hit=12321
  ->  Sort  (cost=45271.94..45302.75 rows=12326 width=57) (actual
time=13.220..13.226 rows=50 loops=1)
  Sort Key: orders.created_at
  Sort Method: top-N heapsort Memory: 36kB
Buffers: shared hit=12321
->  Bitmap Heap Scan on orders orders  (cost=180.85..44862.48
rows=12326 width=57) (actual time=3.268..11.485 rows=12300 loops=1)
Recheck Cond: (orders.user_id = ANY
('{11,1000,3000}'::bigint[]))
Heap Blocks: exact=12300
  Buffers: shared hit=12321
  ->  Bitmap Index Scan on index_orders_on_user_id
 (cost=0.00..177.77 rows=12326 width=0) (actual time=1.257..1.258
rows=12300 loops=1)
  Index Cond: (orders.user_id = ANY
('{11,1000,3000}'::bigint[]))
Buffers: shared hit=21
Planning:
  Buffers: shared hit=6
Execution time: 13.263 ms

The table I'm querying has roughly 50,000,000 orders, with an even
distribution of ~4000 orders per user.

I have found that I can speed this up significantly using CROSS JOIN
LATERAL and it will use the composite index, but I'm struggling to
understand WHY the CROSS JOIN LATERAL is needed here for it to use the
index.

I've tried tweaking costs, disabling bitmap scans, etc, so it seems like
this is a functional limitation rather than something to do with
cost/statistics.

So my question is twofold:
- why doesn't Postgres use the composite index, and then retrieve only the
minimum necessary amount of rows (50 per user) using the query I posted
above?

 - If it is a functional limitation, is it lack of implementation, or is
there a deeper incompatibility with how the query planner works that would
prevent it from being able to do this?

Thanks!


Re: Question on partitioning

2024-02-05 Thread Laurenz Albe
On Mon, 2024-02-05 at 03:09 +0530, veem v wrote:
> In postgresql, Is it possible to partition an existing nonpartitioned table 
> having data
> already residing in it and indexes and constraints defined in it, without the 
> need of
> manually moving the data around, to make it faster? Similarly merging 
> multiple partitions
> to one partition or splitting a single partition into multiple partitions?

There is no way to do that.

Yours,
Laurenz Albe




Re: How to do faster DML

2024-02-05 Thread Lok P
On Mon, 5 Feb, 2024, 2:21 am Lok P,  wrote:


On Sun, Feb 4, 2024 at 9:18 PM Greg Sabino Mullane 
wrote:

Partitioning will depend a lot on how you access the data. Is it always
using that transaction_date column in the where clause? Can you share the
table definition via \d? We've talked about this table quite a bit, but not
actually seen what it looks like. We are flying blind a little bit. You
mentioned your queries are slow, but can you share an explain analyze on
one of these slow queries?

45 billion rows is quite manageable. How many rows are in each day? You may
want to do a larger bucket than 24 hours per partition.

1)Even after partitioning the target table , to speed up the data load on
this table , Is there an option to disable the primary and foreign keys and
re-enable them post data load finishes. Will that be a good idea


No.

3)As the size of the table or each partition is going to be very large and
this will be a read intensive application, compressing the historical
partition will help us save the storage space and will also help the read
queries performance


I am not sure what this means. If you are not going to need the data
anymore, dump the data to deep storage and drop the partition.

Cheers,
Greg



Thank you.

The table has ~127 columns of different data types , combinations of
Numeric, varchar, date etc. And is having current size ~1TB holding
~3billion rows currently and the row size is ~300bytes.

Currently it has lesser volume , but in future the daily transaction per
day which will be inserted into this table will be Max ~500million
rows/day. And the plan is to persist at least ~3months of transaction data
which will be around 45billion rows in total. And if all works well , we
may need to persist ~6 months worth of data in this database in future and
that will be ~90 billion.

This table will always be queried on the transaction_date column as one of
the filters criteria. But the querying/search criteria can span from a day
to a month worth of transaction date data.

When you said "*You may want to do a larger bucket than 24 hours per
partition.*", do you mean to say partition by weekly or so? Currently as
per math i.e. 1TB of storage for ~3billion rows. So the daily range
partition size( to hold ~500million transactions/day) will be around
~100-150GB. Won't that be too much data for a single partition to operate
on, and increasing the granularity further(say weekly) will make the
partition more bulkier?

What I mean was, we will definitely need the data for querying purposes by
the users, but just to keep the storage space incontrol (and to help the
read queries), we were thinking of having the historical partitions
compressed. And for that , if any compression strategy should we follow on
postgres?

With regards to loading data to the table faster, wondering why you said '
*NO*' to load the data first and enabling/Creating the Primary key and
Foreign key constraint later approach. Because this table is a child table
and the parent is already having data in it, loading data to this table in
presence of PK and FK makes it too slow as it tries to validate those for
each set of rows. So we were thinking if doing it at a later stage at
oneshot will be a faster approach. Please suggest.

I will try to collect some SELECT query and post the explain analyze.
Currently we are trying to get rid of the duplicates.

Regards
Lok


Any thoughts, based on above usage pattern?

While I did the maths based on the current stats with limited data sets.
The partitions size coming as 100 to 150gb as I explained above, if we keep
it daily range partition. Should we have to make it hourly then?

 So want some experts view if this going to work fine for a read latency
intensive applications or we should have some different strategy?


Re: Yum Update nothing provides libarmadillo.so.12()(64bit) needed by gdal36-libs-3.6.4-6PGDG.rhel9.x86_64 from pgdg-common

2024-02-05 Thread Devrim Gündüz
Hi,

On Fri, 2024-02-02 at 11:05 -0800, Marcelo Marques wrote:
> *yum update nothing provides libarmadillo.so.12()(64bit) needed by
> gdal36-libs-3.6.4-6PGDG.rhel9.x86_64 from pgdg-common  *

Is EPEL repo enabled? If not, you can refer to our docs about installing
EPEL on RHEL/Rocky:

https://yum.postgresql.org/howto/

Regards,

-- 
Devrim Gündüz
Open Source Solution Architect, PostgreSQL Major Contributor
Twitter: @DevrimGunduz , @DevrimGunduzTR