Re: How to do faster DML

2024-02-03 Thread Lok P
On Sun, Feb 4, 2024 at 12:50 AM Francisco Olarte 
wrote:

> On Sat, 3 Feb 2024 at 19:29, Greg Sabino Mullane 
> wrote:
> ...
> > Given the size of your table, you probably want to divide that up.
> > As long as nothing is changing the original table, you could do:
> >
> > insert into mytable2 select * from mytable1 order by ctid limit
> 10_000_000 offset 0;
> > insert into mytable2 select * from mytable1 order by ctid limit
> 10_000_000 offset 10_000_000;
> > insert into mytable2 select * from mytable1 order by ctid limit
> 10_000_000 offset 20_000_000;
>
> Is it me or does this have the faint smell of quadratic behaviour? I
> mean, you need to read and discard the first 10M to do offset 10M (
> also I believe ctid order is implied in sequential scan, but no so
> sure, if not it would need a full table sort on each pass ).
>
> When doing things like this, I normally have some kind of unique index
> and do it by selecting with limit above the last read value( stored
> when doing it via code, as I normally load first, index later so I
> cannot select max() fast on the target ). Or, with some kind of
> "sparse" index (meaning, groups much less than the batch size ) and a
> little code you can select where index_col > last order by index_col
> limit 10M, remember last received index_col and reselect discarding
> missing ( or just reselect it when doing your on conflict do nothing,
> which also has a fast select max(id) due to the PK, it will work if it
> has an index on id column on the original even if not unique ) to
> avoid that.
>
> Also, I'm not sure if ctid is ordered and you can select where
> ctid>last ordered, if that works it probably is faster for immutable
> origins.
>
> Francisco Olarte.



Thank you.

Yes  , I think the "on conflict do nothing;" option looks promising as it
will remove the duplicate in case of PK violation but keepte load continue
for subsequent rows.

However , as we have ~5billion rows in the base table and out of that , we
were expecting almost half i.e. ~2billion would be duplicates. And you
said, doing the inserts using the "order by CTID Offset" approach must
cause one full sequential scan of the whole table for loading each
chunk/10M of rows and that would take a long time I believe.

I am still trying to understand the other approach which you suggested. Not
able to understand "y*ou can select where index_col > last order by
index_col **limit 10M," .*
However, to get the max ID value of the last 10M loaded rows in target, do
you say that having an PK index created on that target table column(ID)
will help, and we can save the max (ID) value subsequently in another table
to fetch and keep loading from the source table (as ID>Max_ID stored in
temp table)?
OR
Would it be better to do it in one shot only , but by setting a higher
value of some parameters like "maintenance_work_mem" or
"max_parallel_workers"?


Re: How to do faster DML

2024-02-03 Thread Peter J. Holzer
On 2024-02-03 19:25:12 +0530, Lok P wrote:
> Apology. One correction, the query is like below. I. E filter will be on on
> ctid which I believe is equivalent of rowid in oracle and we will not need the
> index on Id column then. 
> 
>  But, it still runs long, so thinking any other way to make the duplicate
> removal faster? 
> 
> Also wondering , the index creation which took ~2.5hrs+ , would that have been
> made faster any possible way by allowing more db resource through some session
> level db parameter setting?
> 
> create table TAB1_New
> as
> SELECT  * from TAB1 A
> where CTID in
>       (select min(CTID) from TAB1
>       group by ID having count(ID)>=1 );

That »having count(ID)>=1« seems redundant to me. Surely every id which
occurs in the table occurs at least once?

Since you want ID to be unique I assume that it is already almost
unique - so only a small fraction of the ids will be duplicates. So I
would start with creating a list of duplicates:

create table tab1_dups as
select id, count(*) from tab1 group by id having count(*) > 1;

This will still take some time because it needs to build a temporary
structure large enough to hold a count for each individual id. But at
least then you'll have a much smaller table to use for further cleanup.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature


Re: How to do faster DML

2024-02-03 Thread Francisco Olarte
On Sat, 3 Feb 2024 at 19:29, Greg Sabino Mullane  wrote:
...
> Given the size of your table, you probably want to divide that up.
> As long as nothing is changing the original table, you could do:
>
> insert into mytable2 select * from mytable1 order by ctid limit 10_000_000 
> offset 0;
> insert into mytable2 select * from mytable1 order by ctid limit 10_000_000 
> offset 10_000_000;
> insert into mytable2 select * from mytable1 order by ctid limit 10_000_000 
> offset 20_000_000;

Is it me or does this have the faint smell of quadratic behaviour? I
mean, you need to read and discard the first 10M to do offset 10M (
also I believe ctid order is implied in sequential scan, but no so
sure, if not it would need a full table sort on each pass ).

When doing things like this, I normally have some kind of unique index
and do it by selecting with limit above the last read value( stored
when doing it via code, as I normally load first, index later so I
cannot select max() fast on the target ). Or, with some kind of
"sparse" index (meaning, groups much less than the batch size ) and a
little code you can select where index_col > last order by index_col
limit 10M, remember last received index_col and reselect discarding
missing ( or just reselect it when doing your on conflict do nothing,
which also has a fast select max(id) due to the PK, it will work if it
has an index on id column on the original even if not unique ) to
avoid that.

Also, I'm not sure if ctid is ordered and you can select where
ctid>last ordered, if that works it probably is faster for immutable
origins.

Francisco Olarte.




Re: How to do faster DML

2024-02-03 Thread Greg Sabino Mullane
As a general rule, avoid heavy subselects like that. You don't need to
build a full list of duplicates before starting. Another approach:

create table mytable2 (like mytable1);

alter table mytable2 add primary key (id);

insert into mytable2 select * from mytable1 on conflict do nothing;

Given the size of your table, you probably want to divide that up.
As long as nothing is changing the original table, you could do:

insert into mytable2 select * from mytable1 order by ctid limit 10_000_000
offset 0;
insert into mytable2 select * from mytable1 order by ctid limit 10_000_000
offset 10_000_000;
insert into mytable2 select * from mytable1 order by ctid limit 10_000_000
offset 20_000_000;
etc.

Cheers,
Greg


Re: How to do faster DML

2024-02-03 Thread Lok P
Ron Johnson 
7:37 PM (1 hour ago)
to *pgsql-general*
On Sat, Feb 3, 2024 at 7:37 PM Ron Johnson  wrote:

> On Sat, Feb 3, 2024 at 8:55 AM Lok P  wrote:
>
>> Apology. One correction, the query is like below. I. E filter will be on
>> on ctid which I believe is equivalent of rowid in oracle and we will not
>> need the index on Id column then.
>>
>>  But, it still runs long, so thinking any other way to make the duplicate
>> removal faster?
>>
>> Also wondering , the index creation which took ~2.5hrs+ , would that have
>> been made faster any possible way by allowing more db resource through some
>> session level db parameter setting?
>>
>> create table TAB1_New
>> as
>> SELECT  * from TAB1 A
>> where CTID in
>>   (select min(CTID) from TAB1
>>   group by ID having count(ID)>=1 );
>>
>>
>> On Sat, Feb 3, 2024 at 5:50 PM Lok P  wrote:
>>
>>> Hello All,
>>> A non partitioned table having ~4.8 billion rows in it and having data
>>> size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has
>>> got approx ~1billion+ duplicate rows inserted in it and we want to get the
>>> duplicate data removed for this table and create a PK/unique constraint
>>> back so as to not have the duplicate values in future. We are struggling to
>>> do the same.
>>>
>>> Teammates suggested doing this using CTAS method, i.e. create a new
>>> table with the unique record set and then drop the main table. Something as
>>> below
>>>
>>> create table TAB1_New
>>> as
>>> SELECT  * from TAB1 A
>>> where ID in
>>>   (select min(ID) from TAB1
>>>   group by ID having count(ID)>=1 );
>>>
>>> But for the above to work faster , they mentioned to have an index
>>> created on the column using which the duplicate check will be performed i.e
>>> ID column. So, creating the index itself took ~2hrs+ and the index size now
>>> shows as ~116GB.
>>>
>>>
>>> *Create index idx1 on TAB1(ID)*
>>> And now running the SELECT subquery part of the CTAS statement to see if
>>> its giving the correct unique records count. It ran for 2.5 hrs and then we
>>> killed it. Below is the plan for the same.
>>>
>>> explain
>>> *select  min(ID) from TAB1 Agroup by ID having count(ID)>=1*
>>>
>>> GroupAggregate  (cost=0.71..6025790113.87 rows=29432861 width=46)
>>>   Group Key: ID
>>>   Filter: (count(ID) >= 1)
>>>   ->  Index Only Scan using idx1 on TAB1 a  (cost=0.71..5988060903.17
>>> rows=4883397120 width=14)
>>>
>>> I want to understand if by any way this can be done faster . Also I am
>>> worried that creating PK constraint/index back after deleting the duplicate
>>> is also going to run forever. Is there any way we can make these heavy
>>> operations faster on postgre by facilitating more database resources
>>> through some parameter setup, like parallel hint etc? We have pg_hint_plan
>>> extension added, but not seeing the parallel hint enforced when adding it
>>> to the query.
>>>
>>> In Oracle we have Parallel hints, Direct path read/write for faster
>>> read/write operations, parallel index scan etc. available, if anything
>>> similar to that available in aurora postgre to facilitate more
>>> horsepower and speed up the batch operations. And , how can we monitor
>>> progress of any running query ?
>>>  Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU,
>>> 256GB RAM. PG version 15.4.
>>>
>>
>  Aurora is not Postgresql, so configurations might not work.  Having said
> that...
> https://www.postgresql.org/docs/15t/how-parallel-query-works.html
>
> And have you analyzed the table lately?  Also, what's your work_mem
> and maintenance_work_mem?
>

 Thank you .

Below are the values of the default parameters in this instance

SHOW max_worker_processes; - 128
show max_parallel_workers_per_gather;- 4
show max_parallel_workers;- 32
show max_parallel_maintenance_workers; - 2
show maintenance_work_mem; - 4155MB
show work_mem; - 8MB
show shared_buffers ; -22029684

When I ran the CTAS queries and index creation process , I had not a very
clear idea of how these are related to each other and help each of the
operations,  but I set a few of those as below before triggering those in
the same session.

set max_parallel_workers_per_gather=16;
SET max_parallel_maintenance_workers TO 16;
SET maintenance_work_mem TO '16 GB';

The instance has a total ~256 GB memory, so how should I adjust/bump these
values when running heavy SELECT queries doing a large sequential scan  OR
large index creation process OR any Select query with heavy sorting/"order
by" operations OR heavy JOINS?

I have not analyzed the table manually though , but seeing the auto_vaccum
and auto_analyze column getting populated in the pg_stat_user_tables , I
thought it must be doing that automatically.

By the way if we run "analyze tab1' on this 1.5TB table , will that run
longer and will any of the above parameters help to expedite that ANALYZE
operation too, if I run the ANALYZE manually?

Regards
Lok

>


Re: How to do faster DML

2024-02-03 Thread Ron Johnson
On Sat, Feb 3, 2024 at 8:55 AM Lok P  wrote:

> Apology. One correction, the query is like below. I. E filter will be on
> on ctid which I believe is equivalent of rowid in oracle and we will not
> need the index on Id column then.
>
>  But, it still runs long, so thinking any other way to make the duplicate
> removal faster?
>
> Also wondering , the index creation which took ~2.5hrs+ , would that have
> been made faster any possible way by allowing more db resource through some
> session level db parameter setting?
>
> create table TAB1_New
> as
> SELECT  * from TAB1 A
> where CTID in
>   (select min(CTID) from TAB1
>   group by ID having count(ID)>=1 );
>
>
> On Sat, Feb 3, 2024 at 5:50 PM Lok P  wrote:
>
>> Hello All,
>> A non partitioned table having ~4.8 billion rows in it and having data
>> size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has
>> got approx ~1billion+ duplicate rows inserted in it and we want to get the
>> duplicate data removed for this table and create a PK/unique constraint
>> back so as to not have the duplicate values in future. We are struggling to
>> do the same.
>>
>> Teammates suggested doing this using CTAS method, i.e. create a new table
>> with the unique record set and then drop the main table. Something as below
>>
>> create table TAB1_New
>> as
>> SELECT  * from TAB1 A
>> where ID in
>>   (select min(ID) from TAB1
>>   group by ID having count(ID)>=1 );
>>
>> But for the above to work faster , they mentioned to have an index
>> created on the column using which the duplicate check will be performed i.e
>> ID column. So, creating the index itself took ~2hrs+ and the index size now
>> shows as ~116GB.
>>
>>
>> *Create index idx1 on TAB1(ID)*
>> And now running the SELECT subquery part of the CTAS statement to see if
>> its giving the correct unique records count. It ran for 2.5 hrs and then we
>> killed it. Below is the plan for the same.
>>
>> explain
>> *select  min(ID) from TAB1 Agroup by ID having count(ID)>=1*
>>
>> GroupAggregate  (cost=0.71..6025790113.87 rows=29432861 width=46)
>>   Group Key: ID
>>   Filter: (count(ID) >= 1)
>>   ->  Index Only Scan using idx1 on TAB1 a  (cost=0.71..5988060903.17
>> rows=4883397120 width=14)
>>
>> I want to understand if by any way this can be done faster . Also I am
>> worried that creating PK constraint/index back after deleting the duplicate
>> is also going to run forever. Is there any way we can make these heavy
>> operations faster on postgre by facilitating more database resources
>> through some parameter setup, like parallel hint etc? We have pg_hint_plan
>> extension added, but not seeing the parallel hint enforced when adding it
>> to the query.
>>
>> In Oracle we have Parallel hints, Direct path read/write for faster
>> read/write operations, parallel index scan etc. available, if anything
>> similar to that available in aurora postgre to facilitate more
>> horsepower and speed up the batch operations. And , how can we monitor
>> progress of any running query ?
>>  Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU,
>> 256GB RAM. PG version 15.4.
>>
>
 Aurora is not Postgresql, so configurations might not work.  Having said
that...
https://www.postgresql.org/docs/15t/how-parallel-query-works.html

And have you analyzed the table lately?  Also, what's your work_mem
and maintenance_work_mem?


Re: How to do faster DML

2024-02-03 Thread Todd Lewis
I copy/pasted your question into ChatGPT, and it gave me 10 specific
suggestions.  Have you tried those?

On Sat, Feb 3, 2024 at 10:55 AM Lok P  wrote:

> Apology. One correction, the query is like below. I. E filter will be on
> on ctid which I believe is equivalent of rowid in oracle and we will not
> need the index on Id column then.
>
>  But, it still runs long, so thinking any other way to make the duplicate
> removal faster?
>
> Also wondering , the index creation which took ~2.5hrs+ , would that have
> been made faster any possible way by allowing more db resource through some
> session level db parameter setting?
>
> create table TAB1_New
> as
> SELECT  * from TAB1 A
> where CTID in
>   (select min(CTID) from TAB1
>   group by ID having count(ID)>=1 );
>
>
> On Sat, Feb 3, 2024 at 5:50 PM Lok P  wrote:
>
>> Hello All,
>> A non partitioned table having ~4.8 billion rows in it and having data
>> size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has
>> got approx ~1billion+ duplicate rows inserted in it and we want to get the
>> duplicate data removed for this table and create a PK/unique constraint
>> back so as to not have the duplicate values in future. We are struggling to
>> do the same.
>>
>> Teammates suggested doing this using CTAS method, i.e. create a new table
>> with the unique record set and then drop the main table. Something as below
>>
>> create table TAB1_New
>> as
>> SELECT  * from TAB1 A
>> where ID in
>>   (select min(ID) from TAB1
>>   group by ID having count(ID)>=1 );
>>
>> But for the above to work faster , they mentioned to have an index
>> created on the column using which the duplicate check will be performed i.e
>> ID column. So, creating the index itself took ~2hrs+ and the index size now
>> shows as ~116GB.
>>
>>
>> *Create index idx1 on TAB1(ID)*
>> And now running the SELECT subquery part of the CTAS statement to see if
>> its giving the correct unique records count. It ran for 2.5 hrs and then we
>> killed it. Below is the plan for the same.
>>
>> explain
>> *select  min(ID) from TAB1 Agroup by ID having count(ID)>=1*
>>
>> GroupAggregate  (cost=0.71..6025790113.87 rows=29432861 width=46)
>>   Group Key: ID
>>   Filter: (count(ID) >= 1)
>>   ->  Index Only Scan using idx1 on TAB1 a  (cost=0.71..5988060903.17
>> rows=4883397120 width=14)
>>
>> I want to understand if by any way this can be done faster . Also I am
>> worried that creating PK constraint/index back after deleting the duplicate
>> is also going to run forever. Is there any way we can make these heavy
>> operations faster on postgre by facilitating more database resources
>> through some parameter setup, like parallel hint etc? We have pg_hint_plan
>> extension added, but not seeing the parallel hint enforced when adding it
>> to the query.
>>
>> In Oracle we have Parallel hints, Direct path read/write for faster
>> read/write operations, parallel index scan etc. available, if anything
>> similar to that available in aurora postgre to facilitate more
>> horsepower and speed up the batch operations. And , how can we monitor
>> progress of any running query ?
>>  Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU,
>> 256GB RAM. PG version 15.4.
>>
>> Regards
>> Lok
>>
>

-- 
Todd Lewis
tle...@brickabode.com


Re: How to do faster DML

2024-02-03 Thread Lok P
Apology. One correction, the query is like below. I. E filter will be on on
ctid which I believe is equivalent of rowid in oracle and we will not need
the index on Id column then.

 But, it still runs long, so thinking any other way to make the duplicate
removal faster?

Also wondering , the index creation which took ~2.5hrs+ , would that have
been made faster any possible way by allowing more db resource through some
session level db parameter setting?

create table TAB1_New
as
SELECT  * from TAB1 A
where CTID in
  (select min(CTID) from TAB1
  group by ID having count(ID)>=1 );


On Sat, Feb 3, 2024 at 5:50 PM Lok P  wrote:

> Hello All,
> A non partitioned table having ~4.8 billion rows in it and having data
> size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has
> got approx ~1billion+ duplicate rows inserted in it and we want to get the
> duplicate data removed for this table and create a PK/unique constraint
> back so as to not have the duplicate values in future. We are struggling to
> do the same.
>
> Teammates suggested doing this using CTAS method, i.e. create a new table
> with the unique record set and then drop the main table. Something as below
>
> create table TAB1_New
> as
> SELECT  * from TAB1 A
> where ID in
>   (select min(ID) from TAB1
>   group by ID having count(ID)>=1 );
>
> But for the above to work faster , they mentioned to have an index created
> on the column using which the duplicate check will be performed i.e ID
> column. So, creating the index itself took ~2hrs+ and the index size now
> shows as ~116GB.
>
>
> *Create index idx1 on TAB1(ID)*
> And now running the SELECT subquery part of the CTAS statement to see if
> its giving the correct unique records count. It ran for 2.5 hrs and then we
> killed it. Below is the plan for the same.
>
> explain
> *select  min(ID) from TAB1 Agroup by ID having count(ID)>=1*
>
> GroupAggregate  (cost=0.71..6025790113.87 rows=29432861 width=46)
>   Group Key: ID
>   Filter: (count(ID) >= 1)
>   ->  Index Only Scan using idx1 on TAB1 a  (cost=0.71..5988060903.17
> rows=4883397120 width=14)
>
> I want to understand if by any way this can be done faster . Also I am
> worried that creating PK constraint/index back after deleting the duplicate
> is also going to run forever. Is there any way we can make these heavy
> operations faster on postgre by facilitating more database resources
> through some parameter setup, like parallel hint etc? We have pg_hint_plan
> extension added, but not seeing the parallel hint enforced when adding it
> to the query.
>
> In Oracle we have Parallel hints, Direct path read/write for faster
> read/write operations, parallel index scan etc. available, if anything
> similar to that available in aurora postgre to facilitate more
> horsepower and speed up the batch operations. And , how can we monitor
> progress of any running query ?
>  Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU,
> 256GB RAM. PG version 15.4.
>
> Regards
> Lok
>


How to do faster DML

2024-02-03 Thread Lok P
Hello All,
A non partitioned table having ~4.8 billion rows in it and having data size
as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has got
approx ~1billion+ duplicate rows inserted in it and we want to get the
duplicate data removed for this table and create a PK/unique constraint
back so as to not have the duplicate values in future. We are struggling to
do the same.

Teammates suggested doing this using CTAS method, i.e. create a new table
with the unique record set and then drop the main table. Something as below

create table TAB1_New
as
SELECT  * from TAB1 A
where ID in
  (select min(ID) from TAB1
  group by ID having count(ID)>=1 );

But for the above to work faster , they mentioned to have an index created
on the column using which the duplicate check will be performed i.e ID
column. So, creating the index itself took ~2hrs+ and the index size now
shows as ~116GB.


*Create index idx1 on TAB1(ID)*
And now running the SELECT subquery part of the CTAS statement to see if
its giving the correct unique records count. It ran for 2.5 hrs and then we
killed it. Below is the plan for the same.

explain
*select  min(ID) from TAB1 Agroup by ID having count(ID)>=1*

GroupAggregate  (cost=0.71..6025790113.87 rows=29432861 width=46)
  Group Key: ID
  Filter: (count(ID) >= 1)
  ->  Index Only Scan using idx1 on TAB1 a  (cost=0.71..5988060903.17
rows=4883397120 width=14)

I want to understand if by any way this can be done faster . Also I am
worried that creating PK constraint/index back after deleting the duplicate
is also going to run forever. Is there any way we can make these heavy
operations faster on postgre by facilitating more database resources
through some parameter setup, like parallel hint etc? We have pg_hint_plan
extension added, but not seeing the parallel hint enforced when adding it
to the query.

In Oracle we have Parallel hints, Direct path read/write for faster
read/write operations, parallel index scan etc. available, if anything
similar to that available in aurora postgre to facilitate more
horsepower and speed up the batch operations. And , how can we monitor
progress of any running query ?
 Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU,
256GB RAM. PG version 15.4.

Regards
Lok