Re: How to do faster DML
On Sun, Feb 4, 2024 at 12:50 AM Francisco Olarte wrote: > On Sat, 3 Feb 2024 at 19:29, Greg Sabino Mullane > wrote: > ... > > Given the size of your table, you probably want to divide that up. > > As long as nothing is changing the original table, you could do: > > > > insert into mytable2 select * from mytable1 order by ctid limit > 10_000_000 offset 0; > > insert into mytable2 select * from mytable1 order by ctid limit > 10_000_000 offset 10_000_000; > > insert into mytable2 select * from mytable1 order by ctid limit > 10_000_000 offset 20_000_000; > > Is it me or does this have the faint smell of quadratic behaviour? I > mean, you need to read and discard the first 10M to do offset 10M ( > also I believe ctid order is implied in sequential scan, but no so > sure, if not it would need a full table sort on each pass ). > > When doing things like this, I normally have some kind of unique index > and do it by selecting with limit above the last read value( stored > when doing it via code, as I normally load first, index later so I > cannot select max() fast on the target ). Or, with some kind of > "sparse" index (meaning, groups much less than the batch size ) and a > little code you can select where index_col > last order by index_col > limit 10M, remember last received index_col and reselect discarding > missing ( or just reselect it when doing your on conflict do nothing, > which also has a fast select max(id) due to the PK, it will work if it > has an index on id column on the original even if not unique ) to > avoid that. > > Also, I'm not sure if ctid is ordered and you can select where > ctid>last ordered, if that works it probably is faster for immutable > origins. > > Francisco Olarte. Thank you. Yes , I think the "on conflict do nothing;" option looks promising as it will remove the duplicate in case of PK violation but keepte load continue for subsequent rows. However , as we have ~5billion rows in the base table and out of that , we were expecting almost half i.e. ~2billion would be duplicates. And you said, doing the inserts using the "order by CTID Offset" approach must cause one full sequential scan of the whole table for loading each chunk/10M of rows and that would take a long time I believe. I am still trying to understand the other approach which you suggested. Not able to understand "y*ou can select where index_col > last order by index_col **limit 10M," .* However, to get the max ID value of the last 10M loaded rows in target, do you say that having an PK index created on that target table column(ID) will help, and we can save the max (ID) value subsequently in another table to fetch and keep loading from the source table (as ID>Max_ID stored in temp table)? OR Would it be better to do it in one shot only , but by setting a higher value of some parameters like "maintenance_work_mem" or "max_parallel_workers"?
Re: How to do faster DML
On 2024-02-03 19:25:12 +0530, Lok P wrote: > Apology. One correction, the query is like below. I. E filter will be on on > ctid which I believe is equivalent of rowid in oracle and we will not need the > index on Id column then. > > But, it still runs long, so thinking any other way to make the duplicate > removal faster? > > Also wondering , the index creation which took ~2.5hrs+ , would that have been > made faster any possible way by allowing more db resource through some session > level db parameter setting? > > create table TAB1_New > as > SELECT * from TAB1 A > where CTID in > (select min(CTID) from TAB1 > group by ID having count(ID)>=1 ); That »having count(ID)>=1« seems redundant to me. Surely every id which occurs in the table occurs at least once? Since you want ID to be unique I assume that it is already almost unique - so only a small fraction of the ids will be duplicates. So I would start with creating a list of duplicates: create table tab1_dups as select id, count(*) from tab1 group by id having count(*) > 1; This will still take some time because it needs to build a temporary structure large enough to hold a count for each individual id. But at least then you'll have a much smaller table to use for further cleanup. hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.at |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature
Re: How to do faster DML
On Sat, 3 Feb 2024 at 19:29, Greg Sabino Mullane wrote: ... > Given the size of your table, you probably want to divide that up. > As long as nothing is changing the original table, you could do: > > insert into mytable2 select * from mytable1 order by ctid limit 10_000_000 > offset 0; > insert into mytable2 select * from mytable1 order by ctid limit 10_000_000 > offset 10_000_000; > insert into mytable2 select * from mytable1 order by ctid limit 10_000_000 > offset 20_000_000; Is it me or does this have the faint smell of quadratic behaviour? I mean, you need to read and discard the first 10M to do offset 10M ( also I believe ctid order is implied in sequential scan, but no so sure, if not it would need a full table sort on each pass ). When doing things like this, I normally have some kind of unique index and do it by selecting with limit above the last read value( stored when doing it via code, as I normally load first, index later so I cannot select max() fast on the target ). Or, with some kind of "sparse" index (meaning, groups much less than the batch size ) and a little code you can select where index_col > last order by index_col limit 10M, remember last received index_col and reselect discarding missing ( or just reselect it when doing your on conflict do nothing, which also has a fast select max(id) due to the PK, it will work if it has an index on id column on the original even if not unique ) to avoid that. Also, I'm not sure if ctid is ordered and you can select where ctid>last ordered, if that works it probably is faster for immutable origins. Francisco Olarte.
Re: How to do faster DML
As a general rule, avoid heavy subselects like that. You don't need to build a full list of duplicates before starting. Another approach: create table mytable2 (like mytable1); alter table mytable2 add primary key (id); insert into mytable2 select * from mytable1 on conflict do nothing; Given the size of your table, you probably want to divide that up. As long as nothing is changing the original table, you could do: insert into mytable2 select * from mytable1 order by ctid limit 10_000_000 offset 0; insert into mytable2 select * from mytable1 order by ctid limit 10_000_000 offset 10_000_000; insert into mytable2 select * from mytable1 order by ctid limit 10_000_000 offset 20_000_000; etc. Cheers, Greg
Re: How to do faster DML
Ron Johnson 7:37 PM (1 hour ago) to *pgsql-general* On Sat, Feb 3, 2024 at 7:37 PM Ron Johnson wrote: > On Sat, Feb 3, 2024 at 8:55 AM Lok P wrote: > >> Apology. One correction, the query is like below. I. E filter will be on >> on ctid which I believe is equivalent of rowid in oracle and we will not >> need the index on Id column then. >> >> But, it still runs long, so thinking any other way to make the duplicate >> removal faster? >> >> Also wondering , the index creation which took ~2.5hrs+ , would that have >> been made faster any possible way by allowing more db resource through some >> session level db parameter setting? >> >> create table TAB1_New >> as >> SELECT * from TAB1 A >> where CTID in >> (select min(CTID) from TAB1 >> group by ID having count(ID)>=1 ); >> >> >> On Sat, Feb 3, 2024 at 5:50 PM Lok P wrote: >> >>> Hello All, >>> A non partitioned table having ~4.8 billion rows in it and having data >>> size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has >>> got approx ~1billion+ duplicate rows inserted in it and we want to get the >>> duplicate data removed for this table and create a PK/unique constraint >>> back so as to not have the duplicate values in future. We are struggling to >>> do the same. >>> >>> Teammates suggested doing this using CTAS method, i.e. create a new >>> table with the unique record set and then drop the main table. Something as >>> below >>> >>> create table TAB1_New >>> as >>> SELECT * from TAB1 A >>> where ID in >>> (select min(ID) from TAB1 >>> group by ID having count(ID)>=1 ); >>> >>> But for the above to work faster , they mentioned to have an index >>> created on the column using which the duplicate check will be performed i.e >>> ID column. So, creating the index itself took ~2hrs+ and the index size now >>> shows as ~116GB. >>> >>> >>> *Create index idx1 on TAB1(ID)* >>> And now running the SELECT subquery part of the CTAS statement to see if >>> its giving the correct unique records count. It ran for 2.5 hrs and then we >>> killed it. Below is the plan for the same. >>> >>> explain >>> *select min(ID) from TAB1 Agroup by ID having count(ID)>=1* >>> >>> GroupAggregate (cost=0.71..6025790113.87 rows=29432861 width=46) >>> Group Key: ID >>> Filter: (count(ID) >= 1) >>> -> Index Only Scan using idx1 on TAB1 a (cost=0.71..5988060903.17 >>> rows=4883397120 width=14) >>> >>> I want to understand if by any way this can be done faster . Also I am >>> worried that creating PK constraint/index back after deleting the duplicate >>> is also going to run forever. Is there any way we can make these heavy >>> operations faster on postgre by facilitating more database resources >>> through some parameter setup, like parallel hint etc? We have pg_hint_plan >>> extension added, but not seeing the parallel hint enforced when adding it >>> to the query. >>> >>> In Oracle we have Parallel hints, Direct path read/write for faster >>> read/write operations, parallel index scan etc. available, if anything >>> similar to that available in aurora postgre to facilitate more >>> horsepower and speed up the batch operations. And , how can we monitor >>> progress of any running query ? >>> Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU, >>> 256GB RAM. PG version 15.4. >>> >> > Aurora is not Postgresql, so configurations might not work. Having said > that... > https://www.postgresql.org/docs/15t/how-parallel-query-works.html > > And have you analyzed the table lately? Also, what's your work_mem > and maintenance_work_mem? > Thank you . Below are the values of the default parameters in this instance SHOW max_worker_processes; - 128 show max_parallel_workers_per_gather;- 4 show max_parallel_workers;- 32 show max_parallel_maintenance_workers; - 2 show maintenance_work_mem; - 4155MB show work_mem; - 8MB show shared_buffers ; -22029684 When I ran the CTAS queries and index creation process , I had not a very clear idea of how these are related to each other and help each of the operations, but I set a few of those as below before triggering those in the same session. set max_parallel_workers_per_gather=16; SET max_parallel_maintenance_workers TO 16; SET maintenance_work_mem TO '16 GB'; The instance has a total ~256 GB memory, so how should I adjust/bump these values when running heavy SELECT queries doing a large sequential scan OR large index creation process OR any Select query with heavy sorting/"order by" operations OR heavy JOINS? I have not analyzed the table manually though , but seeing the auto_vaccum and auto_analyze column getting populated in the pg_stat_user_tables , I thought it must be doing that automatically. By the way if we run "analyze tab1' on this 1.5TB table , will that run longer and will any of the above parameters help to expedite that ANALYZE operation too, if I run the ANALYZE manually? Regards Lok >
Re: How to do faster DML
On Sat, Feb 3, 2024 at 8:55 AM Lok P wrote: > Apology. One correction, the query is like below. I. E filter will be on > on ctid which I believe is equivalent of rowid in oracle and we will not > need the index on Id column then. > > But, it still runs long, so thinking any other way to make the duplicate > removal faster? > > Also wondering , the index creation which took ~2.5hrs+ , would that have > been made faster any possible way by allowing more db resource through some > session level db parameter setting? > > create table TAB1_New > as > SELECT * from TAB1 A > where CTID in > (select min(CTID) from TAB1 > group by ID having count(ID)>=1 ); > > > On Sat, Feb 3, 2024 at 5:50 PM Lok P wrote: > >> Hello All, >> A non partitioned table having ~4.8 billion rows in it and having data >> size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has >> got approx ~1billion+ duplicate rows inserted in it and we want to get the >> duplicate data removed for this table and create a PK/unique constraint >> back so as to not have the duplicate values in future. We are struggling to >> do the same. >> >> Teammates suggested doing this using CTAS method, i.e. create a new table >> with the unique record set and then drop the main table. Something as below >> >> create table TAB1_New >> as >> SELECT * from TAB1 A >> where ID in >> (select min(ID) from TAB1 >> group by ID having count(ID)>=1 ); >> >> But for the above to work faster , they mentioned to have an index >> created on the column using which the duplicate check will be performed i.e >> ID column. So, creating the index itself took ~2hrs+ and the index size now >> shows as ~116GB. >> >> >> *Create index idx1 on TAB1(ID)* >> And now running the SELECT subquery part of the CTAS statement to see if >> its giving the correct unique records count. It ran for 2.5 hrs and then we >> killed it. Below is the plan for the same. >> >> explain >> *select min(ID) from TAB1 Agroup by ID having count(ID)>=1* >> >> GroupAggregate (cost=0.71..6025790113.87 rows=29432861 width=46) >> Group Key: ID >> Filter: (count(ID) >= 1) >> -> Index Only Scan using idx1 on TAB1 a (cost=0.71..5988060903.17 >> rows=4883397120 width=14) >> >> I want to understand if by any way this can be done faster . Also I am >> worried that creating PK constraint/index back after deleting the duplicate >> is also going to run forever. Is there any way we can make these heavy >> operations faster on postgre by facilitating more database resources >> through some parameter setup, like parallel hint etc? We have pg_hint_plan >> extension added, but not seeing the parallel hint enforced when adding it >> to the query. >> >> In Oracle we have Parallel hints, Direct path read/write for faster >> read/write operations, parallel index scan etc. available, if anything >> similar to that available in aurora postgre to facilitate more >> horsepower and speed up the batch operations. And , how can we monitor >> progress of any running query ? >> Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU, >> 256GB RAM. PG version 15.4. >> > Aurora is not Postgresql, so configurations might not work. Having said that... https://www.postgresql.org/docs/15t/how-parallel-query-works.html And have you analyzed the table lately? Also, what's your work_mem and maintenance_work_mem?
Re: How to do faster DML
I copy/pasted your question into ChatGPT, and it gave me 10 specific suggestions. Have you tried those? On Sat, Feb 3, 2024 at 10:55 AM Lok P wrote: > Apology. One correction, the query is like below. I. E filter will be on > on ctid which I believe is equivalent of rowid in oracle and we will not > need the index on Id column then. > > But, it still runs long, so thinking any other way to make the duplicate > removal faster? > > Also wondering , the index creation which took ~2.5hrs+ , would that have > been made faster any possible way by allowing more db resource through some > session level db parameter setting? > > create table TAB1_New > as > SELECT * from TAB1 A > where CTID in > (select min(CTID) from TAB1 > group by ID having count(ID)>=1 ); > > > On Sat, Feb 3, 2024 at 5:50 PM Lok P wrote: > >> Hello All, >> A non partitioned table having ~4.8 billion rows in it and having data >> size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has >> got approx ~1billion+ duplicate rows inserted in it and we want to get the >> duplicate data removed for this table and create a PK/unique constraint >> back so as to not have the duplicate values in future. We are struggling to >> do the same. >> >> Teammates suggested doing this using CTAS method, i.e. create a new table >> with the unique record set and then drop the main table. Something as below >> >> create table TAB1_New >> as >> SELECT * from TAB1 A >> where ID in >> (select min(ID) from TAB1 >> group by ID having count(ID)>=1 ); >> >> But for the above to work faster , they mentioned to have an index >> created on the column using which the duplicate check will be performed i.e >> ID column. So, creating the index itself took ~2hrs+ and the index size now >> shows as ~116GB. >> >> >> *Create index idx1 on TAB1(ID)* >> And now running the SELECT subquery part of the CTAS statement to see if >> its giving the correct unique records count. It ran for 2.5 hrs and then we >> killed it. Below is the plan for the same. >> >> explain >> *select min(ID) from TAB1 Agroup by ID having count(ID)>=1* >> >> GroupAggregate (cost=0.71..6025790113.87 rows=29432861 width=46) >> Group Key: ID >> Filter: (count(ID) >= 1) >> -> Index Only Scan using idx1 on TAB1 a (cost=0.71..5988060903.17 >> rows=4883397120 width=14) >> >> I want to understand if by any way this can be done faster . Also I am >> worried that creating PK constraint/index back after deleting the duplicate >> is also going to run forever. Is there any way we can make these heavy >> operations faster on postgre by facilitating more database resources >> through some parameter setup, like parallel hint etc? We have pg_hint_plan >> extension added, but not seeing the parallel hint enforced when adding it >> to the query. >> >> In Oracle we have Parallel hints, Direct path read/write for faster >> read/write operations, parallel index scan etc. available, if anything >> similar to that available in aurora postgre to facilitate more >> horsepower and speed up the batch operations. And , how can we monitor >> progress of any running query ? >> Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU, >> 256GB RAM. PG version 15.4. >> >> Regards >> Lok >> > -- Todd Lewis tle...@brickabode.com
Re: How to do faster DML
Apology. One correction, the query is like below. I. E filter will be on on ctid which I believe is equivalent of rowid in oracle and we will not need the index on Id column then. But, it still runs long, so thinking any other way to make the duplicate removal faster? Also wondering , the index creation which took ~2.5hrs+ , would that have been made faster any possible way by allowing more db resource through some session level db parameter setting? create table TAB1_New as SELECT * from TAB1 A where CTID in (select min(CTID) from TAB1 group by ID having count(ID)>=1 ); On Sat, Feb 3, 2024 at 5:50 PM Lok P wrote: > Hello All, > A non partitioned table having ~4.8 billion rows in it and having data > size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has > got approx ~1billion+ duplicate rows inserted in it and we want to get the > duplicate data removed for this table and create a PK/unique constraint > back so as to not have the duplicate values in future. We are struggling to > do the same. > > Teammates suggested doing this using CTAS method, i.e. create a new table > with the unique record set and then drop the main table. Something as below > > create table TAB1_New > as > SELECT * from TAB1 A > where ID in > (select min(ID) from TAB1 > group by ID having count(ID)>=1 ); > > But for the above to work faster , they mentioned to have an index created > on the column using which the duplicate check will be performed i.e ID > column. So, creating the index itself took ~2hrs+ and the index size now > shows as ~116GB. > > > *Create index idx1 on TAB1(ID)* > And now running the SELECT subquery part of the CTAS statement to see if > its giving the correct unique records count. It ran for 2.5 hrs and then we > killed it. Below is the plan for the same. > > explain > *select min(ID) from TAB1 Agroup by ID having count(ID)>=1* > > GroupAggregate (cost=0.71..6025790113.87 rows=29432861 width=46) > Group Key: ID > Filter: (count(ID) >= 1) > -> Index Only Scan using idx1 on TAB1 a (cost=0.71..5988060903.17 > rows=4883397120 width=14) > > I want to understand if by any way this can be done faster . Also I am > worried that creating PK constraint/index back after deleting the duplicate > is also going to run forever. Is there any way we can make these heavy > operations faster on postgre by facilitating more database resources > through some parameter setup, like parallel hint etc? We have pg_hint_plan > extension added, but not seeing the parallel hint enforced when adding it > to the query. > > In Oracle we have Parallel hints, Direct path read/write for faster > read/write operations, parallel index scan etc. available, if anything > similar to that available in aurora postgre to facilitate more > horsepower and speed up the batch operations. And , how can we monitor > progress of any running query ? > Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU, > 256GB RAM. PG version 15.4. > > Regards > Lok >
How to do faster DML
Hello All, A non partitioned table having ~4.8 billion rows in it and having data size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has got approx ~1billion+ duplicate rows inserted in it and we want to get the duplicate data removed for this table and create a PK/unique constraint back so as to not have the duplicate values in future. We are struggling to do the same. Teammates suggested doing this using CTAS method, i.e. create a new table with the unique record set and then drop the main table. Something as below create table TAB1_New as SELECT * from TAB1 A where ID in (select min(ID) from TAB1 group by ID having count(ID)>=1 ); But for the above to work faster , they mentioned to have an index created on the column using which the duplicate check will be performed i.e ID column. So, creating the index itself took ~2hrs+ and the index size now shows as ~116GB. *Create index idx1 on TAB1(ID)* And now running the SELECT subquery part of the CTAS statement to see if its giving the correct unique records count. It ran for 2.5 hrs and then we killed it. Below is the plan for the same. explain *select min(ID) from TAB1 Agroup by ID having count(ID)>=1* GroupAggregate (cost=0.71..6025790113.87 rows=29432861 width=46) Group Key: ID Filter: (count(ID) >= 1) -> Index Only Scan using idx1 on TAB1 a (cost=0.71..5988060903.17 rows=4883397120 width=14) I want to understand if by any way this can be done faster . Also I am worried that creating PK constraint/index back after deleting the duplicate is also going to run forever. Is there any way we can make these heavy operations faster on postgre by facilitating more database resources through some parameter setup, like parallel hint etc? We have pg_hint_plan extension added, but not seeing the parallel hint enforced when adding it to the query. In Oracle we have Parallel hints, Direct path read/write for faster read/write operations, parallel index scan etc. available, if anything similar to that available in aurora postgre to facilitate more horsepower and speed up the batch operations. And , how can we monitor progress of any running query ? Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU, 256GB RAM. PG version 15.4. Regards Lok