Re: UPDATE many records

2020-01-10 Thread Israel Brewster
> On Jan 8, 2020, at 7:52 AM, stan  wrote:
> 
> On Tue, Jan 07, 2020 at 12:20:12PM -0900, Israel Brewster wrote:
>>> On Jan 7, 2020, at 12:15 PM, Alan Hodgson  wrote:
>>> 
>>> On Tue, 2020-01-07 at 11:58 -0900, Israel Brewster wrote:
> 
 Really? Why? With the update I am only changing data - I???m not adding
 any additional data, so the total size should stay the same, right?
 I???m obviously missing something??? :-)
 
>>> 
>>> PostgreSQL keeps the old row until it gets vacuumed, as it needs to be
>>> visible to other transactions. Not only that, but every index record
>>> gets updated to point to the location of the new data row too (excluding
>>> HOT), and those old index blocks also need to get vacuumed. And none of
>>> those rows can get removed until your update finishes.
>>> 
>>> I know this isn't universally true with HOT and fillfactor etc. but with
>>> an update this big I think it's safe to say most of the space will get
>>> doubled.
>>> 
>>> Plus you'll get a ton of write-ahead logs.
>> 
>> Gotcha. Batches with VACUUM it is! Thanks for the info.
>> 
> I'd love to see you report on how this went. 

So after determining that I did, in fact, have enough disk space to duplicate 
the data, I moved forward with the CREATE TABLE … AS SELECT …. Method. Running 
the CREATE TABLE command took around 12 minutes for my almost 64million rows. I 
then created indexes/set constraints/set defaults, etc on the new table until 
it exactly matched the old one (other than the changed data, of course). This 
probably took another 5-10 minutes. Two quick ALTER TABLE…RENAME TO… commands 
later, and the new data was live. The only issue I ran into was initially 
forgetting to grant the application user permissions on the new table, but of 
course that was easily remedied.

In the end, really couldn’t have gone much smoother or quicker. Thanks all for 
the assistance and advice!
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145
> 
> -- 
> "They that would give up essential liberty for temporary safety deserve
> neither liberty nor safety."
>   -- Benjamin Franklin



Re: UPDATE many records

2020-01-07 Thread Israel Brewster
> 
> On Jan 7, 2020, at 12:57 PM, Adrian Klaver  wrote:
> 
> On 1/7/20 1:43 PM, Israel Brewster wrote:
>>> On Jan 7, 2020, at 12:21 PM, Adrian Klaver >> > wrote:
>>> 
>>> On 1/7/20 1:10 PM, Israel Brewster wrote:
> On Jan 7, 2020, at 12:01 PM, Adrian Klaver  > wrote:
> 
> On 1/7/20 12:47 PM, Israel Brewster wrote:
>> One potential issue I just thought of with this approach: disk space. 
>> Will I be doubling the amount of space used while both tables exist? If 
>> so, that would prevent this from working - I don’t have that much space 
>> available at the moment.
> 
> It will definitely increase the disk space by at least the data in the 
> new table. How much relative to the old table is going to depend on how 
> aggressive the AUTOVACUUM/VACUUM is.
> 
> A suggestion for an alternative approach:
> 
> 1) Create a table:
> 
> create table change_table(id int, changed_fld some_type)
> 
> where is is the PK from the existing table.
> 
> 2) Run your conversion function against existing table with change to 
> have it put new field value in change_table keyed to id/PK. Probably do 
> this in batches.
> 
> 3) Once all the values have been updated, do an UPDATE set changed_field 
> = changed_fld from change_table where existing_table.pk = change_table.id 
> ;
 Makes sense. Use the fast SELECT to create/populate the other table, then 
 the update can just be setting a value, not having to call any functions. 
 From what you are saying about updates though, I may still need to batch 
 the UPDATE section, with occasional VACUUMs to maintain disk space. Unless 
 I am not understanding the concept of “tuples that are obsoleted by an 
 update”, which is possible.
>>> 
>>> You are not. For a more thorough explanation see:
>>> 
>>> https://www.postgresql.org/docs/12/routine-vacuuming.html#VACUUM-BASICS
>>> 
>>> How much space do you have to work with?
>>> 
>>> To get an idea of the disk space currently used by table see;
>>> 
>>> https://www.postgresql.org/docs/12/functions-admin.html#FUNCTIONS-ADMIN-DBOBJECT
>> Oh, ok, I guess I was being overly paranoid on this front. Those functions 
>> would indicate that the table is only 7.5 GB, with another 8.7GB of indexes, 
>> for a total of around 16GB. So not a problem after all - I have around 100GB 
>> available.
>> Of course, that now leaves me with the mystery of where my other 500GB of 
>> disk space is going, since it is apparently NOT going to my DB as I had 
>> assumed, but solving that can wait.
> 
> Assuming you are on some form of Linux:
> 
> sudo du -h -d 1 /
> 
> Then you can drill down into the output of above.

Yep. Done it many times to discover a runaway log file or the like. Just mildly 
amusing that solving one problem leads to another I need to take care of as 
well… But at least the select into a new table should work nicely. Thanks!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145
> 
>> Thanks again for all the good information and suggestions!
>> ---
>> Israel Brewster
>> Software Engineer
>> Alaska Volcano Observatory
>> Geophysical Institute - UAF
>> 2156 Koyukuk Drive
>> Fairbanks AK 99775-7320
>> Work: 907-474-5172
>> cell:  907-328-9145
>>> 
> 
>> ---
>> Israel Brewster
>> Software Engineer
>> Alaska Volcano Observatory
>> Geophysical Institute - UAF
>> 2156 Koyukuk Drive
>> Fairbanks AK 99775-7320
>> Work: 907-474-5172
>> cell:  907-328-9145
> 
> 
> --
> Adrian Klaver
> adrian.kla...@aklaver.com 
>>> 
>>> 
>>> --
>>> Adrian Klaver
>>> adrian.kla...@aklaver.com 
> 
> 
> -- 
> Adrian Klaver
> adrian.kla...@aklaver.com





Re: UPDATE many records

2020-01-07 Thread Adrian Klaver

On 1/7/20 1:43 PM, Israel Brewster wrote:
On Jan 7, 2020, at 12:21 PM, Adrian Klaver > wrote:


On 1/7/20 1:10 PM, Israel Brewster wrote:
On Jan 7, 2020, at 12:01 PM, Adrian Klaver 
mailto:adrian.kla...@aklaver.com>> wrote:


On 1/7/20 12:47 PM, Israel Brewster wrote:
One potential issue I just thought of with this approach: disk 
space. Will I be doubling the amount of space used while both 
tables exist? If so, that would prevent this from working - I don’t 
have that much space available at the moment.


It will definitely increase the disk space by at least the data in 
the new table. How much relative to the old table is going to depend 
on how aggressive the AUTOVACUUM/VACUUM is.


A suggestion for an alternative approach:

1) Create a table:

create table change_table(id int, changed_fld some_type)

where is is the PK from the existing table.

2) Run your conversion function against existing table with change 
to have it put new field value in change_table keyed to id/PK. 
Probably do this in batches.


3) Once all the values have been updated, do an UPDATE set 
changed_field = changed_fld from change_table where 
existing_table.pk = change_table.id ;
Makes sense. Use the fast SELECT to create/populate the other table, 
then the update can just be setting a value, not having to call any 
functions. From what you are saying about updates though, I may still 
need to batch the UPDATE section, with occasional VACUUMs to maintain 
disk space. Unless I am not understanding the concept of “tuples that 
are obsoleted by an update”, which is possible.


You are not. For a more thorough explanation see:

https://www.postgresql.org/docs/12/routine-vacuuming.html#VACUUM-BASICS

How much space do you have to work with?

To get an idea of the disk space currently used by table see;

https://www.postgresql.org/docs/12/functions-admin.html#FUNCTIONS-ADMIN-DBOBJECT


Oh, ok, I guess I was being overly paranoid on this front. Those 
functions would indicate that the table is only 7.5 GB, with another 
8.7GB of indexes, for a total of around 16GB. So not a problem after all 
- I have around 100GB available.


Of course, that now leaves me with the mystery of where my other 500GB 
of disk space is going, since it is apparently NOT going to my DB as I 
had assumed, but solving that can wait.


Assuming you are on some form of Linux:

sudo du -h -d 1 /

Then you can drill down into the output of above.



Thanks again for all the good information and suggestions!
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145





---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145



--
Adrian Klaver
adrian.kla...@aklaver.com 



--
Adrian Klaver
adrian.kla...@aklaver.com 





--
Adrian Klaver
adrian.kla...@aklaver.com




Re: UPDATE many records

2020-01-07 Thread Israel Brewster
> On Jan 7, 2020, at 12:21 PM, Adrian Klaver  wrote:
> 
> On 1/7/20 1:10 PM, Israel Brewster wrote:
>>> On Jan 7, 2020, at 12:01 PM, Adrian Klaver  
>>> wrote:
>>> 
>>> On 1/7/20 12:47 PM, Israel Brewster wrote:
 One potential issue I just thought of with this approach: disk space. Will 
 I be doubling the amount of space used while both tables exist? If so, 
 that would prevent this from working - I don’t have that much space 
 available at the moment.
>>> 
>>> It will definitely increase the disk space by at least the data in the new 
>>> table. How much relative to the old table is going to depend on how 
>>> aggressive the AUTOVACUUM/VACUUM is.
>>> 
>>> A suggestion for an alternative approach:
>>> 
>>> 1) Create a table:
>>> 
>>> create table change_table(id int, changed_fld some_type)
>>> 
>>> where is is the PK from the existing table.
>>> 
>>> 2) Run your conversion function against existing table with change to have 
>>> it put new field value in change_table keyed to id/PK. Probably do this in 
>>> batches.
>>> 
>>> 3) Once all the values have been updated, do an UPDATE set changed_field = 
>>> changed_fld from change_table where existing_table.pk = change_table.id;
>> Makes sense. Use the fast SELECT to create/populate the other table, then 
>> the update can just be setting a value, not having to call any functions. 
>> From what you are saying about updates though, I may still need to batch the 
>> UPDATE section, with occasional VACUUMs to maintain disk space. Unless I am 
>> not understanding the concept of “tuples that are obsoleted by an update”, 
>> which is possible.
> 
> You are not. For a more thorough explanation see:
> 
> https://www.postgresql.org/docs/12/routine-vacuuming.html#VACUUM-BASICS 
> 
> 
> How much space do you have to work with?
> 
> To get an idea of the disk space currently used by table see;
> 
> https://www.postgresql.org/docs/12/functions-admin.html#FUNCTIONS-ADMIN-DBOBJECT
>  
> 

Oh, ok, I guess I was being overly paranoid on this front. Those functions 
would indicate that the table is only 7.5 GB, with another 8.7GB of indexes, 
for a total of around 16GB. So not a problem after all - I have around 100GB 
available.

Of course, that now leaves me with the mystery of where my other 500GB of disk 
space is going, since it is apparently NOT going to my DB as I had assumed, but 
solving that can wait.

Thanks again for all the good information and suggestions!
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145
> 
>>> 
 ---
 Israel Brewster
 Software Engineer
 Alaska Volcano Observatory
 Geophysical Institute - UAF
 2156 Koyukuk Drive
 Fairbanks AK 99775-7320
 Work: 907-474-5172
 cell:  907-328-9145
>>> 
>>> 
>>> -- 
>>> Adrian Klaver
>>> adrian.kla...@aklaver.com
> 
> 
> -- 
> Adrian Klaver
> adrian.kla...@aklaver.com 


Re: UPDATE many records

2020-01-07 Thread Adrian Klaver

On 1/7/20 1:10 PM, Israel Brewster wrote:



On Jan 7, 2020, at 12:01 PM, Adrian Klaver  wrote:

On 1/7/20 12:47 PM, Israel Brewster wrote:

One potential issue I just thought of with this approach: disk space. Will I be 
doubling the amount of space used while both tables exist? If so, that would 
prevent this from working - I don’t have that much space available at the 
moment.


It will definitely increase the disk space by at least the data in the new 
table. How much relative to the old table is going to depend on how aggressive 
the AUTOVACUUM/VACUUM is.

A suggestion for an alternative approach:

1) Create a table:

create table change_table(id int, changed_fld some_type)

where is is the PK from the existing table.

2) Run your conversion function against existing table with change to have it 
put new field value in change_table keyed to id/PK. Probably do this in batches.

3) Once all the values have been updated, do an UPDATE set changed_field = 
changed_fld from change_table where existing_table.pk = change_table.id;


Makes sense. Use the fast SELECT to create/populate the other table, then the 
update can just be setting a value, not having to call any functions. From what 
you are saying about updates though, I may still need to batch the UPDATE 
section, with occasional VACUUMs to maintain disk space. Unless I am not 
understanding the concept of “tuples that are obsoleted by an update”, which is 
possible.


You are not. For a more thorough explanation see:

https://www.postgresql.org/docs/12/routine-vacuuming.html#VACUUM-BASICS

How much space do you have to work with?

To get an idea of the disk space currently used by table see;

https://www.postgresql.org/docs/12/functions-admin.html#FUNCTIONS-ADMIN-DBOBJECT






---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145



--
Adrian Klaver
adrian.kla...@aklaver.com





--
Adrian Klaver
adrian.kla...@aklaver.com




Re: UPDATE many records

2020-01-07 Thread Israel Brewster
> On Jan 7, 2020, at 12:15 PM, Alan Hodgson  wrote:
> 
> On Tue, 2020-01-07 at 11:58 -0900, Israel Brewster wrote:
>>> 
>> Really? Why? With the update I am only changing data - I’m not adding
>> any additional data, so the total size should stay the same, right?
>> I’m obviously missing something… :-)
>> 
> 
> PostgreSQL keeps the old row until it gets vacuumed, as it needs to be
> visible to other transactions. Not only that, but every index record
> gets updated to point to the location of the new data row too (excluding
> HOT), and those old index blocks also need to get vacuumed. And none of
> those rows can get removed until your update finishes.
> 
> I know this isn't universally true with HOT and fillfactor etc. but with
> an update this big I think it's safe to say most of the space will get
> doubled.
> 
> Plus you'll get a ton of write-ahead logs.

Gotcha. Batches with VACUUM it is! Thanks for the info.

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

> 
> 
> 





Re: UPDATE many records

2020-01-07 Thread Alan Hodgson
On Tue, 2020-01-07 at 11:58 -0900, Israel Brewster wrote:
> > 
> Really? Why? With the update I am only changing data - I’m not adding
> any additional data, so the total size should stay the same, right?
> I’m obviously missing something… :-)
> 

PostgreSQL keeps the old row until it gets vacuumed, as it needs to be
visible to other transactions. Not only that, but every index record
gets updated to point to the location of the new data row too (excluding
HOT), and those old index blocks also need to get vacuumed. And none of
those rows can get removed until your update finishes.

I know this isn't universally true with HOT and fillfactor etc. but with
an update this big I think it's safe to say most of the space will get
doubled.

Plus you'll get a ton of write-ahead logs.





Re: UPDATE many records

2020-01-07 Thread Israel Brewster


> On Jan 7, 2020, at 12:01 PM, Adrian Klaver  wrote:
> 
> On 1/7/20 12:47 PM, Israel Brewster wrote:
>> One potential issue I just thought of with this approach: disk space. Will I 
>> be doubling the amount of space used while both tables exist? If so, that 
>> would prevent this from working - I don’t have that much space available at 
>> the moment.
> 
> It will definitely increase the disk space by at least the data in the new 
> table. How much relative to the old table is going to depend on how 
> aggressive the AUTOVACUUM/VACUUM is.
> 
> A suggestion for an alternative approach:
> 
> 1) Create a table:
> 
> create table change_table(id int, changed_fld some_type)
> 
> where is is the PK from the existing table.
> 
> 2) Run your conversion function against existing table with change to have it 
> put new field value in change_table keyed to id/PK. Probably do this in 
> batches.
> 
> 3) Once all the values have been updated, do an UPDATE set changed_field = 
> changed_fld from change_table where existing_table.pk = change_table.id;

Makes sense. Use the fast SELECT to create/populate the other table, then the 
update can just be setting a value, not having to call any functions. From what 
you are saying about updates though, I may still need to batch the UPDATE 
section, with occasional VACUUMs to maintain disk space. Unless I am not 
understanding the concept of “tuples that are obsoleted by an update”, which is 
possible.

> 
>> ---
>> Israel Brewster
>> Software Engineer
>> Alaska Volcano Observatory
>> Geophysical Institute - UAF
>> 2156 Koyukuk Drive
>> Fairbanks AK 99775-7320
>> Work: 907-474-5172
>> cell:  907-328-9145
> 
> 
> -- 
> Adrian Klaver
> adrian.kla...@aklaver.com





Re: UPDATE many records

2020-01-07 Thread Adrian Klaver

On 1/7/20 12:58 PM, Israel Brewster wrote:

On Jan 7, 2020, at 11:56 AM, Alan Hodgson  wrote:

On Tue, 2020-01-07 at 11:47 -0900, Israel Brewster wrote:

One potential issue I just thought of with this approach: disk space.
Will I be doubling the amount of space used while both tables exist?
If so, that would prevent this from working - I don’t have that much
space available at the moment.


The original update you planned would do that, too.

You probably need to just do the update in batches and vacuum the table
between batches.




Really? Why? With the update I am only changing data - I’m not adding any 
additional data, so the total size should stay the same, right? I’m obviously 
missing something… :-)


https://www.postgresql.org/docs/12/sql-vacuum.html

"VACUUM reclaims storage occupied by dead tuples. In normal PostgreSQL 
operation, tuples that are deleted or obsoleted by an update are not 
physically removed from their table; they remain present until a VACUUM 
is done. Therefore it's necessary to do VACUUM periodically, especially 
on frequently-updated tables."




---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145










--
Adrian Klaver
adrian.kla...@aklaver.com




Re: UPDATE many records

2020-01-07 Thread Adrian Klaver

On 1/7/20 12:47 PM, Israel Brewster wrote:
One potential issue I just thought of with this approach: disk space. 
Will I be doubling the amount of space used while both tables exist? If 
so, that would prevent this from working - I don’t have that much space 
available at the moment.


It will definitely increase the disk space by at least the data in the 
new table. How much relative to the old table is going to depend on how 
aggressive the AUTOVACUUM/VACUUM is.


A suggestion for an alternative approach:

1) Create a table:

create table change_table(id int, changed_fld some_type)

where is is the PK from the existing table.

2) Run your conversion function against existing table with change to 
have it put new field value in change_table keyed to id/PK. Probably do 
this in batches.


3) Once all the values have been updated, do an UPDATE set changed_field 
= changed_fld from change_table where existing_table.pk = change_table.id;



---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145



--
Adrian Klaver
adrian.kla...@aklaver.com




Re: UPDATE many records

2020-01-07 Thread Alan Hodgson
On Tue, 2020-01-07 at 11:47 -0900, Israel Brewster wrote:
> One potential issue I just thought of with this approach: disk space.
> Will I be doubling the amount of space used while both tables exist?
> If so, that would prevent this from working - I don’t have that much
> space available at the moment.

The original update you planned would do that, too.

You probably need to just do the update in batches and vacuum the table
between batches.





Re: UPDATE many records

2020-01-07 Thread Israel Brewster
> On Jan 7, 2020, at 11:56 AM, Alan Hodgson  wrote:
> 
> On Tue, 2020-01-07 at 11:47 -0900, Israel Brewster wrote:
>> One potential issue I just thought of with this approach: disk space.
>> Will I be doubling the amount of space used while both tables exist?
>> If so, that would prevent this from working - I don’t have that much
>> space available at the moment.
> 
> The original update you planned would do that, too.
> 
> You probably need to just do the update in batches and vacuum the table
> between batches.
> 
> 

Really? Why? With the update I am only changing data - I’m not adding any 
additional data, so the total size should stay the same, right? I’m obviously 
missing something… :-)

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

> 





Re: UPDATE many records

2020-01-07 Thread Israel Brewster
One potential issue I just thought of with this approach: disk space. Will I be 
doubling the amount of space used while both tables exist? If so, that would 
prevent this from working - I don’t have that much space available at the 
moment.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

> On Jan 7, 2020, at 10:09 AM, Mark Zellers  wrote:
> 
> You don’t tell us if other users will be concurrently changing any of the 
> records involved.  If you could guarantee that the table won’t be changed, 
> you might be better off doing a CREATE TABLE table_new as SELECT … FROM 
> table_old, dropping table_old, and finally renaming table_new.   Given the 
> way Postgres handles updates, I would think that might perform significantly 
> better.  Even if you did the work in batches (create a new table, 
> insert/select from the old table, drop, rename), that could well be better.  
> Especially if you re-create the indexes after all the data is moved.
>  
>  
>  
> From: Israel Brewster  
> Sent: Monday, January 6, 2020 10:36 AM
> To: pgsql-general@lists.postgresql.org
> Subject: UPDATE many records
>  
> Thanks to a change in historical data, I have a need to update a large number 
> of records (around 50 million). The update itself is straight forward, as I 
> can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, 
> new_value is the result of a stored procedure, if that makes a difference) 
> command via psql, and it should work. However, due to the large number of 
> records this command will obviously take a while, and if anything goes wrong 
> during the update (one bad value in row 45 million, lost connection, etc), 
> all the work that has been done already will be lost due to the transactional 
> nature of such commands (unless I am missing something).
>  
> Given that each row update is completely independent of any other row, I have 
> the following questions:
>  
> 1) Is there any way to set the command such that each row change is committed 
> as it is calculated?
> 2) Is there some way to run this command in parallel in order to better 
> utilize multiple processor cores, other than manually breaking the data into 
> chunks and running a separate psql/update process for each chunk? Honestly, 
> manual parallelizing wouldn’t be too bad (there are a number of logical 
> segregations I can apply), I’m just wondering if there is a more automatic 
> option.
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory 
> Geophysical Institute - UAF 
> 2156 Koyukuk Drive 
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145



Re: UPDATE many records

2020-01-07 Thread Israel Brewster
Nice approach! Using that method, I can do the select to generate the new table 
in only 6 minutes! I’m sure it helps that through a slightly creative use of 
Joins, I can re-write my function to actually be part of the select, just using 
bult-in trig functions, rather than having to called a stored function that I 
wrote for each row. And at only 6 minutes, I don’t need to worry about things 
like committing in batches or parallelizing. 

I haven’t tried selecting to a new table yet, presumably writing the data back 
could take some time, but the concept seems like it should work well. Thanks!
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

> On Jan 7, 2020, at 10:09 AM, Mark Zellers  wrote:
> 
> You don’t tell us if other users will be concurrently changing any of the 
> records involved.  If you could guarantee that the table won’t be changed, 
> you might be better off doing a CREATE TABLE table_new as SELECT … FROM 
> table_old, dropping table_old, and finally renaming table_new.   Given the 
> way Postgres handles updates, I would think that might perform significantly 
> better.  Even if you did the work in batches (create a new table, 
> insert/select from the old table, drop, rename), that could well be better.  
> Especially if you re-create the indexes after all the data is moved.
>  
>  
>  
> From: Israel Brewster  
> Sent: Monday, January 6, 2020 10:36 AM
> To: pgsql-general@lists.postgresql.org
> Subject: UPDATE many records
>  
> Thanks to a change in historical data, I have a need to update a large number 
> of records (around 50 million). The update itself is straight forward, as I 
> can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, 
> new_value is the result of a stored procedure, if that makes a difference) 
> command via psql, and it should work. However, due to the large number of 
> records this command will obviously take a while, and if anything goes wrong 
> during the update (one bad value in row 45 million, lost connection, etc), 
> all the work that has been done already will be lost due to the transactional 
> nature of such commands (unless I am missing something).
>  
> Given that each row update is completely independent of any other row, I have 
> the following questions:
>  
> 1) Is there any way to set the command such that each row change is committed 
> as it is calculated?
> 2) Is there some way to run this command in parallel in order to better 
> utilize multiple processor cores, other than manually breaking the data into 
> chunks and running a separate psql/update process for each chunk? Honestly, 
> manual parallelizing wouldn’t be too bad (there are a number of logical 
> segregations I can apply), I’m just wondering if there is a more automatic 
> option.
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory 
> Geophysical Institute - UAF 
> 2156 Koyukuk Drive 
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145



RE: UPDATE many records

2020-01-07 Thread Mark Zellers
You don’t tell us if other users will be concurrently changing any of the 
records involved.  If you could guarantee that the table won’t be changed, you 
might be better off doing a CREATE TABLE table_new as SELECT … FROM table_old, 
dropping table_old, and finally renaming table_new.   Given the way Postgres 
handles updates, I would think that might perform significantly better.  Even 
if you did the work in batches (create a new table, insert/select from the old 
table, drop, rename), that could well be better.  Especially if you re-create 
the indexes after all the data is moved.



From: Israel Brewster 
Sent: Monday, January 6, 2020 10:36 AM
To: pgsql-general@lists.postgresql.org
Subject: UPDATE many records

Thanks to a change in historical data, I have a need to update a large number 
of records (around 50 million). The update itself is straight forward, as I can 
just issue an "UPDATE table_name SET changed_field=new_value();" (yes, 
new_value is the result of a stored procedure, if that makes a difference) 
command via psql, and it should work. However, due to the large number of 
records this command will obviously take a while, and if anything goes wrong 
during the update (one bad value in row 45 million, lost connection, etc), all 
the work that has been done already will be lost due to the transactional 
nature of such commands (unless I am missing something).

Given that each row update is completely independent of any other row, I have 
the following questions:

1) Is there any way to set the command such that each row change is committed 
as it is calculated?
2) Is there some way to run this command in parallel in order to better utilize 
multiple processor cores, other than manually breaking the data into chunks and 
running a separate psql/update process for each chunk? Honestly, manual 
parallelizing wouldn’t be too bad (there are a number of logical segregations I 
can apply), I’m just wondering if there is a more automatic option.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145



Re: UPDATE many records

2020-01-06 Thread Christopher Browne
On Mon, 6 Jan 2020 at 17:38, Israel Brewster  wrote:

>
> Sure. But I feel we are getting a bit off track. Optimizing the runtime of
> the update is great, but this is a one-off (hopefully) event. I want to
> accomplish it as quickly as possible, of course, but at the same time it
> doesn’t make sense to spend a lot of time optimizing every component of the
> query. The main purpose of the question was honestly for my sanity, to
> reduce the likelihood of having it run for several hours only to error out
> due to bad data or whatever and have to start over from the top. Running in
> parallel simply seemed to be a no-brainer option to make it go quicker,
> assuming CPU bound updating. Optimizations that are going to take work are
> probably not worth it. We can wait for the data to be updated.
>

It sounds like you're in a decent place on this, and that you have done a
pretty apropos amount of exploration of the matter.

I was pleased to hear that you have the idempotency of the updates well in
hand, and that the application can cope with the degree of out-of-sync that
things will temporarily be.

The estimate of 10h to update the data doesn't surprise me; that's long
enough that it sure seems tempting to do the work in pieces so that you
don't have your whole set of application data locked for 10h.

I'd be inclined to call this "enough attention" for a one-off event.

I'll poke at the trigger aspect a wee bit; if the trigger function does a
one-tuple-at-a-time handling of things, so that it fires 50M times, you
might get a substantial speedup by replacing that with an equivalent set
operation that processes a few thousand tuples at a time.  That said, if
you're happy with the process running 10h, it's not worth unpeeling the
extra testing needed to ensure identical end states.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: UPDATE many records

2020-01-06 Thread Israel Brewster

> On Jan 6, 2020, at 12:49 PM, Justin  wrote:
> 
> What was the HD wait time ?  What tool is being use to monitor the server 
> resources??

No idea on the HD wait time - how would I best monitor that? That said, this 
machine does have NVMe drives, so the speed should be fairly high/wait time 
fairly low. It’s also running as a VM, which could affect things, but is a bit 
of a moot point as far as this update goes. As far as monitoring server 
resources, I was just using top.

> It appears based on this information there is allot more going on than a 
> simple Update command

Depending on your definition of “simple update” of course, very true. As I 
stated in the original message, the actual update value is the result of a 
function. The psql command is a simple update, but the function does a bit of 
stuff (primarily trigonometry). According to the EXPLAIN ANALYZE, about .7 ms 
of stuff per record, which of course is most of the runtime. It is entirely 
possible that the function could be optimized to run more quickly.

> 
> Moving code out of the trigger  probably not  going to improve performance, 
> unless there is allot of code  that does not need to be processed for this 
> update or code touching other tables

One SELECT query on another table to get some values I need to use for the 
calculation. No code that is not needed for the update. Given the nature of 
this bulk update, I *could* make a separate function that simply takes those 
values as parameters, since the same value will be applied to a lot of records. 
I’d just have to be careful about how I applied the update, so rows get 
processed with the correct values. I’m not convinced it would be worth it 
though - might shave a few hours off the total execution time (assuming that 
SELECT is expensive - EXPLAIN ANLYZE shows an index scan, on a table with only 
12,761 rows, which seems to be about as simple as it gets), but I doubt it 
would be enough for me to feel comfortable simply running the update as one 
monolithic unit.

> Study the trigger identify what has to run, pull that code out, then disable 
> the trigger.Move the necessary code to a new function for Updating.. 

Sure. But I feel we are getting a bit off track. Optimizing the runtime of the 
update is great, but this is a one-off (hopefully) event. I want to accomplish 
it as quickly as possible, of course, but at the same time it doesn’t make 
sense to spend a lot of time optimizing every component of the query. The main 
purpose of the question was honestly for my sanity, to reduce the likelihood of 
having it run for several hours only to error out due to bad data or whatever 
and have to start over from the top. Running in parallel simply seemed to be a 
no-brainer option to make it go quicker, assuming CPU bound updating. 
Optimizations that are going to take work are probably not worth it. We can 
wait for the data to be updated.

Thanks again!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145


> 
> On Mon, Jan 6, 2020 at 4:24 PM Israel Brewster  > wrote:
> 
>> On Jan 6, 2020, at 11:54 AM, Michael Lewis > > wrote:
>> 
>> I’m thinking it might be worth it to do a “quick” test on 1,000 or so 
>> records (or whatever number can run in a minute or so), watching the 
>> processor utilization as it runs. That should give me a better feel for 
>> where the bottlenecks may be, and how long the entire update process would 
>> take. I’m assuming, of course, that the total time would scale more or less 
>> linearly with the number of records.
>> 
>> I think that depends on how your identify and limit the update to those 1000 
>> records. If it is using a primary key with specific keys in an array, 
>> probably close to linear increase because the where clause isn't impactful 
>> to the overall execution time. If you write a sub-query that is slow, then 
>> you would need to exclude that from the time. You can always run explain 
>> analyze on the update and rollback rather than commit.
> So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN 
> ANALYZE output at https://explain.depesz.com/s/lIYn 
>  if it matters), during which time I did 
> see a postmaster process consuming 100% CPU. Upping the test to 20,819 
> records took about 16.5 seconds, so that looks relatively linear to me. Also, 
> CPU bound. So by my calculations, doing all 50M records would take around 10 
> hours.
> 
> One potentially significant note: most of the execution time is spent in a 
> trigger. This trigger is actually what’s doing the REAL update that I need to 
> happen. If it would make a difference, I could easily pull the trigger code 
> out to a separate function that I just call directly (with triggers 
> temporarily disabled). My thinking is that calling a 

Re: UPDATE many records

2020-01-06 Thread Justin
What was the HD wait time ?  What tool is being use to monitor the server
resources??

It appears based on this information there is allot more going on than a
simple Update command

Moving code out of the trigger  probably not  going to improve performance,
unless there is allot of code  that does not need to be processed for this
update or code touching other tables

Study the trigger identify what has to run, pull that code out, then
disable the trigger.Move the necessary code to a new function for
Updating..

On Mon, Jan 6, 2020 at 4:24 PM Israel Brewster 
wrote:

>
> On Jan 6, 2020, at 11:54 AM, Michael Lewis  wrote:
>
> I’m thinking it might be worth it to do a “quick” test on 1,000 or so
>> records (or whatever number can run in a minute or so), watching the
>> processor utilization as it runs. That should give me a better feel for
>> where the bottlenecks may be, and how long the entire update process would
>> take. I’m assuming, of course, that the total time would scale more or less
>> linearly with the number of records.
>>
>
> I think that depends on how your identify and limit the update to those
> 1000 records. If it is using a primary key with specific keys in an array,
> probably close to linear increase because the where clause isn't impactful
> to the overall execution time. If you write a sub-query that is slow, then
> you would need to exclude that from the time. You can always run explain
> analyze on the update and rollback rather than commit.
>
> So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN
> ANALYZE output at https://explain.depesz.com/s/lIYn if it matters),
> during which time I did see a postmaster process consuming 100% CPU. Upping
> the test to 20,819 records took about 16.5 seconds, so that looks
> relatively linear to me. Also, CPU bound. So by my calculations, doing all
> 50M records would take around 10 hours.
>
> One potentially significant note: most of the execution time is spent in a
> trigger. This trigger is actually what’s doing the REAL update that I need
> to happen. If it would make a difference, I could easily pull the trigger
> code out to a separate function that I just call directly (with triggers
> temporarily disabled). My thinking is that calling a function is calling a
> function, and the fact that it is currently called via a trigger rather
> than direct is of little consequence, but I’m willing to be corrected on
> that :-)
>
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory
> Geophysical Institute - UAF
> 2156 Koyukuk Drive
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145
>
>


Re: UPDATE many records

2020-01-06 Thread Michael Lewis
On Mon, Jan 6, 2020 at 2:34 PM Mark Zellers 
wrote:

> Just out of curiosity, what kind of trigger are you using, a row level
> trigger or a statement level trigger?  If you are using a row level
> trigger, see if you can achieve your requirements using a statement level
> trigger instead.  I’m relatively new to Postgres, so there could be some
> limit that I’m not aware of, but my understanding is that you have access
> to the old and new values of the updated rows in the after statement
> trigger.  It would likely be much more performant to do your operation once
> after the statement is done rather than firing a trigger on every changed
> row.
>

My experience/understanding is that statement level triggers can be a big
performance boost, but only for changing *other* tables and not the table
that the trigger is on since it is *AFTER* only and can't modify NEW record
directly.


RE: UPDATE many records

2020-01-06 Thread Mark Zellers
Just out of curiosity, what kind of trigger are you using, a row level trigger 
or a statement level trigger?  If you are using a row level trigger, see if you 
can achieve your requirements using a statement level trigger instead.  I’m 
relatively new to Postgres, so there could be some limit that I’m not aware of, 
but my understanding is that you have access to the old and new values of the 
updated rows in the after statement trigger.  It would likely be much more 
performant to do your operation once after the statement is done rather than 
firing a trigger on every changed row.

Regards,

Mark Z.


From: Israel Brewster 
Sent: Monday, January 6, 2020 1:24 PM
To: Michael Lewis 
Cc: Rob Sargent ; Alban Hertroys ; 
Christopher Browne ; pgsql-generallists.postgresql.org 

Subject: Re: UPDATE many records


On Jan 6, 2020, at 11:54 AM, Michael Lewis 
mailto:mle...@entrata.com>> wrote:

I’m thinking it might be worth it to do a “quick” test on 1,000 or so records 
(or whatever number can run in a minute or so), watching the processor 
utilization as it runs. That should give me a better feel for where the 
bottlenecks may be, and how long the entire update process would take. I’m 
assuming, of course, that the total time would scale more or less linearly with 
the number of records.

I think that depends on how your identify and limit the update to those 1000 
records. If it is using a primary key with specific keys in an array, probably 
close to linear increase because the where clause isn't impactful to the 
overall execution time. If you write a sub-query that is slow, then you would 
need to exclude that from the time. You can always run explain analyze on the 
update and rollback rather than commit.
So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN 
ANALYZE output at https://explain.depesz.com/s/lIYn if it matters), during 
which time I did see a postmaster process consuming 100% CPU. Upping the test 
to 20,819 records took about 16.5 seconds, so that looks relatively linear to 
me. Also, CPU bound. So by my calculations, doing all 50M records would take 
around 10 hours.

One potentially significant note: most of the execution time is spent in a 
trigger. This trigger is actually what’s doing the REAL update that I need to 
happen. If it would make a difference, I could easily pull the trigger code out 
to a separate function that I just call directly (with triggers temporarily 
disabled). My thinking is that calling a function is calling a function, and 
the fact that it is currently called via a trigger rather than direct is of 
little consequence, but I’m willing to be corrected on that :-)

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145



Re: UPDATE many records

2020-01-06 Thread Israel Brewster

> On Jan 6, 2020, at 11:54 AM, Michael Lewis  wrote:
> 
> I’m thinking it might be worth it to do a “quick” test on 1,000 or so records 
> (or whatever number can run in a minute or so), watching the processor 
> utilization as it runs. That should give me a better feel for where the 
> bottlenecks may be, and how long the entire update process would take. I’m 
> assuming, of course, that the total time would scale more or less linearly 
> with the number of records.
> 
> I think that depends on how your identify and limit the update to those 1000 
> records. If it is using a primary key with specific keys in an array, 
> probably close to linear increase because the where clause isn't impactful to 
> the overall execution time. If you write a sub-query that is slow, then you 
> would need to exclude that from the time. You can always run explain analyze 
> on the update and rollback rather than commit.
So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN 
ANALYZE output at https://explain.depesz.com/s/lIYn 
 if it matters), during which time I did see 
a postmaster process consuming 100% CPU. Upping the test to 20,819 records took 
about 16.5 seconds, so that looks relatively linear to me. Also, CPU bound. So 
by my calculations, doing all 50M records would take around 10 hours.

One potentially significant note: most of the execution time is spent in a 
trigger. This trigger is actually what’s doing the REAL update that I need to 
happen. If it would make a difference, I could easily pull the trigger code out 
to a separate function that I just call directly (with triggers temporarily 
disabled). My thinking is that calling a function is calling a function, and 
the fact that it is currently called via a trigger rather than direct is of 
little consequence, but I’m willing to be corrected on that :-)

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145



Re: UPDATE many records

2020-01-06 Thread Michael Lewis
>
> I’m thinking it might be worth it to do a “quick” test on 1,000 or so
> records (or whatever number can run in a minute or so), watching the
> processor utilization as it runs. That should give me a better feel for
> where the bottlenecks may be, and how long the entire update process would
> take. I’m assuming, of course, that the total time would scale more or less
> linearly with the number of records.
>

I think that depends on how your identify and limit the update to those
1000 records. If it is using a primary key with specific keys in an array,
probably close to linear increase because the where clause isn't impactful
to the overall execution time. If you write a sub-query that is slow, then
you would need to exclude that from the time. You can always run explain
analyze on the update and rollback rather than commit.


Re: UPDATE many records

2020-01-06 Thread Israel Brewster


> On Jan 6, 2020, at 11:40 AM, Rob Sargent  wrote:
> 
> 
> 
>> On Jan 6, 2020, at 1:29 PM, Alban Hertroys > > wrote:
>> 
>> I think you’re overcomplicating the matter.
>> 
>> I’d just do it as a single update in one transaction. It’s only 50M rows. It 
>> may take half an hour or so on decent hardware, depending on how 
>> resource-intensive your function is.
>> 
> I must emphasize: This estimate is HIGHLY dependent on hardware and the 
> complexity of the table (number of indices, etc).  (I suspect there’s a 
> correlation between table size (business value) and number of indices)

I’m thinking it might be worth it to do a “quick” test on 1,000 or so records 
(or whatever number can run in a minute or so), watching the processor 
utilization as it runs. That should give me a better feel for where the 
bottlenecks may be, and how long the entire update process would take. I’m 
assuming, of course, that the total time would scale more or less linearly with 
the number of records.

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

> 
>> If that fails[1], only then would I start looking into batching things. But 
>> then you still need to figure out why it fails and what to do about that; if 
>> it fails it will probably fail fast, and if not, then you’re looking at a 
>> one-off situation that won’t require more than a few workarounds - after 
>> which you can just run the update again.
>> 
>> Ad 1). No harm has been done, it’s a single transaction that rolled back.
>> 
>> Alban Hertroys
>> --
>> If you can't see the forest for the trees,
>> cut the trees and you'll find there is no forest.
> 



Re: UPDATE many records

2020-01-06 Thread Israel Brewster
> On Jan 6, 2020, at 11:38 AM, Christopher Browne  wrote:
> 
> 
> 
> On Mon, Jan 6, 2020, 3:15 PM Israel Brewster  > wrote:
>> On Jan 6, 2020, at 10:08 AM, Christopher Browne > > wrote:
>> 
>> On Mon, 6 Jan 2020 at 13:36, Israel Brewster > > wrote:
>> Thanks to a change in historical data, I have a need to update a large 
>> number of records (around 50 million). The update itself is straight 
>> forward, as I can just issue an "UPDATE table_name SET 
>> changed_field=new_value();" (yes, new_value is the result of a stored 
>> procedure, if that makes a difference) command via psql, and it should work. 
>> However, due to the large number of records this command will obviously take 
>> a while, and if anything goes wrong during the update (one bad value in row 
>> 45 million, lost connection, etc), all the work that has been done already 
>> will be lost due to the transactional nature of such commands (unless I am 
>> missing something).
>> 
>> Given that each row update is completely independent of any other row, I 
>> have the following questions:
>> 
>> 1) Is there any way to set the command such that each row change is 
>> committed as it is calculated?
>> 2) Is there some way to run this command in parallel in order to better 
>> utilize multiple processor cores, other than manually breaking the data into 
>> chunks and running a separate psql/update process for each chunk? Honestly, 
>> manual parallelizing wouldn’t be too bad (there are a number of logical 
>> segregations I can apply), I’m just wondering if there is a more automatic 
>> option.
>> 
>> Yeah, I'd be inclined to do this in batches.
>> 
>> If, for instance, the table has a nice primary key, then I'd capture the 
>> primary keys into a side table, and grab tuples from the side table to 
>> process in more bite-sized batches, say, of a few thousand tuples per batch.
>> 
>> create table just_keys as select pk_column from big_historical_table;
>> alter table just_keys add column processed boolean;
>> create index jkpk on just_keys(pk_column) where (processed is null);
>> then loop repeatedly along the lines...
>> 
>> create temp table iteration as select pk_column from just_keys where 
>> processed is null limit 1000;
>> [do update on big_historical_table where pk_column in (select pk_column from 
>> iteration)]
>> update iteration set processed='true' where pk_column in (select pk_column 
>> from iteration);
>> drop table iteration;
>> 
>> Parallelization is absolutely an interesting idea; if you want to use 8 
>> processes, then use a cycling sequence on the side table to spread tuples 
>> across the 8 processes, so that they can grab their own tuples and not block 
>> one another.
>> 
>> In that case, more like...
>> create temp sequence seq_procs start with 1 maxval 8 cycle;
>> create temp table just_keys as select pk_column, false::boolean as 
>> processed, nextval('seq_procs') as batch_id from big_historical_table;
>> 
>> The individual iterations then look for values in just_keys corresponding to 
>> their assigned batch number.
> 
> Sounds like a reasonable approach. As Justin pointed out, it is actually 
> likely that the process will be IO bound rather than CPU bound, so my 
> parallel idea may not have much merit after all, but the batching procedure 
> makes sense. I assume you meant update just_keys in your sample rather than 
> update iteration on that line just before drop table iteration. Thanks for 
> the info
> 
> As for parallelism, if you have really powerful disk, lots of disks on disk 
> array, it may help.  Or not, as commented.
> 
> I didn't test my wee bit of code, so yep, I meant to update just_keys :-).
> 
> You won't find something terribly much more automatic.
> 
> Oh, yah, there's a possible further complication; does the application need 
> to get stopped to do this update?  Is the newest version of the app still 
> generating data that needs the rewriting?  Sure hope not…

Yeah, a valid concern, but I should be ok on that front. Once I fix the 
calculation function, any new records will have the correct value. Plus, the 
actual update calculation is idempotent, so if a handful of new records end up 
getting re-calculated, that’s not an issue. Granted, the data will look weird 
while the re-calculation is in process (part new, part old), but we can live 
with that :-)

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145



Re: UPDATE many records

2020-01-06 Thread Christopher Browne
On Mon, Jan 6, 2020, 3:15 PM Israel Brewster  wrote:

> On Jan 6, 2020, at 10:08 AM, Christopher Browne 
> wrote:
>
> On Mon, 6 Jan 2020 at 13:36, Israel Brewster 
> wrote:
>
>> Thanks to a change in historical data, I have a need to update a large
>> number of records (around 50 million). The update itself is straight
>> forward, as I can just issue an "UPDATE table_name SET
>> changed_field=new_value();" (yes, new_value is the result of a stored
>> procedure, if that makes a difference) command via psql, and it should
>> work. However, due to the large number of records this command will
>> obviously take a while, and if anything goes wrong during the update (one
>> bad value in row 45 million, lost connection, etc), all the work that has
>> been done already will be lost due to the transactional nature of such
>> commands (unless I am missing something).
>>
>> Given that each row update is completely independent of any other row, I
>> have the following questions:
>>
>> 1) Is there any way to set the command such that each row change is
>> committed as it is calculated?
>> 2) Is there some way to run this command in parallel in order to better
>> utilize multiple processor cores, other than manually breaking the data
>> into chunks and running a separate psql/update process for each chunk?
>> Honestly, manual parallelizing wouldn’t be too bad (there are a number of
>> logical segregations I can apply), I’m just wondering if there is a more
>> automatic option.
>>
>
> Yeah, I'd be inclined to do this in batches.
>
> If, for instance, the table has a nice primary key, then I'd capture the
> primary keys into a side table, and grab tuples from the side table to
> process in more bite-sized batches, say, of a few thousand tuples per batch.
>
> create table just_keys as select pk_column from big_historical_table;
> alter table just_keys add column processed boolean;
> create index jkpk on just_keys(pk_column) where (processed is null);
> then loop repeatedly along the lines...
>
> create temp table iteration as select pk_column from just_keys where
> processed is null limit 1000;
> [do update on big_historical_table where pk_column in (select pk_column
> from iteration)]
> update iteration set processed='true' where pk_column in (select pk_column
> from iteration);
> drop table iteration;
>
>
> Parallelization is absolutely an interesting idea; if you want to use 8
> processes, then use a cycling sequence on the side table to spread tuples
> across the 8 processes, so that they can grab their own tuples and not
> block one another.
>
> In that case, more like...
> create temp sequence seq_procs start with 1 maxval 8 cycle;
> create temp table just_keys as select pk_column, false::boolean as
> processed, nextval('seq_procs') as batch_id from big_historical_table;
>
> The individual iterations then look for values in just_keys corresponding
> to their assigned batch number.
>
>
> Sounds like a reasonable approach. As Justin pointed out, it is actually
> likely that the process will be IO bound rather than CPU bound, so my
> parallel idea may not have much merit after all, but the batching procedure
> makes sense. I assume you meant update just_keys in your sample rather than
> update iteration on that line just before drop table iteration. Thanks for
> the info
>

As for parallelism, if you have really powerful disk, lots of disks on disk
array, it may help.  Or not, as commented.

I didn't test my wee bit of code, so yep, I meant to update just_keys :-).

You won't find something terribly much more automatic.

Oh, yah, there's a possible further complication; does the application need
to get stopped to do this update?  Is the newest version of the app still
generating data that needs the rewriting?  Sure hope not...


Re: UPDATE many records

2020-01-06 Thread Justin
As you have access to Procedure,   you can create a loop then issue an
Begin Update Commit
so something like this should work plpgsql

declare
 icount int = 0;
 new_count int = 0;

begin

select count(*) into icount from mytable;

loop
  begin ;
Update mytable set myvalue = newvalue() where id  between new_count
and new_count+  ;
   commit;
new_count = new_count + 10,000;
   if new_count > icount  then
  break
   end if;
end loop;
end;


I am going to put caveat into this,  if newvalue() function is complex and
takes allot of "CPU cycles to do its thing" then parallelism would help,
unless this function looks at the table being updated it can really
complicate things as the parallel functions would be looking at stale
records which could be bad...






On Mon, Jan 6, 2020 at 3:07 PM Israel Brewster 
wrote:

> Good information. I did forget to mention that I am using PostgreSQL 11.5.
> I also was not aware of the distinction between PROCEDURE and FUNCTION, so
> I guess I used the wrong terminology there when stating that new_value is
> the result of a stored procedure. It’s actually a function.
>
> So would your suggestion then be to create a procedure that loops through
> the records, calculating and committing each one (or, as in your older
> Postgres example, batches of 10k to 20k)?
>
> Good point on the HD I/O bound vs processor bound, but wouldn’t that
> depend on how complicated the actual update is? Still, there is a good
> chance you are correct in that statement, so that aspect is probably not
> worth spending too much time on.
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory
> Geophysical Institute - UAF
> 2156 Koyukuk Drive
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145
>
> On Jan 6, 2020, at 10:05 AM, Justin  wrote:
>
> There are several ways to actually do this
>
> If you have Postgresql 11 or higher we now have Create Procedure  that
> allows committing transactions,  one draw back is it can not parallel from
> inside the procedure
> https://www.postgresql.org/docs/11/sql-createprocedure.html
>
> https://severalnines.com/database-blog/overview-new-stored-procedures-postgresql-11
>
> If its an older version then Python Script or other scripting language to
> iterates over the data say 10 to 20K will do what you want
>
> for i in list of IDs
> begin ;
> "UPDATE table_name SET changed_field=new_value()  where ID @> int4range(i,
> i+1);
> commit;
>
>
> To create parallel process simple Python script or other scripting
> language can be used to create many connections working the data in
> parallel  but given the simple update it will NOT help in performance,
> this  will be Hard disk IO bound,  not process bound where parallelization
> helps
>
>
>
> On Mon, Jan 6, 2020 at 1:36 PM Israel Brewster 
> wrote:
>
>> Thanks to a change in historical data, I have a need to update a large
>> number of records (around 50 million). The update itself is straight
>> forward, as I can just issue an "UPDATE table_name SET
>> changed_field=new_value();" (yes, new_value is the result of a stored
>> procedure, if that makes a difference) command via psql, and it should
>> work. However, due to the large number of records this command will
>> obviously take a while, and if anything goes wrong during the update (one
>> bad value in row 45 million, lost connection, etc), all the work that has
>> been done already will be lost due to the transactional nature of such
>> commands (unless I am missing something).
>>
>> Given that each row update is completely independent of any other row, I
>> have the following questions:
>>
>> 1) Is there any way to set the command such that each row change is
>> committed as it is calculated?
>> 2) Is there some way to run this command in parallel in order to better
>> utilize multiple processor cores, other than manually breaking the data
>> into chunks and running a separate psql/update process for each chunk?
>> Honestly, manual parallelizing wouldn’t be too bad (there are a number of
>> logical segregations I can apply), I’m just wondering if there is a more
>> automatic option.
>> ---
>> Israel Brewster
>> Software Engineer
>> Alaska Volcano Observatory
>> Geophysical Institute - UAF
>> 2156 Koyukuk Drive
>> Fairbanks AK 99775-7320
>> Work: 907-474-5172
>> cell:  907-328-9145
>>
>>
>


Re: UPDATE many records

2020-01-06 Thread Alban Hertroys


> On 6 Jan 2020, at 21:15, Israel Brewster  wrote:
> 
>> On Jan 6, 2020, at 10:08 AM, Christopher Browne  wrote:
>> 
>> On Mon, 6 Jan 2020 at 13:36, Israel Brewster  wrote:
>> Thanks to a change in historical data, I have a need to update a large 
>> number of records (around 50 million). The update itself is straight 
>> forward, as I can just issue an "UPDATE table_name SET 
>> changed_field=new_value();" (yes, new_value is the result of a stored 
>> procedure, if that makes a difference) command via psql, and it should work. 
>> However, due to the large number of records this command will obviously take 
>> a while, and if anything goes wrong during the update (one bad value in row 
>> 45 million, lost connection, etc), all the work that has been done already 
>> will be lost due to the transactional nature of such commands (unless I am 
>> missing something).
>> 
>> Given that each row update is completely independent of any other row, I 
>> have the following questions:
>> 
>> 1) Is there any way to set the command such that each row change is 
>> committed as it is calculated?
>> 2) Is there some way to run this command in parallel in order to better 
>> utilize multiple processor cores, other than manually breaking the data into 
>> chunks and running a separate psql/update process for each chunk? Honestly, 
>> manual parallelizing wouldn’t be too bad (there are a number of logical 
>> segregations I can apply), I’m just wondering if there is a more automatic 
>> option.
>> 
>> Yeah, I'd be inclined to do this in batches.

I think you’re overcomplicating the matter.

I’d just do it as a single update in one transaction. It’s only 50M rows. It 
may take half an hour or so on decent hardware, depending on how 
resource-intensive your function is.

If that fails[1], only then would I start looking into batching things. But 
then you still need to figure out why it fails and what to do about that; if it 
fails it will probably fail fast, and if not, then you’re looking at a one-off 
situation that won’t require more than a few workarounds - after which you can 
just run the update again.

Ad 1). No harm has been done, it’s a single transaction that rolled back.

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.





Re: UPDATE many records

2020-01-06 Thread Israel Brewster
> On Jan 6, 2020, at 10:08 AM, Christopher Browne  wrote:
> 
> On Mon, 6 Jan 2020 at 13:36, Israel Brewster  > wrote:
> Thanks to a change in historical data, I have a need to update a large number 
> of records (around 50 million). The update itself is straight forward, as I 
> can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, 
> new_value is the result of a stored procedure, if that makes a difference) 
> command via psql, and it should work. However, due to the large number of 
> records this command will obviously take a while, and if anything goes wrong 
> during the update (one bad value in row 45 million, lost connection, etc), 
> all the work that has been done already will be lost due to the transactional 
> nature of such commands (unless I am missing something).
> 
> Given that each row update is completely independent of any other row, I have 
> the following questions:
> 
> 1) Is there any way to set the command such that each row change is committed 
> as it is calculated?
> 2) Is there some way to run this command in parallel in order to better 
> utilize multiple processor cores, other than manually breaking the data into 
> chunks and running a separate psql/update process for each chunk? Honestly, 
> manual parallelizing wouldn’t be too bad (there are a number of logical 
> segregations I can apply), I’m just wondering if there is a more automatic 
> option.
> 
> Yeah, I'd be inclined to do this in batches.
> 
> If, for instance, the table has a nice primary key, then I'd capture the 
> primary keys into a side table, and grab tuples from the side table to 
> process in more bite-sized batches, say, of a few thousand tuples per batch.
> 
> create table just_keys as select pk_column from big_historical_table;
> alter table just_keys add column processed boolean;
> create index jkpk on just_keys(pk_column) where (processed is null);
> then loop repeatedly along the lines...
> 
> create temp table iteration as select pk_column from just_keys where 
> processed is null limit 1000;
> [do update on big_historical_table where pk_column in (select pk_column from 
> iteration)]
> update iteration set processed='true' where pk_column in (select pk_column 
> from iteration);
> drop table iteration;
> 
> Parallelization is absolutely an interesting idea; if you want to use 8 
> processes, then use a cycling sequence on the side table to spread tuples 
> across the 8 processes, so that they can grab their own tuples and not block 
> one another.
> 
> In that case, more like...
> create temp sequence seq_procs start with 1 maxval 8 cycle;
> create temp table just_keys as select pk_column, false::boolean as processed, 
> nextval('seq_procs') as batch_id from big_historical_table;
> 
> The individual iterations then look for values in just_keys corresponding to 
> their assigned batch number.

Sounds like a reasonable approach. As Justin pointed out, it is actually likely 
that the process will be IO bound rather than CPU bound, so my parallel idea 
may not have much merit after all, but the batching procedure makes sense. I 
assume you meant update just_keys in your sample rather than update iteration 
on that line just before drop table iteration. Thanks for the info!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

> 
> -- 
> When confronted by a difficult problem, solve it by reducing it to the
> question, "How would the Lone Ranger handle this?"



Re: UPDATE many records

2020-01-06 Thread Israel Brewster
Good information. I did forget to mention that I am using PostgreSQL 11.5. I 
also was not aware of the distinction between PROCEDURE and FUNCTION, so I 
guess I used the wrong terminology there when stating that new_value is the 
result of a stored procedure. It’s actually a function.

So would your suggestion then be to create a procedure that loops through the 
records, calculating and committing each one (or, as in your older Postgres 
example, batches of 10k to 20k)?

Good point on the HD I/O bound vs processor bound, but wouldn’t that depend on 
how complicated the actual update is? Still, there is a good chance you are 
correct in that statement, so that aspect is probably not worth spending too 
much time on.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

> On Jan 6, 2020, at 10:05 AM, Justin  wrote:
> 
> There are several ways to actually do this
> 
> If you have Postgresql 11 or higher we now have Create Procedure  that allows 
> committing transactions,  one draw back is it can not parallel from inside 
> the procedure
> https://www.postgresql.org/docs/11/sql-createprocedure.html 
> 
> https://severalnines.com/database-blog/overview-new-stored-procedures-postgresql-11
>  
> 
> 
> If its an older version then Python Script or other scripting language to 
> iterates over the data say 10 to 20K will do what you want
> 
> for i in list of IDs 
> begin ;
> "UPDATE table_name SET changed_field=new_value()  where ID @> int4range(i, 
> i+1);
> commit;
> 
> 
> To create parallel process simple Python script or other scripting language 
> can be used to create many connections working the data in parallel  but 
> given the simple update it will NOT help in performance,  this  will be Hard 
> disk IO bound,  not process bound where parallelization  helps 
> 
> 
> 
> On Mon, Jan 6, 2020 at 1:36 PM Israel Brewster  > wrote:
> Thanks to a change in historical data, I have a need to update a large number 
> of records (around 50 million). The update itself is straight forward, as I 
> can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, 
> new_value is the result of a stored procedure, if that makes a difference) 
> command via psql, and it should work. However, due to the large number of 
> records this command will obviously take a while, and if anything goes wrong 
> during the update (one bad value in row 45 million, lost connection, etc), 
> all the work that has been done already will be lost due to the transactional 
> nature of such commands (unless I am missing something).
> 
> Given that each row update is completely independent of any other row, I have 
> the following questions:
> 
> 1) Is there any way to set the command such that each row change is committed 
> as it is calculated?
> 2) Is there some way to run this command in parallel in order to better 
> utilize multiple processor cores, other than manually breaking the data into 
> chunks and running a separate psql/update process for each chunk? Honestly, 
> manual parallelizing wouldn’t be too bad (there are a number of logical 
> segregations I can apply), I’m just wondering if there is a more automatic 
> option.
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory 
> Geophysical Institute - UAF 
> 2156 Koyukuk Drive 
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145
> 



Re: UPDATE many records

2020-01-06 Thread Justin
There are several ways to actually do this

If you have Postgresql 11 or higher we now have Create Procedure  that
allows committing transactions,  one draw back is it can not parallel from
inside the procedure
https://www.postgresql.org/docs/11/sql-createprocedure.html
https://severalnines.com/database-blog/overview-new-stored-procedures-postgresql-11

If its an older version then Python Script or other scripting language to
iterates over the data say 10 to 20K will do what you want

for i in list of IDs
begin ;
"UPDATE table_name SET changed_field=new_value()  where ID @> int4range(i,
i+1);
commit;


To create parallel process simple Python script or other scripting language
can be used to create many connections working the data in parallel  but
given the simple update it will NOT help in performance,  this  will be
Hard disk IO bound,  not process bound where parallelization  helps



On Mon, Jan 6, 2020 at 1:36 PM Israel Brewster 
wrote:

> Thanks to a change in historical data, I have a need to update a large
> number of records (around 50 million). The update itself is straight
> forward, as I can just issue an "UPDATE table_name SET
> changed_field=new_value();" (yes, new_value is the result of a stored
> procedure, if that makes a difference) command via psql, and it should
> work. However, due to the large number of records this command will
> obviously take a while, and if anything goes wrong during the update (one
> bad value in row 45 million, lost connection, etc), all the work that has
> been done already will be lost due to the transactional nature of such
> commands (unless I am missing something).
>
> Given that each row update is completely independent of any other row, I
> have the following questions:
>
> 1) Is there any way to set the command such that each row change is
> committed as it is calculated?
> 2) Is there some way to run this command in parallel in order to better
> utilize multiple processor cores, other than manually breaking the data
> into chunks and running a separate psql/update process for each chunk?
> Honestly, manual parallelizing wouldn’t be too bad (there are a number of
> logical segregations I can apply), I’m just wondering if there is a more
> automatic option.
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory
> Geophysical Institute - UAF
> 2156 Koyukuk Drive
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145
>
>


Re: UPDATE many records

2020-01-06 Thread Christopher Browne
On Mon, 6 Jan 2020 at 13:36, Israel Brewster  wrote:

> Thanks to a change in historical data, I have a need to update a large
> number of records (around 50 million). The update itself is straight
> forward, as I can just issue an "UPDATE table_name SET
> changed_field=new_value();" (yes, new_value is the result of a stored
> procedure, if that makes a difference) command via psql, and it should
> work. However, due to the large number of records this command will
> obviously take a while, and if anything goes wrong during the update (one
> bad value in row 45 million, lost connection, etc), all the work that has
> been done already will be lost due to the transactional nature of such
> commands (unless I am missing something).
>
> Given that each row update is completely independent of any other row, I
> have the following questions:
>
> 1) Is there any way to set the command such that each row change is
> committed as it is calculated?
> 2) Is there some way to run this command in parallel in order to better
> utilize multiple processor cores, other than manually breaking the data
> into chunks and running a separate psql/update process for each chunk?
> Honestly, manual parallelizing wouldn’t be too bad (there are a number of
> logical segregations I can apply), I’m just wondering if there is a more
> automatic option.
>

Yeah, I'd be inclined to do this in batches.

If, for instance, the table has a nice primary key, then I'd capture the
primary keys into a side table, and grab tuples from the side table to
process in more bite-sized batches, say, of a few thousand tuples per batch.

create table just_keys as select pk_column from big_historical_table;
alter table just_keys add column processed boolean;
create index jkpk on just_keys(pk_column) where (processed is null);
then loop repeatedly along the lines...

create temp table iteration as select pk_column from just_keys where
processed is null limit 1000;
[do update on big_historical_table where pk_column in (select pk_column
from iteration)]
update iteration set processed='true' where pk_column in (select pk_column
from iteration);
drop table iteration;

Parallelization is absolutely an interesting idea; if you want to use 8
processes, then use a cycling sequence on the side table to spread tuples
across the 8 processes, so that they can grab their own tuples and not
block one another.

In that case, more like...
create temp sequence seq_procs start with 1 maxval 8 cycle;
create temp table just_keys as select pk_column, false::boolean as
processed, nextval('seq_procs') as batch_id from big_historical_table;

The individual iterations then look for values in just_keys corresponding
to their assigned batch number.

-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: UPDATE many records

2020-01-06 Thread Adrian Klaver

On 1/6/20 10:36 AM, Israel Brewster wrote:
Thanks to a change in historical data, I have a need to update a large 
number of records (around 50 million). The update itself is straight 
forward, as I can just issue an "UPDATE table_name SET 
changed_field=new_value();" (yes, new_value is the result of a stored 
procedure, if that makes a difference) command via psql, and it should 
work. However, due to the large number of records this command will 
obviously take a while, and if anything goes wrong during the update 
(one bad value in row 45 million, lost connection, etc), all the work 
that has been done already will be lost due to the transactional nature 
of such commands (unless I am missing something).


Given that each row update is completely independent of any other row, I 
have the following questions:


1) Is there any way to set the command such that each row change is 
committed as it is calculated?


Pretty sure:
UPDATE table_name SET changed_field=new_value();
is seen as a single statement and is all or none.

If you want to go row by row you will need to have the statement run on 
a row by row basis or maybe in batches.


2) Is there some way to run this command in parallel in order to better 
utilize multiple processor cores, other than manually breaking the data 
into chunks and running a separate psql/update process for each chunk? 
Honestly, manual parallelizing wouldn’t be too bad (there are a number 
of logical segregations I can apply), I’m just wondering if there is a 
more automatic option.


This is good time to ask what Postgres version?

I am still working out the recent parallel query system additions. Not 
sure if it applies to UPDATE or not.



---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145




--
Adrian Klaver
adrian.kla...@aklaver.com




UPDATE many records

2020-01-06 Thread Israel Brewster
Thanks to a change in historical data, I have a need to update a large number 
of records (around 50 million). The update itself is straight forward, as I can 
just issue an "UPDATE table_name SET changed_field=new_value();" (yes, 
new_value is the result of a stored procedure, if that makes a difference) 
command via psql, and it should work. However, due to the large number of 
records this command will obviously take a while, and if anything goes wrong 
during the update (one bad value in row 45 million, lost connection, etc), all 
the work that has been done already will be lost due to the transactional 
nature of such commands (unless I am missing something).

Given that each row update is completely independent of any other row, I have 
the following questions:

1) Is there any way to set the command such that each row change is committed 
as it is calculated?
2) Is there some way to run this command in parallel in order to better utilize 
multiple processor cores, other than manually breaking the data into chunks and 
running a separate psql/update process for each chunk? Honestly, manual 
parallelizing wouldn’t be too bad (there are a number of logical segregations I 
can apply), I’m just wondering if there is a more automatic option.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145