Re: pg_dumpall - restoration problem

2024-04-06 Thread Tom Lane
Tony Bazeley  writes:
> I've a problem with restoring a cluster created with pg_dump_all from 14.8
> ( pg_dumpall >pgall.out and then psql -f pgall.out postgres).
> ...
> Attempting to restore to postgresql-16 results in errors

> 2024-04-05 22:17:15.418 ACDT [6565] postgres@tonbaz ERROR:  collation 
> "pg_catalog.C.UTF-8" for encoding "UTF8" does not exist at character 366 

> I don't understand the class text COLLATE pg_catalog."C.UTF-8"  syntax, but
> select * from pg_collation shows a C.UTF8 but no C.UTF-8

I take it you are trying to restore onto a different OS platform with
different locale naming conventions.  The easiest way to deal with it
probably is to edit the dump file and change "C.UTF-8" to "C.UTF8"
everywhere.  (Manually editing an 8G dump file might be no fun, but
"sed" should make short work of it.)

regards, tom lane




pg_dumpall - restoration problem

2024-04-06 Thread Tony Bazeley
I've a problem with restoring a cluster created with pg_dump_all from 14.8
( pg_dumpall >pgall.out and then psql -f pgall.out postgres).


pgall.out was recovered after a hardware failure on the hosting machine.


Attempting to restore to postgresql-16 results in errors
.
.
.
psql:/tmp/pgall.out:5172242: error: invalid command \N 
psql:/tmp/pgall.out:5172243: error: invalid command \N 
psql:/tmp/pgall.out:5172244: error: invalid command \N 
psql:/tmp/pgall.out:5172245: error: invalid command \N 
psql:/tmp/pgall.out:5172246: error: invalid command \N 
psql:/tmp/pgall.out:5172247: error: invalid command \N 
psql:/tmp/pgall.out:5172248: error: invalid command \N 
psql:/tmp/pgall.out:5172249: error: invalid command \N 
psql:/tmp/pgall.out:5660594: error: out of memory
Which strikes me as a bit strange as it's a lightly loaded 32GB machine and a 
8.5GB dump 
file

On examination, all tables up to the table bug_line below appear to be read in 
and the 
error below appears in the log file


2024-04-05 22:17:15.418 ACDT [6565] postgres@tonbaz ERROR:  collation 
"pg_catalog.C.UTF-8" for encoding "UTF8" does not exist at character 366 

2024-04-05 22:17:15.418 ACDT [6565] postgres@tonbaz STATEMENT:  CREATE TABLE 
public.bug_line ( 
   id integer NOT NULL, 
   routenm character varying(254), 
   type character varying(254), 
   status character varying(254), 
   the_geom public.geometry(LineString,28354), 
   category text, 
   code text, 
   src text, 
   name text, 
   timing text, 
   refplan2015 integer, 
   comments text, 
   descrip text, 
   class text COLLATE pg_catalog."C.UTF-8" 
   );
 Thinking it might be something specific to postgres16 I installed a copy of 
the 
postgres14  (14.11) with failure in exactly the same way.

I don't understand the class text COLLATE pg_catalog."C.UTF-8"  syntax, but
select * from pg_collation shows a C.UTF8 but no C.UTF-8

Any help on how to proceed would be most appreciated

TIA 
Tony Bazeley










Re: Moving delta data faster

2024-04-06 Thread Adrian Klaver

On 4/6/24 13:04, yudhi s wrote:


On Sat, Apr 6, 2024 at 10:25 PM Adrian Klaver > wrote:



Your original problem description was:

"Then subsequently these rows will be inserted/updated based on the
delta number of rows that got inserted/updated in the source database.
In some cases these changed data can flow multiple times per day to the
downstream i.e. postgres database and in other cases once daily."

If the above is not a hard rule, then yes up to some point just
replacing the data in mass would be the simplest/fastest method. You
could cut a step out by doing something like TRUNCATE target_tab and
then COPY target_tab FROM 'source.csv' bypassing the INSERT INTO
source_tab.

Yes, actually i didn't realize that truncate table transactional/online 
here in postgres. In other databases like Oracle its downtime for the 
read queries on the target table, as data will be vanished from the 
target table post truncate(until the data load happens) and those are 
auto commit. Thanks Veem for sharing that  option.


  I also think that truncate will be faster if the changes/delta is 
large , but if its handful of rows like <5%of the rows in the table then 
Upsert/Merge will be better performant. And also the down side of the 
truncate option is,  it does ask to bring/export all the data from 
source to the S3 file which may take longer as compared to bringing just 
the delta records. Correct me if I'm wrong.


Since you still have not specified how the data is stored in S3 and how 
you propose to move them into Postgres I can't really answer.




However I am still not able to understand why the upsert is less 
performant than merge, could you throw some light on this please?




I have no idea how this works in the code, but my suspicion is it is due
to the following:

https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT

"The optional ON CONFLICT clause specifies an alternative action to 
raising a unique violation or exclusion constraint violation error. For 
each individual row proposed for insertion, either the insertion 
proceeds, or, if an arbiter constraint or index specified by 
conflict_target is violated, the alternative conflict_action is taken. 
ON CONFLICT DO NOTHING simply avoids inserting a row as its alternative 
action. ON CONFLICT DO UPDATE updates the existing row that conflicts 
with the row proposed for insertion as its alternative action."


vs this:

"First, the MERGE command performs a join from data_source to 
target_table_name producing zero or more candidate change rows. For each 
candidate change row, the status of MATCHED or NOT MATCHED is set just 
once, after which WHEN clauses are evaluated in the order specified. For 
each candidate change row, the first clause to evaluate as true is 
executed. No more than one WHEN clause is executed for any candidate 
change row."


Where ON CONFLICT attempts the INSERT then on failure does the UPDATE 
for the ON CONFLICT DO UPDATE case. MERGE on the hand evaluates based on 
the join condition(ON tbl1.fld =tbl2.fld) and then based on MATCH/NOT 
MATCHED takes the appropriate action for the first WHEN match. In other 
words it goes directly to the appropriate action.


--
Adrian Klaver
adrian.kla...@aklaver.com





Re: Moving delta data faster

2024-04-06 Thread yudhi s
On Sat, Apr 6, 2024 at 10:25 PM Adrian Klaver 
wrote:

>
> Your original problem description was:
>
> "Then subsequently these rows will be inserted/updated based on the
> delta number of rows that got inserted/updated in the source database.
> In some cases these changed data can flow multiple times per day to the
> downstream i.e. postgres database and in other cases once daily."
>
> If the above is not a hard rule, then yes up to some point just
> replacing the data in mass would be the simplest/fastest method. You
> could cut a step out by doing something like TRUNCATE target_tab and
> then COPY target_tab FROM 'source.csv' bypassing the INSERT INTO
> source_tab.
>
> Yes, actually i didn't realize that truncate table transactional/online
here in postgres. In other databases like Oracle its downtime for the read
queries on the target table, as data will be vanished from the target table
post truncate(until the data load happens) and those are auto commit.
Thanks Veem for sharing that  option.

 I also think that truncate will be faster if the changes/delta is large ,
but if its handful of rows like <5%of the rows in the table then
Upsert/Merge will be better performant. And also the down side of the
truncate option is,  it does ask to bring/export all the data from source
to the S3 file which may take longer as compared to bringing just the delta
records. Correct me if I'm wrong.

However I am still not able to understand why the upsert is less performant
than merge, could you throw some light on this please?


Re: About postgres pg_upgrade

2024-04-06 Thread Bruce Momjian
On Fri, Apr  5, 2024 at 09:41:05AM +0200, Daniel Gustafsson wrote:
> > On 5 Apr 2024, at 06:47, Rama Krishnan  wrote:
> 
> > Could you please explain me how does pg_upgrade works one of my friends it 
> > works based on pg_restore I am bit confused
> 
> The documentation does a fairly good job explaining how it works, and there 
> are
> multiple blogposts and presentations on the subject to be found.
> 
>   https://www.postgresql.org/docs/devel/pgupgrade.html
> 
> To summarize in a few words, pg_upgrade does a dump/restore of the schema from
> the old cluster into the new, but copies the table data.  Your friend is thus
> isn't wrong, it does use pg_restore but only for the schema, not the data.

This presentation shows how it works internally:

https://momjian.us/main/writings/pgsql/pg_upgrade.pdf

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Moving delta data faster

2024-04-06 Thread Adrian Klaver

On 4/6/24 08:47, yudhi s wrote:

Thank you Adrian, Greg and Veem.

I tried writing a small routine to see how the performance differs in 
these four approaches i.e. Upsert VS traditional update+insert VS Merge 
vs Truncate+load.


Initially I was thinking Upsert will perform the same as Merge as the 
logic looks similar but it seems it's the worst performing among all, 
not sure why , yet to know the reason though. Truncate+ load seems to be 
the best performing among all. Hope i am doing it correctly. Please 
correct me if I'm wrong.


Your original problem description was:

"Then subsequently these rows will be inserted/updated based on the 
delta number of rows that got inserted/updated in the source database. 
In some cases these changed data can flow multiple times per day to the 
downstream i.e. postgres database and in other cases once daily."


If the above is not a hard rule, then yes up to some point just 
replacing the data in mass would be the simplest/fastest method. You 
could cut a step out by doing something like TRUNCATE target_tab and 
then COPY target_tab FROM 'source.csv' bypassing the INSERT INTO source_tab.




--
Adrian Klaver
adrian.kla...@aklaver.com





Re: Moving delta data faster

2024-04-06 Thread yudhi s
Thank you Adrian, Greg and Veem.

I tried writing a small routine to see how the performance differs in these
four approaches i.e. Upsert VS traditional update+insert VS Merge vs
Truncate+load.

Initially I was thinking Upsert will perform the same as Merge as the logic
looks similar but it seems it's the worst performing among all, not sure
why , yet to know the reason though. Truncate+ load seems to be the best
performing among all. Hope i am doing it correctly. Please correct me if
I'm wrong.

UPSERT approach execution time: *00:00:20.921343*
UPSERT approach rows: 100

insert/update approach execution time: *00:00:15.53612*
insert/update approach update rows : 50
insert/update approach Insert rows: 50

MERGE approach execution time: *00:00:14.884623*
MERGE approach rows: 100

truncate load approach execution time:* 00:00:07.428826*
truncate load rows: 100

* Routine 

 UPSERT Testcase ***
drop table source_tab;
drop table target_tab;

CREATE TABLE source_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

-- Create target table
CREATE TABLE target_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

INSERT INTO source_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 100) AS i;

INSERT INTO target_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 50) AS i;

DO $$
DECLARE
start_time timestamp;
end_time timestamp;
rows_inserted integer:=0;
rows_updated integer:=0;
rows_upserted integer:=0;
rows_merged integer:=0;
BEGIN
-- Measure performance of UPSERT
start_time := clock_timestamp();
INSERT INTO target_tab (id, column1, column2)
SELECT id, column1, column2
FROM source_tab
ON CONFLICT (id) DO UPDATE
SET
column1 = EXCLUDED.column1,
column2 = EXCLUDED.column2;
get diagnostics rows_upserted=row_count;
end_time := clock_timestamp();
RAISE NOTICE 'UPSERT approach execution time: %', end_time - start_time;
RAISE NOTICE 'UPSERT approach rows: %', rows_upserted;

rollback;
END $$;


 Traditional Insert+update Testcase ***
drop table source_tab;
drop table target_tab;

CREATE TABLE source_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

-- Create target table
CREATE TABLE target_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

INSERT INTO source_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 100) AS i;

INSERT INTO target_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 50) AS i;

DO $$
DECLARE
start_time timestamp;
end_time timestamp;
rows_inserted integer:=0;
rows_updated integer:=0;
rows_upserted integer:=0;
rows_merged integer:=0;
BEGIN
-- Measure performance of insert/update approach
start_time := clock_timestamp();
-- Update existing records
UPDATE target_tab AS t
SET
column1 = s.column1,
column2 = s.column2
FROM source_tab AS s
WHERE t.id = s.id;
get diagnostics rows_updated=row_count;

-- Insert new records
INSERT INTO target_tab (id, column1, column2)
SELECT s.id, s.column1, s.column2
FROM source_tab AS s
LEFT JOIN target_tab AS t ON s.id = t.id
WHERE t.id IS NULL;
get diagnostics rows_inserted=row_count;

end_time := clock_timestamp();
RAISE NOTICE 'insert/update approach execution time: %', end_time -
start_time;
RAISE NOTICE 'insert/update approach update rows : %', rows_updated;
RAISE NOTICE 'insert/update approach Insert rows: %', rows_inserted;

rollback;
END $$;


 MERGE Testcase ***
drop table source_tab;
drop table target_tab;

CREATE TABLE source_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

-- Create target table
CREATE TABLE target_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

INSERT INTO source_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 100) AS i;

INSERT INTO target_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 50) AS i;

DO $$
DECLARE
start_time timestamp;
end_time timestamp;
rows_inserted integer:=0;
rows_updated integer:=0;
rows_upserted integer:=0;
rows_merged integer:=0;
begin

start_time := clock_timestamp();

merge into
target_tab t
using source_tab s on
t. id = s. id
when matched then
update
set column1 = s.column1,
column2 = s.column2
when not matched then
insert
values (id, column1, column2);
get diagnostics rows_merged=row_count;

end_time := clock_timestamp();
RAISE NOTICE 'MERGE approach execution time: %', end_time - start_time;
RAISE NOTICE 'MERGE approach rows: %', rows_merged;

rollback;
END $$;


 Truncate+load Testcase ***
drop table source_tab;
drop table target_tab;

CREATE TABLE source_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

-- Create target table
CREATE TABLE target_tab (
id 

Re: Moving delta data faster

2024-04-06 Thread veem v
On Fri, 5 Apr 2024 at 06:10, Adrian Klaver 
wrote:

>
> > S3 is not a database. You will need to be more specific about '...
> then
> > from the S3 it will be picked and gets merged to the target postgres
> > database.'
> >
> >
> > The data from S3 will be dumped into the stage table and then the
> > upsert/merge from that table to the actual table.
>
> The S3 --> staging table would be helped by having the data as CSV and
> then using COPY. The staging --> final table step could be done as
> either ON CONFLICT or MERGE, you would need to test in your situation to
> verify which works better.
>

Just a thought , in case the delta record changes are really higher(say
>30-40% of the total number of rows in the table) can OP also evaluate the
"truncate target table +load target table" strategy here considering
DDL/Trunc is transactional in postgres so can be done online without
impacting the ongoing read queries and also performance wise, it would be
faster as compared to the traditional Update/Insert/Upsert/Merge?


Parallel GIN index?

2024-04-06 Thread Andreas Joseph Krogh


Any plans for $subject?






--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com 
www.visena.com