Re: Moving delta data faster

2024-04-09 Thread yudhi s
On Sun, Apr 7, 2024 at 2:25 AM Adrian Klaver 
wrote:

>
> I have no idea how this works in the code, but my suspicion is it is due
> to the following:
>
> https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT
>
> "The optional ON CONFLICT clause specifies an alternative action to
> raising a unique violation or exclusion constraint violation error. For
> each individual row proposed for insertion, either the insertion
> proceeds, or, if an arbiter constraint or index specified by
> conflict_target is violated, the alternative conflict_action is taken.
> ON CONFLICT DO NOTHING simply avoids inserting a row as its alternative
> action. ON CONFLICT DO UPDATE updates the existing row that conflicts
> with the row proposed for insertion as its alternative action."
>
> vs this:
>
> "First, the MERGE command performs a join from data_source to
> target_table_name producing zero or more candidate change rows. For each
> candidate change row, the status of MATCHED or NOT MATCHED is set just
> once, after which WHEN clauses are evaluated in the order specified. For
> each candidate change row, the first clause to evaluate as true is
> executed. No more than one WHEN clause is executed for any candidate
> change row."
>
> Where ON CONFLICT attempts the INSERT then on failure does the UPDATE
> for the ON CONFLICT DO UPDATE case. MERGE on the hand evaluates based on
> the join condition(ON tbl1.fld =tbl2.fld) and then based on MATCH/NOT
> MATCHED takes the appropriate action for the first WHEN match. In other
> words it goes directly to the appropriate action.
>
>
Thank you Adrian. I think you are spoton on the cause of upsert becoming
slower than Merge. Below is the explain plan I captured for both the
operations and it looks like even the planning time is small for the
UPSERT, as because it operates on the constraint i.e the Update will wait
for all the failure records from the INSERT and thus it takes longer. The
Merge seems to be evaluated on the Joins i.e it is directly able to get the
set of rows which has to be Updated rather waiting for the INSERT to make
it fail based on the PK constraint.

** Explain plan for UPSERT *

Insert on public.target_tab (cost=0.00..17353.00 rows=0 width=0) (actual
time=19957.569..19957.570 rows=0 loops=1)
Conflict Resolution: UPDATE
Conflict Arbiter Indexes: target_tab_pkey
Tuples Inserted: 50
Conflicting Tuples: 50
Buffers: shared hit=8545735 written=10094
-> Seq Scan on public.source_tab (cost=0.00..17353.00 rows=100
width=29) (actual time=0.006..208.306 rows=100 loops=1)
Output: source_tab.id, source_tab.column1, source_tab.column2
Buffers: shared hit=7353
Settings: effective_cache_size = '10475192kB', maintenance_io_concurrency =
'1', max_parallel_workers = '32', max_parallel_workers_per_gather = '4',
search_path = 'public, public, "$user"', temp_buffers = '16MB', work_mem =
'8MB'
Query Identifier: -1356019529835809419
Planning:
Buffers: shared hit=41
Planning Time: 0.199 ms
Execution Time: 19959.261 ms

** Explain plan for Merge *

Merge on public.target_tab t (cost=17368.00..53460.01 rows=0 width=0)
(actual time=14209.966..14209.968 rows=0 loops=1)
Tuples: inserted=50 updated=50
Buffers: shared hit=5040097 written=10460, temp read=4143 written=4143
I/O Timings: temp read=26.746 write=68.596
-> Hash Left Join (cost=17368.00..53460.01 rows=100 width=35) (actual
time=179.233..1332.264 rows=100 loops=1)
Output: t.ctid, s.column1, s.column2, s.id
Inner Unique: true
Hash Cond: (s.id = t.id)
Buffers: shared hit=11029, temp read=4143 written=4143
I/O Timings: temp read=26.746 write=68.596
-> Seq Scan on public.source_tab s (cost=0.00..17353.00 rows=100
width=29) (actual time=0.008..268.506 rows=100 loops=1)
Output: s.column1, s.column2, s.id
Buffers: shared hit=7353
-> Hash (cost=8676.00..8676.00 rows=50 width=10) (actual
time=178.101..178.102 rows=50 loops=1)
Output: t.ctid, t.id
Buckets: 524288 Batches: 2 Memory Usage: 14824kB
Buffers: shared hit=3676, temp written=977
I/O Timings: temp write=5.904
-> Seq Scan on public.target_tab t (cost=0.00..8676.00 rows=50
width=10) (actual time=0.007..66.441 rows=50 loops=1)
Output: t.ctid, t.id
Buffers: shared hit=3676
Settings: effective_cache_size = '10475192kB', maintenance_io_concurrency =
'1', max_parallel_workers = '32', max_parallel_workers_per_gather = '4',
search_path = 'public, public, "$user"', temp_buffers = '16MB', work_mem =
'8MB'
Query Identifier: -2297080081674771467
Planning:
Buffers: shared hit=85
Planning Time: 0.466 ms
Execution Time: 14212.061 ms


Re: Moving delta data faster

2024-04-06 Thread Adrian Klaver

On 4/6/24 13:04, yudhi s wrote:


On Sat, Apr 6, 2024 at 10:25 PM Adrian Klaver > wrote:



Your original problem description was:

"Then subsequently these rows will be inserted/updated based on the
delta number of rows that got inserted/updated in the source database.
In some cases these changed data can flow multiple times per day to the
downstream i.e. postgres database and in other cases once daily."

If the above is not a hard rule, then yes up to some point just
replacing the data in mass would be the simplest/fastest method. You
could cut a step out by doing something like TRUNCATE target_tab and
then COPY target_tab FROM 'source.csv' bypassing the INSERT INTO
source_tab.

Yes, actually i didn't realize that truncate table transactional/online 
here in postgres. In other databases like Oracle its downtime for the 
read queries on the target table, as data will be vanished from the 
target table post truncate(until the data load happens) and those are 
auto commit. Thanks Veem for sharing that  option.


  I also think that truncate will be faster if the changes/delta is 
large , but if its handful of rows like <5%of the rows in the table then 
Upsert/Merge will be better performant. And also the down side of the 
truncate option is,  it does ask to bring/export all the data from 
source to the S3 file which may take longer as compared to bringing just 
the delta records. Correct me if I'm wrong.


Since you still have not specified how the data is stored in S3 and how 
you propose to move them into Postgres I can't really answer.




However I am still not able to understand why the upsert is less 
performant than merge, could you throw some light on this please?




I have no idea how this works in the code, but my suspicion is it is due
to the following:

https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT

"The optional ON CONFLICT clause specifies an alternative action to 
raising a unique violation or exclusion constraint violation error. For 
each individual row proposed for insertion, either the insertion 
proceeds, or, if an arbiter constraint or index specified by 
conflict_target is violated, the alternative conflict_action is taken. 
ON CONFLICT DO NOTHING simply avoids inserting a row as its alternative 
action. ON CONFLICT DO UPDATE updates the existing row that conflicts 
with the row proposed for insertion as its alternative action."


vs this:

"First, the MERGE command performs a join from data_source to 
target_table_name producing zero or more candidate change rows. For each 
candidate change row, the status of MATCHED or NOT MATCHED is set just 
once, after which WHEN clauses are evaluated in the order specified. For 
each candidate change row, the first clause to evaluate as true is 
executed. No more than one WHEN clause is executed for any candidate 
change row."


Where ON CONFLICT attempts the INSERT then on failure does the UPDATE 
for the ON CONFLICT DO UPDATE case. MERGE on the hand evaluates based on 
the join condition(ON tbl1.fld =tbl2.fld) and then based on MATCH/NOT 
MATCHED takes the appropriate action for the first WHEN match. In other 
words it goes directly to the appropriate action.


--
Adrian Klaver
adrian.kla...@aklaver.com





Re: Moving delta data faster

2024-04-06 Thread yudhi s
On Sat, Apr 6, 2024 at 10:25 PM Adrian Klaver 
wrote:

>
> Your original problem description was:
>
> "Then subsequently these rows will be inserted/updated based on the
> delta number of rows that got inserted/updated in the source database.
> In some cases these changed data can flow multiple times per day to the
> downstream i.e. postgres database and in other cases once daily."
>
> If the above is not a hard rule, then yes up to some point just
> replacing the data in mass would be the simplest/fastest method. You
> could cut a step out by doing something like TRUNCATE target_tab and
> then COPY target_tab FROM 'source.csv' bypassing the INSERT INTO
> source_tab.
>
> Yes, actually i didn't realize that truncate table transactional/online
here in postgres. In other databases like Oracle its downtime for the read
queries on the target table, as data will be vanished from the target table
post truncate(until the data load happens) and those are auto commit.
Thanks Veem for sharing that  option.

 I also think that truncate will be faster if the changes/delta is large ,
but if its handful of rows like <5%of the rows in the table then
Upsert/Merge will be better performant. And also the down side of the
truncate option is,  it does ask to bring/export all the data from source
to the S3 file which may take longer as compared to bringing just the delta
records. Correct me if I'm wrong.

However I am still not able to understand why the upsert is less performant
than merge, could you throw some light on this please?


Re: Moving delta data faster

2024-04-06 Thread Adrian Klaver

On 4/6/24 08:47, yudhi s wrote:

Thank you Adrian, Greg and Veem.

I tried writing a small routine to see how the performance differs in 
these four approaches i.e. Upsert VS traditional update+insert VS Merge 
vs Truncate+load.


Initially I was thinking Upsert will perform the same as Merge as the 
logic looks similar but it seems it's the worst performing among all, 
not sure why , yet to know the reason though. Truncate+ load seems to be 
the best performing among all. Hope i am doing it correctly. Please 
correct me if I'm wrong.


Your original problem description was:

"Then subsequently these rows will be inserted/updated based on the 
delta number of rows that got inserted/updated in the source database. 
In some cases these changed data can flow multiple times per day to the 
downstream i.e. postgres database and in other cases once daily."


If the above is not a hard rule, then yes up to some point just 
replacing the data in mass would be the simplest/fastest method. You 
could cut a step out by doing something like TRUNCATE target_tab and 
then COPY target_tab FROM 'source.csv' bypassing the INSERT INTO source_tab.




--
Adrian Klaver
adrian.kla...@aklaver.com





Re: Moving delta data faster

2024-04-06 Thread yudhi s
Thank you Adrian, Greg and Veem.

I tried writing a small routine to see how the performance differs in these
four approaches i.e. Upsert VS traditional update+insert VS Merge vs
Truncate+load.

Initially I was thinking Upsert will perform the same as Merge as the logic
looks similar but it seems it's the worst performing among all, not sure
why , yet to know the reason though. Truncate+ load seems to be the best
performing among all. Hope i am doing it correctly. Please correct me if
I'm wrong.

UPSERT approach execution time: *00:00:20.921343*
UPSERT approach rows: 100

insert/update approach execution time: *00:00:15.53612*
insert/update approach update rows : 50
insert/update approach Insert rows: 50

MERGE approach execution time: *00:00:14.884623*
MERGE approach rows: 100

truncate load approach execution time:* 00:00:07.428826*
truncate load rows: 100

* Routine 

 UPSERT Testcase ***
drop table source_tab;
drop table target_tab;

CREATE TABLE source_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

-- Create target table
CREATE TABLE target_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

INSERT INTO source_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 100) AS i;

INSERT INTO target_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 50) AS i;

DO $$
DECLARE
start_time timestamp;
end_time timestamp;
rows_inserted integer:=0;
rows_updated integer:=0;
rows_upserted integer:=0;
rows_merged integer:=0;
BEGIN
-- Measure performance of UPSERT
start_time := clock_timestamp();
INSERT INTO target_tab (id, column1, column2)
SELECT id, column1, column2
FROM source_tab
ON CONFLICT (id) DO UPDATE
SET
column1 = EXCLUDED.column1,
column2 = EXCLUDED.column2;
get diagnostics rows_upserted=row_count;
end_time := clock_timestamp();
RAISE NOTICE 'UPSERT approach execution time: %', end_time - start_time;
RAISE NOTICE 'UPSERT approach rows: %', rows_upserted;

rollback;
END $$;


 Traditional Insert+update Testcase ***
drop table source_tab;
drop table target_tab;

CREATE TABLE source_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

-- Create target table
CREATE TABLE target_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

INSERT INTO source_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 100) AS i;

INSERT INTO target_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 50) AS i;

DO $$
DECLARE
start_time timestamp;
end_time timestamp;
rows_inserted integer:=0;
rows_updated integer:=0;
rows_upserted integer:=0;
rows_merged integer:=0;
BEGIN
-- Measure performance of insert/update approach
start_time := clock_timestamp();
-- Update existing records
UPDATE target_tab AS t
SET
column1 = s.column1,
column2 = s.column2
FROM source_tab AS s
WHERE t.id = s.id;
get diagnostics rows_updated=row_count;

-- Insert new records
INSERT INTO target_tab (id, column1, column2)
SELECT s.id, s.column1, s.column2
FROM source_tab AS s
LEFT JOIN target_tab AS t ON s.id = t.id
WHERE t.id IS NULL;
get diagnostics rows_inserted=row_count;

end_time := clock_timestamp();
RAISE NOTICE 'insert/update approach execution time: %', end_time -
start_time;
RAISE NOTICE 'insert/update approach update rows : %', rows_updated;
RAISE NOTICE 'insert/update approach Insert rows: %', rows_inserted;

rollback;
END $$;


 MERGE Testcase ***
drop table source_tab;
drop table target_tab;

CREATE TABLE source_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

-- Create target table
CREATE TABLE target_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

INSERT INTO source_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 100) AS i;

INSERT INTO target_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 50) AS i;

DO $$
DECLARE
start_time timestamp;
end_time timestamp;
rows_inserted integer:=0;
rows_updated integer:=0;
rows_upserted integer:=0;
rows_merged integer:=0;
begin

start_time := clock_timestamp();

merge into
target_tab t
using source_tab s on
t. id = s. id
when matched then
update
set column1 = s.column1,
column2 = s.column2
when not matched then
insert
values (id, column1, column2);
get diagnostics rows_merged=row_count;

end_time := clock_timestamp();
RAISE NOTICE 'MERGE approach execution time: %', end_time - start_time;
RAISE NOTICE 'MERGE approach rows: %', rows_merged;

rollback;
END $$;


 Truncate+load Testcase ***
drop table source_tab;
drop table target_tab;

CREATE TABLE source_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

-- Create target table
CREATE TABLE target_tab (
id 

Re: Moving delta data faster

2024-04-06 Thread veem v
On Fri, 5 Apr 2024 at 06:10, Adrian Klaver 
wrote:

>
> > S3 is not a database. You will need to be more specific about '...
> then
> > from the S3 it will be picked and gets merged to the target postgres
> > database.'
> >
> >
> > The data from S3 will be dumped into the stage table and then the
> > upsert/merge from that table to the actual table.
>
> The S3 --> staging table would be helped by having the data as CSV and
> then using COPY. The staging --> final table step could be done as
> either ON CONFLICT or MERGE, you would need to test in your situation to
> verify which works better.
>

Just a thought , in case the delta record changes are really higher(say
>30-40% of the total number of rows in the table) can OP also evaluate the
"truncate target table +load target table" strategy here considering
DDL/Trunc is transactional in postgres so can be done online without
impacting the ongoing read queries and also performance wise, it would be
faster as compared to the traditional Update/Insert/Upsert/Merge?


Re: Moving delta data faster

2024-04-04 Thread Adrian Klaver

On 4/4/24 13:42, yudhi s wrote:


On Thu, Apr 4, 2024 at 9:04 PM Adrian Klaver > wrote:


On 4/3/24 22:24, yudhi s wrote:
 >
 > On Thu, Apr 4, 2024 at 10:16 AM Adrian Klaver
mailto:adrian.kla...@aklaver.com>
 > >> wrote:

S3 is not a database. You will need to be more specific about '... then
from the S3 it will be picked and gets merged to the target postgres
database.'


The data from S3 will be dumped into the stage table and then the 
upsert/merge from that table to the actual table.


The S3 --> staging table would be helped by having the data as CSV and 
then using COPY. The staging --> final table step could be done as 
either ON CONFLICT or MERGE, you would need to test in your situation to 
verify which works better.


--
Adrian Klaver
adrian.kla...@aklaver.com





Re: Moving delta data faster

2024-04-04 Thread Greg Sabino Mullane
So you want advice from this open source community about moving things from
a proprietary database to another proprietary database, going through a
proprietary storage system along the way? :)

To answer the original question, use MERGE. That's it's job, and your table
size is very tiny, so I doubt the performance impact will be worth worrying
about.

Cheers,
Greg


Re: Moving delta data faster

2024-04-04 Thread yudhi s
On Thu, Apr 4, 2024 at 9:04 PM Adrian Klaver 
wrote:

> On 4/3/24 22:24, yudhi s wrote:
> >
> > On Thu, Apr 4, 2024 at 10:16 AM Adrian Klaver  > > wrote:
>
> S3 is not a database. You will need to be more specific about '... then
> from the S3 it will be picked and gets merged to the target postgres
> database.'
>
>
The data from S3 will be dumped into the stage table and then the
upsert/merge from that table to the actual table.


Re: Moving delta data faster

2024-04-04 Thread Adrian Klaver

On 4/3/24 22:24, yudhi s wrote:


On Thu, Apr 4, 2024 at 10:16 AM Adrian Klaver > wrote:


On 4/3/24 20:54, yudhi s wrote:
 > On Thu, Apr 4, 2024 at 2:41 AM Adrian Klaver
mailto:adrian.kla...@aklaver.com>
 > >> wrote:
 >
 >    > Thank you Adrian.
 >
 > And one thing i forgot to mention this target postgresql database
would
 > be on AWS RDS whereas the source Oracle databases is on premise.
I think
 > we don't have the FDW extension currently in place but we can get
that.
 > I am just not able to understand clearly  though, but do you mean
export
 > the data from source using CSV and do truncate and import on
target. And
 > as these data will be moved through the network won't that cause
slowness?
 >
 > The source database here is Oracle database. Correct me if wrong, it
 > looks like foreign data wrapper is like a DB link. Or do you mean
 > writing a query on the target database (which can be UPSERT or
MERGE)
 > but will be joining the table from the source database through the
 > DBlink/DDW? But my question was whether we should use UPSERT or
MERGE
 > for comparing and loading the delta records to the target postgresql
 > database. Want to understand which is more performant , as I see
in the
 > past Merge having performance issues in the past, but not very sure
 > about that.

My motivation was to get some basic information about your setup and
what you are trying to achieve.

If I understand correctly you have:

1) An Oracle database with tables that you want to copy the complete
data from to a Postgres database. For this sort of thing
COPY(https://www.postgresql.org/docs/current/sql-copy.html
) on the
Postgres end using CSV data generated from the source is probably the
quickest bulk load method.

2) After the initial load you want to do follow up INSERT/UPDATEs based
on a delta of the source tables relative to the initial load. This is
still a bit of mystery to me. How are determining the delta: a) On the
source end entirely or b) Target relative to source? Also what is the
anticipated size of the delta per transfer?

Additional information needed:

1) Network distance between source and target?

2) Network capacity?

3) Expected load on both source and target servers from other
operations?


Thank you. Actually I was trying to understand how to cater the delta 
load after the one time load is done . The delta change in records is 
planned to be found based on the primary keys on the tables. If it found 
the key it will update the records if it does not find the keys it will 
insert the rows.


Basically the select query from the source database will fetch the data 
with a certain time interval(based on the latest update timestamp or 
create timestamp if they are available or else full dump) and put it on 
S3 and then from the S3 it will be picked and gets merged to the target 
postgres database. As upsert and merge both were looking similar , so 
was wondering what we should use here for loading the delta records?


S3 is not a database. You will need to be more specific about '... then 
from the S3 it will be picked and gets merged to the target postgres 
database.'


--
Adrian Klaver
adrian.kla...@aklaver.com





Re: Moving delta data faster

2024-04-03 Thread yudhi s
On Thu, Apr 4, 2024 at 10:16 AM Adrian Klaver 
wrote:

> On 4/3/24 20:54, yudhi s wrote:
> > On Thu, Apr 4, 2024 at 2:41 AM Adrian Klaver  > > wrote:
> >
> >> Thank you Adrian.
> >
> > And one thing i forgot to mention this target postgresql database would
> > be on AWS RDS whereas the source Oracle databases is on premise. I think
> > we don't have the FDW extension currently in place but we can get that.
> > I am just not able to understand clearly  though, but do you mean export
> > the data from source using CSV and do truncate and import on target. And
> > as these data will be moved through the network won't that cause
> slowness?
> >
> > The source database here is Oracle database. Correct me if wrong, it
> > looks like foreign data wrapper is like a DB link. Or do you mean
> > writing a query on the target database (which can be UPSERT or MERGE)
> > but will be joining the table from the source database through the
> > DBlink/DDW? But my question was whether we should use UPSERT or MERGE
> > for comparing and loading the delta records to the target postgresql
> > database. Want to understand which is more performant , as I see in the
> > past Merge having performance issues in the past, but not very sure
> > about that.
>
> My motivation was to get some basic information about your setup and
> what you are trying to achieve.
>
> If I understand correctly you have:
>
> 1) An Oracle database with tables that you want to copy the complete
> data from to a Postgres database. For this sort of thing
> COPY(https://www.postgresql.org/docs/current/sql-copy.html) on the
> Postgres end using CSV data generated from the source is probably the
> quickest bulk load method.
>
> 2) After the initial load you want to do follow up INSERT/UPDATEs based
> on a delta of the source tables relative to the initial load. This is
> still a bit of mystery to me. How are determining the delta: a) On the
> source end entirely or b) Target relative to source? Also what is the
> anticipated size of the delta per transfer?
>
> Additional information needed:
>
> 1) Network distance between source and target?
>
> 2) Network capacity?
>
> 3) Expected load on both source and target servers from other operations?
>
>
Thank you. Actually I was trying to understand how to cater the delta load
after the one time load is done . The delta change in records is planned to
be found based on the primary keys on the tables. If it found the key it
will update the records if it does not find the keys it will insert the
rows.

Basically the select query from the source database will fetch the data
with a certain time interval(based on the latest update timestamp or create
timestamp if they are available or else full dump) and put it on S3 and
then from the S3 it will be picked and gets merged to the target postgres
database. As upsert and merge both were looking similar , so was wondering
what we should use here for loading the delta records?


Re: Moving delta data faster

2024-04-03 Thread Adrian Klaver

On 4/3/24 20:54, yudhi s wrote:
On Thu, Apr 4, 2024 at 2:41 AM Adrian Klaver > wrote:


On 4/3/24 13:38, yudhi s wrote:
 >   Hi All,
 >   It's postgresql database version 15.4. We have a requirement in
which
 > we will be initially moving full table data for 3-4 tables, from
source
 > database to target(i.e. postgres) . Maximum number of rows will be
 > ~10million rows in those tables. Then subsequently these rows
will be
 > inserted/updated based on the delta number of rows that got
 > inserted/updated in the source database. In some cases these changed
 > data can flow multiple times per day to the downstream i.e. postgres
 > database and in other cases once daily.

What is the source database?

Can it be reached with a FDW?:

https://wiki.postgresql.org/wiki/Foreign_data_wrappers


Can the delta on the source be output as CSV?



Thank you Adrian.

And one thing i forgot to mention this target postgresql database would 
be on AWS RDS whereas the source Oracle databases is on premise. I think 
we don't have the FDW extension currently in place but we can get that. 
I am just not able to understand clearly  though, but do you mean export 
the data from source using CSV and do truncate and import on target. And 
as these data will be moved through the network won't that cause slowness?


The source database here is Oracle database. Correct me if wrong, it 
looks like foreign data wrapper is like a DB link. Or do you mean 
writing a query on the target database (which can be UPSERT or MERGE) 
but will be joining the table from the source database through the 
DBlink/DDW? But my question was whether we should use UPSERT or MERGE 
for comparing and loading the delta records to the target postgresql 
database. Want to understand which is more performant , as I see in the 
past Merge having performance issues in the past, but not very sure 
about that.


My motivation was to get some basic information about your setup and 
what you are trying to achieve.


If I understand correctly you have:

1) An Oracle database with tables that you want to copy the complete 
data from to a Postgres database. For this sort of thing 
COPY(https://www.postgresql.org/docs/current/sql-copy.html) on the 
Postgres end using CSV data generated from the source is probably the 
quickest bulk load method.


2) After the initial load you want to do follow up INSERT/UPDATEs based 
on a delta of the source tables relative to the initial load. This is 
still a bit of mystery to me. How are determining the delta: a) On the 
source end entirely or b) Target relative to source? Also what is the 
anticipated size of the delta per transfer?


Additional information needed:

1) Network distance between source and target?

2) Network capacity?

3) Expected load on both source and target servers from other operations?

--
Adrian Klaver
adrian.kla...@aklaver.com





Re: Moving delta data faster

2024-04-03 Thread yudhi s
On Thu, Apr 4, 2024 at 2:41 AM Adrian Klaver 
wrote:

> On 4/3/24 13:38, yudhi s wrote:
> >   Hi All,
> >   It's postgresql database version 15.4. We have a requirement in which
> > we will be initially moving full table data for 3-4 tables, from source
> > database to target(i.e. postgres) . Maximum number of rows will be
> > ~10million rows in those tables. Then subsequently these rows will be
> > inserted/updated based on the delta number of rows that got
> > inserted/updated in the source database. In some cases these changed
> > data can flow multiple times per day to the downstream i.e. postgres
> > database and in other cases once daily.
>
> What is the source database?
>
> Can it be reached with a FDW?:
>
> https://wiki.postgresql.org/wiki/Foreign_data_wrappers
>
> Can the delta on the source be output as CSV?
>
>

Thank you Adrian.

And one thing i forgot to mention this target postgresql database would be
on AWS RDS whereas the source Oracle databases is on premise. I think we
don't have the FDW extension currently in place but we can get that. I am
just not able to understand clearly  though, but do you mean export the
data from source using CSV and do truncate and import on target. And as
these data will be moved through the network won't that cause slowness?

The source database here is Oracle database. Correct me if wrong, it looks
like foreign data wrapper is like a DB link. Or do you mean writing a query
on the target database (which can be UPSERT or MERGE) but will be joining
the table from the source database through the DBlink/DDW? But my question
was whether we should use UPSERT or MERGE for comparing and loading the
delta records to the target postgresql database. Want to understand which
is more performant , as I see in the past Merge having performance issues
in the past, but not very sure about that.


Re: Moving delta data faster

2024-04-03 Thread Adrian Klaver

On 4/3/24 13:38, yudhi s wrote:

  Hi All,
  It's postgresql database version 15.4. We have a requirement in which 
we will be initially moving full table data for 3-4 tables, from source 
database to target(i.e. postgres) . Maximum number of rows will be 
~10million rows in those tables. Then subsequently these rows will be 
inserted/updated based on the delta number of rows that got 
inserted/updated in the source database. In some cases these changed 
data can flow multiple times per day to the downstream i.e. postgres 
database and in other cases once daily.


What is the source database?

Can it be reached with a FDW?:

https://wiki.postgresql.org/wiki/Foreign_data_wrappers

Can the delta on the source be output as CSV?



  Want to understand , if we should use upsert(insert on conflict) or 
merge statements or anything else in such a scenario so as to persist 
those delta records faster in the target database, while making the 
system online to the users?


Regards
Yudhi


--
Adrian Klaver
adrian.kla...@aklaver.com