[PERFORM] synchronization between PostgreSQL and Oracle
Hi I would like your advice and recommendation about the following infrastructure problem : What is the best way to optimize synchronization between an instance PostgreSQL on Windows 7 workstation and an Oracle 11gR2 database on linux RHEL ? Here are more detailed explanations In our company we have people who collect data in a 9.6 postgresql instance on their workstation that is disconnected from the internet. In the evening, they connect to the Internet and synchronize the collected data to a remote 11gr2 Oracle database. What is the best performant way to do this ( Oracle_FDW ?, flat files ?, ...) Thanks in advance Best Regards [cid:image002.png@01D3434B.11DE8400] Didier ROS DSP/CSP IT-DMA/Solutions Groupe EDF/Expertise Applicative Expertise SGBD Mail : didier@edf.fr Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à l'intention exclusive des destinataires et les informations qui y figurent sont strictement confidentielles. Toute utilisation de ce Message non conforme à sa destination, toute diffusion ou toute publication totale ou partielle, est interdite sauf autorisation expresse. Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de votre système, ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support que ce soit. Nous vous remercions également d'en avertir immédiatement l'expéditeur par retour du message. Il est impossible de garantir que les communications par messagerie électronique arrivent en temps utile, sont sécurisées ou dénuées de toute erreur ou virus. This message and any attachments (the 'Message') are intended solely for the addressees. The information contained in this Message is confidential. Any use of information contained in this Message not in accord with its purpose, any dissemination or disclosure, either whole or partial, is prohibited except formal approval. If you are not the addressee, you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return message. E-mail communication cannot be guaranteed to be timely secure, error or virus-free. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] synchronization between PostgreSQL and Oracle
On Thu, Oct 12, 2017 at 5:13 AM, ROS Didier wrote: > Hi > >I would like your advice and recommendation about the > following infrastructure problem : > > What is the best way to optimize synchronization between an instance > PostgreSQL on Windows 7 workstation and an Oracle 11gR2 database on linux > RHEL ? > > Here are more detailed explanations > > In our company we have people who collect data in a 9.6 postgresql > instance on their workstation that is disconnected from the internet. > > In the evening, they connect to the Internet and synchronize the collected > data to a remote 11gr2 Oracle database. > > What is the best performant way to do this ( Oracle_FDW ?, flat files ?, …) > > > There are several ways to go about this, but for your use case I'd recommend SymmetricDS -- http://symmetricds.org (or for the commercial version: http://jumpmind.com) SymmetricDS was originally designed to collect data from cash registers in a vastly distributed set of small databases and aggregate those results back into both regional and national data warehouses. It also pushed data the other way - when pricing was updated at corporate headquarters, the data was pushed back into the cash registers. It works with a wide variety of database technologies, scales well, and has many synchronization options. It is also being used by some organizations these days to synchronize small databases on IOS and Android devices with their parent databases back at HQ. I first used it to implement an Oracle to PostgreSQL data migration that had to be done without down time. I've used it successfully for real time data pushes from MySQL and PG OLTP systems into an Oracle DataMart. I also used to use it for PostgreSQL bidirectional replication before other tools became easier to use. Because of its great flexibility, SymmetricDS has a ton of knobs to turn and buttons and configuration options and may take a bit to get it working optimally. If you are short on time to implement a solution, I'd suggest going with the commercial version.
Re: [PERFORM] synchronization between PostgreSQL and Oracle
ROS Didier wrote: > I would like your advice and recommendation about the > following infrastructure problem : > What is the best way to optimize synchronization between an instance > PostgreSQL on Windows 7 workstation and an Oracle 11gR2 database on linux > RHEL ? > Here are more detailed explanations > In our company we have people who collect data in a 9.6 postgresql instance > on their workstation that is disconnected from the internet. > In the evening, they connect to the Internet and synchronize the collected > data to a remote 11gr2 Oracle database. > What is the best performant way to do this ( Oracle_FDW ?, flat files ?, …) If the synchronization is triggered from the workstation with PostgreSQL on it, you can either use oracle_fdw or pg_dump/sql*loader to transfer the data. Using oracle_fdw is probably simpler, but it is not very performant for bulk update operations. If performance is the main objective, use export/import. Yours, Laurenz Albe -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Wrong plane for limit after group by
Hi, I've faced a problem that a query without limit works much faster than with one. Steps to reproduce create extension pg_trgm; create table t (id serial, val text, constraint t_pk primary key (id)); insert into t (val) select (random() * 100)::text from generate_series(1, 100); create index t_val_idx on t using gin (val gin_trgm_ops); quota_patient> explain (analyze,buffers) select id from t where val like ( select '6'::text) group by id; ++ | QUERY PLAN | || | HashAggregate (cost=6401.14..6451.14 rows=5000 width=4) (actual time=22.861..22.861 rows=0 loops=1) | | Group Key: id | | Buffers: shared hit=5158 | | InitPlan 1 (returns $0) | | -> Result (cost=0.00..0.01 rows=1 width=32) (actual time=0.002..0.002 rows=1 loops=1)| | -> Bitmap Heap Scan on t (cost=50.75..6388.63 rows=5000 width=4) (actual time=22.835..22.835 rows=0 loops=1) | | Recheck Cond: (val ~~ $0) | | Rows Removed by Index Recheck: 10112 | | Heap Blocks: exact=5097 | | Buffers: shared hit=5158 | | -> Bitmap Index Scan on t_val_idx (cost=0.00..49.50 rows=5000 width=0) (actual time=8.762..8.762 rows=10112 loops=1) | | Index Cond: (val ~~ $0) | | Buffers: shared hit=61 | | Planning time: 0.166 ms | | Execution time: 22.970 ms | ++ EXPLAIN Time: 0.026s quota_patient> explain (analyze,buffers) select id from t where val like ( select '6'::text) group by id limit 1; +---+ | QUERY PLAN | |---| | Limit (cost=0.43..7.41 rows=1 width=4) (actual time=439.561..439.561 rows=0 loops=1) | | Buffers: shared hit=9105 | | InitPlan 1 (returns $0) | | -> Result (cost=0.00..0.01 rows=1 width=32) (actual time=0.002..0.002 rows=1 loops=1) | | -> Group (cost=0.42..34865.93 rows=5000 width=4) (actual time=439.560..439.560 rows=0 loops=1)| | Group Key: id | | Buffers: shared hit=9105 | | -> Index Scan using t_pk on t (cost=0.42..34853.43 rows=5000 width=4) (actual time=439.557..439.557 rows=0 loops=1) | | Filter: (val ~~ $0) | | Rows Removed by Filter: 100 | | Buffers: shared hit=9105 | | Planning time: 0.205 ms | | Execution time: 439.610 ms | +---+ EXPLAIN Time: 0.443s I can't understand why adding limit after group by makes a planner fall to non optimal plan. I tried to add more work_mem (up to 100Mb) but no effect. Is it a planner bug? BTW if I don't use subquery after like everything is ok quota_patient> explain (analyze,buffers) select id from t where val like '6' ::text group by id limit 1; +-+ | QUERY PLAN | |-| | Limit (cost=24.03..24.04 rows=1 width=4) (actual time=23.048..23.048 rows=0 loops=1) | | Buffers: shared hit=5158 | | -> Group (cost=24.03..24.04 rows=1 width=4) (actual time=23.046..23.046 rows=0 loops=1) | | Group Key: id | | Buffers: shared hit=5158 | | -> Sort
Re: [PERFORM] Rowcount estimation changes based on from clause order
Ants Aasma writes: > I stumbled upon a severe row count underestimation that confusingly > went away when two inner joins in the from clause were reordered. Hm, looks more like an overestimate in this example, but anyway ... > Does anybody have any idea what is going on here? set_joinrel_size_estimates says * Since there is more than one way to make a joinrel for more than two * base relations, the results we get here could depend on which component * rel pair is provided. In theory we should get the same answers no matter * which pair is provided; in practice, since the selectivity estimation * routines don't handle all cases equally well, we might not. But there's * not much to be done about it. In this example I think the core of the issue is actually not so much bad selectivity estimates as rowcount roundoff error. If we first consider joining "small" with "big", we get an estimate of 2000 rows (which is dead on for what would happen if we just joined those). Then we estimate the final result size as the join of that to "lookup". The selectivity number for that step is somewhat hogwash but happens to yield a result that's not awful (8 rows). In the other case we first estimate the size of the join of "small" with the "lookup" subquery, and we get a rounded-off estimate of one row, whereas without the roundoff it would have been probably about 0.01. When that's joined to "big", we are computing one row times 1 million rows times a selectivity estimate that's about right for the "small.id = big.small_id" clause; but because the roundoff already inflated the first join's size so much, you end up with an inflated final result. This suggests that there might be some value in considering the sub-relations from largest to smallest, so that roundoff error in the earlier estimates is less likely to contaminate the final answer. Not sure how expensive it would be to do that or what sort of instability it might introduce into plan choices. Whether that's got anything directly to do with your original problem is hard to say. Joins to subqueries, which we normally lack any stats for, tend to produce pretty bogus selectivity numbers in themselves; so the original problem might've been more of that nature. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Rowcount estimation changes based on from clause order
On Thu, Oct 12, 2017 at 11:50 PM, Tom Lane wrote: > Ants Aasma writes: >> I stumbled upon a severe row count underestimation that confusingly >> went away when two inner joins in the from clause were reordered. > > Hm, looks more like an overestimate in this example, but anyway ... > >> Does anybody have any idea what is going on here? > > set_joinrel_size_estimates says > > * Since there is more than one way to make a joinrel for more than two > * base relations, the results we get here could depend on which component > * rel pair is provided. In theory we should get the same answers no matter > * which pair is provided; in practice, since the selectivity estimation > * routines don't handle all cases equally well, we might not. But there's > * not much to be done about it. > > In this example I think the core of the issue is actually not so much > bad selectivity estimates as rowcount roundoff error. > > If we first consider joining "small" with "big", we get an estimate of > 2000 rows (which is dead on for what would happen if we just joined > those). Then we estimate the final result size as the join of that to > "lookup". The selectivity number for that step is somewhat hogwash but > happens to yield a result that's not awful (8 rows). > > In the other case we first estimate the size of the join of "small" with > the "lookup" subquery, and we get a rounded-off estimate of one row, > whereas without the roundoff it would have been probably about 0.01. > When that's joined to "big", we are computing one row times 1 million rows > times a selectivity estimate that's about right for the "small.id = > big.small_id" clause; but because the roundoff already inflated the first > join's size so much, you end up with an inflated final result. > > This suggests that there might be some value in considering the > sub-relations from largest to smallest, so that roundoff error > in the earlier estimates is less likely to contaminate the final > answer. Not sure how expensive it would be to do that or what > sort of instability it might introduce into plan choices. > > Whether that's got anything directly to do with your original problem is > hard to say. Joins to subqueries, which we normally lack any stats for, > tend to produce pretty bogus selectivity numbers in themselves; so the > original problem might've been more of that nature. Thanks for pointing me in the correct direction. The original issue was that values from lookup joined to ref_id and the subset filter in the small table were almost perfectly correlated, which caused the underestimate. In the second case this was hidden by the intermediate clamping to 1, accidentally resulting in a more correct estimate. I actually think that it might be better to consider relations from smallest to largest. The reasoning being - a join cannot produce a fraction of a row, it will either produce 0 or 1, and we should probably plan for the case when it does return something. Going even further, and I haven't looked at how feasible this is, but I have run into several cases lately where cardinality underestimates clamping to 1 result in catastrophically bad plans. Like a stack of nested loops with unparameterized GroupAggregates and HashAggregates as inner sides bad. It seems to me that row estimates should clamp to something slightly larger than 1 unless it's provably going to be 1. Regards, Ants Aasma -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance