date:20171012

[PERFORM] synchronization between PostgreSQL and Oracle

2017-10-12 Thread ROS Didier

Hi
   I would like your advice  and recommendation about the following 
infrastructure problem :
What is the best way to optimize synchronization between an instance PostgreSQL 
on Windows 7 workstation and an Oracle 11gR2 database on linux RHEL  ?
Here are more detailed explanations
In our company we have people who collect data in a 9.6 postgresql instance on 
their workstation that is disconnected from the internet.
In the evening, they connect to the Internet and synchronize the collected data 
to a remote 11gr2 Oracle database.
What is the best performant way to do this ( Oracle_FDW ?, flat files ?, ...)

Thanks in advance

Best Regards
[cid:image002.png@01D3434B.11DE8400]
Didier ROS
DSP/CSP IT-DMA/Solutions Groupe EDF/Expertise Applicative
Expertise SGBD
Mail : didier@edf.fr





Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à 
l'intention exclusive des destinataires et les informations qui y figurent sont 
strictement confidentielles. Toute utilisation de ce Message non conforme à sa 
destination, toute diffusion ou toute publication totale ou partielle, est 
interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le 
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si 
vous avez reçu ce Message par erreur, merci de le supprimer de votre système, 
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support 
que ce soit. Nous vous remercions également d'en avertir immédiatement 
l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie 
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute 
erreur ou virus.


This message and any attachments (the 'Message') are intended solely for the 
addressees. The information contained in this Message is confidential. Any use 
of information contained in this Message not in accord with its purpose, any 
dissemination or disclosure, either whole or partial, is prohibited except 
formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any 
part of it. If you have received this message in error, please delete it and 
all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or 
virus-free.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] synchronization between PostgreSQL and Oracle

2017-10-12 Thread Rick Otten

On Thu, Oct 12, 2017 at 5:13 AM, ROS Didier  wrote:

> Hi
>
>I would like your advice  and recommendation about the
> following infrastructure problem :
>
> What is the best way to optimize synchronization between an instance
> PostgreSQL on Windows 7 workstation and an Oracle 11gR2 database on linux
> RHEL  ?
>
> Here are more detailed explanations
>
> In our company we have people who collect data in a 9.6 postgresql
> instance on their workstation that is disconnected from the internet.
>
> In the evening, they connect to the Internet and synchronize the collected
> data to a remote 11gr2 Oracle database.
>
> What is the best performant way to do this ( Oracle_FDW ?, flat files ?, …)
>
>
>
There are several ways to go about this, but for your use case I'd
recommend SymmetricDS -- http://symmetricds.org   (or for the commercial
version:  http://jumpmind.com)

SymmetricDS was originally designed to collect data from cash registers in
a vastly distributed set of small databases and aggregate those results
back into both regional and national data warehouses.  It also pushed data
the other way - when pricing was updated at corporate headquarters, the
data was pushed back into the cash registers.  It works with a wide variety
of database technologies, scales well, and has many synchronization
options.  It is also being used by some organizations these days to
synchronize small databases on IOS and Android devices with their parent
databases back at HQ.

I first used it to implement an Oracle to PostgreSQL data migration that
had to be done without down time.   I've used it successfully for real time
data pushes from MySQL and PG OLTP systems into an Oracle DataMart.   I
also used to use it for PostgreSQL bidirectional replication before other
tools became easier to use.  Because of its great flexibility, SymmetricDS
has a ton of knobs to turn and buttons and configuration options and may
take a bit to get it working optimally.   If you are short on time to
implement a solution, I'd suggest going with the commercial version.

Re: [PERFORM] synchronization between PostgreSQL and Oracle

2017-10-12 Thread Laurenz Albe

ROS Didier wrote:
>    I would like your advice  and recommendation about the 
> following infrastructure problem :
> What is the best way to optimize synchronization between an instance 
> PostgreSQL on Windows 7 workstation and an Oracle 11gR2 database on linux 
> RHEL  ?
> Here are more detailed explanations
> In our company we have people who collect data in a 9.6 postgresql instance 
> on their workstation that is disconnected from the internet.
> In the evening, they connect to the Internet and synchronize the collected 
> data to a remote 11gr2 Oracle database.
> What is the best performant way to do this ( Oracle_FDW ?, flat files ?, …)

If the synchronization is triggered from the workstation with
PostgreSQL on it, you can either use oracle_fdw or pg_dump/sql*loader
to transfer the data.

Using oracle_fdw is probably simpler, but it is not very performant
for bulk update operations.

If performance is the main objective, use export/import.

Yours,
Laurenz Albe



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Wrong plane for limit after group by

2017-10-12 Thread Денис Смирнов

Hi,

I've faced a problem that a query without limit works much faster than with
one. Steps to reproduce

create extension pg_trgm;
create table t (id serial, val text, constraint t_pk primary key (id));
insert into t (val) select (random() * 100)::text from generate_series(1,
100);
create index t_val_idx on t using gin (val gin_trgm_ops);

quota_patient> explain (analyze,buffers) select id from t where val like (
select '6'::text) group by id;
++

| QUERY PLAN

 |

||

| HashAggregate  (cost=6401.14..6451.14 rows=5000 width=4) (actual
time=22.861..22.861 rows=0 loops=1)   |
|   Group Key: id

|

|   Buffers: shared hit=5158

 |

|   InitPlan 1 (returns $0)

  |

| ->  Result  (cost=0.00..0.01 rows=1 width=32) (actual
time=0.002..0.002 rows=1 loops=1)|
|   ->  Bitmap Heap Scan on t  (cost=50.75..6388.63 rows=5000 width=4)
(actual time=22.835..22.835 rows=0 loops=1)   |
| Recheck Cond: (val ~~ $0)

  |

| Rows Removed by Index Recheck: 10112

   |

| Heap Blocks: exact=5097

|

| Buffers: shared hit=5158

   |

| ->  Bitmap Index Scan on t_val_idx  (cost=0.00..49.50 rows=5000
width=0) (actual time=8.762..8.762 rows=10112 loops=1) |
|   Index Cond: (val ~~ $0)

  |

|   Buffers: shared hit=61

   |

| Planning time: 0.166 ms

|

| Execution time: 22.970 ms

  |

++

EXPLAIN
Time: 0.026s

quota_patient> explain (analyze,buffers) select id from t where val like (
select '6'::text) group by id limit 1;
+---+

| QUERY PLAN

|

|---|

| Limit  (cost=0.43..7.41 rows=1 width=4) (actual time=439.561..439.561
rows=0 loops=1) |
|   Buffers: shared hit=9105

|

|   InitPlan 1 (returns $0)

 |

| ->  Result  (cost=0.00..0.01 rows=1 width=32) (actual
time=0.002..0.002 rows=1 loops=1)   |
|   ->  Group  (cost=0.42..34865.93 rows=5000 width=4) (actual
time=439.560..439.560 rows=0 loops=1)|
| Group Key: id

 |

| Buffers: shared hit=9105

  |

| ->  Index Scan using t_pk on t  (cost=0.42..34853.43 rows=5000
width=4) (actual time=439.557..439.557 rows=0 loops=1) |
|   Filter: (val ~~ $0)

 |

|   Rows Removed by Filter: 100

 |

|   Buffers: shared hit=9105

|

| Planning time: 0.205 ms

   |

| Execution time: 439.610 ms

|

+---+

EXPLAIN
Time: 0.443s


I can't understand why adding limit after group by makes a planner fall to
non optimal plan. I tried to add more work_mem (up to 100Mb) but no effect.
Is it a planner bug?
BTW if I don't use subquery after like everything is ok

quota_patient> explain (analyze,buffers) select id from t where val like '6'
::text group by id limit 1;
+-+

| QUERY PLAN

  |

|-|

| Limit  (cost=24.03..24.04 rows=1 width=4) (actual time=23.048..23.048
rows=0 loops=1)   |
|   Buffers: shared hit=5158

  |

|   ->  Group  (cost=24.03..24.04 rows=1 width=4) (actual
time=23.046..23.046 rows=0 loops=1)
|
| Group Key: id

   |

| Buffers: shared hit=5158

|

| ->  Sort

Re: [PERFORM] Rowcount estimation changes based on from clause order

2017-10-12 Thread Tom Lane

Ants Aasma  writes:
> I stumbled upon a severe row count underestimation that confusingly
> went away when two inner joins in the from clause were reordered.

Hm, looks more like an overestimate in this example, but anyway ...

> Does anybody have any idea what is going on here?

set_joinrel_size_estimates says

 * Since there is more than one way to make a joinrel for more than two
 * base relations, the results we get here could depend on which component
 * rel pair is provided.  In theory we should get the same answers no matter
 * which pair is provided; in practice, since the selectivity estimation
 * routines don't handle all cases equally well, we might not.  But there's
 * not much to be done about it.

In this example I think the core of the issue is actually not so much
bad selectivity estimates as rowcount roundoff error.

If we first consider joining "small" with "big", we get an estimate of
2000 rows (which is dead on for what would happen if we just joined
those).  Then we estimate the final result size as the join of that to
"lookup".  The selectivity number for that step is somewhat hogwash but
happens to yield a result that's not awful (8 rows).

In the other case we first estimate the size of the join of "small" with
the "lookup" subquery, and we get a rounded-off estimate of one row,
whereas without the roundoff it would have been probably about 0.01.
When that's joined to "big", we are computing one row times 1 million rows
times a selectivity estimate that's about right for the "small.id =
big.small_id" clause; but because the roundoff already inflated the first
join's size so much, you end up with an inflated final result.

This suggests that there might be some value in considering the
sub-relations from largest to smallest, so that roundoff error
in the earlier estimates is less likely to contaminate the final
answer.  Not sure how expensive it would be to do that or what
sort of instability it might introduce into plan choices.

Whether that's got anything directly to do with your original problem is
hard to say.  Joins to subqueries, which we normally lack any stats for,
tend to produce pretty bogus selectivity numbers in themselves; so the
original problem might've been more of that nature.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Rowcount estimation changes based on from clause order

2017-10-12 Thread Ants Aasma

On Thu, Oct 12, 2017 at 11:50 PM, Tom Lane  wrote:
> Ants Aasma  writes:
>> I stumbled upon a severe row count underestimation that confusingly
>> went away when two inner joins in the from clause were reordered.
>
> Hm, looks more like an overestimate in this example, but anyway ...
>
>> Does anybody have any idea what is going on here?
>
> set_joinrel_size_estimates says
>
>  * Since there is more than one way to make a joinrel for more than two
>  * base relations, the results we get here could depend on which component
>  * rel pair is provided.  In theory we should get the same answers no matter
>  * which pair is provided; in practice, since the selectivity estimation
>  * routines don't handle all cases equally well, we might not.  But there's
>  * not much to be done about it.
>
> In this example I think the core of the issue is actually not so much
> bad selectivity estimates as rowcount roundoff error.
>
> If we first consider joining "small" with "big", we get an estimate of
> 2000 rows (which is dead on for what would happen if we just joined
> those).  Then we estimate the final result size as the join of that to
> "lookup".  The selectivity number for that step is somewhat hogwash but
> happens to yield a result that's not awful (8 rows).
>
> In the other case we first estimate the size of the join of "small" with
> the "lookup" subquery, and we get a rounded-off estimate of one row,
> whereas without the roundoff it would have been probably about 0.01.
> When that's joined to "big", we are computing one row times 1 million rows
> times a selectivity estimate that's about right for the "small.id =
> big.small_id" clause; but because the roundoff already inflated the first
> join's size so much, you end up with an inflated final result.
>
> This suggests that there might be some value in considering the
> sub-relations from largest to smallest, so that roundoff error
> in the earlier estimates is less likely to contaminate the final
> answer.  Not sure how expensive it would be to do that or what
> sort of instability it might introduce into plan choices.
>
> Whether that's got anything directly to do with your original problem is
> hard to say.  Joins to subqueries, which we normally lack any stats for,
> tend to produce pretty bogus selectivity numbers in themselves; so the
> original problem might've been more of that nature.

Thanks for pointing me in the correct direction. The original issue
was that values from lookup joined to ref_id and the subset filter in
the small table were almost perfectly correlated, which caused the
underestimate. In the second case this was hidden by the intermediate
clamping to 1, accidentally resulting in a more correct estimate.

I actually think that it might be better to consider relations from
smallest to largest. The reasoning being - a join cannot produce a
fraction of a row, it will either produce 0 or 1, and we should
probably plan for the case when it does return something.

Going even further, and I haven't looked at how feasible this is, but
I have run into several cases lately where cardinality underestimates
clamping to 1 result in catastrophically bad plans. Like a stack of
nested loops with unparameterized GroupAggregates and HashAggregates
as inner sides bad. It seems to me that row estimates should clamp to
something slightly larger than 1 unless it's provably going to be 1.

Regards,
Ants Aasma


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] synchronization between PostgreSQL and Oracle

Re: [PERFORM] synchronization between PostgreSQL and Oracle

Re: [PERFORM] synchronization between PostgreSQL and Oracle

[PERFORM] Wrong plane for limit after group by

Re: [PERFORM] Rowcount estimation changes based on from clause order

Re: [PERFORM] Rowcount estimation changes based on from clause order

6 matches

Site Navigation

Mail list logo

Footer information