RE: Why the index is not used ?

2018-10-08 Thread ROS Didier
Hi Virendra 

You think that outside encryption of the database is the best solution  
 ?
   How do you manage the encryption key ?
Can you give me some examples of this kind of solution.

Best Regards
Didier ROS

-Message d'origine-
De : virendra.ku...@guycarp.com [mailto:virendra.ku...@guycarp.com] 
Envoyé : dimanche 7 octobre 2018 20:41
À : ROS Didier ; fola...@peoplecall.com
Cc : pavel.steh...@gmail.com; pgsql-...@lists.postgresql.org; 
pgsql-performance@lists.postgresql.org; pgsql-gene...@lists.postgresql.org
Objet : RE: Why the index is not used ?

You can consider outside DB encryption which is less of worry for performance 
and data at rest will be encrypted.

Regards,
Virendra
-Original Message-
From: ROS Didier [mailto:didier@edf.fr]
Sent: Sunday, October 07, 2018 2:33 PM
To: fola...@peoplecall.com
Cc: pavel.steh...@gmail.com; pgsql-...@lists.postgresql.org; 
pgsql-performance@lists.postgresql.org; pgsql-gene...@lists.postgresql.org
Subject: RE: Why the index is not used ?

Hi Francisco

Thank you for your remark.
You're right, but it's the only procedure I found to make search on encrypted 
fields with good response times (using index) !

Regarding access to the file system, our servers are in protected network 
areas. few people can connect to it.

it's not the best solution, but we have data encryption needs and good 
performance needs too. I do not know how to do it except the specified 
procedure..
if anyone has any proposals to put this in place, I'm interested.

Thanks in advance

Best Regards
Didier ROS

-Message d'origine-
De : fola...@peoplecall.com [mailto:fola...@peoplecall.com] Envoyé : dimanche 7 
octobre 2018 17:58 À : ROS Didier  Cc : 
pavel.steh...@gmail.com; pgsql-...@lists.postgresql.org; 
pgsql-performance@lists.postgresql.org; pgsql-gene...@lists.postgresql.org
Objet : Re: Why the index is not used ?

ROS:

On Sun, Oct 7, 2018 at 3:13 PM, ROS Didier  wrote:

> -INSERT INTO cartedecredit(username,cc)  SELECT 'individu ' || x.id, 
> pgp_sym_encrypt('test value ' || x.id, 'motdepasse','compress-algo=2, 
> cipher-algo=aes256') FROM generate_series(1,10) AS x(id);
> -CREATE INDEX idx_cartedecredit_cc02 ON 
> cartedecredit(pgp_sym_decrypt(cc, 'motdepasse','compress-algo=2, 
> cipher-algo=aes256'));

If my french is not too rusty you are encrypting a credit-card, and then 
storing an UNENCRYPTED copy in the index. So, getting it from the server is 
trivial for anyone with filesystem access.

Francisco Olarte.



Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à 
l'intention exclusive des destinataires et les informations qui y figurent sont 
strictement confidentielles. Toute utilisation de ce Message non conforme à sa 
destination, toute diffusion ou toute publication totale ou partielle, est 
interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le 
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si 
vous avez reçu ce Message par erreur, merci de le supprimer de votre système, 
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support 
que ce soit. Nous vous remercions également d'en avertir immédiatement 
l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie 
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute 
erreur ou virus.


This message and any attachments (the 'Message') are intended solely for the 
addressees. The information contained in this Message is confidential. Any use 
of information contained in this Message not in accord with its purpose, any 
dissemination or disclosure, either whole or partial, is prohibited except 
formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any 
part of it. If you have received this message in error, please delete it and 
all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or 
virus-free.



This message is intended only for the use of the addressee and may contain 
information that is PRIVILEGED AND CONFIDENTIAL.

If you are not the intended recipient, you are hereby notified that any 
dissemination of this communication is strictly prohibited. If you have 
received this communication in error, please erase all copies of the message 
and its attachments and notify the sender immediately. Thank you.



Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à 
l'intention exclusive des destinataires et les informations qui y figurent sont 
strictement confidentielles. Toute utilisation de ce Message non conforme à sa 
destination, toute diffusion ou toute publication totale ou partielle, est 
interdite sauf a

RE: Why the index is not used ?

2018-10-08 Thread ROS Didier
Hi Vlad
Sorry for this delay, but apparently the subject is of interest to many people 
in the community. I received a lot of comments and answers.
I wrote my answers in the body of your message below

Best Regards
Didier

De : greatvo...@gmail.com [mailto:greatvo...@gmail.com]
Envoyé : samedi 6 octobre 2018 18:51
À : ROS Didier 
Cc : pgsql-...@lists.postgresql.org; pgsql-performance@lists.postgresql.org; 
pgsql-gene...@lists.postgresql.org
Objet : Re: Why the index is not used ?

Hello Didier,

>>
(3), (5) to find the match, you decrypt the whole table, apparently this take 
quite a long time.
Index cannot help here because indexes work on exact match of type and value, 
but you compare mapped value, not indexed. Functional index should help, but 
like it was said, it against the idea of encrypted storage.
<<
I tested the solution of the functional index. It works very well, but the data 
is no longer encrypted. This is not the right solution
>>
(6) I never used pgp_sym_encrypt() but I see that in INSERT INTO you supplied 
additional parameter 'compress-algo=2, cipher-algo=aes256' while in (6) you did 
not. Probably this is the reason.

In general matching indexed bytea column should use index, you can ensure in 
this populating the column unencrypted and using 'test value 32'::bytea for 
match.
In you case I believe pgp_sym_encrypt() is not marked as STABLE or IMMUTABLE 
that's why it will be evaluated for each row (very inefficient) and cannot use 
index. From documentation:

"Since an index scan will evaluate the comparison value only once, not once at 
each row, it is not valid to use a VOLATILE function in an index scan 
condition."
https://www.postgresql.org/docs/10/static/xfunc-volatility.html

If you cannot add STABLE/IMMUTABLE to pgp_sym_encrypt() (which apparently 
should be there), you can encrypt searched value as a separate operation and 
then search in the table using basic value match.
>>
you're right about the missing parameter  'compress-algo=2, 
cipher-algo=aes256'. I agree with you.
(1) I have tested your proposal :
DROP TABLE cartedecredit;
CREATE TABLE cartedecredit(card_id SERIAL PRIMARY KEY, username VARCHAR(100), 
cc bytea);
INSERT INTO cartedecredit(username,cc)  SELECT 'individu ' || x.id, 
decode('test value ' || x.id,'escape') FROM generate_series(1,10) AS x(id);

è I inserted unencrypted data into the bytea column
postgres=# select * from cartedecredit limit 5 ;
card_id |  username   |  cc
-+-+--
   1 | individu 1  | \x746573742076616c75652031
   2 | individu 2  | \x746573742076616c75652032
   3 | individu 3  | \x746573742076616c75652033
   4 | individu 4  | \x746573742076616c75652034
   5 | individu 5  | \x746573742076616c75652035
CREATE INDEX idx_cartedecredit_cc02 ON cartedecredit(cc);
SELECT encode(cc,'escape') FROM cartedecredit WHERE cc=decode('test value 
32','escape');
 QUERY PLAN

Index Only Scan using idx_cartedecredit_cc02 on cartedecredit  (cost=0.42..8.44 
rows=1 width=32) (actual time=0.033..0.034 rows=1 loops=1)
   Index Cond: (cc = '\x746573742076616c7565203332'::bytea)
   Heap Fetches: 1
Planning time: 0.130 ms
Execution time: 0.059 ms
(5 rows)

è It works but the data is not encrypted. everyone can have access to the data
(2) 2nd test :
DROP TABLE cartedecredit;
CREATE TABLE cartedecredit(card_id SERIAL PRIMARY KEY, username VARCHAR(100), 
cc bytea);
INSERT INTO cartedecredit(username,cc)  SELECT 'individu ' || x.id, 
pgp_sym_encrypt('test value ' || x.id, 'motdepasse','compress-algo=2, 
cipher-algo=aes256') FROM generate_series(1,10) AS x(id);
postgres=# select * from cartedecredit limit 5 ;
>>
card_id |  username   | 
  cc
-+-+-
---
   1 | individu 1  | 
\xc30d0409030296304d007bf50ed768d2480153cd4a4e2d240249f94b31ec168391515ea80947f97970f7a4e058bff648f752df194498dd480c3b8a5c0d2942f90c6dde21a6b9bf4e9fd7986c6f986e3783
647e7a6205b48c03
   2 | individu 2  | 
\xc30d0409030257b50bc0e6bcd8d270d248010984b60126af01ba922da27e2e78c33110f223f0210cf34da77243277305254cba374708d447fc7d653dd9e00ff9a96803a2c47ee95269534f2c24fab1c9dc
31f7909ca7adeaf0
   3 | individu 3  | 
\xc30d040903023c5f8cb688c7945275d24801a518d70c6cc2d4a31f99f3738e736c5312f78bb9c3cc187a65d0cf7f893dbc9448825d39b79df5d0460508fc93336c2bec7794893bb08a290afd649ae15fe2
2b0433eff89222f7
   4 | individu 4  | 
\xc30d04090302dcc3bb49a41b297578d2480167f17b09004e7dacc0891fc0cc7276dd551273eec72644520f8d0543abe8e795af7c1b84fc8e5b4adc

RE: Why the index is not used ?

2018-10-08 Thread ROS Didier
Hi Vlad
   OK, I take into account your remark about the need to do 
research on encrypted data.
My answers to your remarks :
>>
you can use a deterministic algorithm for it (without salt)
<<
Can you give me on of these deterministic algorithms(without salt) ?

Best Regards

Didier
De : greatvo...@gmail.com [mailto:greatvo...@gmail.com]
Envoyé : dimanche 7 octobre 2018 21:33
À : ROS Didier 
Cc : fola...@peoplecall.com; pavel.steh...@gmail.com; 
pgsql-...@lists.postgresql.org; pgsql-performance@lists.postgresql.org; 
pgsql-gene...@lists.postgresql.org
Objet : Re: Why the index is not used ?

Additionally it is not clear why you want to search in table on encrypted data. 
Usually you match user with it's unpersonalized data (such as login, user ID) 
and then decrypt personalized data. If you need to store user identifying data 
encrypted as well (e.g. bank account number) you can use a deterministic 
algorithm for it (without salt) because it is guaranteed to be unique and you 
don't need to have different encrypted data for two same input strings.

Vlad



Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à 
l'intention exclusive des destinataires et les informations qui y figurent sont 
strictement confidentielles. Toute utilisation de ce Message non conforme à sa 
destination, toute diffusion ou toute publication totale ou partielle, est 
interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le 
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si 
vous avez reçu ce Message par erreur, merci de le supprimer de votre système, 
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support 
que ce soit. Nous vous remercions également d'en avertir immédiatement 
l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie 
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute 
erreur ou virus.


This message and any attachments (the 'Message') are intended solely for the 
addressees. The information contained in this Message is confidential. Any use 
of information contained in this Message not in accord with its purpose, any 
dissemination or disclosure, either whole or partial, is prohibited except 
formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any 
part of it. If you have received this message in error, please delete it and 
all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or 
virus-free.


RE: Why the index is not used ?

2018-10-08 Thread ROS Didier
Hi Tomas

Thank you for your answer and recommendation which is very interesting. 
I'm going to study the PCI DSS document right now.
-   Here are my answer to your question :
>>
What is your threat model?
<<
we want to prevent access to sensitive data for everyone except those who have 
the encryption key.
in case of files theft, backups theft, dumps theft, we do not want anyone to 
access sensitive data.

-   I have tested the solution you proposed, it works great.

Best Regards

Didier ROS
-Message d'origine-
De : tomas.von...@2ndquadrant.com [mailto:tomas.von...@2ndquadrant.com]
Envoyé : dimanche 7 octobre 2018 22:08
À : ROS Didier ; fola...@peoplecall.com
Cc : pavel.steh...@gmail.com; pgsql-...@lists.postgresql.org; 
pgsql-performance@lists.postgresql.org; pgsql-gene...@lists.postgresql.org
Objet : Re: Why the index is not used ?

Hi,

On 10/07/2018 08:32 PM, ROS Didier wrote:
> Hi Francisco
>
>   Thank you for your remark.
> You're right, but it's the only procedure I found to make search on
> encrypted fields with good response times (using index) !
>

Unfortunately, that kinda invalidates the whole purpose of in-database 
encryption - you'll have encrypted on-disk data in one place, and then 
plaintext right next to it. If you're dealing with credit card numbers, then 
you presumably care about PCI DSS, and this is likely a direct violation of 
that.

> Regarding access to the file system, our servers are in protected
network areas. few people can connect to it.
>

Then why do you need encryption at all? If you assume access to the filesystem 
/ storage is protected, why do you bother with encryption?
What is your threat model?

> it's not the best solution, but we have data encryption needs and good
> performance needs too. I do not know how to do it except the specified
> procedure..
>
> if anyone has any proposals to put this in place, I'm interested.
>

One thing you could do is hashing the value and then searching by the hash. So 
aside from having the encrypted column you'll also have a short hash, and you 
may use it in the query *together* with the original condition. It does not 
need to be unique (in fact it should not be to make it impossible to reverse 
the hash), but it needs to have enough distinct values to make the index 
efficient. Say, 10k values should be enough, because that means 0.01% 
selectivity.

So the function might look like this, for example:

  CREATE FUNCTION cchash(text) RETURNS int AS $$
SELECT abs(hashtext($1)) % 1;
  $$ LANGUAGE sql;

and then be used like this:

  CREATE INDEX idx_cartedecredit_cc02 ON cartedecredit(cchash(cc));

and in the query

  SELECT pgp_sym_decrypt(cc, 'motdepasse') FROM cartedecredit
   WHERE pgp_sym_decrypt(cc, 'motdepasse')='test value 32'
 AND cchash(cc) = cchash('test value 32');

Obviously, this does not really solve the issues with having to pass the 
password to the query, making it visible in pg_stat_activity, various logs etc.

Which is why people generally use FDE for the whole disk, which is transparent 
and provides the same level of protection.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à 
l'intention exclusive des destinataires et les informations qui y figurent sont 
strictement confidentielles. Toute utilisation de ce Message non conforme à sa 
destination, toute diffusion ou toute publication totale ou partielle, est 
interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le 
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si 
vous avez reçu ce Message par erreur, merci de le supprimer de votre système, 
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support 
que ce soit. Nous vous remercions également d'en avertir immédiatement 
l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie 
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute 
erreur ou virus.


This message and any attachments (the 'Message') are intended solely for the 
addressees. The information contained in this Message is confidential. Any use 
of information contained in this Message not in accord with its purpose, any 
dissemination or disclosure, either whole or partial, is prohibited except 
formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any 
part of it. If you have received this message in error, please delete it and 
all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or 
virus-free.


RE: Why the index is not used ?

2018-10-08 Thread ROS Didier
Hi Paul

   Thank you very much for your feedback which is very informative.
   I understand that concerning the encryption of credit card 
numbers, it is imperative to respect the PCI DSS document. I am going to study 
it.
   However, I would like to say that I chose my example badly by 
using a table storing credit card numbers. In fact, my problem is more generic.
I want to implement a solution that encrypts “sensitive” data and can retrieve 
data with good performance (by using an index).
I find that the solution you propose is very interesting and I am going to test 
it.

Best Regards
Didier ROS

De : p...@paulmcgarry.com [mailto:p...@paulmcgarry.com]
Envoyé : lundi 8 octobre 2018 00:11
À : ROS Didier 
Cc : fola...@peoplecall.com; pavel.steh...@gmail.com; 
pgsql-...@lists.postgresql.org; pgsql-performance@lists.postgresql.org; 
pgsql-gene...@lists.postgresql.org
Objet : Re: Why the index is not used ?

Hi Didier,

I’m sorry to tell you that you are probably doing something (ie 
handling/storing credit cards) which would mean you have to comply with PCI DSS 
requirements.

As such you should probably have a QSA (auditor) who you can run any proposed 
solution by (so you know they will be comfortable with it when they do their 
audit).

I think your current solution would be frowned upon because:
- cards are effectively stored in plaintext in the index.
- your encryption/decryption is being done in database, rather than by 
something with that as its sole role.

People have already mentioned the former so I won’t go into it further

But for the second part if someone can do a

Select pgp_sym_decrypt(cc)

then you are one sql injection away from having your card data stolen. You do 
have encryption, but in practice everything is available unencrypted so in 
practice the encryption is more of a tick in a box than an actual defence 
against bad things happening. In a properly segmented system even your DBA 
should not be able to access decrypted card data.

You probably should look into doing something like:

- store the first 6 and last 4 digits of the card unencrypted.
- store the remaining card digits encrypted
- have the encryption/decryption done by a seperate service called by your 
application code outside the db.

You haven’t gone into what your requirements re search are (or I missed them) 
but while the above won’t give you a fast exact cc lookup in practice being 
able to search using the first 6 and last 4 can get you a small enough subset 
than can then be filtered after decrypting the middle.

We are straying a little off PostgreSQL topic here but if you and/or your 
management aren’t already looking at PCI DSS compliance I’d strongly recommend 
you do so. It can seem like a pain but it is much better to take that pain up 
front rather than having to reengineer everything later. There are important 
security aspects it helps make sure you cover but maybe some business aspects 
(ie possible partners who won’t be able to deal with you without your 
compliance sign off documentation).


The alternative, if storing cc data isn’t a core requirement, is to not store 
the credit card data at all. That is generally the best solution if it meets 
your needs, ie if you just want to accept payments then use a third party who 
is PCI compliant to handle the cc part.

I hope that helps a little.

Paul



Sent from my iPhone

On 8 Oct 2018, at 05:32, ROS Didier 
mailto:didier@edf.fr>> wrote:
Hi Francisco

   Thank you for your remark.
   You're right, but it's the only procedure I found to make search on 
encrypted fields with good response times (using index) !

   Regarding access to the file system, our servers are in protected network 
areas. few people can connect to it.

   it's not the best solution, but we have data encryption needs and good 
performance needs too. I do not know how to do it except the specified 
procedure..
   if anyone has any proposals to put this in place, I'm interested.

   Thanks in advance

Best Regards
Didier ROS

-Message d'origine-
De : fola...@peoplecall.com 
[mailto:fola...@peoplecall.com]
Envoyé : dimanche 7 octobre 2018 17:58
À : ROS Didier mailto:didier@edf.fr>>
Cc : pavel.steh...@gmail.com; 
pgsql-...@lists.postgresql.org; 
pgsql-performance@lists.postgresql.org;
 pgsql-gene...@lists.postgresql.org
Objet : Re: Why the index is not used ?

ROS:

On Sun, Oct 7, 2018 at 3:13 PM, ROS Didier 
mailto:didier@edf.fr>> wrote:


-INSERT INTO cartedecredit(username,cc)  SELECT 'individu ' || 
x.id, pgp_sym_encrypt('test value ' || x.id, 
'motdepasse','compress-algo=2, cipher-algo=aes256') FROM 
generate_series(1,10) AS x(id);
-CREATE INDEX idx_cartedecredit_cc02 ON 
cartedecredit(pgp

Re: Why the index is not used ?

2018-10-08 Thread Tomas Vondra
Hi,

On 10/08/2018 04:10 PM, ROS Didier wrote:
> Hi Tomas
>    
>     Thank you for your answer and recommendation which is very
> interesting. I'm going to study the PCI DSS document right now.
> 
>   * Here are my answer to your question :
> 
> />>/
> /What is your threat model?/
> /< we want to prevent access to sensitive data for everyone except those
> who have the encryption key.
> in case of files theft, backups theft, dumps theft, we do not want
> anyone to access sensitive data.
>    

The thing is - encryption is not panacea. The interesting question is
whether this improves security compared to simply using FDE and regular
access rights (which are grantable at the column level).

Using those two pieces properly may very well be a better defense than
not well designed encryption scheme - and based on this discussion, it
does not seem very polished / resilient.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Why the index is not used ?

2018-10-08 Thread Paul McGarry
Hi Didier,

Yes, credit cards are a very specific space that probably gets people who are 
familiar with it going a bit. By the time you factor in general security 
practices, specific PCI requirements, your threat model and likely business 
requirements (needing relatively free access to parts of the card number) the 
acceptable solution space narrows considerably.

More generally though I’d recommend reading:

https://paragonie.com/blog/2017/05/building-searchable-encrypted-databases-with-php-and-sql

as (even if you aren’t using PHP) it discusses several strategies and what 
makes them good/bad for different use cases and how to implement them well.

I don’t think I’d consider the main solution discussed there particularly 
applicable to credit card data (mostly because the low entropy of card data 
makes it difficult to handle safely without additional per-row randomness 
added, though as always, consult your QSA) but it is generally interesting.

Paul

Sent from my iPhone

> On 9 Oct 2018, at 01:29, ROS Didier  wrote:
> 
> Hi Paul
>  
>Thank you very much for your feedback which is very 
> informative.
>I understand that concerning the encryption of credit card 
> numbers, it is imperative to respect the PCI DSS document. I am going to 
> study it.
>However, I would like to say that I chose my example badly by 
> using a table storing credit card numbers. In fact, my problem is more 
> generic.
> I want to implement a solution that encrypts “sensitive” data and can 
> retrieve data with good performance (by using an index).
> I find that the solution you propose is very interesting and I am going to 
> test it.
>  
> Best Regards
> Didier ROS
>  
> De : p...@paulmcgarry.com [mailto:p...@paulmcgarry.com] 
> Envoyé : lundi 8 octobre 2018 00:11
> À : ROS Didier 
> Cc : fola...@peoplecall.com; pavel.steh...@gmail.com; 
> pgsql-...@lists.postgresql.org; pgsql-performance@lists.postgresql.org; 
> pgsql-gene...@lists.postgresql.org
> Objet : Re: Why the index is not used ?
>  
> Hi Didier,
>  
> I’m sorry to tell you that you are probably doing something (ie 
> handling/storing credit cards) which would mean you have to comply with PCI 
> DSS requirements.
>  
> As such you should probably have a QSA (auditor) who you can run any proposed 
> solution by (so you know they will be comfortable with it when they do their 
> audit).
>  
> I think your current solution would be frowned upon because:
> - cards are effectively stored in plaintext in the index.
> - your encryption/decryption is being done in database, rather than by 
> something with that as its sole role.
>  
> People have already mentioned the former so I won’t go into it further
>  
> But for the second part if someone can do a 
>  
> Select pgp_sym_decrypt(cc)
>  
> then you are one sql injection away from having your card data stolen. You do 
> have encryption, but in practice everything is available unencrypted so in 
> practice the encryption is more of a tick in a box than an actual defence 
> against bad things happening. In a properly segmented system even your DBA 
> should not be able to access decrypted card data.
>  
> You probably should look into doing something like:
>  
> - store the first 6 and last 4 digits of the card unencrypted.
> - store the remaining card digits encrypted
> - have the encryption/decryption done by a seperate service called by your 
> application code outside the db.
>  
> You haven’t gone into what your requirements re search are (or I missed them) 
> but while the above won’t give you a fast exact cc lookup in practice being 
> able to search using the first 6 and last 4 can get you a small enough subset 
> than can then be filtered after decrypting the middle. 
>  
> We are straying a little off PostgreSQL topic here but if you and/or your 
> management aren’t already looking at PCI DSS compliance I’d strongly 
> recommend you do so. It can seem like a pain but it is much better to take 
> that pain up front rather than having to reengineer everything later. There 
> are important security aspects it helps make sure you cover but maybe some 
> business aspects (ie possible partners who won’t be able to deal with you 
> without your compliance sign off documentation).
>  
>  
> The alternative, if storing cc data isn’t a core requirement, is to not store 
> the credit card data at all. That is generally the best solution if it meets 
> your needs, ie if you just want to accept payments then use a third party who 
> is PCI compliant to handle the cc part.
>  
> I hope that helps a little.
>  
> Paul
>  
>  
>  
> 
> Sent from my iPhone
> 
> On 8 Oct 2018, at 05:32, ROS Didier  wrote:
> 
> Hi Francisco
> 
>Thank you for your remark. 
>You're right, but it's the only procedure I found to make search on 
> encrypted fields with good response times (using index) !
> 
>Regarding access to the file system, our servers are in protected network 
> areas. few

Re: Partial index plan/cardinality costing

2018-10-08 Thread James Coleman
Bump, and curious if anyone on hackers has any ideas here: of particular
interest is why the (pk, created_at) index can possibly be more valuable
than the (created_at, pk) variant since the former effectively implies
having to scan the entire index.
On Fri, Sep 7, 2018 at 12:17 PM James Coleman  wrote:

> I have the following tables:
> - m(pk bigserial primary key, status text): with a single row
> - s(pk bigserial primary key, status text, action_at date, m_fk bigint):
>   * 80% of the data has action_at between the current date and 1 year ago
>  and status of E or C
>   * 20% of the data has action_at between 5 days ago and 25 days into the
>  future and status of P, PD, or A
>
> I have two partial indexes:
> - s_pk_action_at on s(pk, action_at) where status in ('P', 'PD', 'A')
> - s_action_at_pk on s(action_at, pk) where status in ('P', 'PD', 'A')
>
> With the query:
> SELECT s.pk FROM s
> INNER JOIN m ON m.pk = s.m_fk
> WHERE
>   s.status IN ('A', 'PD', 'P')
>   AND (action_at <= '2018-09-06')
>   AND s.status IN ('A', 'P')
>   AND m.status = 'A';
>
> I generally expect the index s_action_at_pk to always be preferred over
> s_pk_action_at. And on stock Postgres it does in fact use that index (with
> a bitmap index scan).
>
> We like to set random_page_cost = 2 since we use fast SSDs only. With that
> change Postgres strongly prefers the index s_pk_action_at unless I both
> disable the other index and turn off bitmap heap scans.
>
> I'm attaching the following plans:
> - base_plan.txt: default costs; both indexes available
> - base_plan_rpc2.txt: random_page_cost = 2; both indexes available
> - inddisabled_plan_rpc2.txt: random_page_cost = 2; only s_action_at_pk
> available
> - inddisabled_bhsoff_plan_rpc2.txt: random_page_cost = 2;
> enable_bitmapscan = false;  only s_action_at_pk available
>
> A couple of questions:
> - How is s_pk_action_at ever efficient to scan? Given that the highest
> cardinality (primary key) column is first, wouldn't an index scan
> effectively have to scan the entire index?
> - Why does index scan on s_action_at_pk reads over 2x as many blocks as
> the bitmap heap scan with the same index?
> - Would you expect Postgres to generally always prefer using the
> s_action_at_pk index over the s_pk_action_at index for this query? I
> realize changing the random page cost is part of what's driving this, but I
> still can't imagine reading the full s_pk_action_at index (assuming that's
> what it is doing) could ever be more valuable.
>
> As a side note, the planner is very bad at understanding a query that
> happens (I realize you wouldn't write this by hand, but ORMs) when you have
> a where clause like:
> s.status IN ('A', 'PD', 'P') AND s.status IN ('A', 'P')
> the row estimates are significantly different from a where clause with
> only:
> s.status IN ('A', 'P')
> even though semantically those are identical.
>
>
>


Re: Partial index plan/cardinality costing

2018-10-08 Thread Justin Pryzby
Please don't cross-post to lists.

>insert into s(status, action_at, m_fk)
>select
>  ( CASE WHEN series.n % 100 < 80 THEN
>  (ARRAY['E', 'C'])[(series.n % 2) + 1]
>ELSE
>  (ARRAY['P', 'PD', 'A'])[((random() * 3)::integer % 3) + 1]
>END
>  ),
>  (
>CASE WHEN series.n % 100 < 80 THEN
>  '2018-09-07'::date + ((series.n % 365 - 365)::text || ' day')::interval
>ELSE
>  '2018-09-07'::date + (((random() * 30)::integer % 30 - 4)::text || ' 
> day')::interval
>END
>  ),
>  (select m.pk from m limit 1)
>from generate_series(1, 50) series(n);

> I have two partial indexes:
> - s_pk_action_at on s(pk, action_at) where status in ('P', 'PD', 'A')
> - s_action_at_pk on s(action_at, pk) where status in ('P', 'PD', 'A')

> - How is s_pk_action_at ever efficient to scan? Given that the highest
> cardinality (primary key) column is first, wouldn't an index scan
> effectively have to scan the entire index?

The index probably IS inefficient to scan (you could see that if you force an
bitmap index scan on s_pk_action_at)...but because of leading pkey column, the
HEAP is read sequentially, and the planner knows that the heap will be read in
order of its leading column.  Reading the entire index is less expensive than
reading most of the table (maybe nonsequentially).  This is the 2nd effect Jeff
Janes likes to point out: high correlation means 1) sequential reads; *and*, 2)
a smaller fraction of the table needs to be accessed to read a given number of
tuples.

> - Why does index scan on s_action_at_pk reads over 2x as many blocks as the
> bitmap heap scan with the same index?

Maybe because of heap pages accessed multiple times (not sequentially), since
correlation is small on this table loaded with "modulus"-style insertions.

pryzbyj=# SELECT attname, correlation FROM pg_stats WHERE tablename='s' ;
  attname  | correlation 
---+-
 pk|   1
 status|0.340651
 action_at |  0.00224239
 m_fk  |   1

..so each index tuple is accessing a separate heap page.

If you create non-partial index and CLUSTER on action_at_idx, then:

pryzbyj=# SELECT attname, correlation FROM pg_stats WHERE tablename='s' ;
  attname  | correlation
---+-
 pk|  0.00354867
 status|0.420806

 action_at |   1
 m_fk  |   1

 Nested Loop  (cost=1907.03..6780.65 rows=11038 width=8) (actual 
time=2.241..17.839 rows=8922 loops=1)
   Join Filter: (s.m_fk = m.pk)
   Buffers: shared hit=115 read=53
   ->  Seq Scan on m  (cost=0.00..1.01 rows=1 width=8) (actual 
time=0.009..0.011 rows=1 loops=1)
 Filter: (status = 'A'::text)
 Buffers: shared hit=1
   ->  Bitmap Heap Scan on s  (cost=1907.03..6641.66 rows=11038 width=16) 
(actual time=2.222..9.032 rows=8922 loops=1)
 Recheck Cond: ((action_at <= '2018-09-06'::date) AND (status = ANY 
('{P,PD,A}'::text[])))
 Filter: (status = ANY ('{A,P}'::text[]))
 Rows Removed by Filter: 4313
 Heap Blocks: exact=114
 Buffers: shared hit=114 read=53
 ->  Bitmap Index Scan on s_action_at_pk  (cost=0.00..1904.27 
rows=82647 width=0) (actual time=2.185..2.186 rows=13235 loops=1)
   Index Cond: (action_at <= '2018-09-06'::date)
   Buffers: shared read=53

Also, I don't think it matters here, but action_at and status are correlated.
Planner would think that they're independent.

I don't think it's related to other issues, but also note the rowcount estimate 
is off:
 ->  Bitmap Index Scan on s_action_at_pk  (cost=0.00..1258.02 
rows=82347 width=0) (actual time=1.026..1.026 rows=13402 loops=1)   
  
   Index Cond: (action_at <= '2018-09-06'::date)



Justin