subject:"Re\: Transparent column encryption"

Re: Transparent column encryption

2024-04-18 Thread Robert Haas

On Thu, Apr 18, 2024 at 1:49 PM Jelte Fennema-Nio  wrote:
> I think this is an interesting idea. I can indeed see use cases for
> e.g. inserting a new row based on another row (where the secret is the
> same).
>
> IMHO that means that we should also bump the protocol version for this
> change, because it's changing the wire protocol by adding a new
> parameter format code. And it does so in a way that does not depend on
> the new protocol extension.

I think we're more or less covering the same ground we did on the
other thread here -- in theory I don't love the fact that we never
bump the protocol version when we change stuff, but in practice if we
start bumping it every time we do anything I think it's going to just
break a bunch of stuff without any real benefit.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Transparent column encryption

2024-04-18 Thread Jelte Fennema-Nio

On Thu, 18 Apr 2024 at 18:46, Robert Haas  wrote:
> With regard to the Bind message, I suggest that we regard the protocol
> change as reserving a currently-unused bit in the message to indicate
> whether the value is pre-encrypted, without reference to the protocol
> extension. It could be legal for a client that can't understand
> encryption message from the server to supply an encrypted value to be
> inserted into a column. And I don't think we would ever want the bit
> that's being reserved here to be used by some other extension for some
> other purpose, even when this extension isn't used. So I don't see a
> need for this to be tied into the protocol extension.

I think this is an interesting idea. I can indeed see use cases for
e.g. inserting a new row based on another row (where the secret is the
same).

IMHO that means that we should also bump the protocol version for this
change, because it's changing the wire protocol by adding a new
parameter format code. And it does so in a way that does not depend on
the new protocol extension.

Re: Transparent column encryption

2024-04-18 Thread Robert Haas

On Wed, Apr 10, 2024 at 6:13 AM Peter Eisentraut  wrote:
> Obviously, it's early days, so there will be plenty of time to have
> discussions on various other aspects of this patch.  I'm keeping a keen
> eye on the discussion of protocol extensions, for example.

I think the way that you handled that is clever, and along the lines
of what I had in mind when I invented the _pq_ stuff.

More specifically, the way that the ColumnEncryptionKey and
ColumnMasterKey messages are handled is exactly the way that I was
imagining things would work. The client uses _pq_.column_encryption to
signal that it can understand those messages, and the server responds
by including them. I assume that if the client doesn't signal
understanding, then the server simply omits sending those messages. (I
have not checked the code.)

I'm less certain about the changes to the ParameterDescription and
RowDescription messages. I see a couple of potential problems. One is
that, if you say you can understand column encryption messages, the
extra fields are included even for unencrypted columns. The client
must choose at connection startup whether it ever wishes to read any
encrypted data; if so, it pays a portion of that overhead all the
time. Another potential problem is with the scalability of this
design. Suppose that we could not only encrypt columns, but also
compress, fold, mutilate, and spindle them. Then there might end up
being a dizzying array of variation in the format of what is supposed
to be the same message. Perhaps it's not so bad: as long as the
documentation is clear about in which order the additional fields will
appear in the relevant messages when more than one relevant feature is
used, it's probably not too difficult for clients to cope. And it is
probably also true that the precise size of, say, a RowDescription
message will rarely be performance-critical. But another thought is
that we might try to redesign this so that we simply add more message
types rather than mutating message types i.e. after sending the
RowDescription message, if any columns are encrypted, we additionally
send a RowEncryptionDescription message. Then this treatment becomes
symmetric with the handling of ColumnEncryptionKey and ColumnMasterKey
messages, and there's no overhead when the feature is unused.

With regard to the Bind message, I suggest that we regard the protocol
change as reserving a currently-unused bit in the message to indicate
whether the value is pre-encrypted, without reference to the protocol
extension. It could be legal for a client that can't understand
encryption message from the server to supply an encrypted value to be
inserted into a column. And I don't think we would ever want the bit
that's being reserved here to be used by some other extension for some
other purpose, even when this extension isn't used. So I don't see a
need for this to be tied into the protocol extension.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Transparent column encryption

2024-04-18 Thread Jelte Fennema-Nio

On Thu, 18 Apr 2024 at 13:25, Peter Eisentraut  wrote:
> Hopefully, the reason for key rotation is mainly that policies require
> key rotation, not that keys get compromised all the time.

These key rotation policies are generally in place to reduce the
impact of a key compromise by limiting the time a compromised key is
valid.

> This
> seems pretty standard to me.  For example, I can change the password on
> my laptop's file system encryption, which somehow wraps a lower-level
> key, but I can't reencrypt the actual file system in place.

I think the threat model for this proposal and a laptop's file system
encryption are different enough that the same choices/tradeoffs don't
automatically translate. Specifically in this proposal the unencrypted
CEK is present on all servers that need to read/write those encrypted
values. And a successful attacker would then be able to read the
encrypted values forever with this key, because it effectively cannot
be rotated. That is a much bigger attack surface and risk than a
laptop's disk encryption. So, I feel quite strongly that shipping the
proposed feature without being able to re-encrypt columns in an online
fashion would be a mistake.

> That's the
> reason for having this two-tier key system in the first place.

If we allow for online-rotation of the actual encryption key, then
maybe we don't even need this two-tier system ;)

Not having this two tier system would have a few benefits in my opinion:
1. We wouldn't need to be sending encrypted key material from the
server to every client. Which seems nice from a security, bandwidth
and client implementation perspective.
2. Asymmetric encryption of columns is suddenly an option. Allowing
certain clients to enter encrypted data into the database but not read
it.

> Two problems here: One, for deterministic encryption, everyone needs to
> agree on the representation, otherwise equality comparisons won't work.
>   Two, if you give clients the option of storing text or binary, then
> clients also get back a mix of text or binary, and it will be a mess.
> Just giving the option of storing the payload in binary wouldn't be that
> hard, but it's not clear what you can sensibly do with that in the end.

How about defining at column creation time if the underlying value
should be binary or not? Something like:

CREATE TABLE t(
mytime timestamp ENCRYPTED WITH (column_encryption_key = cek1, binary=true)
);

> Even if the identifiers
> somehow were global (but OIDs can also change and are not guaranteed
> unique forever),

OIDs of existing rows can't just change while a connection is active,
right? (all I know is upgrades can change them but that seems fine)
Also they are unique within a catalog table, right?

> the state of which keys have already been sent is
> session state.

I agree that this is the case. But it's state that can be tracked
fairly easily by a transaction pooler. Similar to how prepared
statements can be tracked. And this is easier to do when at the IDs of
the same keys are the same across each session to the server, because
if they differ then you need to do re-mapping of IDs.

> This is kind of like SASL or TLS can add new methods dynamically without
> requiring a new version.  I mean, as we are learning, making new
> protocol versions is kind of hard, so the point was to avoid it.

Fair enough

> I guess you could do that, but wouldn't that making the decoding of
> these messages much more complicated?  You would first have to read the
> "short" variant, decode the format, and then decide to read the rest.
> Seems weird.

I see your point. But with the current approach even for queries that
don't return any encrypted columns, these useless fields would be part
of the RowDescryption. It seems quite annoying to add extra network
and parsing overhead all of your queries even if only a small
percentage use the encryption feature. Maybe we should add a new
message type instead like EncryptedRowDescription, or add some flag
field at the start of RowDescription that can be used to indicate that
there is encryption info for some of the columns.

> Yes, that's what would happen, and that's the intention, so that for
> example you can use pg_dump to back up encrypted columns without having
> to decrypt them.

Okay, makes sense. But I think it would be good to document that.

> > A related question to this is that currently libpq throws an error if
> > e.g. a master key realm is not defined but another one is. Is that
> > really what we want? Is not having one of the realms really that
> > different from not providing any realms at all?
>
> Can you provide a more concrete example of what scenario you have a
> concern about?

A server has table A and B. A is encrypted with a master key realm X
and B is encrypted with master key realm Y. If libpq is only given a
key for realm X, and it then tries to read table B, an error is
thrown. While if you don't provide any realm at all, you can read from
table B just fine,

Re: Transparent column encryption

2024-04-18 Thread Peter Eisentraut


On 10.04.24 16:14, Jelte Fennema-Nio wrote:

(The CEK can't be rotated easily, since
that would require reading out all the data from a table/column and
reencrypting it.  We could/should add some custom tooling for that,
but it wouldn't be a routine operation.)


This seems like something that requires some more thought because CEK
rotation seems just as important as CMK rotation (often both would be
compromised at the same time).


Hopefully, the reason for key rotation is mainly that policies require 
key rotation, not that keys get compromised all the time.  That's the 
reason for having this two-tier key system in the first place.  This 
seems pretty standard to me.  For example, I can change the password on 
my laptop's file system encryption, which somehow wraps a lower-level 
key, but I can't reencrypt the actual file system in place.



+The plaintext inside
+the ciphertext is always in text format, but this is invisible to the
+protocol.

It seems like it would be useful to have a way of storing the
plaintext in binary form too. I'm not saying this should be part of
the initial version, but it would be good to keep that in mind with
the design.


Two problems here: One, for deterministic encryption, everyone needs to 
agree on the representation, otherwise equality comparisons won't work. 
 Two, if you give clients the option of storing text or binary, then 
clients also get back a mix of text or binary, and it will be a mess.
Just giving the option of storing the payload in binary wouldn't be that 
hard, but it's not clear what you can sensibly do with that in the end.



+ The session-specific identifier of the key.

Is it necessary for this identifier to be session-specific? Why not
use a global identifier like an oid? Anything session specific makes
the job of transaction poolers quite a bit harder. If this identifier
would be global, then the message can be forwarded as is to the client
instead of re-mapping identifiers between clients and servers (like is
needed for prepared statements).


The point was just to avoid saying specifically that the OID will be 
sent, because then that would tie the catalog representation to the 
protocol, which seems unnecessary.  Maybe we can reword that somehow.


In terms of connection pooling, this feature as it is conceived right 
now would only work in session pooling anyway.  Even if the identifiers 
somehow were global (but OIDs can also change and are not guaranteed 
unique forever), the state of which keys have already been sent is 
session state.



+   Additional algorithms may be added to this protocol specification without a
+   change in the protocol version number.

What's the reason for not requiring a version bump for this?


This is kind of like SASL or TLS can add new methods dynamically without 
requiring a new version.  I mean, as we are learning, making new 
protocol versions is kind of hard, so the point was to avoid it.



+  If the protocol extension _pq_.column_encryption is
+  enabled (see ), then
+  there is also the following for each parameter:

It seems a little bit wasteful to include these for all columns, even
the ones that don't require encryption. How about only adding these
fields when format code is 0x11


I guess you could do that, but wouldn't that making the decoding of 
these messages much more complicated?  You would first have to read the 
"short" variant, decode the format, and then decide to read the rest. 
Seems weird.



Finally, I'm trying to figure out if this really needs to be a
protocol extension or if a protocol version bump would work as well
without introducing a lot of work for clients/poolers that don't care
about it (possibly with some modifications to the proposed protocol
changes).


That's not something this patch cares about, but the philosophical 
discussions in the other thread on protocol versioning etc. appear to 
lean toward protocol extension.



What makes this a bit difficult for me is that there's not
much written in the documentation on what is supposed to happen for
encrypted columns when the protocol extension is not enabled. Is the
content just returned/written like it would be with a bytea?


Yes, that's what would happen, and that's the intention, so that for 
example you can use pg_dump to back up encrypted columns without having 
to decrypt them.



A related question to this is that currently libpq throws an error if
e.g. a master key realm is not defined but another one is. Is that
really what we want? Is not having one of the realms really that
different from not providing any realms at all?


Can you provide a more concrete example of what scenario you have a 
concern about?

Re: Transparent column encryption

2024-04-10 Thread Jelte Fennema-Nio

On Wed, 10 Apr 2024 at 12:13, Peter Eisentraut  wrote:
>
> To kick some things off for PG18, here is an updated version of the
> patch for automatic client-side column-level encryption.

I only read the docs and none of the code, but here is my feedback on
the current design:

> (The CEK can't be rotated easily, since
> that would require reading out all the data from a table/column and
> reencrypting it.  We could/should add some custom tooling for that,
> but it wouldn't be a routine operation.)

This seems like something that requires some more thought because CEK
rotation seems just as important as CMK rotation (often both would be
compromised at the same time). As far as I can tell the only way to
rotate a CEK is by re-encrypting the column for all rows in a single
go at the client side, thus taking a write-lock on all rows of the
table. That seems quite problematic, because that makes key rotation
an operation that requires application downtime. Allowing online
rotation is important, otherwise almost no-one will do it preventative
at a regular interval.

One way to allow online CEK rotation is by allowing a column to be
encrypted by one of several keys and/or allow a key to have multiple
versions. And then for each row we would store which key/version it
was encrypted with. That way for new insertions/updates clients would
use the newest version. But clients would still be able to decrypt
both old rows with the old key and new rows encrypted with the new
key, because the server would give them both keys and tell which row
was encrypted with which. Then the old rows can be rewritten by a
client in small batches, so that writes to the table can keep working
while this operation takes place.

This could even be used to allow encrypting previously unencrypted
columns using something like "ALTER COLUMN mycol ENCRYPTION KEY cek1".
Then unencrypted rows could be indicated by e.g. returning something
like NULL for the CEK.

+The plaintext inside
+the ciphertext is always in text format, but this is invisible to the
+protocol.

It seems like it would be useful to have a way of storing the
plaintext in binary form too. I'm not saying this should be part of
the initial version, but it would be good to keep that in mind with
the design.

+ The session-specific identifier of the key.

Is it necessary for this identifier to be session-specific? Why not
use a global identifier like an oid? Anything session specific makes
the job of transaction poolers quite a bit harder. If this identifier
would be global, then the message can be forwarded as is to the client
instead of re-mapping identifiers between clients and servers (like is
needed for prepared statements).

+   Additional algorithms may be added to this protocol specification without a
+   change in the protocol version number.

What's the reason for not requiring a version bump for this?

+  If the protocol extension _pq_.column_encryption is
+  enabled (see ), then
+  there is also the following for each parameter:

It seems a little bit wasteful to include these for all columns, even
the ones that don't require encryption. How about only adding these
fields when format code is 0x11

Finally, I'm trying to figure out if this really needs to be a
protocol extension or if a protocol version bump would work as well
without introducing a lot of work for clients/poolers that don't care
about it (possibly with some modifications to the proposed protocol
changes). What makes this a bit difficult for me is that there's not
much written in the documentation on what is supposed to happen for
encrypted columns when the protocol extension is not enabled. Is the
content just returned/written like it would be with a bytea? Or is
writing disallowed because the format code would never be set to 0x11.

A related question to this is that currently libpq throws an error if
e.g. a master key realm is not defined but another one is. Is that
really what we want? Is not having one of the realms really that
different from not providing any realms at all?

But no-matter these behavioural details, I think it would be fairly
easy to add minimal "non-support" for this feature while supporting
the new protocol messages. All they would need to do is understand
what the new protocol messages/fields mean and either ignore them or
throw a clear error. For poolers it's a different story however. For
transaction pooling there's quite a bit of work to be done. I already
mentioned the session-specific ID being a problem, but even assuming
we change that to a global ID there's still difficulties. Key
information is only sent by the server if it wasn't sent before in the
session[1], so a pooler would need to keep it's own cache and send
keys to clients that haven't received them yet.

So yeah, I think it would make sense to put this behind a protocol
extension feature flag, since it's fairly niche and would require
significant work at the pooler side to support.

Re: Transparent column encryption

2023-03-30 Thread Peter Eisentraut


On 30.03.23 20:35, Stephen Frost wrote:

I do feel that column encryption is a useful capability and there's
large parts of this approach that I agree with, but I dislike the idea
of having our clients be able to depend on what gets returned for
non-encrypted columns while not being able to trust what encrypted
column results are and then trying to say it's 'transparent'.  To that
end, it seems like just saying they get back a bytea and making it clear
that they have to provide the validation would be clear, while keeping
much of the rest.


[Note that the word "transparent" has been removed from the feature 
name.  I just didn't change the email thread name.]


These thoughts are reasonable, but I think there is a tradeoff to be 
made between having featureful data validation and enhanced security. 
If you want your database system to validate your data, you have to send 
it in plain text.  If you want to have your database system not see the 
plain text, then it cannot validate it.  But there is still utility in it.


You can't really depend on what gets returned even in the non-encrypted 
case, unless you cryptographically sign the schema against modification 
or something like that.  So realistically, a client that cares strongly 
about the data it receives has to do some kind of client-side validation 
anyway.


Note also that higher-level client libraries like JDBC effectively do 
client-side data validation, for example when you call 
ResultSet.getInt() etc.


This is also one of the reasons why the user facing type declaration 
exists.  You could just make all encrypted columns of type "opaque" or 
something and not make any promises about what's inside.  But client 
APIs sort or rely on the application being able to ask the result set 
for what's inside a column value.  If we just say, we don't know, then 
applications (or driver APIs) will have to be changed to accommodate 
that, but the intention was to not require that.  So instead we say, 
it's supposed to be int, and then if it's sometimes actually not int, 
then your application throws an exception you can deal with.  This is 
arguably a better developer experience, even if it concerns the data 
type purist.


But do you have a different idea about how it should work?


Expanding out from that I'd imagine, pie-in-the-sky
and in some far off land, having our data type in/out validation
functions moved to the common library and then adding client-side
validation of the data going in/out of the encrypted columns would allow
application developers to be able to trust what we're returning (as long
as they're using libpq- and we'd have to document that independent
implementations of the protocol have to provide this or just continue to
return bytea's).


As mentioned, some client libraries effectively already do that.  But 
even if we could make this much more comprehensive, I don't see how this 
can ever actually satisfy your point.  It would require that all 
participating clients apply validation all the time, and all other 
clients can rely on that happening, which is impossible.

Re: Transparent column encryption

2023-03-30 Thread Peter Eisentraut


On 30.03.23 17:55, Andres Freund wrote:

I find it very hard to belief that details of the catalog representation like
this will matter to users. How would would it conceivably affect users that we
store (key, encryption method) in pg_attribute vs storing an oid that's
effectively a foreign key reference to (key, encryption method)?


The change you are alluding to would also affect how the DDL commands 
work and interoperate, so it affects the user.


But also, let's not drive this design decision bottom up.  Let's go from 
how we want the data model and the DDL to work and then figure out 
suitable ways to record that.  I don't really know if you are just 
worried about the catalog size, or you find an actual fault with the 
data model, or you just find it subjectively odd.



The feature is arguably useful without typmod support, e.g., for text. We
could ship it like that, then do some work to reorganize pg_attribute and
tuple descriptors to relieve some pressure on each byte, and then add the
typmod support back in in a future release.  I think that is a workable
compromise.


I doubt that shipping a version of column encryption that breaks our type
system is a good idea.


I don't follow how you get from that to claiming that it breaks the type 
system.  Please provide more details.

Re: Transparent column encryption

2023-03-30 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2023-03-30 16:01:46 +0200, Peter Eisentraut wrote:
> > On 30.03.23 03:29, Andres Freund wrote:
> > > > One might think that, but the precedent in other equivalent systems is 
> > > > that
> > > > you reference the key and the algorithm separately.  There is some
> > > > (admittedly not very conclusive) discussion about this near [0].
> > > > 
> > > > [0]: 
> > > > https://www.postgresql.org/message-id/flat/00b0c4f3-0d9f-dcfd-2ba0-eee5109b4963%40enterprisedb.com#147ad6faafe8cdd2c0d2fd56ec972a40
> > > 
> > > I'm very much not convinced by that. Either way, there at least there 
> > > should
> > > be a comment mentioning that we intentionally try to allow that.
> > > 
> > > Even if this feature is something we want (why?), ISTM that this should 
> > > not be
> > > implemented by having multiple fields in pg_attribute, but instead by a 
> > > table
> > > referenced by by pg_attribute.attcek.
> > 
> > I don't know if it is clear to everyone here, but the key data model and the
> > surrounding DDL are exact copies of the equivalent MS SQL Server feature.
> > When I was first studying it, I had the exact same doubts about this.  But
> > as I was learning more about it, it does make sense, because this matches a
> > common pattern in key management systems, which is relevant because these
> > keys ultimately map into KMS-managed keys in a deployment.  Moreover, 1) it
> > is plausible that those people knew what they were doing, and 2) it would be
> > preferable to maintain alignment and not create something that looks the
> > same but is different in some small but important details.

I was wondering about this- is it really exactly the same, down to the
point that there's zero checking of what the data returned actually is
after it's decrypted and given to the application, and if it actually
matches the claimed data type?

> I find it very hard to belief that details of the catalog representation like
> this will matter to users. How would would it conceivably affect users that we
> store (key, encryption method) in pg_attribute vs storing an oid that's
> effectively a foreign key reference to (key, encryption method)?

I do agree with this.

> > > > With the proposed removal of usertypmod, it's only two fields: the link 
> > > > to
> > > > the key and the user-facing type.
> > > 
> > > That feels far less clean. I think loosing the ability to set the 
> > > precision of
> > > a numeric, or the SRID for postgis datums won't be received very well?
> > 
> > In my mind, and I probably wasn't explicit about this, I'm thinking about
> > what can be done now versus later.
> > 
> > The feature is arguably useful without typmod support, e.g., for text. We
> > could ship it like that, then do some work to reorganize pg_attribute and
> > tuple descriptors to relieve some pressure on each byte, and then add the
> > typmod support back in in a future release.  I think that is a workable
> > compromise.
> 
> I doubt that shipping a version of column encryption that breaks our type
> system is a good idea.

And this.

I do feel that column encryption is a useful capability and there's
large parts of this approach that I agree with, but I dislike the idea
of having our clients be able to depend on what gets returned for
non-encrypted columns while not being able to trust what encrypted
column results are and then trying to say it's 'transparent'.  To that
end, it seems like just saying they get back a bytea and making it clear
that they have to provide the validation would be clear, while keeping
much of the rest.  Expanding out from that I'd imagine, pie-in-the-sky
and in some far off land, having our data type in/out validation
functions moved to the common library and then adding client-side
validation of the data going in/out of the encrypted columns would allow
application developers to be able to trust what we're returning (as long
as they're using libpq- and we'd have to document that independent
implementations of the protocol have to provide this or just continue to
return bytea's).

Not sure how we'd manage to provide support for extensions though.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Transparent column encryption

2023-03-30 Thread Andres Freund

Hi,

On 2023-03-30 16:01:46 +0200, Peter Eisentraut wrote:
> On 30.03.23 03:29, Andres Freund wrote:
> > > One might think that, but the precedent in other equivalent systems is 
> > > that
> > > you reference the key and the algorithm separately.  There is some
> > > (admittedly not very conclusive) discussion about this near [0].
> > > 
> > > [0]: 
> > > https://www.postgresql.org/message-id/flat/00b0c4f3-0d9f-dcfd-2ba0-eee5109b4963%40enterprisedb.com#147ad6faafe8cdd2c0d2fd56ec972a40
> > 
> > I'm very much not convinced by that. Either way, there at least there should
> > be a comment mentioning that we intentionally try to allow that.
> > 
> > Even if this feature is something we want (why?), ISTM that this should not 
> > be
> > implemented by having multiple fields in pg_attribute, but instead by a 
> > table
> > referenced by by pg_attribute.attcek.
> 
> I don't know if it is clear to everyone here, but the key data model and the
> surrounding DDL are exact copies of the equivalent MS SQL Server feature.
> When I was first studying it, I had the exact same doubts about this.  But
> as I was learning more about it, it does make sense, because this matches a
> common pattern in key management systems, which is relevant because these
> keys ultimately map into KMS-managed keys in a deployment.  Moreover, 1) it
> is plausible that those people knew what they were doing, and 2) it would be
> preferable to maintain alignment and not create something that looks the
> same but is different in some small but important details.

I find it very hard to belief that details of the catalog representation like
this will matter to users. How would would it conceivably affect users that we
store (key, encryption method) in pg_attribute vs storing an oid that's
effectively a foreign key reference to (key, encryption method)?


> > > With the proposed removal of usertypmod, it's only two fields: the link to
> > > the key and the user-facing type.
> > 
> > That feels far less clean. I think loosing the ability to set the precision 
> > of
> > a numeric, or the SRID for postgis datums won't be received very well?
> 
> In my mind, and I probably wasn't explicit about this, I'm thinking about
> what can be done now versus later.
> 
> The feature is arguably useful without typmod support, e.g., for text. We
> could ship it like that, then do some work to reorganize pg_attribute and
> tuple descriptors to relieve some pressure on each byte, and then add the
> typmod support back in in a future release.  I think that is a workable
> compromise.

I doubt that shipping a version of column encryption that breaks our type
system is a good idea.

Greetings,

Andres Freund

Re: Transparent column encryption

2023-03-30 Thread Peter Eisentraut

On 30.03.23 03:29, Andres Freund wrote:

One might think that, but the precedent in other equivalent systems is that
you reference the key and the algorithm separately. There is some
(admittedly not very conclusive) discussion about this near [0].

[0]:
https://www.postgresql.org/message-id/flat/00b0c4f3-0d9f-dcfd-2ba0-eee5109b4963%40enterprisedb.com#147ad6faafe8cdd2c0d2fd56ec972a40

I'm very much not convinced by that. Either way, there at least there should
be a comment mentioning that we intentionally try to allow that.

Even if this feature is something we want (why?), ISTM that this should not be
implemented by having multiple fields in pg_attribute, but instead by a table
referenced by by pg_attribute.attcek.

I don't know if it is clear to everyone here, but the key data model and
the surrounding DDL are exact copies of the equivalent MS SQL Server
feature. When I was first studying it, I had the exact same doubts
about this. But as I was learning more about it, it does make sense,
because this matches a common pattern in key management systems, which
is relevant because these keys ultimately map into KMS-managed keys in a
deployment. Moreover, 1) it is plausible that those people knew what
they were doing, and 2) it would be preferable to maintain alignment and
not create something that looks the same but is different in some small
but important details.

With the proposed removal of usertypmod, it's only two fields: the link to
the key and the user-facing type.

That feels far less clean. I think loosing the ability to set the precision of
a numeric, or the SRID for postgis datums won't be received very well?

In my mind, and I probably wasn't explicit about this, I'm thinking
about what can be done now versus later.

The feature is arguably useful without typmod support, e.g., for text.
We could ship it like that, then do some work to reorganize pg_attribute
and tuple descriptors to relieve some pressure on each byte, and then
add the typmod support back in in a future release. I think that is a
workable compromise.

Re: Transparent column encryption

2023-03-29 Thread Andres Freund

Hi,

On 2023-03-29 19:08:25 +0200, Peter Eisentraut wrote:
> On 29.03.23 18:24, Andres Freund wrote:
> > On 2023-03-29 18:08:29 +0200, Peter Eisentraut wrote:
> > > On 24.03.23 19:12, Andres Freund wrote:
> > > > > I thought about this some more.  I think we could get rid of 
> > > > > attusertypmod
> > > > > and just hardcode it as -1.  The idea would be that if you ask for an
> > > > > encrypted column of type, say, varchar(500), the server isn't able to
> > > > > enforce that anyway, so we could just prohibit specifying a nondefault
> > > > > typmod for encrypted columns.
> > > > 
> > > > Why not just use typmod for the underlying typmod? It doesn't seem like
> > > > encrypted datums will need that? Or are you using it for something 
> > > > important there?
> > > 
> > > Yes, the typmod of encrypted types stores the encryption algorithm.
> > 
> > Why isn't that an attribute of pg_colenckey, given that attcek has been 
> > added
> > to pg_attribute?
> 
> One might think that, but the precedent in other equivalent systems is that
> you reference the key and the algorithm separately.  There is some
> (admittedly not very conclusive) discussion about this near [0].
> 
> [0]: 
> https://www.postgresql.org/message-id/flat/00b0c4f3-0d9f-dcfd-2ba0-eee5109b4963%40enterprisedb.com#147ad6faafe8cdd2c0d2fd56ec972a40

I'm very much not convinced by that. Either way, there at least there should
be a comment mentioning that we intentionally try to allow that.

Even if this feature is something we want (why?), ISTM that this should not be
implemented by having multiple fields in pg_attribute, but instead by a table
referenced by by pg_attribute.attcek.


> > > (Also, mixing a type with the typmod of another type is weird in a variety
> > > of ways, so this is a quite clean solution.)
> > 
> > It's not an unrelated type though. It's the actual typmod of the column 
> > we're
> > talking about.
> 
> What I mean is that various parts of the system think that typid+typmod make
> sense together.  If the typmod actually refers to usertypid, well, the code
> doesn't know that, so who knows what happens.

You control what the typmod for encrypted columns does. So I don't see what
problems that could be.

I seems quite likely that having a separate typmod for the encrypted type will
cause problems down the line, because you'll end up having to copy around
typid+typmod for the encrypted datum and then also separately for the
unencrypted one.


> Also, with the current proposal, the system is internally consistent:
> pg_encrypted_* can actually look at their own typmod and verify their own
> input values that way, which is what a typmod is for.  This just works out
> of the box.
> 
> > I find it a lot less clean to make all non-CEK uses of
> > pg_attribute pay the price of storing three new fields.
> 
> With the proposed removal of usertypmod, it's only two fields: the link to
> the key and the user-facing type.

That feels far less clean. I think loosing the ability to set the precision of
a numeric, or the SRID for postgis datums won't be received very well?

Greetings,

Andres Freund

Re: Transparent column encryption

2023-03-29 Thread Peter Eisentraut

On 29.03.23 18:24, Andres Freund wrote:

Hi,

On 2023-03-29 18:08:29 +0200, Peter Eisentraut wrote:

On 24.03.23 19:12, Andres Freund wrote:

I thought about this some more. I think we could get rid of attusertypmod
and just hardcode it as -1. The idea would be that if you ask for an
encrypted column of type, say, varchar(500), the server isn't able to
enforce that anyway, so we could just prohibit specifying a nondefault
typmod for encrypted columns.

Why not just use typmod for the underlying typmod? It doesn't seem like
encrypted datums will need that? Or are you using it for something important
there?

Yes, the typmod of encrypted types stores the encryption algorithm.

Why isn't that an attribute of pg_colenckey, given that attcek has been added
to pg_attribute?

One might think that, but the precedent in other equivalent systems is
that you reference the key and the algorithm separately. There is some
(admittedly not very conclusive) discussion about this near [0].

[0]:
https://www.postgresql.org/message-id/flat/00b0c4f3-0d9f-dcfd-2ba0-eee5109b4963%40enterprisedb.com#147ad6faafe8cdd2c0d2fd56ec972a40

(Also, mixing a type with the typmod of another type is weird in a variety
of ways, so this is a quite clean solution.)

It's not an unrelated type though. It's the actual typmod of the column we're
talking about.

What I mean is that various parts of the system think that typid+typmod
make sense together. If the typmod actually refers to usertypid, well,
the code doesn't know that, so who knows what happens.

Also, with the current proposal, the system is internally consistent:
pg_encrypted_* can actually look at their own typmod and verify their
own input values that way, which is what a typmod is for. This just
works out of the box.

I find it a lot less clean to make all non-CEK uses of
pg_attribute pay the price of storing three new fields.

With the proposed removal of usertypmod, it's only two fields: the link
to the key and the user-facing type.

Re: Transparent column encryption

2023-03-29 Thread Andres Freund

Hi,

On 2023-03-29 18:08:29 +0200, Peter Eisentraut wrote:
> On 24.03.23 19:12, Andres Freund wrote:
> > > I thought about this some more.  I think we could get rid of attusertypmod
> > > and just hardcode it as -1.  The idea would be that if you ask for an
> > > encrypted column of type, say, varchar(500), the server isn't able to
> > > enforce that anyway, so we could just prohibit specifying a nondefault
> > > typmod for encrypted columns.
> >
> > Why not just use typmod for the underlying typmod? It doesn't seem like
> > encrypted datums will need that? Or are you using it for something 
> > important there?
>
> Yes, the typmod of encrypted types stores the encryption algorithm.

Why isn't that an attribute of pg_colenckey, given that attcek has been added
to pg_attribute?


> (Also, mixing a type with the typmod of another type is weird in a variety
> of ways, so this is a quite clean solution.)

It's not an unrelated type though. It's the actual typmod of the column we're
talking about. I find it a lot less clean to make all non-CEK uses of
pg_attribute pay the price of storing three new fields.

Greetings,

Andres Freund

Re: Transparent column encryption

2023-03-29 Thread Peter Eisentraut


On 24.03.23 19:12, Andres Freund wrote:

I thought about this some more.  I think we could get rid of attusertypmod
and just hardcode it as -1.  The idea would be that if you ask for an
encrypted column of type, say, varchar(500), the server isn't able to
enforce that anyway, so we could just prohibit specifying a nondefault
typmod for encrypted columns.


Why not just use typmod for the underlying typmod? It doesn't seem like
encrypted datums will need that? Or are you using it for something important 
there?


Yes, the typmod of encrypted types stores the encryption algorithm.

(Also, mixing a type with the typmod of another type is weird in a 
variety of ways, so this is a quite clean solution.)

Re: Transparent column encryption

2023-03-24 Thread Andres Freund

Hi,

On 2023-03-23 14:54:48 +0100, Peter Eisentraut wrote:
> On 22.03.23 10:00, Peter Eisentraut wrote:
> > > I get that for the type, but why do we need the typmod duplicated as
> > > well?
> > 
> > Earlier patch versions didn't do that, but that got really confusing
> > about which type the typmod really belonged to, since code currently
> > assumes that typid+typmod makes sense.  Earlier patch versions had three
> > fields (usertypid, keyid, encalg), and then I changed it to (usertypid,
> > usertypmod, keyid) and instead placed the encalg into the real typmod,
> > which made everything much cleaner.
> 
> I thought about this some more.  I think we could get rid of attusertypmod
> and just hardcode it as -1.  The idea would be that if you ask for an
> encrypted column of type, say, varchar(500), the server isn't able to
> enforce that anyway, so we could just prohibit specifying a nondefault
> typmod for encrypted columns.

Why not just use typmod for the underlying typmod? It doesn't seem like
encrypted datums will need that? Or are you using it for something important 
there?

Greetings,

Andres Freund

Re: Transparent column encryption

2023-03-24 Thread Peter Eisentraut


On 23.03.23 16:55, Robert Haas wrote:

On Thu, Mar 23, 2023 at 9:55 AM Peter Eisentraut
 wrote:

I thought about this some more.  I think we could get rid of
attusertypmod and just hardcode it as -1.  The idea would be that if you
ask for an encrypted column of type, say, varchar(500), the server isn't
able to enforce that anyway, so we could just prohibit specifying a
nondefault typmod for encrypted columns.

I'm not sure if there are weird types that use typmods in some way where
this wouldn't work.  But so far I could not think of anything.

I'll look into this some more.


I thought we often treated atttypid, atttypmod, and attcollation as a
trio, these days. It seems a bit surprising that you'd end up adding
columns for two out of the three.


Internally, we use all three.  But for reporting to the client 
(RowDescription message), we only have slots for type and typmod.  We 
could in theory extend the protocol to report the collation as well, but 
it's probably not too interesting.

Re: Transparent column encryption

2023-03-23 Thread Robert Haas

On Thu, Mar 23, 2023 at 9:55 AM Peter Eisentraut
 wrote:
> I thought about this some more.  I think we could get rid of
> attusertypmod and just hardcode it as -1.  The idea would be that if you
> ask for an encrypted column of type, say, varchar(500), the server isn't
> able to enforce that anyway, so we could just prohibit specifying a
> nondefault typmod for encrypted columns.
>
> I'm not sure if there are weird types that use typmods in some way where
> this wouldn't work.  But so far I could not think of anything.
>
> I'll look into this some more.

I thought we often treated atttypid, atttypmod, and attcollation as a
trio, these days. It seems a bit surprising that you'd end up adding
columns for two out of the three.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Transparent column encryption

2023-03-23 Thread Peter Eisentraut


On 22.03.23 10:00, Peter Eisentraut wrote:
I get that for the type, but why do we need the typmod duplicated as 
well?


Earlier patch versions didn't do that, but that got really confusing 
about which type the typmod really belonged to, since code currently 
assumes that typid+typmod makes sense.  Earlier patch versions had three 
fields (usertypid, keyid, encalg), and then I changed it to (usertypid, 
usertypmod, keyid) and instead placed the encalg into the real typmod, 
which made everything much cleaner.


I thought about this some more.  I think we could get rid of 
attusertypmod and just hardcode it as -1.  The idea would be that if you 
ask for an encrypted column of type, say, varchar(500), the server isn't 
able to enforce that anyway, so we could just prohibit specifying a 
nondefault typmod for encrypted columns.


I'm not sure if there are weird types that use typmods in some way where 
this wouldn't work.  But so far I could not think of anything.


I'll look into this some more.

Re: Transparent column encryption

2023-03-22 Thread Peter Eisentraut


On 21.03.23 18:47, Andres Freund wrote:

On 2023-03-21 18:05:15 +0100, Peter Eisentraut wrote:

On 16.03.23 17:36, Andres Freund wrote:

Maybe a daft question, but why do we need a separate type and typmod for
encrypted columns? Why isn't the fact that the column is encrypted exactly one
new field, and we use the existing type/typmod fields?


The way this is implemented is that for an encrypted column, the real
atttypid and atttypmod are one of the encrypted special types
(pg_encrypted_*).  That way, most of the system doesn't need to care about
the details of encryption or whatnot, it just unpacks tuples etc. by looking
at atttypid, atttyplen, etc., and queries on encrypted data behave normally
by just looking at what operators etc. those types have.  This approach
heavily contains the number of places that need to know about this feature
at all.


I get that for the type, but why do we need the typmod duplicated as well?


Earlier patch versions didn't do that, but that got really confusing 
about which type the typmod really belonged to, since code currently 
assumes that typid+typmod makes sense.  Earlier patch versions had three 
fields (usertypid, keyid, encalg), and then I changed it to (usertypid, 
usertypmod, keyid) and instead placed the encalg into the real typmod, 
which made everything much cleaner.

Re: Transparent column encryption

2023-03-21 Thread Andres Freund

Hi,

On 2023-03-21 18:05:15 +0100, Peter Eisentraut wrote:
> On 16.03.23 17:36, Andres Freund wrote:
> > Maybe a daft question, but why do we need a separate type and typmod for
> > encrypted columns? Why isn't the fact that the column is encrypted exactly 
> > one
> > new field, and we use the existing type/typmod fields?
> 
> The way this is implemented is that for an encrypted column, the real
> atttypid and atttypmod are one of the encrypted special types
> (pg_encrypted_*).  That way, most of the system doesn't need to care about
> the details of encryption or whatnot, it just unpacks tuples etc. by looking
> at atttypid, atttyplen, etc., and queries on encrypted data behave normally
> by just looking at what operators etc. those types have.  This approach
> heavily contains the number of places that need to know about this feature
> at all.

I get that for the type, but why do we need the typmod duplicated as well?

Greetings,

Andres Freund

Re: Transparent column encryption

2023-03-21 Thread Peter Eisentraut


On 16.03.23 17:36, Andres Freund wrote:

Maybe a daft question, but why do we need a separate type and typmod for
encrypted columns? Why isn't the fact that the column is encrypted exactly one
new field, and we use the existing type/typmod fields?


The way this is implemented is that for an encrypted column, the real 
atttypid and atttypmod are one of the encrypted special types 
(pg_encrypted_*).  That way, most of the system doesn't need to care 
about the details of encryption or whatnot, it just unpacks tuples etc. 
by looking at atttypid, atttyplen, etc., and queries on encrypted data 
behave normally by just looking at what operators etc. those types have. 
 This approach heavily contains the number of places that need to know 
about this feature at all.



Do we need to decouple tuple descriptors from pg_attribute altogether?


Yes. Very clearly. The amount of memory and runtime we spent on tupledescs is
disproportionate. A second angle is that we build tupledescs way way too
frequently. The executor infers them everywhere, so not even prepared
statements protect against that.



How do we decide what goes into the tuple descriptor and what does not?  I'm
interested in addressing this, because obviously we do want the ability to
add more features in the future, but I don't know what the direction should
be.


We've had some prior discussion around this, see e.g.
https://postgr.es/m/20210819114435.6r532qbadcsyfscp%40alap3.anarazel.de


This sounds like a good plan.

Re: Transparent column encryption

2023-03-16 Thread Andres Freund

Hi,

On 2023-03-16 11:26:46 +0100, Peter Eisentraut wrote:
> On 13.03.23 22:11, Andres Freund wrote:
> > > It adds branches, and it makes tupledescs wider. In tight spots, such as
> > > printtup, that can hurt, even if the branches aren't ever entered.
> > In fact, I do see a noticable, but not huge, regression:
> 
> I tried to reproduce your measurements, but I don't have the CPU affinity
> tools like that on macOS, so the differences were lost in the noise.  (The
> absolute numbers looked very similar to yours.)
> 
> I can reproduce a regression if I keep adding more columns to pg_attribute,
> like 8 OID columns does start to show.
>
> [...]
> How do we proceed here?

Maybe a daft question, but why do we need a separate type and typmod for
encrypted columns? Why isn't the fact that the column is encrypted exactly one
new field, and we use the existing type/typmod fields?

> A lot of user-facing table management stuff like compression, statistics,
> collation, dropped columns, and probably future features like column
> reordering or periods, have to be represented in pg_attribute somehow, at
> least in the current architecture.  Do we hope that hardware keeps up with
> the pace at which we add new features?

Yea, it's not great as is. I think it's also OK to decide that the slowdown is
worth it in some cases - it just has to be a conscious decision, in the open.

> Do we need to decouple tuple descriptors from pg_attribute altogether?

Yes. Very clearly. The amount of memory and runtime we spent on tupledescs is
disproportionate. A second angle is that we build tupledescs way way too
frequently. The executor infers them everywhere, so not even prepared
statements protect against that.

> How do we decide what goes into the tuple descriptor and what does not?  I'm
> interested in addressing this, because obviously we do want the ability to
> add more features in the future, but I don't know what the direction should
> be.

We've had some prior discussion around this, see e.g.
https://postgr.es/m/20210819114435.6r532qbadcsyfscp%40alap3.anarazel.de

Greetings,

Andres Freund

Re: Transparent column encryption

2023-03-16 Thread Peter Eisentraut


On 13.03.23 22:11, Andres Freund wrote:

It adds branches, and it makes tupledescs wider. In tight spots, such as
printtup, that can hurt, even if the branches aren't ever entered.

In fact, I do see a noticable, but not huge, regression:


I tried to reproduce your measurements, but I don't have the CPU 
affinity tools like that on macOS, so the differences were lost in the 
noise.  (The absolute numbers looked very similar to yours.)


I can reproduce a regression if I keep adding more columns to 
pg_attribute, like 8 OID columns does start to show.


I investigated whether I could move some columns to the 
"variable-length" part outside the tuple descriptor, but that would 
require major surgery in heap.c and tablecmds.c (MergeAttributes() ... 
shudder), and also elsewhere, where you currently only have a tuple 
descriptor for looking up stuff.


How do we proceed here?  A lot of user-facing table management stuff 
like compression, statistics, collation, dropped columns, and probably 
future features like column reordering or periods, have to be 
represented in pg_attribute somehow, at least in the current 
architecture.  Do we hope that hardware keeps up with the pace at which 
we add new features?  Do we need to decouple tuple descriptors from 
pg_attribute altogether?  How do we decide what goes into the tuple 
descriptor and what does not?  I'm interested in addressing this, 
because obviously we do want the ability to add more features in the 
future, but I don't know what the direction should be.

Re: Transparent column encryption

2023-03-13 Thread Andres Freund

Hi,

On 2023-03-13 13:41:19 -0700, Andres Freund wrote:
> On 2023-03-13 21:22:29 +0100, Peter Eisentraut wrote:
> > On 12.03.23 01:11, Andres Freund wrote:
> > > Have you done benchmarks of some simple workloads to verify this doesn't 
> > > cause
> > > slowdowns (when not using encryption, obviously)? printtup.c is a 
> > > performance
> > > sensitive portion for simple queries, particularly when they return 
> > > multiple
> > > columns.
> >
> > The additional code isn't used when column encryption is off, so there
> > shouldn't be any impact.
>
> It adds branches, and it makes tupledescs wider. In tight spots, such as
> printtup, that can hurt, even if the branches aren't ever entered.

In fact, I do see a noticable, but not huge, regression:

$ cat /tmp/test.sql
SELECT * FROM pg_class WHERE oid = 1247;

c=1;taskset -c 10 pgbench -n -M prepared -c$c -j$c -f /tmp/test.sql -P1 -T10

with the server also pinned to core 1, and turbo boost disabled. Nothing else
is allowed to run on the core, or its hyperthread sibling. This is my setup
for comparing performance with the least noise in general, not related to this
patch.

head:  28495.858509 28823.055643 28731.074311
patch: 28298.498851 28285.426532 28489.359569

A ~1.1% loss.

pipelined 50 statements (pgbench pinned to a different otherwise unused core)
head:  1147.404506 1147.587475 1151.976547
patch: 1126.525708 1122.375337 1119.088734

A ~2.2% loss.

That might not be prohibitive, but it does seem worth analyzing.

Greetings,

Andres Freund

Re: Transparent column encryption

2023-03-13 Thread Andres Freund

Hi,

On 2023-03-13 21:22:29 +0100, Peter Eisentraut wrote:
> On 12.03.23 01:11, Andres Freund wrote:
> > Have you done benchmarks of some simple workloads to verify this doesn't 
> > cause
> > slowdowns (when not using encryption, obviously)? printtup.c is a 
> > performance
> > sensitive portion for simple queries, particularly when they return multiple
> > columns.
> 
> The additional code isn't used when column encryption is off, so there
> shouldn't be any impact.

It adds branches, and it makes tupledescs wider. In tight spots, such as
printtup, that can hurt, even if the branches aren't ever entered.

Greetings,

Andres Freund

Re: Transparent column encryption

2023-03-13 Thread Peter Eisentraut


On 12.03.23 01:11, Andres Freund wrote:

Have you done benchmarks of some simple workloads to verify this doesn't cause
slowdowns (when not using encryption, obviously)? printtup.c is a performance
sensitive portion for simple queries, particularly when they return multiple
columns.


The additional code isn't used when column encryption is off, so there 
shouldn't be any impact.

Re: Transparent column encryption

2023-03-11 Thread Andres Freund

Hi,

On 2023-03-10 08:18:34 +0100, Peter Eisentraut wrote:
> Here is an updated patch.  I have done some cosmetic polishing and fixed a
> minor Windows-related bug.
> 
> In my mind, the patch is complete.
> 
> If someone wants to do some in-depth code review, I suggest focusing on the
> following files:
> 
> * src/backend/access/common/printtup.c

Have you done benchmarks of some simple workloads to verify this doesn't cause
slowdowns (when not using encryption, obviously)? printtup.c is a performance
sensitive portion for simple queries, particularly when they return multiple
columns.

And making tupledescs even wider is likely to have some price, both due to the
increase in memory usage, and due to the lower cache density - and that's code
where we're already hurting noticeably.

Greetings,

Andres Freund

Re: Transparent column encryption

2023-03-11 Thread Mark Dilger

> On Mar 9, 2023, at 11:18 PM, Peter Eisentraut 
>  wrote:
> 
> Here is an updated patch.

Thanks, Peter.  The patch appears to be in pretty good shape, but I have a few 
comments and concerns.

CEKIsVisible() and CMKIsVisible() are obviously copied from 
TSParserIsVisible(), and the code comments weren't fully updated.  
Specifically, the phrase "hidden by another parser of the same name" should be 
updated to not mention "parser".

Why does get_cmkalg_name() return the string "unspecified" for 
PG_CMK_UNSPECIFIED, but the next function get_cmkalg_jwa_name() returns NULL 
for PG_CMK_UNSPECIFIED?  It seems they would both return NULL, or both return 
"unspecified".  If there's a reason for the divergence, could you add a code 
comment to clarify?  BTW, get_cmkalg_jwa_name() has no test coverage.

Looking further at code coverage, the new conditional in printsimple_startup() 
is never tested with (MyProcPort->column_encryption_enabled), so the block is 
never entered.  This would seem to be a consequence of backends like walsender 
not using column encryption, which is not terribly surprising, but it got me 
wondering if you had a particular client use case in mind when you added this 
block?

The new function pg_encrypted_in() appears totally untested, but I have to 
wonder if that's because we're not round-trip testing pg_dump with column 
encryption...?  The code coverage in pg_dump looks fairly decent, but some 
column encryption code is not covered.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Transparent column encryption

2023-02-22 Thread Peter Eisentraut


On 12.02.23 15:48, Mark Dilger wrote:

I should mention, src/sgml/html/libpq-exec.html needs clarification:


paramFormats[]Specifies whether parameters are text (put a zero in the array 
entry for the corresponding parameter) or binary (put a one in the array entry 
for the corresponding parameter). If the array pointer is null then all 
parameters are presumed to be text strings.


Perhaps you should edit this last sentence to say that all parameters are 
presumed to be test strings without forced encryption.


This is actually already mentioned later in that section.


When column encryption is enabled, the second-least-significant byte of this 
parameter specifies whether encryption should be forced for a parameter.


The value 0x10 has a one as its second-least-significant *nibble*, but that's a 
really strange way to describe the high-order nibble, and perhaps not what you 
mean.  Could you clarify?


Set this byte to one to force encryption.


I think setting the byte to one (0x01) means "binary unencrypted", not "force 
encryption".  Don't you mean to set this byte to 16?


For example, use the C code literal 0x10 to specify text format with forced 
encryption. If the array pointer is null then encryption is not forced for any 
parameter.
If encryption is forced for a parameter but the parameter does not correspond 
to an encrypted column on the server, then the call will fail and the parameter 
will not be sent. This can be used for additional security against a 
compromised server. (The drawback is that application code then needs to be 
kept up to date with knowledge about which columns are encrypted rather than 
letting the server specify this.)


This was me being confused.  I adjusted the text to use "half-byte".


  I think you should say something about how specifying 0x11 will behave -- in other 
words, asking for encrypted binary data.  I believe that this is will draw a "format 
must be text for encrypted parameter" error, and that the docs should clearly say so.


done

Re: Transparent column encryption

2023-02-22 Thread Peter Eisentraut

On 11.02.23 22:54, Mark Dilger wrote:

Thanks Peter. Here are some observations about the documentation in patch
version 15.

In acronyms.sgml, the CEK and CMK entries should link to documentation, perhaps
linkend="glossary-column-encryption-key" and linkend="glossary-column-master-key". These
glossary entries should in turn link to linkend="ddl-column-encryption".

In ddl.sgml, the sentence "forcing encryption of certain parameters in the client library (see
its documentation)" should link to linkend="libpq-connect-column-encryption".

Did you intend the use of "transparent data encryption" (rather than "transparent
column encryption") in datatype.sgml? If so, what's the difference?

There are all addressed in the v16 patch I just posted.

Is this feature intended to be available from ecpg? If so, can we maybe
include an example in 36.3.4. Prepared Statements showing how to pass the
encrypted values securely. If not, can we include verbiage about that
limitation, so folks don't waste time trying to figure out how to do it?

It should "just work". I will give this a try sometime, but I don't see
why it wouldn't work.

The documentation for pg_dump (and pg_dumpall) now includes a
--decrypt-encrypted-columns option, which I suppose requires cmklookup to first
be configured, and for PGCMKLOOKUP to be exported. There isn't anything in the
pg_dump docs about this, though, so maybe a link to section 5.5.3 with a
warning about not running pg_dump this way on the database server itself?

Also addressed in v16.

How does a psql user mark a parameter as having forced encryption? A libpq user can
specify this in the paramFormats array, but I don't see any syntax for doing this from
psql. None of the perl tap tests you've included appear to do this (except indirectly
when calling test_client); grep'ing for the libpq error message "parameter with
forced encryption is not to be encrypted" in the tests has no matches. Is it just
not possible? I thought you'd mentioned some syntax for this when we spoke in person,
but I don't see it now.

This has been asked about before. We just need to come up with a syntax
for it. The issue is contained inside psql.

Re: Transparent column encryption

2023-02-12 Thread Mark Dilger

> On Feb 11, 2023, at 1:54 PM, Mark Dilger  wrote:
> 
> Here are some observations 

I should mention, src/sgml/html/libpq-exec.html needs clarification:

> paramFormats[]Specifies whether parameters are text (put a zero in the array 
> entry for the corresponding parameter) or binary (put a one in the array 
> entry for the corresponding parameter). If the array pointer is null then all 
> parameters are presumed to be text strings.

Perhaps you should edit this last sentence to say that all parameters are 
presumed to be test strings without forced encryption.

> Values passed in binary format require knowledge of the internal 
> representation expected by the backend. For example, integers must be passed 
> in network byte order. Passing numeric values requires knowledge of the 
> server storage format, as implemented in 
> src/backend/utils/adt/numeric.c::numeric_send() and 
> src/backend/utils/adt/numeric.c::numeric_recv().

> When column encryption is enabled, the second-least-significant byte of this 
> parameter specifies whether encryption should be forced for a parameter.

The value 0x10 has a one as its second-least-significant *nibble*, but that's a 
really strange way to describe the high-order nibble, and perhaps not what you 
mean.  Could you clarify?

> Set this byte to one to force encryption.

I think setting the byte to one (0x01) means "binary unencrypted", not "force 
encryption".  Don't you mean to set this byte to 16? 

> For example, use the C code literal 0x10 to specify text format with forced 
> encryption. If the array pointer is null then encryption is not forced for 
> any parameter.
> If encryption is forced for a parameter but the parameter does not correspond 
> to an encrypted column on the server, then the call will fail and the 
> parameter will not be sent. This can be used for additional security against 
> a compromised server. (The drawback is that application code then needs to be 
> kept up to date with knowledge about which columns are encrypted rather than 
> letting the server specify this.)

 I think you should say something about how specifying 0x11 will behave -- in 
other words, asking for encrypted binary data.  I believe that this is will 
draw a "format must be text for encrypted parameter" error, and that the docs 
should clearly say so.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Transparent column encryption

2023-02-11 Thread Mark Dilger

> On Jan 25, 2023, at 10:44 AM, Peter Eisentraut 
>  wrote:
> 
> Here is a new patch.  Changes since v14:
> 
> - Fixed some typos (review by Justin Pryzby)
> - Fixed backward compat. psql and pg_dump (review by Justin Pryzby)
> - Doc additions (review by Jacob Champion)
> - Validate column_encryption option in libpq (review by Jacob Champion)
> - Handle column encryption in inheritance
> - Change CEKs and CMKs to live inside 
> schemas

Thanks Peter.  Here are some observations about the documentation in patch 
version 15.

In acronyms.sgml, the CEK and CMK entries should link to documentation, perhaps 
linkend="glossary-column-encryption-key" and 
linkend="glossary-column-master-key".  These glossary entries should in turn 
link to linkend="ddl-column-encryption".

In ddl.sgml, the sentence "forcing encryption of certain parameters in the 
client library (see its documentation)" should link to 
linkend="libpq-connect-column-encryption".

Did you intend the use of "transparent data encryption" (rather than 
"transparent column encryption") in datatype.sgml?  If so, what's the 
difference?

Is this feature intended to be available from ecpg?  If so, can we maybe 
include an example in 36.3.4. Prepared Statements showing how to pass the 
encrypted values securely.  If not, can we include verbiage about that 
limitation, so folks don't waste time trying to figure out how to do it?

The documentation for pg_dump (and pg_dumpall) now includes a 
--decrypt-encrypted-columns option, which I suppose requires cmklookup to first 
be configured, and for PGCMKLOOKUP to be exported.  There isn't anything in the 
pg_dump docs about this, though, so maybe a link to section 5.5.3 with a 
warning about not running pg_dump this way on the database server itself?

How does a psql user mark a parameter as having forced encryption?  A libpq 
user can specify this in the paramFormats array, but I don't see any syntax for 
doing this from psql.  None of the perl tap tests you've included appear to do 
this (except indirectly when calling test_client); grep'ing for the libpq error 
message "parameter with forced encryption is not to be encrypted" in the tests 
has no matches.  Is it just not possible?  I thought you'd mentioned some 
syntax for this when we spoke in person, but I don't see it now.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Transparent column encryption

2023-02-03 Thread Jacob Champion

On Tue, Jan 31, 2023 at 5:26 AM Peter Eisentraut
 wrote:
> See here for example:
> https://learn.microsoft.com/en-us/sql/relational-databases/security/encryption/always-encrypted-database-engine?view=sql-server-ver15

Hm. I notice they haven't implemented more than one algorithm, so I
wonder if they're going to be happy with their decision to
mix-and-match when number two arrives.

> For pg_dump, I'd like a mode that makes all values parameters of an
> INSERT statement.  But obviously not all of those will be encrypted.  So
> I think we'd want a per-parameter syntax.

Makes sense. Maybe something that defaults to encrypted with opt-out
per parameter?

UPDATE t SET name = $1 WHERE id = $2
\encbind "Jacob" plaintext(24)

> > More concretely: should psql allow you to push arbitrary text into an
> > encrypted \bind parameter, like it does now?
>
> We don't have any data type awareness like that now in psql or libpq.
> It would be quite a change to start now.

I agree. It just feels weird that a misbehaving client can "attack"
the other client implementations using it, and we don't make any
attempt to mitigate it.

--Jacob

Re: Transparent column encryption

2023-01-31 Thread Peter Eisentraut


On 30.01.23 23:30, Jacob Champion wrote:

The column encryption algorithm is set per-column -- but isn't it
tightly coupled to the CEK, since the key length has to match? From a
layperson perspective, using the same key to encrypt the same plaintext
under two different algorithms (if they happen to have the same key
length) seems like it might be cryptographically risky. Is there a
reason I should be encouraged to do that?


Not really.  I was also initially confused by this setup, but that's how
other similar systems are set up, so I thought it would be confusing to
do it differently.


Which systems let you mix and match keys and algorithms this way? I'd
like to take a look at them.


See here for example: 
https://learn.microsoft.com/en-us/sql/relational-databases/security/encryption/always-encrypted-database-engine?view=sql-server-ver15



With the loss of \gencr it looks like we also lost a potential way to
force encryption from within psql. Any plans to add that for v1?


\gencr didn't do that either.  We could do it.  The libpq API supports
it.  We just need to come up with some syntax for psql.


Do you think people would rather set encryption for all parameters at
once -- something like \encbind -- or have the ability to mix
encrypted and unencrypted parameters?


For pg_dump, I'd like a mode that makes all values parameters of an 
INSERT statement.  But obviously not all of those will be encrypted.  So 
I think we'd want a per-parameter syntax.



More concretely: should psql allow you to push arbitrary text into an
encrypted \bind parameter, like it does now?


We don't have any data type awareness like that now in psql or libpq. 
It would be quite a change to start now.  How would that deal with data 
type extensibility, is an obvious question to start with.  Don't know.

Re: Transparent column encryption

2023-01-30 Thread Jacob Champion

On Wed, Jan 25, 2023 at 11:00 AM Peter Eisentraut
 wrote:
> > When writing the paragraph on RSA-OAEP I was reminded that we didn't
> > really dig into the asymmetric/symmetric discussion. Assuming that most
> > first-time users will pick the builtin CMK encryption method, do we
> > still want to have an asymmetric scheme implemented first instead of a
> > symmetric keywrap? I'm still concerned about that public key, since it
> > can't really be made public.
>
> I had started coding that, but one problem was that the openssl CLI
> doesn't really provide any means to work with those kinds of keys.  The
> "openssl enc" command always wants to mix in a password.  Without that,
> there is no way to write a test case, and more crucially no way for
> users to set up these kinds of keys.  Unless we write our own tooling
> for this, which, you know, the patch just passed 400k in size.

Arrgh: https://github.com/openssl/openssl/issues/10605

> > The column encryption algorithm is set per-column -- but isn't it
> > tightly coupled to the CEK, since the key length has to match? From a
> > layperson perspective, using the same key to encrypt the same plaintext
> > under two different algorithms (if they happen to have the same key
> > length) seems like it might be cryptographically risky. Is there a
> > reason I should be encouraged to do that?
>
> Not really.  I was also initially confused by this setup, but that's how
> other similar systems are set up, so I thought it would be confusing to
> do it differently.

Which systems let you mix and match keys and algorithms this way? I'd
like to take a look at them.

> > With the loss of \gencr it looks like we also lost a potential way to
> > force encryption from within psql. Any plans to add that for v1?
>
> \gencr didn't do that either.  We could do it.  The libpq API supports
> it.  We just need to come up with some syntax for psql.

Do you think people would rather set encryption for all parameters at
once -- something like \encbind -- or have the ability to mix
encrypted and unencrypted parameters?

> > Are there plans to document client-side implementation requirements, to
> > ensure cross-client compatibility? Things like the "PG\x00\x01"
> > associated data are buried at the moment (or else I've missed them in
> > the docs). If you're holding off until the feature is more finalized,
> > that's fine too.
>
> This is documented in the protocol chapter, which I thought was the
> right place.  Did you want more documentation, or in a different place?

I just missed it; sorry.

> > Speaking of cross-client compatibility, I'm still disconcerted by the
> > ability to write the value "hello world" into an encrypted integer
> > column. Should clients be required to validate the text format, using
> > the attrealtypid?
>
> Well, we can ask them to, but we can't really require them, in a
> cryptographic sense.  I'm not sure what more we can do.

Right -- I just mean that clients need to pay more attention to it
now, whereas before they may have delegated correctness to the server.
The problem is documented in the context of deterministic encryption,
but I think it applies to randomized as well.

More concretely: should psql allow you to push arbitrary text into an
encrypted \bind parameter, like it does now?

> > It occurred to me when looking at the "unspecified" CMK scheme that the
> > CEK doesn't really have to be an encryption key at all. In that case it
> > can function more like a (possibly signed?) cookie for lookup, or even
> > be ignored altogether if you don't want to use a wrapping scheme
> > (similar to JWE's "direct" mode, maybe?). So now you have three ways to
> > look up or determine a column encryption key (CMK realm, CMK name, CEK
> > cookie)... is that a concept worth exploring in v1 and/or the documentation?
>
> I don't completely follow this.

Yeah, I'm not expressing it very well. My feeling is that the
organization system here -- a realm "contains" multiple CMKs, a CMK
encrypts multiple CEKs -- is so general and flexible that it may need
some suggested guardrails for people to use it sanely. I just don't
know what those guardrails should be. I was motivated by the
realization that CEKs don't even need to be keys.

Thanks,
--Jacob

Re: Transparent column encryption

2023-01-25 Thread Peter Eisentraut


On 19.01.23 21:48, Jacob Champion wrote:

I like the existing "caveats" documentation, and I've attached a sample
patch with some more caveats documented, based on some of the upthread
conversation:

- text format makes fixed-length columns leak length information too
- you only get partial protection against the Evil DBA
- RSA-OAEP public key safety

(Feel free to use/remix/discard as desired.)


I have added those in the v15 patch I just posted.


When writing the paragraph on RSA-OAEP I was reminded that we didn't
really dig into the asymmetric/symmetric discussion. Assuming that most
first-time users will pick the builtin CMK encryption method, do we
still want to have an asymmetric scheme implemented first instead of a
symmetric keywrap? I'm still concerned about that public key, since it
can't really be made public.


I had started coding that, but one problem was that the openssl CLI 
doesn't really provide any means to work with those kinds of keys.  The 
"openssl enc" command always wants to mix in a password.  Without that, 
there is no way to write a test case, and more crucially no way for 
users to set up these kinds of keys.  Unless we write our own tooling 
for this, which, you know, the patch just passed 400k in size.



For the padding caveat:


+  There is no concern if all values are of the same length (e.g., credit
+  card numbers).


I nodded along to this statement last year, and then this year I learned
that CCNs aren't fixed-length. So with a 16-byte block, you're probably
going to be able to figure out who has an American Express card.


Heh.  I have removed that parenthetical remark.


The column encryption algorithm is set per-column -- but isn't it
tightly coupled to the CEK, since the key length has to match? From a
layperson perspective, using the same key to encrypt the same plaintext
under two different algorithms (if they happen to have the same key
length) seems like it might be cryptographically risky. Is there a
reason I should be encouraged to do that?


Not really.  I was also initially confused by this setup, but that's how 
other similar systems are set up, so I thought it would be confusing to 
do it differently.



With the loss of \gencr it looks like we also lost a potential way to
force encryption from within psql. Any plans to add that for v1?


\gencr didn't do that either.  We could do it.  The libpq API supports 
it.  We just need to come up with some syntax for psql.



While testing, I forgot how the new option worked and connected with
`column_encryption=on` -- and then I accidentally sent unencrypted data
to the server, since `on` means "not enabled". :( The server errors out
after the damage is done, of course, but would it be okay to strictly
validate that option's values?


fixed in v15


Are there plans to document client-side implementation requirements, to
ensure cross-client compatibility? Things like the "PG\x00\x01"
associated data are buried at the moment (or else I've missed them in
the docs). If you're holding off until the feature is more finalized,
that's fine too.


This is documented in the protocol chapter, which I thought was the 
right place.  Did you want more documentation, or in a different place?



Speaking of cross-client compatibility, I'm still disconcerted by the
ability to write the value "hello world" into an encrypted integer
column. Should clients be required to validate the text format, using
the attrealtypid?


Well, we can ask them to, but we can't really require them, in a 
cryptographic sense.  I'm not sure what more we can do.



It occurred to me when looking at the "unspecified" CMK scheme that the
CEK doesn't really have to be an encryption key at all. In that case it
can function more like a (possibly signed?) cookie for lookup, or even
be ignored altogether if you don't want to use a wrapping scheme
(similar to JWE's "direct" mode, maybe?). So now you have three ways to
look up or determine a column encryption key (CMK realm, CMK name, CEK
cookie)... is that a concept worth exploring in v1 and/or the documentation?


I don't completely follow this.

Re: Transparent column encryption

2023-01-25 Thread Peter Eisentraut


On 12.01.23 17:32, Peter Eisentraut wrote:
Can we do anything about the attack vector wherein a malicious DBA 
simply copies the encrypted datum from one row to another?


We discussed this earlier [0].  This patch is not that feature.  We 
could get there eventually, but it would appear to be an immense amount 
of additional work.  We have to start somewhere.


I've been thinking, this could be done as a "version 2" of the currently 
proposed feature, within the same framework.  We'd extend the 
RowDescription and ParameterDescription messages to provide primary key 
information, some flags, then the client would have enough to know what 
to do.  As you wrote in your follow-up message, a challenge would be to 
handle statements that do not touch all the columns.  We'd need to work 
through this and consider all the details.

Re: Transparent column encryption

2023-01-25 Thread Peter Eisentraut


On 07.01.23 01:34, Justin Pryzby wrote:

"ON (CASE WHEN a.attrealtypid <> 0 THEN a.attrealtypid ELSE a.atttypid END = 
t.oid)\n"

This breaks interoperability with older servers:
ERROR:  column a.attrealtypid does not exist

Same in describe.c

Find attached some typos and bad indentation.  I'm sending this off now
as I've already sat on it for 2 weeks since starting to look at the
patch.


Thanks, I have integrated all that into the v15 patch I just posted.

Re: Transparent column encryption

2023-01-19 Thread Jacob Champion

On 12/31/22 06:17, Peter Eisentraut wrote:
> On 21.12.22 06:46, Peter Eisentraut wrote:
>> And another update.  The main changes are that I added an 'unspecified' 
>> CMK algorithm, which indicates that the external KMS knows what it is 
>> but the database system doesn't.  This was discussed a while ago.  I 
>> also changed some details about how the "cmklookup" works in libpq. Also 
>> added more code comments and documentation and rearranged some code.

Trying to delay a review until I had "completed it" has only led to me
not reviewing, so here's a partial one. Let me know what pieces of the
implementation and/or architecture you're hoping to get more feedback on.

I like the existing "caveats" documentation, and I've attached a sample
patch with some more caveats documented, based on some of the upthread
conversation:

- text format makes fixed-length columns leak length information too
- you only get partial protection against the Evil DBA
- RSA-OAEP public key safety

(Feel free to use/remix/discard as desired.)

When writing the paragraph on RSA-OAEP I was reminded that we didn't
really dig into the asymmetric/symmetric discussion. Assuming that most
first-time users will pick the builtin CMK encryption method, do we
still want to have an asymmetric scheme implemented first instead of a
symmetric keywrap? I'm still concerned about that public key, since it
can't really be made public. (And now that "unspecified" is available, I
think an asymmetric CMK could be easily created by users that have a
niche use case, and then we wouldn't have to commit to supporting it
forever.)

For the padding caveat:

> +  There is no concern if all values are of the same length (e.g., credit
> +  card numbers).

I nodded along to this statement last year, and then this year I learned
that CCNs aren't fixed-length. So with a 16-byte block, you're probably
going to be able to figure out who has an American Express card.

The column encryption algorithm is set per-column -- but isn't it
tightly coupled to the CEK, since the key length has to match? From a
layperson perspective, using the same key to encrypt the same plaintext
under two different algorithms (if they happen to have the same key
length) seems like it might be cryptographically risky. Is there a
reason I should be encouraged to do that?

With the loss of \gencr it looks like we also lost a potential way to
force encryption from within psql. Any plans to add that for v1?

While testing, I forgot how the new option worked and connected with
`column_encryption=on` -- and then I accidentally sent unencrypted data
to the server, since `on` means "not enabled". :( The server errors out
after the damage is done, of course, but would it be okay to strictly
validate that option's values?

Are there plans to document client-side implementation requirements, to
ensure cross-client compatibility? Things like the "PG\x00\x01"
associated data are buried at the moment (or else I've missed them in
the docs). If you're holding off until the feature is more finalized,
that's fine too.

Speaking of cross-client compatibility, I'm still disconcerted by the
ability to write the value "hello world" into an encrypted integer
column. Should clients be required to validate the text format, using
the attrealtypid?

It occurred to me when looking at the "unspecified" CMK scheme that the
CEK doesn't really have to be an encryption key at all. In that case it
can function more like a (possibly signed?) cookie for lookup, or even
be ignored altogether if you don't want to use a wrapping scheme
(similar to JWE's "direct" mode, maybe?). So now you have three ways to
look up or determine a column encryption key (CMK realm, CMK name, CEK
cookie)... is that a concept worth exploring in v1 and/or the documentation?

Thanks,
--Jacobdiff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 55f33a2f5f..06e1c077d5 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -1588,7 +1588,33 @@ export PGCMKLOOKUP
 card numbers).  But if there are signficant length differences between
 valid values and that length information is security-sensitive, then
 application-specific workarounds such as padding would need to be applied.
-How to do that securely is beyond the scope of this manual.
+How to do that securely is beyond the scope of this manual.  Note that
+column encryption is applied to the text representation of the stored 
value,
+so length differences can be leaked even for fixed-length column types 
(e.g.
+bigint, whose largest decimal representation is longer
+than 16 bytes).
+   
+
+   
+Column encryption provides only partial protection against a malicious
+user with write access to the table.  Once encrypted, any modifications to 
a
+stored value on the server side will cause a decryption failure on the
+client.  However, a user with write access may still freely swap encrypted
+values between rows or columns (o

Re: Transparent column encryption

2023-01-12 Thread Peter Eisentraut


On 10.01.23 18:26, Mark Dilger wrote:

I wonder if logical replication could be made to work more easily with this 
feature.  Specifically, subscribers of encrypted columns will need the 
encrypted column encryption key (CEK) and the name of the column master key 
(CMD) as exists on the publisher, but getting access to that is not automated 
as far as I can see. It doesn't come through automatically as part of a 
subscription, and publisher's can't publish the pg_catalog tables where the 
keys are kept (because publishing system tables is not supported.)  Is it 
reasonable to make available the CEK and CMK to subscribers in an automated 
fashion, to facilitate setting up logical replication with less manual 
distribution of key information?  Is this already done, and I'm just not 
recognizing that you've done it?


This would be done as part of DDL replication.


Can we do anything about the attack vector wherein a malicious DBA simply 
copies the encrypted datum from one row to another?


We discussed this earlier [0].  This patch is not that feature.  We 
could get there eventually, but it would appear to be an immense amount 
of additional work.  We have to start somewhere.



[0]: 
https://www.postgresql.org/message-id/4fbcf5540633699fc3d81ffb59cb0ac884673a7c.ca...@vmware.com

Re: Transparent column encryption

2023-01-11 Thread vignesh C

On Sat, 31 Dec 2022 at 19:47, Peter Eisentraut
 wrote:
>
> On 21.12.22 06:46, Peter Eisentraut wrote:
> > And another update.  The main changes are that I added an 'unspecified'
> > CMK algorithm, which indicates that the external KMS knows what it is
> > but the database system doesn't.  This was discussed a while ago.  I
> > also changed some details about how the "cmklookup" works in libpq. Also
> > added more code comments and documentation and rearranged some code.
> >
> > According to my local todo list, this patch is now complete.
>
> Another update, with some merge conflicts resolved.  I also fixed up the
> remaining TODO markers in the code, which had something to do with Perl
> and Windows.  I did some more work on schema handling, e.g., CREATE
> TABLE / LIKE, views, partitioning etc. on top of encrypted columns,
> mostly tedious and repetitive, nothing interesting.  I also rewrote the
> code that extracts the underlying tables and columns corresponding to
> query parameters.  It's now much simpler and better encapsulated.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
5f6401f81cb24bd3930e0dc589fc4aa8b5424cdc ===
=== applying patch ./v14-0001-Transparent-column-encryption.patch

Hunk #1 FAILED at 1109.

1 out of 5 hunks FAILED -- saving rejects to file doc/src/sgml/protocol.sgml.rej

patching file doc/src/sgml/ref/create_table.sgml
Hunk #3 FAILED at 351.
Hunk #4 FAILED at 704.
2 out of 4 hunks FAILED -- saving rejects to file
doc/src/sgml/ref/create_table.sgml.rej

Hunk #1 FAILED at 1420.
Hunk #2 FAILED at 4022.
2 out of 2 hunks FAILED -- saving rejects to file
doc/src/sgml/ref/psql-ref.sgml.rej

[1] - http://cfbot.cputube.org/patch_41_3718.log

Regards,
Vignesh

Re: Transparent column encryption

2023-01-10 Thread Mark Dilger

> On Jan 10, 2023, at 9:26 AM, Mark Dilger  wrote:
> 
>-- Cryptographically connected to the encrypted record
>patient_id  BIGINT NOT NULL,
>patient_ssn CHAR(11),
> 
>-- The encrypted record
>patient_record TEXT ENCRYPTED WITH (column_encryption_key = cek1,
>column_encryption_salt = (patient_id, 
> patient_ssn)),

As you mention upthread, tying columns together creates problems for statements 
that only operate on a subset of columns.  Allowing schema designers a choice 
about tying the encrypted column to zero or more other columns allows them to 
choose which works best for their security needs.

The example above would make a statement like "UPDATE patient_record SET 
patient_record = $1 \bind '{some json whatever}'" raise an exception at the 
libpq client level, but maybe that's what schema designers wants it to do.  If 
not, they should omit the column_encryption_salt option in the create table 
statement; but if so, they should expect to have to specify the other columns 
as part of the update statement, possibly as part of the where clause, like

UPDATE patient_record
SET patient_record = $1
WHERE patient_id = 12345
  AND patient_ssn = '111-11-' 
\bind '{some json record}'

and have the libpq get the salt column values from the where clause (which may 
be tricky to implement), or perhaps use some new bind syntax like

UPDATE patient_record
SET patient_record = ($1:$2,$3)   -- new, wonky syntax
WHERE patient_id = $2
  AND patient_ssn = $3 
\bind '{some json record}' 12345 '111-11-'

which would be error prone, since the sql statement could specify the 
($1:$2,$3) inconsistently with the where clause, or perhaps specify the "new" 
salt columns even when not changed, like

UPDATE patient_record
SET patient_record = $1, patient_id = 12345, patient_ssn = 
"111-11-"
WHERE patient_id = 12345
  AND patient_ssn = "111-11-"
\bind '{some json record}'

which looks kind of nuts at first glance, but is grammatically consistent with 
cases where one or both of the patient_id or patient_ssn are also being 
changed, like

UPDATE patient_record
SET patient_record = $1, patient_id = 98765, patient_ssn = 
"999-99-"
WHERE patient_id = 12345
  AND patient_ssn = "111-11-"
\bind '{some json record}'

Or, of course, you can ignore these suggestions or punt them to some future 
patch that extends the current work, rather than trying to get it all done in 
the first column encryption commit.  But it seems useful to think about what 
future directions would be, to avoid coding ourselves into a corner, making 
such future work harder.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Transparent column encryption

2023-01-10 Thread Mark Dilger

> On Dec 31, 2022, at 6:17 AM, Peter Eisentraut 
>  wrote:
> 
> Another update, with some merge conflicts resolved. 

Hi Peter, thanks for the patch!

I wonder if logical replication could be made to work more easily with this 
feature.  Specifically, subscribers of encrypted columns will need the 
encrypted column encryption key (CEK) and the name of the column master key 
(CMD) as exists on the publisher, but getting access to that is not automated 
as far as I can see. It doesn't come through automatically as part of a 
subscription, and publisher's can't publish the pg_catalog tables where the 
keys are kept (because publishing system tables is not supported.)  Is it 
reasonable to make available the CEK and CMK to subscribers in an automated 
fashion, to facilitate setting up logical replication with less manual 
distribution of key information?  Is this already done, and I'm just not 
recognizing that you've done it?

Can we do anything about the attack vector wherein a malicious DBA simply 
copies the encrypted datum from one row to another?  Imagine the DBA Alice 
wants to murder a hospital patient Bob by removing the fact that Bob is deathly 
allergic to latex.  She cannot modify the Bob's encrypted and authenticated 
record, but she can easily update Bob's record with the encrypted record of a 
different patient Charlie.  Likewise, if she want's Bob to pay Charlie's bill, 
she can replace Charlie's encrypted credit card number with Bob's, and once Bob 
is dead, he won't dispute the charges.

An encrypted-and-authenticated column value should be connected with its row in 
some way that Alice cannot circumvent.  In the patch as you have it written, 
the client application can include row information in the patient record 
(specifically, the patient's name, ssn, etc) and verify when any patient record 
is retrieved that this information matches.  But that's hardly "transparent" to 
the client.  It's something all clients will have to do, and easy to forget to 
do in some code path.  Also, for encrypted fixed-width columns, it is not an 
option.  So it seems the client needs to "salt" (maybe not the right term for 
what I have in mind) the encryption with some relevant other columns, and 
that's something the libpq client would need to understand, and something the 
patch's syntax needs to support.  Something like:

CREATE TABLE patient_records (
-- Cryptographically connected to the encrypted record
patient_id  BIGINT NOT NULL,
patient_ssn CHAR(11),

-- The encrypted record
patient_record TEXT ENCRYPTED WITH (column_encryption_key = cek1,
column_encryption_salt = (patient_id, 
patient_ssn)),

-- Extra stuff, not cryptographically connected to anything
next_of_kin TEXT,
phone_number BIGINT,
...
);

I have not selected any algorithms that include such "salt"ing (again, maybe 
the wrong word) because I'm just trying to discuss the general feature, not get 
into the weeds about which cryptographic algorithm to select.

Thoughts?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Transparent column encryption

2023-01-06 Thread Justin Pryzby

"ON (CASE WHEN a.attrealtypid <> 0 THEN a.attrealtypid ELSE a.atttypid END = 
t.oid)\n"

This breaks interoperability with older servers:
ERROR:  column a.attrealtypid does not exist

Same in describe.c

Find attached some typos and bad indentation.  I'm sending this off now
as I've already sat on it for 2 weeks since starting to look at the
patch.

-- 
Justin
>From d9dcf23d25ba4452fb12c4b065ab5215e2228882 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 6 Jan 2023 18:28:59 -0600
Subject: [PATCH] f!

---
 doc/src/sgml/datatype.sgml |  4 ++--
 doc/src/sgml/ddl.sgml  |  4 ++--
 doc/src/sgml/libpq.sgml|  2 +-
 doc/src/sgml/protocol.sgml |  2 +-
 src/bin/pg_dump/pg_dump.c  |  4 ++--
 src/interfaces/libpq/fe-exec.c | 12 ++--
 6 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 56b2e1d0d1e..243e6861506 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -5369,7 +5369,7 @@ WHERE ...
 pg_encrypted_rnd (for randomized encryption) or
 pg_encrypted_det (for deterministic encryption); see .  Most of the database system treats
-this as normal types.  For example, the type pg_encrypted_det has
+these as normal types.  For example, the type pg_encrypted_det has
 an equals operator that allows lookup of encrypted values.  It is,
 however, not allowed to create a table using one of these types directly
 as a column type.
@@ -5383,7 +5383,7 @@ WHERE ...
 Clients that don't support transparent column encryption or have disabled
 it will see the encrypted values in this format.  Clients that support
 transparent data encryption will not see these types in result sets, as
-the protocol layer will translate them back to declared underlying type in
+the protocol layer will translate them back to the declared underlying type in
 the table definition.

 
diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index b8364a91f9a..a7624d6a60c 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -1263,7 +1263,7 @@ CREATE TABLE customers (
randomized encryption, which is the default.  Randomized encryption uses a
random initialization vector for each encryption, so that even if the
plaintext of two rows is equal, the encrypted values will be different.
-   This prevents someone with direct access to the database server to make
+   This prevents someone with direct access to the database server from making
computations such as distinct counts on the encrypted values.
Deterministic encryption uses a fixed initialization vector.  This reduces
security, but it allows equality searches on encrypted values.  The
@@ -1540,7 +1540,7 @@ export PGCMKLOOKUP
 

 In general, column encryption is never a replacement for additional
-security and encryption techniques such as transmission encryption
+security and encryption techniques such as transport encryption
 (SSL/TLS), storage encryption, strong access control, and password
 security.  Column encryption only targets specific use cases and should be
 used in conjunction with additional security measures.
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index a2d413bafd5..9b7f76db7a9 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -3012,7 +3012,7 @@ PGresult *PQexecParams(PGconn *conn,
 If encryption is forced for a parameter but the parameter does not
 correspond to an encrypted column on the server, then the call
 will fail and the parameter will not be sent.  This can be used
-for additional security against a comprimised server.  (The
+for additional security against a compromised server.  (The
 drawback is that application code then needs to be kept up to date
 with knowledge about which columns are encrypted rather than
 letting the server specify this.)
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 1a9b8abd7f2..caa6e3174ee 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -5807,7 +5807,7 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"

 
  If the field is encrypted, this specifies the identifier of the
- encrypt algorithm, else zero.
+ encryption algorithm, else zero.
 

   
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index f82c2496fd5..99f2583c34a 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -690,7 +690,7 @@ main(int argc, char **argv)
 	 * --rows-per-insert were specified.
 	 */
 	if (dopt.cparams.column_encryption && dopt.dump_inserts == 0)
-		pg_fatal("option --decrypt_encrypted_columns requires option --inserts, --rows-per-insert, or --column-inserts");
+		pg_fatal("option --decrypt-encrypted-columns requires

Re: Transparent column encryption

2022-11-24 Thread Jehan-Guillaume de Rorthais

On Wed, 23 Nov 2022 19:45:06 +0100
Peter Eisentraut  wrote:
> On 28.10.22 12:16, Jehan-Guillaume de Rorthais wrote:
[...]

> >* I wonder if encryption related fields in ParameterDescription and
> >  RowDescription could be optional somehow? The former might be quite
> >  large when using a lot of parameters (like, imagine a large and ugly
> >  "IN($1...$N)"). On the other hand, these messages are not sent in high
> >  frequency anyway...  
> 
> They are only used if you turn on the column_encryption protocol option. 
>   Or did you mean make them optional even then?

I meant even when column_encryption is turned on.

Regards,

Re: Transparent column encryption

2022-11-23 Thread Peter Eisentraut


On 28.10.22 12:16, Jehan-Guillaume de Rorthais wrote:

I did a review of the documentation and usability.


I have incorporated some of your feedback into the v11 patch I just posted.


# Applying patch

   The patch applied on top of f13b2088fa2 without trouble. Notice a small
   warning during compilation:
   
 colenccmds.c:134:27: warning: ‘encval’ may be used uninitialized
   
   A simple fix could be:
   
 +++ b/src/backend/commands/colenccmds.c

 @@ -119,2 +119,3
 encval = defGetString(encvalEl);
 +   *encval_p = encval;
 }
 @@ -132,4 +133,2
 *alg_p = alg;
 -   if (encval_p)
 -   *encval_p = encval;
  }


fixed


# Documentation

   * In page "create_column_encryption_key.sgml", both encryption algorithms for
 a CMK are declared as the default one:

 + 
 +  The encryption algorithm that was used to encrypt the key material 
of
 +  this column encryption key.  Supported algorithms are:
 +  
 +   
 +RSAES_OAEP_SHA_1 (default)
 +   
 +   
 +RSAES_OAEP_SHA_256 (default)
 +   
 +  
 + 
   
 As far as I understand the code, I suppose RSAES_OAEP_SHA_1 should be the

 default.


fixed


   I believe two information should be clearly shown to user somewhere in
   chapter 5.5 instead of being buried deep in documentation:
  
   * «COPY does not support column decryption», currently buried in pg_dump page

   * «When transparent column encryption is enabled, the client encoding must
 match the server encoding», currently buried in the protocol description
 page.

   * In the libpq doc of PQexecPrepared2, "paramForceColumnEncryptions" might
 deserve a little more detail about the array format, like «0 means "don't
 enforce" and anything else enforce the encryption is enabled on this
 column». By the way, maybe this array could be an array of boolean?

   * In chapter 55.2.5 (protocol-flow) is stated: «when column encryption is
 used, the plaintext is always in text format (not binary format).». Does it
 means parameter "resultFormat" in "PQexecPrepared2" should always be 0? If
 yes, is it worth keeping this argument? Moreover, this format constraint
 should probably be explained in the libpq page as well.


I will keep these suggestions around.  Some of these things will 
probably change again, so I'll make sure to update the documentation 
when I touch it again.



# Protocol

   * In the ColumnEncryptionKey message, it seems the field holding the length
 key material is redundant with the message length itself, as all other
 fields have a known size. The key material length is the message length -
 (4+4+4+2). For comparison, the AuthenticationSASLContinue message has a
 variable data size but rely only on the message length without additional
 field.


I find that weird, though.  An explicit length seems better.  Things 
like AuthenticationSASLContinue only work if they have exactly one 
variable-length data item.



   * I wonder if encryption related fields in ParameterDescription and
 RowDescription could be optional somehow? The former might be quite large
 when using a lot of parameters (like, imagine a large and ugly
 "IN($1...$N)"). On the other hand, these messages are not sent in high
 frequency anyway...


They are only used if you turn on the column_encryption protocol option. 
 Or did you mean make them optional even then?



# libpq

   Would it be possible to have an encryption-ready PQexecParams() equivalent
   of what PQprepare/describe/exec do?


I plan to do that.


# psql

   * You already mark \d in the TODO. But having some psql command to list the
 known CEK/CMK might be useful as well.


right


   * about write queries using psql, would it be possible to use psql
 parameters? Eg.:

   => \set c1 myval
   => INSERT INTO mytable VALUES (:'c1') \gencr


No, because those are resolved by psql before libpq sees them.


# Manual tests

   * The lookup error message is shown twice for some reason:

   => select * from test_tce;
   no CMK lookup found for realm ""
   
   no CMK lookup found for realm ""


 It might worth adding the column name and the CMK/CEK names related to each
 error message? Last, notice the useless empty line between messages.


I'll look into that.


   * When "DROP IF EXISTS" a missing CEK or CMK, the command raise an
 "unrecognized object type":

   => DROP COLUMN MASTER KEY IF EXISTS noexists;
   ERROR:  unrecognized object type: 10
   => DROP COLUMN ENCRYPTION KEY IF EXISTS noexists;
   ERROR:  unrecognized object type: 8


fixed


   * I was wondering what "pg_typeof" should return. It currently returns
 "pg_encrypted_*". If this is supposed to be transparent from the client
 perspective, shouldn't it re

Re: Transparent column encryption

2022-10-28 Thread Frédéric Yhuel


Hi,

Here are a few more things I noticed :

If a CEK is encrypted with cmk1 and cmk2, but cmk1 isn't found on the 
client,the following error is printed twice for the very first SELECT 
statement:


  could not open file "/path/to/cmk1.pem": No such file or directory

...and nothing is returned. The next queries in the same session would 
work  correctly (cmk2 is used for the decryption of the CEK). An INSERT 
statement si handled properly, though : one (and only one) error 
message, and line actually inserted in all cases).


For example :

  postgres=# SELECT * FROM customers ;
  could not open file "/path/to/cmk1.pem": No such file or directory

  could not open file "/path/to/cmk1.pem": No such file or directory

  postgres=# SELECT * FROM customers ;
   id | name  | creditcard_num
  +---+-
1 | toto  | 546843351354245
2 | babar | 546843351354245



  postgres=# INSERT INTO customers (id, name, creditcard_num) VALUES 
 ($1, $2, $3) \gencr '3' 'toto' '546888351354245';

  could not open file "/path/to/cmk1.pem": No such file or directory

  INSERT 0 1
  postgres=# SELECT * FROM customers ;
   id | name  | creditcard_num
  +---+-
1 | toto  | 546843351354245
2 | babar | 546843351354245
3 | toto  | 546888351354245


From the documentation of CREATE COLUMN MASTER KEY, it looks like the 
REALM is optional, but both

  CREATE COLUMN MASTER KEY cmk1;
and
  CREATE COLUMN MASTER KEY cmk1 WITH ();
returns a syntax error.


About AEAD, the documentation says :
> The “associated data” in these algorithms consists of 4 bytes: The 
ASCII letters P and G (byte values 80 and 71), followed by the algorithm 
ID as a 16-bit unsigned integer in network byte order.


My guess is that it serves no real purpose, did I misunderstand ?

Re: Transparent column encryption

2022-10-28 Thread Jehan-Guillaume de Rorthais

Hi,

I did a review of the documentation and usability.

# Applying patch

  The patch applied on top of f13b2088fa2 without trouble. Notice a small
  warning during compilation:
  
colenccmds.c:134:27: warning: ‘encval’ may be used uninitialized
  
  A simple fix could be:
  
+++ b/src/backend/commands/colenccmds.c
@@ -119,2 +119,3
encval = defGetString(encvalEl);
+   *encval_p = encval;
}
@@ -132,4 +133,2
*alg_p = alg;
-   if (encval_p)
-   *encval_p = encval;
 }

# Documentation

  * In page "create_column_encryption_key.sgml", both encryption algorithms for
a CMK are declared as the default one:

+ 
+  The encryption algorithm that was used to encrypt the key material of
+  this column encryption key.  Supported algorithms are:
+  
+   
+RSAES_OAEP_SHA_1 (default)
+   
+   
+RSAES_OAEP_SHA_256 (default)
+   
+  
+ 
  
As far as I understand the code, I suppose RSAES_OAEP_SHA_1 should be the
default. 

  I believe two information should be clearly shown to user somewhere in
  chapter 5.5 instead of being buried deep in documentation:
 
  * «COPY does not support column decryption», currently buried in pg_dump page
  * «When transparent column encryption is enabled, the client encoding must
match the server encoding», currently buried in the protocol description
page.

  * In the libpq doc of PQexecPrepared2, "paramForceColumnEncryptions" might
deserve a little more detail about the array format, like «0 means "don't
enforce" and anything else enforce the encryption is enabled on this
column». By the way, maybe this array could be an array of boolean? 

  * In chapter 55.2.5 (protocol-flow) is stated: «when column encryption is
used, the plaintext is always in text format (not binary format).». Does it
means parameter "resultFormat" in "PQexecPrepared2" should always be 0? If
yes, is it worth keeping this argument? Moreover, this format constraint
should probably be explained in the libpq page as well.

# Protocol

  * In the ColumnEncryptionKey message, it seems the field holding the length
key material is redundant with the message length itself, as all other
fields have a known size. The key material length is the message length -
(4+4+4+2). For comparison, the AuthenticationSASLContinue message has a
variable data size but rely only on the message length without additional
field.

  * I wonder if encryption related fields in ParameterDescription and
RowDescription could be optional somehow? The former might be quite large
when using a lot of parameters (like, imagine a large and ugly
"IN($1...$N)"). On the other hand, these messages are not sent in high
frequency anyway...

# libpq

  Would it be possible to have an encryption-ready PQexecParams() equivalent
  of what PQprepare/describe/exec do?

# psql

  * You already mark \d in the TODO. But having some psql command to list the
known CEK/CMK might be useful as well.

  * about write queries using psql, would it be possible to use psql
parameters? Eg.:

  => \set c1 myval
  => INSERT INTO mytable VALUES (:'c1') \gencr

# Manual tests

  * The lookup error message is shown twice for some reason:

  => select * from test_tce;
  no CMK lookup found for realm ""
  
  no CMK lookup found for realm ""

It might worth adding the column name and the CMK/CEK names related to each
error message? Last, notice the useless empty line between messages.

  * When "DROP IF EXISTS" a missing CEK or CMK, the command raise an
"unrecognized object type":

  => DROP COLUMN MASTER KEY IF EXISTS noexists;
  ERROR:  unrecognized object type: 10
  => DROP COLUMN ENCRYPTION KEY IF EXISTS noexists;
  ERROR:  unrecognized object type: 8

  * I was wondering what "pg_typeof" should return. It currently returns
"pg_encrypted_*". If this is supposed to be transparent from the client
perspective, shouldn't it return "attrealtypid" when the field is encrypted?

  * any reason to not support altering the CMK realm?

This patch is really interesting and would be a nice addition to the core.

Thanks!

Regards,

Re: Transparent column encryption

2022-10-13 Thread Mark Woodward

If memory serves me correctly, if you statically link openssl this will
work. If you are using ssl in a DLL, I believe that the DLL has its own
"c-library" and its own heap.

On Thu, Oct 13, 2022 at 9:43 AM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> On 06.10.22 17:19, Andres Freund wrote:
> >>> psql error: stderr: 'OPENSSL_Uplink(7FFC165CBD50,08): no
> OPENSSL_Applink'
> >> What in the world is that about?  What scant information I could find
> >> suggests that it has something to do with building a "release" build
> against
> >> an "debug" build of the openssl library, or vice versa.  But this patch
> >> doesn't introduce any use of openssl that we haven't seen before.
> > It looks to me that one needs to compile, in some form,
> openssl/applink.c and
> > link it to the application. No idea why that'd be required now and not
> > earlier.
>
> I have figured this out.  The problem is that on Windows you can't
> reliably pass stdio FILE * handles between the application and OpenSSL.
> To give the helpful places I found some Google juice, I'll mention them
> here:
>
> - https://github.com/edenhill/librdkafka/pull/3602
> - https://github.com/edenhill/librdkafka/issues/3554
> - https://www.mail-archive.com/openssl-users@openssl.org/msg91029.html
>
>
>
>

Re: Transparent column encryption

2022-10-13 Thread Peter Eisentraut


On 06.10.22 17:19, Andres Freund wrote:

psql error: stderr: 'OPENSSL_Uplink(7FFC165CBD50,08): no OPENSSL_Applink'

What in the world is that about?  What scant information I could find
suggests that it has something to do with building a "release" build against
an "debug" build of the openssl library, or vice versa.  But this patch
doesn't introduce any use of openssl that we haven't seen before.

It looks to me that one needs to compile, in some form, openssl/applink.c and
link it to the application. No idea why that'd be required now and not
earlier.


I have figured this out.  The problem is that on Windows you can't 
reliably pass stdio FILE * handles between the application and OpenSSL. 
To give the helpful places I found some Google juice, I'll mention them 
here:


- https://github.com/edenhill/librdkafka/pull/3602
- https://github.com/edenhill/librdkafka/issues/3554
- https://www.mail-archive.com/openssl-users@openssl.org/msg91029.html

Re: Transparent column encryption

2022-10-06 Thread Andres Freund

Hi,

On 2022-10-06 16:25:51 +0200, Peter Eisentraut wrote:
> On 02.10.22 09:16, Andres Freund wrote:
> > On 2022-09-27 15:51:25 +0200, Peter Eisentraut wrote:
> > > Updated version with meson build system support added (for added files and
> > > new tests).
> > 
> > This fails on windows: https://cirrus-ci.com/task/6151847080624128
> > 
> > https://api.cirrus-ci.com/v1/artifact/task/6151847080624128/testrun/build/testrun/column_encryption/001_column_encryption/log/regress_log_001_column_encryption
> > 
> > psql error: stderr: 'OPENSSL_Uplink(7FFC165CBD50,08): no 
> > OPENSSL_Applink'
> 
> What in the world is that about?  What scant information I could find
> suggests that it has something to do with building a "release" build against
> an "debug" build of the openssl library, or vice versa.  But this patch
> doesn't introduce any use of openssl that we haven't seen before.

It looks to me that one needs to compile, in some form, openssl/applink.c and
link it to the application. No idea why that'd be required now and not
earlier.

Greetings,

Andres Freund

Re: Transparent column encryption

2022-10-06 Thread Peter Eisentraut


On 02.10.22 09:16, Andres Freund wrote:

On 2022-09-27 15:51:25 +0200, Peter Eisentraut wrote:

Updated version with meson build system support added (for added files and
new tests).


This fails on windows: https://cirrus-ci.com/task/6151847080624128

https://api.cirrus-ci.com/v1/artifact/task/6151847080624128/testrun/build/testrun/column_encryption/001_column_encryption/log/regress_log_001_column_encryption

psql error: stderr: 'OPENSSL_Uplink(7FFC165CBD50,08): no OPENSSL_Applink'


What in the world is that about?  What scant information I could find 
suggests that it has something to do with building a "release" build 
against an "debug" build of the openssl library, or vice versa.  But 
this patch doesn't introduce any use of openssl that we haven't seen before.

Re: Transparent column encryption

2022-10-02 Thread Andres Freund

Hi,

On 2022-09-27 15:51:25 +0200, Peter Eisentraut wrote:
> Updated version with meson build system support added (for added files and
> new tests).

This fails on windows: https://cirrus-ci.com/task/6151847080624128

https://api.cirrus-ci.com/v1/artifact/task/6151847080624128/testrun/build/testrun/column_encryption/001_column_encryption/log/regress_log_001_column_encryption

psql error: stderr: 'OPENSSL_Uplink(7FFC165CBD50,08): no OPENSSL_Applink'

Greetings,

Andres Freund

Re: Transparent column encryption

2022-08-31 Thread Jacob Champion

On Tue, Aug 30, 2022 at 4:53 AM Peter Eisentraut
 wrote:
> I would be interested in learning more about such padding systems.  I
> have done a lot of reading for this development project, and I have
> never come across a cryptographic approach to hide length differences by
> padding.  Of course, padding to the block cipher's block size is already
> part of the process, but that is done out of necessity, not because you
> want to disguise the length.  Are there any other methods?  I'm
> interested to learn more.

TLS 1.3 has one example. Here is a description from GnuTLS:
https://gnutls.org/manual/html_node/On-Record-Padding.html (Note the
option to turn on constant-time padding; that may not be a good
tradeoff for us if we're focusing on offline attacks.)

Here's a recent paper that claims to formally characterize length
hiding, but it's behind a wall and I haven't read it:
https://dl.acm.org/doi/abs/10.1145/3460120.3484590

I'll try to find more when I get the chance.

--Jacob

Re: Transparent column encryption

2022-08-30 Thread Peter Eisentraut


On 27.07.22 01:19, Jacob Champion wrote:

Now, if we don't have a padding system
built into the feature, then that does put even more on the user; it's
hard to argue with that.

Right. If they can even fix it at all. Having a well-documented padding
feature would not only help mitigate that, it would conveniently hang a
big sign on the caveats that exist.


I would be interested in learning more about such padding systems.  I 
have done a lot of reading for this development project, and I have 
never come across a cryptographic approach to hide length differences by 
padding.  Of course, padding to the block cipher's block size is already 
part of the process, but that is done out of necessity, not because you 
want to disguise the length.  Are there any other methods?  I'm 
interested to learn more.

Re: Transparent column encryption

2022-08-30 Thread Peter Eisentraut


On 20.07.22 08:12, Masahiko Sawada wrote:

---
Regarding the documentation, I'd like to have a page that describes
the generic information of the transparent column encryption for users
such as what this feature actually does, what can be achieved by this
feature, CMK rotation, and its known limitations. The patch has
"Transparent Column Encryption" section in protocol.sgml but it seems
to be more internal information.


I have added more documentation in the v6 patch.


---
In datatype.sgml, it says "Thus, clients that don't support
transparent column encryption or have disabled it will see the
encrypted values as byte arrays." but I got an error rather than
encrypted values when I tried to connect to the server using by
clients that don't support the encryption:

postgres(1:6040)=# select * from tbl;
no CMK lookup found for realm ""


This has now been improved in v6.  The protocol changes need to be 
activated explicitly at connection time, so if you use a client that 
doesn't support it or activates it, you get the described behavior.



---
In single-user mode, the user cannot decrypt the encrypted value but
probably it's fine in practice.


Yes, there is nothing really to do about that.


---
Regarding the column master key rotation, would it be useful if we
provide a tool for that? For example, it takes old and new CMK as
input, re-encrypt all CEKs realted to the CMK, and registers them to
the server.


I imagine users using a variety of key management systems, so I don't 
see how a single tool would work.  But it's something we can think about 
in the future.



---
Is there any convenient way to load a large amount of test data to the
encrypted columns? I tried to use generate_series() but it seems not
to work as it generates the data on the server side:


No, that doesn't work, by design.  You'd have to write a client program 
to generate the data.



I've also tried to load the data from a file on the client by using
\copy command, but it seems not to work:

postgres(1:80556)=# copy (select generate_series(1, 1000)::text) to
'/tmp/tmp.dat';
COPY 1000
postgres(1:80556)=# \copy a from '/tmp/tmp.dat'
COPY 1000
postgres(1:80556)=# select * from a;
out out memory


This was a bug that I have fixed.


---
I got SEGV in the following two situations:


I have fixed these.

Re: Transparent column encryption

2022-07-26 Thread Jacob Champion

On 7/26/22 13:25, Robert Haas wrote:
> I certainly have no objection to being clear about such details in the
> documentation.

Cool.

> I fear the phenomenon where
> doing anything about a problem makes you responsible for the whole
> problem. If we disclaim the ability to hide the length of values,
> that's clear enough.

I don't think disclaiming responsibility absolves you of it here, in
part because choices are being made (text format) that make length
hiding even more important than it otherwise would be. A user who
already knows that encryption doesn't hide length might still reasonably
expect a fixed-length column type like bigint to be protected in all
cases. It won't be (at least not with your 16-byte example).

And sure, you can document that caveat too, but said user might then
reasonably wonder how they're supposed to actually make it safe.

> But if we start padding to try to hide the length
> of values, then people might expect it to work in all cases, and I
> don't see how it ever can.

Well, that's where I agree with you on the value of solid documentation.
But there are other things we can do as well. In general we should

- choose a default that will protect most people out of the box,
- document the heck out of the default's limitations,
- provide guardrails that warn the user when they're outgrowing those
  limitations, and
- give people a way to tune it to their own use cases.

As an example, a naive guardrail in this instance could be to simply
have the client refuse to encrypt data past the padding maximum, if
you've gone so far as to set one up. It'd suck to hit that maximum in
production and have to rewrite the column, but you did want your
encryption to hide your data, right?

Maybe that's way too complex to think about for a v1, but it'll be
easier to maintain this into the future if there's at least a plan to
create a v2. If you declare it out of scope, instead of considering it a
potential TODO, then I think it'll be a lot harder for people to improve it.

> Moreover, I think that the padding might
> need to be done in a "cryptographically intelligent" way rather than
> just, say, adding trailing blanks.

Possibly. I think that's where AEAD comes in -- if you've authenticated
your ciphertext sufficiently, padding oracles should be prohibitively
difficult(?). (But see below; I think we also have other things to worry
about in terms of authentication and oracles.)

> Now that being said, if Peter wants
> to implement something around padding that he has reason to believe
> will not create cryptographic weaknesses, I have no issue with that. I
> just don't view it as an essential part of the feature, because hiding
> such things doesn't seem like it can ever be the main point of a
> feature like this.

I think that side channel consideration has to be an essential part of
any cryptography feature. Recent history has shown "obscure" side
channels gaining power to the point of completely breaking crypto schemes.

And it's not like TLS where we have to protect an infinite stream of
arbitrary bytes; this is going to be used on small values that probably
get repeated often and have (comparatively) very little entropy.
Cryptanalysis based on length seems to me like part and parcel of the
problem space.

> I guess my view on this is that, if you're trying to hide something
> like a credit card number, most likely every value in the system is
> the same length, and then this is a non-issue.

Agreed.

> On the other hand, if
> the secret column is a person's name, then there is an issue, but
> you're not going to pad every value out the maximum length of a
> varlena, so you have to make an estimate of how long a name someone
> might reasonably have to decide how much padding to include. You also
> have to decide whether the storage cost of padding every value is
> worth it to you given the potential information leakage. Only the
> human user can make those decisions, so some amount of "putting that
> on the user" feels inevitable.

Agreed.

> Now, if we don't have a padding system
> built into the feature, then that does put even more on the user; it's
> hard to argue with that.
Right. If they can even fix it at all. Having a well-documented padding
feature would not only help mitigate that, it would conveniently hang a
big sign on the caveats that exist.

--

Speaking of oracles and side channels. Users may want to use associated
data to further lock an encrypted value to its column type, too.
Otherwise it seems like an active DBA could feed an encrypted text blob
to a client in place of, say, an int column, to see whether or not that
text blob is a number. Seems like AD is going to be important to prevent
active attacks in general.

--Jacob

Re: Transparent column encryption

2022-07-26 Thread Robert Haas

On Tue, Jul 26, 2022 at 2:27 PM Jacob Champion  wrote:
> Right. My point is, if you have a column that has exactly one
> important value that is 17 bytes long when converted to text, you're
> going to want to know that block size exactly, because the encryption
> will be effectively useless for that value. That size needs to be
> documented, and it'd be helpful to know that it's longer than, say,
> the longest text representation of our fixed-length column types.

I certainly have no objection to being clear about such details in the
documentation.

> If the goal is to provide real encryption, and not just a toy, I think
> you're going to need to put a *lot* of effort into analysis. Even if
> the result of the analysis is "we don't plan to address this in v1".
>
> Crypto is inherently a cycle of
> make-it-and-break-it-and-fix-it-and-break-it-again. If that's
> considered a "wild goose chase" and not seriously pursued at some
> level, then this implementation will probably not last long in the
> face of real abuse. (That doesn't mean you have to take my advice; I'm
> just a dude with opinions -- but you will need to have real
> cryptographers look at this, and you're going to need to think about
> how the system will evolve when it's broken.)

Well, I'm just a dude with opinions, too. I fear the phenomenon where
doing anything about a problem makes you responsible for the whole
problem. If we disclaim the ability to hide the length of values,
that's clear enough. But if we start padding to try to hide the length
of values, then people might expect it to work in all cases, and I
don't see how it ever can. Moreover, I think that the padding might
need to be done in a "cryptographically intelligent" way rather than
just, say, adding trailing blanks. Now that being said, if Peter wants
to implement something around padding that he has reason to believe
will not create cryptographic weaknesses, I have no issue with that. I
just don't view it as an essential part of the feature, because hiding
such things doesn't seem like it can ever be the main point of a
feature like this.

> > Padding values to try to further obscure
> > things may be situationally useful, but if you find yourself worrying
> > too much about that sort of thing, you likely should have picked
> > stronger medicine initially.
>
> In my experience, this entire field is the application of
> situationally useful protection. That's one of the reasons it's hard,
> and designing this sort of patch is going to be hard too.

Agreed.

> Putting that
> on the user isn't quite fair when you're the ones designing the
> system; you determine what they have to worry about when you choose
> the crypto.

I guess my view on this is that, if you're trying to hide something
like a credit card number, most likely every value in the system is
the same length, and then this is a non-issue. On the other hand, if
the secret column is a person's name, then there is an issue, but
you're not going to pad every value out the maximum length of a
varlena, so you have to make an estimate of how long a name someone
might reasonably have to decide how much padding to include. You also
have to decide whether the storage cost of padding every value is
worth it to you given the potential information leakage. Only the
human user can make those decisions, so some amount of "putting that
on the user" feels inevitable. Now, if we don't have a padding system
built into the feature, then that does put even more on the user; it's
hard to argue with that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Transparent column encryption

2022-07-26 Thread Jacob Champion

On Tue, Jul 26, 2022 at 10:52 AM Robert Haas  wrote:
> On Thu, Jul 21, 2022 at 2:30 PM Jacob Champion  
> wrote:
> > A minimum padding option would fix the leak here, right? If every
> > entry is the same length then there's no information to be gained, at
> > least in an offline analysis.
>
> Sure, but padding every text column that you have, even the ones
> containing only 'a', out to the length of Don Quixote in the original
> Spanish, is unlikely to be an appealing option.

If you are honestly trying to conceal Don Quixote, I suspect you are
already in the business of making unappealing decisions. I don't think
that's necessarily an argument against hiding the length for
real-world use cases.

> > I think some work around that is probably going to be needed for
> > serious use of this encryption, in part because of the use of text
> > format as the canonical input. If the encrypted values of 1, 10, 100,
> > and 1000 hypothetically leaked their exact lengths, then an encrypted
> > int wouldn't be very useful. So I'd want to quantify (and possibly
> > configure) exactly how much data you can encrypt in a single message
> > before the length starts being leaked, and then make sure that my
> > encrypted values stay inside that bound.
>
> I think most ciphers these days are block ciphers, so you're going to
> get output that is a multiple of the block size anyway - e.g. I think
> for AES it's 128 bits = 16 bytes. So small differences in length will
> be concealed naturally, which may be good enough for some use cases.

Right. My point is, if you have a column that has exactly one
important value that is 17 bytes long when converted to text, you're
going to want to know that block size exactly, because the encryption
will be effectively useless for that value. That size needs to be
documented, and it'd be helpful to know that it's longer than, say,
the longest text representation of our fixed-length column types.

> I'm not really convinced that it's worth putting a lot of effort into
> bolstering the security of this kind of tech above what it naturally
> gives. I think it's likely to be a wild goose chase.

If the goal is to provide real encryption, and not just a toy, I think
you're going to need to put a *lot* of effort into analysis. Even if
the result of the analysis is "we don't plan to address this in v1".

Crypto is inherently a cycle of
make-it-and-break-it-and-fix-it-and-break-it-again. If that's
considered a "wild goose chase" and not seriously pursued at some
level, then this implementation will probably not last long in the
face of real abuse. (That doesn't mean you have to take my advice; I'm
just a dude with opinions -- but you will need to have real
cryptographers look at this, and you're going to need to think about
how the system will evolve when it's broken.)

> If you have major
> worries about someone reading your disk in its entirety, use full-disk
> encryption.

This patchset was designed to protect against the evil DBA case, I
think. Full disk encryption doesn't help.

> Selective encryption is only suitable when you want to add
> a modest level of protection for individual value and are willing to
> accept that some information leakage is likely if an adversary can in
> fact read the full disk.

...but there's a known countermeasure to this particular leakage,
right? Which would make it more suitable for that case.

> Padding values to try to further obscure
> things may be situationally useful, but if you find yourself worrying
> too much about that sort of thing, you likely should have picked
> stronger medicine initially.

In my experience, this entire field is the application of
situationally useful protection. That's one of the reasons it's hard,
and designing this sort of patch is going to be hard too. Putting that
on the user isn't quite fair when you're the ones designing the
system; you determine what they have to worry about when you choose
the crypto.

--Jacob

Re: Transparent column encryption

2022-07-26 Thread Robert Haas

On Thu, Jul 21, 2022 at 2:30 PM Jacob Champion  wrote:
> On Mon, Jul 18, 2022 at 9:07 AM Robert Haas  wrote:
> > Even there, what can be accomplished with a feature that only encrypts
> > individual column values is by nature somewhat limited. If you have a
> > text column that, for one row, stores the value 'a', and for some
> > other row, stores the entire text of Don Quixote in the original
> > Spanish, it is going to be really difficult to keep an adversary who
> > can read from the disk from distinguishing those rows. If you want to
> > fix that, you're going to need to do block-level encryption or
> > something of that sort.
>
> A minimum padding option would fix the leak here, right? If every
> entry is the same length then there's no information to be gained, at
> least in an offline analysis.

Sure, but padding every text column that you have, even the ones
containing only 'a', out to the length of Don Quixote in the original
Spanish, is unlikely to be an appealing option.

> I think some work around that is probably going to be needed for
> serious use of this encryption, in part because of the use of text
> format as the canonical input. If the encrypted values of 1, 10, 100,
> and 1000 hypothetically leaked their exact lengths, then an encrypted
> int wouldn't be very useful. So I'd want to quantify (and possibly
> configure) exactly how much data you can encrypt in a single message
> before the length starts being leaked, and then make sure that my
> encrypted values stay inside that bound.

I think most ciphers these days are block ciphers, so you're going to
get output that is a multiple of the block size anyway - e.g. I think
for AES it's 128 bits = 16 bytes. So small differences in length will
be concealed naturally, which may be good enough for some use cases.

I'm not really convinced that it's worth putting a lot of effort into
bolstering the security of this kind of tech above what it naturally
gives. I think it's likely to be a wild goose chase. If you have major
worries about someone reading your disk in its entirety, use full-disk
encryption. Selective encryption is only suitable when you want to add
a modest level of protection for individual value and are willing to
accept that some information leakage is likely if an adversary can in
fact read the full disk. Padding values to try to further obscure
things may be situationally useful, but if you find yourself worrying
too much about that sort of thing, you likely should have picked
stronger medicine initially.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Transparent column encryption

2022-07-21 Thread Bruce Momjian

On Mon, Jul 18, 2022 at 12:53:23PM +0200, Peter Eisentraut wrote:
> Asymmetric keys gives you some more options for how you set up the keys at
> the beginning.  For example, you create the asymmetric key pair on the host
> where your client program that wants access to the encrypted data will run.
> You put the private key in an appropriate location for run time.  You send
> the public key to another host.  On that other host, you create the CEK,
> encrypt it with the CMK, and then upload it into the server (CREATE COLUMN
> ENCRYPTION KEY).  Then you can wipe that second host.  That way, you can be
> even more sure that the unencrypted CEK isn't left anywhere.  I'm not sure
> whether this method is very useful in practice, but it's interesting.
> 
> In any case, as I mentioned above, this particular aspect is up for
> discussion.

I caution against adding complexity without a good reason, because
historically complexity often leads to exploits and bugs, especially
with crypto.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson

Re: Transparent column encryption

2022-07-21 Thread Jacob Champion

On Mon, Jul 18, 2022 at 9:07 AM Robert Haas  wrote:
> Even there, what can be accomplished with a feature that only encrypts
> individual column values is by nature somewhat limited. If you have a
> text column that, for one row, stores the value 'a', and for some
> other row, stores the entire text of Don Quixote in the original
> Spanish, it is going to be really difficult to keep an adversary who
> can read from the disk from distinguishing those rows. If you want to
> fix that, you're going to need to do block-level encryption or
> something of that sort.

A minimum padding option would fix the leak here, right? If every
entry is the same length then there's no information to be gained, at
least in an offline analysis.

I think some work around that is probably going to be needed for
serious use of this encryption, in part because of the use of text
format as the canonical input. If the encrypted values of 1, 10, 100,
and 1000 hypothetically leaked their exact lengths, then an encrypted
int wouldn't be very useful. So I'd want to quantify (and possibly
configure) exactly how much data you can encrypt in a single message
before the length starts being leaked, and then make sure that my
encrypted values stay inside that bound.

--Jacob

Re: Transparent column encryption

2022-07-21 Thread Jacob Champion

On Mon, Jul 18, 2022 at 3:53 AM Peter Eisentraut
 wrote:
> Some other products make use of secure enclaves to do computations on
> (otherwise) encrypted values on the server.  I don't fully know how that
> works, but I suspect that asymmetric keys can play a role in that.  (I
> don't have any immediate plans for that in my patch.  It seems to be a
> dying technology at the moment.)
>
> Asymmetric keys gives you some more options for how you set up the keys
> at the beginning.  For example, you create the asymmetric key pair on
> the host where your client program that wants access to the encrypted
> data will run.  You put the private key in an appropriate location for
> run time.  You send the public key to another host.  On that other host,
> you create the CEK, encrypt it with the CMK, and then upload it into the
> server (CREATE COLUMN ENCRYPTION KEY).  Then you can wipe that second
> host.  That way, you can be even more sure that the unencrypted CEK
> isn't left anywhere.  I'm not sure whether this method is very useful in
> practice, but it's interesting.

As long as it's clear to people trying this that the "public" key
cannot actually be made public, I suppose. That needs to be documented
IMO. I like your idea of providing a symmetric option as well.

> In any case, as I mentioned above, this particular aspect is up for
> discussion.
>
> Also note that if you use a KMS (cmklookup "run" method), the actual
> algorithm doesn't even matter (depending on details of the KMS setup),
> since you just tell the KMS "decrypt this", and the KMS knows by itself
> what algorithm to use.  Maybe there should be a way to specify "unknown"
> in the ckdcmkalg field.

+1, an officially client-defined method would probably be useful.

> The short answer is, these same algorithms are used in equivalent
> products (see MS SQL Server, MongoDB).  They even reference the same
> exact draft document.
>
> Besides that, here is my analysis for why these are good choices: You
> can't use any of the counter modes, because since the encryption happens
> on the client, there is no way to coordinate to avoid nonce reuse.  So
> among mainstream modes, you are basically left with AES-CBC with a
> random IV.  In that case, even if you happen to reuse an IV, the
> possible damage is very contained.)

I think both AES-GCM-SIV and XChaCha20-Poly1305 are designed to handle
the nonce problem as well. In any case, if I were deploying this, I'd
want to know the characteristics/limits of our chosen suites (e.g. how
much data can be encrypted per key) so that I could plan rotations
correctly. Something like the table in [1]?

> > Since we're requiring "canonical" use of text format, and the docs say
> > there are no embedded or trailing nulls allowed in text values, could we
> > steal the use of a single zero byte to mean NULL? One additional
> > complication would be that the client would have to double-check that
> > we're not writing a NULL into a NOT NULL column, and complain if it
> > reads one during decryption. Another complication would be that the
> > client would need to complain if it got a plaintext NULL.
>
> You're already alluding to some of the complications.  Also consider
> that null values could arise from, say, outer joins.  So you could be in
> a situation where encrypted and unencrypted null values coexist.

(I realize I'm about to wade into the pool of what NULL means in SQL,
the subject of which I've stayed mostly, gleefully, ignorant.)

To be honest that sounds pretty useful. Any unencrypted null must have
come from the server computation; it's a clear claim by the server
that no such rows exist. (If the encrypted column is itself NOT NULL
then there's no ambiguity to begin with, I think.) That wouldn't be
transparent behavior anymore, so it may (understandably) be a non-goal
for the patch, but it really does sound useful.

And it might be a little extreme, but if I as a user decided that I
wanted in-band encrypted null, it wouldn't be particularly surprising
to me if such a column couldn't be included in an outer join. Just
like I can't join on a randomized encrypted column, or add two
encrypted NUMERICs to each other. In fact I might even want the server
to enforce NOT NULL transparently on the underlying pg_encrypted_*
column, to make sure that I didn't accidentally push an unencrypted
NULL by mistake.

> And of
> course the server doesn't know about the encrypted null values.  So how
> do you maintain semantics, like for aggregate functions, primary keys,
> anything that treats null values specially?

Could you elaborate? Any special cases seem like they'd be important
to document regardless of whether or not we support in-band null
encryption. For example, do you plan to support encrypted primary
keys, null or not? That seems like it'd be particularly difficult
during CEK rotation.

> How do clients deal with a
> mix of encrypted and unencrypted null values, how do they know which one
> is real.

That one seems s

Re: Transparent column encryption

2022-07-19 Thread Masahiko Sawada

On Tue, Jul 19, 2022 at 10:52 PM Peter Eisentraut
 wrote:
>
> On 12.07.22 20:29, Peter Eisentraut wrote:
> > Updated patch, to resolve some merge conflicts.
>
> Rebased patch, no new functionality

Thank you for working on this and updating the patch!

I've mainly looked at the documentation and tests and done some tests.
Before looking at the code in depth, I'd like to share my
comments/questions.

---
Regarding the documentation, I'd like to have a page that describes
the generic information of the transparent column encryption for users
such as what this feature actually does, what can be achieved by this
feature, CMK rotation, and its known limitations. The patch has
"Transparent Column Encryption" section in protocol.sgml but it seems
to be more internal information.

---
In datatype.sgml, it says "Thus, clients that don't support
transparent column encryption or have disabled it will see the
encrypted values as byte arrays." but I got an error rather than
encrypted values when I tried to connect to the server using by
clients that don't support the encryption:

postgres(1:6040)=# select * from tbl;
no CMK lookup found for realm ""

no CMK lookup found for realm ""
postgres(1:6040)=#

---
In single-user mode, the user cannot decrypt the encrypted value but
probably it's fine in practice.

---
Regarding the column master key rotation, would it be useful if we
provide a tool for that? For example, it takes old and new CMK as
input, re-encrypt all CEKs realted to the CMK, and registers them to
the server.

---
Is there any convenient way to load a large amount of test data to the
encrypted columns? I tried to use generate_series() but it seems not
to work as it generates the data on the server side:

postgres(1:80556)=# create table a (i text encrypted with
(column_encryption_key = cek1));
CREATE TABLE
postgres(1:80556)=# insert into a select i::text from
generate_series(1, 1000) i;
2022-07-20 15:06:38.502 JST [80556] ERROR:  column "i" is of type
pg_encrypted_rnd but expression is of type text at character 22

I've also tried to load the data from a file on the client by using
\copy command, but it seems not to work:

postgres(1:80556)=# copy (select generate_series(1, 1000)::text) to
'/tmp/tmp.dat';
COPY 1000
postgres(1:80556)=# \copy a from '/tmp/tmp.dat'
COPY 1000
postgres(1:80556)=# select * from a;
out out memory

---
I got SEGV in the following two situations:

(1) SEGV by backend
postgres(1:59931)=# create table tbl (i int encrypted with
(column_encryption_key = cek1));
CREATE TABLE
postgres(1:59931)=# insert into tbl values ($1) \gencr 1
INSERT 0 1
postgres(1:59931)=# select * from tbl;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

The backtrace is:

(lldb) bt
* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x000106830a30
postgres`pg_detoast_datum_packed(datum=0xab32c563) at
fmgr.c:1742:6
frame #1: 0x0001067d9dbf
postgres`byteaout(fcinfo=0x7ffee9c311a8) at varlena.c:392:20
frame #2: 0x00010682ed0c
postgres`FunctionCall1Coll(flinfo=0x7faeb28193a8, collation=0,
arg1=18446744072286815587) at fmgr.c:1124:11
frame #3: 0x000106830611
postgres`OutputFunctionCall(flinfo=0x7faeb28193a8,
val=18446744072286815587) at fmgr.c:1561:9
frame #4: 0x000105fed702
postgres`printtup(slot=0x7faeb2818960, self=0x7faeb3809390) at
printtup.c:519:16
frame #5: 0x000106319318
postgres`ExecutePlan(estate=0x7faeb2818520,
planstate=0x7faeb2818758, use_parallel_mode=false,
operation=CMD_SELECT, sendTuples=true, numberTuples=0,
direction=ForwardScanDirection, dest=0x7faeb3809390,
execute_once=true) at execMain.c:1667:9
frame #6: 0x000106319180
postgres`standard_ExecutorRun(queryDesc=0x7faeb280d920,
direction=ForwardScanDirection, count=0, execute_once=true) at
execMain.c:363:3
frame #7: 0x000106318f11
postgres`ExecutorRun(queryDesc=0x7faeb280d920,
direction=ForwardScanDirection, count=0, execute_once=true) at
execMain.c:307:3
frame #8: 0x00010661139c
postgres`PortalRunSelect(portal=0x7faeb5028920, forward=true,
count=0, dest=0x7faeb3809390) at pquery.c:924:4
frame #9: 0x000106610d3f
postgres`PortalRun(portal=0x7faeb5028920,
count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x7faeb3809390, altdest=0x7faeb3809390,
qc=0x7ffee9c31620) at pquery.c:768:18
frame #10: 0x00010660bb93
postgres`exec_simple_query(query_string="select * from tbl;") at
postgres.c:1246:10
frame #11: 0x00010660ac2f
postgres`PostgresMain(dbname="postgres", username="masahiko") at
postgres.c:4534:7
frame #12: 0x00010650d9c6
postgres`BackendRun(port=0x7faeb3004210) at postmaster.c:4490:2
frame #13: 0x00010650cf8a
postgres`BackendStartup(port=0x7faeb3004210) at
postmaster.c:4218:3
frame #14: 0x00010650bd57 postgres`ServerLoop at postmaster.c:1808

Re: Transparent column encryption

2022-07-18 Thread Robert Haas

On Mon, Jul 18, 2022 at 6:53 AM Peter Eisentraut
 wrote:
> I think a way to look at this is that this column encryption feature
> isn't suitable for disguising the existence or absence of data, it can
> only disguise the particular data that you know exists.

+1.

Even there, what can be accomplished with a feature that only encrypts
individual column values is by nature somewhat limited. If you have a
text column that, for one row, stores the value 'a', and for some
other row, stores the entire text of Don Quixote in the original
Spanish, it is going to be really difficult to keep an adversary who
can read from the disk from distinguishing those rows. If you want to
fix that, you're going to need to do block-level encryption or
something of that sort. And even then, you still have another version
of the problem, because now imagine you have one *table* that contains
nothing but the value 'a' and another that contains nothing but the
entire text of Don Quixote, it is going to be possible for an
adversary to tell those tables apart, even with the corresponding
files encrypted in their entirety.

But I don't think that this means that a feature like this has no
value. I think it just means that we need to clearly document how the
feature works and not over-promise.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Transparent column encryption

2022-07-18 Thread Peter Eisentraut


On 15.07.22 19:47, Jacob Champion wrote:

 The CEK key
 material is in turn encrypted by an assymmetric key called the column
 master key (CMK).


I'm not yet understanding why the CMK is asymmetric.


I'm not totally sure either.  I started to build it that way because 
other systems were doing it that way, too.  But I have been thinking 
about adding a symmetric alternative for the CMKs as well (probably AESKW).


I think there are a couple of reasons why asymmetric keys are possibly 
useful for CMKs:


Some other products make use of secure enclaves to do computations on 
(otherwise) encrypted values on the server.  I don't fully know how that 
works, but I suspect that asymmetric keys can play a role in that.  (I 
don't have any immediate plans for that in my patch.  It seems to be a 
dying technology at the moment.)


Asymmetric keys gives you some more options for how you set up the keys 
at the beginning.  For example, you create the asymmetric key pair on 
the host where your client program that wants access to the encrypted 
data will run.  You put the private key in an appropriate location for 
run time.  You send the public key to another host.  On that other host, 
you create the CEK, encrypt it with the CMK, and then upload it into the 
server (CREATE COLUMN ENCRYPTION KEY).  Then you can wipe that second 
host.  That way, you can be even more sure that the unencrypted CEK 
isn't left anywhere.  I'm not sure whether this method is very useful in 
practice, but it's interesting.


In any case, as I mentioned above, this particular aspect is up for 
discussion.


Also note that if you use a KMS (cmklookup "run" method), the actual 
algorithm doesn't even matter (depending on details of the KMS setup), 
since you just tell the KMS "decrypt this", and the KMS knows by itself 
what algorithm to use.  Maybe there should be a way to specify "unknown" 
in the ckdcmkalg field.



+#define PG_CEK_AEAD_AES_128_CBC_HMAC_SHA_256   130
+#define PG_CEK_AEAD_AES_192_CBC_HMAC_SHA_384   131
+#define PG_CEK_AEAD_AES_256_CBC_HMAC_SHA_384   132
+#define PG_CEK_AEAD_AES_256_CBC_HMAC_SHA_512   133


It looks like these ciphersuites were abandoned by the IETF. Are there
existing implementations of them that have been audited/analyzed? Are
they safe (and do we know that the claims made in the draft are
correct)? How do they compare to other constructions like AES-GCM-SIV
and XChacha20-Poly1305?


The short answer is, these same algorithms are used in equivalent 
products (see MS SQL Server, MongoDB).  They even reference the same 
exact draft document.


Besides that, here is my analysis for why these are good choices: You 
can't use any of the counter modes, because since the encryption happens 
on the client, there is no way to coordinate to avoid nonce reuse.  So 
among mainstream modes, you are basically left with AES-CBC with a 
random IV.  In that case, even if you happen to reuse an IV, the 
possible damage is very contained.


And then, if you want to use AEAD, you combine that with some MAC, and 
HMAC is just as good as any for that.


The referenced draft document doesn't really contain any additional 
cryptographic insights, it's just a guide on a particular way to put 
these two together.


So altogether I think this is a pretty solid choice.


+-- \gencr
+-- (This just tests the parameter passing; there is no encryption here.)
+CREATE TABLE test_gencr (a int, b text);
+INSERT INTO test_gencr VALUES (1, 'one') \gencr
+SELECT * FROM test_gencr WHERE a = 1 \gencr
+ a |  b
+---+-
+ 1 | one
+(1 row)
+
+INSERT INTO test_gencr VALUES ($1, $2) \gencr 2 'two'
+SELECT * FROM test_gencr WHERE a IN ($1, $2) \gencr 2 3
+ a |  b
+---+-
+ 2 | two
+(1 row)

I'd expect \gencr to error out without sending plaintext. I know that
under the hood this is just setting up a prepared statement, but if I'm
using \gencr, presumably I really do want to be encrypting my data.
Would it be a problem to always set force-column-encryption for the
parameters we're given here? Any unencrypted columns could be provided
directly.


Yeah, this needs a bit of refinement.  You don't want something named 
"encr" but it only encrypts some of the time.  We could possibly do what 
you suggest and make it set the force-encryption flag, or maybe rename 
it or add another command that just uses prepared statements and doesn't 
promise anything about encryption from its name.


This also ties in with how pg_dump will eventually work.  I think by 
default pg_dump will just dump things encrypted and set it up so that 
COPY writes it back encrypted.  But there should probably be a mode that 
dumps out plaintext and then uses one of these commands to load the 
plaintext back in.  What these psql commands need to do also depends on 
what pg_dump needs them to do.



+  
+   Null values are not encrypted by transparent column encryption; null values
+   sent by the client are visible as null values in the database.  If the fact
+   that

Re: Transparent column encryption

2022-07-15 Thread Jacob Champion

On 7/12/22 11:29, Peter Eisentraut wrote:
> 
> Updated patch, to resolve some merge conflicts.

Thank you for working on this; it's an exciting feature.

> The CEK key
> material is in turn encrypted by an assymmetric key called the column
> master key (CMK).

I'm not yet understanding why the CMK is asymmetric. Maybe you could use
the public key to add ephemeral, single-use encryption keys that no one
but the private key holder could use (after you forget them on your
side, that is). But since the entire column is encrypted with a single
CEK, you would essentially only be able to do that if you created an
entirely new column or table; do I have that right?

I'm used to public keys being safe for... publication, but if I'm
understanding correctly, it's important that the server admin doesn't
get hold of the public key for your CMK, because then they could
substitute their own CEKs transparently and undermine future encrypted
writes. That seems surprising. Am I just missing something important
about RSAES-OAEP?

> +#define PG_CEK_AEAD_AES_128_CBC_HMAC_SHA_256   130
> +#define PG_CEK_AEAD_AES_192_CBC_HMAC_SHA_384   131
> +#define PG_CEK_AEAD_AES_256_CBC_HMAC_SHA_384   132
> +#define PG_CEK_AEAD_AES_256_CBC_HMAC_SHA_512   133

It looks like these ciphersuites were abandoned by the IETF. Are there
existing implementations of them that have been audited/analyzed? Are
they safe (and do we know that the claims made in the draft are
correct)? How do they compare to other constructions like AES-GCM-SIV
and XChacha20-Poly1305?

> +-- \gencr
> +-- (This just tests the parameter passing; there is no encryption here.)
> +CREATE TABLE test_gencr (a int, b text);
> +INSERT INTO test_gencr VALUES (1, 'one') \gencr
> +SELECT * FROM test_gencr WHERE a = 1 \gencr
> + a |  b
> +---+-
> + 1 | one
> +(1 row)
> +
> +INSERT INTO test_gencr VALUES ($1, $2) \gencr 2 'two'
> +SELECT * FROM test_gencr WHERE a IN ($1, $2) \gencr 2 3
> + a |  b
> +---+-
> + 2 | two
> +(1 row)
I'd expect \gencr to error out without sending plaintext. I know that
under the hood this is just setting up a prepared statement, but if I'm
using \gencr, presumably I really do want to be encrypting my data.
Would it be a problem to always set force-column-encryption for the
parameters we're given here? Any unencrypted columns could be provided
directly.

Another idle thought I had was that it'd be nice to have some syntax for
providing a null value to \gencr (assuming I didn't overlook it in the
patch). But that brings me to...

> +  
> +   Null values are not encrypted by transparent column encryption; null 
> values
> +   sent by the client are visible as null values in the database.  If the 
> fact
> +   that a value is null needs to be hidden from the server, this information
> +   needs to be encoded into a nonnull value in the client somehow.
> +  

This is a major gap, IMO. Especially with the switch to authenticated
ciphers, because it means you can't sign your NULL values. And having
each client or user that's out there solve this with a magic in-band
value seems like a recipe for pain.

Since we're requiring "canonical" use of text format, and the docs say
there are no embedded or trailing nulls allowed in text values, could we
steal the use of a single zero byte to mean NULL? One additional
complication would be that the client would have to double-check that
we're not writing a NULL into a NOT NULL column, and complain if it
reads one during decryption. Another complication would be that the
client would need to complain if it got a plaintext NULL.

(The need for robust client-side validation of encrypted columns might
be something to expand on in the docs more generally, since before this
feature, it could probably be assumed that the server was buggy if it
sent you unparsable junk in a column.)

> +   
> +The associated data in these algorithms consists of 4
> +bytes: The ASCII letters P and G
> +(byte values 80 and 71), followed by the algorithm ID as a 16-bit 
> unsigned
> +integer in network byte order.
> +   

Is this AD intended as a placeholder for the future, or does it serve a
particular purpose?

Thanks,
--Jacob

Re: Transparent column encryption

2021-12-17 Thread Peter Eisentraut


On 17.12.21 01:41, Jacob Champion wrote:

(And I think the client should be able to enforce encryption in
the first place, before I distract you too much with other stuff.)


Yes, this is a useful point that I have added to my notes.

Re: Transparent column encryption

2021-12-16 Thread Jacob Champion

On Thu, 2021-12-09 at 11:04 +0100, Tomas Vondra wrote:
> On 12/9/21 01:12, Jacob Champion wrote:
> > 
> > The rabbit hole I led you down is one where we use the rest of the row
> > as AD, to try to freeze pieces of it in place. That might(?) have some
> > useful security properties (if the client defines its use and doesn't
> > defer to the server). But it's not what I intended to propose and I'd
> > have to think about that case some more.

So after thinking about it some more, in the case where the client is
relying on the server to return both the encrypted data and its
associated data -- and you don't trust the server -- then tying even
the entire row together doesn't help you.

I was briefly led astray by the idea that you could include a unique or
primary key column in the associated data, and then SELECT based on
that column -- but a motivated DBA could simply corrupt state so that
the row they wanted got returned regardless of the query. So the client
still has to have prior knowledge.

> > In my credit card example, I'm imagining something like (forgive the
> > contrived syntax):
> > 
> > SELECT address, :{aead(users.credit_card, 'u...@example.com')}
> >   FROM users WHERE email = 'u...@example.com';
> > 
> > UPDATE users
> >SET :{aead(users.credit_card, 'u...@example.com')} = '1234-...'
> >  WHERE email = 'u...@example.com';
> > 
> > The client explicitly links a table's column to its AD for the duration
> > of the query. This approach can't scale to
> > 
> > SELECT credit_card FROM users;
> > 
> > because in this case the AD for each row is different, but I'd argue
> > that's ideal for this particular case. The client doesn't need to (and
> > probably shouldn't) grab everyone's credit card details all at once, so
> > there's no reason to optimize for it.
> 
> Maybe, but it seems like a rather annoying limitation, as it restricts
> the client to single-row queries (or at least it looks like that to me).
> Yes, it may be fine for some use cases, but I'd bet a DBA who can modify
> data can do plenty other things - swapping "old" values, which will have
> the right AD, for example.

Resurrecting old data doesn't help the DBA read the values, right? I
view that as similar to the "increasing account balance" problem, in
that it's definitely a problem but not one we're trying to tackle here.

(And I'm not familiar with any solutions for resurrections -- other
than having data expire and tying the timestamp into the
authentication, which I think again requires AD. Revoking signed data
is one of those hard problems. Do you know a better way?)

> OK. In any case, I think we shouldn't require this capability from the
> get go - it's fine to get the simple version done first, which gives us
> privacy / protects against passive attacker. And then sometime in the
> future improve this further.

Agreed. (And I think the client should be able to enforce encryption in
the first place, before I distract you too much with other stuff.)

--Jacob

Re: Transparent column encryption

2021-12-16 Thread Peter Eisentraut


On 16.12.21 05:47, Greg Stark wrote:

In the server, the encrypted datums are stored in types called
encryptedr and encryptedd (for randomized and deterministic
encryption).  These are essentially cousins of bytea.


Does that mean someone could go in with psql and select out the data
without any keys and just get a raw bytea-like representation? That
seems like a natural and useful thing to be able to do. For example to
allow dumping a table and loading it elsewhere and transferring keys
through some other channel (perhaps only as needed).


Yes to all of that.

Re: Transparent column encryption

2021-12-15 Thread Greg Stark

> In the server, the encrypted datums are stored in types called
> encryptedr and encryptedd (for randomized and deterministic
> encryption).  These are essentially cousins of bytea.

Does that mean someone could go in with psql and select out the data
without any keys and just get a raw bytea-like representation? That
seems like a natural and useful thing to be able to do. For example to
allow dumping a table and loading it elsewhere and transferring keys
through some other channel (perhaps only as needed).

Re: Transparent column encryption

2021-12-09 Thread Tomas Vondra




On 12/9/21 01:12, Jacob Champion wrote:
> On Wed, 2021-12-08 at 02:58 +0100, Tomas Vondra wrote:
>>
>> On 12/8/21 00:26, Jacob Champion wrote:
>>> On Tue, 2021-12-07 at 22:21 +0100, Tomas Vondra wrote:
 IMO it's impossible to solve this attack within TCE, because it requires 
 ensuring consistency at the row level, but TCE obviously works at column 
 level only.
>>>
>>> I was under the impression that clients already had to be modified to
>>> figure out how to encrypt the data? If part of that process ends up
>>> including enforcement of encryption for a specific column set, then the
>>> addition of AEAD data could hypothetically be part of that hand-
>>> waviness.
>>
>> I think "transparency" here means the client just uses the regular
>> prepared-statement API without having to explicitly encrypt/decrypt any
>> data. The problem is we can't easily tie this to other columns in the
>> table, because the client may not even know what values are in those
>> columns.
> 
> The way I originally described my request -- "I'd like to be able to
> tie an encrypted value to other column (or external) data" -- was not
> very clear.
> 
> With my proposed model -- where the DBA (and the server) are completely
> untrusted, and the DBA needs to be prevented from using the encrypted
> value -- I don't think there's a useful way for the client to use
> associated data that comes from the server. The client has to know what
> the AD should be beforehand, because otherwise the DBA can make it so
> the server returns whatever is correct.
> 

True. With untrusted server the additional data would have to come from
some other source. Say, an isolated auth system or so.

>> Imagine you do this
>>
>>   UPDATE t SET encrypted_column = $1 WHERE another_column = $2;
>>
>> but you want to ensure the encrypted value belongs to a particular row
>> (which may or may not be identified by the another_column value). How
>> would the client do that? Should it fetch the value or what?
>>
>> Similarly, what if the client just does
>>
>>   SELECT encrypted_column FROM t;
>>
>> How would it verify the values belong to the row, without having all the
>> data for the row (or just the required columns)?
> 
> So with my (hopefully more clear) model above, it wouldn't. The client
> would already have the AD, and somehow tell libpq what that data was
> for the query.
> 
> The rabbit hole I led you down is one where we use the rest of the row
> as AD, to try to freeze pieces of it in place. That might(?) have some
> useful security properties (if the client defines its use and doesn't
> defer to the server). But it's not what I intended to propose and I'd
> have to think about that case some more.
> 

OK

> In my credit card example, I'm imagining something like (forgive the
> contrived syntax):
> 
> SELECT address, :{aead(users.credit_card, 'u...@example.com')}
>   FROM users WHERE email = 'u...@example.com';
> 
> UPDATE users
>SET :{aead(users.credit_card, 'u...@example.com')} = '1234-...'
>  WHERE email = 'u...@example.com';
> 
> The client explicitly links a table's column to its AD for the duration
> of the query. This approach can't scale to
> 
> SELECT credit_card FROM users;
> 
> because in this case the AD for each row is different, but I'd argue
> that's ideal for this particular case. The client doesn't need to (and
> probably shouldn't) grab everyone's credit card details all at once, so
> there's no reason to optimize for it.
> 

Maybe, but it seems like a rather annoying limitation, as it restricts
the client to single-row queries (or at least it looks like that to me).
Yes, it may be fine for some use cases, but I'd bet a DBA who can modify
data can do plenty other things - swapping "old" values, which will have
the right AD, for example.

>>> Unless "transparent" means that the client completely defers to the
>>> server on whether to encrypt or not, and silently goes along with it if
>>> the server tells it not to encrypt?
>> I think that's probably a valid concern - a "bad DBA" could alter the
>> table definition to not contain the "ENCRYPTED" bits, and then peek at
>> the plaintext values.
>>
>> But it's not clear to me how exactly would the AEAD prevent this?
>> Wouldn't that be also specified on the server, somehow? In which case
>> the DBA could just tweak that too, no?
>>
>> In other words, this issue seems mostly orthogonal to the AEAD, and the
>> right solution would be to allow the client to define which columns have
>> to be encrypted (in which case altering the server definition would not
>> be enough).
> 
> Right, exactly. When I mentioned AEAD I had assumed that "allow the
> client to define which columns have to be encrypted" was already
> planned or in the works; I just misunderstood pieces of Peter's email.
> It's that piece where a client would probably have to add details
> around AEAD and its use.
> 
>>> That would only protect against a
>>> _completely_ passive DBA, like som

Re: Transparent column encryption

2021-12-08 Thread Jacob Champion

On Wed, 2021-12-08 at 02:58 +0100, Tomas Vondra wrote:
> 
> On 12/8/21 00:26, Jacob Champion wrote:
> > On Tue, 2021-12-07 at 22:21 +0100, Tomas Vondra wrote:
> > > IMO it's impossible to solve this attack within TCE, because it requires 
> > > ensuring consistency at the row level, but TCE obviously works at column 
> > > level only.
> > 
> > I was under the impression that clients already had to be modified to
> > figure out how to encrypt the data? If part of that process ends up
> > including enforcement of encryption for a specific column set, then the
> > addition of AEAD data could hypothetically be part of that hand-
> > waviness.
> 
> I think "transparency" here means the client just uses the regular
> prepared-statement API without having to explicitly encrypt/decrypt any
> data. The problem is we can't easily tie this to other columns in the
> table, because the client may not even know what values are in those
> columns.

The way I originally described my request -- "I'd like to be able to
tie an encrypted value to other column (or external) data" -- was not
very clear.

With my proposed model -- where the DBA (and the server) are completely
untrusted, and the DBA needs to be prevented from using the encrypted
value -- I don't think there's a useful way for the client to use
associated data that comes from the server. The client has to know what
the AD should be beforehand, because otherwise the DBA can make it so
the server returns whatever is correct.

> Imagine you do this
> 
>   UPDATE t SET encrypted_column = $1 WHERE another_column = $2;
> 
> but you want to ensure the encrypted value belongs to a particular row
> (which may or may not be identified by the another_column value). How
> would the client do that? Should it fetch the value or what?
> 
> Similarly, what if the client just does
> 
>   SELECT encrypted_column FROM t;
> 
> How would it verify the values belong to the row, without having all the
> data for the row (or just the required columns)?

So with my (hopefully more clear) model above, it wouldn't. The client
would already have the AD, and somehow tell libpq what that data was
for the query.

The rabbit hole I led you down is one where we use the rest of the row
as AD, to try to freeze pieces of it in place. That might(?) have some
useful security properties (if the client defines its use and doesn't
defer to the server). But it's not what I intended to propose and I'd
have to think about that case some more.

In my credit card example, I'm imagining something like (forgive the
contrived syntax):

SELECT address, :{aead(users.credit_card, 'u...@example.com')}
  FROM users WHERE email = 'u...@example.com';

UPDATE users
   SET :{aead(users.credit_card, 'u...@example.com')} = '1234-...'
 WHERE email = 'u...@example.com';

The client explicitly links a table's column to its AD for the duration
of the query. This approach can't scale to

SELECT credit_card FROM users;

because in this case the AD for each row is different, but I'd argue
that's ideal for this particular case. The client doesn't need to (and
probably shouldn't) grab everyone's credit card details all at once, so
there's no reason to optimize for it.

> > Unless "transparent" means that the client completely defers to the
> > server on whether to encrypt or not, and silently goes along with it if
> > the server tells it not to encrypt?
> I think that's probably a valid concern - a "bad DBA" could alter the
> table definition to not contain the "ENCRYPTED" bits, and then peek at
> the plaintext values.
> 
> But it's not clear to me how exactly would the AEAD prevent this?
> Wouldn't that be also specified on the server, somehow? In which case
> the DBA could just tweak that too, no?
>
> In other words, this issue seems mostly orthogonal to the AEAD, and the
> right solution would be to allow the client to define which columns have
> to be encrypted (in which case altering the server definition would not
> be enough).

Right, exactly. When I mentioned AEAD I had assumed that "allow the
client to define which columns have to be encrypted" was already
planned or in the works; I just misunderstood pieces of Peter's email.
It's that piece where a client would probably have to add details
around AEAD and its use.

> > That would only protect against a
> > _completely_ passive DBA, like someone reading unencrypted backups,
> > etc. And that still has a lot of value, certainly. But it seems like
> > this prototype is very close to a system where the client can reliably
> > secure data even if the server isn't trustworthy, if that's a use case
> > you're interested in.
> 
> Right. IMHO the "passive attacker" is a perfectly fine model for use
> cases that would be fine with e.g. pgcrypto if there was no risk of
> leaking plaintext values to logs, system catalogs, etc.
> 
> If we can improve it to provide (at least some) protection against
> active attackers, that'd be a nice bonus.

I agree that resis

Re: Transparent column encryption

2021-12-07 Thread Tomas Vondra

On 12/8/21 00:26, Jacob Champion wrote:
> On Tue, 2021-12-07 at 22:21 +0100, Tomas Vondra wrote:
>> IMO it's impossible to solve this attack within TCE, because it requires 
>> ensuring consistency at the row level, but TCE obviously works at column 
>> level only.
> 
> I was under the impression that clients already had to be modified to
> figure out how to encrypt the data? If part of that process ends up
> including enforcement of encryption for a specific column set, then the
> addition of AEAD data could hypothetically be part of that hand-
> waviness.
> 

I think "transparency" here means the client just uses the regular
prepared-statement API without having to explicitly encrypt/decrypt any
data. The problem is we can't easily tie this to other columns in the
table, because the client may not even know what values are in those
columns.

Imagine you do this

  UPDATE t SET encrypted_column = $1 WHERE another_column = $2;

but you want to ensure the encrypted value belongs to a particular row
(which may or may not be identified by the another_column value). How
would the client do that? Should it fetch the value or what?

Similarly, what if the client just does

  SELECT encrypted_column FROM t;

How would it verify the values belong to the row, without having all the
data for the row (or just the required columns)?

> Unless "transparent" means that the client completely defers to the
> server on whether to encrypt or not, and silently goes along with it if
> the server tells it not to encrypt?
I think that's probably a valid concern - a "bad DBA" could alter the
table definition to not contain the "ENCRYPTED" bits, and then peek at
the plaintext values.

But it's not clear to me how exactly would the AEAD prevent this?
Wouldn't that be also specified on the server, somehow? In which case
the DBA could just tweak that too, no?

In other words, this issue seems mostly orthogonal to the AEAD, and the
right solution would be to allow the client to define which columns have
to be encrypted (in which case altering the server definition would not
be enough).

> That would only protect against a
> _completely_ passive DBA, like someone reading unencrypted backups,
> etc. And that still has a lot of value, certainly. But it seems like
> this prototype is very close to a system where the client can reliably
> secure data even if the server isn't trustworthy, if that's a use case
> you're interested in.
> 

Right. IMHO the "passive attacker" is a perfectly fine model for use
cases that would be fine with e.g. pgcrypto if there was no risk of
leaking plaintext values to logs, system catalogs, etc.

If we can improve it to provide (at least some) protection against
active attackers, that'd be a nice bonus.

>> I believe TCE can do AEAD at the column level, which protects against 
>> attacks that flipping bits, and similar attacks. It's just a matter of 
>> how the client encrypts the data.
> 
> Right, I think authenticated encryption ciphers (without AD) will be
> important to support in practice. I think users are going to want
> *some* protection against active attacks.
> 
>> Extending it to protect the whole row seems tricky, because the client 
>> may not even know the other columns, and it's not clear to me how it'd 
>> deal with things like updates of the other columns, hint bits, dropped 
>> columns, etc.
> 
> Covering the entire row automatically probably isn't super helpful in
> practice. As you mention later:
> 
>> It's probably possible to get something like this (row-level AEAD) by 
>> encrypting enriched data, i.e. not just the card number, but {user ID, 
>> card number} or something like that, and verify that in the webapp. The 
>> problem of course is that the "user ID" is just another column in the 
>> table, and there's nothing preventing the DBA from modifying that too.
> 
> Right. That's why the client has to be able to choose AD according to
> the application. In my previous example, the victim's email address can
> be copied by the DBA, but they wouldn't be able to authenticate as that
> user and couldn't convince the client to use the plaintext on their
> behalf.
> 

Well, yeah. But I'm not sure how to make that work easily, because the
client may not have the data :-(

I was thinking about using a composite data type combining the data with
the extra bits - that'd not be all that transparent as it'd require the
client to build this manually and then also cross-check it after loading
the data. So the user would be responsible for having all the data.

But doing that automatically/transparently seems hard, because how would
you deal e.g. with SELECT queries reading data through a view or CTE?

How would you declare this, either at the client or server?

Do any other databases have this capability? How do they do it?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Transparent column encryption

2021-12-07 Thread Jacob Champion

On Tue, 2021-12-07 at 22:21 +0100, Tomas Vondra wrote:
> IMO it's impossible to solve this attack within TCE, because it requires 
> ensuring consistency at the row level, but TCE obviously works at column 
> level only.

I was under the impression that clients already had to be modified to
figure out how to encrypt the data? If part of that process ends up
including enforcement of encryption for a specific column set, then the
addition of AEAD data could hypothetically be part of that hand-
waviness.

Unless "transparent" means that the client completely defers to the
server on whether to encrypt or not, and silently goes along with it if
the server tells it not to encrypt? That would only protect against a
_completely_ passive DBA, like someone reading unencrypted backups,
etc. And that still has a lot of value, certainly. But it seems like
this prototype is very close to a system where the client can reliably
secure data even if the server isn't trustworthy, if that's a use case
you're interested in.

> I believe TCE can do AEAD at the column level, which protects against 
> attacks that flipping bits, and similar attacks. It's just a matter of 
> how the client encrypts the data.

Right, I think authenticated encryption ciphers (without AD) will be
important to support in practice. I think users are going to want
*some* protection against active attacks.

> Extending it to protect the whole row seems tricky, because the client 
> may not even know the other columns, and it's not clear to me how it'd 
> deal with things like updates of the other columns, hint bits, dropped 
> columns, etc.

Covering the entire row automatically probably isn't super helpful in
practice. As you mention later:

> It's probably possible to get something like this (row-level AEAD) by 
> encrypting enriched data, i.e. not just the card number, but {user ID, 
> card number} or something like that, and verify that in the webapp. The 
> problem of course is that the "user ID" is just another column in the 
> table, and there's nothing preventing the DBA from modifying that too.

Right. That's why the client has to be able to choose AD according to
the application. In my previous example, the victim's email address can
be copied by the DBA, but they wouldn't be able to authenticate as that
user and couldn't convince the client to use the plaintext on their
behalf.

--Jacob

Re: Transparent column encryption

2021-12-07 Thread Tomas Vondra





On 12/7/21 19:02, Jacob Champion wrote:

On Tue, 2021-12-07 at 16:39 +0100, Peter Eisentraut wrote:

On 06.12.21 21:44, Jacob Champion wrote:

I think reusing a zero IV will potentially leak more information than
just equality, depending on the cipher in use. You may be interested in
synthetic IVs and nonce-misuse resistance (e.g. [1]), since they seem
like they would match this use case exactly. (But I'm not a
cryptographer.)


I'm aware of this and plan to make use of SIV.  The current
implementation is just an example.


Sounds good.


Have you given any thought to AEAD? As a client I'd like to be able to
tie an encrypted value to other column (or external) data. For example,
AEAD could be used to prevent a DBA from copying the (encrypted) value
of my credit card column into their account's row to use it.


I don't know how that is supposed to work.  When the value is encrypted
for insertion, the client may know things like table name or column
name, so it can tie it to those.  But it doesn't know what row it will
go in, so you can't prevent the value from being copied into another
row.  You would need some permanent logical row ID for this, I think.


Sorry, my description was confusing. There's nothing preventing the DBA
from copying the value inside the database, but AEAD can make it so
that the copied value isn't useful to the DBA.

Sample case. Say I have a webapp backed by Postgres, which stores
encrypted credit card numbers. Users authenticate to the webapp which
then uses the client (which has the keys) to talk to the database.
Additionally, I assume that:

- the DBA can't access the client directly (because if they can, then
they can unencrypt the victim's info using the client's keys), and

- the DBA can't authenticate as the user/victim (because if they can,
they can just log in themselves and have the data). The webapp might
for example use federated authn with a separate provider, using an
email address as an identifier.

Now, if the client encrypts a user's credit card number using their
email address as associated data, then it doesn't matter if the DBA
copies that user's encrypted card over to their own account. The DBA
can't log in as the victim, so the client will fail to authenticate the
value because its associated data won't match.


This is not targeting PostgreSQL 15.  But I'd appreciate some feedback
on the direction.


What kinds of attacks are you hoping to prevent (and not prevent)?


The point is to prevent admins from getting at plaintext data.  The
scenario you show is an interesting one but I think it's not meant to be
addressed by this.  If admins can alter the database to their advantage,
they could perhaps increase their account balance, create discount
codes, etc. also.


Sure, but increasing account balances and discount codes don't lead to
getting at plaintext data, right? Whereas stealing someone else's
encrypted value seems like it would be covered under your threat model,
since it lets you trick a real-world client into decrypting it for you.

Other avenues of attack might depend on how you choose to add HMAC to
the current choice of AES-CBC. My understanding of AE ciphers (with or
without associated data) is that you don't have to design that
yourself, which is nice.



IMO it's impossible to solve this attack within TCE, because it requires 
ensuring consistency at the row level, but TCE obviously works at column 
level only.


I believe TCE can do AEAD at the column level, which protects against 
attacks that flipping bits, and similar attacks. It's just a matter of 
how the client encrypts the data.


Extending it to protect the whole row seems tricky, because the client 
may not even know the other columns, and it's not clear to me how it'd 
deal with things like updates of the other columns, hint bits, dropped 
columns, etc.


It's probably possible to get something like this (row-level AEAD) by 
encrypting enriched data, i.e. not just the card number, but {user ID, 
card number} or something like that, and verify that in the webapp. The 
problem of course is that the "user ID" is just another column in the 
table, and there's nothing preventing the DBA from modifying that too.


So I think it's pointless to try extending this to row-level AEAD.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Transparent column encryption

2021-12-07 Thread Jacob Champion

On Tue, 2021-12-07 at 16:39 +0100, Peter Eisentraut wrote:
> On 06.12.21 21:44, Jacob Champion wrote:
> > I think reusing a zero IV will potentially leak more information than
> > just equality, depending on the cipher in use. You may be interested in
> > synthetic IVs and nonce-misuse resistance (e.g. [1]), since they seem
> > like they would match this use case exactly. (But I'm not a
> > cryptographer.)
> 
> I'm aware of this and plan to make use of SIV.  The current 
> implementation is just an example.

Sounds good.

> > Have you given any thought to AEAD? As a client I'd like to be able to
> > tie an encrypted value to other column (or external) data. For example,
> > AEAD could be used to prevent a DBA from copying the (encrypted) value
> > of my credit card column into their account's row to use it.
> 
> I don't know how that is supposed to work.  When the value is encrypted 
> for insertion, the client may know things like table name or column 
> name, so it can tie it to those.  But it doesn't know what row it will 
> go in, so you can't prevent the value from being copied into another 
> row.  You would need some permanent logical row ID for this, I think.

Sorry, my description was confusing. There's nothing preventing the DBA
from copying the value inside the database, but AEAD can make it so
that the copied value isn't useful to the DBA.

Sample case. Say I have a webapp backed by Postgres, which stores
encrypted credit card numbers. Users authenticate to the webapp which
then uses the client (which has the keys) to talk to the database.
Additionally, I assume that:

- the DBA can't access the client directly (because if they can, then
they can unencrypt the victim's info using the client's keys), and

- the DBA can't authenticate as the user/victim (because if they can,
they can just log in themselves and have the data). The webapp might
for example use federated authn with a separate provider, using an
email address as an identifier.

Now, if the client encrypts a user's credit card number using their
email address as associated data, then it doesn't matter if the DBA
copies that user's encrypted card over to their own account. The DBA
can't log in as the victim, so the client will fail to authenticate the
value because its associated data won't match.

> > > This is not targeting PostgreSQL 15.  But I'd appreciate some feedback
> > > on the direction.
> > 
> > What kinds of attacks are you hoping to prevent (and not prevent)?
> 
> The point is to prevent admins from getting at plaintext data.  The 
> scenario you show is an interesting one but I think it's not meant to be 
> addressed by this.  If admins can alter the database to their advantage, 
> they could perhaps increase their account balance, create discount 
> codes, etc. also.

Sure, but increasing account balances and discount codes don't lead to
getting at plaintext data, right? Whereas stealing someone else's
encrypted value seems like it would be covered under your threat model,
since it lets you trick a real-world client into decrypting it for you.

Other avenues of attack might depend on how you choose to add HMAC to
the current choice of AES-CBC. My understanding of AE ciphers (with or
without associated data) is that you don't have to design that
yourself, which is nice.

--Jacob

Re: Transparent column encryption

2021-12-07 Thread Peter Eisentraut


On 06.12.21 21:44, Jacob Champion wrote:

I think reusing a zero IV will potentially leak more information than
just equality, depending on the cipher in use. You may be interested in
synthetic IVs and nonce-misuse resistance (e.g. [1]), since they seem
like they would match this use case exactly. (But I'm not a
cryptographer.)


I'm aware of this and plan to make use of SIV.  The current 
implementation is just an example.



Have you given any thought to AEAD? As a client I'd like to be able to
tie an encrypted value to other column (or external) data. For example,
AEAD could be used to prevent a DBA from copying the (encrypted) value
of my credit card column into their account's row to use it.


I don't know how that is supposed to work.  When the value is encrypted 
for insertion, the client may know things like table name or column 
name, so it can tie it to those.  But it doesn't know what row it will 
go in, so you can't prevent the value from being copied into another 
row.  You would need some permanent logical row ID for this, I think. 
For this scenario, the deterministic encryption mode is perhaps not the 
right one.



This is not targeting PostgreSQL 15.  But I'd appreciate some feedback
on the direction.


What kinds of attacks are you hoping to prevent (and not prevent)?


The point is to prevent admins from getting at plaintext data.  The 
scenario you show is an interesting one but I think it's not meant to be 
addressed by this.  If admins can alter the database to their advantage, 
they could perhaps increase their account balance, create discount 
codes, etc. also.


If this is a problem, then perhaps a better approach would be to store 
parts of the data in a separate database with separate admins.

Re: Transparent column encryption

2021-12-07 Thread Peter Eisentraut


On 06.12.21 19:28, Robert Haas wrote:

Speaking of parameter descriptions, the trickiest part of this whole
thing appears to be how to get transparently encrypted data into the
database (as opposed to reading it out).  It is required to use
protocol-level prepared statements (i.e., extended query) for this.

Why? If the client knows the CEK, can't the client choose to send
unprepared insert or update statements with pre-encrypted blobs? That
might be a bad idea from a security perspective because the encrypted
blob might then got logged, but we sometimes log parameters, too.


The client can send something like

PQexec(conn, "INSERT INTO tbl VALUES ('ENCBLOB', 'ENCBLOB')");

and it will work.  (See the included test suite where 'ENCBLOB' is 
actually computed by pgcrypto.)  But that is not transparent encryption. 
 The client wants to send "INSERT INTO tbl VALUES ('val1', 'val2')" and 
have libpq take care of encrypting 'val1' and 'val2' before hitting the 
wire.  For that you need to use the prepared statement API so that the 
values are available separately from the statement.  And furthermore the 
client needs to know what columns the insert statements is writing to, 
so that it can get the CEK for that column.  That's what it needs the 
parameter description for.


As alluded to, workarounds exist or might be made available to do part 
of that work yourself, but that shouldn't be the normal way of using it.

Re: Transparent column encryption

2021-12-06 Thread Jacob Champion

On Fri, 2021-12-03 at 22:32 +0100, Peter Eisentraut wrote:
> This feature does support deterministic
> encryption as an alternative to the default randomized encryption, so
> in that mode you can do equality lookups, at the cost of some
> security.

> + if (enc_det)
> + memset(iv, ivlen, 0);

I think reusing a zero IV will potentially leak more information than
just equality, depending on the cipher in use. You may be interested in
synthetic IVs and nonce-misuse resistance (e.g. [1]), since they seem
like they would match this use case exactly. (But I'm not a
cryptographer.)

> The encryption algorithms are mostly hardcoded right now, but there
> are facilities for picking algorithms and adding new ones that will be
> expanded.  The CMK process uses RSA-OAEP.  The CEK process uses
> AES-128-CBC right now; a more complete solution should probably
> involve some HMAC thrown in.

Have you given any thought to AEAD? As a client I'd like to be able to
tie an encrypted value to other column (or external) data. For example,
AEAD could be used to prevent a DBA from copying the (encrypted) value
of my credit card column into their account's row to use it.

> This is not targeting PostgreSQL 15.  But I'd appreciate some feedback
> on the direction.

What kinds of attacks are you hoping to prevent (and not prevent)?

--Jacob

[1] https://datatracker.ietf.org/doc/html/rfc8452

Re: Transparent column encryption

2021-12-06 Thread Robert Haas

On Fri, Dec 3, 2021 at 4:32 PM Peter Eisentraut
 wrote:
> But it's missing the remaining 90% of the work,
> including additional DDL support, error handling, robust memory
> management, protocol versioning, forward and backward compatibility,
> pg_dump support, psql \d support, refinement of the cryptography, and
> so on.  But I think obvious solutions exist to all of those things, so
> it isn't that interesting to focus on them for now.

Right, we wouldn't want to get bogged down at this stage in little
details like, uh, everything.

> Some protocol extensions are required.  These should be guarded by
> some _pq_... setting, but this is not done in this patch yet.  As
> mentioned above, extra messages are added for sending the CMKs and
> CEKs.  In the RowDescription message, I have commandeered the format
> field to add a bit that indicates that the field is encrypted.  This
> could be made a separate field, and there should probably be
> additional fields to indicate the algorithm and CEK name, but this was
> easiest for now.  The ParameterDescription message is extended to
> contain format fields for each parameter, for the same purpose.
> Again, this could be done differently.

I think this is reasonable. I would choose to use an additional bit in
the format field as opposed to a separate field. It is worth
considering whether it makes more sense to extend the existing
ParameterDescription message conditionally on some protocol-level
option, or whether we should instead, say, add ParameterDescription2
or the moral equivalent. As I see it, the latter feels conceptually
simpler, but on the other hand, our wire protocol supposes that we
will never run out of 1-byte codes for messages, so perhaps some
prudence is needed.

> Speaking of parameter descriptions, the trickiest part of this whole
> thing appears to be how to get transparently encrypted data into the
> database (as opposed to reading it out).  It is required to use
> protocol-level prepared statements (i.e., extended query) for this.

Why? If the client knows the CEK, can't the client choose to send
unprepared insert or update statements with pre-encrypted blobs? That
might be a bad idea from a security perspective because the encrypted
blob might then got logged, but we sometimes log parameters, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

82 matches

Mail list logo