Re: Data encryption in Kudu

Franco Venturi Sat, 29 Apr 2017 20:36:37 -0700


In the last couple of days I did some reading on Oracle TDE (Transparent Data 
Encryption) and had some discussions with people at work, and that helped me 
clarify my ideas about encryption in Kudu.





- The most important thing I realized is that there are basically (at least) 
two orthogonal ways to achieve 'data encryption' in Kudu (and in Oracle as 
well): 
- client-side encryption (which Oracle calls 'TDE column encryption') 
- server-side encryption (which Oracle calls 'TDE tablespace encryption') 




- I prefer the terms 'client-side encryption' and 'server-side encryption', 
because in the case of Kudu the data is written to disk by columns, and 
therefore a term like 'column encryption' might be misleading; also the terms 
'client-side 'and 'server-side' help understand IMHO some of the advantages and 
drawbacks of each approach and some high level details of how they could be 
implemented. 




- During the next items I'll be referring to the Oracle TDE documents; two of 
them that I found especially useful are: 
- Oracle Advanced Security Transparent Data Encryption (TDE) FAQ 
(http://www.oracle.com/technetwork/database/options/advanced-security/overview/advanced-security-tde-faq-2995212.pdf)
 
- These two chapters in the Oracle Database Advanced Security Guide 
(https://docs.oracle.com/database/121/ASOAG/toc.htm) 
- Introduction to Transparent Data Encryption 
- General Considerations of Using Transparent Data Encryption 
- I also found useful the chapter about 'Transparent Data Encryption' in the 
Oracle 'Advanced Security Guide' for Oracle DB version 10.2 
(https://docs.oracle.com/cd/B19306_01/network.102/b14268/asotrans.htm#ASOAG600) 
because it has a couple of pictures that are not in the newer version. 





Some subnotes about client-side encryption: 


- the idea here is that all the encryption happens on the client and the client 
sends encrypted data to the server; the Kudu server engine never sees any 
plaintext data 
- this point has a few important consequences: 
- the encrypted column (assuming that encryption is on a column-by-column base, 
which is probably what most users would need) is always stored a 'byte' type 
regardless of its original type 
- for that encrypted column, the size of each row is going to be somewhat 
bigger than if that entry was stored as a plaintext - this is due both to the 
need to store the 'padding' required by the block cipher and possibly a MAC for 
integrity validation (see the Oracle documents above for more details) 
- for that encrypted column, the RLE or dictionary or other encodings don't 
bring any advantage (compression or otherwise) since the data is just random 
data from the point of view of the Kudu server. For this kind of column the 
encoding would have to be just plain natural format 
- if the encrypted column is used as a key or in a 'where' clause of a 'select' 
statement, several other considerations also apply: 
- range selects on that column are not possible and they would become full 
scans over that column (same thing with Oracle) 
- 'range' comparison operators like 'greater than' or 'less than' on that 
column are not possible and they would become full scans over that column 
- 'exact' comparison operators like 'equal to' or 'not equal to' on that column 
are only possible is the encryption scheme is one-to-one (i.e. if for a given 
plain text there's only one way it can be encrypted, which typically means the 
encryption algorithm cannot make use of a 'salt'); otherwise we go back to full 
scans of that column 





- given these points, these are the consequences from the performance point of 
view: 
- since the overhead of the encryption happens on the client side, the 
performance of the server itself is not significantly affected by the 
client-side encryption except for the fact that it loses any possible advantage 
that the column encodings could have given (compression, etc) 
- however (and this is a big HOWEVER), any selection on an encrypted column 
that involves a range (and possibly even an exact selection if the encryption 
scheme uses a 'salt') becomes a full column scan (at the client side, since the 
server is helpless to 'understand' what the encrypted data mean); this means 
that a select on a 'big data' table of millions/billions of rows becomes 
extremely slow, because for that column all the rows have to be sent back to 
the client and the client has to decrypt them and decide which ones satisfy the 
selection criteria (and as you can imagine, there are also significant network 
implications here because all the entries need to be sent back to the client). 




- also this approach is very similar to what HDFS does for their transparent 
encryption, and I would imagine that in this case we could leverage some of the 
already existing key management infrastructure offered by HDFS. 




- from the security point of view with client-side encryption, the server has 
no knowledge of what the encrypted data actually means, i.e. an attacker on the 
server itself would not be able to decrypt the data 
- also from the security point of view, since the encryption happens at the 
client side, the data that is transfered on the network between the client and 
the server is already encrypted and there's no need (at least from this point 
of view) to add a layer of encryption between client and server 




- the practical implementation of client-side encryption would require some 
minor changes on the server code; for now I can think of the following: 
- an additional field on each column that indicates if the column is encrypted 
(the field could be a 2-byte cipher suite id as defined in RFC 5246 - with the 
value 0 meaning that the column is not encrypted) 
- if the column encryption id is not 0, the column would be internally stored 
as a byte type and the server would be expected to receive (and send back) byte 
type data for any entry belonging to to that column 
- another boolean field on each column that indicates of the encryption scheme 
is one-to-one (i.e. it doesn't use a 'salt') or one-to-many 
- to avoid any problems with 'bad' clients that don't understand the 
limitations above, the server could return an 'invalid request' error if the 
client attempts to run a 'range' search on an encrypted column or an 'exact' 
search on an encrypted column where the encryption algorithm is not one-to-one 




- the changes to the client code would instead be substantial - I thought of 
some of them but I don't want to make this post even longer than it is now 





Some subnotes about server-side encryption: 





- the idea here is that the encryption happens on the server and that's what I 
was initially thinking when I started this thread 
- this could be implemented by adding 'encryption codecs' right after the 
compression codecs. This would happen inside the server code when 
reading/writing a 'cfile' (and hence it is more or less the equivalent of 
Oracle tablespace encryption) 
- the server-side encryption would still be at the column level for Kudu 
because of the way Kudu writes its data to disk 
- this approach would allow for range searches using B-trees, and would not 
have any of the limitations listed above 





- from the security point of view, an attacker with full access to the server 
would probably be able to decrypt the encrypted data 
- also from a security point of view the server returns the data back in 
plaintext format; if the data transferred over the network contains sensitive 
information, it would need an extra encryption layer like TLS or something like 
that 





- as per performance implications, if the encryption on the server side uses 
something like AES192 or AES256, there are libraries like libcrypto that take 
advantage of the hardware acceleration for AES encryption on many modern CPUs 
and therefore I suspect the performance overhead would be limited; this is also 
indicated by what the Oracle documentation says regarding processing overhead 
in the case of tablespace encryption in TDE 





- of course in this case the major benefit would be that exact 'selects' and 
range 'selects' would work exactly like they do now (i.e. they are able to use 
B-trees and don't require a full scan of the column); another benefit is that 
RLE encoding, dictionary encoding, etc work as expected and offer all their 
benefits (compression, etc) 





- the implementation of server-side encryption would require on the server more 
changes than the client-side encryption (for instance the cfile header may 
require an additional field to store the size of the block after compression 
and before encryption) 





- it would also require a way to have the server manage these column encryption 
keys (possibly though additional client API's); I haven't looked yet at the way 
Oracle handles encryption/decryption keys for the tablespace encryption TDE, 
but it's on my 'to-do' list 





- finally, since these two approaches (client-side and server-side) are 
orthogonal, i.e. independent of each other, if both were implemented at some 
time, you could have cases where some (more security critical) columns are 
encrypted on the client side, while others (perhaps columns with less stringent 
security requirements, and used in 'selects') are encrypted on the server side 
(and of course other columns could not be encrypted at all). 


I think this is all for now; thanks for your patience reading though this long 
post. 

Franco 


----- Original Message -----

From: [email protected] 
To: [email protected] 
Sent: Wednesday, April 26, 2017 9:48:07 PM 
Subject: Re: Data encryption in Kudu 

David, Dan, Todd, 
thanks for your prompt replies. 

At this stage I am just exploring what it would take to implement some sort of 
data encryption in Kudu. 

After reading your comments here are some further thoughts: 

- according to the first sentence in this paragraph in the Kudu docs ( 
https://kudu.apache.org/docs/schema_design.html#compression ): 

Kudu allows per-column compression using the LZ4 , Snappy , or zlib compression 
codecs. 

it should be possible to perform per-column encryption by adding 'encryption 
codecs' right after the compression codecs. I browsed through the code quickly 
and I think this done when reading/writing a 'cfile' (please correct me if I am 
wrong). If this is correct, this change could be 'minimally invasive' (at least 
for the 'cfile' part) and would not require a major overhaul of the Kudu 
architecture. 

- as per the key management aspect, I am not a security expert at all, so I am 
not sure what would be the best approach here - my thought here is that in most 
places Kudu is deployed together with HDFS, so it would be 'desirable' if the 
key management were consistent between the two services; on the other hand, I 
also realize that the basic premises are fundamentally different: HDFS encrypts 
everything at the client level and therefore the HDFS engine itself is almost 
completely unaware that the data it stores is actually encrypted (except for a 
special file hidden attribute, if I understand correctly), while in Kudu the 
storage engine must have both the 'public' key (when encrypting) and the 
'private' key (when decrypting) otherwise it can't take advantage of knowing 
the 'structure' of the data (for instance the Bloom filters wouldn't probably 
work with the key being encrypted). This means for instance that an attacker 
who is able to gain access to the Kudu tablet servers would probably be able to 
decrypt the data. Also one way to achieve something similar to what HDFS does 
(i.e. client-based encryption and data encrypted in-flight) could be perhaps 
using a one-time client certificate generated by the KMS server, but this would 
also require changes to the client code. 

Franco 


----- Original Message -----

From: "Todd Lipcon" <[email protected]> 
To: [email protected] 
Sent: Tuesday, April 25, 2017 3:49:50 PM 
Subject: Re: Data encryption in Kudu 

Agreed with what Dan said. 

I think there are a number of interesting design alternatives to be considered, 
so before coding it would be great to work through a design document to explore 
the alternatives. For example, we could try to apply encryption at the 'fs/' 
layer, which would cover all non-WAL data, but then we would lose the ability 
to specify encryption on a per-column basis. There are other requirements that 
need to be ironed out about whether we'd need to support separate encryption 
keys per column/table/server/etc, whether metadata also needs to be encrypted, 
etc. 

-Todd 

On Tue, Apr 25, 2017 at 10:38 AM, Dan Burkert < [email protected] > wrote: 



Hi Franco, 

I think you are right that a client-based approach wouldn't work, because we 
wouldn't want to encrypt at the level of individual cell values. That would get 
in the way of encoding, compression, predicate evaluation, etc. As you note, 
adding encryption at the block layer is probably the way to go. Key management 
is definitely the tricky issue. We do have one advantage over HDFS - because 
Kudu does logical replication, the encryption key can be scoped to a particular 
tablet server or tablet replica, it wouldn't need to be shared among all 
replicas. I haven't done enough research to know if this makes it fundamentally 
easier to do key management. I would assume at a minimum we would want to 
integrate with key providers such an HSM. It would be good to have a thorough 
review of existing solutions in the space, such as TDE and the Hadoop KMS. Is 
this something you are interested in working on? 

- Dan 

On Tue, Apr 25, 2017 at 8:30 AM, David Alves < [email protected] > wrote: 

<blockquote>

Hi Franco 

Dan, Alexey, Todd are our security experts. 
Folks, thoughts on this? 

Best 
David 

On Mon, Apr 24, 2017 at 7:08 PM, < [email protected] > wrote: 

<blockquote>

Over the weekend I started looking at what it would take to add data encryption 
to Kudu (besides using filesystem encryption via dm-crypt or something like 
that). 

Here are a few notes - please feel free to comment on them and add suggestions: 

- reading through this mailing list, it looks like this feature has been asked 
a couple of times but last year, but from what I can tell, noone is currently 
working on it. 
- a client-based approach to encryption like the one used by HDFS wouldn't work 
(at least out of the box) because for instance encrypting the primary key at 
the client would prevent being able to have range filters for scans; it might 
work for the columns that are not part of the primary key 
- there's already code in Kudu for several compression codecs (LZ4, gzip, etc); 
I thought it would be possible to add similar code for encryption codecs (to be 
applied after the compression, of course) 
- the WAL log files and delta files should be similarly encrypted too 
- not sure what would be the best way to manage the key - I see that in HDFS 
they use a double key mechanism, where the encryption key for the data file is 
itself encrypted with the allowed user key and this whole process is managed by 
an external Key Management Service 

Thanks in advance for your ideas and suggestions, 
Franco 





</blockquote>



</blockquote>




-- 
Todd Lipcon 
Software Engineer, Cloudera

Re: Data encryption in Kudu

Reply via email to