Renaud Delbru created LUCENE-6966:
-------------------------------------

             Summary: Contribution: Codec for index-level encryption
                 Key: LUCENE-6966
                 URL: https://issues.apache.org/jira/browse/LUCENE-6966
             Project: Lucene - Core
          Issue Type: New Feature
          Components: modules/other
            Reporter: Renaud Delbru


We would like to contribute a codec that enables the encryption of sensitive 
data in the index that has been developed as part of an engagement with a 
customer. We think that this could be of interest for the community.

Below is a description of the project.

h1. Introduction

In comparison with approaches where all data is encrypted (e.g., file system 
encryption, index output / directory encryption), encryption at a codec level 
enables more fine-grained control on which block of data is encrypted. This is 
more efficient since less data has to be encrypted. This also gives more 
flexibility such as the ability to select which field to encrypt.

Some of the requirements for this project were:

* The performance impact of the encryption should be reasonable.
* The user can choose which field to encrypt.
* Key management: During the life cycle of the index, the user can provide a 
new version of his encryption key. Multiple key versions should co-exist in one 
index.

h1. What is supported ?

- Block tree terms index and dictionary
- Compressed stored fields format
- Compressed term vectors format
- Doc values format (prototype based on an encrypted index output) - this will 
be submitted as a separated patch
- Index upgrader: command to upgrade all the index segments with the latest key 
version available.

h1. How it is implemented ?

h2. Key Management

One index segment is encrypted with a single key version. An index can have 
multiple segments, each one encrypted using a different key version. The key 
version for a segment is stored in the segment info.

The provided codec is abstract, and a subclass is responsible in providing an 
implementation of the cipher factory. The cipher factory is responsible of the 
creation of a cipher instance based on a given key version.

h2. Encryption Model

The encryption model is based on AES/CBC with padding. Initialisation vector 
(IV) is reused for performance reason, but only on a per format and per segment 
basis.

While IV reuse is usually considered a bad practice, the CBC mode is somehow 
resilient to IV reuse. The only "leak" of information that this could lead to 
is being able to know that two encrypted blocks of data starts with the same 
prefix. However, it is unlikely that two data blocks in an index segment will 
start with the same data:

- Stored Fields Format: Each encrypted data block is a compressed block (~4kb) 
of one or more documents. It is unlikely that two compressed blocks start with 
the same data prefix.

- Term Vectors: Each encrypted data block is a compressed block (~4kb) of terms 
and payloads from one or more documents. It is unlikely that two compressed 
blocks start with the same data prefix.

- Term Dictionary Index: The term dictionary index is encoded and encrypted in 
one single data block.

- Term Dictionary Data: Each data block of the term dictionary encodes a set of 
suffixes. It is unlikely to have two dictionary data blocks sharing the same 
prefix within the same segment.

- DocValues: A DocValues file will be composed of multiple encrypted data 
blocks. It is unlikely to have two data blocks sharing the same prefix within 
the same segment (each one will encodes a list of values associated to a field).

To the best of our knowledge, this model should be safe. However, it would be 
good if someone with security expertise in the community could review and 
validate it. 

h1. Performance

We report here a performance benchmark we did on an early prototype based on 
Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all the 
fields (id, title, body, date) were encrypted. Only the block tree terms and 
compressed stored fields format were tested at that time. 

h2. Indexing

The indexing throughput slightly decreased and is roughly 15% less than with 
the base Lucene. 

The merge time slightly increased by 35%.

There was no significant difference in term of index size.

h2. Query Throughput

With respect to query throughput, we observed no significant impact on the 
following queries: Term query, boolean query, phrase query, numeric range 
query. 

We observed the following performance impact for queries that needs to scan a 
larger portion of the term dictionary:

- prefix query: decrease of ~25%
- wildcard query (e.g., “fu*r”): decrease of ~60%
- fuzzy query (distance 1): decrease of ~40%
- fuzzy query (distance 2): decrease of ~80%

We can see that the decrease of performance is relative to the size of the 
dictionary scan.

h2. Document Retrieval

We observed a decrease of performance that is relative to the size of the set 
of documents to be retrieved:

- ~20% when retrieving a medium set of documents (100) 
- ~30/40% when retrieving a large set of documents (1000) 

h1. Known Limitations

- compressed stored field do not keep order of fields since non-encrypted and 
encrypted fields are stored in separated blocks.

- the current implementation of the cipher factory does not enforce the use of 
AES/CBC. We are planning to add this to the final version of the patch.

- the current implementation does not change the IV per segment. We are 
planning to add this to the final version of the patch.

- the current implementation of compressed stored fields decrypts a full 
compressed block even if a small portion is decompressed (high impact when 
storing very small documents). We are planning to add this optimisation to the 
final version of the patch. The overall document retrieval performance might 
increase with this optimisation.

The codec has been implemented as a contrib. Given that most of the classes 
were final, we had to copy most of the original code from the extended formats. 
At a later stage, we could think of opening some of these classes to extend 
them properly in order to reduce code duplication and simplify code maintenance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to