Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2020280234
##
parquet/src/encryption/decrypt.rs:
##
@@ -324,6 +346,14 @@ impl DecryptionPropertiesBuilder {
.insert(column_name.to_string(), decryption_key);
self
}
+
+/// Specify multiple column decryption keys
+pub fn with_column_keys(mut self, column_names: Vec<&str>, keys:
Vec>) -> Self {
+for (column_name, key) in
column_names.into_iter().zip(keys.into_iter()) {
Review Comment:
:+1: Good point, done thanks
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
alamb commented on PR #7111: URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2769634877 Follow on PR (for full support) is: - https://github.com/apache/parquet-site/pull/110 is released I will mark that ready for review when we have released the next version of arrow-rs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
alamb commented on PR #7111: URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2769389289 Thanks again @rok @etseidl @adamreeve and @corwinjoy 🦾 🚀 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
alamb commented on PR #7111: URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2769423294 I also made a small PR to update the parquet status page (when this is released): - https://github.com/apache/parquet-site/pull/110 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
alamb merged PR #7111: URL: https://github.com/apache/arrow-rs/pull/7111 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
alamb commented on PR #7111: URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2769121068 Merged to resolve another conflict 🙄 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
alamb commented on PR #7111: URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2767084736 Thanks @adamreeve @corwinjoy and @rok ! I'll plan to merge this tomorrow This PR appeared to have a conflict so I took the liberty of merging up from main  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2020279917
##
parquet/tests/encryption/encryption.rs:
##
@@ -0,0 +1,709 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This module contains tests for reading encrypted Parquet files with the
Arrow API
+
+use crate::encryption_util::{verify_encryption_test_data, TestKeyRetriever};
+use arrow::array::*;
+use arrow::error::Result as ArrowResult;
+use arrow_array::{Int32Array, RecordBatch};
+use arrow_schema::{DataType as ArrowDataType, DataType, Field, Schema};
+use parquet::arrow::arrow_reader::{
+ArrowReaderMetadata, ArrowReaderOptions, ParquetRecordBatchReaderBuilder,
+};
+use parquet::arrow::ArrowWriter;
+use parquet::data_type::{ByteArray, ByteArrayType};
+use parquet::encryption::decrypt::FileDecryptionProperties;
+use parquet::encryption::encrypt::FileEncryptionProperties;
+use parquet::file::metadata::ParquetMetaData;
+use parquet::file::properties::WriterProperties;
+use parquet::file::writer::SerializedFileWriter;
+use parquet::schema::parser::parse_message_type;
+use std::fs::File;
+use std::sync::Arc;
+
+#[test]
+fn test_non_uniform_encryption_plaintext_footer() {
+let test_data = arrow::util::test_util::parquet_test_data();
+let path =
format!("{test_data}/encrypt_columns_plaintext_footer.parquet.encrypted");
+let file = File::open(path).unwrap();
+
+// There is always a footer key even with a plaintext footer,
+// but this is used for signing the footer.
+let footer_key = "0123456789012345".as_bytes(); // 128bit/16
+let column_1_key = "1234567890123450".as_bytes();
+let column_2_key = "1234567890123451".as_bytes();
+
+let decryption_properties =
FileDecryptionProperties::builder(footer_key.to_vec())
+.with_column_key("double_field", column_1_key.to_vec())
+.with_column_key("float_field", column_2_key.to_vec())
+.build()
+.unwrap();
+
+verify_encryption_test_file_read(file, decryption_properties);
+}
+
+#[test]
+fn test_non_uniform_encryption_disabled_aad_storage() {
+let test_data = arrow::util::test_util::parquet_test_data();
+let path =
+
format!("{test_data}/encrypt_columns_and_footer_disable_aad_storage.parquet.encrypted");
+let file = File::open(path.clone()).unwrap();
+
+let footer_key = b"0123456789012345".to_vec(); // 128bit/16
+let column_1_key = b"1234567890123450".to_vec();
+let column_2_key = b"1234567890123451".to_vec();
+
+// Can read successfully when providing the correct AAD prefix
+let decryption_properties =
FileDecryptionProperties::builder(footer_key.clone())
+.with_column_key("double_field", column_1_key.clone())
+.with_column_key("float_field", column_2_key.clone())
+.with_aad_prefix(b"tester".to_vec())
+.build()
+.unwrap();
+
+verify_encryption_test_file_read(file, decryption_properties);
+
+// Using wrong AAD prefix should fail
+let decryption_properties =
FileDecryptionProperties::builder(footer_key.clone())
+.with_column_key("double_field", column_1_key.clone())
+.with_column_key("float_field", column_2_key.clone())
+.with_aad_prefix(b"wrong_aad_prefix".to_vec())
+.build()
+.unwrap();
+
+let file = File::open(path.clone()).unwrap();
+let options = ArrowReaderOptions::default()
+.with_file_decryption_properties(decryption_properties.clone());
+let result = ArrowReaderMetadata::load(&file, options.clone());
+assert!(result.is_err());
+assert_eq!(
+result.unwrap_err().to_string(),
+"Parquet error: Provided footer key and AAD were unable to decrypt
parquet footer"
+);
+
+// Not providing any AAD prefix should fail as it isn't stored in the file
+let decryption_properties = FileDecryptionProperties::builder(footer_key)
+.with_column_key("double_field", column_1_key)
+.with_column_key("float_field", column_2_key)
+.build()
+.unwrap();
+
+let file = File::open(path).unwrap();
+let options = ArrowReaderOptions::default()
+.with_file_decryption_properties(decryption_properties.clo
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on PR #7111: URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2764799620 Your latest comments should all be addressed now thanks @alamb. I've also just added one extra test to cover being able to use enencrypted row group metadata when writing a file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2020280114
##
parquet/src/column/page_encryption.rs:
##
@@ -0,0 +1,119 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::column::page::CompressedPage;
+use crate::encryption::ciphers::BlockEncryptor;
+use crate::encryption::encrypt::{encrypt_object, FileEncryptor};
+use crate::encryption::modules::{create_module_aad, ModuleType};
+use crate::errors::ParquetError;
+use crate::errors::Result;
+use crate::format::PageHeader;
+use crate::format::PageType;
+use bytes::Bytes;
+use std::io::Write;
+use std::sync::Arc;
+
+#[derive(Debug)]
+/// Encrypts page headers and page data for columns
+pub struct PageEncryptor {
Review Comment:
Yes these are in a `pub(crate)` module so are crate private, I've changed
this to be more explicit.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2020258059
##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -727,85 +819,173 @@ pub fn get_column_writers(
) -> Result> {
let mut writers = Vec::with_capacity(arrow.fields.len());
let mut leaves = parquet.columns().iter();
+let column_factory = ArrowColumnWriterFactory::new();
for field in &arrow.fields {
-get_arrow_column_writer(field.data_type(), props, &mut leaves, &mut
writers)?;
+column_factory.get_arrow_column_writer(
+field.data_type(),
+props,
+&mut leaves,
+&mut writers,
+)?;
}
Ok(writers)
}
-/// Gets the [`ArrowColumnWriter`] for the given `data_type`
-fn get_arrow_column_writer(
-data_type: &ArrowDataType,
+/// Returns the [`ArrowColumnWriter`] for a given schema and supports columnar
encryption
+#[cfg(feature = "encryption")]
+fn get_column_writers_with_encryptor(
Review Comment:
:+1: I've made https://github.com/apache/arrow-rs/issues/7359 for this
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
alamb commented on PR #7111: URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2762417125 I ran the performance benchmarks and I do not see any changes for parquet writing with this PR (though the numbers do vary somewhat). Nice work. Details below Details ``` ++ critcmp main encryption-basics-fork group encryption-basics-fork main - -- write_batch nested/4096 values primitive list 1.00 6.5±0.30ms 325.6 MB/sec1.03 6.7±0.26ms 316.1 MB/sec write_batch nested/4096 values primitive list non-null 1.00 8.2±0.37ms 260.8 MB/sec1.00 8.2±0.29ms 260.1 MB/sec write_batch primitive/4096 values bool 1.01 1022.2±53.98µs 1062.3 KB/sec1.00 1013.0±40.47µs 1072.0 KB/sec write_batch primitive/4096 values bool non-null 1.00 934.1±27.14µs 627.3 KB/sec1.03 960.6±34.00µs 610.0 KB/sec write_batch primitive/4096 values float with NaNs 1.01 1824.6±56.19µs30.1 MB/sec1.00 1809.3±31.57µs30.4 MB/sec write_batch primitive/4096 values primitive 1.04 2.9±0.24ms60.4 MB/sec1.00 2.8±0.07ms62.6 MB/sec write_batch primitive/4096 values primitive non-null 1.06 2.9±0.25ms60.5 MB/sec1.00 2.7±0.13ms64.0 MB/sec write_batch primitive/4096 values primitive non-null with bloom filter 1.05 22.5±2.68ms 7.7 MB/sec1.00 21.5±0.36ms 8.0 MB/sec write_batch primitive/4096 values primitive with bloom filter 1.05 22.4±3.68ms 7.8 MB/sec1.00 21.4±0.88ms 8.2 MB/sec write_batch primitive/4096 values string 1.00 1746.6±55.52µs72.3 MB/sec1.01 1757.4±46.59µs71.8 MB/sec write_batch primitive/4096 values string dictionary 1.02 4.3±0.40ms 241.6 MB/sec1.00 4.2±0.18ms 247.1 MB/sec write_batch primitive/4096 values string dictionary with bloom filter 1.02 5.8±0.33ms 177.5 MB/sec1.00 5.7±0.27ms 180.3 MB/sec write_batch primitive/4096 values string non-null 1.02 6.9±0.24ms 295.6 MB/sec1.00 6.8±0.25ms 302.8 MB/sec write_batch primitive/4096 values string non-null with bloom filter 1.00 12.9±0.34ms 159.3 MB/sec1.00 12.9±0.36ms 158.6 MB/sec write_batch primitive/4096 values string with bloom filter 1.00 11.1±0.63ms 184.1 MB/sec1.01 11.2±0.37ms 182.3 MB/sec ``` Second run ``` ++ critcmp main encryption-basics-fork group encryption-basics-fork main - -- write_batch nested/4096 values primitive list 1.01 6.9±0.29ms 309.3 MB/sec1.00 6.8±0.27ms 313.5 MB/sec write_batch nested/4096 values primitive list non-null 1.00 8.5±0.41ms 251.6 MB/sec1.04 8.8±0.51ms 241.4 MB/sec write_batch primitive/4096 values bool 1.00 1012.3±34.35µs 1072.7 KB/sec1.03 1044.5±39.21µs 1039.6 KB/sec write_batch primitive/4096 values bool non-null 1.00 945.6±33.77µs 619.6 KB/sec1.02 962.4±43.08µs 608.8 KB/sec write_batch primitive/4096 values float with NaNs 1.01 1876.1±54.31µs29.3 MB/sec1.00 1861.8±63.39µs29.5 MB/sec write_batch primitive/4096 values primitive 1.03 3.0±0.13ms58.5 MB/sec1.00 2.9±0.12ms60.4 MB/sec write_batch primitive/4096 values primitive non-null 1.04 3.0±0.75ms57.7 MB/sec1.00 2.9±0.60ms60.0 MB/sec write_batch primitive/4096 values primitive non-null with bloom filter 1.00 21.5±0.50ms 8.0 MB/sec1.00 21.5±0.65ms 8.0 MB/sec write_batch primitive/4096 values primitive with bloom filter 1.00 21.4±0.85ms 8.2 MB/sec1.00 21.4±0.96ms 8.2 MB/sec write_batch primitive/4096 values string 1.00 1790.4±68.70µs70.5 MB/sec1.02 1827.0±87.24µs69.1 MB/sec write_batch primitive/4096 values string dictionary 1.00 4.6±0.28ms 225.3 MB/sec1.06 4.8±0.30ms 213.4 MB/sec write_batch primitive/4096 values string dictionary with bloom filter 1.00 6.2±0.31ms 166.9 MB/sec
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
alamb commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2019023428
##
parquet/src/column/page_encryption.rs:
##
@@ -0,0 +1,119 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::column::page::CompressedPage;
+use crate::encryption::ciphers::BlockEncryptor;
+use crate::encryption::encrypt::{encrypt_object, FileEncryptor};
+use crate::encryption::modules::{create_module_aad, ModuleType};
+use crate::errors::ParquetError;
+use crate::errors::Result;
+use crate::format::PageHeader;
+use crate::format::PageType;
+use bytes::Bytes;
+use std::io::Write;
+use std::sync::Arc;
+
+#[derive(Debug)]
+/// Encrypts page headers and page data for columns
+pub struct PageEncryptor {
Review Comment:
I am pretty sure this is crate private -- could we mark it explicitly like
that?
```suggestion
pub (crate) struct PageEncryptor {
```
##
parquet/src/encryption/decrypt.rs:
##
@@ -324,6 +346,14 @@ impl DecryptionPropertiesBuilder {
.insert(column_name.to_string(), decryption_key);
self
}
+
+/// Specify multiple column decryption keys
+pub fn with_column_keys(mut self, column_names: Vec<&str>, keys:
Vec>) -> Self {
+for (column_name, key) in
column_names.into_iter().zip(keys.into_iter()) {
Review Comment:
it might be worth checking that the column_names and keys have the same
length (`zip` silently ignores stuff if one side has more than the other)
##
parquet/src/file/writer.rs:
##
@@ -523,12 +601,24 @@ impl<'a, W: Write + Send> SerializedRowGroupWriter<'a, W>
{
) -> Result,
{
self.assert_previous_writer_closed()?;
+
+let encryptor_context = self.get_page_encryptor_context();
+
Ok(match self.next_column_desc() {
Some(column) => {
let props = self.props.clone();
let (buf, on_close) = self.get_on_close();
-let page_writer = Box::new(SerializedPageWriter::new(buf));
-Some(factory(column, props, page_writer, Box::new(on_close))?)
+
+let page_writer = SerializedPageWriter::new(buf);
Review Comment:
thank you -- I think this looks very reasonable now
##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -727,85 +819,173 @@ pub fn get_column_writers(
) -> Result> {
let mut writers = Vec::with_capacity(arrow.fields.len());
let mut leaves = parquet.columns().iter();
+let column_factory = ArrowColumnWriterFactory::new();
for field in &arrow.fields {
-get_arrow_column_writer(field.data_type(), props, &mut leaves, &mut
writers)?;
+column_factory.get_arrow_column_writer(
+field.data_type(),
+props,
+&mut leaves,
+&mut writers,
+)?;
}
Ok(writers)
}
-/// Gets the [`ArrowColumnWriter`] for the given `data_type`
-fn get_arrow_column_writer(
-data_type: &ArrowDataType,
+/// Returns the [`ArrowColumnWriter`] for a given schema and supports columnar
encryption
+#[cfg(feature = "encryption")]
+fn get_column_writers_with_encryptor(
Review Comment:
I think we should file a follow on ticket for a feature request of "Support
writing encrypted files with multiple threads" or something like that
I agree it is ok as a follow on
##
parquet/src/encryption/decrypt.rs:
##
@@ -324,6 +346,14 @@ impl DecryptionPropertiesBuilder {
.insert(column_name.to_string(), decryption_key);
self
}
+
+/// Specify multiple column decryption keys
+pub fn with_column_keys(mut self, column_names: Vec<&str>, keys:
Vec>) -> Self {
+for (column_name, key) in
column_names.into_iter().zip(keys.into_iter()) {
Review Comment:
```suggestion
if column_names.len() != keys.len() {
... error
}
for (column_name, key) in
column_names.into_iter().zip(keys.into_iter()) {
```
##
parquet/src/encryption/encrypt.rs:
##
@@ -0,0 +1,368 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE f
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
etseidl commented on PR #7111: URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-276525 Thanks @rok! I'll try to get a start on a review this weekend, and hopefully get it done early next week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
alamb commented on PR #7111: URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2749282835 BTW I wanted to thank you all for taking on implementing this (large) feature -- it is one of the last remaining major feature gaps in the Rust parquet implemention in my mind -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2017690250
##
parquet/src/file/metadata/writer.rs:
##
@@ -133,17 +190,53 @@ impl<'a, W: Write> ThriftMetadataWriter<'a, W> {
// Write file metadata
let start_pos = self.buf.bytes_written();
-{
-let mut protocol = TCompactOutputProtocol::new(&mut self.buf);
-file_metadata.write_to_out_protocol(&mut protocol)?;
+match self.file_encryptor.as_ref() {
+#[cfg(feature = "encryption")]
+Some(file_encryptor) if
file_encryptor.properties().encrypt_footer() => {
+// First write FileCryptoMetadata
+let crypto_metadata =
Self::file_crypto_metadata(file_encryptor)?;
+let mut protocol = TCompactOutputProtocol::new(&mut self.buf);
+crypto_metadata.write_to_out_protocol(&mut protocol)?;
+
+// Then write encrypted footer
+let aad = create_footer_aad(file_encryptor.file_aad())?;
+let mut encryptor = file_encryptor.get_footer_encryptor()?;
+encrypt_object(&file_metadata, &mut encryptor, &mut self.buf,
&aad)?;
+}
+_ => {
+let mut protocol = TCompactOutputProtocol::new(&mut self.buf);
+file_metadata.write_to_out_protocol(&mut protocol)?;
+}
}
+
let end_pos = self.buf.bytes_written();
// Write footer
let metadata_len = (end_pos - start_pos) as u32;
self.buf.write_all(&metadata_len.to_le_bytes())?;
-self.buf.write_all(&PARQUET_MAGIC)?;
+#[cfg(feature = "encryption")]
+let magic = get_file_magic(
+self.file_encryptor
+.as_ref()
+.map(|encryptor| encryptor.properties()),
+);
+#[cfg(not(feature = "encryption"))]
+let magic = get_file_magic();
+self.buf.write_all(magic)?;
+
+// TODO: Discuss what to do here with community
+// The argument for returning unencrypted rowgroups is that any users
working with
+// the returned FIleMetaData have no easy way to decrypt it and hence
expect an
+// unencrypted structure be returned. So, the argument here would be
backward compatibility.
+// Return unencrypted row_group for use in program
+// E.g. when collecting statistics.
+// Related to this see: https://github.com/apache/arrow-rs/issues/7254
Review Comment:
After discussing this some more with @corwinjoy, we'd prefer to keep this as
is rather than making a big breaking change like returning `ParquetMetaData`
instead, and don't think this workaround is too objectionable.
If others are happy with this I'll just update this comment and remove the
TODO.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on PR #7111: URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2759518123 Yes I think this is ready for review. There are a couple of comments I'd like some input on (https://github.com/apache/arrow-rs/pull/7111/files#r2015196618 and https://github.com/apache/arrow-rs/pull/7111/files#r2009421835) but otherwise it's looking good to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
alamb commented on PR #7111: URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2759435510 How is this PR doing, btw? Is it ready for another review? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
alamb commented on PR #7111: URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2759435126 > But that might be quite a large and painful breaking change, and also might come with a performance hit... We would have to try it I suspect to test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on PR #7111: URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2756055877 > I came up with https://github.com/apache/arrow-rs/pull/7337 which I think is easier to work with Thanks, this example helped. I took another pass at refactoring `ThriftMetadataWriter` and this is much nicer now. > Since this is a major release, we can make API changes too if needed to support modular encryption. The one thing that comes to mind is whether we want to change `AsyncArrowWriter::finish` to return `parquet::file::ParquetMetadata` instead of `parquet::format::FileMetaData`. This was suggested in #7254 although closed without comment, but that would probably avoid the problem of writers not being able to use the returned metadata to retrieve statistics due to them being encrypted (https://github.com/apache/arrow-rs/pull/7111/files#r2009421835). But that might be quite a large and painful breaking change, and also might come with a performance hit... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
corwinjoy commented on PR #7111: URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2756149153 @alamb wrote: > Since this is a major release, we can make API changes too if needed to support modular encryption. > One minor API change that I would like to have considered is this related PR that I feel makes decryption usage better. https://github.com/apache/arrow-rs/pull/7342 At the suggestion of @adamreeve I have posted this on its own PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2015196618
##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -727,85 +819,173 @@ pub fn get_column_writers(
) -> Result> {
let mut writers = Vec::with_capacity(arrow.fields.len());
let mut leaves = parquet.columns().iter();
+let column_factory = ArrowColumnWriterFactory::new();
for field in &arrow.fields {
-get_arrow_column_writer(field.data_type(), props, &mut leaves, &mut
writers)?;
+column_factory.get_arrow_column_writer(
+field.data_type(),
+props,
+&mut leaves,
+&mut writers,
+)?;
}
Ok(writers)
}
-/// Gets the [`ArrowColumnWriter`] for the given `data_type`
-fn get_arrow_column_writer(
-data_type: &ArrowDataType,
+/// Returns the [`ArrowColumnWriter`] for a given schema and supports columnar
encryption
+#[cfg(feature = "encryption")]
+fn get_column_writers_with_encryptor(
Review Comment:
`get_column_writers` above is a `pub fn` to allow users to have low-level
control over column writing, eg. see docs at
https://github.com/apache/arrow-rs/blob/4e9e1570ef28c160492891539bbc9649ec069a53/parquet/src/arrow/arrow_writer/mod.rs#L606
That method won't work with encryption though.
This new method is private as `FileEncryptor` is only `pub(crate)`, but
maybe we should make this pub and only require pub types in order to support
the same use case? Or modify `get_column_writers` to take a `row_group_idx` and
`&SerializedFileWriter` if we can make breaking changes?
If going with the first approach, that could always be done later as a
non-breaking change.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111: URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2015140880 ## parquet/src/util/never.rs: ## @@ -0,0 +1,21 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Type that can never be instantiated, to make working with features more ergonomic. +// An Option containing this type has zero size. +#[cfg(not(feature = "encryption"))] +pub type Never = core::convert::Infallible; Review Comment: Thanks, this is deleted now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
alamb commented on PR #7111: URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2755546366 > I am going to spend some time seeing if I can get it more isolated and hopefully that might serve as an example of how to do it for writing. Update: - I came up with https://github.com/apache/arrow-rs/pull/7337 which I think is easier to work with -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
alamb commented on PR #7111: URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2755178641 > I've tried to reduce the use of #[cfg(feature = "encryption")] within function bodies (https://github.com/apache/arrow-rs/commit/78ac7b25504ea0d23d1d612f0cb52e823c012bd5), but have been pretty consistently hampered by borrow checker issues. Thanks for trying @adamreeve - I am struggling to get the filter pushdown in https://github.com/apache/arrow-rs/pull/6921 working well given how the encryption code in the reader is injected. I am going to spend some time seeing if I can get it more isolated and hopefully that might serve as an example of how to do it for writing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
alamb commented on PR #7111: URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2755064197 BTW I am hoping we can get this PR ready and merged for a release next week. Specifically this one. - https://github.com/apache/arrow-rs/issues/7084 Since this is a major release, we can make API changes too if needed to support modular encryption. I hope to have time to help if needed later this week or early next week -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on PR #7111: URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2753181854 I've tried to reduce the use of `#[cfg(feature = "encryption")]` within function bodies (https://github.com/apache/arrow-rs/pull/7111/commits/78ac7b25504ea0d23d1d612f0cb52e823c012bd5), but have been pretty consistently hampered by borrow checker issues. The borrow checker has no insight into which fields a function accesses and assumes all of self is borrowed by an `&self` function, so it's not possible to extract logic out to functions when there's a mutable reference to any field of `self` present. For example, in `arrow_writer/mod.rs`, we currently have this code: https://github.com/apache/arrow-rs/blob/78ac7b25504ea0d23d1d612f0cb52e823c012bd5/parquet/src/arrow/arrow_writer/mod.rs#L266-L288 The contents of the `x` branch of the match statement can't be extracted into a method, because a mutable reference to `self.in_progress` is live. This is the same reason why getting `row_group_index` has to come before the reference to `self.in_progress` is taken. I ran into a similar problem in `file/metadata/writer.rs` here: https://github.com/apache/arrow-rs/blob/78ac7b25504ea0d23d1d612f0cb52e823c012bd5/parquet/src/file/metadata/writer.rs#L73 I wanted to extract a `write_offset_index` method with different implementations depending on whether encryption is enabled, but this isn't possible due to there being a mutable reference to `self.row_groups`. I could instead have different implementations of the whole `write_offset_indexes` function, but that would lead to quite a bit of code duplication. I think the current state is a bit better than it was before, but any thoughts on how to work around this and improve the code readability further would be appreciated! `ThriftMetadataWriter::finish` is probably the ugliest part of the diff at the moment, mostly because of a workaround we have to return unencrypted row group metadata so that the statistics are usable by consumers, and we'd also appreciate any thoughts on that (https://github.com/apache/arrow-rs/pull/7111/files#r2009421835). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
rok commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2003517486
##
parquet/src/file/writer.rs:
##
@@ -171,7 +183,30 @@ impl SerializedFileWriter {
/// Creates new file writer.
pub fn new(buf: W, schema: TypePtr, properties: WriterPropertiesPtr) ->
Result {
let mut buf = TrackedWrite::new(buf);
-Self::start_file(&mut buf)?;
+
+#[cfg(feature = "encryption")]
+let file_encryptor = match
properties.file_encryption_properties.as_ref() {
+None => None,
+Some(encryption_props) =>
Some(Arc::new(FileEncryptor::new(encryption_props.clone())?)),
+};
+
+#[cfg(feature = "encryption")]
+if properties.file_encryption_properties.is_some() {
+properties
+.file_encryption_properties
+.clone()
+.unwrap()
+
.validate_encrypted_column_names(SchemaDescriptor::new(schema.clone()))?;
+}
Review Comment:
Do you mean:
```rust
Ok(Self {
buf,
schema: schema.clone(),
descr: Arc::new(schema_descriptor),
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2011101642
##
parquet/src/file/writer.rs:
##
@@ -699,34 +803,61 @@ impl<'a> SerializedColumnWriter<'a> {
/// `SerializedPageWriter` should not be used after calling `close()`.
pub struct SerializedPageWriter<'a, W: Write> {
sink: &'a mut TrackedWrite,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,
Review Comment:
Oh I can just change the method to return both the page encryptor reference
and a mutable reference to `self.sink`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2011029754
##
parquet/src/file/writer.rs:
##
@@ -699,34 +803,61 @@ impl<'a> SerializedColumnWriter<'a> {
/// `SerializedPageWriter` should not be used after calling `close()`.
pub struct SerializedPageWriter<'a, W: Write> {
sink: &'a mut TrackedWrite,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,
Review Comment:
Hmm this is looking promising but I've hit a bit of a wall due to the
compiler complaining about multiple mutable borrows of self.sink
([here](https://github.com/rok/arrow-rs/blob/9e79d75b3e15e45423b6f09225ab22fae11245e6/parquet/src/file/writer.rs#L836)),
as it doesn't know that `page_encryptor_mut` only allows mutating the page
encryptor.
I might need to put the `BlockEncryptor` into a `RefCell` or `Mutex` so that
the `PageEncryptor` methods can take `&self` instead of `&mut self`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2011211923
##
parquet/src/file/writer.rs:
##
@@ -523,12 +595,37 @@ impl<'a, W: Write + Send> SerializedRowGroupWriter<'a, W>
{
) -> Result,
{
self.assert_previous_writer_closed()?;
+
+#[cfg(feature = "encryption")]
Review Comment:
I've tidied this particular method up. It was a bit awkward due to
`get_on_close` returning a mutable reference from self so we can't call any
other methods that take `&self` after this, but I think it has ended up
cleaner. I'll continue trying to tidy up uses of `FileEncryptor` too.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
alamb commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2010826146
##
parquet/README.md:
##
@@ -84,7 +84,7 @@ The `parquet` crate provides the following features which may
be enabled in your
- [ ] Row record writer
- [x] Arrow record writer
- [x] Async support
- - [ ] Encrypted files
+ - [x] Encrypted files
Review Comment:
🎉
##
parquet/src/file/writer.rs:
##
@@ -699,34 +803,61 @@ impl<'a> SerializedColumnWriter<'a> {
/// `SerializedPageWriter` should not be used after calling `close()`.
pub struct SerializedPageWriter<'a, W: Write> {
sink: &'a mut TrackedWrite,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,
Review Comment:
Instead of adding a special field, what if you put access to
`page_encryptor` behind a feature flag
```rust
#[cfg(feature = "encryption")]
fn page_encryptor_mut(&mut self) -> Option<&PageEncryptor> {
self.page_encryptor.as_mut()
}
#[cfg(not(feature = "encryption"))]
fn page_encryptor_mut(&mut self) -> Option<&PageEncryptor> {
None
}
```
Perhaps what you could do is then define two `PageEncryptor`s:
1. When `encryption` is enabled, what is currently in this PR
2. When `encryption` is not enabled, then define a `PageEncryptor ` that has
the same interface but function calls that do nothing
##
parquet/src/file/writer.rs:
##
@@ -523,12 +595,37 @@ impl<'a, W: Write + Send> SerializedRowGroupWriter<'a, W>
{
) -> Result,
{
self.assert_previous_writer_closed()?;
+
+#[cfg(feature = "encryption")]
Review Comment:
If at all possible, I would prefer that all the feature gated code is
isolated to functions / structures that have different implementations when the
feature is on/off
Sprinking the `cfg` in this code I think makes it significantly harder to
read and this is some of the most complicated / performance sensitive code of
the parquet crate as it is
##
parquet/src/util/never.rs:
##
@@ -0,0 +1,21 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// Type that can never be instantiated, to make working with features more
ergonomic.
+// An Option containing this type has zero size.
+#[cfg(not(feature = "encryption"))]
+pub type Never = core::convert::Infallible;
Review Comment:
I left an alternate suggestion above about how to handle the feature flags
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2010846626
##
parquet/src/file/writer.rs:
##
@@ -699,34 +803,61 @@ impl<'a> SerializedColumnWriter<'a> {
/// `SerializedPageWriter` should not be used after calling `close()`.
pub struct SerializedPageWriter<'a, W: Write> {
sink: &'a mut TrackedWrite,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,
Review Comment:
I tried something a bit like this in
https://github.com/apache/arrow-rs/pull/7111#discussion_r1992613117 (see
https://github.com/adamreeve/arrow-rs/commit/c94ddd07f95ff09076019e8854f4d9f1f065683a),
but ran into an issue with my trait not being dyn-compatible. But making this
a compile-time switch rather than a dynamic trait object would avoid that
problem, I'll give this a go thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2009421835
##
parquet/src/file/metadata/writer.rs:
##
@@ -133,17 +190,53 @@ impl<'a, W: Write> ThriftMetadataWriter<'a, W> {
// Write file metadata
let start_pos = self.buf.bytes_written();
-{
-let mut protocol = TCompactOutputProtocol::new(&mut self.buf);
-file_metadata.write_to_out_protocol(&mut protocol)?;
+match self.file_encryptor.as_ref() {
+#[cfg(feature = "encryption")]
+Some(file_encryptor) if
file_encryptor.properties().encrypt_footer() => {
+// First write FileCryptoMetadata
+let crypto_metadata =
Self::file_crypto_metadata(file_encryptor)?;
+let mut protocol = TCompactOutputProtocol::new(&mut self.buf);
+crypto_metadata.write_to_out_protocol(&mut protocol)?;
+
+// Then write encrypted footer
+let aad = create_footer_aad(file_encryptor.file_aad())?;
+let mut encryptor = file_encryptor.get_footer_encryptor()?;
+encrypt_object(&file_metadata, &mut encryptor, &mut self.buf,
&aad)?;
+}
+_ => {
+let mut protocol = TCompactOutputProtocol::new(&mut self.buf);
+file_metadata.write_to_out_protocol(&mut protocol)?;
+}
}
+
let end_pos = self.buf.bytes_written();
// Write footer
let metadata_len = (end_pos - start_pos) as u32;
self.buf.write_all(&metadata_len.to_le_bytes())?;
-self.buf.write_all(&PARQUET_MAGIC)?;
+#[cfg(feature = "encryption")]
+let magic = get_file_magic(
+self.file_encryptor
+.as_ref()
+.map(|encryptor| encryptor.properties()),
+);
+#[cfg(not(feature = "encryption"))]
+let magic = get_file_magic();
+self.buf.write_all(magic)?;
+
+// TODO: Discuss what to do here with community
+// The argument for returning unencrypted rowgroups is that any users
working with
+// the returned FIleMetaData have no easy way to decrypt it and hence
expect an
+// unencrypted structure be returned. So, the argument here would be
backward compatibility.
+// Return unencrypted row_group for use in program
+// E.g. when collecting statistics.
+// Related to this see: https://github.com/apache/arrow-rs/issues/7254
Review Comment:
This is an issue we ran into when trying to integrate these changes into
delta-rs, which reads the row group metadata after writing files to get column
statistics. Making clients decrypt the metadata in this scenario doesn't seem
reasonable.
The suggestion to return `ParquetMetaData` instead of the Thrift
`FileMetaData` in #7254 seems relevant here, if that was done then that might
also solve this problem, depending on how it was implemented.
We'd appreciate some feedback on this area and if there's a better way to
handle this than our current solution of copying and restoring the unencrypted
metadata after the file is written.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2009250354
##
parquet/tests/encryption/encryption.rs:
##
@@ -0,0 +1,609 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This module contains tests for reading encrypted Parquet files with the
Arrow API
+
+use crate::encryption_util::read_and_roundtrip_to_encrypted_file;
+use crate::encryption_util::verify_encryption_test_data;
+use arrow::array::*;
+use arrow::error::Result as ArrowResult;
+use arrow_array::{Int32Array, RecordBatch};
+use arrow_schema::{DataType as ArrowDataType, DataType, Field, Schema};
+use parquet::arrow::arrow_reader::{
+ArrowReaderMetadata, ArrowReaderOptions, ParquetRecordBatchReaderBuilder,
+};
+use parquet::arrow::ArrowWriter;
+use parquet::data_type::{ByteArray, ByteArrayType};
+use parquet::encryption::decrypt::FileDecryptionProperties;
+use parquet::encryption::encrypt::FileEncryptionProperties;
+use parquet::file::metadata::ParquetMetaData;
+use parquet::file::properties::WriterProperties;
+use parquet::file::writer::SerializedFileWriter;
+use parquet::schema::parser::parse_message_type;
+use std::fs::File;
+use std::sync::Arc;
+
+#[test]
+fn test_non_uniform_encryption_plaintext_footer() {
+let test_data = arrow::util::test_util::parquet_test_data();
+let path =
format!("{test_data}/encrypt_columns_plaintext_footer.parquet.encrypted");
+let file = File::open(path).unwrap();
+
+// There is always a footer key even with a plaintext footer,
+// but this is used for signing the footer.
+let footer_key = "0123456789012345".as_bytes(); // 128bit/16
+let column_1_key = "1234567890123450".as_bytes();
+let column_2_key = "1234567890123451".as_bytes();
+
+let decryption_properties =
FileDecryptionProperties::builder(footer_key.to_vec())
+.with_column_key("double_field", column_1_key.to_vec())
+.with_column_key("float_field", column_2_key.to_vec())
+.build()
+.unwrap();
+
+verify_encryption_test_file_read(file, decryption_properties);
+}
+
+#[test]
+fn test_non_uniform_encryption_disabled_aad_storage() {
+let test_data = arrow::util::test_util::parquet_test_data();
+let path =
+
format!("{test_data}/encrypt_columns_and_footer_disable_aad_storage.parquet.encrypted");
+let file = File::open(path.clone()).unwrap();
+
+let footer_key = b"0123456789012345".to_vec(); // 128bit/16
+let column_1_key = b"1234567890123450".to_vec();
+let column_2_key = b"1234567890123451".to_vec();
+
+// Can read successfully when providing the correct AAD prefix
+let decryption_properties =
FileDecryptionProperties::builder(footer_key.clone())
+.with_column_key("double_field", column_1_key.clone())
+.with_column_key("float_field", column_2_key.clone())
+.with_aad_prefix(b"tester".to_vec())
+.build()
+.unwrap();
+
+verify_encryption_test_file_read(file, decryption_properties);
+
+// Using wrong AAD prefix should fail
+let decryption_properties =
FileDecryptionProperties::builder(footer_key.clone())
+.with_column_key("double_field", column_1_key.clone())
+.with_column_key("float_field", column_2_key.clone())
+.with_aad_prefix(b"wrong_aad_prefix".to_vec())
+.build()
+.unwrap();
+
+let file = File::open(path.clone()).unwrap();
+let options = ArrowReaderOptions::default()
+.with_file_decryption_properties(decryption_properties.clone());
+let result = ArrowReaderMetadata::load(&file, options.clone());
+assert!(result.is_err());
+assert_eq!(
+result.unwrap_err().to_string(),
+"Parquet error: Provided footer key and AAD were unable to decrypt
parquet footer"
+);
+
+// Not providing any AAD prefix should fail as it isn't stored in the file
+let decryption_properties = FileDecryptionProperties::builder(footer_key)
+.with_column_key("double_field", column_1_key)
+.with_column_key("float_field", column_2_key)
+.build()
+.unwrap();
+
+let file = File::open(path).unwrap();
+let options = ArrowReaderOptions::default()
+.with_file_
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2009272762
##
parquet/tests/encryption/encryption.rs:
##
@@ -0,0 +1,609 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This module contains tests for reading encrypted Parquet files with the
Arrow API
+
+use crate::encryption_util::read_and_roundtrip_to_encrypted_file;
+use crate::encryption_util::verify_encryption_test_data;
+use arrow::array::*;
+use arrow::error::Result as ArrowResult;
+use arrow_array::{Int32Array, RecordBatch};
+use arrow_schema::{DataType as ArrowDataType, DataType, Field, Schema};
+use parquet::arrow::arrow_reader::{
+ArrowReaderMetadata, ArrowReaderOptions, ParquetRecordBatchReaderBuilder,
+};
+use parquet::arrow::ArrowWriter;
+use parquet::data_type::{ByteArray, ByteArrayType};
+use parquet::encryption::decrypt::FileDecryptionProperties;
+use parquet::encryption::encrypt::FileEncryptionProperties;
+use parquet::file::metadata::ParquetMetaData;
+use parquet::file::properties::WriterProperties;
+use parquet::file::writer::SerializedFileWriter;
+use parquet::schema::parser::parse_message_type;
+use std::fs::File;
+use std::sync::Arc;
+
+#[test]
+fn test_non_uniform_encryption_plaintext_footer() {
+let test_data = arrow::util::test_util::parquet_test_data();
+let path =
format!("{test_data}/encrypt_columns_plaintext_footer.parquet.encrypted");
+let file = File::open(path).unwrap();
+
+// There is always a footer key even with a plaintext footer,
+// but this is used for signing the footer.
+let footer_key = "0123456789012345".as_bytes(); // 128bit/16
+let column_1_key = "1234567890123450".as_bytes();
+let column_2_key = "1234567890123451".as_bytes();
+
+let decryption_properties =
FileDecryptionProperties::builder(footer_key.to_vec())
+.with_column_key("double_field", column_1_key.to_vec())
+.with_column_key("float_field", column_2_key.to_vec())
+.build()
+.unwrap();
+
+verify_encryption_test_file_read(file, decryption_properties);
+}
+
+#[test]
+fn test_non_uniform_encryption_disabled_aad_storage() {
+let test_data = arrow::util::test_util::parquet_test_data();
+let path =
+
format!("{test_data}/encrypt_columns_and_footer_disable_aad_storage.parquet.encrypted");
+let file = File::open(path.clone()).unwrap();
+
+let footer_key = b"0123456789012345".to_vec(); // 128bit/16
+let column_1_key = b"1234567890123450".to_vec();
+let column_2_key = b"1234567890123451".to_vec();
+
+// Can read successfully when providing the correct AAD prefix
+let decryption_properties =
FileDecryptionProperties::builder(footer_key.clone())
+.with_column_key("double_field", column_1_key.clone())
+.with_column_key("float_field", column_2_key.clone())
+.with_aad_prefix(b"tester".to_vec())
+.build()
+.unwrap();
+
+verify_encryption_test_file_read(file, decryption_properties);
+
+// Using wrong AAD prefix should fail
+let decryption_properties =
FileDecryptionProperties::builder(footer_key.clone())
+.with_column_key("double_field", column_1_key.clone())
+.with_column_key("float_field", column_2_key.clone())
+.with_aad_prefix(b"wrong_aad_prefix".to_vec())
+.build()
+.unwrap();
+
+let file = File::open(path.clone()).unwrap();
+let options = ArrowReaderOptions::default()
+.with_file_decryption_properties(decryption_properties.clone());
+let result = ArrowReaderMetadata::load(&file, options.clone());
+assert!(result.is_err());
+assert_eq!(
+result.unwrap_err().to_string(),
+"Parquet error: Provided footer key and AAD were unable to decrypt
parquet footer"
+);
+
+// Not providing any AAD prefix should fail as it isn't stored in the file
+let decryption_properties = FileDecryptionProperties::builder(footer_key)
+.with_column_key("double_field", column_1_key)
+.with_column_key("float_field", column_2_key)
+.build()
+.unwrap();
+
+let file = File::open(path).unwrap();
+let options = ArrowReaderOptions::default()
+.with_file_
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2006626887
##
parquet/src/encryption/encrypt.rs:
##
@@ -0,0 +1,343 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::encryption::ciphers::{BlockEncryptor, RingGcmBlockEncryptor};
+use crate::errors::{ParquetError, Result};
+use crate::file::column_crypto_metadata::{ColumnCryptoMetaData,
EncryptionWithColumnKey};
+use crate::schema::types::{ColumnDescPtr, SchemaDescriptor};
+use crate::thrift::TSerializable;
+use ring::rand::{SecureRandom, SystemRandom};
+use std::collections::{HashMap, HashSet};
+use std::io::Write;
+use thrift::protocol::TCompactOutputProtocol;
+
+#[derive(Debug, Clone, PartialEq)]
+struct EncryptionKey {
+key: Vec,
+key_metadata: Option>,
+}
+
+impl EncryptionKey {
+fn new(key: Vec) -> EncryptionKey {
+Self {
+key,
+key_metadata: None,
+}
+}
+
+fn with_metadata(mut self, metadata: Vec) -> Self {
+self.key_metadata = Some(metadata);
+self
+}
+
+fn key(&self) -> &Vec {
+&self.key
+}
+}
+
+#[derive(Debug, Clone, PartialEq)]
+pub struct FileEncryptionProperties {
+encrypt_footer: bool,
+footer_key: EncryptionKey,
+column_keys: HashMap,
+aad_prefix: Option>,
+store_aad_prefix: bool,
+}
+
+impl FileEncryptionProperties {
+/// Create a new builder for encryption properties
+pub fn builder(footer_key: Vec) -> EncryptionPropertiesBuilder {
+EncryptionPropertiesBuilder::new(footer_key)
+}
+
+/// Should the footer be encrypted
+pub fn encrypt_footer(&self) -> bool {
+self.encrypt_footer
+}
+
+/// Retrieval metadata of key used for encryption of footer and (possibly)
columns
+pub fn footer_key_metadata(&self) -> Option<&Vec> {
+self.footer_key.key_metadata.as_ref()
+}
+
+/// AAD prefix string uniquely identifies the file and prevents file
swapping
+pub fn aad_prefix(&self) -> Option<&Vec> {
+self.aad_prefix.as_ref()
+}
+
+/// Should the AAD prefix be stored in the file
+pub fn store_aad_prefix(&self) -> bool {
+self.store_aad_prefix && self.aad_prefix.is_some()
+}
+
+/// Checks if columns that are to be encrypted are present in schema
+#[cfg(feature = "encryption")]
Review Comment:
```suggestion
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
alamb commented on PR #7111: URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2744283131 @mbutrovich -- I think you are interested in this feature too. Do you have some time to help review this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
alamb commented on PR #7111: URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2744285058 I have this on my list -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
rok commented on PR #7111: URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2744103847 @alamb @ggershinsky @etseidl - I think this is ripe for review, both from encryption and "arrow" standpoint. I'm travelling from Monday on and have limited bandwidth but @adamreeve will probably be available to reply. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
rok commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2007606880
##
parquet/tests/arrow_reader/encryption.rs:
##
@@ -201,3 +211,410 @@ fn verify_encryption_test_file_read(file: File,
decryption_properties: FileDecry
verify_encryption_test_data(record_batches, metadata);
}
+
+fn row_group_sizes(metadata: &ParquetMetaData) -> Vec {
+metadata.row_groups().iter().map(|x| x.num_rows()).collect()
+}
+
+#[test]
+fn test_uniform_encryption_roundtrip() {
Review Comment:
Done:
https://github.com/apache/arrow-rs/pull/7111/commits/ce8d2a91e2163791f88ac9286c62cb1a41ad2c86
Not sure if this is the best way to split tests, but it does remove a lot of
`#cfg` logic.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
rok commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2004524736
##
parquet/src/file/metadata/writer.rs:
##
@@ -182,6 +258,95 @@ impl<'a, W: Write> ThriftMetadataWriter<'a, W> {
self.key_value_metadata = Some(key_value_metadata);
self
}
+
+#[cfg(feature = "encryption")]
+pub fn with_file_encryptor(mut self, file_encryptor:
Option>) -> Self {
+self.file_encryptor = file_encryptor;
+self
+}
+
+#[cfg(feature = "encryption")]
+fn file_crypto_metadata(
+file_encryptor: &Arc,
+) -> Result {
+let properties = file_encryptor.properties();
+let supply_aad_prefix = properties
+.aad_prefix()
+.map(|_| !properties.store_aad_prefix());
+let encryption_algorithm = AesGcmV1 {
+aad_prefix: properties.aad_prefix().cloned(),
Review Comment:
How about:
https://github.com/apache/arrow-rs/blob/b01b2851be51dbe32af106d5fe78cd5bafa9aaa6/parquet/src/file/metadata/reader.rs#L735-L744
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
rok commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2004728402
##
parquet/src/file/metadata/writer.rs:
##
@@ -182,6 +258,95 @@ impl<'a, W: Write> ThriftMetadataWriter<'a, W> {
self.key_value_metadata = Some(key_value_metadata);
self
}
+
+#[cfg(feature = "encryption")]
+pub fn with_file_encryptor(mut self, file_encryptor:
Option>) -> Self {
+self.file_encryptor = file_encryptor;
+self
+}
+
+#[cfg(feature = "encryption")]
+fn file_crypto_metadata(
+file_encryptor: &Arc,
+) -> Result {
+let properties = file_encryptor.properties();
+let supply_aad_prefix = properties
+.aad_prefix()
+.map(|_| !properties.store_aad_prefix());
+let encryption_algorithm = AesGcmV1 {
+aad_prefix: properties.aad_prefix().cloned(),
Review Comment:
Done!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
rok commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2004683314
##
parquet/tests/arrow_reader/encryption.rs:
##
@@ -201,3 +211,410 @@ fn verify_encryption_test_file_read(file: File,
decryption_properties: FileDecry
verify_encryption_test_data(record_batches, metadata);
}
+
+fn row_group_sizes(metadata: &ParquetMetaData) -> Vec {
+metadata.row_groups().iter().map(|x| x.num_rows()).collect()
+}
+
+#[test]
+fn test_uniform_encryption_roundtrip() {
Review Comment:
I'll try adding `parquet/tests/encryption` folder and move related things
there.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2004576722
##
parquet/src/file/metadata/writer.rs:
##
@@ -182,6 +258,95 @@ impl<'a, W: Write> ThriftMetadataWriter<'a, W> {
self.key_value_metadata = Some(key_value_metadata);
self
}
+
+#[cfg(feature = "encryption")]
+pub fn with_file_encryptor(mut self, file_encryptor:
Option>) -> Self {
+self.file_encryptor = file_encryptor;
+self
+}
+
+#[cfg(feature = "encryption")]
+fn file_crypto_metadata(
+file_encryptor: &Arc,
+) -> Result {
+let properties = file_encryptor.properties();
+let supply_aad_prefix = properties
+.aad_prefix()
+.map(|_| !properties.store_aad_prefix());
+let encryption_algorithm = AesGcmV1 {
+aad_prefix: properties.aad_prefix().cloned(),
Review Comment:
:+1: that looks right to me, although you can avoid the `clone` by matching
on a reference:
```rust
let supply_aad_prefix = match &t_file_crypto_metadata.encryption_algorithm {
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
rok commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2004525536
##
parquet/src/file/writer.rs:
##
@@ -171,7 +183,30 @@ impl SerializedFileWriter {
/// Creates new file writer.
pub fn new(buf: W, schema: TypePtr, properties: WriterPropertiesPtr) ->
Result {
let mut buf = TrackedWrite::new(buf);
-Self::start_file(&mut buf)?;
+
+#[cfg(feature = "encryption")]
+let file_encryptor = match
properties.file_encryption_properties.as_ref() {
+None => None,
+Some(encryption_props) =>
Some(Arc::new(FileEncryptor::new(encryption_props.clone())?)),
+};
+
+#[cfg(feature = "encryption")]
+if properties.file_encryption_properties.is_some() {
+properties
+.file_encryption_properties
+.clone()
+.unwrap()
+
.validate_encrypted_column_names(SchemaDescriptor::new(schema.clone()))?;
+}
Review Comment:
Good point about the schema. Changed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
rok commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2004521719
##
parquet/src/encryption/ciphers.rs:
##
@@ -61,3 +64,118 @@ impl BlockDecryptor for RingGcmBlockDecryptor {
Ok(result)
}
}
+
+pub trait BlockEncryptor: Debug + Send + Sync {
+fn encrypt(&mut self, plaintext: &[u8], aad: &[u8]) -> Result>;
+}
+
+#[derive(Debug, Clone)]
+struct CounterNonce {
+start: u128,
+counter: u128,
+}
+
+impl CounterNonce {
+pub fn new(rng: &SystemRandom) -> Result {
+let mut buf = [0; 16];
+rng.fill(&mut buf)?;
+
+// Since this is a random seed value, endianness doesn't matter at all,
+// and we can use whatever is platform-native.
+let start = u128::from_ne_bytes(buf) & RIGHT_TWELVE;
+let counter = start.wrapping_add(1);
+
+Ok(Self { start, counter })
+}
+
+/// One accessor for the nonce bytes to avoid potentially flipping
endianness
+#[inline]
+pub fn get_bytes(&self) -> [u8; NONCE_LEN] {
+self.counter.to_le_bytes()[0..NONCE_LEN].try_into().unwrap()
+}
+}
+
+impl NonceSequence for CounterNonce {
+fn advance(&mut self) -> Result {
+// If we've wrapped around, we've exhausted this nonce sequence
+if (self.counter & RIGHT_TWELVE) == (self.start & RIGHT_TWELVE) {
+Err(ring::error::Unspecified)
+} else {
+// Otherwise, just advance and return the new value
+let buf: [u8; NONCE_LEN] = self.get_bytes();
+self.counter = self.counter.wrapping_add(1);
+Ok(ring::aead::Nonce::assume_unique_for_key(buf))
+}
+}
+}
+
+#[derive(Debug, Clone)]
+pub(crate) struct RingGcmBlockEncryptor {
+key: LessSafeKey,
+nonce_sequence: CounterNonce,
+}
+
+impl RingGcmBlockEncryptor {
+/// Create a new `RingGcmBlockEncryptor` with a given key and random nonce.
+/// The nonce will advance appropriately with each block encryption and
+/// return an error if it wraps around.
+pub(crate) fn new(key_bytes: &[u8]) -> Result {
+let rng = SystemRandom::new();
+
+// todo support other key sizes
+let key = UnboundKey::new(&AES_128_GCM, key_bytes)
+.map_err(|e| general_err!("Error creating AES key: {}", e))?;
+let nonce = CounterNonce::new(&rng)?;
+
+Ok(Self {
+key: LessSafeKey::new(key),
+nonce_sequence: nonce,
+})
+}
+}
+
+impl BlockEncryptor for RingGcmBlockEncryptor {
+fn encrypt(&mut self, plaintext: &[u8], aad: &[u8]) -> Result> {
+// Create encrypted buffer.
+// Format is: [ciphertext size, nonce, ciphertext, authentication tag]
+let ciphertext_length = NONCE_LEN + plaintext.len() + TAG_LEN;
+let mut ciphertext = Vec::with_capacity(SIZE_LEN + ciphertext_length);
+ciphertext.extend((ciphertext_length as u32).to_le_bytes());
Review Comment:
Fair point! I've changed this to:
```rust
let ciphertext_length : u32 = (NONCE_LEN + plaintext.len() +
TAG_LEN).try_into()
.map_err(|err|
General(format!("Plaintext data too long. {:?}", err)))?;
// Not checking for overflow here because we've already checked for
it with ciphertext_length
let mut ciphertext = Vec::with_capacity(SIZE_LEN + ciphertext_length
as usize);
ciphertext.extend((ciphertext_length).to_le_bytes());
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2003758011
##
parquet/src/file/writer.rs:
##
@@ -171,7 +183,30 @@ impl SerializedFileWriter {
/// Creates new file writer.
pub fn new(buf: W, schema: TypePtr, properties: WriterPropertiesPtr) ->
Result {
let mut buf = TrackedWrite::new(buf);
-Self::start_file(&mut buf)?;
+
+#[cfg(feature = "encryption")]
+let file_encryptor = match
properties.file_encryption_properties.as_ref() {
+None => None,
+Some(encryption_props) =>
Some(Arc::new(FileEncryptor::new(encryption_props.clone())?)),
+};
+
+#[cfg(feature = "encryption")]
+if properties.file_encryption_properties.is_some() {
+properties
+.file_encryption_properties
+.clone()
+.unwrap()
+
.validate_encrypted_column_names(SchemaDescriptor::new(schema.clone()))?;
+}
Review Comment:
Oh I see you added the `Arc::new(`. Yeah that's needed, although the
`schema.clone()` should be able to just be `schema`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2003708546
##
parquet/src/file/writer.rs:
##
@@ -171,7 +183,30 @@ impl SerializedFileWriter {
/// Creates new file writer.
pub fn new(buf: W, schema: TypePtr, properties: WriterPropertiesPtr) ->
Result {
let mut buf = TrackedWrite::new(buf);
-Self::start_file(&mut buf)?;
+
+#[cfg(feature = "encryption")]
+let file_encryptor = match
properties.file_encryption_properties.as_ref() {
+None => None,
+Some(encryption_props) =>
Some(Arc::new(FileEncryptor::new(encryption_props.clone())?)),
+};
+
+#[cfg(feature = "encryption")]
+if properties.file_encryption_properties.is_some() {
+properties
+.file_encryption_properties
+.clone()
+.unwrap()
+
.validate_encrypted_column_names(SchemaDescriptor::new(schema.clone()))?;
+}
Review Comment:
Yes that bit should be able to be changed if you create the
schema_descriptor above here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
rok commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2003700114
##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -457,20 +476,49 @@ type SharedColumnChunk = Arc>;
#[derive(Default)]
struct ArrowPageWriter {
buffer: SharedColumnChunk,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,
+}
+
+#[cfg(feature = "encryption")]
+impl ArrowPageWriter {
+pub fn with_encryptor(mut self, page_encryptor: Option) ->
Self {
+self.page_encryptor = page_encryptor;
+self
+}
}
impl PageWriter for ArrowPageWriter {
fn write_page(&mut self, page: CompressedPage) -> Result {
let mut buf = self.buffer.try_lock().unwrap();
-let page_header = page.to_thrift_header();
-let header = {
-let mut header = Vec::with_capacity(1024);
-let mut protocol = TCompactOutputProtocol::new(&mut header);
-page_header.write_to_out_protocol(&mut protocol)?;
-Bytes::from(header)
+
+let data = match self.page_encryptor.as_ref() {
+#[cfg(feature = "encryption")]
+Some(encryptor) => {
+let encrypted_buffer = encryptor.encrypt_page(&page)?;
+Bytes::from(encrypted_buffer)
+}
+_ => page.compressed_page().buffer().clone(),
+};
+
+let mut page_header = page.to_thrift_header();
+page_header.compressed_page_size = data.len() as i32;
Review Comment:
Yeah, that's a nice cleanup, thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2002341357
##
parquet/src/file/metadata/writer.rs:
##
@@ -182,6 +258,95 @@ impl<'a, W: Write> ThriftMetadataWriter<'a, W> {
self.key_value_metadata = Some(key_value_metadata);
self
}
+
+#[cfg(feature = "encryption")]
+pub fn with_file_encryptor(mut self, file_encryptor:
Option>) -> Self {
+self.file_encryptor = file_encryptor;
+self
+}
+
+#[cfg(feature = "encryption")]
+fn file_crypto_metadata(
+file_encryptor: &Arc,
+) -> Result {
+let properties = file_encryptor.properties();
+let supply_aad_prefix = properties
+.aad_prefix()
+.map(|_| !properties.store_aad_prefix());
+let encryption_algorithm = AesGcmV1 {
+aad_prefix: properties.aad_prefix().cloned(),
Review Comment:
We shouldn't set `aad_prefix` here unless `properties.store_aad_prefix()` is
true. (I think this is my mistake!)
We should also add a test that we can't read a file written with
`store_aad_prefix = false` unless the AAD prefix is passed.
When reading, I guess we could check `supply_aad_prefix` as an indication
that an AAD prefix is expected and raise a more helpful error if one isn't
supplied, but that's probably out of scope for this PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2002336070
##
parquet/src/encryption/encrypt.rs:
##
@@ -0,0 +1,339 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::encryption::ciphers::{BlockEncryptor, RingGcmBlockEncryptor};
+use crate::errors::{ParquetError, Result};
+use crate::file::column_crypto_metadata::{ColumnCryptoMetaData,
EncryptionWithColumnKey};
+use crate::schema::types::{ColumnDescPtr, SchemaDescriptor};
+use crate::thrift::TSerializable;
+use ring::rand::{SecureRandom, SystemRandom};
+use std::collections::{HashMap, HashSet};
+use std::io::Write;
+use thrift::protocol::TCompactOutputProtocol;
+
+#[derive(Debug, Clone, PartialEq)]
+struct EncryptionKey {
+key: Vec,
+key_metadata: Option>,
+}
+
+impl EncryptionKey {
+fn new(key: Vec) -> EncryptionKey {
+Self {
+key,
+key_metadata: None,
+}
+}
+
+fn with_metadata(mut self, metadata: Vec) -> Self {
+self.key_metadata = Some(metadata);
+self
+}
+
+fn key(&self) -> &Vec {
+&self.key
+}
+}
+
+#[derive(Debug, Clone, PartialEq)]
+pub struct FileEncryptionProperties {
+encrypt_footer: bool,
+footer_key: EncryptionKey,
+column_keys: HashMap,
+aad_prefix: Option>,
+store_aad_prefix: bool,
+}
+
+impl FileEncryptionProperties {
+/// Create a new builder for encryption properties
+pub fn builder(footer_key: Vec) -> EncryptionPropertiesBuilder {
+EncryptionPropertiesBuilder::new(footer_key)
+}
+
+/// Should the footer be encrypted
+pub fn encrypt_footer(&self) -> bool {
+self.encrypt_footer
+}
+
+/// Retrieval metadata of key used for encryption of footer and (possibly)
columns
+pub fn footer_key_metadata(&self) -> Option<&Vec> {
+self.footer_key.key_metadata.as_ref()
+}
+
+/// AAD prefix string uniquely identifies the file and prevents file
swapping
+pub fn aad_prefix(&self) -> Option<&Vec> {
+self.aad_prefix.as_ref()
+}
+
+/// Should the AAD prefix should be stored in the file
+pub fn store_aad_prefix(&self) -> bool {
+self.store_aad_prefix && self.aad_prefix.is_some()
+}
+
+/// Checks if columns that are to be encrypted are present in schema
+#[cfg(feature = "encryption")]
+pub(crate) fn validate_encrypted_column_names(
+&self,
+schema: SchemaDescriptor,
+) -> std::result::Result<(), ParquetError> {
+let schema_columns = schema
+.columns()
+.iter()
+.map(|c| c.path().string())
+.collect::>();
+let encryption_columns = self
+.column_keys
+.keys()
+.cloned()
+.collect::>();
+if !encryption_columns.is_subset(&schema_columns) {
+let mut columns_missing_in_schema = encryption_columns
+.difference(&schema_columns)
+.cloned()
+.collect::>();
+columns_missing_in_schema.sort();
+return Err(ParquetError::General(
+format!(
+"The following columns with encryption keys specified were
not found in the schema: {}",
+columns_missing_in_schema.join(", ")
+)
+.to_string(),
+));
+}
+Ok(())
+}
+}
+
+pub struct EncryptionPropertiesBuilder {
+footer_key: EncryptionKey,
+column_keys: HashMap,
+aad_prefix: Option>,
+encrypt_footer: bool,
+store_aad_prefix: bool,
+}
+
+impl EncryptionPropertiesBuilder {
+pub fn new(footer_key: Vec) -> EncryptionPropertiesBuilder {
+Self {
+footer_key: EncryptionKey::new(footer_key),
+column_keys: HashMap::default(),
+aad_prefix: None,
+encrypt_footer: true,
+store_aad_prefix: true,
+}
+}
+
+/// Set if the footer be encrypted
Review Comment:
```suggestion
/// Set if the footer should be encrypted. Defaults to true.
```
##
parquet/src/encryption/ciphers.rs:
##
@@ -61,3 +64,118 @@ impl BlockDecryptor
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r174296
##
parquet/src/file/writer.rs:
##
@@ -699,44 +803,96 @@ impl<'a> SerializedColumnWriter<'a> {
/// `SerializedPageWriter` should not be used after calling `close()`.
pub struct SerializedPageWriter<'a, W: Write> {
sink: &'a mut TrackedWrite,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,
}
impl<'a, W: Write> SerializedPageWriter<'a, W> {
/// Creates new page writer.
pub fn new(sink: &'a mut TrackedWrite) -> Self {
-Self { sink }
+Self {
+sink,
+page_encryptor: None,
+}
+}
+
+#[cfg(feature = "encryption")]
+/// Set the encryptor to use to encrypt page data
+fn with_page_encryptor(mut self, page_encryptor: Option) ->
Self {
+self.page_encryptor = page_encryptor;
+self
}
/// Serializes page header into Thrift.
/// Returns number of bytes that have been written into the sink.
#[inline]
fn serialize_page_header(&mut self, header: parquet::PageHeader) ->
Result {
let start_pos = self.sink.bytes_written();
-{
-let mut protocol = TCompactOutputProtocol::new(&mut self.sink);
-header.write_to_out_protocol(&mut protocol)?;
+match self.page_encryptor.as_ref() {
+#[cfg(feature = "encryption")]
+Some(page_encryptor) => {
+page_encryptor.encrypt_page_header(&header, &mut self.sink)?;
+}
+_ => {
+let mut protocol = TCompactOutputProtocol::new(&mut self.sink);
+header.write_to_out_protocol(&mut protocol)?;
+}
}
Ok(self.sink.bytes_written() - start_pos)
}
}
+trait PageModuleWriter {
+fn serialize_page(&mut self, page: &CompressedPage) -> Result>;
+}
+
+#[cfg(not(feature = "encryption"))]
+impl PageModuleWriter for SerializedPageWriter<'_, W> {
+fn serialize_page(&mut self, page: &CompressedPage) -> Result> {
+Ok(page.data().to_vec())
Review Comment:
It's not great that we need to copy all the page data here when we didn't
before. I think if we can do something like
https://github.com/apache/arrow-rs/pull/7111#discussion_r133781 then this
wouldn't be needed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r220283
##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -457,20 +476,49 @@ type SharedColumnChunk = Arc>;
#[derive(Default)]
struct ArrowPageWriter {
buffer: SharedColumnChunk,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,
+}
+
+#[cfg(feature = "encryption")]
+impl ArrowPageWriter {
+pub fn with_encryptor(mut self, page_encryptor: Option) ->
Self {
+self.page_encryptor = page_encryptor;
+self
+}
}
impl PageWriter for ArrowPageWriter {
fn write_page(&mut self, page: CompressedPage) -> Result {
let mut buf = self.buffer.try_lock().unwrap();
-let page_header = page.to_thrift_header();
-let header = {
-let mut header = Vec::with_capacity(1024);
-let mut protocol = TCompactOutputProtocol::new(&mut header);
-page_header.write_to_out_protocol(&mut protocol)?;
-Bytes::from(header)
+
+let data = match self.page_encryptor.as_ref() {
+#[cfg(feature = "encryption")]
+Some(encryptor) => {
+let encrypted_buffer = encryptor.encrypt_page(&page)?;
+Bytes::from(encrypted_buffer)
+}
+_ => page.compressed_page().buffer().clone(),
+};
+
+let mut page_header = page.to_thrift_header();
+page_header.compressed_page_size = data.len() as i32;
Review Comment:
I've pushed a commit to change this, let me know what you think:
https://github.com/apache/arrow-rs/pull/7111/commits/4032b4f70c5407a1ce98358d1440cd00771543d5
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r136575
##
parquet/src/file/writer.rs:
##
@@ -171,7 +183,30 @@ impl SerializedFileWriter {
/// Creates new file writer.
pub fn new(buf: W, schema: TypePtr, properties: WriterPropertiesPtr) ->
Result {
let mut buf = TrackedWrite::new(buf);
-Self::start_file(&mut buf)?;
+
+#[cfg(feature = "encryption")]
+let file_encryptor = match
properties.file_encryption_properties.as_ref() {
+None => None,
+Some(encryption_props) =>
Some(Arc::new(FileEncryptor::new(encryption_props.clone())?)),
+};
+
+#[cfg(feature = "encryption")]
+if properties.file_encryption_properties.is_some() {
+properties
+.file_encryption_properties
+.clone()
+.unwrap()
+
.validate_encrypted_column_names(SchemaDescriptor::new(schema.clone()))?;
+}
+
+#[cfg(feature = "encryption")]
+if let Some(ref file_encryption_properties) =
properties.file_encryption_properties {
+if !file_encryption_properties.encrypt_footer() {
+return Err(general_err!("Footer encryption is not supported
yet"));
Review Comment:
```suggestion
return Err(general_err!("Writing encrypted files with
plaintext footers is not supported yet"));
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r255941
##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -457,20 +476,49 @@ type SharedColumnChunk = Arc>;
#[derive(Default)]
struct ArrowPageWriter {
buffer: SharedColumnChunk,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,
+}
+
+#[cfg(feature = "encryption")]
+impl ArrowPageWriter {
+pub fn with_encryptor(mut self, page_encryptor: Option) ->
Self {
+self.page_encryptor = page_encryptor;
+self
+}
}
impl PageWriter for ArrowPageWriter {
fn write_page(&mut self, page: CompressedPage) -> Result {
let mut buf = self.buffer.try_lock().unwrap();
-let page_header = page.to_thrift_header();
-let header = {
-let mut header = Vec::with_capacity(1024);
-let mut protocol = TCompactOutputProtocol::new(&mut header);
-page_header.write_to_out_protocol(&mut protocol)?;
-Bytes::from(header)
+
+let data = match self.page_encryptor.as_ref() {
+#[cfg(feature = "encryption")]
+Some(encryptor) => {
+let encrypted_buffer = encryptor.encrypt_page(&page)?;
+Bytes::from(encrypted_buffer)
+}
+_ => page.compressed_page().buffer().clone(),
+};
+
+let mut page_header = page.to_thrift_header();
+page_header.compressed_page_size = data.len() as i32;
+
+let mut header = Vec::with_capacity(1024);
+match self.page_encryptor.as_ref() {
Review Comment:
I spent a bit more time on this and made some progress to demonstrate what I
was thinking, but have run into issues with my trait not being object-safe /
dyn-compatible:
https://github.com/adamreeve/arrow-rs/commit/c94ddd07f95ff09076019e8854f4d9f1f065683a
I'm not sure it's worth pursuing this further, I think the current state is
probably fine.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r169215
##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -457,20 +476,49 @@ type SharedColumnChunk = Arc>;
#[derive(Default)]
struct ArrowPageWriter {
buffer: SharedColumnChunk,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,
+}
+
+#[cfg(feature = "encryption")]
+impl ArrowPageWriter {
+pub fn with_encryptor(mut self, page_encryptor: Option) ->
Self {
+self.page_encryptor = page_encryptor;
+self
+}
}
impl PageWriter for ArrowPageWriter {
fn write_page(&mut self, page: CompressedPage) -> Result {
let mut buf = self.buffer.try_lock().unwrap();
-let page_header = page.to_thrift_header();
-let header = {
-let mut header = Vec::with_capacity(1024);
-let mut protocol = TCompactOutputProtocol::new(&mut header);
-page_header.write_to_out_protocol(&mut protocol)?;
-Bytes::from(header)
+
+let data = match self.page_encryptor.as_ref() {
+#[cfg(feature = "encryption")]
+Some(encryptor) => {
+let encrypted_buffer = encryptor.encrypt_page(&page)?;
+Bytes::from(encrypted_buffer)
+}
+_ => page.compressed_page().buffer().clone(),
+};
+
+let mut page_header = page.to_thrift_header();
+page_header.compressed_page_size = data.len() as i32;
+
+let mut header = Vec::with_capacity(1024);
+match self.page_encryptor.as_ref() {
Review Comment:
Yeah after looking at this in more detail, I don't think there's an easy way
to share this logic as `ArrowPageWriter` and `SerializedPageWriter` are too
different. I think we should get rid of the `PageModuleWriter` traits but
keeping separate `serialize_page` methods for with and without encryption
enabled would still be nice.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r145331
##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -457,20 +476,49 @@ type SharedColumnChunk = Arc>;
#[derive(Default)]
struct ArrowPageWriter {
buffer: SharedColumnChunk,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,
+}
+
+#[cfg(feature = "encryption")]
+impl ArrowPageWriter {
+pub fn with_encryptor(mut self, page_encryptor: Option) ->
Self {
+self.page_encryptor = page_encryptor;
+self
+}
}
impl PageWriter for ArrowPageWriter {
fn write_page(&mut self, page: CompressedPage) -> Result {
let mut buf = self.buffer.try_lock().unwrap();
-let page_header = page.to_thrift_header();
-let header = {
-let mut header = Vec::with_capacity(1024);
-let mut protocol = TCompactOutputProtocol::new(&mut header);
-page_header.write_to_out_protocol(&mut protocol)?;
-Bytes::from(header)
+
+let data = match self.page_encryptor.as_ref() {
+#[cfg(feature = "encryption")]
+Some(encryptor) => {
+let encrypted_buffer = encryptor.encrypt_page(&page)?;
+Bytes::from(encrypted_buffer)
+}
+_ => page.compressed_page().buffer().clone(),
+};
+
+let mut page_header = page.to_thrift_header();
+page_header.compressed_page_size = data.len() as i32;
+
+let mut header = Vec::with_capacity(1024);
+match self.page_encryptor.as_ref() {
Review Comment:
If the logic can't be shared easily then I don't think the new traits are
useful, `serialize_page` could just be a normal method under `impl
ArrowPageWriter` and `impl SerializedPageWriter`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r142005
##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -457,20 +476,49 @@ type SharedColumnChunk = Arc>;
#[derive(Default)]
struct ArrowPageWriter {
buffer: SharedColumnChunk,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,
+}
+
+#[cfg(feature = "encryption")]
+impl ArrowPageWriter {
+pub fn with_encryptor(mut self, page_encryptor: Option) ->
Self {
+self.page_encryptor = page_encryptor;
+self
+}
}
impl PageWriter for ArrowPageWriter {
fn write_page(&mut self, page: CompressedPage) -> Result {
let mut buf = self.buffer.try_lock().unwrap();
-let page_header = page.to_thrift_header();
-let header = {
-let mut header = Vec::with_capacity(1024);
-let mut protocol = TCompactOutputProtocol::new(&mut header);
-page_header.write_to_out_protocol(&mut protocol)?;
-Bytes::from(header)
+
+let data = match self.page_encryptor.as_ref() {
+#[cfg(feature = "encryption")]
+Some(encryptor) => {
+let encrypted_buffer = encryptor.encrypt_page(&page)?;
+Bytes::from(encrypted_buffer)
+}
+_ => page.compressed_page().buffer().clone(),
+};
+
+let mut page_header = page.to_thrift_header();
+page_header.compressed_page_size = data.len() as i32;
+
+let mut header = Vec::with_capacity(1024);
+match self.page_encryptor.as_ref() {
Review Comment:
I see you've added a `PageModuleWriter` trait that tidies this up a bit
although this isn't really what I had in mind, I was hoping this logic could be
reused across the `ArrowPageWriter` and `SerializedPageWriter` rather than
having a similar trait in two different places. I'll see if I can make a change
to show what I mean.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r133781
##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -457,20 +476,49 @@ type SharedColumnChunk = Arc>;
#[derive(Default)]
struct ArrowPageWriter {
buffer: SharedColumnChunk,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,
+}
+
+#[cfg(feature = "encryption")]
+impl ArrowPageWriter {
+pub fn with_encryptor(mut self, page_encryptor: Option) ->
Self {
+self.page_encryptor = page_encryptor;
+self
+}
}
impl PageWriter for ArrowPageWriter {
fn write_page(&mut self, page: CompressedPage) -> Result {
let mut buf = self.buffer.try_lock().unwrap();
-let page_header = page.to_thrift_header();
-let header = {
-let mut header = Vec::with_capacity(1024);
-let mut protocol = TCompactOutputProtocol::new(&mut header);
-page_header.write_to_out_protocol(&mut protocol)?;
-Bytes::from(header)
+
+let data = match self.page_encryptor.as_ref() {
+#[cfg(feature = "encryption")]
+Some(encryptor) => {
+let encrypted_buffer = encryptor.encrypt_page(&page)?;
+Bytes::from(encrypted_buffer)
+}
+_ => page.compressed_page().buffer().clone(),
+};
+
+let mut page_header = page.to_thrift_header();
+page_header.compressed_page_size = data.len() as i32;
Review Comment:
Ah not really, I was thinking more like having a method that could encrypt
the page, eg:
`fn encrypt_page(page: CompressedPage) -> CompressedPage`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
rok commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r1999681758
##
parquet/src/encryption/encrypt.rs:
##
@@ -152,8 +152,20 @@ impl EncryptionPropertiesBuilder {
/// Set the key used for encryption of a column. Note that if no column
keys are provided but
/// footer key is all columns will be encrypted with the footer key. If
column keys are provided
/// only the columns with a key will be encrypted even if footer key is
provided.
-pub fn with_column_key(mut self, column_path: String, encryption_key:
EncryptionKey) -> Self {
-self.column_keys.insert(column_path, encryption_key);
+pub fn with_column_key(mut self, column_name: &str, encryption_key:
EncryptionKey) -> Self {
+self.column_keys.insert(column_name.to_string(), encryption_key);
+self
+}
Review Comment:
Done.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
corwinjoy commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r1999659239
##
parquet/src/encryption/encrypt.rs:
##
@@ -152,8 +152,20 @@ impl EncryptionPropertiesBuilder {
/// Set the key used for encryption of a column. Note that if no column
keys are provided but
/// footer key is all columns will be encrypted with the footer key. If
column keys are provided
/// only the columns with a key will be encrypted even if footer key is
provided.
-pub fn with_column_key(mut self, column_path: String, encryption_key:
EncryptionKey) -> Self {
-self.column_keys.insert(column_path, encryption_key);
+pub fn with_column_key(mut self, column_name: &str, encryption_key:
EncryptionKey) -> Self {
+self.column_keys.insert(column_name.to_string(), encryption_key);
+self
+}
Review Comment:
What I would propose instead is
```
pub fn with_column_key(mut self, column_name: &str, key: Vec) -> Self {
self.column_keys.insert(column_name.to_string(),
EncryptionKey::new(key));
self
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
rok commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r1997653819
##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -457,20 +476,49 @@ type SharedColumnChunk = Arc>;
#[derive(Default)]
struct ArrowPageWriter {
buffer: SharedColumnChunk,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,
+}
+
+#[cfg(feature = "encryption")]
+impl ArrowPageWriter {
+pub fn with_encryptor(mut self, page_encryptor: Option) ->
Self {
+self.page_encryptor = page_encryptor;
+self
+}
}
impl PageWriter for ArrowPageWriter {
fn write_page(&mut self, page: CompressedPage) -> Result {
let mut buf = self.buffer.try_lock().unwrap();
-let page_header = page.to_thrift_header();
-let header = {
-let mut header = Vec::with_capacity(1024);
-let mut protocol = TCompactOutputProtocol::new(&mut header);
-page_header.write_to_out_protocol(&mut protocol)?;
-Bytes::from(header)
+
+let data = match self.page_encryptor.as_ref() {
+#[cfg(feature = "encryption")]
+Some(encryptor) => {
+let encrypted_buffer = encryptor.encrypt_page(&page)?;
+Bytes::from(encrypted_buffer)
+}
+_ => page.compressed_page().buffer().clone(),
+};
+
+let mut page_header = page.to_thrift_header();
+page_header.compressed_page_size = data.len() as i32;
Review Comment:
Do you mean like passing `data.len()` into `to_thrift_header` method?
##
parquet/src/encryption/encrypt.rs:
##
@@ -0,0 +1,283 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::encryption::ciphers::{BlockEncryptor, RingGcmBlockEncryptor};
+use crate::errors::{ParquetError, Result};
+use crate::file::column_crypto_metadata::{ColumnCryptoMetaData,
EncryptionWithColumnKey};
+use crate::schema::types::{ColumnDescPtr, SchemaDescriptor};
+use crate::thrift::TSerializable;
+use ring::rand::{SecureRandom, SystemRandom};
+use std::collections::{HashMap, HashSet};
+use std::io::Write;
+use thrift::protocol::TCompactOutputProtocol;
+
+#[derive(Debug, Clone, PartialEq)]
+pub struct EncryptionKey {
+key: Vec,
+key_metadata: Option>,
+}
+
+impl EncryptionKey {
+pub fn new(key: Vec) -> EncryptionKey {
+Self {
+key,
+key_metadata: None,
+}
+}
+
+pub fn with_metadata(mut self, metadata: Vec) -> Self {
+self.key_metadata = Some(metadata);
+self
+}
+
+pub fn key(&self) -> &Vec {
+&self.key
+}
+}
+
+#[derive(Debug, Clone, PartialEq)]
+pub struct FileEncryptionProperties {
+encrypt_footer: bool,
+footer_key: EncryptionKey,
+column_keys: HashMap,
+aad_prefix: Option>,
+store_aad_prefix: bool,
+}
+
+impl FileEncryptionProperties {
+pub fn builder(footer_key: Vec) -> EncryptionPropertiesBuilder {
+EncryptionPropertiesBuilder::new(footer_key)
+}
+
+pub fn encrypt_footer(&self) -> bool {
+self.encrypt_footer
+}
+
+pub fn footer_key_metadata(&self) -> Option<&Vec> {
+self.footer_key.key_metadata.as_ref()
+}
+
+pub fn aad_prefix(&self) -> Option<&Vec> {
+self.aad_prefix.as_ref()
+}
+
+pub fn store_aad_prefix(&self) -> bool {
+self.store_aad_prefix && self.aad_prefix.is_some()
+}
+
+/// Checks if columns that are to be encrypted are present in schema
+#[cfg(feature = "encryption")]
+pub(crate) fn encrypted_columns_in_schema(
+&self,
+schema: SchemaDescriptor,
+) -> std::result::Result<(), ParquetError> {
+let schema_columns = schema
+.columns()
+.iter()
+.map(|c| c.path().string())
+.collect::>();
+let encryption_columns = self
+.column_keys
+.keys()
+.cloned()
+.collect::>();
+if !encryption_columns.is_subset(&schema_columns) {
+let mut columns_missing_in_schema = encryption_columns
+.difference(&schema_columns)
+
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
rok commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r1994171086
##
parquet/src/encryption/encrypt.rs:
##
@@ -0,0 +1,283 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::encryption::ciphers::{BlockEncryptor, RingGcmBlockEncryptor};
+use crate::errors::{ParquetError, Result};
+use crate::file::column_crypto_metadata::{ColumnCryptoMetaData,
EncryptionWithColumnKey};
+use crate::schema::types::{ColumnDescPtr, SchemaDescriptor};
+use crate::thrift::TSerializable;
+use ring::rand::{SecureRandom, SystemRandom};
+use std::collections::{HashMap, HashSet};
+use std::io::Write;
+use thrift::protocol::TCompactOutputProtocol;
+
+#[derive(Debug, Clone, PartialEq)]
+pub struct EncryptionKey {
+key: Vec,
+key_metadata: Option>,
+}
+
+impl EncryptionKey {
+pub fn new(key: Vec) -> EncryptionKey {
+Self {
+key,
+key_metadata: None,
+}
+}
+
+pub fn with_metadata(mut self, metadata: Vec) -> Self {
+self.key_metadata = Some(metadata);
+self
+}
+
+pub fn key(&self) -> &Vec {
+&self.key
+}
+}
+
+#[derive(Debug, Clone, PartialEq)]
+pub struct FileEncryptionProperties {
+encrypt_footer: bool,
+footer_key: EncryptionKey,
+column_keys: HashMap,
+aad_prefix: Option>,
+store_aad_prefix: bool,
+}
+
+impl FileEncryptionProperties {
+pub fn builder(footer_key: Vec) -> EncryptionPropertiesBuilder {
+EncryptionPropertiesBuilder::new(footer_key)
+}
+
+pub fn encrypt_footer(&self) -> bool {
+self.encrypt_footer
+}
+
+pub fn footer_key_metadata(&self) -> Option<&Vec> {
+self.footer_key.key_metadata.as_ref()
+}
+
+pub fn aad_prefix(&self) -> Option<&Vec> {
+self.aad_prefix.as_ref()
+}
+
+pub fn store_aad_prefix(&self) -> bool {
+self.store_aad_prefix && self.aad_prefix.is_some()
+}
+
+/// Checks if columns that are to be encrypted are present in schema
+#[cfg(feature = "encryption")]
+pub(crate) fn encrypted_columns_in_schema(
Review Comment:
Done.
##
parquet/src/file/writer.rs:
##
@@ -171,19 +183,37 @@ impl SerializedFileWriter {
/// Creates new file writer.
pub fn new(buf: W, schema: TypePtr, properties: WriterPropertiesPtr) ->
Result {
let mut buf = TrackedWrite::new(buf);
-Self::start_file(&mut buf)?;
+
+#[cfg(feature = "encryption")]
+let file_encryptor = match
properties.file_encryption_properties.as_ref() {
+None => None,
+Some(encryption_props) =>
Some(Arc::new(FileEncryptor::new(encryption_props.clone())?)),
+};
+
+#[cfg(feature = "encryption")]
+if properties.file_encryption_properties.is_some() {
+properties
+.file_encryption_properties
+.clone()
+.unwrap()
+
.encrypted_columns_in_schema(SchemaDescriptor::new(schema.clone()))?;
+}
+
+Self::start_file(&properties, &mut buf)?;
Ok(Self {
buf,
schema: schema.clone(),
descr: Arc::new(SchemaDescriptor::new(schema)),
-props: properties,
+props: properties.clone(),
Review Comment:
Removed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
rok commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r1994145187
##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -3527,4 +3700,199 @@ mod tests {
.unwrap();
assert_eq!(batches.len(), 0);
}
+
+#[cfg(feature = "encryption")]
+#[test]
+fn test_uniform_encryption_roundtrip() {
Review Comment:
Done.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r1992607802
##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -457,20 +476,49 @@ type SharedColumnChunk = Arc>;
#[derive(Default)]
struct ArrowPageWriter {
buffer: SharedColumnChunk,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,
+}
+
+#[cfg(feature = "encryption")]
+impl ArrowPageWriter {
+pub fn with_encryptor(mut self, page_encryptor: Option) ->
Self {
+self.page_encryptor = page_encryptor;
+self
+}
}
impl PageWriter for ArrowPageWriter {
fn write_page(&mut self, page: CompressedPage) -> Result {
let mut buf = self.buffer.try_lock().unwrap();
-let page_header = page.to_thrift_header();
-let header = {
-let mut header = Vec::with_capacity(1024);
-let mut protocol = TCompactOutputProtocol::new(&mut header);
-page_header.write_to_out_protocol(&mut protocol)?;
-Bytes::from(header)
+
+let data = match self.page_encryptor.as_ref() {
+#[cfg(feature = "encryption")]
+Some(encryptor) => {
+let encrypted_buffer = encryptor.encrypt_page(&page)?;
+Bytes::from(encrypted_buffer)
+}
+_ => page.compressed_page().buffer().clone(),
+};
+
+let mut page_header = page.to_thrift_header();
+page_header.compressed_page_size = data.len() as i32;
Review Comment:
Rather than overwriting the compressed page size here, we should probably
encrypt the page before creating the header. That might require making the page
mutable to allow overwriting its `buf` member? I'm not sure how best to handle
this but it's a little hacky at the moment.
##
parquet/src/encryption/encrypt.rs:
##
@@ -0,0 +1,283 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::encryption::ciphers::{BlockEncryptor, RingGcmBlockEncryptor};
+use crate::errors::{ParquetError, Result};
+use crate::file::column_crypto_metadata::{ColumnCryptoMetaData,
EncryptionWithColumnKey};
+use crate::schema::types::{ColumnDescPtr, SchemaDescriptor};
+use crate::thrift::TSerializable;
+use ring::rand::{SecureRandom, SystemRandom};
+use std::collections::{HashMap, HashSet};
+use std::io::Write;
+use thrift::protocol::TCompactOutputProtocol;
+
+#[derive(Debug, Clone, PartialEq)]
+pub struct EncryptionKey {
+key: Vec,
+key_metadata: Option>,
+}
+
+impl EncryptionKey {
+pub fn new(key: Vec) -> EncryptionKey {
+Self {
+key,
+key_metadata: None,
+}
+}
+
+pub fn with_metadata(mut self, metadata: Vec) -> Self {
+self.key_metadata = Some(metadata);
+self
+}
+
+pub fn key(&self) -> &Vec {
+&self.key
+}
+}
+
+#[derive(Debug, Clone, PartialEq)]
+pub struct FileEncryptionProperties {
+encrypt_footer: bool,
+footer_key: EncryptionKey,
+column_keys: HashMap,
+aad_prefix: Option>,
+store_aad_prefix: bool,
+}
+
+impl FileEncryptionProperties {
+pub fn builder(footer_key: Vec) -> EncryptionPropertiesBuilder {
+EncryptionPropertiesBuilder::new(footer_key)
+}
+
+pub fn encrypt_footer(&self) -> bool {
+self.encrypt_footer
+}
+
+pub fn footer_key_metadata(&self) -> Option<&Vec> {
+self.footer_key.key_metadata.as_ref()
+}
+
+pub fn aad_prefix(&self) -> Option<&Vec> {
+self.aad_prefix.as_ref()
+}
+
+pub fn store_aad_prefix(&self) -> bool {
+self.store_aad_prefix && self.aad_prefix.is_some()
+}
+
+/// Checks if columns that are to be encrypted are present in schema
+#[cfg(feature = "encryption")]
+pub(crate) fn encrypted_columns_in_schema(
+&self,
+schema: SchemaDescriptor,
+) -> std::result::Result<(), ParquetError> {
+let schema_columns = schema
+.columns()
+.iter()
+.map(|c| c.path().string())
+.collect::>();
+let encryption_columns = self
+.column_keys
+.keys()
+.cl
Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]
alamb commented on PR #7111: URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2719134785 The read part was broken out into - https://github.com/apache/arrow-rs/pull/6637 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
