Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-04-05 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2020280234


##
parquet/src/encryption/decrypt.rs:
##
@@ -324,6 +346,14 @@ impl DecryptionPropertiesBuilder {
 .insert(column_name.to_string(), decryption_key);
 self
 }
+
+/// Specify multiple column decryption keys
+pub fn with_column_keys(mut self, column_names: Vec<&str>, keys: 
Vec>) -> Self {
+for (column_name, key) in 
column_names.into_iter().zip(keys.into_iter()) {

Review Comment:
   :+1: Good point, done thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-04-01 Thread via GitHub


alamb commented on PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2769634877

   Follow on PR (for full support) is:
   -  https://github.com/apache/parquet-site/pull/110 is released
   
   I will mark that ready for review when we have released the next version of 
arrow-rs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-04-01 Thread via GitHub


alamb commented on PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2769389289

   Thanks again @rok @etseidl @adamreeve and @corwinjoy  🦾 🚀 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-04-01 Thread via GitHub


alamb commented on PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2769423294

   I also made a small PR to update the parquet status page (when this is 
released):
   - https://github.com/apache/parquet-site/pull/110


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-04-01 Thread via GitHub


alamb merged PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-04-01 Thread via GitHub


alamb commented on PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2769121068

   Merged to resolve another conflict 🙄 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-31 Thread via GitHub


alamb commented on PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2767084736

   Thanks @adamreeve @corwinjoy and @rok ! I'll plan to merge this tomorrow 
   
   This PR appeared to have a conflict so I took the liberty of merging up from 
main
   
   ![Screenshot 2025-03-31 at 2 35 30 
PM](https://github.com/user-attachments/assets/5c5d6c31-92f7-4872-a950-8d66bd208951)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-30 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2020279917


##
parquet/tests/encryption/encryption.rs:
##
@@ -0,0 +1,709 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This module contains tests for reading encrypted Parquet files with the 
Arrow API
+
+use crate::encryption_util::{verify_encryption_test_data, TestKeyRetriever};
+use arrow::array::*;
+use arrow::error::Result as ArrowResult;
+use arrow_array::{Int32Array, RecordBatch};
+use arrow_schema::{DataType as ArrowDataType, DataType, Field, Schema};
+use parquet::arrow::arrow_reader::{
+ArrowReaderMetadata, ArrowReaderOptions, ParquetRecordBatchReaderBuilder,
+};
+use parquet::arrow::ArrowWriter;
+use parquet::data_type::{ByteArray, ByteArrayType};
+use parquet::encryption::decrypt::FileDecryptionProperties;
+use parquet::encryption::encrypt::FileEncryptionProperties;
+use parquet::file::metadata::ParquetMetaData;
+use parquet::file::properties::WriterProperties;
+use parquet::file::writer::SerializedFileWriter;
+use parquet::schema::parser::parse_message_type;
+use std::fs::File;
+use std::sync::Arc;
+
+#[test]
+fn test_non_uniform_encryption_plaintext_footer() {
+let test_data = arrow::util::test_util::parquet_test_data();
+let path = 
format!("{test_data}/encrypt_columns_plaintext_footer.parquet.encrypted");
+let file = File::open(path).unwrap();
+
+// There is always a footer key even with a plaintext footer,
+// but this is used for signing the footer.
+let footer_key = "0123456789012345".as_bytes(); // 128bit/16
+let column_1_key = "1234567890123450".as_bytes();
+let column_2_key = "1234567890123451".as_bytes();
+
+let decryption_properties = 
FileDecryptionProperties::builder(footer_key.to_vec())
+.with_column_key("double_field", column_1_key.to_vec())
+.with_column_key("float_field", column_2_key.to_vec())
+.build()
+.unwrap();
+
+verify_encryption_test_file_read(file, decryption_properties);
+}
+
+#[test]
+fn test_non_uniform_encryption_disabled_aad_storage() {
+let test_data = arrow::util::test_util::parquet_test_data();
+let path =
+
format!("{test_data}/encrypt_columns_and_footer_disable_aad_storage.parquet.encrypted");
+let file = File::open(path.clone()).unwrap();
+
+let footer_key = b"0123456789012345".to_vec(); // 128bit/16
+let column_1_key = b"1234567890123450".to_vec();
+let column_2_key = b"1234567890123451".to_vec();
+
+// Can read successfully when providing the correct AAD prefix
+let decryption_properties = 
FileDecryptionProperties::builder(footer_key.clone())
+.with_column_key("double_field", column_1_key.clone())
+.with_column_key("float_field", column_2_key.clone())
+.with_aad_prefix(b"tester".to_vec())
+.build()
+.unwrap();
+
+verify_encryption_test_file_read(file, decryption_properties);
+
+// Using wrong AAD prefix should fail
+let decryption_properties = 
FileDecryptionProperties::builder(footer_key.clone())
+.with_column_key("double_field", column_1_key.clone())
+.with_column_key("float_field", column_2_key.clone())
+.with_aad_prefix(b"wrong_aad_prefix".to_vec())
+.build()
+.unwrap();
+
+let file = File::open(path.clone()).unwrap();
+let options = ArrowReaderOptions::default()
+.with_file_decryption_properties(decryption_properties.clone());
+let result = ArrowReaderMetadata::load(&file, options.clone());
+assert!(result.is_err());
+assert_eq!(
+result.unwrap_err().to_string(),
+"Parquet error: Provided footer key and AAD were unable to decrypt 
parquet footer"
+);
+
+// Not providing any AAD prefix should fail as it isn't stored in the file
+let decryption_properties = FileDecryptionProperties::builder(footer_key)
+.with_column_key("double_field", column_1_key)
+.with_column_key("float_field", column_2_key)
+.build()
+.unwrap();
+
+let file = File::open(path).unwrap();
+let options = ArrowReaderOptions::default()
+.with_file_decryption_properties(decryption_properties.clo

Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-30 Thread via GitHub


adamreeve commented on PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2764799620

   Your latest comments should all be addressed now thanks @alamb. I've also 
just added one extra test to cover being able to use enencrypted row group 
metadata when writing a file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-30 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2020280114


##
parquet/src/column/page_encryption.rs:
##
@@ -0,0 +1,119 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::column::page::CompressedPage;
+use crate::encryption::ciphers::BlockEncryptor;
+use crate::encryption::encrypt::{encrypt_object, FileEncryptor};
+use crate::encryption::modules::{create_module_aad, ModuleType};
+use crate::errors::ParquetError;
+use crate::errors::Result;
+use crate::format::PageHeader;
+use crate::format::PageType;
+use bytes::Bytes;
+use std::io::Write;
+use std::sync::Arc;
+
+#[derive(Debug)]
+/// Encrypts page headers and page data for columns
+pub struct PageEncryptor {

Review Comment:
   Yes these are in a `pub(crate)` module so are crate private, I've changed 
this to be more explicit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-30 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2020258059


##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -727,85 +819,173 @@ pub fn get_column_writers(
 ) -> Result> {
 let mut writers = Vec::with_capacity(arrow.fields.len());
 let mut leaves = parquet.columns().iter();
+let column_factory = ArrowColumnWriterFactory::new();
 for field in &arrow.fields {
-get_arrow_column_writer(field.data_type(), props, &mut leaves, &mut 
writers)?;
+column_factory.get_arrow_column_writer(
+field.data_type(),
+props,
+&mut leaves,
+&mut writers,
+)?;
 }
 Ok(writers)
 }
 
-/// Gets the [`ArrowColumnWriter`] for the given `data_type`
-fn get_arrow_column_writer(
-data_type: &ArrowDataType,
+/// Returns the [`ArrowColumnWriter`] for a given schema and supports columnar 
encryption
+#[cfg(feature = "encryption")]
+fn get_column_writers_with_encryptor(

Review Comment:
   :+1: I've made https://github.com/apache/arrow-rs/issues/7359 for this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-28 Thread via GitHub


alamb commented on PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2762417125

   I ran the performance benchmarks and I do not see any changes for parquet 
writing with this PR (though the numbers do vary somewhat). Nice work. Details 
below
   
   Details
   
   
   ```
   ++ critcmp main encryption-basics-fork
   group 
encryption-basics-fork main
   - 
-- 
   write_batch nested/4096 values primitive list 
1.00  6.5±0.30ms   325.6 MB/sec1.03  6.7±0.26ms   316.1 MB/sec
   write_batch nested/4096 values primitive list non-null
1.00  8.2±0.37ms   260.8 MB/sec1.00  8.2±0.29ms   260.1 MB/sec
   write_batch primitive/4096 values bool
1.01  1022.2±53.98µs  1062.3 KB/sec1.00  1013.0±40.47µs  1072.0 KB/sec
   write_batch primitive/4096 values bool non-null   
1.00   934.1±27.14µs   627.3 KB/sec1.03   960.6±34.00µs   610.0 KB/sec
   write_batch primitive/4096 values float with NaNs 
1.01  1824.6±56.19µs30.1 MB/sec1.00  1809.3±31.57µs30.4 MB/sec
   write_batch primitive/4096 values primitive   
1.04  2.9±0.24ms60.4 MB/sec1.00  2.8±0.07ms62.6 MB/sec
   write_batch primitive/4096 values primitive non-null  
1.06  2.9±0.25ms60.5 MB/sec1.00  2.7±0.13ms64.0 MB/sec
   write_batch primitive/4096 values primitive non-null with bloom filter
1.05 22.5±2.68ms 7.7 MB/sec1.00 21.5±0.36ms 8.0 MB/sec
   write_batch primitive/4096 values primitive with bloom filter 
1.05 22.4±3.68ms 7.8 MB/sec1.00 21.4±0.88ms 8.2 MB/sec
   write_batch primitive/4096 values string  
1.00  1746.6±55.52µs72.3 MB/sec1.01  1757.4±46.59µs71.8 MB/sec
   write_batch primitive/4096 values string dictionary   
1.02  4.3±0.40ms   241.6 MB/sec1.00  4.2±0.18ms   247.1 MB/sec
   write_batch primitive/4096 values string dictionary with bloom filter 
1.02  5.8±0.33ms   177.5 MB/sec1.00  5.7±0.27ms   180.3 MB/sec
   write_batch primitive/4096 values string non-null 
1.02  6.9±0.24ms   295.6 MB/sec1.00  6.8±0.25ms   302.8 MB/sec
   write_batch primitive/4096 values string non-null with bloom filter   
1.00 12.9±0.34ms   159.3 MB/sec1.00 12.9±0.36ms   158.6 MB/sec
   write_batch primitive/4096 values string with bloom filter
1.00 11.1±0.63ms   184.1 MB/sec1.01 11.2±0.37ms   182.3 MB/sec
   ```
   
   Second run
   
   ```
   ++ critcmp main encryption-basics-fork
   group 
encryption-basics-fork main
   - 
-- 
   write_batch nested/4096 values primitive list 
1.01  6.9±0.29ms   309.3 MB/sec1.00  6.8±0.27ms   313.5 MB/sec
   write_batch nested/4096 values primitive list non-null
1.00  8.5±0.41ms   251.6 MB/sec1.04  8.8±0.51ms   241.4 MB/sec
   write_batch primitive/4096 values bool
1.00  1012.3±34.35µs  1072.7 KB/sec1.03  1044.5±39.21µs  1039.6 KB/sec
   write_batch primitive/4096 values bool non-null   
1.00   945.6±33.77µs   619.6 KB/sec1.02   962.4±43.08µs   608.8 KB/sec
   write_batch primitive/4096 values float with NaNs 
1.01  1876.1±54.31µs29.3 MB/sec1.00  1861.8±63.39µs29.5 MB/sec
   write_batch primitive/4096 values primitive   
1.03  3.0±0.13ms58.5 MB/sec1.00  2.9±0.12ms60.4 MB/sec
   write_batch primitive/4096 values primitive non-null  
1.04  3.0±0.75ms57.7 MB/sec1.00  2.9±0.60ms60.0 MB/sec
   write_batch primitive/4096 values primitive non-null with bloom filter
1.00 21.5±0.50ms 8.0 MB/sec1.00 21.5±0.65ms 8.0 MB/sec
   write_batch primitive/4096 values primitive with bloom filter 
1.00 21.4±0.85ms 8.2 MB/sec1.00 21.4±0.96ms 8.2 MB/sec
   write_batch primitive/4096 values string  
1.00  1790.4±68.70µs70.5 MB/sec1.02  1827.0±87.24µs69.1 MB/sec
   write_batch primitive/4096 values string dictionary   
1.00  4.6±0.28ms   225.3 MB/sec1.06  4.8±0.30ms   213.4 MB/sec
   write_batch primitive/4096 values string dictionary with bloom filter 
1.00  6.2±0.31ms   166.9 MB/sec

Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-28 Thread via GitHub


alamb commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2019023428


##
parquet/src/column/page_encryption.rs:
##
@@ -0,0 +1,119 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::column::page::CompressedPage;
+use crate::encryption::ciphers::BlockEncryptor;
+use crate::encryption::encrypt::{encrypt_object, FileEncryptor};
+use crate::encryption::modules::{create_module_aad, ModuleType};
+use crate::errors::ParquetError;
+use crate::errors::Result;
+use crate::format::PageHeader;
+use crate::format::PageType;
+use bytes::Bytes;
+use std::io::Write;
+use std::sync::Arc;
+
+#[derive(Debug)]
+/// Encrypts page headers and page data for columns
+pub struct PageEncryptor {

Review Comment:
   I am pretty sure this is crate private -- could we mark it explicitly like 
that?
   
   ```suggestion
   pub (crate) struct PageEncryptor {
   ```



##
parquet/src/encryption/decrypt.rs:
##
@@ -324,6 +346,14 @@ impl DecryptionPropertiesBuilder {
 .insert(column_name.to_string(), decryption_key);
 self
 }
+
+/// Specify multiple column decryption keys
+pub fn with_column_keys(mut self, column_names: Vec<&str>, keys: 
Vec>) -> Self {
+for (column_name, key) in 
column_names.into_iter().zip(keys.into_iter()) {

Review Comment:
   it might be worth checking that the column_names and keys have the same 
length (`zip` silently ignores stuff if one side has more than the other)



##
parquet/src/file/writer.rs:
##
@@ -523,12 +601,24 @@ impl<'a, W: Write + Send> SerializedRowGroupWriter<'a, W> 
{
 ) -> Result,
 {
 self.assert_previous_writer_closed()?;
+
+let encryptor_context = self.get_page_encryptor_context();
+
 Ok(match self.next_column_desc() {
 Some(column) => {
 let props = self.props.clone();
 let (buf, on_close) = self.get_on_close();
-let page_writer = Box::new(SerializedPageWriter::new(buf));
-Some(factory(column, props, page_writer, Box::new(on_close))?)
+
+let page_writer = SerializedPageWriter::new(buf);

Review Comment:
   thank you -- I think this looks very reasonable now



##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -727,85 +819,173 @@ pub fn get_column_writers(
 ) -> Result> {
 let mut writers = Vec::with_capacity(arrow.fields.len());
 let mut leaves = parquet.columns().iter();
+let column_factory = ArrowColumnWriterFactory::new();
 for field in &arrow.fields {
-get_arrow_column_writer(field.data_type(), props, &mut leaves, &mut 
writers)?;
+column_factory.get_arrow_column_writer(
+field.data_type(),
+props,
+&mut leaves,
+&mut writers,
+)?;
 }
 Ok(writers)
 }
 
-/// Gets the [`ArrowColumnWriter`] for the given `data_type`
-fn get_arrow_column_writer(
-data_type: &ArrowDataType,
+/// Returns the [`ArrowColumnWriter`] for a given schema and supports columnar 
encryption
+#[cfg(feature = "encryption")]
+fn get_column_writers_with_encryptor(

Review Comment:
   I think we should file a follow on ticket for a feature request of "Support 
writing encrypted files with multiple threads" or something like that 
   
   I agree it is ok as a follow on



##
parquet/src/encryption/decrypt.rs:
##
@@ -324,6 +346,14 @@ impl DecryptionPropertiesBuilder {
 .insert(column_name.to_string(), decryption_key);
 self
 }
+
+/// Specify multiple column decryption keys
+pub fn with_column_keys(mut self, column_names: Vec<&str>, keys: 
Vec>) -> Self {
+for (column_name, key) in 
column_names.into_iter().zip(keys.into_iter()) {

Review Comment:
   ```suggestion
   if column_names.len() != keys.len() { 
  ... error
   } 
   for (column_name, key) in 
column_names.into_iter().zip(keys.into_iter()) {
   ```



##
parquet/src/encryption/encrypt.rs:
##
@@ -0,0 +1,368 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE f

Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-28 Thread via GitHub


etseidl commented on PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-276525

   Thanks @rok! I'll try to get a start on a review this weekend, and hopefully 
get it done early next week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-28 Thread via GitHub


alamb commented on PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2749282835

   BTW I wanted to thank you all for taking on implementing this (large) 
feature -- it is one of the last remaining major feature gaps in the Rust 
parquet implemention in my mind


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-27 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2017690250


##
parquet/src/file/metadata/writer.rs:
##
@@ -133,17 +190,53 @@ impl<'a, W: Write> ThriftMetadataWriter<'a, W> {
 
 // Write file metadata
 let start_pos = self.buf.bytes_written();
-{
-let mut protocol = TCompactOutputProtocol::new(&mut self.buf);
-file_metadata.write_to_out_protocol(&mut protocol)?;
+match self.file_encryptor.as_ref() {
+#[cfg(feature = "encryption")]
+Some(file_encryptor) if 
file_encryptor.properties().encrypt_footer() => {
+// First write FileCryptoMetadata
+let crypto_metadata = 
Self::file_crypto_metadata(file_encryptor)?;
+let mut protocol = TCompactOutputProtocol::new(&mut self.buf);
+crypto_metadata.write_to_out_protocol(&mut protocol)?;
+
+// Then write encrypted footer
+let aad = create_footer_aad(file_encryptor.file_aad())?;
+let mut encryptor = file_encryptor.get_footer_encryptor()?;
+encrypt_object(&file_metadata, &mut encryptor, &mut self.buf, 
&aad)?;
+}
+_ => {
+let mut protocol = TCompactOutputProtocol::new(&mut self.buf);
+file_metadata.write_to_out_protocol(&mut protocol)?;
+}
 }
+
 let end_pos = self.buf.bytes_written();
 
 // Write footer
 let metadata_len = (end_pos - start_pos) as u32;
 
 self.buf.write_all(&metadata_len.to_le_bytes())?;
-self.buf.write_all(&PARQUET_MAGIC)?;
+#[cfg(feature = "encryption")]
+let magic = get_file_magic(
+self.file_encryptor
+.as_ref()
+.map(|encryptor| encryptor.properties()),
+);
+#[cfg(not(feature = "encryption"))]
+let magic = get_file_magic();
+self.buf.write_all(magic)?;
+
+// TODO: Discuss what to do here with community
+// The argument for returning unencrypted rowgroups is that any users 
working with
+// the returned FIleMetaData have no easy way to decrypt it and hence 
expect an
+// unencrypted structure be returned. So, the argument here would be 
backward compatibility.
+// Return unencrypted row_group for use in program
+// E.g. when collecting statistics.
+// Related to this see: https://github.com/apache/arrow-rs/issues/7254

Review Comment:
   After discussing this some more with @corwinjoy, we'd prefer to keep this as 
is rather than making a big breaking change like returning `ParquetMetaData` 
instead, and don't think this workaround is too objectionable.
   
   If others are happy with this I'll just update this comment and remove the 
TODO.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-27 Thread via GitHub


adamreeve commented on PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2759518123

   Yes I think this is ready for review. There are a couple of comments I'd 
like some input on 
(https://github.com/apache/arrow-rs/pull/7111/files#r2015196618 and 
https://github.com/apache/arrow-rs/pull/7111/files#r2009421835) but otherwise 
it's looking good to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-27 Thread via GitHub


alamb commented on PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2759435510

   How is this PR doing, btw? Is it ready for another review?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-27 Thread via GitHub


alamb commented on PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2759435126

   > But that might be quite a large and painful breaking change, and also 
might come with a performance hit...
   
   We would have to try it I suspect to test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-26 Thread via GitHub


adamreeve commented on PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2756055877

   > I came up with https://github.com/apache/arrow-rs/pull/7337 which I think 
is easier to work with
   
   Thanks, this example helped. I took another pass at refactoring 
`ThriftMetadataWriter` and this is much nicer now.
   
   > Since this is a major release, we can make API changes too if needed to 
support modular encryption.
   
   The one thing that comes to mind is whether we want to change 
`AsyncArrowWriter::finish` to return `parquet::file::ParquetMetadata` instead 
of `parquet::format::FileMetaData`. This was suggested in #7254 although closed 
without comment, but that would probably avoid the problem of writers not being 
able to use the returned metadata to retrieve statistics due to them being 
encrypted (https://github.com/apache/arrow-rs/pull/7111/files#r2009421835). But 
that might be quite a large and painful breaking change, and also might come 
with a performance hit...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-26 Thread via GitHub


corwinjoy commented on PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2756149153

   @alamb wrote:
   > Since this is a major release, we can make API changes too if needed to 
support modular encryption.
   > 
   
   One minor API change that I would like to have considered is this related PR 
that I feel makes decryption usage better.
   https://github.com/apache/arrow-rs/pull/7342
   
   At the suggestion of @adamreeve I have posted this on its own PR.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-26 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2015196618


##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -727,85 +819,173 @@ pub fn get_column_writers(
 ) -> Result> {
 let mut writers = Vec::with_capacity(arrow.fields.len());
 let mut leaves = parquet.columns().iter();
+let column_factory = ArrowColumnWriterFactory::new();
 for field in &arrow.fields {
-get_arrow_column_writer(field.data_type(), props, &mut leaves, &mut 
writers)?;
+column_factory.get_arrow_column_writer(
+field.data_type(),
+props,
+&mut leaves,
+&mut writers,
+)?;
 }
 Ok(writers)
 }
 
-/// Gets the [`ArrowColumnWriter`] for the given `data_type`
-fn get_arrow_column_writer(
-data_type: &ArrowDataType,
+/// Returns the [`ArrowColumnWriter`] for a given schema and supports columnar 
encryption
+#[cfg(feature = "encryption")]
+fn get_column_writers_with_encryptor(

Review Comment:
   `get_column_writers` above is a `pub fn` to allow users to have low-level 
control over column writing, eg. see docs at 
https://github.com/apache/arrow-rs/blob/4e9e1570ef28c160492891539bbc9649ec069a53/parquet/src/arrow/arrow_writer/mod.rs#L606
   
   That method won't work with encryption though.
   
   This new method is private as `FileEncryptor` is only `pub(crate)`, but 
maybe we should make this pub and only require pub types in order to support 
the same use case? Or modify `get_column_writers` to take a `row_group_idx` and 
`&SerializedFileWriter` if we can make breaking changes?
   
   If going with the first approach, that could always be done later as a 
non-breaking change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-26 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2015140880


##
parquet/src/util/never.rs:
##
@@ -0,0 +1,21 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// Type that can never be instantiated, to make working with features more 
ergonomic.
+// An Option containing this type has zero size.
+#[cfg(not(feature = "encryption"))]
+pub type Never = core::convert::Infallible;

Review Comment:
   Thanks, this is deleted now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-26 Thread via GitHub


alamb commented on PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2755546366

   > I am going to spend some time seeing if I can get it more isolated and 
hopefully that might serve as an example of how to do it for writing.
   
   Update: 
   - I came up with https://github.com/apache/arrow-rs/pull/7337 which I think 
is easier to work with


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-26 Thread via GitHub


alamb commented on PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2755178641

   > I've tried to reduce the use of #[cfg(feature = "encryption")] within 
function bodies 
(https://github.com/apache/arrow-rs/commit/78ac7b25504ea0d23d1d612f0cb52e823c012bd5),
 but have been pretty consistently hampered by borrow checker issues.
   
   Thanks for trying @adamreeve - I am struggling to get the filter pushdown in 
https://github.com/apache/arrow-rs/pull/6921 working well given how the 
encryption code in the reader is injected. I am going to spend some time seeing 
if I can get it more isolated and hopefully that might serve as an example of 
how to do it for writing. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-26 Thread via GitHub


alamb commented on PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2755064197

   BTW I am hoping we can get this PR ready and merged for a release next week. 
Specifically this one.
   
   - https://github.com/apache/arrow-rs/issues/7084
   
   Since this is a major release, we can make API changes too if needed to 
support modular encryption. 
   
   I hope to have time to help if needed later this week or early next week


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-25 Thread via GitHub


adamreeve commented on PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2753181854

   I've tried to reduce the use of `#[cfg(feature = "encryption")]` within 
function bodies 
(https://github.com/apache/arrow-rs/pull/7111/commits/78ac7b25504ea0d23d1d612f0cb52e823c012bd5),
 but have been pretty consistently hampered by borrow checker issues. The 
borrow checker has no insight into which fields a function accesses and assumes 
all of self is borrowed by an `&self` function, so it's not possible to extract 
logic out to functions when there's a mutable reference to any field of `self` 
present.
   
   For example, in `arrow_writer/mod.rs`, we currently have this code:
   
   
https://github.com/apache/arrow-rs/blob/78ac7b25504ea0d23d1d612f0cb52e823c012bd5/parquet/src/arrow/arrow_writer/mod.rs#L266-L288
   
   The contents of the `x` branch of the match statement can't be extracted 
into a method, because a mutable reference to `self.in_progress` is live. This 
is the same reason why getting `row_group_index` has to come before the 
reference to `self.in_progress` is taken.
   
   I ran into a similar problem in `file/metadata/writer.rs` here: 
https://github.com/apache/arrow-rs/blob/78ac7b25504ea0d23d1d612f0cb52e823c012bd5/parquet/src/file/metadata/writer.rs#L73
   
   I wanted to extract a `write_offset_index` method with different 
implementations depending on whether encryption is enabled, but this isn't 
possible due to there being a mutable reference to `self.row_groups`. I could 
instead have different implementations of the whole `write_offset_indexes` 
function, but that would lead to quite a bit of code duplication.
   
   I think the current state is a bit better than it was before, but any 
thoughts on how to work around this and improve the code readability further 
would be appreciated! 
   
   `ThriftMetadataWriter::finish` is probably the ugliest part of the diff at 
the moment, mostly because of a workaround we have to return unencrypted row 
group metadata so that the statistics are usable by consumers, and we'd also 
appreciate any thoughts on that 
(https://github.com/apache/arrow-rs/pull/7111/files#r2009421835).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-25 Thread via GitHub


rok commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2003517486


##
parquet/src/file/writer.rs:
##
@@ -171,7 +183,30 @@ impl SerializedFileWriter {
 /// Creates new file writer.
 pub fn new(buf: W, schema: TypePtr, properties: WriterPropertiesPtr) -> 
Result {
 let mut buf = TrackedWrite::new(buf);
-Self::start_file(&mut buf)?;
+
+#[cfg(feature = "encryption")]
+let file_encryptor = match 
properties.file_encryption_properties.as_ref() {
+None => None,
+Some(encryption_props) => 
Some(Arc::new(FileEncryptor::new(encryption_props.clone())?)),
+};
+
+#[cfg(feature = "encryption")]
+if properties.file_encryption_properties.is_some() {
+properties
+.file_encryption_properties
+.clone()
+.unwrap()
+
.validate_encrypted_column_names(SchemaDescriptor::new(schema.clone()))?;
+}

Review Comment:
   Do you mean:
   ```rust
   Ok(Self {
   buf,
   schema: schema.clone(),
   descr: Arc::new(schema_descriptor),
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-25 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2011101642


##
parquet/src/file/writer.rs:
##
@@ -699,34 +803,61 @@ impl<'a> SerializedColumnWriter<'a> {
 /// `SerializedPageWriter` should not be used after calling `close()`.
 pub struct SerializedPageWriter<'a, W: Write> {
 sink: &'a mut TrackedWrite,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,

Review Comment:
   Oh I can just change the method to return both the page encryptor reference 
and a mutable reference to `self.sink`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-24 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2011029754


##
parquet/src/file/writer.rs:
##
@@ -699,34 +803,61 @@ impl<'a> SerializedColumnWriter<'a> {
 /// `SerializedPageWriter` should not be used after calling `close()`.
 pub struct SerializedPageWriter<'a, W: Write> {
 sink: &'a mut TrackedWrite,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,

Review Comment:
   Hmm this is looking promising but I've hit a bit of a wall due to the 
compiler complaining about multiple mutable borrows of self.sink 
([here](https://github.com/rok/arrow-rs/blob/9e79d75b3e15e45423b6f09225ab22fae11245e6/parquet/src/file/writer.rs#L836)),
 as it doesn't know that `page_encryptor_mut` only allows mutating the page 
encryptor.
   
   I might need to put the `BlockEncryptor` into a `RefCell` or `Mutex` so that 
the `PageEncryptor` methods can take `&self` instead of `&mut self`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-24 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2011211923


##
parquet/src/file/writer.rs:
##
@@ -523,12 +595,37 @@ impl<'a, W: Write + Send> SerializedRowGroupWriter<'a, W> 
{
 ) -> Result,
 {
 self.assert_previous_writer_closed()?;
+
+#[cfg(feature = "encryption")]

Review Comment:
   I've tidied this particular method up. It was a bit awkward due to 
`get_on_close` returning a mutable reference from self so we can't call any 
other methods that take `&self` after this, but I think it has ended up 
cleaner. I'll continue trying to tidy up uses of `FileEncryptor` too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-24 Thread via GitHub


alamb commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2010826146


##
parquet/README.md:
##
@@ -84,7 +84,7 @@ The `parquet` crate provides the following features which may 
be enabled in your
   - [ ] Row record writer
   - [x] Arrow record writer
   - [x] Async support
-  - [ ] Encrypted files
+  - [x] Encrypted files

Review Comment:
   🎉 



##
parquet/src/file/writer.rs:
##
@@ -699,34 +803,61 @@ impl<'a> SerializedColumnWriter<'a> {
 /// `SerializedPageWriter` should not be used after calling `close()`.
 pub struct SerializedPageWriter<'a, W: Write> {
 sink: &'a mut TrackedWrite,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,

Review Comment:
   Instead of adding a special field, what if you put access to 
`page_encryptor` behind a feature flag
   
   ```rust
   #[cfg(feature = "encryption")]
   fn page_encryptor_mut(&mut self) -> Option<&PageEncryptor> {
 self.page_encryptor.as_mut()
   }
   
   #[cfg(not(feature = "encryption"))]
   fn page_encryptor_mut(&mut self) -> Option<&PageEncryptor> {
None
   }
   ```
   
   Perhaps what you could do is then define two `PageEncryptor`s:
   1. When `encryption` is enabled, what is currently in this PR
   2. When `encryption` is not enabled, then define a `PageEncryptor ` that has 
the same interface but function calls that do nothing
   



##
parquet/src/file/writer.rs:
##
@@ -523,12 +595,37 @@ impl<'a, W: Write + Send> SerializedRowGroupWriter<'a, W> 
{
 ) -> Result,
 {
 self.assert_previous_writer_closed()?;
+
+#[cfg(feature = "encryption")]

Review Comment:
   If at all possible, I would prefer that all the feature gated code is 
isolated to functions / structures that have different implementations when the 
feature is on/off
   
   Sprinking the `cfg` in this code I think makes it significantly harder to 
read and this is some of the most complicated / performance sensitive code of 
the parquet crate as it is



##
parquet/src/util/never.rs:
##
@@ -0,0 +1,21 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// Type that can never be instantiated, to make working with features more 
ergonomic.
+// An Option containing this type has zero size.
+#[cfg(not(feature = "encryption"))]
+pub type Never = core::convert::Infallible;

Review Comment:
   I left an alternate suggestion above about how to handle the feature flags



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-24 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2010846626


##
parquet/src/file/writer.rs:
##
@@ -699,34 +803,61 @@ impl<'a> SerializedColumnWriter<'a> {
 /// `SerializedPageWriter` should not be used after calling `close()`.
 pub struct SerializedPageWriter<'a, W: Write> {
 sink: &'a mut TrackedWrite,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,

Review Comment:
   I tried something a bit like this in 
https://github.com/apache/arrow-rs/pull/7111#discussion_r1992613117 (see 
https://github.com/adamreeve/arrow-rs/commit/c94ddd07f95ff09076019e8854f4d9f1f065683a),
 but ran into an issue with my trait not being dyn-compatible. But making this 
a compile-time switch rather than a dynamic trait object would avoid that 
problem, I'll give this a go thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-23 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2009421835


##
parquet/src/file/metadata/writer.rs:
##
@@ -133,17 +190,53 @@ impl<'a, W: Write> ThriftMetadataWriter<'a, W> {
 
 // Write file metadata
 let start_pos = self.buf.bytes_written();
-{
-let mut protocol = TCompactOutputProtocol::new(&mut self.buf);
-file_metadata.write_to_out_protocol(&mut protocol)?;
+match self.file_encryptor.as_ref() {
+#[cfg(feature = "encryption")]
+Some(file_encryptor) if 
file_encryptor.properties().encrypt_footer() => {
+// First write FileCryptoMetadata
+let crypto_metadata = 
Self::file_crypto_metadata(file_encryptor)?;
+let mut protocol = TCompactOutputProtocol::new(&mut self.buf);
+crypto_metadata.write_to_out_protocol(&mut protocol)?;
+
+// Then write encrypted footer
+let aad = create_footer_aad(file_encryptor.file_aad())?;
+let mut encryptor = file_encryptor.get_footer_encryptor()?;
+encrypt_object(&file_metadata, &mut encryptor, &mut self.buf, 
&aad)?;
+}
+_ => {
+let mut protocol = TCompactOutputProtocol::new(&mut self.buf);
+file_metadata.write_to_out_protocol(&mut protocol)?;
+}
 }
+
 let end_pos = self.buf.bytes_written();
 
 // Write footer
 let metadata_len = (end_pos - start_pos) as u32;
 
 self.buf.write_all(&metadata_len.to_le_bytes())?;
-self.buf.write_all(&PARQUET_MAGIC)?;
+#[cfg(feature = "encryption")]
+let magic = get_file_magic(
+self.file_encryptor
+.as_ref()
+.map(|encryptor| encryptor.properties()),
+);
+#[cfg(not(feature = "encryption"))]
+let magic = get_file_magic();
+self.buf.write_all(magic)?;
+
+// TODO: Discuss what to do here with community
+// The argument for returning unencrypted rowgroups is that any users 
working with
+// the returned FIleMetaData have no easy way to decrypt it and hence 
expect an
+// unencrypted structure be returned. So, the argument here would be 
backward compatibility.
+// Return unencrypted row_group for use in program
+// E.g. when collecting statistics.
+// Related to this see: https://github.com/apache/arrow-rs/issues/7254

Review Comment:
   This is an issue we ran into when trying to integrate these changes into 
delta-rs, which reads the row group metadata after writing files to get column 
statistics. Making clients decrypt the metadata in this scenario doesn't seem 
reasonable.
   
   The suggestion to return `ParquetMetaData` instead of the Thrift 
`FileMetaData` in #7254 seems relevant here, if that was done then that might 
also solve this problem, depending on how it was implemented.
   
   We'd appreciate some feedback on this area and if there's a better way to 
handle this than our current solution of copying and restoring the unencrypted 
metadata after the file is written. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-23 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2009250354


##
parquet/tests/encryption/encryption.rs:
##
@@ -0,0 +1,609 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This module contains tests for reading encrypted Parquet files with the 
Arrow API
+
+use crate::encryption_util::read_and_roundtrip_to_encrypted_file;
+use crate::encryption_util::verify_encryption_test_data;
+use arrow::array::*;
+use arrow::error::Result as ArrowResult;
+use arrow_array::{Int32Array, RecordBatch};
+use arrow_schema::{DataType as ArrowDataType, DataType, Field, Schema};
+use parquet::arrow::arrow_reader::{
+ArrowReaderMetadata, ArrowReaderOptions, ParquetRecordBatchReaderBuilder,
+};
+use parquet::arrow::ArrowWriter;
+use parquet::data_type::{ByteArray, ByteArrayType};
+use parquet::encryption::decrypt::FileDecryptionProperties;
+use parquet::encryption::encrypt::FileEncryptionProperties;
+use parquet::file::metadata::ParquetMetaData;
+use parquet::file::properties::WriterProperties;
+use parquet::file::writer::SerializedFileWriter;
+use parquet::schema::parser::parse_message_type;
+use std::fs::File;
+use std::sync::Arc;
+
+#[test]
+fn test_non_uniform_encryption_plaintext_footer() {
+let test_data = arrow::util::test_util::parquet_test_data();
+let path = 
format!("{test_data}/encrypt_columns_plaintext_footer.parquet.encrypted");
+let file = File::open(path).unwrap();
+
+// There is always a footer key even with a plaintext footer,
+// but this is used for signing the footer.
+let footer_key = "0123456789012345".as_bytes(); // 128bit/16
+let column_1_key = "1234567890123450".as_bytes();
+let column_2_key = "1234567890123451".as_bytes();
+
+let decryption_properties = 
FileDecryptionProperties::builder(footer_key.to_vec())
+.with_column_key("double_field", column_1_key.to_vec())
+.with_column_key("float_field", column_2_key.to_vec())
+.build()
+.unwrap();
+
+verify_encryption_test_file_read(file, decryption_properties);
+}
+
+#[test]
+fn test_non_uniform_encryption_disabled_aad_storage() {
+let test_data = arrow::util::test_util::parquet_test_data();
+let path =
+
format!("{test_data}/encrypt_columns_and_footer_disable_aad_storage.parquet.encrypted");
+let file = File::open(path.clone()).unwrap();
+
+let footer_key = b"0123456789012345".to_vec(); // 128bit/16
+let column_1_key = b"1234567890123450".to_vec();
+let column_2_key = b"1234567890123451".to_vec();
+
+// Can read successfully when providing the correct AAD prefix
+let decryption_properties = 
FileDecryptionProperties::builder(footer_key.clone())
+.with_column_key("double_field", column_1_key.clone())
+.with_column_key("float_field", column_2_key.clone())
+.with_aad_prefix(b"tester".to_vec())
+.build()
+.unwrap();
+
+verify_encryption_test_file_read(file, decryption_properties);
+
+// Using wrong AAD prefix should fail
+let decryption_properties = 
FileDecryptionProperties::builder(footer_key.clone())
+.with_column_key("double_field", column_1_key.clone())
+.with_column_key("float_field", column_2_key.clone())
+.with_aad_prefix(b"wrong_aad_prefix".to_vec())
+.build()
+.unwrap();
+
+let file = File::open(path.clone()).unwrap();
+let options = ArrowReaderOptions::default()
+.with_file_decryption_properties(decryption_properties.clone());
+let result = ArrowReaderMetadata::load(&file, options.clone());
+assert!(result.is_err());
+assert_eq!(
+result.unwrap_err().to_string(),
+"Parquet error: Provided footer key and AAD were unable to decrypt 
parquet footer"
+);
+
+// Not providing any AAD prefix should fail as it isn't stored in the file
+let decryption_properties = FileDecryptionProperties::builder(footer_key)
+.with_column_key("double_field", column_1_key)
+.with_column_key("float_field", column_2_key)
+.build()
+.unwrap();
+
+let file = File::open(path).unwrap();
+let options = ArrowReaderOptions::default()
+.with_file_

Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-23 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2009272762


##
parquet/tests/encryption/encryption.rs:
##
@@ -0,0 +1,609 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This module contains tests for reading encrypted Parquet files with the 
Arrow API
+
+use crate::encryption_util::read_and_roundtrip_to_encrypted_file;
+use crate::encryption_util::verify_encryption_test_data;
+use arrow::array::*;
+use arrow::error::Result as ArrowResult;
+use arrow_array::{Int32Array, RecordBatch};
+use arrow_schema::{DataType as ArrowDataType, DataType, Field, Schema};
+use parquet::arrow::arrow_reader::{
+ArrowReaderMetadata, ArrowReaderOptions, ParquetRecordBatchReaderBuilder,
+};
+use parquet::arrow::ArrowWriter;
+use parquet::data_type::{ByteArray, ByteArrayType};
+use parquet::encryption::decrypt::FileDecryptionProperties;
+use parquet::encryption::encrypt::FileEncryptionProperties;
+use parquet::file::metadata::ParquetMetaData;
+use parquet::file::properties::WriterProperties;
+use parquet::file::writer::SerializedFileWriter;
+use parquet::schema::parser::parse_message_type;
+use std::fs::File;
+use std::sync::Arc;
+
+#[test]
+fn test_non_uniform_encryption_plaintext_footer() {
+let test_data = arrow::util::test_util::parquet_test_data();
+let path = 
format!("{test_data}/encrypt_columns_plaintext_footer.parquet.encrypted");
+let file = File::open(path).unwrap();
+
+// There is always a footer key even with a plaintext footer,
+// but this is used for signing the footer.
+let footer_key = "0123456789012345".as_bytes(); // 128bit/16
+let column_1_key = "1234567890123450".as_bytes();
+let column_2_key = "1234567890123451".as_bytes();
+
+let decryption_properties = 
FileDecryptionProperties::builder(footer_key.to_vec())
+.with_column_key("double_field", column_1_key.to_vec())
+.with_column_key("float_field", column_2_key.to_vec())
+.build()
+.unwrap();
+
+verify_encryption_test_file_read(file, decryption_properties);
+}
+
+#[test]
+fn test_non_uniform_encryption_disabled_aad_storage() {
+let test_data = arrow::util::test_util::parquet_test_data();
+let path =
+
format!("{test_data}/encrypt_columns_and_footer_disable_aad_storage.parquet.encrypted");
+let file = File::open(path.clone()).unwrap();
+
+let footer_key = b"0123456789012345".to_vec(); // 128bit/16
+let column_1_key = b"1234567890123450".to_vec();
+let column_2_key = b"1234567890123451".to_vec();
+
+// Can read successfully when providing the correct AAD prefix
+let decryption_properties = 
FileDecryptionProperties::builder(footer_key.clone())
+.with_column_key("double_field", column_1_key.clone())
+.with_column_key("float_field", column_2_key.clone())
+.with_aad_prefix(b"tester".to_vec())
+.build()
+.unwrap();
+
+verify_encryption_test_file_read(file, decryption_properties);
+
+// Using wrong AAD prefix should fail
+let decryption_properties = 
FileDecryptionProperties::builder(footer_key.clone())
+.with_column_key("double_field", column_1_key.clone())
+.with_column_key("float_field", column_2_key.clone())
+.with_aad_prefix(b"wrong_aad_prefix".to_vec())
+.build()
+.unwrap();
+
+let file = File::open(path.clone()).unwrap();
+let options = ArrowReaderOptions::default()
+.with_file_decryption_properties(decryption_properties.clone());
+let result = ArrowReaderMetadata::load(&file, options.clone());
+assert!(result.is_err());
+assert_eq!(
+result.unwrap_err().to_string(),
+"Parquet error: Provided footer key and AAD were unable to decrypt 
parquet footer"
+);
+
+// Not providing any AAD prefix should fail as it isn't stored in the file
+let decryption_properties = FileDecryptionProperties::builder(footer_key)
+.with_column_key("double_field", column_1_key)
+.with_column_key("float_field", column_2_key)
+.build()
+.unwrap();
+
+let file = File::open(path).unwrap();
+let options = ArrowReaderOptions::default()
+.with_file_

Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-22 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2006626887


##
parquet/src/encryption/encrypt.rs:
##
@@ -0,0 +1,343 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::encryption::ciphers::{BlockEncryptor, RingGcmBlockEncryptor};
+use crate::errors::{ParquetError, Result};
+use crate::file::column_crypto_metadata::{ColumnCryptoMetaData, 
EncryptionWithColumnKey};
+use crate::schema::types::{ColumnDescPtr, SchemaDescriptor};
+use crate::thrift::TSerializable;
+use ring::rand::{SecureRandom, SystemRandom};
+use std::collections::{HashMap, HashSet};
+use std::io::Write;
+use thrift::protocol::TCompactOutputProtocol;
+
+#[derive(Debug, Clone, PartialEq)]
+struct EncryptionKey {
+key: Vec,
+key_metadata: Option>,
+}
+
+impl EncryptionKey {
+fn new(key: Vec) -> EncryptionKey {
+Self {
+key,
+key_metadata: None,
+}
+}
+
+fn with_metadata(mut self, metadata: Vec) -> Self {
+self.key_metadata = Some(metadata);
+self
+}
+
+fn key(&self) -> &Vec {
+&self.key
+}
+}
+
+#[derive(Debug, Clone, PartialEq)]
+pub struct FileEncryptionProperties {
+encrypt_footer: bool,
+footer_key: EncryptionKey,
+column_keys: HashMap,
+aad_prefix: Option>,
+store_aad_prefix: bool,
+}
+
+impl FileEncryptionProperties {
+/// Create a new builder for encryption properties
+pub fn builder(footer_key: Vec) -> EncryptionPropertiesBuilder {
+EncryptionPropertiesBuilder::new(footer_key)
+}
+
+/// Should the footer be encrypted
+pub fn encrypt_footer(&self) -> bool {
+self.encrypt_footer
+}
+
+/// Retrieval metadata of key used for encryption of footer and (possibly) 
columns
+pub fn footer_key_metadata(&self) -> Option<&Vec> {
+self.footer_key.key_metadata.as_ref()
+}
+
+/// AAD prefix string uniquely identifies the file and prevents file 
swapping
+pub fn aad_prefix(&self) -> Option<&Vec> {
+self.aad_prefix.as_ref()
+}
+
+/// Should the AAD prefix be stored in the file
+pub fn store_aad_prefix(&self) -> bool {
+self.store_aad_prefix && self.aad_prefix.is_some()
+}
+
+/// Checks if columns that are to be encrypted are present in schema
+#[cfg(feature = "encryption")]

Review Comment:
   ```suggestion
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-21 Thread via GitHub


alamb commented on PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2744283131

   @mbutrovich  -- I think you are interested in this feature too. Do you have 
some time to help review this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-21 Thread via GitHub


alamb commented on PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2744285058

   I have this on my list


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-21 Thread via GitHub


rok commented on PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2744103847

   @alamb @ggershinsky @etseidl - I think this is ripe for review, both from 
encryption and "arrow" standpoint.
   I'm travelling from Monday on and have limited bandwidth but @adamreeve will 
probably be available to reply.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-21 Thread via GitHub


rok commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2007606880


##
parquet/tests/arrow_reader/encryption.rs:
##
@@ -201,3 +211,410 @@ fn verify_encryption_test_file_read(file: File, 
decryption_properties: FileDecry
 
 verify_encryption_test_data(record_batches, metadata);
 }
+
+fn row_group_sizes(metadata: &ParquetMetaData) -> Vec {
+metadata.row_groups().iter().map(|x| x.num_rows()).collect()
+}
+
+#[test]
+fn test_uniform_encryption_roundtrip() {

Review Comment:
   Done: 
https://github.com/apache/arrow-rs/pull/7111/commits/ce8d2a91e2163791f88ac9286c62cb1a41ad2c86
   Not sure if this is the best way to split tests, but it does remove a lot of 
`#cfg` logic.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-20 Thread via GitHub


rok commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2004524736


##
parquet/src/file/metadata/writer.rs:
##
@@ -182,6 +258,95 @@ impl<'a, W: Write> ThriftMetadataWriter<'a, W> {
 self.key_value_metadata = Some(key_value_metadata);
 self
 }
+
+#[cfg(feature = "encryption")]
+pub fn with_file_encryptor(mut self, file_encryptor: 
Option>) -> Self {
+self.file_encryptor = file_encryptor;
+self
+}
+
+#[cfg(feature = "encryption")]
+fn file_crypto_metadata(
+file_encryptor: &Arc,
+) -> Result {
+let properties = file_encryptor.properties();
+let supply_aad_prefix = properties
+.aad_prefix()
+.map(|_| !properties.store_aad_prefix());
+let encryption_algorithm = AesGcmV1 {
+aad_prefix: properties.aad_prefix().cloned(),

Review Comment:
   How about: 
https://github.com/apache/arrow-rs/blob/b01b2851be51dbe32af106d5fe78cd5bafa9aaa6/parquet/src/file/metadata/reader.rs#L735-L744



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-19 Thread via GitHub


rok commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2004728402


##
parquet/src/file/metadata/writer.rs:
##
@@ -182,6 +258,95 @@ impl<'a, W: Write> ThriftMetadataWriter<'a, W> {
 self.key_value_metadata = Some(key_value_metadata);
 self
 }
+
+#[cfg(feature = "encryption")]
+pub fn with_file_encryptor(mut self, file_encryptor: 
Option>) -> Self {
+self.file_encryptor = file_encryptor;
+self
+}
+
+#[cfg(feature = "encryption")]
+fn file_crypto_metadata(
+file_encryptor: &Arc,
+) -> Result {
+let properties = file_encryptor.properties();
+let supply_aad_prefix = properties
+.aad_prefix()
+.map(|_| !properties.store_aad_prefix());
+let encryption_algorithm = AesGcmV1 {
+aad_prefix: properties.aad_prefix().cloned(),

Review Comment:
   Done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-19 Thread via GitHub


rok commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2004683314


##
parquet/tests/arrow_reader/encryption.rs:
##
@@ -201,3 +211,410 @@ fn verify_encryption_test_file_read(file: File, 
decryption_properties: FileDecry
 
 verify_encryption_test_data(record_batches, metadata);
 }
+
+fn row_group_sizes(metadata: &ParquetMetaData) -> Vec {
+metadata.row_groups().iter().map(|x| x.num_rows()).collect()
+}
+
+#[test]
+fn test_uniform_encryption_roundtrip() {

Review Comment:
   I'll try adding `parquet/tests/encryption` folder and move related things 
there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-19 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2004576722


##
parquet/src/file/metadata/writer.rs:
##
@@ -182,6 +258,95 @@ impl<'a, W: Write> ThriftMetadataWriter<'a, W> {
 self.key_value_metadata = Some(key_value_metadata);
 self
 }
+
+#[cfg(feature = "encryption")]
+pub fn with_file_encryptor(mut self, file_encryptor: 
Option>) -> Self {
+self.file_encryptor = file_encryptor;
+self
+}
+
+#[cfg(feature = "encryption")]
+fn file_crypto_metadata(
+file_encryptor: &Arc,
+) -> Result {
+let properties = file_encryptor.properties();
+let supply_aad_prefix = properties
+.aad_prefix()
+.map(|_| !properties.store_aad_prefix());
+let encryption_algorithm = AesGcmV1 {
+aad_prefix: properties.aad_prefix().cloned(),

Review Comment:
   :+1: that looks right to me, although you can avoid the `clone` by matching 
on a reference:
   ```rust
   let supply_aad_prefix = match &t_file_crypto_metadata.encryption_algorithm {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-19 Thread via GitHub


rok commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2004525536


##
parquet/src/file/writer.rs:
##
@@ -171,7 +183,30 @@ impl SerializedFileWriter {
 /// Creates new file writer.
 pub fn new(buf: W, schema: TypePtr, properties: WriterPropertiesPtr) -> 
Result {
 let mut buf = TrackedWrite::new(buf);
-Self::start_file(&mut buf)?;
+
+#[cfg(feature = "encryption")]
+let file_encryptor = match 
properties.file_encryption_properties.as_ref() {
+None => None,
+Some(encryption_props) => 
Some(Arc::new(FileEncryptor::new(encryption_props.clone())?)),
+};
+
+#[cfg(feature = "encryption")]
+if properties.file_encryption_properties.is_some() {
+properties
+.file_encryption_properties
+.clone()
+.unwrap()
+
.validate_encrypted_column_names(SchemaDescriptor::new(schema.clone()))?;
+}

Review Comment:
   Good point about the schema. Changed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-19 Thread via GitHub


rok commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2004521719


##
parquet/src/encryption/ciphers.rs:
##
@@ -61,3 +64,118 @@ impl BlockDecryptor for RingGcmBlockDecryptor {
 Ok(result)
 }
 }
+
+pub trait BlockEncryptor: Debug + Send + Sync {
+fn encrypt(&mut self, plaintext: &[u8], aad: &[u8]) -> Result>;
+}
+
+#[derive(Debug, Clone)]
+struct CounterNonce {
+start: u128,
+counter: u128,
+}
+
+impl CounterNonce {
+pub fn new(rng: &SystemRandom) -> Result {
+let mut buf = [0; 16];
+rng.fill(&mut buf)?;
+
+// Since this is a random seed value, endianness doesn't matter at all,
+// and we can use whatever is platform-native.
+let start = u128::from_ne_bytes(buf) & RIGHT_TWELVE;
+let counter = start.wrapping_add(1);
+
+Ok(Self { start, counter })
+}
+
+/// One accessor for the nonce bytes to avoid potentially flipping 
endianness
+#[inline]
+pub fn get_bytes(&self) -> [u8; NONCE_LEN] {
+self.counter.to_le_bytes()[0..NONCE_LEN].try_into().unwrap()
+}
+}
+
+impl NonceSequence for CounterNonce {
+fn advance(&mut self) -> Result {
+// If we've wrapped around, we've exhausted this nonce sequence
+if (self.counter & RIGHT_TWELVE) == (self.start & RIGHT_TWELVE) {
+Err(ring::error::Unspecified)
+} else {
+// Otherwise, just advance and return the new value
+let buf: [u8; NONCE_LEN] = self.get_bytes();
+self.counter = self.counter.wrapping_add(1);
+Ok(ring::aead::Nonce::assume_unique_for_key(buf))
+}
+}
+}
+
+#[derive(Debug, Clone)]
+pub(crate) struct RingGcmBlockEncryptor {
+key: LessSafeKey,
+nonce_sequence: CounterNonce,
+}
+
+impl RingGcmBlockEncryptor {
+/// Create a new `RingGcmBlockEncryptor` with a given key and random nonce.
+/// The nonce will advance appropriately with each block encryption and
+/// return an error if it wraps around.
+pub(crate) fn new(key_bytes: &[u8]) -> Result {
+let rng = SystemRandom::new();
+
+// todo support other key sizes
+let key = UnboundKey::new(&AES_128_GCM, key_bytes)
+.map_err(|e| general_err!("Error creating AES key: {}", e))?;
+let nonce = CounterNonce::new(&rng)?;
+
+Ok(Self {
+key: LessSafeKey::new(key),
+nonce_sequence: nonce,
+})
+}
+}
+
+impl BlockEncryptor for RingGcmBlockEncryptor {
+fn encrypt(&mut self, plaintext: &[u8], aad: &[u8]) -> Result> {
+// Create encrypted buffer.
+// Format is: [ciphertext size, nonce, ciphertext, authentication tag]
+let ciphertext_length = NONCE_LEN + plaintext.len() + TAG_LEN;
+let mut ciphertext = Vec::with_capacity(SIZE_LEN + ciphertext_length);
+ciphertext.extend((ciphertext_length as u32).to_le_bytes());

Review Comment:
   Fair point! I've changed this to:
   ```rust
   let ciphertext_length : u32 = (NONCE_LEN + plaintext.len() + 
TAG_LEN).try_into()
   .map_err(|err|
   General(format!("Plaintext data too long. {:?}", err)))?;
   // Not checking for overflow here because we've already checked for 
it with ciphertext_length
   let mut ciphertext = Vec::with_capacity(SIZE_LEN + ciphertext_length 
as usize);
   ciphertext.extend((ciphertext_length).to_le_bytes());
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-19 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2003758011


##
parquet/src/file/writer.rs:
##
@@ -171,7 +183,30 @@ impl SerializedFileWriter {
 /// Creates new file writer.
 pub fn new(buf: W, schema: TypePtr, properties: WriterPropertiesPtr) -> 
Result {
 let mut buf = TrackedWrite::new(buf);
-Self::start_file(&mut buf)?;
+
+#[cfg(feature = "encryption")]
+let file_encryptor = match 
properties.file_encryption_properties.as_ref() {
+None => None,
+Some(encryption_props) => 
Some(Arc::new(FileEncryptor::new(encryption_props.clone())?)),
+};
+
+#[cfg(feature = "encryption")]
+if properties.file_encryption_properties.is_some() {
+properties
+.file_encryption_properties
+.clone()
+.unwrap()
+
.validate_encrypted_column_names(SchemaDescriptor::new(schema.clone()))?;
+}

Review Comment:
   Oh I see you added the `Arc::new(`. Yeah that's needed, although the 
`schema.clone()` should be able to just be `schema`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-19 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2003708546


##
parquet/src/file/writer.rs:
##
@@ -171,7 +183,30 @@ impl SerializedFileWriter {
 /// Creates new file writer.
 pub fn new(buf: W, schema: TypePtr, properties: WriterPropertiesPtr) -> 
Result {
 let mut buf = TrackedWrite::new(buf);
-Self::start_file(&mut buf)?;
+
+#[cfg(feature = "encryption")]
+let file_encryptor = match 
properties.file_encryption_properties.as_ref() {
+None => None,
+Some(encryption_props) => 
Some(Arc::new(FileEncryptor::new(encryption_props.clone())?)),
+};
+
+#[cfg(feature = "encryption")]
+if properties.file_encryption_properties.is_some() {
+properties
+.file_encryption_properties
+.clone()
+.unwrap()
+
.validate_encrypted_column_names(SchemaDescriptor::new(schema.clone()))?;
+}

Review Comment:
   Yes that bit should be able to be changed if you create the 
schema_descriptor above here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-19 Thread via GitHub


rok commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2003700114


##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -457,20 +476,49 @@ type SharedColumnChunk = Arc>;
 #[derive(Default)]
 struct ArrowPageWriter {
 buffer: SharedColumnChunk,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,
+}
+
+#[cfg(feature = "encryption")]
+impl ArrowPageWriter {
+pub fn with_encryptor(mut self, page_encryptor: Option) -> 
Self {
+self.page_encryptor = page_encryptor;
+self
+}
 }
 
 impl PageWriter for ArrowPageWriter {
 fn write_page(&mut self, page: CompressedPage) -> Result {
 let mut buf = self.buffer.try_lock().unwrap();
-let page_header = page.to_thrift_header();
-let header = {
-let mut header = Vec::with_capacity(1024);
-let mut protocol = TCompactOutputProtocol::new(&mut header);
-page_header.write_to_out_protocol(&mut protocol)?;
-Bytes::from(header)
+
+let data = match self.page_encryptor.as_ref() {
+#[cfg(feature = "encryption")]
+Some(encryptor) => {
+let encrypted_buffer = encryptor.encrypt_page(&page)?;
+Bytes::from(encrypted_buffer)
+}
+_ => page.compressed_page().buffer().clone(),
+};
+
+let mut page_header = page.to_thrift_header();
+page_header.compressed_page_size = data.len() as i32;

Review Comment:
   Yeah, that's a nice cleanup, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-18 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2002341357


##
parquet/src/file/metadata/writer.rs:
##
@@ -182,6 +258,95 @@ impl<'a, W: Write> ThriftMetadataWriter<'a, W> {
 self.key_value_metadata = Some(key_value_metadata);
 self
 }
+
+#[cfg(feature = "encryption")]
+pub fn with_file_encryptor(mut self, file_encryptor: 
Option>) -> Self {
+self.file_encryptor = file_encryptor;
+self
+}
+
+#[cfg(feature = "encryption")]
+fn file_crypto_metadata(
+file_encryptor: &Arc,
+) -> Result {
+let properties = file_encryptor.properties();
+let supply_aad_prefix = properties
+.aad_prefix()
+.map(|_| !properties.store_aad_prefix());
+let encryption_algorithm = AesGcmV1 {
+aad_prefix: properties.aad_prefix().cloned(),

Review Comment:
   We shouldn't set `aad_prefix` here unless `properties.store_aad_prefix()` is 
true. (I think this is my mistake!)
   
   We should also add a test that we can't read a file written with 
`store_aad_prefix = false` unless the AAD prefix is passed.
   
   When reading, I guess we could check `supply_aad_prefix` as an indication 
that an AAD prefix is expected and raise a more helpful error if one isn't 
supplied, but that's probably out of scope for this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-18 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r2002336070


##
parquet/src/encryption/encrypt.rs:
##
@@ -0,0 +1,339 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::encryption::ciphers::{BlockEncryptor, RingGcmBlockEncryptor};
+use crate::errors::{ParquetError, Result};
+use crate::file::column_crypto_metadata::{ColumnCryptoMetaData, 
EncryptionWithColumnKey};
+use crate::schema::types::{ColumnDescPtr, SchemaDescriptor};
+use crate::thrift::TSerializable;
+use ring::rand::{SecureRandom, SystemRandom};
+use std::collections::{HashMap, HashSet};
+use std::io::Write;
+use thrift::protocol::TCompactOutputProtocol;
+
+#[derive(Debug, Clone, PartialEq)]
+struct EncryptionKey {
+key: Vec,
+key_metadata: Option>,
+}
+
+impl EncryptionKey {
+fn new(key: Vec) -> EncryptionKey {
+Self {
+key,
+key_metadata: None,
+}
+}
+
+fn with_metadata(mut self, metadata: Vec) -> Self {
+self.key_metadata = Some(metadata);
+self
+}
+
+fn key(&self) -> &Vec {
+&self.key
+}
+}
+
+#[derive(Debug, Clone, PartialEq)]
+pub struct FileEncryptionProperties {
+encrypt_footer: bool,
+footer_key: EncryptionKey,
+column_keys: HashMap,
+aad_prefix: Option>,
+store_aad_prefix: bool,
+}
+
+impl FileEncryptionProperties {
+/// Create a new builder for encryption properties
+pub fn builder(footer_key: Vec) -> EncryptionPropertiesBuilder {
+EncryptionPropertiesBuilder::new(footer_key)
+}
+
+/// Should the footer be encrypted
+pub fn encrypt_footer(&self) -> bool {
+self.encrypt_footer
+}
+
+/// Retrieval metadata of key used for encryption of footer and (possibly) 
columns
+pub fn footer_key_metadata(&self) -> Option<&Vec> {
+self.footer_key.key_metadata.as_ref()
+}
+
+/// AAD prefix string uniquely identifies the file and prevents file 
swapping
+pub fn aad_prefix(&self) -> Option<&Vec> {
+self.aad_prefix.as_ref()
+}
+
+/// Should the AAD prefix should be stored in the file
+pub fn store_aad_prefix(&self) -> bool {
+self.store_aad_prefix && self.aad_prefix.is_some()
+}
+
+/// Checks if columns that are to be encrypted are present in schema
+#[cfg(feature = "encryption")]
+pub(crate) fn validate_encrypted_column_names(
+&self,
+schema: SchemaDescriptor,
+) -> std::result::Result<(), ParquetError> {
+let schema_columns = schema
+.columns()
+.iter()
+.map(|c| c.path().string())
+.collect::>();
+let encryption_columns = self
+.column_keys
+.keys()
+.cloned()
+.collect::>();
+if !encryption_columns.is_subset(&schema_columns) {
+let mut columns_missing_in_schema = encryption_columns
+.difference(&schema_columns)
+.cloned()
+.collect::>();
+columns_missing_in_schema.sort();
+return Err(ParquetError::General(
+format!(
+"The following columns with encryption keys specified were 
not found in the schema: {}",
+columns_missing_in_schema.join(", ")
+)
+.to_string(),
+));
+}
+Ok(())
+}
+}
+
+pub struct EncryptionPropertiesBuilder {
+footer_key: EncryptionKey,
+column_keys: HashMap,
+aad_prefix: Option>,
+encrypt_footer: bool,
+store_aad_prefix: bool,
+}
+
+impl EncryptionPropertiesBuilder {
+pub fn new(footer_key: Vec) -> EncryptionPropertiesBuilder {
+Self {
+footer_key: EncryptionKey::new(footer_key),
+column_keys: HashMap::default(),
+aad_prefix: None,
+encrypt_footer: true,
+store_aad_prefix: true,
+}
+}
+
+/// Set if the footer be encrypted

Review Comment:
   ```suggestion
   /// Set if the footer should be encrypted. Defaults to true.
   ```



##
parquet/src/encryption/ciphers.rs:
##
@@ -61,3 +64,118 @@ impl BlockDecryptor

Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-18 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r174296


##
parquet/src/file/writer.rs:
##
@@ -699,44 +803,96 @@ impl<'a> SerializedColumnWriter<'a> {
 /// `SerializedPageWriter` should not be used after calling `close()`.
 pub struct SerializedPageWriter<'a, W: Write> {
 sink: &'a mut TrackedWrite,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,
 }
 
 impl<'a, W: Write> SerializedPageWriter<'a, W> {
 /// Creates new page writer.
 pub fn new(sink: &'a mut TrackedWrite) -> Self {
-Self { sink }
+Self {
+sink,
+page_encryptor: None,
+}
+}
+
+#[cfg(feature = "encryption")]
+/// Set the encryptor to use to encrypt page data
+fn with_page_encryptor(mut self, page_encryptor: Option) -> 
Self {
+self.page_encryptor = page_encryptor;
+self
 }
 
 /// Serializes page header into Thrift.
 /// Returns number of bytes that have been written into the sink.
 #[inline]
 fn serialize_page_header(&mut self, header: parquet::PageHeader) -> 
Result {
 let start_pos = self.sink.bytes_written();
-{
-let mut protocol = TCompactOutputProtocol::new(&mut self.sink);
-header.write_to_out_protocol(&mut protocol)?;
+match self.page_encryptor.as_ref() {
+#[cfg(feature = "encryption")]
+Some(page_encryptor) => {
+page_encryptor.encrypt_page_header(&header, &mut self.sink)?;
+}
+_ => {
+let mut protocol = TCompactOutputProtocol::new(&mut self.sink);
+header.write_to_out_protocol(&mut protocol)?;
+}
 }
 Ok(self.sink.bytes_written() - start_pos)
 }
 }
 
+trait PageModuleWriter {
+fn serialize_page(&mut self, page: &CompressedPage) -> Result>;
+}
+
+#[cfg(not(feature = "encryption"))]
+impl PageModuleWriter for SerializedPageWriter<'_, W> {
+fn serialize_page(&mut self, page: &CompressedPage) -> Result> {
+Ok(page.data().to_vec())

Review Comment:
   It's not great that we need to copy all the page data here when we didn't 
before. I think if we can do something like 
https://github.com/apache/arrow-rs/pull/7111#discussion_r133781 then this 
wouldn't be needed. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-18 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r220283


##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -457,20 +476,49 @@ type SharedColumnChunk = Arc>;
 #[derive(Default)]
 struct ArrowPageWriter {
 buffer: SharedColumnChunk,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,
+}
+
+#[cfg(feature = "encryption")]
+impl ArrowPageWriter {
+pub fn with_encryptor(mut self, page_encryptor: Option) -> 
Self {
+self.page_encryptor = page_encryptor;
+self
+}
 }
 
 impl PageWriter for ArrowPageWriter {
 fn write_page(&mut self, page: CompressedPage) -> Result {
 let mut buf = self.buffer.try_lock().unwrap();
-let page_header = page.to_thrift_header();
-let header = {
-let mut header = Vec::with_capacity(1024);
-let mut protocol = TCompactOutputProtocol::new(&mut header);
-page_header.write_to_out_protocol(&mut protocol)?;
-Bytes::from(header)
+
+let data = match self.page_encryptor.as_ref() {
+#[cfg(feature = "encryption")]
+Some(encryptor) => {
+let encrypted_buffer = encryptor.encrypt_page(&page)?;
+Bytes::from(encrypted_buffer)
+}
+_ => page.compressed_page().buffer().clone(),
+};
+
+let mut page_header = page.to_thrift_header();
+page_header.compressed_page_size = data.len() as i32;

Review Comment:
   I've pushed a commit to change this, let me know what you think: 
https://github.com/apache/arrow-rs/pull/7111/commits/4032b4f70c5407a1ce98358d1440cd00771543d5



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-17 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r136575


##
parquet/src/file/writer.rs:
##
@@ -171,7 +183,30 @@ impl SerializedFileWriter {
 /// Creates new file writer.
 pub fn new(buf: W, schema: TypePtr, properties: WriterPropertiesPtr) -> 
Result {
 let mut buf = TrackedWrite::new(buf);
-Self::start_file(&mut buf)?;
+
+#[cfg(feature = "encryption")]
+let file_encryptor = match 
properties.file_encryption_properties.as_ref() {
+None => None,
+Some(encryption_props) => 
Some(Arc::new(FileEncryptor::new(encryption_props.clone())?)),
+};
+
+#[cfg(feature = "encryption")]
+if properties.file_encryption_properties.is_some() {
+properties
+.file_encryption_properties
+.clone()
+.unwrap()
+
.validate_encrypted_column_names(SchemaDescriptor::new(schema.clone()))?;
+}
+
+#[cfg(feature = "encryption")]
+if let Some(ref file_encryption_properties) = 
properties.file_encryption_properties {
+if !file_encryption_properties.encrypt_footer() {
+return Err(general_err!("Footer encryption is not supported 
yet"));

Review Comment:
   ```suggestion
   return Err(general_err!("Writing encrypted files with 
plaintext footers is not supported yet"));
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-17 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r255941


##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -457,20 +476,49 @@ type SharedColumnChunk = Arc>;
 #[derive(Default)]
 struct ArrowPageWriter {
 buffer: SharedColumnChunk,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,
+}
+
+#[cfg(feature = "encryption")]
+impl ArrowPageWriter {
+pub fn with_encryptor(mut self, page_encryptor: Option) -> 
Self {
+self.page_encryptor = page_encryptor;
+self
+}
 }
 
 impl PageWriter for ArrowPageWriter {
 fn write_page(&mut self, page: CompressedPage) -> Result {
 let mut buf = self.buffer.try_lock().unwrap();
-let page_header = page.to_thrift_header();
-let header = {
-let mut header = Vec::with_capacity(1024);
-let mut protocol = TCompactOutputProtocol::new(&mut header);
-page_header.write_to_out_protocol(&mut protocol)?;
-Bytes::from(header)
+
+let data = match self.page_encryptor.as_ref() {
+#[cfg(feature = "encryption")]
+Some(encryptor) => {
+let encrypted_buffer = encryptor.encrypt_page(&page)?;
+Bytes::from(encrypted_buffer)
+}
+_ => page.compressed_page().buffer().clone(),
+};
+
+let mut page_header = page.to_thrift_header();
+page_header.compressed_page_size = data.len() as i32;
+
+let mut header = Vec::with_capacity(1024);
+match self.page_encryptor.as_ref() {

Review Comment:
   I spent a bit more time on this and made some progress to demonstrate what I 
was thinking, but have run into issues with my trait not being object-safe / 
dyn-compatible: 
https://github.com/adamreeve/arrow-rs/commit/c94ddd07f95ff09076019e8854f4d9f1f065683a
   
   I'm not sure it's worth pursuing this further, I think the current state is 
probably fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-17 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r169215


##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -457,20 +476,49 @@ type SharedColumnChunk = Arc>;
 #[derive(Default)]
 struct ArrowPageWriter {
 buffer: SharedColumnChunk,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,
+}
+
+#[cfg(feature = "encryption")]
+impl ArrowPageWriter {
+pub fn with_encryptor(mut self, page_encryptor: Option) -> 
Self {
+self.page_encryptor = page_encryptor;
+self
+}
 }
 
 impl PageWriter for ArrowPageWriter {
 fn write_page(&mut self, page: CompressedPage) -> Result {
 let mut buf = self.buffer.try_lock().unwrap();
-let page_header = page.to_thrift_header();
-let header = {
-let mut header = Vec::with_capacity(1024);
-let mut protocol = TCompactOutputProtocol::new(&mut header);
-page_header.write_to_out_protocol(&mut protocol)?;
-Bytes::from(header)
+
+let data = match self.page_encryptor.as_ref() {
+#[cfg(feature = "encryption")]
+Some(encryptor) => {
+let encrypted_buffer = encryptor.encrypt_page(&page)?;
+Bytes::from(encrypted_buffer)
+}
+_ => page.compressed_page().buffer().clone(),
+};
+
+let mut page_header = page.to_thrift_header();
+page_header.compressed_page_size = data.len() as i32;
+
+let mut header = Vec::with_capacity(1024);
+match self.page_encryptor.as_ref() {

Review Comment:
   Yeah after looking at this in more detail, I don't think there's an easy way 
to share this logic as `ArrowPageWriter` and `SerializedPageWriter` are too 
different. I think we should get rid of the `PageModuleWriter` traits but 
keeping separate `serialize_page` methods for with and without encryption 
enabled would still be nice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-17 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r145331


##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -457,20 +476,49 @@ type SharedColumnChunk = Arc>;
 #[derive(Default)]
 struct ArrowPageWriter {
 buffer: SharedColumnChunk,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,
+}
+
+#[cfg(feature = "encryption")]
+impl ArrowPageWriter {
+pub fn with_encryptor(mut self, page_encryptor: Option) -> 
Self {
+self.page_encryptor = page_encryptor;
+self
+}
 }
 
 impl PageWriter for ArrowPageWriter {
 fn write_page(&mut self, page: CompressedPage) -> Result {
 let mut buf = self.buffer.try_lock().unwrap();
-let page_header = page.to_thrift_header();
-let header = {
-let mut header = Vec::with_capacity(1024);
-let mut protocol = TCompactOutputProtocol::new(&mut header);
-page_header.write_to_out_protocol(&mut protocol)?;
-Bytes::from(header)
+
+let data = match self.page_encryptor.as_ref() {
+#[cfg(feature = "encryption")]
+Some(encryptor) => {
+let encrypted_buffer = encryptor.encrypt_page(&page)?;
+Bytes::from(encrypted_buffer)
+}
+_ => page.compressed_page().buffer().clone(),
+};
+
+let mut page_header = page.to_thrift_header();
+page_header.compressed_page_size = data.len() as i32;
+
+let mut header = Vec::with_capacity(1024);
+match self.page_encryptor.as_ref() {

Review Comment:
   If the logic can't be shared easily then I don't think the new traits are 
useful, `serialize_page` could just be a normal method under `impl 
ArrowPageWriter` and `impl SerializedPageWriter`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-17 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r142005


##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -457,20 +476,49 @@ type SharedColumnChunk = Arc>;
 #[derive(Default)]
 struct ArrowPageWriter {
 buffer: SharedColumnChunk,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,
+}
+
+#[cfg(feature = "encryption")]
+impl ArrowPageWriter {
+pub fn with_encryptor(mut self, page_encryptor: Option) -> 
Self {
+self.page_encryptor = page_encryptor;
+self
+}
 }
 
 impl PageWriter for ArrowPageWriter {
 fn write_page(&mut self, page: CompressedPage) -> Result {
 let mut buf = self.buffer.try_lock().unwrap();
-let page_header = page.to_thrift_header();
-let header = {
-let mut header = Vec::with_capacity(1024);
-let mut protocol = TCompactOutputProtocol::new(&mut header);
-page_header.write_to_out_protocol(&mut protocol)?;
-Bytes::from(header)
+
+let data = match self.page_encryptor.as_ref() {
+#[cfg(feature = "encryption")]
+Some(encryptor) => {
+let encrypted_buffer = encryptor.encrypt_page(&page)?;
+Bytes::from(encrypted_buffer)
+}
+_ => page.compressed_page().buffer().clone(),
+};
+
+let mut page_header = page.to_thrift_header();
+page_header.compressed_page_size = data.len() as i32;
+
+let mut header = Vec::with_capacity(1024);
+match self.page_encryptor.as_ref() {

Review Comment:
   I see you've added a `PageModuleWriter` trait that tidies this up a bit 
although this isn't really what I had in mind, I was hoping this logic could be 
reused across the `ArrowPageWriter` and `SerializedPageWriter` rather than 
having a similar trait in two different places. I'll see if I can make a change 
to show what I mean.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-17 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r133781


##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -457,20 +476,49 @@ type SharedColumnChunk = Arc>;
 #[derive(Default)]
 struct ArrowPageWriter {
 buffer: SharedColumnChunk,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,
+}
+
+#[cfg(feature = "encryption")]
+impl ArrowPageWriter {
+pub fn with_encryptor(mut self, page_encryptor: Option) -> 
Self {
+self.page_encryptor = page_encryptor;
+self
+}
 }
 
 impl PageWriter for ArrowPageWriter {
 fn write_page(&mut self, page: CompressedPage) -> Result {
 let mut buf = self.buffer.try_lock().unwrap();
-let page_header = page.to_thrift_header();
-let header = {
-let mut header = Vec::with_capacity(1024);
-let mut protocol = TCompactOutputProtocol::new(&mut header);
-page_header.write_to_out_protocol(&mut protocol)?;
-Bytes::from(header)
+
+let data = match self.page_encryptor.as_ref() {
+#[cfg(feature = "encryption")]
+Some(encryptor) => {
+let encrypted_buffer = encryptor.encrypt_page(&page)?;
+Bytes::from(encrypted_buffer)
+}
+_ => page.compressed_page().buffer().clone(),
+};
+
+let mut page_header = page.to_thrift_header();
+page_header.compressed_page_size = data.len() as i32;

Review Comment:
   Ah not really, I was thinking more like having a method that could encrypt 
the page, eg:
   `fn encrypt_page(page: CompressedPage) -> CompressedPage`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-17 Thread via GitHub


rok commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r1999681758


##
parquet/src/encryption/encrypt.rs:
##
@@ -152,8 +152,20 @@ impl EncryptionPropertiesBuilder {
 /// Set the key used for encryption of a column. Note that if no column 
keys are provided but
 /// footer key is all columns will be encrypted with the footer key. If 
column keys are provided
 /// only the columns with a key will be encrypted even if footer key is 
provided.
-pub fn with_column_key(mut self, column_path: String, encryption_key: 
EncryptionKey) -> Self {
-self.column_keys.insert(column_path, encryption_key);
+pub fn with_column_key(mut self, column_name: &str, encryption_key: 
EncryptionKey) -> Self {
+self.column_keys.insert(column_name.to_string(), encryption_key);
+self
+}

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-17 Thread via GitHub


corwinjoy commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r1999659239


##
parquet/src/encryption/encrypt.rs:
##
@@ -152,8 +152,20 @@ impl EncryptionPropertiesBuilder {
 /// Set the key used for encryption of a column. Note that if no column 
keys are provided but
 /// footer key is all columns will be encrypted with the footer key. If 
column keys are provided
 /// only the columns with a key will be encrypted even if footer key is 
provided.
-pub fn with_column_key(mut self, column_path: String, encryption_key: 
EncryptionKey) -> Self {
-self.column_keys.insert(column_path, encryption_key);
+pub fn with_column_key(mut self, column_name: &str, encryption_key: 
EncryptionKey) -> Self {
+self.column_keys.insert(column_name.to_string(), encryption_key);
+self
+}

Review Comment:
   What I would propose instead is
   ```
   pub fn with_column_key(mut self, column_name: &str, key: Vec) -> Self {
   self.column_keys.insert(column_name.to_string(), 
EncryptionKey::new(key));
   self
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-16 Thread via GitHub


rok commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r1997653819


##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -457,20 +476,49 @@ type SharedColumnChunk = Arc>;
 #[derive(Default)]
 struct ArrowPageWriter {
 buffer: SharedColumnChunk,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,
+}
+
+#[cfg(feature = "encryption")]
+impl ArrowPageWriter {
+pub fn with_encryptor(mut self, page_encryptor: Option) -> 
Self {
+self.page_encryptor = page_encryptor;
+self
+}
 }
 
 impl PageWriter for ArrowPageWriter {
 fn write_page(&mut self, page: CompressedPage) -> Result {
 let mut buf = self.buffer.try_lock().unwrap();
-let page_header = page.to_thrift_header();
-let header = {
-let mut header = Vec::with_capacity(1024);
-let mut protocol = TCompactOutputProtocol::new(&mut header);
-page_header.write_to_out_protocol(&mut protocol)?;
-Bytes::from(header)
+
+let data = match self.page_encryptor.as_ref() {
+#[cfg(feature = "encryption")]
+Some(encryptor) => {
+let encrypted_buffer = encryptor.encrypt_page(&page)?;
+Bytes::from(encrypted_buffer)
+}
+_ => page.compressed_page().buffer().clone(),
+};
+
+let mut page_header = page.to_thrift_header();
+page_header.compressed_page_size = data.len() as i32;

Review Comment:
   Do you mean like passing `data.len()` into `to_thrift_header` method?



##
parquet/src/encryption/encrypt.rs:
##
@@ -0,0 +1,283 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::encryption::ciphers::{BlockEncryptor, RingGcmBlockEncryptor};
+use crate::errors::{ParquetError, Result};
+use crate::file::column_crypto_metadata::{ColumnCryptoMetaData, 
EncryptionWithColumnKey};
+use crate::schema::types::{ColumnDescPtr, SchemaDescriptor};
+use crate::thrift::TSerializable;
+use ring::rand::{SecureRandom, SystemRandom};
+use std::collections::{HashMap, HashSet};
+use std::io::Write;
+use thrift::protocol::TCompactOutputProtocol;
+
+#[derive(Debug, Clone, PartialEq)]
+pub struct EncryptionKey {
+key: Vec,
+key_metadata: Option>,
+}
+
+impl EncryptionKey {
+pub fn new(key: Vec) -> EncryptionKey {
+Self {
+key,
+key_metadata: None,
+}
+}
+
+pub fn with_metadata(mut self, metadata: Vec) -> Self {
+self.key_metadata = Some(metadata);
+self
+}
+
+pub fn key(&self) -> &Vec {
+&self.key
+}
+}
+
+#[derive(Debug, Clone, PartialEq)]
+pub struct FileEncryptionProperties {
+encrypt_footer: bool,
+footer_key: EncryptionKey,
+column_keys: HashMap,
+aad_prefix: Option>,
+store_aad_prefix: bool,
+}
+
+impl FileEncryptionProperties {
+pub fn builder(footer_key: Vec) -> EncryptionPropertiesBuilder {
+EncryptionPropertiesBuilder::new(footer_key)
+}
+
+pub fn encrypt_footer(&self) -> bool {
+self.encrypt_footer
+}
+
+pub fn footer_key_metadata(&self) -> Option<&Vec> {
+self.footer_key.key_metadata.as_ref()
+}
+
+pub fn aad_prefix(&self) -> Option<&Vec> {
+self.aad_prefix.as_ref()
+}
+
+pub fn store_aad_prefix(&self) -> bool {
+self.store_aad_prefix && self.aad_prefix.is_some()
+}
+
+/// Checks if columns that are to be encrypted are present in schema
+#[cfg(feature = "encryption")]
+pub(crate) fn encrypted_columns_in_schema(
+&self,
+schema: SchemaDescriptor,
+) -> std::result::Result<(), ParquetError> {
+let schema_columns = schema
+.columns()
+.iter()
+.map(|c| c.path().string())
+.collect::>();
+let encryption_columns = self
+.column_keys
+.keys()
+.cloned()
+.collect::>();
+if !encryption_columns.is_subset(&schema_columns) {
+let mut columns_missing_in_schema = encryption_columns
+.difference(&schema_columns)
+ 

Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-13 Thread via GitHub


rok commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r1994171086


##
parquet/src/encryption/encrypt.rs:
##
@@ -0,0 +1,283 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::encryption::ciphers::{BlockEncryptor, RingGcmBlockEncryptor};
+use crate::errors::{ParquetError, Result};
+use crate::file::column_crypto_metadata::{ColumnCryptoMetaData, 
EncryptionWithColumnKey};
+use crate::schema::types::{ColumnDescPtr, SchemaDescriptor};
+use crate::thrift::TSerializable;
+use ring::rand::{SecureRandom, SystemRandom};
+use std::collections::{HashMap, HashSet};
+use std::io::Write;
+use thrift::protocol::TCompactOutputProtocol;
+
+#[derive(Debug, Clone, PartialEq)]
+pub struct EncryptionKey {
+key: Vec,
+key_metadata: Option>,
+}
+
+impl EncryptionKey {
+pub fn new(key: Vec) -> EncryptionKey {
+Self {
+key,
+key_metadata: None,
+}
+}
+
+pub fn with_metadata(mut self, metadata: Vec) -> Self {
+self.key_metadata = Some(metadata);
+self
+}
+
+pub fn key(&self) -> &Vec {
+&self.key
+}
+}
+
+#[derive(Debug, Clone, PartialEq)]
+pub struct FileEncryptionProperties {
+encrypt_footer: bool,
+footer_key: EncryptionKey,
+column_keys: HashMap,
+aad_prefix: Option>,
+store_aad_prefix: bool,
+}
+
+impl FileEncryptionProperties {
+pub fn builder(footer_key: Vec) -> EncryptionPropertiesBuilder {
+EncryptionPropertiesBuilder::new(footer_key)
+}
+
+pub fn encrypt_footer(&self) -> bool {
+self.encrypt_footer
+}
+
+pub fn footer_key_metadata(&self) -> Option<&Vec> {
+self.footer_key.key_metadata.as_ref()
+}
+
+pub fn aad_prefix(&self) -> Option<&Vec> {
+self.aad_prefix.as_ref()
+}
+
+pub fn store_aad_prefix(&self) -> bool {
+self.store_aad_prefix && self.aad_prefix.is_some()
+}
+
+/// Checks if columns that are to be encrypted are present in schema
+#[cfg(feature = "encryption")]
+pub(crate) fn encrypted_columns_in_schema(

Review Comment:
   Done.



##
parquet/src/file/writer.rs:
##
@@ -171,19 +183,37 @@ impl SerializedFileWriter {
 /// Creates new file writer.
 pub fn new(buf: W, schema: TypePtr, properties: WriterPropertiesPtr) -> 
Result {
 let mut buf = TrackedWrite::new(buf);
-Self::start_file(&mut buf)?;
+
+#[cfg(feature = "encryption")]
+let file_encryptor = match 
properties.file_encryption_properties.as_ref() {
+None => None,
+Some(encryption_props) => 
Some(Arc::new(FileEncryptor::new(encryption_props.clone())?)),
+};
+
+#[cfg(feature = "encryption")]
+if properties.file_encryption_properties.is_some() {
+properties
+.file_encryption_properties
+.clone()
+.unwrap()
+
.encrypted_columns_in_schema(SchemaDescriptor::new(schema.clone()))?;
+}
+
+Self::start_file(&properties, &mut buf)?;
 Ok(Self {
 buf,
 schema: schema.clone(),
 descr: Arc::new(SchemaDescriptor::new(schema)),
-props: properties,
+props: properties.clone(),

Review Comment:
   Removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-13 Thread via GitHub


rok commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r1994145187


##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -3527,4 +3700,199 @@ mod tests {
 .unwrap();
 assert_eq!(batches.len(), 0);
 }
+
+#[cfg(feature = "encryption")]
+#[test]
+fn test_uniform_encryption_roundtrip() {

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-12 Thread via GitHub


adamreeve commented on code in PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#discussion_r1992607802


##
parquet/src/arrow/arrow_writer/mod.rs:
##
@@ -457,20 +476,49 @@ type SharedColumnChunk = Arc>;
 #[derive(Default)]
 struct ArrowPageWriter {
 buffer: SharedColumnChunk,
+#[cfg(feature = "encryption")]
+page_encryptor: Option,
+#[cfg(not(feature = "encryption"))]
+page_encryptor: Option,
+}
+
+#[cfg(feature = "encryption")]
+impl ArrowPageWriter {
+pub fn with_encryptor(mut self, page_encryptor: Option) -> 
Self {
+self.page_encryptor = page_encryptor;
+self
+}
 }
 
 impl PageWriter for ArrowPageWriter {
 fn write_page(&mut self, page: CompressedPage) -> Result {
 let mut buf = self.buffer.try_lock().unwrap();
-let page_header = page.to_thrift_header();
-let header = {
-let mut header = Vec::with_capacity(1024);
-let mut protocol = TCompactOutputProtocol::new(&mut header);
-page_header.write_to_out_protocol(&mut protocol)?;
-Bytes::from(header)
+
+let data = match self.page_encryptor.as_ref() {
+#[cfg(feature = "encryption")]
+Some(encryptor) => {
+let encrypted_buffer = encryptor.encrypt_page(&page)?;
+Bytes::from(encrypted_buffer)
+}
+_ => page.compressed_page().buffer().clone(),
+};
+
+let mut page_header = page.to_thrift_header();
+page_header.compressed_page_size = data.len() as i32;

Review Comment:
   Rather than overwriting the compressed page size here, we should probably 
encrypt the page before creating the header. That might require making the page 
mutable to allow overwriting its `buf` member? I'm not sure how best to handle 
this but it's a little hacky at the moment.



##
parquet/src/encryption/encrypt.rs:
##
@@ -0,0 +1,283 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::encryption::ciphers::{BlockEncryptor, RingGcmBlockEncryptor};
+use crate::errors::{ParquetError, Result};
+use crate::file::column_crypto_metadata::{ColumnCryptoMetaData, 
EncryptionWithColumnKey};
+use crate::schema::types::{ColumnDescPtr, SchemaDescriptor};
+use crate::thrift::TSerializable;
+use ring::rand::{SecureRandom, SystemRandom};
+use std::collections::{HashMap, HashSet};
+use std::io::Write;
+use thrift::protocol::TCompactOutputProtocol;
+
+#[derive(Debug, Clone, PartialEq)]
+pub struct EncryptionKey {
+key: Vec,
+key_metadata: Option>,
+}
+
+impl EncryptionKey {
+pub fn new(key: Vec) -> EncryptionKey {
+Self {
+key,
+key_metadata: None,
+}
+}
+
+pub fn with_metadata(mut self, metadata: Vec) -> Self {
+self.key_metadata = Some(metadata);
+self
+}
+
+pub fn key(&self) -> &Vec {
+&self.key
+}
+}
+
+#[derive(Debug, Clone, PartialEq)]
+pub struct FileEncryptionProperties {
+encrypt_footer: bool,
+footer_key: EncryptionKey,
+column_keys: HashMap,
+aad_prefix: Option>,
+store_aad_prefix: bool,
+}
+
+impl FileEncryptionProperties {
+pub fn builder(footer_key: Vec) -> EncryptionPropertiesBuilder {
+EncryptionPropertiesBuilder::new(footer_key)
+}
+
+pub fn encrypt_footer(&self) -> bool {
+self.encrypt_footer
+}
+
+pub fn footer_key_metadata(&self) -> Option<&Vec> {
+self.footer_key.key_metadata.as_ref()
+}
+
+pub fn aad_prefix(&self) -> Option<&Vec> {
+self.aad_prefix.as_ref()
+}
+
+pub fn store_aad_prefix(&self) -> bool {
+self.store_aad_prefix && self.aad_prefix.is_some()
+}
+
+/// Checks if columns that are to be encrypted are present in schema
+#[cfg(feature = "encryption")]
+pub(crate) fn encrypted_columns_in_schema(
+&self,
+schema: SchemaDescriptor,
+) -> std::result::Result<(), ParquetError> {
+let schema_columns = schema
+.columns()
+.iter()
+.map(|c| c.path().string())
+.collect::>();
+let encryption_columns = self
+.column_keys
+.keys()
+.cl

Re: [PR] Add Parquet Modular encryption support (write) [arrow-rs]

2025-03-12 Thread via GitHub


alamb commented on PR #7111:
URL: https://github.com/apache/arrow-rs/pull/7111#issuecomment-2719134785

   The read part was broken out into
   - https://github.com/apache/arrow-rs/pull/6637


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]