[GitHub] [arrow] kiszk commented on pull request #7938: ARROW-9701: [CI][Java] Add a job for s390x Java on TravisCI

2020-10-02 Thread GitBox


kiszk commented on pull request #7938:
URL: https://github.com/apache/arrow/pull/7938#issuecomment-703045578


   @emkornfield  @kou Is it ok to merge this PR now? Or, do we still hold off 
for a while?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-10-02 Thread GitBox


wjones1 commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-703043604


   Got the actions to rerun, but some are still failing. As far as I can tell, 
these failures aren't due to the changes in this PR. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


github-actions[bot] commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-703042966


   Revision: 5089cb76106ea909aced4e0db32c0d5e9a512bd3
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-589](https://github.com/ursa-labs/crossbow/branches/all?query=actions-589)
   
   |Task|Status|
   ||--|
   
|test-r-rhub-ubuntu-gcc-release|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-589-azure-test-r-rhub-ubuntu-gcc-release)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-589-azure-test-r-rhub-ubuntu-gcc-release)|
   
|test-r-rocker-r-base-latest|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-589-azure-test-r-rocker-r-base-latest)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-589-azure-test-r-rocker-r-base-latest)|
   
|test-r-rstudio-r-base-3.6-bionic|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-589-azure-test-r-rstudio-r-base-3.6-bionic)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-589-azure-test-r-rstudio-r-base-3.6-bionic)|
   
|test-r-rstudio-r-base-3.6-centos6|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-589-azure-test-r-rstudio-r-base-3.6-centos6)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-589-azure-test-r-rstudio-r-base-3.6-centos6)|
   
|test-r-rstudio-r-base-3.6-centos8|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-589-azure-test-r-rstudio-r-base-3.6-centos8)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-589-azure-test-r-rstudio-r-base-3.6-centos8)|
   
|test-r-rstudio-r-base-3.6-opensuse15|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-589-azure-test-r-rstudio-r-base-3.6-opensuse15)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-589-azure-test-r-rstudio-r-base-3.6-opensuse15)|
   
|test-r-rstudio-r-base-3.6-opensuse42|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-589-azure-test-r-rstudio-r-base-3.6-opensuse42)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-589-azure-test-r-rstudio-r-base-3.6-opensuse42)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


nealrichardson commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-703042816


   @github-actions crossbow submit test-r-r*



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorgecarleitao commented on pull request #8324: ARROW-10156: [Rust] Added github action to label PRs for rust.

2020-10-02 Thread GitBox


jorgecarleitao commented on pull request #8324:
URL: https://github.com/apache/arrow/pull/8324#issuecomment-703042742


   FYI @lidavidm (`lang-java`) and @romainfrancois (`lang-R`), since both of 
you are also tagging PRs with github labels.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield closed pull request #8293: ARROW-10127: Update specification for Decimal to allow for 256-bits

2020-10-02 Thread GitBox


emkornfield closed pull request #8293:
URL: https://github.com/apache/arrow/pull/8293


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #8293: ARROW-10127: Update specification for Decimal to allow for 256-bits

2020-10-02 Thread GitBox


emkornfield commented on pull request #8293:
URL: https://github.com/apache/arrow/pull/8293#issuecomment-703042352


   Merging based on vote.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #8330: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes

2020-10-02 Thread GitBox


emkornfield commented on a change in pull request #8330:
URL: https://github.com/apache/arrow/pull/8330#discussion_r499108858



##
File path: rust/parquet/src/arrow/arrow_writer.rs
##
@@ -688,4 +688,413 @@ mod tests {
 writer.write().unwrap();
 writer.close().unwrap();
 }
+
+const SMALL_SIZE: usize = 100;
+
+fn roundtrip(filename: , expected_batch: RecordBatch) {
+let file = get_temp_file(filename, &[]);
+
+let mut writer = ArrowWriter::try_new(
+file.try_clone().unwrap(),
+expected_batch.schema(),
+None,
+)
+.unwrap();
+writer.write(_batch).unwrap();
+writer.close().unwrap();
+
+let reader = SerializedFileReader::new(file).unwrap();
+let mut arrow_reader = ParquetFileArrowReader::new(Rc::new(reader));
+let mut record_batch_reader = 
arrow_reader.get_record_reader(1024).unwrap();
+
+let actual_batch = record_batch_reader.next_batch().unwrap().unwrap();
+
+assert_eq!(expected_batch.schema(), actual_batch.schema());
+assert_eq!(expected_batch.num_columns(), actual_batch.num_columns());
+assert_eq!(expected_batch.num_rows(), actual_batch.num_rows());
+for i in 0..expected_batch.num_columns() {
+let expected_data = expected_batch.column(i).data();
+let actual_data = actual_batch.column(i).data();
+
+assert_eq!(expected_data.data_type(), actual_data.data_type());
+assert_eq!(expected_data.len(), actual_data.len());
+assert_eq!(expected_data.null_count(), actual_data.null_count());
+assert_eq!(expected_data.offset(), actual_data.offset());
+assert_eq!(expected_data.buffers(), actual_data.buffers());
+assert_eq!(expected_data.child_data(), actual_data.child_data());
+assert_eq!(expected_data.null_bitmap(), actual_data.null_bitmap());
+}
+}
+
+fn one_column_roundtrip(filename: , values: ArrayRef, nullable: bool) {
+let schema = Schema::new(vec![Field::new(
+"col",
+values.data_type().clone(),
+nullable,
+)]);
+let expected_batch =
+RecordBatch::try_new(Arc::new(schema), vec![values]).unwrap();
+
+roundtrip(filename, expected_batch);
+}
+
+fn values_required(iter: I, filename: )
+where
+A: From> + Array + 'static,
+I: IntoIterator,
+{
+let raw_values: Vec<_> = iter.into_iter().collect();
+let values = Arc::new(A::from(raw_values));
+one_column_roundtrip(filename, values, false);
+}
+
+fn values_optional(iter: I, filename: )
+where
+A: From>> + Array + 'static,
+I: IntoIterator,
+{
+let optional_raw_values: Vec<_> = iter
+.into_iter()
+.enumerate()
+.map(|(i, v)| if i % 2 == 0 { None } else { Some(v) })
+.collect();
+let optional_values = Arc::new(A::from(optional_raw_values));
+one_column_roundtrip(filename, optional_values, true);
+}
+
+fn required_and_optional(iter: I, filename: )
+where
+A: From> + From>> + Array + 'static,
+I: IntoIterator + Clone,
+{
+values_required::(iter.clone(), filename);
+values_optional::(iter, filename);
+}
+
+#[test]
+#[should_panic(expected = "Null arrays not supported")]
+fn null_single_column() {
+let values = Arc::new(NullArray::new(SMALL_SIZE));
+one_column_roundtrip("null_single_column", values.clone(), true);
+one_column_roundtrip("null_single_column", values, false);
+}
+
+#[test]
+#[should_panic(
+expected = "Attempting to write an Arrow type that is not yet 
implemented"
+)]
+fn bool_single_column() {
+required_and_optional::(
+[true, false].iter().cycle().copied().take(SMALL_SIZE),
+"bool_single_column",
+);
+}
+
+#[test]
+fn i8_single_column() {
+required_and_optional::(0..SMALL_SIZE as i8, 
"i8_single_column");
+}
+
+#[test]
+fn i16_single_column() {
+required_and_optional::(0..SMALL_SIZE as i16, 
"i16_single_column");
+}
+
+#[test]
+fn i32_single_column() {
+required_and_optional::(0..SMALL_SIZE as i32, 
"i32_single_column");
+}
+
+#[test]
+fn i64_single_column() {
+required_and_optional::(0..SMALL_SIZE as i64, 
"i64_single_column");
+}
+
+#[test]
+fn u8_single_column() {
+required_and_optional::(0..SMALL_SIZE as u8, 
"u8_single_column");
+}
+
+#[test]
+fn u16_single_column() {
+required_and_optional::(
+0..SMALL_SIZE as u16,
+"u16_single_column",
+);
+}
+
+#[test]
+fn u32_single_column() {
+required_and_optional::(
+0..SMALL_SIZE as u32,
+"u32_single_column",
+);
+

[GitHub] [arrow] emkornfield commented on pull request #8330: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes

2020-10-02 Thread GitBox


emkornfield commented on pull request #8330:
URL: https://github.com/apache/arrow/pull/8330#issuecomment-703034687


   > I am not sure the actual data arrays have to be the same -- what needs to 
be the same is the data rows where the null_bit is 0 (aka we should probably be 
checking only for non-null rows)
   
   Small nit I'm prettry sure  null_bit = 1 is actually where values should be 
compared.  (thinking of it as a validity bitmap instead of a null bitmap 
helps).  And yes I agree they should be compared separately, as there is 
generally no requirement on the values of null fields (generally, you'll want 
to initialize to some known values to avoid security issues).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] paddyhoran commented on a change in pull request #8331: ARROW-10162: [Rust] Add pretty print support for DictionaryArray

2020-10-02 Thread GitBox


paddyhoran commented on a change in pull request #8331:
URL: https://github.com/apache/arrow/pull/8331#discussion_r499105205



##
File path: rust/arrow/src/util/pretty.rs
##
@@ -126,15 +130,56 @@ fn array_value_to_string(column: array::ArrayRef, row: 
usize) -> Result
 DataType::Time64(unit) if *unit == TimeUnit::Nanosecond => {
 make_string!(array::Time64NanosecondArray, column, row)
 }
+DataType::Dictionary(index_type, _value_type) => match **index_type {
+DataType::Int8 => dict_array_value_to_string::(column, 
row),
+DataType::Int16 => dict_array_value_to_string::(column, 
row),
+DataType::Int32 => dict_array_value_to_string::(column, 
row),
+DataType::Int64 => dict_array_value_to_string::(column, 
row),
+DataType::UInt8 => dict_array_value_to_string::(column, 
row),
+DataType::UInt16 => 
dict_array_value_to_string::(column, row),
+DataType::UInt32 => 
dict_array_value_to_string::(column, row),
+DataType::UInt64 => 
dict_array_value_to_string::(column, row),
+_ => Err(ArrowError::InvalidArgumentError(format!(
+"Unsupported index type {:?} type for {:?} in repl.",

Review comment:
   I think this (pretty printing) was part of DataFusion's cli to begin 
with but was moved into `arrow` as it was generally useful.  I don't understand 
repl in this context either.  I suggest updating it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] paddyhoran commented on a change in pull request #8331: ARROW-10162: [Rust] Add pretty print support for DictionaryArray

2020-10-02 Thread GitBox


paddyhoran commented on a change in pull request #8331:
URL: https://github.com/apache/arrow/pull/8331#discussion_r499105205



##
File path: rust/arrow/src/util/pretty.rs
##
@@ -126,15 +130,56 @@ fn array_value_to_string(column: array::ArrayRef, row: 
usize) -> Result
 DataType::Time64(unit) if *unit == TimeUnit::Nanosecond => {
 make_string!(array::Time64NanosecondArray, column, row)
 }
+DataType::Dictionary(index_type, _value_type) => match **index_type {
+DataType::Int8 => dict_array_value_to_string::(column, 
row),
+DataType::Int16 => dict_array_value_to_string::(column, 
row),
+DataType::Int32 => dict_array_value_to_string::(column, 
row),
+DataType::Int64 => dict_array_value_to_string::(column, 
row),
+DataType::UInt8 => dict_array_value_to_string::(column, 
row),
+DataType::UInt16 => 
dict_array_value_to_string::(column, row),
+DataType::UInt32 => 
dict_array_value_to_string::(column, row),
+DataType::UInt64 => 
dict_array_value_to_string::(column, row),
+_ => Err(ArrowError::InvalidArgumentError(format!(
+"Unsupported index type {:?} type for {:?} in repl.",

Review comment:
   I think this was part of DataFusion but was moved into `arrow` as it was 
generally useful.  I don't understand repl in this context either.  I suggest 
updating it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types

2020-10-02 Thread GitBox


nevi-me commented on pull request #8291:
URL: https://github.com/apache/arrow/pull/8291#issuecomment-703014936


   Merged



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on a change in pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types

2020-10-02 Thread GitBox


nevi-me commented on a change in pull request #8291:
URL: https://github.com/apache/arrow/pull/8291#discussion_r499097627



##
File path: rust/arrow/src/ipc/convert.rs
##
@@ -609,10 +621,45 @@ pub(crate) fn get_fb_field_type<'a: 'b, 'b>(
 children: Some(fbb.create_vector([..])),
 }
 }
+Dictionary(_, value_type) => {
+// In this library, the dictionary "type" is a logical construct. 
Here we
+// pass through to the value type, as we've already captured the 
index
+// type in the DictionaryEncoding metadata in the parent field
+get_fb_field_type(value_type, fbb)
+}
 t => unimplemented!("Type {:?} not supported", t),
 }
 }
 
+/// Create an IPC dictionary encoding
+pub(crate) fn get_fb_dictionary<'a: 'b, 'b>(
+index_type: ,
+dict_id: i64,
+dict_is_ordered: bool,
+fbb:  FlatBufferBuilder<'a>,
+) -> WIPOffset> {
+// We assume that the dictionary index type (as an integer) has already 
been
+// validated elsewhere, and can safely assume we are dealing with signed
+// integers
+let mut index_builder = ipc::IntBuilder::new(fbb);
+index_builder.add_is_signed(true);

Review comment:
   @carols10cents we can address this in future, I'm merging this





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types

2020-10-02 Thread GitBox


nevi-me commented on pull request #8291:
URL: https://github.com/apache/arrow/pull/8291#issuecomment-703013274


   > How can I best help with getting the rust-parquet-arrow-writer branch 
ready to be merged into master?
   
   The main blocker right now is computing nested definition and repetition 
levels for deeply nested arrays (see 
https://github.com/apache/arrow/tree/master/cpp/src/parquet/arrow for the CPP 
implementation). I've been working on that on and off, due to time constraints 
(and the need to spend a good chunk of time per session; which I haven't had 
much of).
   
   Perhaps I can commit my work so you can have a look at that. I'll also open 
more JIRAs for work I believe we'd need to cover.
   
   Writing more test cases will also def help us. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types

2020-10-02 Thread GitBox


nevi-me commented on pull request #8291:
URL: https://github.com/apache/arrow/pull/8291#issuecomment-703011014


   Hi @carols10cents, my apologies for the delay.
   
   > Am I at least correct that adding dictionary support will involve adding 
support for the dictionary schema as well as the dictionary data? And that 
neither of those are done yet?
   
   The Parquet format's dictionary semantics are in my understanding different 
to Arrow, in that we could have a `StringArray` in Arrow that gets saved with 
Parquet's dictionary encoding. I'm still going through the rest of the Rust 
Parquet codebase though, so I'm not speaking from any authority.
   
   My comment was more in passing, and generally around dictionary support in 
IPC, in that beyond serializing an `arrow::datatypes::Schema` to 
`arrow::ipc::Schema`, there might be little in relation between the file 
formats.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


github-actions[bot] commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-703005258


   Revision: 5089cb76106ea909aced4e0db32c0d5e9a512bd3
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-588](https://github.com/ursa-labs/crossbow/branches/all?query=actions-588)
   
   |Task|Status|
   ||--|
   
|test-r-rstudio-r-base-3.6-centos6|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-588-azure-test-r-rstudio-r-base-3.6-centos6)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-588-azure-test-r-rstudio-r-base-3.6-centos6)|
   
|test-r-rstudio-r-base-3.6-centos8|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-588-azure-test-r-rstudio-r-base-3.6-centos8)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-588-azure-test-r-rstudio-r-base-3.6-centos8)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


nealrichardson commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-703003701


   @github-actions crossbow submit test-r-rstudio-r-base-3.6-centos8 
test-r-rstudio-r-base-3.6-centos6



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


github-actions[bot] commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-703001148


   Revision: 5d1b6eec6dfcf4ed01b83c3501b72f96e7690bd1
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-587](https://github.com/ursa-labs/crossbow/branches/all?query=actions-587)
   
   |Task|Status|
   ||--|
   
|test-r-rstudio-r-base-3.6-centos6|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-587-azure-test-r-rstudio-r-base-3.6-centos6)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-587-azure-test-r-rstudio-r-base-3.6-centos6)|
   
|test-r-rstudio-r-base-3.6-centos8|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-587-azure-test-r-rstudio-r-base-3.6-centos8)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-587-azure-test-r-rstudio-r-base-3.6-centos8)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


nealrichardson commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-703000798


   @github-actions crossbow submit test-r-rstudio-r-base-3.6-centos8 
test-r-rstudio-r-base-3.6-centos6



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


github-actions[bot] commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-702989907


   Revision: 176d1e6b1a96d8819be4bbbd0815067733c2e54b
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-586](https://github.com/ursa-labs/crossbow/branches/all?query=actions-586)
   
   |Task|Status|
   ||--|
   
|test-r-rstudio-r-base-3.6-centos8|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-586-azure-test-r-rstudio-r-base-3.6-centos8)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-586-azure-test-r-rstudio-r-base-3.6-centos8)|
   
|test-r-rstudio-r-base-3.6-opensuse15|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-586-azure-test-r-rstudio-r-base-3.6-opensuse15)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-586-azure-test-r-rstudio-r-base-3.6-opensuse15)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


nealrichardson commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-702989412


   @github-actions crossbow submit test-r-rstudio-r-base-3.6-centos8 
test-r-rstudio-r-base-3.6-opensuse15



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


nealrichardson commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-702983569


   Current R Linux static build status (using shared libs for curl and 
openssl/crypto, provided by the system):
   
   * test-r-rhub-ubuntu-gcc-release (ubuntu 16.04, gcc 5.4.0): FAIL. C++ build 
with S3 succeeds but R package link error: `undefined symbol: __cpu_model`
   * test-r-rocker-r-base-latest (debian:testing, gcc 9.3.0): SUCCESS with S3
   * test-r-rstudio-r-base-3.6-bionic (ubuntu 18.04, gcc 7.5.0): SUCCESS with S3
   * test-r-rstudio-r-base-3.6-centos6 (gcc 8.3.1): FAIL because system openssl 
is stale: `Found unsuitable version "1.0.1e", but required is at least "1.0.2"`
   * test-r-rstudio-r-base-3.6-centos7 (gcc 4.8.5): SUCCESS *without S3* (gcc 
too old)
   * test-r-rstudio-r-base-3.6-centos8 (gcc 8.3.1): FAIL `(missing: 
OPENSSL_CRYPTO_LIBRARY) (found suitable version "1.1.1c", minimum required is 
"1.0.2")`
   * test-r-rstudio-r-base-3.6-opensuse15 (gcc 7.4.1): FAIL `(missing: 
OPENSSL_CRYPTO_LIBRARY) (found suitable version "1.1.0i", minimum required is 
"1.0.2")`
   * test-r-rstudio-r-base-3.6-opensuse42 (gcc 4.8): SUCCESS *without S3* (gcc 
too old)
   
   Summary: 2 success with S3, 2 successfully built without S3 because it 
detected that gcc was too old, 2 fail because they are finding openssl but not 
crypto for some reason, 1 fails because it has openssl but the version provided 
by the package manager is too old, one fails with `undefined symbol: 
__cpu_model` (a gcc 5 bug?).
   
   Among the reasons this is frustrating: 
   
   * I've installed openssl but it says it can't find it (sometimes); I've 
installed it from the system package manager but it's still too old; these are 
both easily solvable by bundling openssl like we already do in the wheels, but 
I've gotten strong pushback on that.
   * I've tried to add logic in the R build script to detect whether we can 
build aws-sdk-cpp 
(https://github.com/apache/arrow/pull/8304/files#diff-3875fa5e75833c426b36487b25892bd8R370-R398)
 but that's apparently not enough to catch these deviant (yet common) cases. It 
would be nice to move this logic to cmake, an `ARROW_S3=PLEASE_IF_YOU_CAN` 
option: is there objection to that?
   * Given how fragile this is, I'm fearing the onslaught of bug reports after 
the release.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] dianaclarke commented on pull request #8312: ARROW-9941: [Python] Better string representation for extension types

2020-10-02 Thread GitBox


dianaclarke commented on pull request #8312:
URL: https://github.com/apache/arrow/pull/8312#issuecomment-702978986


   I have it returning `"extension>"` now 
from the C++ side.
   
   No idea if it's right though.
   
   Thanks for your patience :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jhorstmann commented on pull request #8331: ARROW-10162: [Rust] Add pretty print support for DictionaryArray

2020-10-02 Thread GitBox


jhorstmann commented on pull request #8331:
URL: https://github.com/apache/arrow/pull/8331#issuecomment-702964608


   Changes look good to me, but I'm wondering now how the existing code handles 
null values in the primitive arrays. Seems to me that would print the default 
value. For `Utf8` I think it works kinda implicitly because of how the offsets 
for null values are layed out.
   
   Would it make sense to fix this in the same PR or should I open a separate 
ticket?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on a change in pull request #8324: ARROW-10156: [Rust] Added github action to label PRs for rust.

2020-10-02 Thread GitBox


kou commented on a change in pull request #8324:
URL: https://github.com/apache/arrow/pull/8324#discussion_r499054289



##
File path: .github/workflows/dev_labeler.yml
##
@@ -0,0 +1,31 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+on:
+- pull_request_target
+
+jobs:
+  assign-rust-labels:
+if: github.repository == 'apache/arrow'
+runs-on: ubuntu-latest
+steps:
+- name: Assign github labels

Review comment:
   ```suggestion
   - name: Assign GitHub labels
   ```

##
File path: .github/workflows/dev_labeler.yml
##
@@ -0,0 +1,31 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+on:
+- pull_request_target
+
+jobs:
+  assign-rust-labels:
+if: github.repository == 'apache/arrow'

Review comment:
   We don't need this restriction.
   We have this restriction in `dev_cron.yml` because cron GitHub Actions logs 
in forks hide other logs.
   `pull_request_target` doesn't run this job periodically. So we don't need 
this restriction.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


github-actions[bot] commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-702959895


   Revision: 672e9f9ecf17c7e195f60c0a95e0434e5fe1ed53
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-585](https://github.com/ursa-labs/crossbow/branches/all?query=actions-585)
   
   |Task|Status|
   ||--|
   
|test-r-rhub-ubuntu-gcc-release|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-585-azure-test-r-rhub-ubuntu-gcc-release)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-585-azure-test-r-rhub-ubuntu-gcc-release)|
   
|test-r-rocker-r-base-latest|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-585-azure-test-r-rocker-r-base-latest)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-585-azure-test-r-rocker-r-base-latest)|
   
|test-r-rstudio-r-base-3.6-bionic|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-585-azure-test-r-rstudio-r-base-3.6-bionic)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-585-azure-test-r-rstudio-r-base-3.6-bionic)|
   
|test-r-rstudio-r-base-3.6-centos6|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-585-azure-test-r-rstudio-r-base-3.6-centos6)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-585-azure-test-r-rstudio-r-base-3.6-centos6)|
   
|test-r-rstudio-r-base-3.6-centos8|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-585-azure-test-r-rstudio-r-base-3.6-centos8)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-585-azure-test-r-rstudio-r-base-3.6-centos8)|
   
|test-r-rstudio-r-base-3.6-opensuse15|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-585-azure-test-r-rstudio-r-base-3.6-opensuse15)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-585-azure-test-r-rstudio-r-base-3.6-opensuse15)|
   
|test-r-rstudio-r-base-3.6-opensuse42|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-585-azure-test-r-rstudio-r-base-3.6-opensuse42)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-585-azure-test-r-rstudio-r-base-3.6-opensuse42)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


nealrichardson commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-702959015


   @github-actions crossbow submit test-r-r*



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on pull request #8325: ARROW-10105: [C++][Python][Java][FlightRPC] Allow disabling server validation

2020-10-02 Thread GitBox


lidavidm commented on pull request #8325:
URL: https://github.com/apache/arrow/pull/8325#issuecomment-702951063


   @kszucs - are we ok to bump the gRPC versions like this?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jduo commented on pull request #8325: ARROW-10105: [C++][Python][Java][FlightRPC] Allow disabling server validation

2020-10-02 Thread GitBox


jduo commented on pull request #8325:
URL: https://github.com/apache/arrow/pull/8325#issuecomment-702948147


   > Hey James, I'll review this later today hopefully, but just to answer a 
few questions
   > 
   > * Style in C++ can be done automatically with clang-format, see the 
[developer 
guide](https://arrow.apache.org/docs/developers/cpp/development.html).
   > * For Python similarly you can use the [flake8 
config](https://arrow.apache.org/docs/developers/python.html#coding-style)
   > * The versions are scattered but as a start see thirdparty.txt: 
https://github.com/apache/arrow/blob/master/cpp/thirdparty/versions.txt though 
note you may also have to update CI scripts, Homebrew/Conda packages, etc.
   > * I'd rather avoid very recent experimental features as that will create a 
lot of churn for us as gRPC upgrades frequently. It may be worth considering in 
a few versions if it looks like the API settles down, since I agree it would be 
nice to consolidate the options + it does bring the ability to do certificate 
rotation.
   
   Hi @lidavidm .
   
   - I have made the Python edits now and am running them through CI while I 
get my environment working.
   
   - Unit tests have been added for all three languages.
   
   - I updated basically every file every file I could find that referenced a 
C++ gRPC version.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


github-actions[bot] commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-702946973


   Revision: 66fa1607e4922cb98bed417fa947d5a0b0c7e949
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-584](https://github.com/ursa-labs/crossbow/branches/all?query=actions-584)
   
   |Task|Status|
   ||--|
   
|test-r-rstudio-r-base-3.6-opensuse15|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-584-azure-test-r-rstudio-r-base-3.6-opensuse15)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-584-azure-test-r-rstudio-r-base-3.6-opensuse15)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


nealrichardson commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-702946137


   @github-actions crossbow submit test-r-rstudio-r-base-3.6-opensuse15



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


github-actions[bot] commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-702943521


   Revision: 66fa1607e4922cb98bed417fa947d5a0b0c7e949
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-583](https://github.com/ursa-labs/crossbow/branches/all?query=actions-583)
   
   |Task|Status|
   ||--|
   
|test-r-rstudio-r-base-3.6-bionic|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-583-azure-test-r-rstudio-r-base-3.6-bionic)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-583-azure-test-r-rstudio-r-base-3.6-bionic)|
   
|test-r-rstudio-r-base-3.6-centos8|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-583-azure-test-r-rstudio-r-base-3.6-centos8)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-583-azure-test-r-rstudio-r-base-3.6-centos8)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


nealrichardson commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-702942796


   @github-actions crossbow submit test-r-rstudio-r-base-3.6-bionic 
test-r-rstudio-r-base-3.6-centos8



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8331: ARROW-10162: [Rust] Add pretty print support for DictionaryArray

2020-10-02 Thread GitBox


github-actions[bot] commented on pull request #8331:
URL: https://github.com/apache/arrow/pull/8331#issuecomment-702941984


   https://issues.apache.org/jira/browse/ARROW-10162



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on a change in pull request #8331: ARROW-10162: [Rust] Add pretty print support for DictionaryArray

2020-10-02 Thread GitBox


alamb commented on a change in pull request #8331:
URL: https://github.com/apache/arrow/pull/8331#discussion_r499031446



##
File path: rust/arrow/src/util/pretty.rs
##
@@ -60,7 +64,7 @@ fn create_table(results: &[RecordBatch]) -> Result {
 let mut cells = Vec::new();
 for col in 0..batch.num_columns() {
 let column = batch.column(col);
-cells.push(Cell::new(_value_to_string(column.clone(), 
row)?));

Review comment:
   There is no reason / need to clone to column when printing each value





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on a change in pull request #8331: ARROW-10162: [Rust] Add pretty print support for DictionaryArray

2020-10-02 Thread GitBox


alamb commented on a change in pull request #8331:
URL: https://github.com/apache/arrow/pull/8331#discussion_r499030937



##
File path: rust/arrow/src/util/pretty.rs
##
@@ -80,8 +84,8 @@ macro_rules! make_string {
 }};
 }
 
-/// Get the value at the given row in an array as a string
-fn array_value_to_string(column: array::ArrayRef, row: usize) -> 
Result {
+/// Get the value at the given row in an array as a String
+pub fn array_value_to_string(column: ::ArrayRef, row: usize) -> 
Result {

Review comment:
   I made this `pub` so I could call it from datafusion/src/test/sql.rs 
which currently has a copy/pasted version of pretty printing in `array_str`: 
https://github.com/apache/arrow/blob/master/rust/datafusion/tests/sql.rs#L646





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on a change in pull request #8331: ARROW-10162: [Rust] Add pretty print support for DictionaryArray

2020-10-02 Thread GitBox


alamb commented on a change in pull request #8331:
URL: https://github.com/apache/arrow/pull/8331#discussion_r499028581



##
File path: rust/arrow/src/util/pretty.rs
##
@@ -126,15 +130,56 @@ fn array_value_to_string(column: array::ArrayRef, row: 
usize) -> Result
 DataType::Time64(unit) if *unit == TimeUnit::Nanosecond => {
 make_string!(array::Time64NanosecondArray, column, row)
 }
+DataType::Dictionary(index_type, _value_type) => match **index_type {
+DataType::Int8 => dict_array_value_to_string::(column, 
row),
+DataType::Int16 => dict_array_value_to_string::(column, 
row),
+DataType::Int32 => dict_array_value_to_string::(column, 
row),
+DataType::Int64 => dict_array_value_to_string::(column, 
row),
+DataType::UInt8 => dict_array_value_to_string::(column, 
row),
+DataType::UInt16 => 
dict_array_value_to_string::(column, row),
+DataType::UInt32 => 
dict_array_value_to_string::(column, row),
+DataType::UInt64 => 
dict_array_value_to_string::(column, row),
+_ => Err(ArrowError::InvalidArgumentError(format!(
+"Unsupported index type {:?} type for {:?} in repl.",

Review comment:
   I don't really grok why the error messages refer to `repl` in this file, 
but I have carried forward the tradition with this PR





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb opened a new pull request #8331: ARROW-10162: [Rust] Add pretty print support for DictionaryArray

2020-10-02 Thread GitBox


alamb opened a new pull request #8331:
URL: https://github.com/apache/arrow/pull/8331


   This adds some logic for handling dictionary arrays when pretty printing 
batches. Part of a larger set of work I am working on for better 
DictionaryArray handling: https://issues.apache.org/jira/browse/ARROW-10159



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


github-actions[bot] commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-702935169


   Revision: 30b888d92c12be28a956ea6042ad892706fdfb69
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-582](https://github.com/ursa-labs/crossbow/branches/all?query=actions-582)
   
   |Task|Status|
   ||--|
   
|test-r-rhub-ubuntu-gcc-release|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-582-azure-test-r-rhub-ubuntu-gcc-release)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-582-azure-test-r-rhub-ubuntu-gcc-release)|
   
|test-r-rocker-r-base-latest|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-582-azure-test-r-rocker-r-base-latest)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-582-azure-test-r-rocker-r-base-latest)|
   
|test-r-rstudio-r-base-3.6-bionic|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-582-azure-test-r-rstudio-r-base-3.6-bionic)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-582-azure-test-r-rstudio-r-base-3.6-bionic)|
   
|test-r-rstudio-r-base-3.6-centos6|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-582-azure-test-r-rstudio-r-base-3.6-centos6)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-582-azure-test-r-rstudio-r-base-3.6-centos6)|
   
|test-r-rstudio-r-base-3.6-centos8|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-582-azure-test-r-rstudio-r-base-3.6-centos8)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-582-azure-test-r-rstudio-r-base-3.6-centos8)|
   
|test-r-rstudio-r-base-3.6-opensuse15|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-582-azure-test-r-rstudio-r-base-3.6-opensuse15)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-582-azure-test-r-rstudio-r-base-3.6-opensuse15)|
   
|test-r-rstudio-r-base-3.6-opensuse42|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-582-azure-test-r-rstudio-r-base-3.6-opensuse42)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-582-azure-test-r-rstudio-r-base-3.6-opensuse42)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] dianaclarke commented on pull request #8312: ARROW-9941: [Python] Better string representation for extension types

2020-10-02 Thread GitBox


dianaclarke commented on pull request #8312:
URL: https://github.com/apache/arrow/pull/8312#issuecomment-702934789


   I had hoped I could just use something like the following, but it's 
segfaulting. I think I need to go back to first principals and do a few C++ 
tutorials ;)
   
   `internal::PyObject_StdStringRepr(type_instance_.obj());`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


nealrichardson commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-702934485


   @github-actions crossbow submit test-r-r*



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


github-actions[bot] commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-702928784


   Revision: f63aba0e66b74a6777ae8e755c91ba80cd35e1e8
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-581](https://github.com/ursa-labs/crossbow/branches/all?query=actions-581)
   
   |Task|Status|
   ||--|
   
|test-r-rhub-ubuntu-gcc-release|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-581-azure-test-r-rhub-ubuntu-gcc-release)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-581-azure-test-r-rhub-ubuntu-gcc-release)|
   
|test-r-rocker-r-base-latest|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-581-azure-test-r-rocker-r-base-latest)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-581-azure-test-r-rocker-r-base-latest)|
   
|test-r-rstudio-r-base-3.6-bionic|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-581-azure-test-r-rstudio-r-base-3.6-bionic)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-581-azure-test-r-rstudio-r-base-3.6-bionic)|
   
|test-r-rstudio-r-base-3.6-centos6|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-581-azure-test-r-rstudio-r-base-3.6-centos6)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-581-azure-test-r-rstudio-r-base-3.6-centos6)|
   
|test-r-rstudio-r-base-3.6-centos8|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-581-azure-test-r-rstudio-r-base-3.6-centos8)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-581-azure-test-r-rstudio-r-base-3.6-centos8)|
   
|test-r-rstudio-r-base-3.6-opensuse15|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-581-azure-test-r-rstudio-r-base-3.6-opensuse15)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-581-azure-test-r-rstudio-r-base-3.6-opensuse15)|
   
|test-r-rstudio-r-base-3.6-opensuse42|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-581-azure-test-r-rstudio-r-base-3.6-opensuse42)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-581-azure-test-r-rstudio-r-base-3.6-opensuse42)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


nealrichardson commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-702928024


   @github-actions crossbow submit test-r-r*



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] carols10cents commented on a change in pull request #8330: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes

2020-10-02 Thread GitBox


carols10cents commented on a change in pull request #8330:
URL: https://github.com/apache/arrow/pull/8330#discussion_r499015678



##
File path: rust/parquet/src/arrow/arrow_writer.rs
##
@@ -688,4 +688,413 @@ mod tests {
 writer.write().unwrap();
 writer.close().unwrap();
 }
+
+const SMALL_SIZE: usize = 100;
+
+fn roundtrip(filename: , expected_batch: RecordBatch) {
+let file = get_temp_file(filename, &[]);
+
+let mut writer = ArrowWriter::try_new(
+file.try_clone().unwrap(),
+expected_batch.schema(),
+None,
+)
+.unwrap();
+writer.write(_batch).unwrap();
+writer.close().unwrap();
+
+let reader = SerializedFileReader::new(file).unwrap();
+let mut arrow_reader = ParquetFileArrowReader::new(Rc::new(reader));
+let mut record_batch_reader = 
arrow_reader.get_record_reader(1024).unwrap();
+
+let actual_batch = record_batch_reader.next_batch().unwrap().unwrap();
+
+assert_eq!(expected_batch.schema(), actual_batch.schema());
+assert_eq!(expected_batch.num_columns(), actual_batch.num_columns());
+assert_eq!(expected_batch.num_rows(), actual_batch.num_rows());
+for i in 0..expected_batch.num_columns() {
+let expected_data = expected_batch.column(i).data();
+let actual_data = actual_batch.column(i).data();
+
+assert_eq!(expected_data.data_type(), actual_data.data_type());
+assert_eq!(expected_data.len(), actual_data.len());
+assert_eq!(expected_data.null_count(), actual_data.null_count());
+assert_eq!(expected_data.offset(), actual_data.offset());
+assert_eq!(expected_data.buffers(), actual_data.buffers());
+assert_eq!(expected_data.child_data(), actual_data.child_data());
+assert_eq!(expected_data.null_bitmap(), actual_data.null_bitmap());
+}
+}
+
+fn one_column_roundtrip(filename: , values: ArrayRef, nullable: bool) {
+let schema = Schema::new(vec![Field::new(
+"col",
+values.data_type().clone(),
+nullable,
+)]);
+let expected_batch =
+RecordBatch::try_new(Arc::new(schema), vec![values]).unwrap();
+
+roundtrip(filename, expected_batch);
+}
+
+fn values_required(iter: I, filename: )
+where
+A: From> + Array + 'static,
+I: IntoIterator,
+{
+let raw_values: Vec<_> = iter.into_iter().collect();
+let values = Arc::new(A::from(raw_values));
+one_column_roundtrip(filename, values, false);
+}
+
+fn values_optional(iter: I, filename: )
+where
+A: From>> + Array + 'static,
+I: IntoIterator,
+{
+let optional_raw_values: Vec<_> = iter
+.into_iter()
+.enumerate()
+.map(|(i, v)| if i % 2 == 0 { None } else { Some(v) })
+.collect();
+let optional_values = Arc::new(A::from(optional_raw_values));
+one_column_roundtrip(filename, optional_values, true);
+}
+
+fn required_and_optional(iter: I, filename: )
+where
+A: From> + From>> + Array + 'static,
+I: IntoIterator + Clone,
+{
+values_required::(iter.clone(), filename);
+values_optional::(iter, filename);
+}
+
+#[test]
+#[should_panic(expected = "Null arrays not supported")]
+fn null_single_column() {
+let values = Arc::new(NullArray::new(SMALL_SIZE));
+one_column_roundtrip("null_single_column", values.clone(), true);
+one_column_roundtrip("null_single_column", values, false);
+}
+
+#[test]
+#[should_panic(
+expected = "Attempting to write an Arrow type that is not yet 
implemented"
+)]
+fn bool_single_column() {
+required_and_optional::(
+[true, false].iter().cycle().copied().take(SMALL_SIZE),
+"bool_single_column",
+);
+}
+
+#[test]
+fn i8_single_column() {
+required_and_optional::(0..SMALL_SIZE as i8, 
"i8_single_column");
+}
+
+#[test]
+fn i16_single_column() {
+required_and_optional::(0..SMALL_SIZE as i16, 
"i16_single_column");
+}
+
+#[test]
+fn i32_single_column() {
+required_and_optional::(0..SMALL_SIZE as i32, 
"i32_single_column");
+}
+
+#[test]
+fn i64_single_column() {
+required_and_optional::(0..SMALL_SIZE as i64, 
"i64_single_column");
+}
+
+#[test]
+fn u8_single_column() {
+required_and_optional::(0..SMALL_SIZE as u8, 
"u8_single_column");
+}
+
+#[test]
+fn u16_single_column() {
+required_and_optional::(
+0..SMALL_SIZE as u16,
+"u16_single_column",
+);
+}
+
+#[test]
+fn u32_single_column() {
+required_and_optional::(
+0..SMALL_SIZE as u32,
+"u32_single_column",
+);
+  

[GitHub] [arrow] bkietz commented on a change in pull request #8311: ARROW-10008: [C++][Dataset] Fix filtering/row group statistics of dict columns

2020-10-02 Thread GitBox


bkietz commented on a change in pull request #8311:
URL: https://github.com/apache/arrow/pull/8311#discussion_r498985112



##
File path: cpp/src/arrow/dataset/file_parquet.cc
##
@@ -171,6 +170,15 @@ static std::shared_ptr 
ColumnChunkStatisticsAsStructScalar(
 return nullptr;
   }
 
+  auto maybe_min = min->CastTo(field->type());
+  auto maybe_max = max->CastTo(field->type());

Review comment:
   (IE, it only changes behavior in cases where the physical type wasn't 
appropriate)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] dianaclarke commented on a change in pull request #8312: ARROW-9941: [Python] Better string representation for extension types

2020-10-02 Thread GitBox


dianaclarke commented on a change in pull request #8312:
URL: https://github.com/apache/arrow/pull/8312#discussion_r498972124



##
File path: python/pyarrow/tests/test_extension_type.py
##
@@ -95,6 +95,18 @@ def test_ext_type_basics():
 assert ty.extension_name == "arrow.py_extension_type"
 
 
+def test_ext_type_str():
+ty = IntegerType()
+assert str(ty) == "DataType(int64)"
+cpp_expected = "extension>"

Review comment:
   TODO:  this should be: `extension>`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] paddyhoran closed pull request #8327: ARROW-10160: [Rust] Improve DictionaryType documentation (clarify which type is which)

2020-10-02 Thread GitBox


paddyhoran closed pull request #8327:
URL: https://github.com/apache/arrow/pull/8327


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #8312: ARROW-9941: [Python] Better string representation for extension types

2020-10-02 Thread GitBox


nealrichardson commented on pull request #8312:
URL: https://github.com/apache/arrow/pull/8312#issuecomment-702875396


   To be clear, the 2 Windows CI failures are unrelated.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] paddyhoran closed pull request #8326: ARROW-10157: [Rust] Add an example to the take kernel

2020-10-02 Thread GitBox


paddyhoran closed pull request #8326:
URL: https://github.com/apache/arrow/pull/8326


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


github-actions[bot] commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-702866357


   Revision: 96283bc0c9d134680783e3df00926d0b972946ba
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-580](https://github.com/ursa-labs/crossbow/branches/all?query=actions-580)
   
   |Task|Status|
   ||--|
   
|test-r-rhub-ubuntu-gcc-release|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-580-azure-test-r-rhub-ubuntu-gcc-release)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-580-azure-test-r-rhub-ubuntu-gcc-release)|
   
|test-r-rocker-r-base-latest|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-580-azure-test-r-rocker-r-base-latest)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-580-azure-test-r-rocker-r-base-latest)|
   
|test-r-rstudio-r-base-3.6-bionic|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-580-azure-test-r-rstudio-r-base-3.6-bionic)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-580-azure-test-r-rstudio-r-base-3.6-bionic)|
   
|test-r-rstudio-r-base-3.6-centos6|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-580-azure-test-r-rstudio-r-base-3.6-centos6)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-580-azure-test-r-rstudio-r-base-3.6-centos6)|
   
|test-r-rstudio-r-base-3.6-centos8|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-580-azure-test-r-rstudio-r-base-3.6-centos8)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-580-azure-test-r-rstudio-r-base-3.6-centos8)|
   
|test-r-rstudio-r-base-3.6-opensuse15|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-580-azure-test-r-rstudio-r-base-3.6-opensuse15)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-580-azure-test-r-rstudio-r-base-3.6-opensuse15)|
   
|test-r-rstudio-r-base-3.6-opensuse42|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-580-azure-test-r-rstudio-r-base-3.6-opensuse42)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-580-azure-test-r-rstudio-r-base-3.6-opensuse42)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


nealrichardson commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-702865456


   @github-actions crossbow submit test-r-r*



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on pull request #8317: ARROW-10134: [Python][Dataset] Add ParquetFileFragment.num_row_groups

2020-10-02 Thread GitBox


bkietz commented on pull request #8317:
URL: https://github.com/apache/arrow/pull/8317#issuecomment-702862461


   @jorisvandenbossche just confirming: you want `f.num_row_groups` to 
potentially perform IO?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #8311: ARROW-10008: [C++][Dataset] Fix filtering/row group statistics of dict columns

2020-10-02 Thread GitBox


bkietz commented on a change in pull request #8311:
URL: https://github.com/apache/arrow/pull/8311#discussion_r498953710



##
File path: cpp/src/arrow/dataset/file_parquet.cc
##
@@ -171,6 +170,15 @@ static std::shared_ptr 
ColumnChunkStatisticsAsStructScalar(
 return nullptr;
   }
 
+  auto maybe_min = min->CastTo(field->type());
+  auto maybe_max = max->CastTo(field->type());

Review comment:
   `StatisticsAsScalars` returns scalars whose types are the correct 
*physical type*, so even if the column was `dictionary(string)` min and max 
would be just `string` before this cast





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


github-actions[bot] commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-702857488


   Revision: 96283bc0c9d134680783e3df00926d0b972946ba
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-579](https://github.com/ursa-labs/crossbow/branches/all?query=actions-579)
   
   |Task|Status|
   ||--|
   
|test-r-rstudio-r-base-3.6-bionic|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-579-azure-test-r-rstudio-r-base-3.6-bionic)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-579-azure-test-r-rstudio-r-base-3.6-bionic)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


nealrichardson commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-702855212


   @github-actions crossbow submit test-r-rstudio-r-base-3.6-bionic



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on issue #8329: pyarrow=1.0.1 is not even usable

2020-10-02 Thread GitBox


wesm commented on issue #8329:
URL: https://github.com/apache/arrow/issues/8329#issuecomment-702850502


   There is no deception. Some individuals work on Apache Arrow as part of 
their job responsibilities, but the work that is contributed to the project is 
_volunteered_ by these organizations. They have their own priorities; they do 
not work for you. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #8330: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes

2020-10-02 Thread GitBox


alamb commented on pull request #8330:
URL: https://github.com/apache/arrow/pull/8330#issuecomment-702838970


   > I am comparing the RecordBatch's column's data before and after the round 
trip directly; I'm not sure that this is appropriate or not because for some 
data types, the null_bitmap isn't matching and I'm not sure if it's supposed to 
or not.
   
   I think the null bitmap should definitely match after the round trip (as it 
signals which rows are null, which should be preserved).
   
   I am not sure the actual data arrays have to be the same -- what needs to be 
the same is the data rows where the null_bit is 0 (aka we should probably be 
checking only for non-null rows)
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] impredicative edited a comment on issue #8329: pyarrow=1.0.1 is not even usable

2020-10-02 Thread GitBox


impredicative edited a comment on issue #8329:
URL: https://github.com/apache/arrow/issues/8329#issuecomment-702823291


   > Apache Arrow is supported by volunteers
   
   This information is intended to deceive. The notable fact is that [Ursa Lab 
has sponsors](https://ursalabs.org/blog/2020-new-sponsors/) and is [industry 
funded](https://ursalabs.org/about/). Perhaps the sponsors would like to learn 
that pyarrow is not even importable.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] impredicative edited a comment on issue #8329: pyarrow=1.0.1 is not even usable

2020-10-02 Thread GitBox


impredicative edited a comment on issue #8329:
URL: https://github.com/apache/arrow/issues/8329#issuecomment-702823291


   > Apache Arrow is supported by volunteers
   
   This information is intended to deceive. The notable fact is that [Ursa Lab 
has sponsors](https://ursalabs.org/blog/2020-new-sponsors/) and is [industry 
funded](https://ursalabs.org/about/).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] impredicative commented on issue #8329: pyarrow=1.0.1 is not even usable

2020-10-02 Thread GitBox


impredicative commented on issue #8329:
URL: https://github.com/apache/arrow/issues/8329#issuecomment-702823291


   > Apache Arrow is supported by volunteers
   
   This information is intended to deceive. The fact is that [Ursa Lab has 
sponsors](https://ursalabs.org/blog/2020-new-sponsors/) and is [industry 
funded](https://ursalabs.org/about/).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #8317: ARROW-10134: [Python][Dataset] Add ParquetFileFragment.num_row_groups

2020-10-02 Thread GitBox


jorisvandenbossche commented on pull request #8317:
URL: https://github.com/apache/arrow/pull/8317#issuecomment-702823129


   I think the main reason such a property would be interesting for dask's use 
case is to get the number of row groups in a case all statistics are not 
available / not yet parsed. So the way that this PR returns `None` in that case 
is not super useful, I think.  
   I think ideally if the number of row groups is not know (the row_groups are 
not set), it would retrieve this information from the FileMetaData.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8311: ARROW-10008: [C++][Dataset] Fix filtering/row group statistics of dict columns

2020-10-02 Thread GitBox


jorisvandenbossche commented on a change in pull request #8311:
URL: https://github.com/apache/arrow/pull/8311#discussion_r498913314



##
File path: cpp/src/arrow/dataset/file_parquet.cc
##
@@ -171,6 +170,15 @@ static std::shared_ptr 
ColumnChunkStatisticsAsStructScalar(
 return nullptr;
   }
 
+  auto maybe_min = min->CastTo(field->type());
+  auto maybe_max = max->CastTo(field->type());

Review comment:
   Does this change behaviour? For a dictionary with string values, is 
field->type() string or dictionary?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] carols10cents opened a new pull request #8330: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes

2020-10-02 Thread GitBox


carols10cents opened a new pull request #8330:
URL: https://github.com/apache/arrow/pull/8330


   Note that this PR goes to the rust-parquet-arrow-writer branch, not master.
   
   Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc
   
   These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch 
with a single column of values of each the supported data types and some of the 
unsupported ones.
   
   Tests that currently fail are either marked with `#[should_panic]` (if the
   reason they fail is because of a panic) or `#[ignore]` (if the reason
   they fail is because the values don't match).
   
   I am comparing the RecordBatch's column's data before and after the round 
trip directly; I'm not sure that this is appropriate or not because for some 
data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed 
to or not.
   
   So I would love advice on that front, and I would love to know if these 
tests are useful or not!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] carols10cents commented on pull request #8330: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes

2020-10-02 Thread GitBox


carols10cents commented on pull request #8330:
URL: https://github.com/apache/arrow/pull/8330#issuecomment-702815429


   cc @nevi-me 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


github-actions[bot] commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-702810329


   Revision: 841e2f0e3e04c903875bb4196d11a01ac3aacaac
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-578](https://github.com/ursa-labs/crossbow/branches/all?query=actions-578)
   
   |Task|Status|
   ||--|
   
|test-r-rstudio-r-base-3.6-bionic|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-578-azure-test-r-rstudio-r-base-3.6-bionic)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-578-azure-test-r-rstudio-r-base-3.6-bionic)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorgecarleitao edited a comment on pull request #8324: ARROW-10156: [Rust] Added github action to label PRs for rust.

2020-10-02 Thread GitBox


jorgecarleitao edited a comment on pull request #8324:
URL: https://github.com/apache/arrow/pull/8324#issuecomment-702805572


   @kou, Thanks for the suggestion. I have moved the content to another place 
and added the `on` accordingly.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorgecarleitao commented on pull request #8313: ARROW-4927: [Rust] Update top level README to describe current functionality

2020-10-02 Thread GitBox


jorgecarleitao commented on pull request #8313:
URL: https://github.com/apache/arrow/pull/8313#issuecomment-702809460


   @andygrove , sure! I am stuck in the last step of 
https://gitbox.apache.org/setup/
   
   > User not a member of the ASF GitHub organisation. Please make sure you are 
a part of the ASF Organisation on GitHub and have 2FA enabled. Visit 
id.apache.org and set your GitHub ID to be invited to the org.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson closed issue #8329: pyarrow=1.0.1 is not even usable

2020-10-02 Thread GitBox


nealrichardson closed issue #8329:
URL: https://github.com/apache/arrow/issues/8329


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on issue #8329: pyarrow=1.0.1 is not even usable

2020-10-02 Thread GitBox


nealrichardson commented on issue #8329:
URL: https://github.com/apache/arrow/issues/8329#issuecomment-702808106


   Thanks for your report. Apache Arrow is supported by volunteers, so while 
you wait patiently for an answer to your question, please review our [code of 
conduct](https://www.apache.org/foundation/policies/conduct.html) for 
participating in the community.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorgecarleitao commented on pull request #8324: ARROW-10156: [Rust] Added github action to label PRs for rust.

2020-10-02 Thread GitBox


jorgecarleitao commented on pull request #8324:
URL: https://github.com/apache/arrow/pull/8324#issuecomment-702805572


   @kou, Thanks for the suggestion. I have move the content to another place 
and added the `on` accordingly.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp

2020-10-02 Thread GitBox


nealrichardson commented on pull request #8304:
URL: https://github.com/apache/arrow/pull/8304#issuecomment-702805412


   @github-actions crossbow submit test-r-rstudio-r-base-3.6-bionic



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] andygrove commented on pull request #8313: ARROW-4927: [Rust] Update top level README to describe current functionality

2020-10-02 Thread GitBox


andygrove commented on pull request #8313:
URL: https://github.com/apache/arrow/pull/8313#issuecomment-702805331


   @jorgecarleitao Now that you are a committer do you want to merge this one 
to get to know the process? I usually make sure I have the latest from the 
master branch and then run `./dev/merge_pr.py`. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] impredicative opened a new issue #8329: pyarrow=1.0.1 is not even usable

2020-10-02 Thread GitBox


impredicative opened a new issue #8329:
URL: https://github.com/apache/arrow/issues/8329


   Is someone going to comment on 
https://issues.apache.org/jira/browse/ARROW-10152 ? As it stands, the software 
is not even usable, period.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8328: ARROW-10161: [Rust] [DataFusion] DRYed code in tests

2020-10-02 Thread GitBox


github-actions[bot] commented on pull request #8328:
URL: https://github.com/apache/arrow/pull/8328#issuecomment-702800123


   https://issues.apache.org/jira/browse/ARROW-10161



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #8316: ARROW-10149: [Rust] Improved support for externally owned memory regions

2020-10-02 Thread GitBox


pitrou commented on a change in pull request #8316:
URL: https://github.com/apache/arrow/pull/8316#discussion_r498885291



##
File path: rust/arrow/src/bytes.rs
##
@@ -0,0 +1,185 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This module contains an implementation of a contiguous immutable memory 
region that knows
+//! how to de-allocate itself, [`Bytes`].
+//! Note that this is a low-level functionality of this crate, and is only 
required to be used
+//! when implementing FFI.
+
+use core::slice;
+use std::{fmt::Debug, fmt::Formatter, sync::Arc};
+
+use crate::memory;
+
+/// function resposible for de-allocating `Bytes`.
+pub type DropFn = Arc;
+
+/// Mode of deallocating memory regions
+pub enum Deallocation {
+/// Native deallocation, using Rust deallocator with Arrow-specific memory 
aligment
+Native(usize),
+/// Foreign deallocation, using some other form of memory deallocation
+Foreign(DropFn),
+}
+
+impl Debug for Deallocation {
+fn fmt(, f:  Formatter) -> std::fmt::Result {
+match self {
+Deallocation::Native(capacity) => {
+write!(f, "Deallocation::Native {{ capacity: {} }}", capacity)
+}
+Deallocation::Foreign(_) => {
+write!(f, "Deallocation::Foreign {{ capacity: unknown }}")
+}
+}
+}
+}
+
+/// A continuous, fixed-size, immutable memory region that knows how to 
de-allocate itself.
+/// This structs' API is inspired by the `bytes::Bytes`, but it is not limited 
to using rust's
+/// global allocator nor u8 aligmnent.
+///
+/// In the most common case, this buffer is allocated using 
[`allocate_aligned`](memory::allocate_aligned)
+/// and deallocated accordingly [`free_aligned`](memory::free_aligned).
+/// When the region is allocated by an foreign allocator, 
[Deallocation::Foreign], this calls the
+/// foreign deallocator to deallocate the region when it is no longer needed.
+pub struct Bytes {
+/// The raw pointer to be begining of the region
+ptr: *const u8,
+
+/// The number of bytes visible to this region. This is always smaller 
than its capacity (when avaliable).
+len: usize,
+
+/// how to deallocate this region
+deallocation: Deallocation,
+}
+
+impl Bytes {
+/// Takes ownership of an allocated memory region,
+///
+/// # Arguments
+///
+/// * `ptr` - Pointer to raw parts
+/// * `len` - Length of raw parts in **bytes**
+/// * `capacity` - Total allocated memory for the pointer `ptr`, in 
**bytes**
+///
+/// # Safety
+///
+/// This function is unsafe as there is no guarantee that the given 
pointer is valid for `len`
+/// bytes. If the `ptr` and `capacity` come from a `Buffer`, then this is 
guaranteed.
+pub unsafe fn new(ptr: *const u8, len: usize, deallocation: Deallocation) 
-> Bytes {
+Bytes {
+ptr,
+len,
+deallocation,
+}
+}
+
+#[inline]
+pub fn as_slice() -> &[u8] {
+unsafe { slice::from_raw_parts(self.ptr, self.len) }
+}
+
+#[inline]
+pub fn len() -> usize {
+self.len
+}
+
+#[inline]
+pub fn is_empty() -> bool {
+self.len == 0
+}
+
+#[inline]
+pub fn raw_data() -> *const u8 {
+self.ptr
+}
+
+#[inline]
+pub fn raw_data_mut( self) -> *mut u8 {
+self.ptr as *mut u8
+}
+
+pub fn capacity() -> usize {
+match self.deallocation {
+Deallocation::Native(capacity) => capacity,
+// we cannot determine this in general,
+// and thus we state that this is externally-owned memory
+Deallocation::Foreign(_) => 0,
+}
+}
+}
+
+impl Drop for Bytes {
+#[inline]
+fn drop( self) {
+match  {
+Deallocation::Native(capacity) => {
+if !self.ptr.is_null() {
+unsafe { memory::free_aligned(self.ptr as *mut u8, 
*capacity) };
+}
+}
+Deallocation::Foreign(drop) => {
+(drop.clone())(self);
+}
+}
+}
+}
+
+impl PartialEq for Bytes {
+fn eq(, other: ) -> bool {
+

[GitHub] [arrow] jorgecarleitao opened a new pull request #8328: ARROW-10161: [Rust] [DataFusion] DRYed code in tests

2020-10-02 Thread GitBox


jorgecarleitao opened a new pull request #8328:
URL: https://github.com/apache/arrow/pull/8328


   Just a small cleanup by moving common code to a macro and using it in the 
tests.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #8312: ARROW-9941: [Python] Better string representation for extension types

2020-10-02 Thread GitBox


pitrou commented on pull request #8312:
URL: https://github.com/apache/arrow/pull/8312#issuecomment-702788372


   > `assert pa.DataType.__str__(ty) == "arrow.py_extension_type"`
   
   The problem is that this doesn't distinguish between different extension 
types with the same storage type. For example, two extension types representing 
UUIDs and IPv6 addresses, respectively, would get the same string 
representation (because both would be backed by a 16-byte fixed-size binary).
   
   I would be more useful if the string representation looked like 
`extension>`, where `IntegerType` is the 
Python class name.
   
   Optionally this could be made configurable as well.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] andygrove commented on pull request #8324: ARROW-10156: [Rust] Added github action to label PRs for rust.

2020-10-02 Thread GitBox


andygrove commented on pull request #8324:
URL: https://github.com/apache/arrow/pull/8324#issuecomment-702773859


   @jorgecarleitao I will be very happy to relinquish my role as github label 
admin (as I'm sure the other Rust committers will too). Thanks for working on 
this.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #8305: ARROW-9782: [C++][Dataset] More configurable Dataset writing

2020-10-02 Thread GitBox


pitrou commented on a change in pull request #8305:
URL: https://github.com/apache/arrow/pull/8305#discussion_r498818275



##
File path: cpp/src/arrow/dataset/file_base.cc
##
@@ -143,97 +145,205 @@ FragmentIterator FileSystemDataset::GetFragmentsImpl(
   return MakeVectorIterator(std::move(fragments));
 }
 
-struct WriteTask {
-  Status Execute();
+Status FileWriter::Write(RecordBatchReader* batches) {
+  for (std::shared_ptr batch;;) {

Review comment:
   I have a preference for `while (true)` which is much more explicit. Just 
my 2 cents :-)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #8305: ARROW-9782: [C++][Dataset] More configurable Dataset writing

2020-10-02 Thread GitBox


bkietz commented on a change in pull request #8305:
URL: https://github.com/apache/arrow/pull/8305#discussion_r498813432



##
File path: cpp/src/arrow/dataset/file_base.cc
##
@@ -143,97 +145,205 @@ FragmentIterator FileSystemDataset::GetFragmentsImpl(
   return MakeVectorIterator(std::move(fragments));
 }
 
-struct WriteTask {
-  Status Execute();
+Status FileWriter::Write(RecordBatchReader* batches) {
+  for (std::shared_ptr batch;;) {

Review comment:
   I personally prefer `for(;;)` to `while(true)` but if it'll be clearer 
then I'll rewrite it

##
File path: cpp/src/arrow/dataset/file_base.cc
##
@@ -143,97 +145,205 @@ FragmentIterator FileSystemDataset::GetFragmentsImpl(
   return MakeVectorIterator(std::move(fragments));
 }
 
-struct WriteTask {
-  Status Execute();
+Status FileWriter::Write(RecordBatchReader* batches) {
+  for (std::shared_ptr batch;;) {
+RETURN_NOT_OK(batches->ReadNext());
+if (batch == nullptr) break;
+RETURN_NOT_OK(Write(batch));
+  }
+  return Status::OK();
+}
+
+struct NextBasenameGenerator {
+  static Result Make(const std::string& 
basename_template) {
+if (basename_template.find(fs::internal::kSep) != std::string::npos) {
+  return Status::Invalid("basename_template contained '/'");
+}
+size_t token_start = basename_template.find(token());
+if (token_start == std::string::npos) {
+  return Status::Invalid("basename_template did not contain '{i}'");
+}
+return NextBasenameGenerator{basename_template, 0, token_start,
+ token_start + token().size()};
+  }
 
-  /// The basename of files written by this WriteTask. Extensions
-  /// are derived from format
-  std::string basename;
+  static const std::string& token() {
+static const std::string token = "{i}";
+return token;
+  }
 
-  /// The partitioning with which paths will be generated
-  std::shared_ptr partitioning;
+  const std::string& template_;
+  size_t i_, token_start_, token_end_;
 
-  /// The format in which fragments will be written
-  std::shared_ptr format;
+  std::string operator()() {
+return template_.substr(0, token_start_) + std::to_string(i_++) +
+   template_.substr(token_end_);
+  }
+};
 
-  /// The FileSystem and base directory into which fragments will be written
-  std::shared_ptr filesystem;
-  std::string base_dir;
+using MutexedWriter = util::Mutexed>;
 
-  /// Batches to be written
-  std::shared_ptr batches;
+struct WriterSet {
+  WriterSet(NextBasenameGenerator next_basename,
+const FileSystemDatasetWriteOptions& write_options)
+  : next_basename_(std::move(next_basename)),
+base_dir_(fs::internal::EnsureTrailingSlash(write_options.base_dir)),
+write_options_(write_options) {}
 
-  /// An Expression already satisfied by every batch to be written
-  std::shared_ptr partition_expression;
-};
+  Result> Get(const Expression& 
partition_expression,
+ const std::shared_ptr& 
schema) {
+ARROW_ASSIGN_OR_RAISE(auto part_segments,
+  
write_options_.partitioning->Format(partition_expression));
+std::string dir = base_dir_ + part_segments;
 
-Status WriteTask::Execute() {
-  std::unordered_map path_to_batches;
-
-  // TODO(bkietz) these calls to Partition() should be scattered across a 
TaskGroup
-  for (auto maybe_batch : IteratorFromReader(batches)) {
-ARROW_ASSIGN_OR_RAISE(auto batch, maybe_batch);
-ARROW_ASSIGN_OR_RAISE(auto partitioned_batches, 
partitioning->Partition(batch));
-for (auto&& partitioned_batch : partitioned_batches) {
-  AndExpression expr(std::move(partitioned_batch.partition_expression),
- partition_expression);
-  ARROW_ASSIGN_OR_RAISE(std::string path, partitioning->Format(expr));
-  path = fs::internal::EnsureLeadingSlash(path);
-  path_to_batches[path].push_back(std::move(partitioned_batch.batch));
-}
-  }
+util::Mutex::Guard writer_lock;
+
+auto set_lock = mutex_.Lock();
 
-  for (auto&& path_batches : path_to_batches) {
-auto dir = base_dir + path_batches.first;
-RETURN_NOT_OK(filesystem->CreateDir(dir, /*recursive=*/true));
+auto writer =

Review comment:
   I'll add comments inline





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #8305: ARROW-9782: [C++][Dataset] More configurable Dataset writing

2020-10-02 Thread GitBox


bkietz commented on a change in pull request #8305:
URL: https://github.com/apache/arrow/pull/8305#discussion_r498812732



##
File path: cpp/src/arrow/dataset/file_base.cc
##
@@ -143,97 +145,205 @@ FragmentIterator FileSystemDataset::GetFragmentsImpl(
   return MakeVectorIterator(std::move(fragments));
 }
 
-struct WriteTask {
-  Status Execute();
+Status FileWriter::Write(RecordBatchReader* batches) {
+  for (std::shared_ptr batch;;) {
+RETURN_NOT_OK(batches->ReadNext());
+if (batch == nullptr) break;
+RETURN_NOT_OK(Write(batch));
+  }
+  return Status::OK();
+}
+
+struct NextBasenameGenerator {
+  static Result Make(const std::string& 
basename_template) {
+if (basename_template.find(fs::internal::kSep) != std::string::npos) {
+  return Status::Invalid("basename_template contained '/'");
+}
+size_t token_start = basename_template.find(token());
+if (token_start == std::string::npos) {
+  return Status::Invalid("basename_template did not contain '{i}'");
+}
+return NextBasenameGenerator{basename_template, 0, token_start,
+ token_start + token().size()};
+  }
 
-  /// The basename of files written by this WriteTask. Extensions
-  /// are derived from format
-  std::string basename;
+  static const std::string& token() {
+static const std::string token = "{i}";
+return token;
+  }
 
-  /// The partitioning with which paths will be generated
-  std::shared_ptr partitioning;
+  const std::string& template_;
+  size_t i_, token_start_, token_end_;
 
-  /// The format in which fragments will be written
-  std::shared_ptr format;
+  std::string operator()() {
+return template_.substr(0, token_start_) + std::to_string(i_++) +
+   template_.substr(token_end_);
+  }
+};
 
-  /// The FileSystem and base directory into which fragments will be written
-  std::shared_ptr filesystem;
-  std::string base_dir;
+using MutexedWriter = util::Mutexed>;
 
-  /// Batches to be written
-  std::shared_ptr batches;
+struct WriterSet {
+  WriterSet(NextBasenameGenerator next_basename,
+const FileSystemDatasetWriteOptions& write_options)
+  : next_basename_(std::move(next_basename)),
+base_dir_(fs::internal::EnsureTrailingSlash(write_options.base_dir)),
+write_options_(write_options) {}
 
-  /// An Expression already satisfied by every batch to be written
-  std::shared_ptr partition_expression;
-};
+  Result> Get(const Expression& 
partition_expression,
+ const std::shared_ptr& 
schema) {
+ARROW_ASSIGN_OR_RAISE(auto part_segments,
+  
write_options_.partitioning->Format(partition_expression));
+std::string dir = base_dir_ + part_segments;
 
-Status WriteTask::Execute() {
-  std::unordered_map path_to_batches;
-
-  // TODO(bkietz) these calls to Partition() should be scattered across a 
TaskGroup
-  for (auto maybe_batch : IteratorFromReader(batches)) {
-ARROW_ASSIGN_OR_RAISE(auto batch, maybe_batch);
-ARROW_ASSIGN_OR_RAISE(auto partitioned_batches, 
partitioning->Partition(batch));
-for (auto&& partitioned_batch : partitioned_batches) {
-  AndExpression expr(std::move(partitioned_batch.partition_expression),
- partition_expression);
-  ARROW_ASSIGN_OR_RAISE(std::string path, partitioning->Format(expr));
-  path = fs::internal::EnsureLeadingSlash(path);
-  path_to_batches[path].push_back(std::move(partitioned_batch.batch));
-}
-  }
+util::Mutex::Guard writer_lock;
+
+auto set_lock = mutex_.Lock();
 
-  for (auto&& path_batches : path_to_batches) {
-auto dir = base_dir + path_batches.first;
-RETURN_NOT_OK(filesystem->CreateDir(dir, /*recursive=*/true));
+auto writer =
+internal::GetOrInsertGenerated(_to_writer_, dir, [&](const 
std::string&) {
+  auto writer = std::make_shared();
+  writer_lock = writer->Lock();
+  return writer;
+})->second;
 
-auto path = fs::internal::ConcatAbstractPath(dir, basename);
-ARROW_ASSIGN_OR_RAISE(auto destination, 
filesystem->OpenOutputStream(path));
+if (writer_lock) {
+  // NB: next_basename_() must be invoked with the set_lock held
+  auto path = fs::internal::ConcatAbstractPath(dir, next_basename_());
+  set_lock.Unlock();
 
-DCHECK(!path_batches.second.empty());
-ARROW_ASSIGN_OR_RAISE(auto reader,
-  
RecordBatchReader::Make(std::move(path_batches.second)));
-RETURN_NOT_OK(format->WriteFragment(reader.get(), destination.get()));
+  RETURN_NOT_OK(write_options_.filesystem->CreateDir(dir));
+
+  ARROW_ASSIGN_OR_RAISE(auto destination,
+write_options_.filesystem->OpenOutputStream(path));
+
+  ARROW_ASSIGN_OR_RAISE(**writer, write_options_.format()->MakeWriter(
+  std::move(destination), schema,
+  

[GitHub] [arrow] lidavidm commented on pull request #8325: ARROW-10105: [C++][Python][Java][FlightRPC] Allow disabling server validation

2020-10-02 Thread GitBox


lidavidm commented on pull request #8325:
URL: https://github.com/apache/arrow/pull/8325#issuecomment-702709693


   Actually, it seems the relevant changes have been there since early this 
year. It may be OK to switch fully.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm commented on pull request #8325: ARROW-10105: [C++][Python][Java][FlightRPC] Allow disabling server validation

2020-10-02 Thread GitBox


lidavidm commented on pull request #8325:
URL: https://github.com/apache/arrow/pull/8325#issuecomment-702702219


   Hey James, I'll review this later today hopefully, but just to answer a few 
questions
   - Style in C++ can be done automatically with clang-format, see the 
[developer 
guide](https://arrow.apache.org/docs/developers/cpp/development.html). 
   - For Python similarly you can use the [flake8 
config](https://arrow.apache.org/docs/developers/python.html#coding-style)
   - The versions are scattered but as a start see thirdparty.txt: 
https://github.com/apache/arrow/blob/master/cpp/thirdparty/versions.txt though 
note you may also have to update CI scripts, Homebrew/Conda packages, etc.
   - I'd rather avoid very recent experimental features as that will create a 
lot of churn for us as gRPC upgrades frequently. It may be worth considering in 
a few versions if it looks like the API settles down, since I agree it would be 
nice to consolidate the options + it does bring the ability to do certificate 
rotation.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8327: ARROW-10160: [Rust] Improve DictionaryType documentation (clarify which type is which)

2020-10-02 Thread GitBox


github-actions[bot] commented on pull request #8327:
URL: https://github.com/apache/arrow/pull/8327#issuecomment-702699403


   https://issues.apache.org/jira/browse/ARROW-10160



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on a change in pull request #8305: ARROW-9782: [C++][Dataset] More configurable Dataset writing

2020-10-02 Thread GitBox


fsaintjacques commented on a change in pull request #8305:
URL: https://github.com/apache/arrow/pull/8305#discussion_r498758123



##
File path: cpp/src/arrow/dataset/file_base.cc
##
@@ -143,97 +145,205 @@ FragmentIterator FileSystemDataset::GetFragmentsImpl(
   return MakeVectorIterator(std::move(fragments));
 }
 
-struct WriteTask {
-  Status Execute();
+Status FileWriter::Write(RecordBatchReader* batches) {
+  for (std::shared_ptr batch;;) {
+RETURN_NOT_OK(batches->ReadNext());
+if (batch == nullptr) break;
+RETURN_NOT_OK(Write(batch));
+  }
+  return Status::OK();
+}
+
+struct NextBasenameGenerator {
+  static Result Make(const std::string& 
basename_template) {
+if (basename_template.find(fs::internal::kSep) != std::string::npos) {
+  return Status::Invalid("basename_template contained '/'");
+}
+size_t token_start = basename_template.find(token());
+if (token_start == std::string::npos) {
+  return Status::Invalid("basename_template did not contain '{i}'");
+}
+return NextBasenameGenerator{basename_template, 0, token_start,
+ token_start + token().size()};
+  }
 
-  /// The basename of files written by this WriteTask. Extensions
-  /// are derived from format
-  std::string basename;
+  static const std::string& token() {
+static const std::string token = "{i}";
+return token;
+  }
 
-  /// The partitioning with which paths will be generated
-  std::shared_ptr partitioning;
+  const std::string& template_;
+  size_t i_, token_start_, token_end_;
 
-  /// The format in which fragments will be written
-  std::shared_ptr format;
+  std::string operator()() {
+return template_.substr(0, token_start_) + std::to_string(i_++) +
+   template_.substr(token_end_);
+  }
+};
 
-  /// The FileSystem and base directory into which fragments will be written
-  std::shared_ptr filesystem;
-  std::string base_dir;
+using MutexedWriter = util::Mutexed>;
 
-  /// Batches to be written
-  std::shared_ptr batches;
+struct WriterSet {
+  WriterSet(NextBasenameGenerator next_basename,
+const FileSystemDatasetWriteOptions& write_options)
+  : next_basename_(std::move(next_basename)),
+base_dir_(fs::internal::EnsureTrailingSlash(write_options.base_dir)),
+write_options_(write_options) {}
 
-  /// An Expression already satisfied by every batch to be written
-  std::shared_ptr partition_expression;
-};
+  Result> Get(const Expression& 
partition_expression,
+ const std::shared_ptr& 
schema) {
+ARROW_ASSIGN_OR_RAISE(auto part_segments,
+  
write_options_.partitioning->Format(partition_expression));
+std::string dir = base_dir_ + part_segments;
 
-Status WriteTask::Execute() {
-  std::unordered_map path_to_batches;
-
-  // TODO(bkietz) these calls to Partition() should be scattered across a 
TaskGroup
-  for (auto maybe_batch : IteratorFromReader(batches)) {
-ARROW_ASSIGN_OR_RAISE(auto batch, maybe_batch);
-ARROW_ASSIGN_OR_RAISE(auto partitioned_batches, 
partitioning->Partition(batch));
-for (auto&& partitioned_batch : partitioned_batches) {
-  AndExpression expr(std::move(partitioned_batch.partition_expression),
- partition_expression);
-  ARROW_ASSIGN_OR_RAISE(std::string path, partitioning->Format(expr));
-  path = fs::internal::EnsureLeadingSlash(path);
-  path_to_batches[path].push_back(std::move(partitioned_batch.batch));
-}
-  }
+util::Mutex::Guard writer_lock;
+
+auto set_lock = mutex_.Lock();
 
-  for (auto&& path_batches : path_to_batches) {
-auto dir = base_dir + path_batches.first;
-RETURN_NOT_OK(filesystem->CreateDir(dir, /*recursive=*/true));
+auto writer =

Review comment:
   I don't understand the logic here about dual locking (set_lock, 
writer_lock).

##
File path: cpp/src/arrow/dataset/file_base.cc
##
@@ -143,97 +145,205 @@ FragmentIterator FileSystemDataset::GetFragmentsImpl(
   return MakeVectorIterator(std::move(fragments));
 }
 
-struct WriteTask {
-  Status Execute();
+Status FileWriter::Write(RecordBatchReader* batches) {
+  for (std::shared_ptr batch;;) {

Review comment:
   while-loop?

##
File path: cpp/src/arrow/dataset/file_base.cc
##
@@ -143,97 +145,205 @@ FragmentIterator FileSystemDataset::GetFragmentsImpl(
   return MakeVectorIterator(std::move(fragments));
 }
 
-struct WriteTask {
-  Status Execute();
+Status FileWriter::Write(RecordBatchReader* batches) {
+  for (std::shared_ptr batch;;) {
+RETURN_NOT_OK(batches->ReadNext());
+if (batch == nullptr) break;
+RETURN_NOT_OK(Write(batch));
+  }
+  return Status::OK();
+}
+
+struct NextBasenameGenerator {
+  static Result Make(const std::string& 
basename_template) {
+if (basename_template.find(fs::internal::kSep) != std::string::npos) {
+  return Status::Invalid("basename_template contained 

[GitHub] [arrow] alamb opened a new pull request #8327: ARROW-10160: [Rust] Improve DictionaryType documentation (clarify which type is which)

2020-10-02 Thread GitBox


alamb opened a new pull request #8327:
URL: https://github.com/apache/arrow/pull/8327


   I kept having to look this up, so I figured I would update the docs about it



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] malthe commented on pull request #6807: PARQUET-1404: [C++] Adding Page level Indexes

2020-10-02 Thread GitBox


malthe commented on pull request #6807:
URL: https://github.com/apache/arrow/pull/6807#issuecomment-702683387


   I added an issue https://issues.apache.org/jira/browse/ARROW-10158.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] malthe commented on issue #1426: pyarrow parquet - support for row group filters

2020-10-02 Thread GitBox


malthe commented on issue #1426:
URL: https://github.com/apache/arrow/issues/1426#issuecomment-702679313


   I opened a JIRA issue on this: 
https://issues.apache.org/jira/browse/ARROW-10158.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8326: ARROW-10157: [Rust] Add an example to the take kernel

2020-10-02 Thread GitBox


github-actions[bot] commented on pull request #8326:
URL: https://github.com/apache/arrow/pull/8326#issuecomment-702677975


   https://issues.apache.org/jira/browse/ARROW-10157



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] drusso commented on pull request #8222: ARROW-10043: [Rust][DataFusion] Implement COUNT(DISTINCT col)

2020-10-02 Thread GitBox


drusso commented on pull request #8222:
URL: https://github.com/apache/arrow/pull/8222#issuecomment-702677246


   Just a quick update here, I'll have some time today and over the weekend to 
work through the changes. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb opened a new pull request #8326: ARROW-10157: [Rust] Add an example to the take kernel

2020-10-02 Thread GitBox


alamb opened a new pull request #8326:
URL: https://github.com/apache/arrow/pull/8326


   I was playing around with this and thought I would encode some of my 
learning as a doc / example. 
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #8320: ARROW-10058: [C++] Improve repeated levels conversion without BMI2

2020-10-02 Thread GitBox


pitrou commented on pull request #8320:
URL: https://github.com/apache/arrow/pull/8320#issuecomment-702639541


   > is it ok if I take a closer look next week
   
   No problem.
   
   > but I would not expect that to be the case.
   
   Right. The emulation is probably much slower.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #8319: ARROW-10057: [C++] Add hand-written Parquet nested tests

2020-10-02 Thread GitBox


pitrou commented on a change in pull request #8319:
URL: https://github.com/apache/arrow/pull/8319#discussion_r498727490



##
File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc
##
@@ -2434,6 +2435,145 @@ TEST(TestArrowReadWrite, CanonicalNestedRoundTrip) {
   CheckSimpleRoundtrip(expected, 2);
 }
 
+TEST(ArrowReadWrite, ListOfStruct) {
+  using ::arrow::field;
+
+  auto type = ::arrow::list(::arrow::struct_(
+  {field("a", ::arrow::int16(), /*nullable=*/false), field("b", 
::arrow::utf8())}));
+
+  const char* json = R"([
+  [{"a": 4, "b": "foo"}, {"a": 5}, {"a": 6, "b": "bar"}],
+  [null, {"a": 7}],
+  null,
+  []])";
+  auto array = ::arrow::ArrayFromJSON(type, json);
+  auto table = ::arrow::Table::Make(::arrow::schema({field("root", type)}), 
{array});
+  CheckSimpleRoundtrip(table, 2);
+}
+
+TEST(ArrowReadWrite, ListOfStructOfList1) {
+  using ::arrow::field;
+  using ::arrow::list;
+  using ::arrow::struct_;
+
+  auto type = list(struct_({field("a", ::arrow::int16(), /*nullable=*/false),
+field("b", list(::arrow::int64()))}));
+
+  const char* json = R"([
+  [{"a": 123, "b": [1, 2, null, 3]}, null],
+  null,
+  [],
+  [{"a": 456}, {"a": 789, "b": []}, {"a": 876, "b": [4, 5, 6]}]])";
+  auto array = ::arrow::ArrayFromJSON(type, json);
+  auto table = ::arrow::Table::Make(::arrow::schema({field("root", type)}), 
{array});
+  CheckSimpleRoundtrip(table, 2);
+}
+
+TEST(ArrowReadWrite, DISABLED_ListOfStructOfList2) {
+  using ::arrow::field;
+  using ::arrow::list;
+  using ::arrow::struct_;
+
+  auto type =
+  list(field("item",
+ struct_({field("a", ::arrow::int16(), /*nullable=*/false),
+  field("b", list(::arrow::int64()), 
/*nullable=*/false)}),
+ /*nullable=*/false));
+
+  const char* json = R"([
+  [{"a": 123, "b": [1, 2, 3]}],
+  null,
+  [],
+  [{"a": 456}, {"a": 789, "b": [null]}, {"a": 876, "b": [4, 5, 6]}]])";

Review comment:
   Ah, you may be right indeed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on pull request #8322: ARROW-9520: [Rust] [DataFusion] Add support for aliased aggregate exprs

2020-10-02 Thread GitBox


alamb commented on pull request #8322:
URL: https://github.com/apache/arrow/pull/8322#issuecomment-702631071


   > Are squash merges accepted for this project or would you prefer that I 
rebase the PR down to a single commit?
   
   There is some sort of automation script that merges arrow PRs -- I don't 
think there is any need to rebase to a single commit



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] thamht4190 commented on pull request #8023: ARROW-9318: [C++] Parquet encryption key management

2020-10-02 Thread GitBox


thamht4190 commented on pull request #8023:
URL: https://github.com/apache/arrow/pull/8023#issuecomment-702625493


   @bkietz I have the same thought with @ggershinsky. Moreover, I wrote the 
code based on Java version, so I may not have enough deep understanding to 
write out the mind & design behind. I will try to list out some points in sub 
PR's description. Is that ok?
   Anw, I will try to fix the comment already listed on this PR first.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] returnString commented on pull request #8322: ARROW-9520: [Rust] [DataFusion] Add support for aliased aggregate exprs

2020-10-02 Thread GitBox


returnString commented on pull request #8322:
URL: https://github.com/apache/arrow/pull/8322#issuecomment-702621373


   That's interesting, I've got rustfmt set to run on save whilst editing in 
vscode but I _was_ having some issues with code completion in the project too; 
I'll try and figure out what's up with that for any future contributions.
   
   Are squash merges accepted for this project or would you prefer that I 
rebase the PR down to a single commit?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] thamht4190 commented on a change in pull request #8023: ARROW-9318: [C++] Parquet encryption key management

2020-10-02 Thread GitBox


thamht4190 commented on a change in pull request #8023:
URL: https://github.com/apache/arrow/pull/8023#discussion_r498709142



##
File path: cpp/src/parquet/encryption/remote_kms_client.cc
##
@@ -0,0 +1,127 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/json/object_parser.h"
+#include "arrow/json/object_writer.h"
+
+#include "parquet/encryption/key_toolkit_internal.h"
+#include "parquet/encryption/remote_kms_client.h"
+#include "parquet/exception.h"
+
+using arrow::json::ObjectParser;
+using arrow::json::ObjectWriter;
+
+namespace parquet {
+namespace encryption {
+
+constexpr const char RemoteKmsClient::kLocalWrapNoKeyVersion[];
+
+constexpr const char 
RemoteKmsClient::LocalKeyWrap::kLocalWrapKeyVersionField[];
+constexpr const char 
RemoteKmsClient::LocalKeyWrap::kLocalWrapEncryptedKeyField[];
+
+RemoteKmsClient::LocalKeyWrap::LocalKeyWrap(const std::string& 
master_key_version,
+const std::string& 
encrypted_encoded_key)
+: encrypted_encoded_key_(encrypted_encoded_key),
+  master_key_version_(master_key_version) {}
+
+std::string RemoteKmsClient::LocalKeyWrap::CreateSerialized(
+const std::string& encrypted_encoded_key) {
+  ObjectWriter json_writer;
+
+  json_writer.SetString(kLocalWrapKeyVersionField, kLocalWrapNoKeyVersion);
+  json_writer.SetString(kLocalWrapEncryptedKeyField, encrypted_encoded_key);
+
+  return json_writer.Serialize();
+}
+
+RemoteKmsClient::LocalKeyWrap RemoteKmsClient::LocalKeyWrap::Parse(
+const std::string& wrapped_key) {
+  ObjectParser json_parser;
+  if (!json_parser.Parse(wrapped_key)) {
+throw ParquetException("Failed to parse local key wrap json " + 
wrapped_key);
+  }
+  std::string master_key_version;
+  PARQUET_ASSIGN_OR_THROW(master_key_version,
+  json_parser.GetString(kLocalWrapKeyVersionField));
+
+  std::string encrypted_encoded_key;
+  PARQUET_ASSIGN_OR_THROW(encrypted_encoded_key,
+  json_parser.GetString(kLocalWrapEncryptedKeyField));
+
+  return RemoteKmsClient::LocalKeyWrap(master_key_version, 
encrypted_encoded_key);
+}
+
+void RemoteKmsClient::Initialize(const KmsConnectionConfig& 
kms_connection_config,
+ bool is_wrap_locally) {
+  kms_connection_config_ = kms_connection_config;
+  is_wrap_locally_ = is_wrap_locally;
+  if (is_wrap_locally_) {
+master_key_cache_.Clear();
+  }
+
+  is_default_token_ =
+  kms_connection_config_.key_access_token() == 
KmsClient::kKeyAccessTokenDefault;
+
+  InitializeInternal();
+}
+
+std::string RemoteKmsClient::WrapKey(const std::string& key_bytes,
+ const std::string& master_key_identifier) 
{
+  if (is_wrap_locally_) {

Review comment:
   Thanks!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #8320: ARROW-10058: [C++] Improve repeated levels conversion without BMI2

2020-10-02 Thread GitBox


emkornfield commented on pull request #8320:
URL: https://github.com/apache/arrow/pull/8320#issuecomment-702615585


   > I also notice that we call internal::GreaterThanBitmap for each 64 levels, 
which always goes through the dynamic dispatch indirection (meaning two 
function calls, I think). We could call GreaterThanBitmapImpl but that requires 
compiling a specialized version of level_conversion_inc.h for AVX2, otherwise 
we lose performance.
   
   yeah it isn't ideal, it is possible there is a better factoring in there but 
it seemed hard to do and isolate BMI2 special instructions, I guess if this 
isn't too much slower then BMI2 on intel we could potentially collapse 
everything, but I would not expect that to be the case.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #8320: ARROW-10058: [C++] Improve repeated levels conversion without BMI2

2020-10-02 Thread GitBox


emkornfield commented on pull request #8320:
URL: https://github.com/apache/arrow/pull/8320#issuecomment-702614856


   @pitrou I'm devoting most of my bandwidth to try to finish up the parquet 
read component this week, is it ok if I take a closer look next week (hopefully 
with enough time before an RC is cut?)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ggershinsky commented on a change in pull request #8023: ARROW-9318: [C++] Parquet encryption key management

2020-10-02 Thread GitBox


ggershinsky commented on a change in pull request #8023:
URL: https://github.com/apache/arrow/pull/8023#discussion_r498701580



##
File path: cpp/src/parquet/encryption/remote_kms_client.cc
##
@@ -0,0 +1,127 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/json/object_parser.h"
+#include "arrow/json/object_writer.h"
+
+#include "parquet/encryption/key_toolkit_internal.h"
+#include "parquet/encryption/remote_kms_client.h"
+#include "parquet/exception.h"
+
+using arrow::json::ObjectParser;
+using arrow::json::ObjectWriter;
+
+namespace parquet {
+namespace encryption {
+
+constexpr const char RemoteKmsClient::kLocalWrapNoKeyVersion[];
+
+constexpr const char 
RemoteKmsClient::LocalKeyWrap::kLocalWrapKeyVersionField[];
+constexpr const char 
RemoteKmsClient::LocalKeyWrap::kLocalWrapEncryptedKeyField[];
+
+RemoteKmsClient::LocalKeyWrap::LocalKeyWrap(const std::string& 
master_key_version,
+const std::string& 
encrypted_encoded_key)
+: encrypted_encoded_key_(encrypted_encoded_key),
+  master_key_version_(master_key_version) {}
+
+std::string RemoteKmsClient::LocalKeyWrap::CreateSerialized(
+const std::string& encrypted_encoded_key) {
+  ObjectWriter json_writer;
+
+  json_writer.SetString(kLocalWrapKeyVersionField, kLocalWrapNoKeyVersion);
+  json_writer.SetString(kLocalWrapEncryptedKeyField, encrypted_encoded_key);
+
+  return json_writer.Serialize();
+}
+
+RemoteKmsClient::LocalKeyWrap RemoteKmsClient::LocalKeyWrap::Parse(
+const std::string& wrapped_key) {
+  ObjectParser json_parser;
+  if (!json_parser.Parse(wrapped_key)) {
+throw ParquetException("Failed to parse local key wrap json " + 
wrapped_key);
+  }
+  std::string master_key_version;
+  PARQUET_ASSIGN_OR_THROW(master_key_version,
+  json_parser.GetString(kLocalWrapKeyVersionField));
+
+  std::string encrypted_encoded_key;
+  PARQUET_ASSIGN_OR_THROW(encrypted_encoded_key,
+  json_parser.GetString(kLocalWrapEncryptedKeyField));
+
+  return RemoteKmsClient::LocalKeyWrap(master_key_version, 
encrypted_encoded_key);
+}
+
+void RemoteKmsClient::Initialize(const KmsConnectionConfig& 
kms_connection_config,
+ bool is_wrap_locally) {
+  kms_connection_config_ = kms_connection_config;
+  is_wrap_locally_ = is_wrap_locally;
+  if (is_wrap_locally_) {
+master_key_cache_.Clear();
+  }
+
+  is_default_token_ =
+  kms_connection_config_.key_access_token() == 
KmsClient::kKeyAccessTokenDefault;
+
+  InitializeInternal();
+}
+
+std::string RemoteKmsClient::WrapKey(const std::string& key_bytes,
+ const std::string& master_key_identifier) 
{
+  if (is_wrap_locally_) {

Review comment:
   @thamht4190 This is a very good point! I see how it can be confusing. 
Lets then rename (and slightly modify)  `RemoteKmsClient` to something like  
`LocalWrapKmsClient` that will be used only for KMS systems that don't support 
in-server wrapping. All other KMS clients will extend the `KmsClient` interface 
directly.
   I'll try to make this change also in the Java API (parquet-mr).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




  1   2   >