[GitHub] [arrow] kiszk commented on pull request #7938: ARROW-9701: [CI][Java] Add a job for s390x Java on TravisCI
kiszk commented on pull request #7938: URL: https://github.com/apache/arrow/pull/7938#issuecomment-703045578 @emkornfield @kou Is it ok to merge this PR now? Or, do we still hold off for a while? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-703043604 Got the actions to rerun, but some are still failing. As far as I can tell, these failures aren't due to the changes in this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
github-actions[bot] commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-703042966 Revision: 5089cb76106ea909aced4e0db32c0d5e9a512bd3 Submitted crossbow builds: [ursa-labs/crossbow @ actions-589](https://github.com/ursa-labs/crossbow/branches/all?query=actions-589) |Task|Status| ||--| |test-r-rhub-ubuntu-gcc-release|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-589-azure-test-r-rhub-ubuntu-gcc-release)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-589-azure-test-r-rhub-ubuntu-gcc-release)| |test-r-rocker-r-base-latest|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-589-azure-test-r-rocker-r-base-latest)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-589-azure-test-r-rocker-r-base-latest)| |test-r-rstudio-r-base-3.6-bionic|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-589-azure-test-r-rstudio-r-base-3.6-bionic)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-589-azure-test-r-rstudio-r-base-3.6-bionic)| |test-r-rstudio-r-base-3.6-centos6|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-589-azure-test-r-rstudio-r-base-3.6-centos6)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-589-azure-test-r-rstudio-r-base-3.6-centos6)| |test-r-rstudio-r-base-3.6-centos8|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-589-azure-test-r-rstudio-r-base-3.6-centos8)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-589-azure-test-r-rstudio-r-base-3.6-centos8)| |test-r-rstudio-r-base-3.6-opensuse15|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-589-azure-test-r-rstudio-r-base-3.6-opensuse15)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-589-azure-test-r-rstudio-r-base-3.6-opensuse15)| |test-r-rstudio-r-base-3.6-opensuse42|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-589-azure-test-r-rstudio-r-base-3.6-opensuse42)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-589-azure-test-r-rstudio-r-base-3.6-opensuse42)| This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
nealrichardson commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-703042816 @github-actions crossbow submit test-r-r* This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on pull request #8324: ARROW-10156: [Rust] Added github action to label PRs for rust.
jorgecarleitao commented on pull request #8324: URL: https://github.com/apache/arrow/pull/8324#issuecomment-703042742 FYI @lidavidm (`lang-java`) and @romainfrancois (`lang-R`), since both of you are also tagging PRs with github labels. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield closed pull request #8293: ARROW-10127: Update specification for Decimal to allow for 256-bits
emkornfield closed pull request #8293: URL: https://github.com/apache/arrow/pull/8293 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #8293: ARROW-10127: Update specification for Decimal to allow for 256-bits
emkornfield commented on pull request #8293: URL: https://github.com/apache/arrow/pull/8293#issuecomment-703042352 Merging based on vote. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #8330: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes
emkornfield commented on a change in pull request #8330: URL: https://github.com/apache/arrow/pull/8330#discussion_r499108858 ## File path: rust/parquet/src/arrow/arrow_writer.rs ## @@ -688,4 +688,413 @@ mod tests { writer.write().unwrap(); writer.close().unwrap(); } + +const SMALL_SIZE: usize = 100; + +fn roundtrip(filename: , expected_batch: RecordBatch) { +let file = get_temp_file(filename, &[]); + +let mut writer = ArrowWriter::try_new( +file.try_clone().unwrap(), +expected_batch.schema(), +None, +) +.unwrap(); +writer.write(_batch).unwrap(); +writer.close().unwrap(); + +let reader = SerializedFileReader::new(file).unwrap(); +let mut arrow_reader = ParquetFileArrowReader::new(Rc::new(reader)); +let mut record_batch_reader = arrow_reader.get_record_reader(1024).unwrap(); + +let actual_batch = record_batch_reader.next_batch().unwrap().unwrap(); + +assert_eq!(expected_batch.schema(), actual_batch.schema()); +assert_eq!(expected_batch.num_columns(), actual_batch.num_columns()); +assert_eq!(expected_batch.num_rows(), actual_batch.num_rows()); +for i in 0..expected_batch.num_columns() { +let expected_data = expected_batch.column(i).data(); +let actual_data = actual_batch.column(i).data(); + +assert_eq!(expected_data.data_type(), actual_data.data_type()); +assert_eq!(expected_data.len(), actual_data.len()); +assert_eq!(expected_data.null_count(), actual_data.null_count()); +assert_eq!(expected_data.offset(), actual_data.offset()); +assert_eq!(expected_data.buffers(), actual_data.buffers()); +assert_eq!(expected_data.child_data(), actual_data.child_data()); +assert_eq!(expected_data.null_bitmap(), actual_data.null_bitmap()); +} +} + +fn one_column_roundtrip(filename: , values: ArrayRef, nullable: bool) { +let schema = Schema::new(vec![Field::new( +"col", +values.data_type().clone(), +nullable, +)]); +let expected_batch = +RecordBatch::try_new(Arc::new(schema), vec![values]).unwrap(); + +roundtrip(filename, expected_batch); +} + +fn values_required(iter: I, filename: ) +where +A: From> + Array + 'static, +I: IntoIterator, +{ +let raw_values: Vec<_> = iter.into_iter().collect(); +let values = Arc::new(A::from(raw_values)); +one_column_roundtrip(filename, values, false); +} + +fn values_optional(iter: I, filename: ) +where +A: From>> + Array + 'static, +I: IntoIterator, +{ +let optional_raw_values: Vec<_> = iter +.into_iter() +.enumerate() +.map(|(i, v)| if i % 2 == 0 { None } else { Some(v) }) +.collect(); +let optional_values = Arc::new(A::from(optional_raw_values)); +one_column_roundtrip(filename, optional_values, true); +} + +fn required_and_optional(iter: I, filename: ) +where +A: From> + From>> + Array + 'static, +I: IntoIterator + Clone, +{ +values_required::(iter.clone(), filename); +values_optional::(iter, filename); +} + +#[test] +#[should_panic(expected = "Null arrays not supported")] +fn null_single_column() { +let values = Arc::new(NullArray::new(SMALL_SIZE)); +one_column_roundtrip("null_single_column", values.clone(), true); +one_column_roundtrip("null_single_column", values, false); +} + +#[test] +#[should_panic( +expected = "Attempting to write an Arrow type that is not yet implemented" +)] +fn bool_single_column() { +required_and_optional::( +[true, false].iter().cycle().copied().take(SMALL_SIZE), +"bool_single_column", +); +} + +#[test] +fn i8_single_column() { +required_and_optional::(0..SMALL_SIZE as i8, "i8_single_column"); +} + +#[test] +fn i16_single_column() { +required_and_optional::(0..SMALL_SIZE as i16, "i16_single_column"); +} + +#[test] +fn i32_single_column() { +required_and_optional::(0..SMALL_SIZE as i32, "i32_single_column"); +} + +#[test] +fn i64_single_column() { +required_and_optional::(0..SMALL_SIZE as i64, "i64_single_column"); +} + +#[test] +fn u8_single_column() { +required_and_optional::(0..SMALL_SIZE as u8, "u8_single_column"); +} + +#[test] +fn u16_single_column() { +required_and_optional::( +0..SMALL_SIZE as u16, +"u16_single_column", +); +} + +#[test] +fn u32_single_column() { +required_and_optional::( +0..SMALL_SIZE as u32, +"u32_single_column", +); +
[GitHub] [arrow] emkornfield commented on pull request #8330: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes
emkornfield commented on pull request #8330: URL: https://github.com/apache/arrow/pull/8330#issuecomment-703034687 > I am not sure the actual data arrays have to be the same -- what needs to be the same is the data rows where the null_bit is 0 (aka we should probably be checking only for non-null rows) Small nit I'm prettry sure null_bit = 1 is actually where values should be compared. (thinking of it as a validity bitmap instead of a null bitmap helps). And yes I agree they should be compared separately, as there is generally no requirement on the values of null fields (generally, you'll want to initialize to some known values to avoid security issues). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] paddyhoran commented on a change in pull request #8331: ARROW-10162: [Rust] Add pretty print support for DictionaryArray
paddyhoran commented on a change in pull request #8331: URL: https://github.com/apache/arrow/pull/8331#discussion_r499105205 ## File path: rust/arrow/src/util/pretty.rs ## @@ -126,15 +130,56 @@ fn array_value_to_string(column: array::ArrayRef, row: usize) -> Result DataType::Time64(unit) if *unit == TimeUnit::Nanosecond => { make_string!(array::Time64NanosecondArray, column, row) } +DataType::Dictionary(index_type, _value_type) => match **index_type { +DataType::Int8 => dict_array_value_to_string::(column, row), +DataType::Int16 => dict_array_value_to_string::(column, row), +DataType::Int32 => dict_array_value_to_string::(column, row), +DataType::Int64 => dict_array_value_to_string::(column, row), +DataType::UInt8 => dict_array_value_to_string::(column, row), +DataType::UInt16 => dict_array_value_to_string::(column, row), +DataType::UInt32 => dict_array_value_to_string::(column, row), +DataType::UInt64 => dict_array_value_to_string::(column, row), +_ => Err(ArrowError::InvalidArgumentError(format!( +"Unsupported index type {:?} type for {:?} in repl.", Review comment: I think this (pretty printing) was part of DataFusion's cli to begin with but was moved into `arrow` as it was generally useful. I don't understand repl in this context either. I suggest updating it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] paddyhoran commented on a change in pull request #8331: ARROW-10162: [Rust] Add pretty print support for DictionaryArray
paddyhoran commented on a change in pull request #8331: URL: https://github.com/apache/arrow/pull/8331#discussion_r499105205 ## File path: rust/arrow/src/util/pretty.rs ## @@ -126,15 +130,56 @@ fn array_value_to_string(column: array::ArrayRef, row: usize) -> Result DataType::Time64(unit) if *unit == TimeUnit::Nanosecond => { make_string!(array::Time64NanosecondArray, column, row) } +DataType::Dictionary(index_type, _value_type) => match **index_type { +DataType::Int8 => dict_array_value_to_string::(column, row), +DataType::Int16 => dict_array_value_to_string::(column, row), +DataType::Int32 => dict_array_value_to_string::(column, row), +DataType::Int64 => dict_array_value_to_string::(column, row), +DataType::UInt8 => dict_array_value_to_string::(column, row), +DataType::UInt16 => dict_array_value_to_string::(column, row), +DataType::UInt32 => dict_array_value_to_string::(column, row), +DataType::UInt64 => dict_array_value_to_string::(column, row), +_ => Err(ArrowError::InvalidArgumentError(format!( +"Unsupported index type {:?} type for {:?} in repl.", Review comment: I think this was part of DataFusion but was moved into `arrow` as it was generally useful. I don't understand repl in this context either. I suggest updating it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types
nevi-me commented on pull request #8291: URL: https://github.com/apache/arrow/pull/8291#issuecomment-703014936 Merged This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on a change in pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types
nevi-me commented on a change in pull request #8291: URL: https://github.com/apache/arrow/pull/8291#discussion_r499097627 ## File path: rust/arrow/src/ipc/convert.rs ## @@ -609,10 +621,45 @@ pub(crate) fn get_fb_field_type<'a: 'b, 'b>( children: Some(fbb.create_vector([..])), } } +Dictionary(_, value_type) => { +// In this library, the dictionary "type" is a logical construct. Here we +// pass through to the value type, as we've already captured the index +// type in the DictionaryEncoding metadata in the parent field +get_fb_field_type(value_type, fbb) +} t => unimplemented!("Type {:?} not supported", t), } } +/// Create an IPC dictionary encoding +pub(crate) fn get_fb_dictionary<'a: 'b, 'b>( +index_type: , +dict_id: i64, +dict_is_ordered: bool, +fbb: FlatBufferBuilder<'a>, +) -> WIPOffset> { +// We assume that the dictionary index type (as an integer) has already been +// validated elsewhere, and can safely assume we are dealing with signed +// integers +let mut index_builder = ipc::IntBuilder::new(fbb); +index_builder.add_is_signed(true); Review comment: @carols10cents we can address this in future, I'm merging this This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types
nevi-me commented on pull request #8291: URL: https://github.com/apache/arrow/pull/8291#issuecomment-703013274 > How can I best help with getting the rust-parquet-arrow-writer branch ready to be merged into master? The main blocker right now is computing nested definition and repetition levels for deeply nested arrays (see https://github.com/apache/arrow/tree/master/cpp/src/parquet/arrow for the CPP implementation). I've been working on that on and off, due to time constraints (and the need to spend a good chunk of time per session; which I haven't had much of). Perhaps I can commit my work so you can have a look at that. I'll also open more JIRAs for work I believe we'd need to cover. Writing more test cases will also def help us. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me commented on pull request #8291: ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types
nevi-me commented on pull request #8291: URL: https://github.com/apache/arrow/pull/8291#issuecomment-703011014 Hi @carols10cents, my apologies for the delay. > Am I at least correct that adding dictionary support will involve adding support for the dictionary schema as well as the dictionary data? And that neither of those are done yet? The Parquet format's dictionary semantics are in my understanding different to Arrow, in that we could have a `StringArray` in Arrow that gets saved with Parquet's dictionary encoding. I'm still going through the rest of the Rust Parquet codebase though, so I'm not speaking from any authority. My comment was more in passing, and generally around dictionary support in IPC, in that beyond serializing an `arrow::datatypes::Schema` to `arrow::ipc::Schema`, there might be little in relation between the file formats. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
github-actions[bot] commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-703005258 Revision: 5089cb76106ea909aced4e0db32c0d5e9a512bd3 Submitted crossbow builds: [ursa-labs/crossbow @ actions-588](https://github.com/ursa-labs/crossbow/branches/all?query=actions-588) |Task|Status| ||--| |test-r-rstudio-r-base-3.6-centos6|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-588-azure-test-r-rstudio-r-base-3.6-centos6)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-588-azure-test-r-rstudio-r-base-3.6-centos6)| |test-r-rstudio-r-base-3.6-centos8|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-588-azure-test-r-rstudio-r-base-3.6-centos8)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-588-azure-test-r-rstudio-r-base-3.6-centos8)| This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
nealrichardson commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-703003701 @github-actions crossbow submit test-r-rstudio-r-base-3.6-centos8 test-r-rstudio-r-base-3.6-centos6 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
github-actions[bot] commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-703001148 Revision: 5d1b6eec6dfcf4ed01b83c3501b72f96e7690bd1 Submitted crossbow builds: [ursa-labs/crossbow @ actions-587](https://github.com/ursa-labs/crossbow/branches/all?query=actions-587) |Task|Status| ||--| |test-r-rstudio-r-base-3.6-centos6|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-587-azure-test-r-rstudio-r-base-3.6-centos6)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-587-azure-test-r-rstudio-r-base-3.6-centos6)| |test-r-rstudio-r-base-3.6-centos8|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-587-azure-test-r-rstudio-r-base-3.6-centos8)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-587-azure-test-r-rstudio-r-base-3.6-centos8)| This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
nealrichardson commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-703000798 @github-actions crossbow submit test-r-rstudio-r-base-3.6-centos8 test-r-rstudio-r-base-3.6-centos6 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
github-actions[bot] commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-702989907 Revision: 176d1e6b1a96d8819be4bbbd0815067733c2e54b Submitted crossbow builds: [ursa-labs/crossbow @ actions-586](https://github.com/ursa-labs/crossbow/branches/all?query=actions-586) |Task|Status| ||--| |test-r-rstudio-r-base-3.6-centos8|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-586-azure-test-r-rstudio-r-base-3.6-centos8)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-586-azure-test-r-rstudio-r-base-3.6-centos8)| |test-r-rstudio-r-base-3.6-opensuse15|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-586-azure-test-r-rstudio-r-base-3.6-opensuse15)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-586-azure-test-r-rstudio-r-base-3.6-opensuse15)| This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
nealrichardson commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-702989412 @github-actions crossbow submit test-r-rstudio-r-base-3.6-centos8 test-r-rstudio-r-base-3.6-opensuse15 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
nealrichardson commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-702983569 Current R Linux static build status (using shared libs for curl and openssl/crypto, provided by the system): * test-r-rhub-ubuntu-gcc-release (ubuntu 16.04, gcc 5.4.0): FAIL. C++ build with S3 succeeds but R package link error: `undefined symbol: __cpu_model` * test-r-rocker-r-base-latest (debian:testing, gcc 9.3.0): SUCCESS with S3 * test-r-rstudio-r-base-3.6-bionic (ubuntu 18.04, gcc 7.5.0): SUCCESS with S3 * test-r-rstudio-r-base-3.6-centos6 (gcc 8.3.1): FAIL because system openssl is stale: `Found unsuitable version "1.0.1e", but required is at least "1.0.2"` * test-r-rstudio-r-base-3.6-centos7 (gcc 4.8.5): SUCCESS *without S3* (gcc too old) * test-r-rstudio-r-base-3.6-centos8 (gcc 8.3.1): FAIL `(missing: OPENSSL_CRYPTO_LIBRARY) (found suitable version "1.1.1c", minimum required is "1.0.2")` * test-r-rstudio-r-base-3.6-opensuse15 (gcc 7.4.1): FAIL `(missing: OPENSSL_CRYPTO_LIBRARY) (found suitable version "1.1.0i", minimum required is "1.0.2")` * test-r-rstudio-r-base-3.6-opensuse42 (gcc 4.8): SUCCESS *without S3* (gcc too old) Summary: 2 success with S3, 2 successfully built without S3 because it detected that gcc was too old, 2 fail because they are finding openssl but not crypto for some reason, 1 fails because it has openssl but the version provided by the package manager is too old, one fails with `undefined symbol: __cpu_model` (a gcc 5 bug?). Among the reasons this is frustrating: * I've installed openssl but it says it can't find it (sometimes); I've installed it from the system package manager but it's still too old; these are both easily solvable by bundling openssl like we already do in the wheels, but I've gotten strong pushback on that. * I've tried to add logic in the R build script to detect whether we can build aws-sdk-cpp (https://github.com/apache/arrow/pull/8304/files#diff-3875fa5e75833c426b36487b25892bd8R370-R398) but that's apparently not enough to catch these deviant (yet common) cases. It would be nice to move this logic to cmake, an `ARROW_S3=PLEASE_IF_YOU_CAN` option: is there objection to that? * Given how fragile this is, I'm fearing the onslaught of bug reports after the release. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] dianaclarke commented on pull request #8312: ARROW-9941: [Python] Better string representation for extension types
dianaclarke commented on pull request #8312: URL: https://github.com/apache/arrow/pull/8312#issuecomment-702978986 I have it returning `"extension>"` now from the C++ side. No idea if it's right though. Thanks for your patience :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jhorstmann commented on pull request #8331: ARROW-10162: [Rust] Add pretty print support for DictionaryArray
jhorstmann commented on pull request #8331: URL: https://github.com/apache/arrow/pull/8331#issuecomment-702964608 Changes look good to me, but I'm wondering now how the existing code handles null values in the primitive arrays. Seems to me that would print the default value. For `Utf8` I think it works kinda implicitly because of how the offsets for null values are layed out. Would it make sense to fix this in the same PR or should I open a separate ticket? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on a change in pull request #8324: ARROW-10156: [Rust] Added github action to label PRs for rust.
kou commented on a change in pull request #8324: URL: https://github.com/apache/arrow/pull/8324#discussion_r499054289 ## File path: .github/workflows/dev_labeler.yml ## @@ -0,0 +1,31 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +on: +- pull_request_target + +jobs: + assign-rust-labels: +if: github.repository == 'apache/arrow' +runs-on: ubuntu-latest +steps: +- name: Assign github labels Review comment: ```suggestion - name: Assign GitHub labels ``` ## File path: .github/workflows/dev_labeler.yml ## @@ -0,0 +1,31 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +on: +- pull_request_target + +jobs: + assign-rust-labels: +if: github.repository == 'apache/arrow' Review comment: We don't need this restriction. We have this restriction in `dev_cron.yml` because cron GitHub Actions logs in forks hide other logs. `pull_request_target` doesn't run this job periodically. So we don't need this restriction. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
github-actions[bot] commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-702959895 Revision: 672e9f9ecf17c7e195f60c0a95e0434e5fe1ed53 Submitted crossbow builds: [ursa-labs/crossbow @ actions-585](https://github.com/ursa-labs/crossbow/branches/all?query=actions-585) |Task|Status| ||--| |test-r-rhub-ubuntu-gcc-release|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-585-azure-test-r-rhub-ubuntu-gcc-release)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-585-azure-test-r-rhub-ubuntu-gcc-release)| |test-r-rocker-r-base-latest|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-585-azure-test-r-rocker-r-base-latest)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-585-azure-test-r-rocker-r-base-latest)| |test-r-rstudio-r-base-3.6-bionic|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-585-azure-test-r-rstudio-r-base-3.6-bionic)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-585-azure-test-r-rstudio-r-base-3.6-bionic)| |test-r-rstudio-r-base-3.6-centos6|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-585-azure-test-r-rstudio-r-base-3.6-centos6)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-585-azure-test-r-rstudio-r-base-3.6-centos6)| |test-r-rstudio-r-base-3.6-centos8|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-585-azure-test-r-rstudio-r-base-3.6-centos8)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-585-azure-test-r-rstudio-r-base-3.6-centos8)| |test-r-rstudio-r-base-3.6-opensuse15|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-585-azure-test-r-rstudio-r-base-3.6-opensuse15)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-585-azure-test-r-rstudio-r-base-3.6-opensuse15)| |test-r-rstudio-r-base-3.6-opensuse42|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-585-azure-test-r-rstudio-r-base-3.6-opensuse42)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-585-azure-test-r-rstudio-r-base-3.6-opensuse42)| This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
nealrichardson commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-702959015 @github-actions crossbow submit test-r-r* This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #8325: ARROW-10105: [C++][Python][Java][FlightRPC] Allow disabling server validation
lidavidm commented on pull request #8325: URL: https://github.com/apache/arrow/pull/8325#issuecomment-702951063 @kszucs - are we ok to bump the gRPC versions like this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jduo commented on pull request #8325: ARROW-10105: [C++][Python][Java][FlightRPC] Allow disabling server validation
jduo commented on pull request #8325: URL: https://github.com/apache/arrow/pull/8325#issuecomment-702948147 > Hey James, I'll review this later today hopefully, but just to answer a few questions > > * Style in C++ can be done automatically with clang-format, see the [developer guide](https://arrow.apache.org/docs/developers/cpp/development.html). > * For Python similarly you can use the [flake8 config](https://arrow.apache.org/docs/developers/python.html#coding-style) > * The versions are scattered but as a start see thirdparty.txt: https://github.com/apache/arrow/blob/master/cpp/thirdparty/versions.txt though note you may also have to update CI scripts, Homebrew/Conda packages, etc. > * I'd rather avoid very recent experimental features as that will create a lot of churn for us as gRPC upgrades frequently. It may be worth considering in a few versions if it looks like the API settles down, since I agree it would be nice to consolidate the options + it does bring the ability to do certificate rotation. Hi @lidavidm . - I have made the Python edits now and am running them through CI while I get my environment working. - Unit tests have been added for all three languages. - I updated basically every file every file I could find that referenced a C++ gRPC version. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
github-actions[bot] commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-702946973 Revision: 66fa1607e4922cb98bed417fa947d5a0b0c7e949 Submitted crossbow builds: [ursa-labs/crossbow @ actions-584](https://github.com/ursa-labs/crossbow/branches/all?query=actions-584) |Task|Status| ||--| |test-r-rstudio-r-base-3.6-opensuse15|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-584-azure-test-r-rstudio-r-base-3.6-opensuse15)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-584-azure-test-r-rstudio-r-base-3.6-opensuse15)| This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
nealrichardson commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-702946137 @github-actions crossbow submit test-r-rstudio-r-base-3.6-opensuse15 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
github-actions[bot] commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-702943521 Revision: 66fa1607e4922cb98bed417fa947d5a0b0c7e949 Submitted crossbow builds: [ursa-labs/crossbow @ actions-583](https://github.com/ursa-labs/crossbow/branches/all?query=actions-583) |Task|Status| ||--| |test-r-rstudio-r-base-3.6-bionic|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-583-azure-test-r-rstudio-r-base-3.6-bionic)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-583-azure-test-r-rstudio-r-base-3.6-bionic)| |test-r-rstudio-r-base-3.6-centos8|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-583-azure-test-r-rstudio-r-base-3.6-centos8)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-583-azure-test-r-rstudio-r-base-3.6-centos8)| This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
nealrichardson commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-702942796 @github-actions crossbow submit test-r-rstudio-r-base-3.6-bionic test-r-rstudio-r-base-3.6-centos8 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8331: ARROW-10162: [Rust] Add pretty print support for DictionaryArray
github-actions[bot] commented on pull request #8331: URL: https://github.com/apache/arrow/pull/8331#issuecomment-702941984 https://issues.apache.org/jira/browse/ARROW-10162 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #8331: ARROW-10162: [Rust] Add pretty print support for DictionaryArray
alamb commented on a change in pull request #8331: URL: https://github.com/apache/arrow/pull/8331#discussion_r499031446 ## File path: rust/arrow/src/util/pretty.rs ## @@ -60,7 +64,7 @@ fn create_table(results: &[RecordBatch]) -> Result { let mut cells = Vec::new(); for col in 0..batch.num_columns() { let column = batch.column(col); -cells.push(Cell::new(_value_to_string(column.clone(), row)?)); Review comment: There is no reason / need to clone to column when printing each value This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #8331: ARROW-10162: [Rust] Add pretty print support for DictionaryArray
alamb commented on a change in pull request #8331: URL: https://github.com/apache/arrow/pull/8331#discussion_r499030937 ## File path: rust/arrow/src/util/pretty.rs ## @@ -80,8 +84,8 @@ macro_rules! make_string { }}; } -/// Get the value at the given row in an array as a string -fn array_value_to_string(column: array::ArrayRef, row: usize) -> Result { +/// Get the value at the given row in an array as a String +pub fn array_value_to_string(column: ::ArrayRef, row: usize) -> Result { Review comment: I made this `pub` so I could call it from datafusion/src/test/sql.rs which currently has a copy/pasted version of pretty printing in `array_str`: https://github.com/apache/arrow/blob/master/rust/datafusion/tests/sql.rs#L646 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #8331: ARROW-10162: [Rust] Add pretty print support for DictionaryArray
alamb commented on a change in pull request #8331: URL: https://github.com/apache/arrow/pull/8331#discussion_r499028581 ## File path: rust/arrow/src/util/pretty.rs ## @@ -126,15 +130,56 @@ fn array_value_to_string(column: array::ArrayRef, row: usize) -> Result DataType::Time64(unit) if *unit == TimeUnit::Nanosecond => { make_string!(array::Time64NanosecondArray, column, row) } +DataType::Dictionary(index_type, _value_type) => match **index_type { +DataType::Int8 => dict_array_value_to_string::(column, row), +DataType::Int16 => dict_array_value_to_string::(column, row), +DataType::Int32 => dict_array_value_to_string::(column, row), +DataType::Int64 => dict_array_value_to_string::(column, row), +DataType::UInt8 => dict_array_value_to_string::(column, row), +DataType::UInt16 => dict_array_value_to_string::(column, row), +DataType::UInt32 => dict_array_value_to_string::(column, row), +DataType::UInt64 => dict_array_value_to_string::(column, row), +_ => Err(ArrowError::InvalidArgumentError(format!( +"Unsupported index type {:?} type for {:?} in repl.", Review comment: I don't really grok why the error messages refer to `repl` in this file, but I have carried forward the tradition with this PR This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb opened a new pull request #8331: ARROW-10162: [Rust] Add pretty print support for DictionaryArray
alamb opened a new pull request #8331: URL: https://github.com/apache/arrow/pull/8331 This adds some logic for handling dictionary arrays when pretty printing batches. Part of a larger set of work I am working on for better DictionaryArray handling: https://issues.apache.org/jira/browse/ARROW-10159 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
github-actions[bot] commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-702935169 Revision: 30b888d92c12be28a956ea6042ad892706fdfb69 Submitted crossbow builds: [ursa-labs/crossbow @ actions-582](https://github.com/ursa-labs/crossbow/branches/all?query=actions-582) |Task|Status| ||--| |test-r-rhub-ubuntu-gcc-release|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-582-azure-test-r-rhub-ubuntu-gcc-release)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-582-azure-test-r-rhub-ubuntu-gcc-release)| |test-r-rocker-r-base-latest|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-582-azure-test-r-rocker-r-base-latest)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-582-azure-test-r-rocker-r-base-latest)| |test-r-rstudio-r-base-3.6-bionic|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-582-azure-test-r-rstudio-r-base-3.6-bionic)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-582-azure-test-r-rstudio-r-base-3.6-bionic)| |test-r-rstudio-r-base-3.6-centos6|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-582-azure-test-r-rstudio-r-base-3.6-centos6)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-582-azure-test-r-rstudio-r-base-3.6-centos6)| |test-r-rstudio-r-base-3.6-centos8|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-582-azure-test-r-rstudio-r-base-3.6-centos8)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-582-azure-test-r-rstudio-r-base-3.6-centos8)| |test-r-rstudio-r-base-3.6-opensuse15|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-582-azure-test-r-rstudio-r-base-3.6-opensuse15)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-582-azure-test-r-rstudio-r-base-3.6-opensuse15)| |test-r-rstudio-r-base-3.6-opensuse42|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-582-azure-test-r-rstudio-r-base-3.6-opensuse42)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-582-azure-test-r-rstudio-r-base-3.6-opensuse42)| This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] dianaclarke commented on pull request #8312: ARROW-9941: [Python] Better string representation for extension types
dianaclarke commented on pull request #8312: URL: https://github.com/apache/arrow/pull/8312#issuecomment-702934789 I had hoped I could just use something like the following, but it's segfaulting. I think I need to go back to first principals and do a few C++ tutorials ;) `internal::PyObject_StdStringRepr(type_instance_.obj());` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
nealrichardson commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-702934485 @github-actions crossbow submit test-r-r* This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
github-actions[bot] commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-702928784 Revision: f63aba0e66b74a6777ae8e755c91ba80cd35e1e8 Submitted crossbow builds: [ursa-labs/crossbow @ actions-581](https://github.com/ursa-labs/crossbow/branches/all?query=actions-581) |Task|Status| ||--| |test-r-rhub-ubuntu-gcc-release|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-581-azure-test-r-rhub-ubuntu-gcc-release)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-581-azure-test-r-rhub-ubuntu-gcc-release)| |test-r-rocker-r-base-latest|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-581-azure-test-r-rocker-r-base-latest)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-581-azure-test-r-rocker-r-base-latest)| |test-r-rstudio-r-base-3.6-bionic|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-581-azure-test-r-rstudio-r-base-3.6-bionic)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-581-azure-test-r-rstudio-r-base-3.6-bionic)| |test-r-rstudio-r-base-3.6-centos6|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-581-azure-test-r-rstudio-r-base-3.6-centos6)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-581-azure-test-r-rstudio-r-base-3.6-centos6)| |test-r-rstudio-r-base-3.6-centos8|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-581-azure-test-r-rstudio-r-base-3.6-centos8)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-581-azure-test-r-rstudio-r-base-3.6-centos8)| |test-r-rstudio-r-base-3.6-opensuse15|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-581-azure-test-r-rstudio-r-base-3.6-opensuse15)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-581-azure-test-r-rstudio-r-base-3.6-opensuse15)| |test-r-rstudio-r-base-3.6-opensuse42|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-581-azure-test-r-rstudio-r-base-3.6-opensuse42)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-581-azure-test-r-rstudio-r-base-3.6-opensuse42)| This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
nealrichardson commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-702928024 @github-actions crossbow submit test-r-r* This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] carols10cents commented on a change in pull request #8330: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes
carols10cents commented on a change in pull request #8330: URL: https://github.com/apache/arrow/pull/8330#discussion_r499015678 ## File path: rust/parquet/src/arrow/arrow_writer.rs ## @@ -688,4 +688,413 @@ mod tests { writer.write().unwrap(); writer.close().unwrap(); } + +const SMALL_SIZE: usize = 100; + +fn roundtrip(filename: , expected_batch: RecordBatch) { +let file = get_temp_file(filename, &[]); + +let mut writer = ArrowWriter::try_new( +file.try_clone().unwrap(), +expected_batch.schema(), +None, +) +.unwrap(); +writer.write(_batch).unwrap(); +writer.close().unwrap(); + +let reader = SerializedFileReader::new(file).unwrap(); +let mut arrow_reader = ParquetFileArrowReader::new(Rc::new(reader)); +let mut record_batch_reader = arrow_reader.get_record_reader(1024).unwrap(); + +let actual_batch = record_batch_reader.next_batch().unwrap().unwrap(); + +assert_eq!(expected_batch.schema(), actual_batch.schema()); +assert_eq!(expected_batch.num_columns(), actual_batch.num_columns()); +assert_eq!(expected_batch.num_rows(), actual_batch.num_rows()); +for i in 0..expected_batch.num_columns() { +let expected_data = expected_batch.column(i).data(); +let actual_data = actual_batch.column(i).data(); + +assert_eq!(expected_data.data_type(), actual_data.data_type()); +assert_eq!(expected_data.len(), actual_data.len()); +assert_eq!(expected_data.null_count(), actual_data.null_count()); +assert_eq!(expected_data.offset(), actual_data.offset()); +assert_eq!(expected_data.buffers(), actual_data.buffers()); +assert_eq!(expected_data.child_data(), actual_data.child_data()); +assert_eq!(expected_data.null_bitmap(), actual_data.null_bitmap()); +} +} + +fn one_column_roundtrip(filename: , values: ArrayRef, nullable: bool) { +let schema = Schema::new(vec![Field::new( +"col", +values.data_type().clone(), +nullable, +)]); +let expected_batch = +RecordBatch::try_new(Arc::new(schema), vec![values]).unwrap(); + +roundtrip(filename, expected_batch); +} + +fn values_required(iter: I, filename: ) +where +A: From> + Array + 'static, +I: IntoIterator, +{ +let raw_values: Vec<_> = iter.into_iter().collect(); +let values = Arc::new(A::from(raw_values)); +one_column_roundtrip(filename, values, false); +} + +fn values_optional(iter: I, filename: ) +where +A: From>> + Array + 'static, +I: IntoIterator, +{ +let optional_raw_values: Vec<_> = iter +.into_iter() +.enumerate() +.map(|(i, v)| if i % 2 == 0 { None } else { Some(v) }) +.collect(); +let optional_values = Arc::new(A::from(optional_raw_values)); +one_column_roundtrip(filename, optional_values, true); +} + +fn required_and_optional(iter: I, filename: ) +where +A: From> + From>> + Array + 'static, +I: IntoIterator + Clone, +{ +values_required::(iter.clone(), filename); +values_optional::(iter, filename); +} + +#[test] +#[should_panic(expected = "Null arrays not supported")] +fn null_single_column() { +let values = Arc::new(NullArray::new(SMALL_SIZE)); +one_column_roundtrip("null_single_column", values.clone(), true); +one_column_roundtrip("null_single_column", values, false); +} + +#[test] +#[should_panic( +expected = "Attempting to write an Arrow type that is not yet implemented" +)] +fn bool_single_column() { +required_and_optional::( +[true, false].iter().cycle().copied().take(SMALL_SIZE), +"bool_single_column", +); +} + +#[test] +fn i8_single_column() { +required_and_optional::(0..SMALL_SIZE as i8, "i8_single_column"); +} + +#[test] +fn i16_single_column() { +required_and_optional::(0..SMALL_SIZE as i16, "i16_single_column"); +} + +#[test] +fn i32_single_column() { +required_and_optional::(0..SMALL_SIZE as i32, "i32_single_column"); +} + +#[test] +fn i64_single_column() { +required_and_optional::(0..SMALL_SIZE as i64, "i64_single_column"); +} + +#[test] +fn u8_single_column() { +required_and_optional::(0..SMALL_SIZE as u8, "u8_single_column"); +} + +#[test] +fn u16_single_column() { +required_and_optional::( +0..SMALL_SIZE as u16, +"u16_single_column", +); +} + +#[test] +fn u32_single_column() { +required_and_optional::( +0..SMALL_SIZE as u32, +"u32_single_column", +); +
[GitHub] [arrow] bkietz commented on a change in pull request #8311: ARROW-10008: [C++][Dataset] Fix filtering/row group statistics of dict columns
bkietz commented on a change in pull request #8311: URL: https://github.com/apache/arrow/pull/8311#discussion_r498985112 ## File path: cpp/src/arrow/dataset/file_parquet.cc ## @@ -171,6 +170,15 @@ static std::shared_ptr ColumnChunkStatisticsAsStructScalar( return nullptr; } + auto maybe_min = min->CastTo(field->type()); + auto maybe_max = max->CastTo(field->type()); Review comment: (IE, it only changes behavior in cases where the physical type wasn't appropriate) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] dianaclarke commented on a change in pull request #8312: ARROW-9941: [Python] Better string representation for extension types
dianaclarke commented on a change in pull request #8312: URL: https://github.com/apache/arrow/pull/8312#discussion_r498972124 ## File path: python/pyarrow/tests/test_extension_type.py ## @@ -95,6 +95,18 @@ def test_ext_type_basics(): assert ty.extension_name == "arrow.py_extension_type" +def test_ext_type_str(): +ty = IntegerType() +assert str(ty) == "DataType(int64)" +cpp_expected = "extension>" Review comment: TODO: this should be: `extension>` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] paddyhoran closed pull request #8327: ARROW-10160: [Rust] Improve DictionaryType documentation (clarify which type is which)
paddyhoran closed pull request #8327: URL: https://github.com/apache/arrow/pull/8327 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #8312: ARROW-9941: [Python] Better string representation for extension types
nealrichardson commented on pull request #8312: URL: https://github.com/apache/arrow/pull/8312#issuecomment-702875396 To be clear, the 2 Windows CI failures are unrelated. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] paddyhoran closed pull request #8326: ARROW-10157: [Rust] Add an example to the take kernel
paddyhoran closed pull request #8326: URL: https://github.com/apache/arrow/pull/8326 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
github-actions[bot] commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-702866357 Revision: 96283bc0c9d134680783e3df00926d0b972946ba Submitted crossbow builds: [ursa-labs/crossbow @ actions-580](https://github.com/ursa-labs/crossbow/branches/all?query=actions-580) |Task|Status| ||--| |test-r-rhub-ubuntu-gcc-release|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-580-azure-test-r-rhub-ubuntu-gcc-release)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-580-azure-test-r-rhub-ubuntu-gcc-release)| |test-r-rocker-r-base-latest|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-580-azure-test-r-rocker-r-base-latest)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-580-azure-test-r-rocker-r-base-latest)| |test-r-rstudio-r-base-3.6-bionic|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-580-azure-test-r-rstudio-r-base-3.6-bionic)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-580-azure-test-r-rstudio-r-base-3.6-bionic)| |test-r-rstudio-r-base-3.6-centos6|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-580-azure-test-r-rstudio-r-base-3.6-centos6)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-580-azure-test-r-rstudio-r-base-3.6-centos6)| |test-r-rstudio-r-base-3.6-centos8|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-580-azure-test-r-rstudio-r-base-3.6-centos8)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-580-azure-test-r-rstudio-r-base-3.6-centos8)| |test-r-rstudio-r-base-3.6-opensuse15|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-580-azure-test-r-rstudio-r-base-3.6-opensuse15)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-580-azure-test-r-rstudio-r-base-3.6-opensuse15)| |test-r-rstudio-r-base-3.6-opensuse42|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-580-azure-test-r-rstudio-r-base-3.6-opensuse42)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-580-azure-test-r-rstudio-r-base-3.6-opensuse42)| This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
nealrichardson commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-702865456 @github-actions crossbow submit test-r-r* This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on pull request #8317: ARROW-10134: [Python][Dataset] Add ParquetFileFragment.num_row_groups
bkietz commented on pull request #8317: URL: https://github.com/apache/arrow/pull/8317#issuecomment-702862461 @jorisvandenbossche just confirming: you want `f.num_row_groups` to potentially perform IO? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #8311: ARROW-10008: [C++][Dataset] Fix filtering/row group statistics of dict columns
bkietz commented on a change in pull request #8311: URL: https://github.com/apache/arrow/pull/8311#discussion_r498953710 ## File path: cpp/src/arrow/dataset/file_parquet.cc ## @@ -171,6 +170,15 @@ static std::shared_ptr ColumnChunkStatisticsAsStructScalar( return nullptr; } + auto maybe_min = min->CastTo(field->type()); + auto maybe_max = max->CastTo(field->type()); Review comment: `StatisticsAsScalars` returns scalars whose types are the correct *physical type*, so even if the column was `dictionary(string)` min and max would be just `string` before this cast This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
github-actions[bot] commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-702857488 Revision: 96283bc0c9d134680783e3df00926d0b972946ba Submitted crossbow builds: [ursa-labs/crossbow @ actions-579](https://github.com/ursa-labs/crossbow/branches/all?query=actions-579) |Task|Status| ||--| |test-r-rstudio-r-base-3.6-bionic|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-579-azure-test-r-rstudio-r-base-3.6-bionic)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-579-azure-test-r-rstudio-r-base-3.6-bionic)| This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
nealrichardson commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-702855212 @github-actions crossbow submit test-r-rstudio-r-base-3.6-bionic This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on issue #8329: pyarrow=1.0.1 is not even usable
wesm commented on issue #8329: URL: https://github.com/apache/arrow/issues/8329#issuecomment-702850502 There is no deception. Some individuals work on Apache Arrow as part of their job responsibilities, but the work that is contributed to the project is _volunteered_ by these organizations. They have their own priorities; they do not work for you. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #8330: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes
alamb commented on pull request #8330: URL: https://github.com/apache/arrow/pull/8330#issuecomment-702838970 > I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the null_bitmap isn't matching and I'm not sure if it's supposed to or not. I think the null bitmap should definitely match after the round trip (as it signals which rows are null, which should be preserved). I am not sure the actual data arrays have to be the same -- what needs to be the same is the data rows where the null_bit is 0 (aka we should probably be checking only for non-null rows) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] impredicative edited a comment on issue #8329: pyarrow=1.0.1 is not even usable
impredicative edited a comment on issue #8329: URL: https://github.com/apache/arrow/issues/8329#issuecomment-702823291 > Apache Arrow is supported by volunteers This information is intended to deceive. The notable fact is that [Ursa Lab has sponsors](https://ursalabs.org/blog/2020-new-sponsors/) and is [industry funded](https://ursalabs.org/about/). Perhaps the sponsors would like to learn that pyarrow is not even importable. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] impredicative edited a comment on issue #8329: pyarrow=1.0.1 is not even usable
impredicative edited a comment on issue #8329: URL: https://github.com/apache/arrow/issues/8329#issuecomment-702823291 > Apache Arrow is supported by volunteers This information is intended to deceive. The notable fact is that [Ursa Lab has sponsors](https://ursalabs.org/blog/2020-new-sponsors/) and is [industry funded](https://ursalabs.org/about/). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] impredicative commented on issue #8329: pyarrow=1.0.1 is not even usable
impredicative commented on issue #8329: URL: https://github.com/apache/arrow/issues/8329#issuecomment-702823291 > Apache Arrow is supported by volunteers This information is intended to deceive. The fact is that [Ursa Lab has sponsors](https://ursalabs.org/blog/2020-new-sponsors/) and is [industry funded](https://ursalabs.org/about/). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on pull request #8317: ARROW-10134: [Python][Dataset] Add ParquetFileFragment.num_row_groups
jorisvandenbossche commented on pull request #8317: URL: https://github.com/apache/arrow/pull/8317#issuecomment-702823129 I think the main reason such a property would be interesting for dask's use case is to get the number of row groups in a case all statistics are not available / not yet parsed. So the way that this PR returns `None` in that case is not super useful, I think. I think ideally if the number of row groups is not know (the row_groups are not set), it would retrieve this information from the FileMetaData. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8311: ARROW-10008: [C++][Dataset] Fix filtering/row group statistics of dict columns
jorisvandenbossche commented on a change in pull request #8311: URL: https://github.com/apache/arrow/pull/8311#discussion_r498913314 ## File path: cpp/src/arrow/dataset/file_parquet.cc ## @@ -171,6 +170,15 @@ static std::shared_ptr ColumnChunkStatisticsAsStructScalar( return nullptr; } + auto maybe_min = min->CastTo(field->type()); + auto maybe_max = max->CastTo(field->type()); Review comment: Does this change behaviour? For a dictionary with string values, is field->type() string or dictionary? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] carols10cents opened a new pull request #8330: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes
carols10cents opened a new pull request #8330: URL: https://github.com/apache/arrow/pull/8330 Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] carols10cents commented on pull request #8330: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes
carols10cents commented on pull request #8330: URL: https://github.com/apache/arrow/pull/8330#issuecomment-702815429 cc @nevi-me This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
github-actions[bot] commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-702810329 Revision: 841e2f0e3e04c903875bb4196d11a01ac3aacaac Submitted crossbow builds: [ursa-labs/crossbow @ actions-578](https://github.com/ursa-labs/crossbow/branches/all?query=actions-578) |Task|Status| ||--| |test-r-rstudio-r-base-3.6-bionic|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-578-azure-test-r-rstudio-r-base-3.6-bionic)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-578-azure-test-r-rstudio-r-base-3.6-bionic)| This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao edited a comment on pull request #8324: ARROW-10156: [Rust] Added github action to label PRs for rust.
jorgecarleitao edited a comment on pull request #8324: URL: https://github.com/apache/arrow/pull/8324#issuecomment-702805572 @kou, Thanks for the suggestion. I have moved the content to another place and added the `on` accordingly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on pull request #8313: ARROW-4927: [Rust] Update top level README to describe current functionality
jorgecarleitao commented on pull request #8313: URL: https://github.com/apache/arrow/pull/8313#issuecomment-702809460 @andygrove , sure! I am stuck in the last step of https://gitbox.apache.org/setup/ > User not a member of the ASF GitHub organisation. Please make sure you are a part of the ASF Organisation on GitHub and have 2FA enabled. Visit id.apache.org and set your GitHub ID to be invited to the org. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson closed issue #8329: pyarrow=1.0.1 is not even usable
nealrichardson closed issue #8329: URL: https://github.com/apache/arrow/issues/8329 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on issue #8329: pyarrow=1.0.1 is not even usable
nealrichardson commented on issue #8329: URL: https://github.com/apache/arrow/issues/8329#issuecomment-702808106 Thanks for your report. Apache Arrow is supported by volunteers, so while you wait patiently for an answer to your question, please review our [code of conduct](https://www.apache.org/foundation/policies/conduct.html) for participating in the community. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on pull request #8324: ARROW-10156: [Rust] Added github action to label PRs for rust.
jorgecarleitao commented on pull request #8324: URL: https://github.com/apache/arrow/pull/8324#issuecomment-702805572 @kou, Thanks for the suggestion. I have move the content to another place and added the `on` accordingly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #8304: ARROW-10068: [C++] Add bundled external project for aws-sdk-cpp
nealrichardson commented on pull request #8304: URL: https://github.com/apache/arrow/pull/8304#issuecomment-702805412 @github-actions crossbow submit test-r-rstudio-r-base-3.6-bionic This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] andygrove commented on pull request #8313: ARROW-4927: [Rust] Update top level README to describe current functionality
andygrove commented on pull request #8313: URL: https://github.com/apache/arrow/pull/8313#issuecomment-702805331 @jorgecarleitao Now that you are a committer do you want to merge this one to get to know the process? I usually make sure I have the latest from the master branch and then run `./dev/merge_pr.py`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] impredicative opened a new issue #8329: pyarrow=1.0.1 is not even usable
impredicative opened a new issue #8329: URL: https://github.com/apache/arrow/issues/8329 Is someone going to comment on https://issues.apache.org/jira/browse/ARROW-10152 ? As it stands, the software is not even usable, period. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8328: ARROW-10161: [Rust] [DataFusion] DRYed code in tests
github-actions[bot] commented on pull request #8328: URL: https://github.com/apache/arrow/pull/8328#issuecomment-702800123 https://issues.apache.org/jira/browse/ARROW-10161 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #8316: ARROW-10149: [Rust] Improved support for externally owned memory regions
pitrou commented on a change in pull request #8316: URL: https://github.com/apache/arrow/pull/8316#discussion_r498885291 ## File path: rust/arrow/src/bytes.rs ## @@ -0,0 +1,185 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! This module contains an implementation of a contiguous immutable memory region that knows +//! how to de-allocate itself, [`Bytes`]. +//! Note that this is a low-level functionality of this crate, and is only required to be used +//! when implementing FFI. + +use core::slice; +use std::{fmt::Debug, fmt::Formatter, sync::Arc}; + +use crate::memory; + +/// function resposible for de-allocating `Bytes`. +pub type DropFn = Arc; + +/// Mode of deallocating memory regions +pub enum Deallocation { +/// Native deallocation, using Rust deallocator with Arrow-specific memory aligment +Native(usize), +/// Foreign deallocation, using some other form of memory deallocation +Foreign(DropFn), +} + +impl Debug for Deallocation { +fn fmt(, f: Formatter) -> std::fmt::Result { +match self { +Deallocation::Native(capacity) => { +write!(f, "Deallocation::Native {{ capacity: {} }}", capacity) +} +Deallocation::Foreign(_) => { +write!(f, "Deallocation::Foreign {{ capacity: unknown }}") +} +} +} +} + +/// A continuous, fixed-size, immutable memory region that knows how to de-allocate itself. +/// This structs' API is inspired by the `bytes::Bytes`, but it is not limited to using rust's +/// global allocator nor u8 aligmnent. +/// +/// In the most common case, this buffer is allocated using [`allocate_aligned`](memory::allocate_aligned) +/// and deallocated accordingly [`free_aligned`](memory::free_aligned). +/// When the region is allocated by an foreign allocator, [Deallocation::Foreign], this calls the +/// foreign deallocator to deallocate the region when it is no longer needed. +pub struct Bytes { +/// The raw pointer to be begining of the region +ptr: *const u8, + +/// The number of bytes visible to this region. This is always smaller than its capacity (when avaliable). +len: usize, + +/// how to deallocate this region +deallocation: Deallocation, +} + +impl Bytes { +/// Takes ownership of an allocated memory region, +/// +/// # Arguments +/// +/// * `ptr` - Pointer to raw parts +/// * `len` - Length of raw parts in **bytes** +/// * `capacity` - Total allocated memory for the pointer `ptr`, in **bytes** +/// +/// # Safety +/// +/// This function is unsafe as there is no guarantee that the given pointer is valid for `len` +/// bytes. If the `ptr` and `capacity` come from a `Buffer`, then this is guaranteed. +pub unsafe fn new(ptr: *const u8, len: usize, deallocation: Deallocation) -> Bytes { +Bytes { +ptr, +len, +deallocation, +} +} + +#[inline] +pub fn as_slice() -> &[u8] { +unsafe { slice::from_raw_parts(self.ptr, self.len) } +} + +#[inline] +pub fn len() -> usize { +self.len +} + +#[inline] +pub fn is_empty() -> bool { +self.len == 0 +} + +#[inline] +pub fn raw_data() -> *const u8 { +self.ptr +} + +#[inline] +pub fn raw_data_mut( self) -> *mut u8 { +self.ptr as *mut u8 +} + +pub fn capacity() -> usize { +match self.deallocation { +Deallocation::Native(capacity) => capacity, +// we cannot determine this in general, +// and thus we state that this is externally-owned memory +Deallocation::Foreign(_) => 0, +} +} +} + +impl Drop for Bytes { +#[inline] +fn drop( self) { +match { +Deallocation::Native(capacity) => { +if !self.ptr.is_null() { +unsafe { memory::free_aligned(self.ptr as *mut u8, *capacity) }; +} +} +Deallocation::Foreign(drop) => { +(drop.clone())(self); +} +} +} +} + +impl PartialEq for Bytes { +fn eq(, other: ) -> bool { +
[GitHub] [arrow] jorgecarleitao opened a new pull request #8328: ARROW-10161: [Rust] [DataFusion] DRYed code in tests
jorgecarleitao opened a new pull request #8328: URL: https://github.com/apache/arrow/pull/8328 Just a small cleanup by moving common code to a macro and using it in the tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #8312: ARROW-9941: [Python] Better string representation for extension types
pitrou commented on pull request #8312: URL: https://github.com/apache/arrow/pull/8312#issuecomment-702788372 > `assert pa.DataType.__str__(ty) == "arrow.py_extension_type"` The problem is that this doesn't distinguish between different extension types with the same storage type. For example, two extension types representing UUIDs and IPv6 addresses, respectively, would get the same string representation (because both would be backed by a 16-byte fixed-size binary). I would be more useful if the string representation looked like `extension>`, where `IntegerType` is the Python class name. Optionally this could be made configurable as well. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] andygrove commented on pull request #8324: ARROW-10156: [Rust] Added github action to label PRs for rust.
andygrove commented on pull request #8324: URL: https://github.com/apache/arrow/pull/8324#issuecomment-702773859 @jorgecarleitao I will be very happy to relinquish my role as github label admin (as I'm sure the other Rust committers will too). Thanks for working on this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #8305: ARROW-9782: [C++][Dataset] More configurable Dataset writing
pitrou commented on a change in pull request #8305: URL: https://github.com/apache/arrow/pull/8305#discussion_r498818275 ## File path: cpp/src/arrow/dataset/file_base.cc ## @@ -143,97 +145,205 @@ FragmentIterator FileSystemDataset::GetFragmentsImpl( return MakeVectorIterator(std::move(fragments)); } -struct WriteTask { - Status Execute(); +Status FileWriter::Write(RecordBatchReader* batches) { + for (std::shared_ptr batch;;) { Review comment: I have a preference for `while (true)` which is much more explicit. Just my 2 cents :-) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #8305: ARROW-9782: [C++][Dataset] More configurable Dataset writing
bkietz commented on a change in pull request #8305: URL: https://github.com/apache/arrow/pull/8305#discussion_r498813432 ## File path: cpp/src/arrow/dataset/file_base.cc ## @@ -143,97 +145,205 @@ FragmentIterator FileSystemDataset::GetFragmentsImpl( return MakeVectorIterator(std::move(fragments)); } -struct WriteTask { - Status Execute(); +Status FileWriter::Write(RecordBatchReader* batches) { + for (std::shared_ptr batch;;) { Review comment: I personally prefer `for(;;)` to `while(true)` but if it'll be clearer then I'll rewrite it ## File path: cpp/src/arrow/dataset/file_base.cc ## @@ -143,97 +145,205 @@ FragmentIterator FileSystemDataset::GetFragmentsImpl( return MakeVectorIterator(std::move(fragments)); } -struct WriteTask { - Status Execute(); +Status FileWriter::Write(RecordBatchReader* batches) { + for (std::shared_ptr batch;;) { +RETURN_NOT_OK(batches->ReadNext()); +if (batch == nullptr) break; +RETURN_NOT_OK(Write(batch)); + } + return Status::OK(); +} + +struct NextBasenameGenerator { + static Result Make(const std::string& basename_template) { +if (basename_template.find(fs::internal::kSep) != std::string::npos) { + return Status::Invalid("basename_template contained '/'"); +} +size_t token_start = basename_template.find(token()); +if (token_start == std::string::npos) { + return Status::Invalid("basename_template did not contain '{i}'"); +} +return NextBasenameGenerator{basename_template, 0, token_start, + token_start + token().size()}; + } - /// The basename of files written by this WriteTask. Extensions - /// are derived from format - std::string basename; + static const std::string& token() { +static const std::string token = "{i}"; +return token; + } - /// The partitioning with which paths will be generated - std::shared_ptr partitioning; + const std::string& template_; + size_t i_, token_start_, token_end_; - /// The format in which fragments will be written - std::shared_ptr format; + std::string operator()() { +return template_.substr(0, token_start_) + std::to_string(i_++) + + template_.substr(token_end_); + } +}; - /// The FileSystem and base directory into which fragments will be written - std::shared_ptr filesystem; - std::string base_dir; +using MutexedWriter = util::Mutexed>; - /// Batches to be written - std::shared_ptr batches; +struct WriterSet { + WriterSet(NextBasenameGenerator next_basename, +const FileSystemDatasetWriteOptions& write_options) + : next_basename_(std::move(next_basename)), +base_dir_(fs::internal::EnsureTrailingSlash(write_options.base_dir)), +write_options_(write_options) {} - /// An Expression already satisfied by every batch to be written - std::shared_ptr partition_expression; -}; + Result> Get(const Expression& partition_expression, + const std::shared_ptr& schema) { +ARROW_ASSIGN_OR_RAISE(auto part_segments, + write_options_.partitioning->Format(partition_expression)); +std::string dir = base_dir_ + part_segments; -Status WriteTask::Execute() { - std::unordered_map path_to_batches; - - // TODO(bkietz) these calls to Partition() should be scattered across a TaskGroup - for (auto maybe_batch : IteratorFromReader(batches)) { -ARROW_ASSIGN_OR_RAISE(auto batch, maybe_batch); -ARROW_ASSIGN_OR_RAISE(auto partitioned_batches, partitioning->Partition(batch)); -for (auto&& partitioned_batch : partitioned_batches) { - AndExpression expr(std::move(partitioned_batch.partition_expression), - partition_expression); - ARROW_ASSIGN_OR_RAISE(std::string path, partitioning->Format(expr)); - path = fs::internal::EnsureLeadingSlash(path); - path_to_batches[path].push_back(std::move(partitioned_batch.batch)); -} - } +util::Mutex::Guard writer_lock; + +auto set_lock = mutex_.Lock(); - for (auto&& path_batches : path_to_batches) { -auto dir = base_dir + path_batches.first; -RETURN_NOT_OK(filesystem->CreateDir(dir, /*recursive=*/true)); +auto writer = Review comment: I'll add comments inline This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #8305: ARROW-9782: [C++][Dataset] More configurable Dataset writing
bkietz commented on a change in pull request #8305: URL: https://github.com/apache/arrow/pull/8305#discussion_r498812732 ## File path: cpp/src/arrow/dataset/file_base.cc ## @@ -143,97 +145,205 @@ FragmentIterator FileSystemDataset::GetFragmentsImpl( return MakeVectorIterator(std::move(fragments)); } -struct WriteTask { - Status Execute(); +Status FileWriter::Write(RecordBatchReader* batches) { + for (std::shared_ptr batch;;) { +RETURN_NOT_OK(batches->ReadNext()); +if (batch == nullptr) break; +RETURN_NOT_OK(Write(batch)); + } + return Status::OK(); +} + +struct NextBasenameGenerator { + static Result Make(const std::string& basename_template) { +if (basename_template.find(fs::internal::kSep) != std::string::npos) { + return Status::Invalid("basename_template contained '/'"); +} +size_t token_start = basename_template.find(token()); +if (token_start == std::string::npos) { + return Status::Invalid("basename_template did not contain '{i}'"); +} +return NextBasenameGenerator{basename_template, 0, token_start, + token_start + token().size()}; + } - /// The basename of files written by this WriteTask. Extensions - /// are derived from format - std::string basename; + static const std::string& token() { +static const std::string token = "{i}"; +return token; + } - /// The partitioning with which paths will be generated - std::shared_ptr partitioning; + const std::string& template_; + size_t i_, token_start_, token_end_; - /// The format in which fragments will be written - std::shared_ptr format; + std::string operator()() { +return template_.substr(0, token_start_) + std::to_string(i_++) + + template_.substr(token_end_); + } +}; - /// The FileSystem and base directory into which fragments will be written - std::shared_ptr filesystem; - std::string base_dir; +using MutexedWriter = util::Mutexed>; - /// Batches to be written - std::shared_ptr batches; +struct WriterSet { + WriterSet(NextBasenameGenerator next_basename, +const FileSystemDatasetWriteOptions& write_options) + : next_basename_(std::move(next_basename)), +base_dir_(fs::internal::EnsureTrailingSlash(write_options.base_dir)), +write_options_(write_options) {} - /// An Expression already satisfied by every batch to be written - std::shared_ptr partition_expression; -}; + Result> Get(const Expression& partition_expression, + const std::shared_ptr& schema) { +ARROW_ASSIGN_OR_RAISE(auto part_segments, + write_options_.partitioning->Format(partition_expression)); +std::string dir = base_dir_ + part_segments; -Status WriteTask::Execute() { - std::unordered_map path_to_batches; - - // TODO(bkietz) these calls to Partition() should be scattered across a TaskGroup - for (auto maybe_batch : IteratorFromReader(batches)) { -ARROW_ASSIGN_OR_RAISE(auto batch, maybe_batch); -ARROW_ASSIGN_OR_RAISE(auto partitioned_batches, partitioning->Partition(batch)); -for (auto&& partitioned_batch : partitioned_batches) { - AndExpression expr(std::move(partitioned_batch.partition_expression), - partition_expression); - ARROW_ASSIGN_OR_RAISE(std::string path, partitioning->Format(expr)); - path = fs::internal::EnsureLeadingSlash(path); - path_to_batches[path].push_back(std::move(partitioned_batch.batch)); -} - } +util::Mutex::Guard writer_lock; + +auto set_lock = mutex_.Lock(); - for (auto&& path_batches : path_to_batches) { -auto dir = base_dir + path_batches.first; -RETURN_NOT_OK(filesystem->CreateDir(dir, /*recursive=*/true)); +auto writer = +internal::GetOrInsertGenerated(_to_writer_, dir, [&](const std::string&) { + auto writer = std::make_shared(); + writer_lock = writer->Lock(); + return writer; +})->second; -auto path = fs::internal::ConcatAbstractPath(dir, basename); -ARROW_ASSIGN_OR_RAISE(auto destination, filesystem->OpenOutputStream(path)); +if (writer_lock) { + // NB: next_basename_() must be invoked with the set_lock held + auto path = fs::internal::ConcatAbstractPath(dir, next_basename_()); + set_lock.Unlock(); -DCHECK(!path_batches.second.empty()); -ARROW_ASSIGN_OR_RAISE(auto reader, - RecordBatchReader::Make(std::move(path_batches.second))); -RETURN_NOT_OK(format->WriteFragment(reader.get(), destination.get())); + RETURN_NOT_OK(write_options_.filesystem->CreateDir(dir)); + + ARROW_ASSIGN_OR_RAISE(auto destination, +write_options_.filesystem->OpenOutputStream(path)); + + ARROW_ASSIGN_OR_RAISE(**writer, write_options_.format()->MakeWriter( + std::move(destination), schema, +
[GitHub] [arrow] lidavidm commented on pull request #8325: ARROW-10105: [C++][Python][Java][FlightRPC] Allow disabling server validation
lidavidm commented on pull request #8325: URL: https://github.com/apache/arrow/pull/8325#issuecomment-702709693 Actually, it seems the relevant changes have been there since early this year. It may be OK to switch fully. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #8325: ARROW-10105: [C++][Python][Java][FlightRPC] Allow disabling server validation
lidavidm commented on pull request #8325: URL: https://github.com/apache/arrow/pull/8325#issuecomment-702702219 Hey James, I'll review this later today hopefully, but just to answer a few questions - Style in C++ can be done automatically with clang-format, see the [developer guide](https://arrow.apache.org/docs/developers/cpp/development.html). - For Python similarly you can use the [flake8 config](https://arrow.apache.org/docs/developers/python.html#coding-style) - The versions are scattered but as a start see thirdparty.txt: https://github.com/apache/arrow/blob/master/cpp/thirdparty/versions.txt though note you may also have to update CI scripts, Homebrew/Conda packages, etc. - I'd rather avoid very recent experimental features as that will create a lot of churn for us as gRPC upgrades frequently. It may be worth considering in a few versions if it looks like the API settles down, since I agree it would be nice to consolidate the options + it does bring the ability to do certificate rotation. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8327: ARROW-10160: [Rust] Improve DictionaryType documentation (clarify which type is which)
github-actions[bot] commented on pull request #8327: URL: https://github.com/apache/arrow/pull/8327#issuecomment-702699403 https://issues.apache.org/jira/browse/ARROW-10160 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on a change in pull request #8305: ARROW-9782: [C++][Dataset] More configurable Dataset writing
fsaintjacques commented on a change in pull request #8305: URL: https://github.com/apache/arrow/pull/8305#discussion_r498758123 ## File path: cpp/src/arrow/dataset/file_base.cc ## @@ -143,97 +145,205 @@ FragmentIterator FileSystemDataset::GetFragmentsImpl( return MakeVectorIterator(std::move(fragments)); } -struct WriteTask { - Status Execute(); +Status FileWriter::Write(RecordBatchReader* batches) { + for (std::shared_ptr batch;;) { +RETURN_NOT_OK(batches->ReadNext()); +if (batch == nullptr) break; +RETURN_NOT_OK(Write(batch)); + } + return Status::OK(); +} + +struct NextBasenameGenerator { + static Result Make(const std::string& basename_template) { +if (basename_template.find(fs::internal::kSep) != std::string::npos) { + return Status::Invalid("basename_template contained '/'"); +} +size_t token_start = basename_template.find(token()); +if (token_start == std::string::npos) { + return Status::Invalid("basename_template did not contain '{i}'"); +} +return NextBasenameGenerator{basename_template, 0, token_start, + token_start + token().size()}; + } - /// The basename of files written by this WriteTask. Extensions - /// are derived from format - std::string basename; + static const std::string& token() { +static const std::string token = "{i}"; +return token; + } - /// The partitioning with which paths will be generated - std::shared_ptr partitioning; + const std::string& template_; + size_t i_, token_start_, token_end_; - /// The format in which fragments will be written - std::shared_ptr format; + std::string operator()() { +return template_.substr(0, token_start_) + std::to_string(i_++) + + template_.substr(token_end_); + } +}; - /// The FileSystem and base directory into which fragments will be written - std::shared_ptr filesystem; - std::string base_dir; +using MutexedWriter = util::Mutexed>; - /// Batches to be written - std::shared_ptr batches; +struct WriterSet { + WriterSet(NextBasenameGenerator next_basename, +const FileSystemDatasetWriteOptions& write_options) + : next_basename_(std::move(next_basename)), +base_dir_(fs::internal::EnsureTrailingSlash(write_options.base_dir)), +write_options_(write_options) {} - /// An Expression already satisfied by every batch to be written - std::shared_ptr partition_expression; -}; + Result> Get(const Expression& partition_expression, + const std::shared_ptr& schema) { +ARROW_ASSIGN_OR_RAISE(auto part_segments, + write_options_.partitioning->Format(partition_expression)); +std::string dir = base_dir_ + part_segments; -Status WriteTask::Execute() { - std::unordered_map path_to_batches; - - // TODO(bkietz) these calls to Partition() should be scattered across a TaskGroup - for (auto maybe_batch : IteratorFromReader(batches)) { -ARROW_ASSIGN_OR_RAISE(auto batch, maybe_batch); -ARROW_ASSIGN_OR_RAISE(auto partitioned_batches, partitioning->Partition(batch)); -for (auto&& partitioned_batch : partitioned_batches) { - AndExpression expr(std::move(partitioned_batch.partition_expression), - partition_expression); - ARROW_ASSIGN_OR_RAISE(std::string path, partitioning->Format(expr)); - path = fs::internal::EnsureLeadingSlash(path); - path_to_batches[path].push_back(std::move(partitioned_batch.batch)); -} - } +util::Mutex::Guard writer_lock; + +auto set_lock = mutex_.Lock(); - for (auto&& path_batches : path_to_batches) { -auto dir = base_dir + path_batches.first; -RETURN_NOT_OK(filesystem->CreateDir(dir, /*recursive=*/true)); +auto writer = Review comment: I don't understand the logic here about dual locking (set_lock, writer_lock). ## File path: cpp/src/arrow/dataset/file_base.cc ## @@ -143,97 +145,205 @@ FragmentIterator FileSystemDataset::GetFragmentsImpl( return MakeVectorIterator(std::move(fragments)); } -struct WriteTask { - Status Execute(); +Status FileWriter::Write(RecordBatchReader* batches) { + for (std::shared_ptr batch;;) { Review comment: while-loop? ## File path: cpp/src/arrow/dataset/file_base.cc ## @@ -143,97 +145,205 @@ FragmentIterator FileSystemDataset::GetFragmentsImpl( return MakeVectorIterator(std::move(fragments)); } -struct WriteTask { - Status Execute(); +Status FileWriter::Write(RecordBatchReader* batches) { + for (std::shared_ptr batch;;) { +RETURN_NOT_OK(batches->ReadNext()); +if (batch == nullptr) break; +RETURN_NOT_OK(Write(batch)); + } + return Status::OK(); +} + +struct NextBasenameGenerator { + static Result Make(const std::string& basename_template) { +if (basename_template.find(fs::internal::kSep) != std::string::npos) { + return Status::Invalid("basename_template contained
[GitHub] [arrow] alamb opened a new pull request #8327: ARROW-10160: [Rust] Improve DictionaryType documentation (clarify which type is which)
alamb opened a new pull request #8327: URL: https://github.com/apache/arrow/pull/8327 I kept having to look this up, so I figured I would update the docs about it This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] malthe commented on pull request #6807: PARQUET-1404: [C++] Adding Page level Indexes
malthe commented on pull request #6807: URL: https://github.com/apache/arrow/pull/6807#issuecomment-702683387 I added an issue https://issues.apache.org/jira/browse/ARROW-10158. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] malthe commented on issue #1426: pyarrow parquet - support for row group filters
malthe commented on issue #1426: URL: https://github.com/apache/arrow/issues/1426#issuecomment-702679313 I opened a JIRA issue on this: https://issues.apache.org/jira/browse/ARROW-10158. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8326: ARROW-10157: [Rust] Add an example to the take kernel
github-actions[bot] commented on pull request #8326: URL: https://github.com/apache/arrow/pull/8326#issuecomment-702677975 https://issues.apache.org/jira/browse/ARROW-10157 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] drusso commented on pull request #8222: ARROW-10043: [Rust][DataFusion] Implement COUNT(DISTINCT col)
drusso commented on pull request #8222: URL: https://github.com/apache/arrow/pull/8222#issuecomment-702677246 Just a quick update here, I'll have some time today and over the weekend to work through the changes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb opened a new pull request #8326: ARROW-10157: [Rust] Add an example to the take kernel
alamb opened a new pull request #8326: URL: https://github.com/apache/arrow/pull/8326 I was playing around with this and thought I would encode some of my learning as a doc / example. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #8320: ARROW-10058: [C++] Improve repeated levels conversion without BMI2
pitrou commented on pull request #8320: URL: https://github.com/apache/arrow/pull/8320#issuecomment-702639541 > is it ok if I take a closer look next week No problem. > but I would not expect that to be the case. Right. The emulation is probably much slower. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #8319: ARROW-10057: [C++] Add hand-written Parquet nested tests
pitrou commented on a change in pull request #8319: URL: https://github.com/apache/arrow/pull/8319#discussion_r498727490 ## File path: cpp/src/parquet/arrow/arrow_reader_writer_test.cc ## @@ -2434,6 +2435,145 @@ TEST(TestArrowReadWrite, CanonicalNestedRoundTrip) { CheckSimpleRoundtrip(expected, 2); } +TEST(ArrowReadWrite, ListOfStruct) { + using ::arrow::field; + + auto type = ::arrow::list(::arrow::struct_( + {field("a", ::arrow::int16(), /*nullable=*/false), field("b", ::arrow::utf8())})); + + const char* json = R"([ + [{"a": 4, "b": "foo"}, {"a": 5}, {"a": 6, "b": "bar"}], + [null, {"a": 7}], + null, + []])"; + auto array = ::arrow::ArrayFromJSON(type, json); + auto table = ::arrow::Table::Make(::arrow::schema({field("root", type)}), {array}); + CheckSimpleRoundtrip(table, 2); +} + +TEST(ArrowReadWrite, ListOfStructOfList1) { + using ::arrow::field; + using ::arrow::list; + using ::arrow::struct_; + + auto type = list(struct_({field("a", ::arrow::int16(), /*nullable=*/false), +field("b", list(::arrow::int64()))})); + + const char* json = R"([ + [{"a": 123, "b": [1, 2, null, 3]}, null], + null, + [], + [{"a": 456}, {"a": 789, "b": []}, {"a": 876, "b": [4, 5, 6]}]])"; + auto array = ::arrow::ArrayFromJSON(type, json); + auto table = ::arrow::Table::Make(::arrow::schema({field("root", type)}), {array}); + CheckSimpleRoundtrip(table, 2); +} + +TEST(ArrowReadWrite, DISABLED_ListOfStructOfList2) { + using ::arrow::field; + using ::arrow::list; + using ::arrow::struct_; + + auto type = + list(field("item", + struct_({field("a", ::arrow::int16(), /*nullable=*/false), + field("b", list(::arrow::int64()), /*nullable=*/false)}), + /*nullable=*/false)); + + const char* json = R"([ + [{"a": 123, "b": [1, 2, 3]}], + null, + [], + [{"a": 456}, {"a": 789, "b": [null]}, {"a": 876, "b": [4, 5, 6]}]])"; Review comment: Ah, you may be right indeed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #8322: ARROW-9520: [Rust] [DataFusion] Add support for aliased aggregate exprs
alamb commented on pull request #8322: URL: https://github.com/apache/arrow/pull/8322#issuecomment-702631071 > Are squash merges accepted for this project or would you prefer that I rebase the PR down to a single commit? There is some sort of automation script that merges arrow PRs -- I don't think there is any need to rebase to a single commit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] thamht4190 commented on pull request #8023: ARROW-9318: [C++] Parquet encryption key management
thamht4190 commented on pull request #8023: URL: https://github.com/apache/arrow/pull/8023#issuecomment-702625493 @bkietz I have the same thought with @ggershinsky. Moreover, I wrote the code based on Java version, so I may not have enough deep understanding to write out the mind & design behind. I will try to list out some points in sub PR's description. Is that ok? Anw, I will try to fix the comment already listed on this PR first. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] returnString commented on pull request #8322: ARROW-9520: [Rust] [DataFusion] Add support for aliased aggregate exprs
returnString commented on pull request #8322: URL: https://github.com/apache/arrow/pull/8322#issuecomment-702621373 That's interesting, I've got rustfmt set to run on save whilst editing in vscode but I _was_ having some issues with code completion in the project too; I'll try and figure out what's up with that for any future contributions. Are squash merges accepted for this project or would you prefer that I rebase the PR down to a single commit? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] thamht4190 commented on a change in pull request #8023: ARROW-9318: [C++] Parquet encryption key management
thamht4190 commented on a change in pull request #8023: URL: https://github.com/apache/arrow/pull/8023#discussion_r498709142 ## File path: cpp/src/parquet/encryption/remote_kms_client.cc ## @@ -0,0 +1,127 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/json/object_parser.h" +#include "arrow/json/object_writer.h" + +#include "parquet/encryption/key_toolkit_internal.h" +#include "parquet/encryption/remote_kms_client.h" +#include "parquet/exception.h" + +using arrow::json::ObjectParser; +using arrow::json::ObjectWriter; + +namespace parquet { +namespace encryption { + +constexpr const char RemoteKmsClient::kLocalWrapNoKeyVersion[]; + +constexpr const char RemoteKmsClient::LocalKeyWrap::kLocalWrapKeyVersionField[]; +constexpr const char RemoteKmsClient::LocalKeyWrap::kLocalWrapEncryptedKeyField[]; + +RemoteKmsClient::LocalKeyWrap::LocalKeyWrap(const std::string& master_key_version, +const std::string& encrypted_encoded_key) +: encrypted_encoded_key_(encrypted_encoded_key), + master_key_version_(master_key_version) {} + +std::string RemoteKmsClient::LocalKeyWrap::CreateSerialized( +const std::string& encrypted_encoded_key) { + ObjectWriter json_writer; + + json_writer.SetString(kLocalWrapKeyVersionField, kLocalWrapNoKeyVersion); + json_writer.SetString(kLocalWrapEncryptedKeyField, encrypted_encoded_key); + + return json_writer.Serialize(); +} + +RemoteKmsClient::LocalKeyWrap RemoteKmsClient::LocalKeyWrap::Parse( +const std::string& wrapped_key) { + ObjectParser json_parser; + if (!json_parser.Parse(wrapped_key)) { +throw ParquetException("Failed to parse local key wrap json " + wrapped_key); + } + std::string master_key_version; + PARQUET_ASSIGN_OR_THROW(master_key_version, + json_parser.GetString(kLocalWrapKeyVersionField)); + + std::string encrypted_encoded_key; + PARQUET_ASSIGN_OR_THROW(encrypted_encoded_key, + json_parser.GetString(kLocalWrapEncryptedKeyField)); + + return RemoteKmsClient::LocalKeyWrap(master_key_version, encrypted_encoded_key); +} + +void RemoteKmsClient::Initialize(const KmsConnectionConfig& kms_connection_config, + bool is_wrap_locally) { + kms_connection_config_ = kms_connection_config; + is_wrap_locally_ = is_wrap_locally; + if (is_wrap_locally_) { +master_key_cache_.Clear(); + } + + is_default_token_ = + kms_connection_config_.key_access_token() == KmsClient::kKeyAccessTokenDefault; + + InitializeInternal(); +} + +std::string RemoteKmsClient::WrapKey(const std::string& key_bytes, + const std::string& master_key_identifier) { + if (is_wrap_locally_) { Review comment: Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #8320: ARROW-10058: [C++] Improve repeated levels conversion without BMI2
emkornfield commented on pull request #8320: URL: https://github.com/apache/arrow/pull/8320#issuecomment-702615585 > I also notice that we call internal::GreaterThanBitmap for each 64 levels, which always goes through the dynamic dispatch indirection (meaning two function calls, I think). We could call GreaterThanBitmapImpl but that requires compiling a specialized version of level_conversion_inc.h for AVX2, otherwise we lose performance. yeah it isn't ideal, it is possible there is a better factoring in there but it seemed hard to do and isolate BMI2 special instructions, I guess if this isn't too much slower then BMI2 on intel we could potentially collapse everything, but I would not expect that to be the case. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #8320: ARROW-10058: [C++] Improve repeated levels conversion without BMI2
emkornfield commented on pull request #8320: URL: https://github.com/apache/arrow/pull/8320#issuecomment-702614856 @pitrou I'm devoting most of my bandwidth to try to finish up the parquet read component this week, is it ok if I take a closer look next week (hopefully with enough time before an RC is cut?) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ggershinsky commented on a change in pull request #8023: ARROW-9318: [C++] Parquet encryption key management
ggershinsky commented on a change in pull request #8023: URL: https://github.com/apache/arrow/pull/8023#discussion_r498701580 ## File path: cpp/src/parquet/encryption/remote_kms_client.cc ## @@ -0,0 +1,127 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/json/object_parser.h" +#include "arrow/json/object_writer.h" + +#include "parquet/encryption/key_toolkit_internal.h" +#include "parquet/encryption/remote_kms_client.h" +#include "parquet/exception.h" + +using arrow::json::ObjectParser; +using arrow::json::ObjectWriter; + +namespace parquet { +namespace encryption { + +constexpr const char RemoteKmsClient::kLocalWrapNoKeyVersion[]; + +constexpr const char RemoteKmsClient::LocalKeyWrap::kLocalWrapKeyVersionField[]; +constexpr const char RemoteKmsClient::LocalKeyWrap::kLocalWrapEncryptedKeyField[]; + +RemoteKmsClient::LocalKeyWrap::LocalKeyWrap(const std::string& master_key_version, +const std::string& encrypted_encoded_key) +: encrypted_encoded_key_(encrypted_encoded_key), + master_key_version_(master_key_version) {} + +std::string RemoteKmsClient::LocalKeyWrap::CreateSerialized( +const std::string& encrypted_encoded_key) { + ObjectWriter json_writer; + + json_writer.SetString(kLocalWrapKeyVersionField, kLocalWrapNoKeyVersion); + json_writer.SetString(kLocalWrapEncryptedKeyField, encrypted_encoded_key); + + return json_writer.Serialize(); +} + +RemoteKmsClient::LocalKeyWrap RemoteKmsClient::LocalKeyWrap::Parse( +const std::string& wrapped_key) { + ObjectParser json_parser; + if (!json_parser.Parse(wrapped_key)) { +throw ParquetException("Failed to parse local key wrap json " + wrapped_key); + } + std::string master_key_version; + PARQUET_ASSIGN_OR_THROW(master_key_version, + json_parser.GetString(kLocalWrapKeyVersionField)); + + std::string encrypted_encoded_key; + PARQUET_ASSIGN_OR_THROW(encrypted_encoded_key, + json_parser.GetString(kLocalWrapEncryptedKeyField)); + + return RemoteKmsClient::LocalKeyWrap(master_key_version, encrypted_encoded_key); +} + +void RemoteKmsClient::Initialize(const KmsConnectionConfig& kms_connection_config, + bool is_wrap_locally) { + kms_connection_config_ = kms_connection_config; + is_wrap_locally_ = is_wrap_locally; + if (is_wrap_locally_) { +master_key_cache_.Clear(); + } + + is_default_token_ = + kms_connection_config_.key_access_token() == KmsClient::kKeyAccessTokenDefault; + + InitializeInternal(); +} + +std::string RemoteKmsClient::WrapKey(const std::string& key_bytes, + const std::string& master_key_identifier) { + if (is_wrap_locally_) { Review comment: @thamht4190 This is a very good point! I see how it can be confusing. Lets then rename (and slightly modify) `RemoteKmsClient` to something like `LocalWrapKmsClient` that will be used only for KMS systems that don't support in-server wrapping. All other KMS clients will extend the `KmsClient` interface directly. I'll try to make this change also in the Java API (parquet-mr). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org