[jira] [Created] (ARROW-15947) rename_with s3 method for arrow_dplyr_query
Mark Roman Miller created ARROW-15947: - Summary: rename_with s3 method for arrow_dplyr_query Key: ARROW-15947 URL: https://issues.apache.org/jira/browse/ARROW-15947 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Mark Roman Miller Created a simple version of `rename_with` that applies the function to the current names of the .data argument and passes the result to `rename`. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15946) [Go] Memory leak in pqarrow.NewColumnWriter with nested structures
Min-Young Wu created ARROW-15946: Summary: [Go] Memory leak in pqarrow.NewColumnWriter with nested structures Key: ARROW-15946 URL: https://issues.apache.org/jira/browse/ARROW-15946 Project: Apache Arrow Issue Type: Bug Components: Go, Parquet Reporter: Min-Young Wu There seems to be a memory leak (well, using the default allocator, it would just be an accounting error?) when writing nested structures using pqarrow.FileWriter Repro: {code:go} package main import ( "bytes" "fmt" "github.com/apache/arrow/go/v7/arrow" "github.com/apache/arrow/go/v7/arrow/array" "github.com/apache/arrow/go/v7/arrow/memory" "github.com/apache/arrow/go/v7/parquet" "github.com/apache/arrow/go/v7/parquet/compress" "github.com/apache/arrow/go/v7/parquet/pqarrow" ) func main() { allocator := memory.NewCheckedAllocator(memory.DefaultAllocator) sc := arrow.NewSchema([]arrow.Field{ {Name: "f32", Type: arrow.PrimitiveTypes.Float32, Nullable: true}, {Name: "i32", Type: arrow.PrimitiveTypes.Int32, Nullable: true}, {Name: "struct_i64_f64", Type: arrow.StructOf( arrow.Field{Name: "i64", Type: arrow.PrimitiveTypes.Int64, Nullable: true}, arrow.Field{Name: "f64", Type: arrow.PrimitiveTypes.Float64, Nullable: true})}, }, nil) bld := array.NewRecordBuilder(allocator, sc) bld.Field(0).(*array.Float32Builder).Append(1.0) bld.Field(1).(*array.Int32Builder).Append(1) sbld := bld.Field(2).(*array.StructBuilder) sbld.Append(true) sbld.FieldBuilder(0).(*array.Int64Builder).Append(1) sbld.FieldBuilder(1).(*array.Float64Builder).Append(1.0) rec := bld.NewRecord() bld.Release() var buf bytes.Buffer wr, err := pqarrow.NewFileWriter(sc, , parquet.NewWriterProperties(parquet.WithCompression(compress.Codecs.Snappy)), pqarrow.NewArrowWriterProperties(pqarrow.WithAllocator(allocator))) if err != nil { panic(err) } err = wr.Write(rec) if err != nil { panic(err) } rec.Release() wr.Close() if allocator.CurrentAlloc() != 0 { fmt.Printf("remaining allocation size: %d\n", allocator.CurrentAlloc()) } } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15945) debug build for gandiva and arrow is not working
Chak-Pong Chung created ARROW-15945: --- Summary: debug build for gandiva and arrow is not working Key: ARROW-15945 URL: https://issues.apache.org/jira/browse/ARROW-15945 Project: Apache Arrow Issue Type: Bug Components: C++, C++ - Gandiva, Documentation Affects Versions: 7.0.0, 6.0.0 Reporter: Chak-Pong Chung As reported to the dev mailing list, debug build is not working. Within the email, you can find the conda env with dependencies used and the script to reproduce the problem. This bug is found from 7.0 and 6.0 release branch. https://lists.apache.org/list?d...@arrow.apache.org:lte=1M:debug%20build%20error%20for%20arrow -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15944) Document dependencies for building on Arch Linux
Tobias Zagorni created ARROW-15944: -- Summary: Document dependencies for building on Arch Linux Key: ARROW-15944 URL: https://issues.apache.org/jira/browse/ARROW-15944 Project: Apache Arrow Issue Type: Improvement Components: C++, Documentation Environment: Arch Linux Reporter: Tobias Zagorni Assignee: Tobias Zagorni List command to install dependencies to build Arrow on Arch Linux in the documentation, similar to other distributions -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string
Nicola Crane created ARROW-15943: Summary: [C++] Filter which files to be read in as part of filesystem, filtered using a string Key: ARROW-15943 URL: https://issues.apache.org/jira/browse/ARROW-15943 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Nicola Crane There is a report from a user (see this Stack Overflow post [1]) who has used the {{basename_template}} parameter to write files to a dataset, some of which have the prefix {{"summary"}} and others which have the prefix "{{{}prediction"{}}}. This data is saved in partitioned directories. They want to be able to read back in the data, so that, as well as the partition variables in their dataset, they can choose which subset (predictions vs. summaries) to read back in. This isn't currently possible; if they try to open a dataset with a list of files, they cannot read it in as partitioned data. A short-term solution is to suggest they change the structure of how their data is stored, but it could be useful to be able to pass in some sort of filter to determine which files get read in as a dataset. [1] [https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15942) [C++] RecordBatch::ValidateFull fails on nested StructArray
Rok Mihevc created ARROW-15942: -- Summary: [C++] RecordBatch::ValidateFull fails on nested StructArray Key: ARROW-15942 URL: https://issues.apache.org/jira/browse/ARROW-15942 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Rok Mihevc ValidateFull appears to discard the outermost field of nested schema. The following example passes: {code:bash} diff --git a/cpp/src/arrow/array/array_struct_test.cc b/cpp/src/arrow/array/array_struct_test.cc index 318c83860..6a8896ca9 100644 --- a/cpp/src/arrow/array/array_struct_test.cc +++ b/cpp/src/arrow/array/array_struct_test.cc @@ -15,6 +15,8 @@ // specific language governing permissions and limitations // under the License. +#include + #include #include @@ -696,4 +698,20 @@ TEST(TestFieldRef, GetChildren) { AssertArraysEqual(*a, *expected_a); } +TEST(TestFieldRef, TestValidateFullRecordBatch) { + auto struct_array = + ArrayFromJSON(struct_({field("a", struct_({field("b", float64())}))}), R"([ +{"a": {"b": 6.125}}, +{"a": {"b": 0.0}}, +{"a": {"b": -1}} + ])"); + + auto schema1 = arrow::schema({field("x", struct_({field("a", struct_({field("b", float64())}))}))}); + auto schema2 = arrow::schema({field("a", struct_({field("b", float64())}))}); + auto record_batch1 = arrow::RecordBatch::Make(schema1, 3, {struct_array}); + auto record_batch2 = arrow::RecordBatch::Make(schema2, 3, {struct_array}); + ASSERT_OK(record_batch1->ValidateFull()); + ASSERT_NOT_OK(record_batch2->ValidateFull()); +} + {code} Is this expected behaviour? -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15941) [C++] Allow setting IO thread pool size with an environment variable
Antoine Pitrou created ARROW-15941: -- Summary: [C++] Allow setting IO thread pool size with an environment variable Key: ARROW-15941 URL: https://issues.apache.org/jira/browse/ARROW-15941 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Antoine Pitrou Fix For: 8.0.0 See https://issues.apache.org/jira/browse/ARROW-14354 and https://github.com/apache/arrow/pull/12624#discussion_r827088337 for discussion. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15940) [Gandiva][C++] Add NEGATIVE function for decimal data type
Johnnathan Rodrigo Pego de Almeida created ARROW-15940: -- Summary: [Gandiva][C++] Add NEGATIVE function for decimal data type Key: ARROW-15940 URL: https://issues.apache.org/jira/browse/ARROW-15940 Project: Apache Arrow Issue Type: New Feature Components: C++ - Gandiva Reporter: Johnnathan Rodrigo Pego de Almeida This PR implements the NEGATIVE function for decimal data type. The function receive a decimal128() and return a negative decimal128(). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15939) [Python] support pickling json.ReadOptions and json.ParseOptions
runvyang created ARROW-15939: Summary: [Python] support pickling json.ReadOptions and json.ParseOptions Key: ARROW-15939 URL: https://issues.apache.org/jira/browse/ARROW-15939 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: runvyang -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15938) [R][C++] Segfault in left join with empty right table when filtered on partition
Vitalie Spinu created ARROW-15938: - Summary: [R][C++] Segfault in left join with empty right table when filtered on partition Key: ARROW-15938 URL: https://issues.apache.org/jira/browse/ARROW-15938 Project: Apache Arrow Issue Type: Bug Components: C++, Compute IR Affects Versions: 7.0.1 Environment: ubuntu linux, R4.1.2 Reporter: Vitalie Spinu When the right table in a join is empty as a result of a filtering on a partition group the join segfaults: {code:java} library(arrow) library(glue) df <- mutate(iris, id = runif(n())) dir <- "./tmp/iris" dir.create(glue("{dir}/group=a/"), recursive = T, showWarnings = F) dir.create(glue("{dir}/group=b/"), recursive = T, showWarnings = F) write_parquet(df, glue("{dir}/group=a/part1.parquet")) write_parquet(df, glue("{dir}/group=b/part2.parquet")) db1 <- open_dataset(dir) %>% filter(group == "blabla") open_dataset(dir) %>% filter(group == "b") %>% select(id) %>% left_join(db1, by = "id") %>% collect() {code} {code:java} ==24063== Thread 7: ==24063== Invalid read of size 1 ==24063== at 0x1FFE606D: arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(long, arrow::compute::ExecBatch*, arrow::compute::ExecBatch*, arrow::compute::ExecBatch*, arrow::compute::ExecBatch*) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) ==24063== by 0x1FFE68CC: arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(unsigned long, long, int const*, int const*) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) ==24063== by 0x1FFE84D5: arrow::compute::HashJoinBasicImpl::ProbeBatch(unsigned long, arrow::compute::ExecBatch const&) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) ==24063== by 0x1FFE8CB4: arrow::compute::HashJoinBasicImpl::InputReceived(unsigned long, int, arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) ==24063== by 0x200011CF: arrow::compute::HashJoinNode::InputReceived(arrow::compute::ExecNode*, arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) ==24063== by 0x1FFB580E: arrow::compute::MapNode::SubmitTask(std::function (arrow::compute::ExecBatch)>, arrow::compute::ExecBatch)::{lambda()#1}::operator()() const (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) ==24063== by 0x1FFB6444: arrow::internal::FnOnce::FnImpl, arrow::compute::MapNode::SubmitTask(std::function (arrow::compute::ExecBatch)>, arrow::compute::ExecBatch)::{lambda()#2}::operator()() const::{lambda()#1})> >::invoke() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) ==24063== by 0x1FE2B2A0: std::thread::_State_impl > >::_M_run() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0) ==24063== by 0x92844BF: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.29) ==24063== by 0x6DD46DA: start_thread (pthread_create.c:463) ==24063== by 0x710D71E: clone (clone.S:95) ==24063== Address 0x10 is not stack'd, malloc'd or (recently) free'd ==24063== *** caught segfault *** address 0x10, cause 'memory not mapped'Traceback: 1: Table__from_RecordBatchReader(self) 2: tab$read_table() 3: do_exec_plan(x) 4: doTryCatch(return(expr), name, parentenv, handler) 5: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 6: tryCatchList(expr, classes, parentenv, handlers) 7: tryCatch(tab <- do_exec_plan(x), error = function(e) { handle_csv_read_error(e, x$.data$schema)}) 8: collect.arrow_dplyr_query(.) 9: collect(.) 10: open_dataset(dir) %>% filter(group == "b") %>% select(id) %>% left_join(db1, by = "id") %>% collect()Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace {code} This is arrow from current master ece0e23f1. It's worth noting that if the right table is filtered on a non-partitioned variable the problem does not occur. -- This message was sent by Atlassian Jira (v8.20.1#820001)