[jira] [Created] (ARROW-15947) rename_with s3 method for arrow_dplyr_query

2022-03-15 Thread Mark Roman Miller (Jira)
Mark Roman Miller created ARROW-15947:
-

 Summary: rename_with s3 method for arrow_dplyr_query
 Key: ARROW-15947
 URL: https://issues.apache.org/jira/browse/ARROW-15947
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Mark Roman Miller


Created a simple version of `rename_with` that applies the function to the 
current names of the .data argument and passes the result to `rename`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15946) [Go] Memory leak in pqarrow.NewColumnWriter with nested structures

2022-03-15 Thread Min-Young Wu (Jira)
Min-Young Wu created ARROW-15946:


 Summary: [Go] Memory leak in pqarrow.NewColumnWriter with nested 
structures
 Key: ARROW-15946
 URL: https://issues.apache.org/jira/browse/ARROW-15946
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go, Parquet
Reporter: Min-Young Wu


There seems to be a memory leak (well, using the default allocator, it would 
just be an accounting error?) when writing nested structures using 
pqarrow.FileWriter

Repro:
{code:go}
package main

import (
"bytes"
"fmt"

"github.com/apache/arrow/go/v7/arrow"
"github.com/apache/arrow/go/v7/arrow/array"
"github.com/apache/arrow/go/v7/arrow/memory"
"github.com/apache/arrow/go/v7/parquet"
"github.com/apache/arrow/go/v7/parquet/compress"
"github.com/apache/arrow/go/v7/parquet/pqarrow"
)

func main() {
allocator := memory.NewCheckedAllocator(memory.DefaultAllocator)
sc := arrow.NewSchema([]arrow.Field{
{Name: "f32", Type: arrow.PrimitiveTypes.Float32, Nullable: 
true},
{Name: "i32", Type: arrow.PrimitiveTypes.Int32, Nullable: true},
{Name: "struct_i64_f64", Type: arrow.StructOf(
arrow.Field{Name: "i64", Type: 
arrow.PrimitiveTypes.Int64, Nullable: true},
arrow.Field{Name: "f64", Type: 
arrow.PrimitiveTypes.Float64, Nullable: true})},
}, nil)

bld := array.NewRecordBuilder(allocator, sc)
bld.Field(0).(*array.Float32Builder).Append(1.0)
bld.Field(1).(*array.Int32Builder).Append(1)
sbld := bld.Field(2).(*array.StructBuilder)
sbld.Append(true)
sbld.FieldBuilder(0).(*array.Int64Builder).Append(1)
sbld.FieldBuilder(1).(*array.Float64Builder).Append(1.0)

rec := bld.NewRecord()
bld.Release()

var buf bytes.Buffer
wr, err := pqarrow.NewFileWriter(sc, ,

parquet.NewWriterProperties(parquet.WithCompression(compress.Codecs.Snappy)),

pqarrow.NewArrowWriterProperties(pqarrow.WithAllocator(allocator)))
if err != nil {
panic(err)
}

err = wr.Write(rec)
if err != nil {
panic(err)
}
rec.Release()
wr.Close()

if allocator.CurrentAlloc() != 0 {
fmt.Printf("remaining allocation size: %d\n", 
allocator.CurrentAlloc())
}
}
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15945) debug build for gandiva and arrow is not working

2022-03-15 Thread Chak-Pong Chung (Jira)
Chak-Pong Chung created ARROW-15945:
---

 Summary: debug build for gandiva and arrow is not working
 Key: ARROW-15945
 URL: https://issues.apache.org/jira/browse/ARROW-15945
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, C++ - Gandiva, Documentation
Affects Versions: 7.0.0, 6.0.0
Reporter: Chak-Pong Chung


As reported to the dev mailing list, debug build is not working.

Within the email, you can find the conda env with dependencies used and the 
script to reproduce the problem. This bug is found from 7.0 and 6.0 release 
branch.

https://lists.apache.org/list?d...@arrow.apache.org:lte=1M:debug%20build%20error%20for%20arrow



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15944) Document dependencies for building on Arch Linux

2022-03-15 Thread Tobias Zagorni (Jira)
Tobias Zagorni created ARROW-15944:
--

 Summary: Document dependencies for building on Arch Linux
 Key: ARROW-15944
 URL: https://issues.apache.org/jira/browse/ARROW-15944
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Documentation
 Environment: Arch Linux
Reporter: Tobias Zagorni
Assignee: Tobias Zagorni


List command to install dependencies to build Arrow on Arch Linux in the 
documentation, similar to other distributions



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string

2022-03-15 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15943:


 Summary: [C++] Filter which files to be read in as part of 
filesystem, filtered using a string
 Key: ARROW-15943
 URL: https://issues.apache.org/jira/browse/ARROW-15943
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Nicola Crane


There is a report from a user (see this Stack Overflow post [1]) who has used 
the {{basename_template}} parameter to write files to a dataset, some of which 
have the prefix {{"summary"}} and others which have the prefix 
"{{{}prediction"{}}}.  This data is saved in partitioned directories.  They 
want to be able to read back in the data, so that, as well as the partition 
variables in their dataset, they can choose which subset (predictions vs. 
summaries) to read back in.  

This isn't currently possible; if they try to open a dataset with a list of 
files, they cannot read it in as partitioned data.

A short-term solution is to suggest they change the structure of how their data 
is stored, but it could be useful to be able to pass in some sort of filter to 
determine which files get read in as a dataset.

 

[1] 
[https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15942) [C++] RecordBatch::ValidateFull fails on nested StructArray

2022-03-15 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-15942:
--

 Summary: [C++] RecordBatch::ValidateFull fails on nested 
StructArray
 Key: ARROW-15942
 URL: https://issues.apache.org/jira/browse/ARROW-15942
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Rok Mihevc


ValidateFull appears to discard the outermost field of nested schema. The 
following example passes:

{code:bash}
diff --git a/cpp/src/arrow/array/array_struct_test.cc 
b/cpp/src/arrow/array/array_struct_test.cc
index 318c83860..6a8896ca9 100644
--- a/cpp/src/arrow/array/array_struct_test.cc
+++ b/cpp/src/arrow/array/array_struct_test.cc
@@ -15,6 +15,8 @@
 // specific language governing permissions and limitations
 // under the License.
 
+#include 
+
 #include 
 
 #include 
@@ -696,4 +698,20 @@ TEST(TestFieldRef, GetChildren) {
   AssertArraysEqual(*a, *expected_a);
 }
 
+TEST(TestFieldRef, TestValidateFullRecordBatch) {
+  auto struct_array =
+  ArrayFromJSON(struct_({field("a", struct_({field("b", float64())}))}), 
R"([
+{"a": {"b": 6.125}},
+{"a": {"b": 0.0}},
+{"a": {"b": -1}}
+  ])");
+
+  auto schema1 = arrow::schema({field("x", struct_({field("a", 
struct_({field("b", float64())}))}))});
+  auto schema2 = arrow::schema({field("a", struct_({field("b", float64())}))});
+  auto record_batch1 = arrow::RecordBatch::Make(schema1, 3, {struct_array});
+  auto record_batch2 = arrow::RecordBatch::Make(schema2, 3, {struct_array});
+  ASSERT_OK(record_batch1->ValidateFull());
+  ASSERT_NOT_OK(record_batch2->ValidateFull());
+}
+
{code}

Is this expected behaviour?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15941) [C++] Allow setting IO thread pool size with an environment variable

2022-03-15 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-15941:
--

 Summary: [C++] Allow setting IO thread pool size with an 
environment variable
 Key: ARROW-15941
 URL: https://issues.apache.org/jira/browse/ARROW-15941
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Antoine Pitrou
 Fix For: 8.0.0


See https://issues.apache.org/jira/browse/ARROW-14354 and 
https://github.com/apache/arrow/pull/12624#discussion_r827088337 for discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15940) [Gandiva][C++] Add NEGATIVE function for decimal data type

2022-03-15 Thread Johnnathan Rodrigo Pego de Almeida (Jira)
Johnnathan Rodrigo Pego de Almeida created ARROW-15940:
--

 Summary: [Gandiva][C++] Add NEGATIVE function for decimal data type
 Key: ARROW-15940
 URL: https://issues.apache.org/jira/browse/ARROW-15940
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++ - Gandiva
Reporter: Johnnathan Rodrigo Pego de Almeida


This PR implements the NEGATIVE function for decimal data type.

The function receive a decimal128() and return a negative decimal128().



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15939) [Python] support pickling json.ReadOptions and json.ParseOptions

2022-03-15 Thread runvyang (Jira)
runvyang created ARROW-15939:


 Summary: [Python] support pickling json.ReadOptions and 
json.ParseOptions
 Key: ARROW-15939
 URL: https://issues.apache.org/jira/browse/ARROW-15939
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: runvyang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15938) [R][C++] Segfault in left join with empty right table when filtered on partition

2022-03-15 Thread Vitalie Spinu (Jira)
Vitalie Spinu created ARROW-15938:
-

 Summary: [R][C++] Segfault in left join with empty right table 
when filtered on partition
 Key: ARROW-15938
 URL: https://issues.apache.org/jira/browse/ARROW-15938
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Compute IR
Affects Versions: 7.0.1
 Environment: ubuntu linux, R4.1.2
Reporter: Vitalie Spinu


When the right table in a join is empty as a result of a filtering on a 
partition group the join segfaults:
{code:java}
  library(arrow)
  library(glue)
  df <- mutate(iris, id = runif(n()))
  dir <- "./tmp/iris"
  dir.create(glue("{dir}/group=a/"), recursive = T, showWarnings = F)
  dir.create(glue("{dir}/group=b/"), recursive = T, showWarnings = F)
  write_parquet(df, glue("{dir}/group=a/part1.parquet"))
  write_parquet(df, glue("{dir}/group=b/part2.parquet"))  db1 <- 
open_dataset(dir) %>%
    filter(group == "blabla")  open_dataset(dir) %>%
    filter(group == "b") %>%
    select(id) %>%
    left_join(db1, by = "id") %>%
    collect()
  {code}
{code:java}
==24063== Thread 7:
==24063== Invalid read of size 1
==24063==    at 0x1FFE606D: 
arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(long, 
arrow::compute::ExecBatch*, arrow::compute::ExecBatch*, 
arrow::compute::ExecBatch*, arrow::compute::ExecBatch*) (in 
/home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
==24063==    by 0x1FFE68CC: 
arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(unsigned long, long, 
int const*, int const*) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
==24063==    by 0x1FFE84D5: 
arrow::compute::HashJoinBasicImpl::ProbeBatch(unsigned long, 
arrow::compute::ExecBatch const&) (in 
/home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
==24063==    by 0x1FFE8CB4: 
arrow::compute::HashJoinBasicImpl::InputReceived(unsigned long, int, 
arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
==24063==    by 0x200011CF: 
arrow::compute::HashJoinNode::InputReceived(arrow::compute::ExecNode*, 
arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
==24063==    by 0x1FFB580E: 
arrow::compute::MapNode::SubmitTask(std::function
 (arrow::compute::ExecBatch)>, 
arrow::compute::ExecBatch)::{lambda()#1}::operator()() const (in 
/home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
==24063==    by 0x1FFB6444: arrow::internal::FnOnce::FnImpl, 
arrow::compute::MapNode::SubmitTask(std::function
 (arrow::compute::ExecBatch)>, 
arrow::compute::ExecBatch)::{lambda()#2}::operator()() const::{lambda()#1})> 
>::invoke() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
==24063==    by 0x1FE2B2A0: 
std::thread::_State_impl
 > >::_M_run() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
==24063==    by 0x92844BF: ??? (in 
/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.29)
==24063==    by 0x6DD46DA: start_thread (pthread_create.c:463)
==24063==    by 0x710D71E: clone (clone.S:95)
==24063==  Address 0x10 is not stack'd, malloc'd or (recently) free'd
==24063==  *** caught segfault ***
address 0x10, cause 'memory not mapped'Traceback:
 1: Table__from_RecordBatchReader(self)
 2: tab$read_table()
 3: do_exec_plan(x)
 4: doTryCatch(return(expr), name, parentenv, handler)
 5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 6: tryCatchList(expr, classes, parentenv, handlers)
 7: tryCatch(tab <- do_exec_plan(x), error = function(e) {    
handle_csv_read_error(e, x$.data$schema)})
 8: collect.arrow_dplyr_query(.)
 9: collect(.)
10: open_dataset(dir) %>% filter(group == "b") %>% select(id) %>%     
left_join(db1, by = "id") %>% collect()Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace {code}
This is arrow from current master ece0e23f1. 

It's worth noting that if the right table is filtered on a non-partitioned 
variable the problem does not occur.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)