[jira] [Created] (ARROW-15728) [Python] Zstd IPC test is flaky.

2022-02-17 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-15728:
---

 Summary: [Python] Zstd IPC test is flaky.
 Key: ARROW-15728
 URL: https://issues.apache.org/jira/browse/ARROW-15728
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Micah Kornfield
Assignee: Micah Kornfield


Our internal CI system shows flakes on the test at approximately a 2% rate.  By 
reducing the integer range we can make this much less flaky (zero observed 
flakes in 5000 runs).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15727) [Python] Lists of MonthDayNano Interval can't be converted to Pandas

2022-02-17 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-15727:
---

 Summary: [Python] Lists of MonthDayNano Interval can't be 
converted to Pandas
 Key: ARROW-15727
 URL: https://issues.apache.org/jira/browse/ARROW-15727
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Micah Kornfield
Assignee: Micah Kornfield






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15726) [R] Support push-down projection/filtering in datasets / dplyr

2022-02-17 Thread Weston Pace (Jira)
Weston Pace created ARROW-15726:
---

 Summary: [R] Support push-down projection/filtering in datasets / 
dplyr
 Key: ARROW-15726
 URL: https://issues.apache.org/jira/browse/ARROW-15726
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Weston Pace


The following query should read a single column from the target parquet file.

{noformat}
open_dataset("lineitem.parquet") %>% select(l_tax) %>% filter(l_tax < 0.01) %>% 
collect()
{noformat}

Furthermore, it should apply a pushdown filter to the source node allowing 
parquet row groups to potentially filter out target data.

At the moment it creates the following exec plan:

{noformat}
3:SinkNode{}
  2:ProjectNode{projection=[l_tax]}
1:FilterNode{filter=(l_tax < 0.01)}
  0:SourceNode{}
{noformat}

There is no projection or filter in the source node.  As a result we end up 
reading much more data from disk (the entire file) than we need to (at most a 
single column).

This _could_ be fixed via heuristics in the dplyr code.  However, it may 
quickly get complex (for example, the project comes after the filter, so you 
need to make sure you push down a projection that includes both the columns 
accessed by the filter and the columns accessed by the projection OR can you 
push down the projection through a join [yes you can], how do you know which 
columns to apply to which source node).

A more complete fix would be to call into some kind of 3rd party optimizer 
(e.g. calcite) after the plan has been created by dplyr but before it is passed 
to the execution engine.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15725) [Python] Legacy dataset can't roundtrip Int64 with nulls if partitioned

2022-02-17 Thread Will Jones (Jira)
Will Jones created ARROW-15725:
--

 Summary: [Python] Legacy dataset can't roundtrip Int64 with nulls 
if partitioned
 Key: ARROW-15725
 URL: https://issues.apache.org/jira/browse/ARROW-15725
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 7.0.0, 4.0.0
Reporter: Will Jones


If there is partitioning and the column has nulls, Int64 columns may not round 
trip successfully using the legacy datasets implementation. 

Simple reproduction:

 {code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import tempfile

table = pa.table({
'x': pa.array([None, 7753285016841556620]),
'y': pa.array(['a', 'b'])
})

ds_dir = tempfile.mkdtemp()
pq.write_to_dataset(table, ds_dir, partition_cols=['y'])

table_after = ds.dataset(ds_dir).to_table()
print(table['x'])
print(table_after['x'])
assert table['x'] == table_after['x']
{code}

{code}
[
  [
null,
7753285016841556620
  ]
]
[
  [
null
  ],
  [
7753285016841556992
  ]
]
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15724) reduce directory and file IO when reading partition parquet dataset

2022-02-17 Thread Yin (Jira)
Yin created ARROW-15724:
---

 Summary: reduce directory and file IO when reading partition 
parquet dataset
 Key: ARROW-15724
 URL: https://issues.apache.org/jira/browse/ARROW-15724
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Yin
 Attachments: pq.py

Hi,
It seems Arrow accesses all partitions directories (even the parquet files), 
including those clearly not matching with the partition key values in the 
filters. This may cause multiple times  difference between accessing one 
partition directly vs accessing with partition key filters, 
specially on Network file system, and on local file system when there are lots 
of partitions, e.g. 1/10th of second vs seconds.

Attached Python code to create example dataframe and save parquet datasets with 
different hive partition structure (/y=/m=/d=, or /y=/m=, or /dk=). And read 
the datasets with/without filters to reproduce the issue. Observe the run time, 
and the directories and files which are accessed by the process in Process 
Monitor on Windows.

In the three partition structures, I saw in Process Monitor that all 
directories are accessed regardless of use_legacy_dataset=True or False. 
When use_legacy_dataset=False, the parquet files in all directories were 
opened.  
The argument validate_schema=False made small time difference, but still opens 
the partition directories, and it's only supported when 
use_legacy_dataset=True, and not supported/passed in from pandas read_parquet 
wrapper API. 

The /y=/m= is faster since there is no daily partition so less directories and 
files.


There was a related another stackoverflow question and example 
[https://stackoverflow.com/questions/66339381/pyarrow-read-single-file-from-partitioned-parquet-dataset-is-unexpectedly-slow]
and there was a comment on the partition discovery:
{quote}It should get discovered automatically. pd.read_parquet calls 
pyarrow.parquet.read_table and the default partitioning behavior should be to 
discover hive-style partitions (i.e. the ones you have). The fact that you have 
to specify this means that discovery is failing. If you could create a 
reproducible example and submit it to Arrow JIRA it would be helpful. 
– Pace  Feb 24 2021 at 18:55"
{quote}
Wonder if there was some related Jira here already.
I tried passing in partitioning argument, it didn't help. 
The version of pyarrow used were 1.01, 5, and 7.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15723) PyArrow segfault on orc write table

2022-02-17 Thread patrice (Jira)
patrice created ARROW-15723:
---

 Summary: PyArrow segfault on orc write table
 Key: ARROW-15723
 URL: https://issues.apache.org/jira/browse/ARROW-15723
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 7.0.0
Reporter: patrice


pyarrow segfault when trying to write an orc from a table containing nullArray.

 

from pyarrow import orc
import pyarrow as pa

a = pa.array([1, None, 3, None])
b = pa.array([None, None, None, None])
table = pa.table(\{"int64": a, "utf8": b})
orc.write_table(table, 'test.orc')


zsh: segmentation fault  python3



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15722) [Java] Improve error message for ListVector with wrong number of children

2022-02-17 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-15722:


 Summary: [Java] Improve error message for ListVector with wrong 
number of children
 Key: ARROW-15722
 URL: https://issues.apache.org/jira/browse/ARROW-15722
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Bryan Cutler
Assignee: Bryan Cutler


If a ListVector is made without any children, the error message will say "Lists 
have only one child. Found: []".

The wording could be improved a little to let the user know what went wrong.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15721) [Website] Create subproject page for Arrow Flight

2022-02-17 Thread Ian Cook (Jira)
Ian Cook created ARROW-15721:


 Summary: [Website] Create subproject page for Arrow Flight
 Key: ARROW-15721
 URL: https://issues.apache.org/jira/browse/ARROW-15721
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Ian Cook


Currently there is not really a high-level landing page for Arrow Flight on the 
Arrow site. The closest thing is the [Arrow Flight 
RPC|https://arrow.apache.org/docs/format/Flight.html] page under 
*Specifications and Protocols* but that page does not really explain the "why" 
and quickly gets into technical details. Should we consider adding a new *Arrow 
Flight* page under {*}Subprojects{*}? We could populate the page with some 
content from the Arrow Flight and Arrow Flight SQL blog posts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15720) [CI] Nightly dask build is failing due to wrong usage of Array.to_pandas

2022-02-17 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15720:
-

 Summary: [CI] Nightly dask build is failing due to wrong usage of 
Array.to_pandas
 Key: ARROW-15720
 URL: https://issues.apache.org/jira/browse/ARROW-15720
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration
Reporter: Joris Van den Bossche


This failure is triggered by a change in Arrow (addition of {{types_mapper}} 
keyword to {{pa.Array.to_pandas}}), but the cause is a wrong usage of that in 
dask.

I already fixed that on the dask side: https://github.com/dask/dask/pull/8733

But we should still skip the test on our side (will be needed until that PR is 
merged + released)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15719) [R] Simplify code for handling summarise() with no aggregations

2022-02-17 Thread Ian Cook (Jira)
Ian Cook created ARROW-15719:


 Summary: [R] Simplify code for handling summarise() with no 
aggregations
 Key: ARROW-15719
 URL: https://issues.apache.org/jira/browse/ARROW-15719
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Ian Cook
 Fix For: 8.0.0


Check whether ARROW-15609 enables us to remove code from 
{{{}[query-engine.R|https://github.com/apache/arrow/blob/master/r/R/query-engine.R]{}}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15718) [R] Joining two datasets crashes if use_threads=FALSE

2022-02-17 Thread Will Jones (Jira)
Will Jones created ARROW-15718:
--

 Summary: [R] Joining two datasets crashes if use_threads=FALSE
 Key: ARROW-15718
 URL: https://issues.apache.org/jira/browse/ARROW-15718
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, R
Affects Versions: 7.0.0
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 8.0.0


In ARROW-14908 we solved the case of joining a dataset to an in memory table, 
but did not solve joining two datasets.

The previous solution was to add +1 to the thread count, because the hash join 
logic might be called by the scanner's IO thread. For joining more than 1 
dataset, we might have more than 1 IO thread, so we either need to add a larger 
arbitrary number or find a way to make the state logic more resilient to 
unexpected threads.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15717) [Docs] Add hash_one to the documentation

2022-02-17 Thread David Li (Jira)
David Li created ARROW-15717:


 Summary: [Docs] Add hash_one to the documentation
 Key: ARROW-15717
 URL: https://issues.apache.org/jira/browse/ARROW-15717
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: David Li


Follow up to ARROW-13993.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15716) [Dataset][Python] Parse a list of fragment paths to gather filters

2022-02-17 Thread Lance Dacey (Jira)
Lance Dacey created ARROW-15716:
---

 Summary: [Dataset][Python] Parse a list of fragment paths to 
gather filters
 Key: ARROW-15716
 URL: https://issues.apache.org/jira/browse/ARROW-15716
 Project: Apache Arrow
  Issue Type: Wish
Affects Versions: 7.0.0
Reporter: Lance Dacey


Is it possible for partitioning.parse() to be updated to parse a list of paths 
instead of just a single path? 

I am passing the .paths from file_visitor to downstream tasks to process data 
which was recently saved, but I can run into problems with this if I overwrite 
data with delete_matching in order to consolidate small files since the paths 
won't exist. 

Here is the output of my current approach to use filters instead of reading the 
paths directly:

{code:java}
# Fragments saved during write_dataset 
['dev/dataset/fragments/date_id=20210813/data-0.parquet', 
'dev/dataset/fragments/date_id=20210114/data-2.parquet', 
'dev/dataset/fragments/date_id=20210114/data-1.parquet', 
'dev/dataset/fragments/date_id=20210114/data-0.parquet']

# Run partitioning.parse() on each fragment 
[, 
, , ]

# Format those expressions into a list of tuples
[('date_id', 'in', [20210114, 20210813])]

# Convert to an expression which is used as a filter in .to_table()
is_in(date_id, {value_set=int64:[
  20210114,
  20210813
], skip_nulls=false})
{code}

And here is how I am creating the filter from a list of .paths (perhaps there 
is a better way?):

{code:python}
partitioning = ds.HivePartitioning(partition_schema)
expressions = []
for file in paths:
expressions.append(partitioning.parse(file))
values = []
filters = []
for expression in expressions:
partitions = ds._get_partition_keys(expression)
if len(partitions.keys()) > 1:
element = [(k, "==", v) for k, v in partitions.items()]
if element not in filters:
filters.append(element)
else:
for k, v in partitions.items():
if v not in values:
values.append(v)
filters = [(k, "in", sorted(values))]

filt_exp = pa.parquet._filters_to_expression(filters)
dataset.to_table(filter=filt_exp)
{code}


My hope would be to do something like filt_exp = partitioning.parse(paths) 
which would return a dataset expression.




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15715) [Go] ipc.Writer includes unneccessary offsets with sliced arrays

2022-02-17 Thread Chris Hoff (Jira)
Chris Hoff created ARROW-15715:
--

 Summary: [Go] ipc.Writer includes unneccessary offsets with sliced 
arrays
 Key: ARROW-15715
 URL: https://issues.apache.org/jira/browse/ARROW-15715
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Reporter: Chris Hoff


PR incoming.

 

Sliced arrays will be serialized with unnecessary trailing offsets for values 
that were sliced off.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15714) [C++][Gandiva] Increase the protobuf recursion limit in gandiva protobuf parser

2022-02-17 Thread Projjal Chanda (Jira)
Projjal Chanda created ARROW-15714:
--

 Summary: [C++][Gandiva] Increase the protobuf recursion limit in 
gandiva protobuf parser
 Key: ARROW-15714
 URL: https://issues.apache.org/jira/browse/ARROW-15714
 Project: Apache Arrow
  Issue Type: Task
  Components: C++ - Gandiva
Reporter: Projjal Chanda






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15713) [Java]: Create a sphinx plugin or java cookbook plugin to format code that appear on java cookbook

2022-02-17 Thread David Dali Susanibar Arce (Jira)
David Dali Susanibar Arce created ARROW-15713:
-

 Summary: [Java]: Create a sphinx plugin or java cookbook plugin to 
format code that appear on java cookbook
 Key: ARROW-15713
 URL: https://issues.apache.org/jira/browse/ARROW-15713
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: David Dali Susanibar Arce
Assignee: David Dali Susanibar Arce


Create a sphinx plugin or java cookbook plugin to format code that appear on 
java cookbook

Currently we are adding java code as a text and needed to enhance that to apply 
a format code



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15712) [R] Add a {{type}} method for {{Expression}}

2022-02-17 Thread Jira
Dragoș Moldovan-Grünfeld created ARROW-15712:


 Summary: [R] Add a {{type}} method for {{Expression}} 
 Key: ARROW-15712
 URL: https://issues.apache.org/jira/browse/ARROW-15712
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Dragoș Moldovan-Grünfeld


This would allow for more consistent syntax when extracting the type of an 
expression.

A block like this:
{code:r}
if (inherits(x, "Expression")) { 
class <- x$type()$ToString() 
} else { 
class <- type(x)$ToString() 
}
{code}

would be simplified to 
{code:r}
class <- type(x)$ToString()
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15711) [C++][Parquet] Extension types with nanosecond timestamp resolution don't roundtrip

2022-02-17 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15711:
-

 Summary: [C++][Parquet] Extension types with nanosecond timestamp 
resolution don't roundtrip
 Key: ARROW-15711
 URL: https://issues.apache.org/jira/browse/ARROW-15711
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Parquet
Reporter: Joris Van den Bossche


Example code:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

class MyTimestampType(pa.PyExtensionType):

def __init__(self):
pa.PyExtensionType.__init__(self, pa.timestamp("ns"))

def __reduce__(self):
return MyTimestampType, ()


arr = MyTimestampType().wrap_array(pa.array([1000, 2000, 3000], 
pa.timestamp("ns")))
table = pa.table({"col": arr})
{code}

{code}
>>> table.schema
col: extension>

>>> pq.write_table(table, "test_parquet_extension_type_timestamp_ns.parquet")
>>> result = pq.read_table("test_parquet_extension_type_timestamp_ns.parquet")
>>> result.schema
col: timestamp[us]
{code}

The reason for this is because we only restore the extension type if the 
inferred storage type (inferred from parquet + after applying any updates based 
on the Arrow schema) exactly equals the original storage type (as stored in the 
Arrow schema):

https://github.com/apache/arrow/blob/afaa92e7e4289d6e4f302cc91810368794e8092b/cpp/src/parquet/arrow/schema.cc#L973-L977

And, with the default options, a timestamp with nanosecond resolution gets 
stored as microsecond resolution in Parquet, and that is something we do not 
restore when updating the read types based on the stored Arrow schema (eg we do 
add a timezone, but we don't change the resolution).

An additional issue is that _if_ you loose the extension type, the field 
metadata about the extension type are also lost. I think that if we cannot 
restore the extension type, we should at least try to keep the ARROW:extension 
field metadata as information.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)