[jira] [Created] (ARROW-15728) [Python] Zstd IPC test is flaky.
Micah Kornfield created ARROW-15728: --- Summary: [Python] Zstd IPC test is flaky. Key: ARROW-15728 URL: https://issues.apache.org/jira/browse/ARROW-15728 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Micah Kornfield Assignee: Micah Kornfield Our internal CI system shows flakes on the test at approximately a 2% rate. By reducing the integer range we can make this much less flaky (zero observed flakes in 5000 runs). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15727) [Python] Lists of MonthDayNano Interval can't be converted to Pandas
Micah Kornfield created ARROW-15727: --- Summary: [Python] Lists of MonthDayNano Interval can't be converted to Pandas Key: ARROW-15727 URL: https://issues.apache.org/jira/browse/ARROW-15727 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Micah Kornfield Assignee: Micah Kornfield -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15726) [R] Support push-down projection/filtering in datasets / dplyr
Weston Pace created ARROW-15726: --- Summary: [R] Support push-down projection/filtering in datasets / dplyr Key: ARROW-15726 URL: https://issues.apache.org/jira/browse/ARROW-15726 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Weston Pace The following query should read a single column from the target parquet file. {noformat} open_dataset("lineitem.parquet") %>% select(l_tax) %>% filter(l_tax < 0.01) %>% collect() {noformat} Furthermore, it should apply a pushdown filter to the source node allowing parquet row groups to potentially filter out target data. At the moment it creates the following exec plan: {noformat} 3:SinkNode{} 2:ProjectNode{projection=[l_tax]} 1:FilterNode{filter=(l_tax < 0.01)} 0:SourceNode{} {noformat} There is no projection or filter in the source node. As a result we end up reading much more data from disk (the entire file) than we need to (at most a single column). This _could_ be fixed via heuristics in the dplyr code. However, it may quickly get complex (for example, the project comes after the filter, so you need to make sure you push down a projection that includes both the columns accessed by the filter and the columns accessed by the projection OR can you push down the projection through a join [yes you can], how do you know which columns to apply to which source node). A more complete fix would be to call into some kind of 3rd party optimizer (e.g. calcite) after the plan has been created by dplyr but before it is passed to the execution engine. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15725) [Python] Legacy dataset can't roundtrip Int64 with nulls if partitioned
Will Jones created ARROW-15725: -- Summary: [Python] Legacy dataset can't roundtrip Int64 with nulls if partitioned Key: ARROW-15725 URL: https://issues.apache.org/jira/browse/ARROW-15725 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 7.0.0, 4.0.0 Reporter: Will Jones If there is partitioning and the column has nulls, Int64 columns may not round trip successfully using the legacy datasets implementation. Simple reproduction: {code:python} import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds import tempfile table = pa.table({ 'x': pa.array([None, 7753285016841556620]), 'y': pa.array(['a', 'b']) }) ds_dir = tempfile.mkdtemp() pq.write_to_dataset(table, ds_dir, partition_cols=['y']) table_after = ds.dataset(ds_dir).to_table() print(table['x']) print(table_after['x']) assert table['x'] == table_after['x'] {code} {code} [ [ null, 7753285016841556620 ] ] [ [ null ], [ 7753285016841556992 ] ] {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15724) reduce directory and file IO when reading partition parquet dataset
Yin created ARROW-15724: --- Summary: reduce directory and file IO when reading partition parquet dataset Key: ARROW-15724 URL: https://issues.apache.org/jira/browse/ARROW-15724 Project: Apache Arrow Issue Type: Improvement Reporter: Yin Attachments: pq.py Hi, It seems Arrow accesses all partitions directories (even the parquet files), including those clearly not matching with the partition key values in the filters. This may cause multiple times difference between accessing one partition directly vs accessing with partition key filters, specially on Network file system, and on local file system when there are lots of partitions, e.g. 1/10th of second vs seconds. Attached Python code to create example dataframe and save parquet datasets with different hive partition structure (/y=/m=/d=, or /y=/m=, or /dk=). And read the datasets with/without filters to reproduce the issue. Observe the run time, and the directories and files which are accessed by the process in Process Monitor on Windows. In the three partition structures, I saw in Process Monitor that all directories are accessed regardless of use_legacy_dataset=True or False. When use_legacy_dataset=False, the parquet files in all directories were opened. The argument validate_schema=False made small time difference, but still opens the partition directories, and it's only supported when use_legacy_dataset=True, and not supported/passed in from pandas read_parquet wrapper API. The /y=/m= is faster since there is no daily partition so less directories and files. There was a related another stackoverflow question and example [https://stackoverflow.com/questions/66339381/pyarrow-read-single-file-from-partitioned-parquet-dataset-is-unexpectedly-slow] and there was a comment on the partition discovery: {quote}It should get discovered automatically. pd.read_parquet calls pyarrow.parquet.read_table and the default partitioning behavior should be to discover hive-style partitions (i.e. the ones you have). The fact that you have to specify this means that discovery is failing. If you could create a reproducible example and submit it to Arrow JIRA it would be helpful. – Pace Feb 24 2021 at 18:55" {quote} Wonder if there was some related Jira here already. I tried passing in partitioning argument, it didn't help. The version of pyarrow used were 1.01, 5, and 7. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15723) PyArrow segfault on orc write table
patrice created ARROW-15723: --- Summary: PyArrow segfault on orc write table Key: ARROW-15723 URL: https://issues.apache.org/jira/browse/ARROW-15723 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 7.0.0 Reporter: patrice pyarrow segfault when trying to write an orc from a table containing nullArray. from pyarrow import orc import pyarrow as pa a = pa.array([1, None, 3, None]) b = pa.array([None, None, None, None]) table = pa.table(\{"int64": a, "utf8": b}) orc.write_table(table, 'test.orc') zsh: segmentation fault python3 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15722) [Java] Improve error message for ListVector with wrong number of children
Bryan Cutler created ARROW-15722: Summary: [Java] Improve error message for ListVector with wrong number of children Key: ARROW-15722 URL: https://issues.apache.org/jira/browse/ARROW-15722 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Bryan Cutler Assignee: Bryan Cutler If a ListVector is made without any children, the error message will say "Lists have only one child. Found: []". The wording could be improved a little to let the user know what went wrong. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15721) [Website] Create subproject page for Arrow Flight
Ian Cook created ARROW-15721: Summary: [Website] Create subproject page for Arrow Flight Key: ARROW-15721 URL: https://issues.apache.org/jira/browse/ARROW-15721 Project: Apache Arrow Issue Type: Improvement Components: Website Reporter: Ian Cook Currently there is not really a high-level landing page for Arrow Flight on the Arrow site. The closest thing is the [Arrow Flight RPC|https://arrow.apache.org/docs/format/Flight.html] page under *Specifications and Protocols* but that page does not really explain the "why" and quickly gets into technical details. Should we consider adding a new *Arrow Flight* page under {*}Subprojects{*}? We could populate the page with some content from the Arrow Flight and Arrow Flight SQL blog posts. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15720) [CI] Nightly dask build is failing due to wrong usage of Array.to_pandas
Joris Van den Bossche created ARROW-15720: - Summary: [CI] Nightly dask build is failing due to wrong usage of Array.to_pandas Key: ARROW-15720 URL: https://issues.apache.org/jira/browse/ARROW-15720 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration Reporter: Joris Van den Bossche This failure is triggered by a change in Arrow (addition of {{types_mapper}} keyword to {{pa.Array.to_pandas}}), but the cause is a wrong usage of that in dask. I already fixed that on the dask side: https://github.com/dask/dask/pull/8733 But we should still skip the test on our side (will be needed until that PR is merged + released) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15719) [R] Simplify code for handling summarise() with no aggregations
Ian Cook created ARROW-15719: Summary: [R] Simplify code for handling summarise() with no aggregations Key: ARROW-15719 URL: https://issues.apache.org/jira/browse/ARROW-15719 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Ian Cook Fix For: 8.0.0 Check whether ARROW-15609 enables us to remove code from {{{}[query-engine.R|https://github.com/apache/arrow/blob/master/r/R/query-engine.R]{}}}. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15718) [R] Joining two datasets crashes if use_threads=FALSE
Will Jones created ARROW-15718: -- Summary: [R] Joining two datasets crashes if use_threads=FALSE Key: ARROW-15718 URL: https://issues.apache.org/jira/browse/ARROW-15718 Project: Apache Arrow Issue Type: Bug Components: C++, R Affects Versions: 7.0.0 Reporter: Will Jones Assignee: Will Jones Fix For: 8.0.0 In ARROW-14908 we solved the case of joining a dataset to an in memory table, but did not solve joining two datasets. The previous solution was to add +1 to the thread count, because the hash join logic might be called by the scanner's IO thread. For joining more than 1 dataset, we might have more than 1 IO thread, so we either need to add a larger arbitrary number or find a way to make the state logic more resilient to unexpected threads. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15717) [Docs] Add hash_one to the documentation
David Li created ARROW-15717: Summary: [Docs] Add hash_one to the documentation Key: ARROW-15717 URL: https://issues.apache.org/jira/browse/ARROW-15717 Project: Apache Arrow Issue Type: Improvement Components: Documentation Reporter: David Li Follow up to ARROW-13993. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15716) [Dataset][Python] Parse a list of fragment paths to gather filters
Lance Dacey created ARROW-15716: --- Summary: [Dataset][Python] Parse a list of fragment paths to gather filters Key: ARROW-15716 URL: https://issues.apache.org/jira/browse/ARROW-15716 Project: Apache Arrow Issue Type: Wish Affects Versions: 7.0.0 Reporter: Lance Dacey Is it possible for partitioning.parse() to be updated to parse a list of paths instead of just a single path? I am passing the .paths from file_visitor to downstream tasks to process data which was recently saved, but I can run into problems with this if I overwrite data with delete_matching in order to consolidate small files since the paths won't exist. Here is the output of my current approach to use filters instead of reading the paths directly: {code:java} # Fragments saved during write_dataset ['dev/dataset/fragments/date_id=20210813/data-0.parquet', 'dev/dataset/fragments/date_id=20210114/data-2.parquet', 'dev/dataset/fragments/date_id=20210114/data-1.parquet', 'dev/dataset/fragments/date_id=20210114/data-0.parquet'] # Run partitioning.parse() on each fragment [, , , ] # Format those expressions into a list of tuples [('date_id', 'in', [20210114, 20210813])] # Convert to an expression which is used as a filter in .to_table() is_in(date_id, {value_set=int64:[ 20210114, 20210813 ], skip_nulls=false}) {code} And here is how I am creating the filter from a list of .paths (perhaps there is a better way?): {code:python} partitioning = ds.HivePartitioning(partition_schema) expressions = [] for file in paths: expressions.append(partitioning.parse(file)) values = [] filters = [] for expression in expressions: partitions = ds._get_partition_keys(expression) if len(partitions.keys()) > 1: element = [(k, "==", v) for k, v in partitions.items()] if element not in filters: filters.append(element) else: for k, v in partitions.items(): if v not in values: values.append(v) filters = [(k, "in", sorted(values))] filt_exp = pa.parquet._filters_to_expression(filters) dataset.to_table(filter=filt_exp) {code} My hope would be to do something like filt_exp = partitioning.parse(paths) which would return a dataset expression. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15715) [Go] ipc.Writer includes unneccessary offsets with sliced arrays
Chris Hoff created ARROW-15715: -- Summary: [Go] ipc.Writer includes unneccessary offsets with sliced arrays Key: ARROW-15715 URL: https://issues.apache.org/jira/browse/ARROW-15715 Project: Apache Arrow Issue Type: Bug Components: Go Reporter: Chris Hoff PR incoming. Sliced arrays will be serialized with unnecessary trailing offsets for values that were sliced off. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15714) [C++][Gandiva] Increase the protobuf recursion limit in gandiva protobuf parser
Projjal Chanda created ARROW-15714: -- Summary: [C++][Gandiva] Increase the protobuf recursion limit in gandiva protobuf parser Key: ARROW-15714 URL: https://issues.apache.org/jira/browse/ARROW-15714 Project: Apache Arrow Issue Type: Task Components: C++ - Gandiva Reporter: Projjal Chanda -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15713) [Java]: Create a sphinx plugin or java cookbook plugin to format code that appear on java cookbook
David Dali Susanibar Arce created ARROW-15713: - Summary: [Java]: Create a sphinx plugin or java cookbook plugin to format code that appear on java cookbook Key: ARROW-15713 URL: https://issues.apache.org/jira/browse/ARROW-15713 Project: Apache Arrow Issue Type: Sub-task Reporter: David Dali Susanibar Arce Assignee: David Dali Susanibar Arce Create a sphinx plugin or java cookbook plugin to format code that appear on java cookbook Currently we are adding java code as a text and needed to enhance that to apply a format code -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15712) [R] Add a {{type}} method for {{Expression}}
Dragoș Moldovan-Grünfeld created ARROW-15712: Summary: [R] Add a {{type}} method for {{Expression}} Key: ARROW-15712 URL: https://issues.apache.org/jira/browse/ARROW-15712 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Dragoș Moldovan-Grünfeld This would allow for more consistent syntax when extracting the type of an expression. A block like this: {code:r} if (inherits(x, "Expression")) { class <- x$type()$ToString() } else { class <- type(x)$ToString() } {code} would be simplified to {code:r} class <- type(x)$ToString() {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15711) [C++][Parquet] Extension types with nanosecond timestamp resolution don't roundtrip
Joris Van den Bossche created ARROW-15711: - Summary: [C++][Parquet] Extension types with nanosecond timestamp resolution don't roundtrip Key: ARROW-15711 URL: https://issues.apache.org/jira/browse/ARROW-15711 Project: Apache Arrow Issue Type: Bug Components: C++, Parquet Reporter: Joris Van den Bossche Example code: {code:python} import pyarrow as pa import pyarrow.parquet as pq class MyTimestampType(pa.PyExtensionType): def __init__(self): pa.PyExtensionType.__init__(self, pa.timestamp("ns")) def __reduce__(self): return MyTimestampType, () arr = MyTimestampType().wrap_array(pa.array([1000, 2000, 3000], pa.timestamp("ns"))) table = pa.table({"col": arr}) {code} {code} >>> table.schema col: extension> >>> pq.write_table(table, "test_parquet_extension_type_timestamp_ns.parquet") >>> result = pq.read_table("test_parquet_extension_type_timestamp_ns.parquet") >>> result.schema col: timestamp[us] {code} The reason for this is because we only restore the extension type if the inferred storage type (inferred from parquet + after applying any updates based on the Arrow schema) exactly equals the original storage type (as stored in the Arrow schema): https://github.com/apache/arrow/blob/afaa92e7e4289d6e4f302cc91810368794e8092b/cpp/src/parquet/arrow/schema.cc#L973-L977 And, with the default options, a timestamp with nanosecond resolution gets stored as microsecond resolution in Parquet, and that is something we do not restore when updating the read types based on the stored Arrow schema (eg we do add a timezone, but we don't change the resolution). An additional issue is that _if_ you loose the extension type, the field metadata about the extension type are also lost. I think that if we cannot restore the extension type, we should at least try to keep the ARROW:extension field metadata as information. -- This message was sent by Atlassian Jira (v8.20.1#820001)