[jira] [Created] (ARROW-16984) [Ruby] Add support for installing Apache Arrow GLib automatically
Kouhei Sutou created ARROW-16984: Summary: [Ruby] Add support for installing Apache Arrow GLib automatically Key: ARROW-16984 URL: https://issues.apache.org/jira/browse/ARROW-16984 Project: Apache Arrow Issue Type: Improvement Reporter: Kouhei Sutou Assignee: Kouhei Sutou Fix For: 9.0.0 Fedora 37 or later will ship Apache Arrow GLib as {{libarrow-glib-devel}}: https://packages.fedoraproject.org/pkgs/libarrow/ -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-16983) Delta byte array encoder broken due to memory leak
Matt DePero created ARROW-16983: --- Summary: Delta byte array encoder broken due to memory leak Key: ARROW-16983 URL: https://issues.apache.org/jira/browse/ARROW-16983 Project: Apache Arrow Issue Type: Bug Components: Go, Parquet Reporter: Matt DePero The `DeltaByteArrayEncoder` has a memory leak due to a bug in how `EstimatedDataEncodedSize` is calculated. DeltaByteArrayEncoder extends `encoder` which calculates EstimatedDataEncodedSize by calling `Len()` on its `PooledBufferWriter` sink. DeltaByteArrayEncoder however does not write data to sink, instead writing data to `prefixEncoder` and `suffixEncoder` causing EstimatedDataEncodedSize to always return zero. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [arrow-adbc] lidavidm merged pull request #29: Install the headers along with the driver manager
lidavidm merged PR #29: URL: https://github.com/apache/arrow-adbc/pull/29 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (ARROW-16982) Slow reading of partitioned parquet files from S3
Blaž Zupančič created ARROW-16982: - Summary: Slow reading of partitioned parquet files from S3 Key: ARROW-16982 URL: https://issues.apache.org/jira/browse/ARROW-16982 Project: Apache Arrow Issue Type: Improvement Components: Parquet, Python Affects Versions: 8.0.0 Reporter: Blaž Zupančič When reading partitioned files from S3 and using filters to select partitions, the reader will send list requests each time read_table() is called. {code:python} # partitioning: s3://bucket/year=/month=y/day=z from pyarrow import parquet parquet.read_table('s3://bucket', filters=[('day', '=', 1)]) # lists s3 bucket parquet.read_table('s3://bucket', filters=[('day', '=', 2)]) # lists again{code} This is not a problem if done once, but repeated calls to select different partitions lead to a large amount of (slow and potentially expensive) S3 list requests. Current workaround is to list and filter partition structure manually, however this is not nearly as convenient as using filters. If we know that the S3 prefixes did not change, it should be possible to do recursive list only once and load different data multiple times (using only S3 get requests). I suppose this should be possible by using ParquetDataset, however current implementation only allows filters in constructor and not in the read() method. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [arrow-adbc] lidavidm opened a new pull request, #29: Install the headers along with the driver manager
lidavidm opened a new pull request, #29: URL: https://github.com/apache/arrow-adbc/pull/29 Right now we only install the shared library. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (ARROW-16981) [C++] Expose jemalloc statistics for logging
Rok Mihevc created ARROW-16981: -- Summary: [C++] Expose jemalloc statistics for logging Key: ARROW-16981 URL: https://issues.apache.org/jira/browse/ARROW-16981 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Rok Mihevc Assignee: Rok Mihevc This would enable us to log memory usage and diagnose out of memory issues. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [arrow-adbc] lidavidm merged pull request #27: Make AdbcConnectionNew 2-adic for consistency
lidavidm merged PR #27: URL: https://github.com/apache/arrow-adbc/pull/27 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (ARROW-16980) [Python] Results of running a substrait plan against a tpch data table written into parquet are all null
Richard Tia created ARROW-16980: --- Summary: [Python] Results of running a substrait plan against a tpch data table written into parquet are all null Key: ARROW-16980 URL: https://issues.apache.org/jira/browse/ARROW-16980 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Richard Tia Attachments: lineitem.json SQL {code:java} SELECT l_returnflag, l_linestatus FROM lineitem{code} substrait plan type info for l_returnflag: {code:java} { "fixedChar": { "length": 1, "typeVariationReference": 0, "nullability": "NULLABILITY_NULLABLE" }{code} fixedChar is an extension type. Error: {code:java} pyarrow/table.pxi:1223: in pyarrow.lib.ChunkedArray.chunks.__get__ ??? pyarrow/table.pxi:1241: in iterchunks ??? pyarrow/table.pxi:1185: in pyarrow.lib.ChunkedArray.chunk ??? pyarrow/public-api.pxi:200: in pyarrow.lib.pyarrow_wrap_array ??? _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > ??? E AttributeError: 'pyarrow.lib.BaseExtensionType' object has no attribute '__arrow_ext_class__' {code} Reproduction Steps: {code:java} import pyarrow as pa import pyarrow.substrait as substrait from pyarrow import json as pyarrow_json from pyarrow.lib import tobytes substrait_query = json_file_path = os.path.join(, 'lineitem.json') arrow_data_path_ipc = os.path.join(, 'substrait_data.arrow') substrait_query = tobytes(substrait_query.replace("FILENAME_PLACEHOLDER", arrow_data_path_ipc)) # Save lineitem.json into IPC arrow binary file table = pyarrow_json.read_json(json_file_path) with pa.ipc.RecordBatchFileWriter(filepath, schema=table.schema, arrow_data_path_ipc) as writer: writer.write_table(table) # Run the substrait query plan buf = pa._substrait._parse_json_plan(substrait_query) reader = substrait.run_query(buf) result = reader.read_all() print(result.columns[0].chunks) {code} lineitem.json is attached substrait query plan: {code:java} """ { "extensionUris": [], "extensions": [], "relations": [{ "root": { "input": { "project": { "common": { }, "input": { "read": { "common": { "direct": { } }, "baseSchema": { "names": ["L_ORDERKEY", "L_PARTKEY", "L_SUPPKEY", "L_LINENUMBER", "L_QUANTITY", "L_EXTENDEDPRICE", "L_DISCOUNT", "L_TAX", "L_RETURNFLAG", "L_LINESTATUS", "L_SHIPDATE", "L_COMMITDATE", "L_RECEIPTDATE", "L_SHIPINSTRUCT", "L_SHIPMODE", "L_COMMENT"], "struct": { "types": [{ "i64": { "typeVariationReference": 0, "nullability": "NULLABILITY_NULLABLE" } }, { "i64": { "typeVariationReference": 0, "nullability": "NULLABILITY_NULLABLE" } }, { "i64": { "typeVariationReference": 0, "nullability": "NULLABILITY_NULLABLE" } }, { "i32": { "typeVariationReference": 0, "nullability": "NULLABILITY_NULLABLE" } }, { "decimal": { "scale": 0, "precision": 19, "typeVariationReference": 0, "nullability": "NULLABILITY_NULLABLE" } }, { "decimal": { "scale": 0, "precision": 19, "typeVariationReference": 0, "nullability": "NULLABILITY_NULLABLE" } }, { "decimal": { "scale": 0, "precision": 19, "typeVariationReference": 0, "nullability": "NULLABILITY_NULLABLE" } }, { "decimal": { "scale": 0, "precision": 19, "typeVariationReference": 0, "nullability": "NULLABILITY_NULLABLE" } }, { "fixedChar": { "length": 1, "typeVariationReference": 0, "nullability": "NULLABILITY_NULLABLE" } }, { "fixedChar": { "length": 1, "typeVariationReference": 0, "nullability": "NULLABILITY_NULLABLE" } }, { "date": {
[jira] [Created] (ARROW-16979) [Java] Further Consolidate JNI compilation
Alessandro Molina created ARROW-16979: - Summary: [Java] Further Consolidate JNI compilation Key: ARROW-16979 URL: https://issues.apache.org/jira/browse/ARROW-16979 Project: Apache Arrow Issue Type: Improvement Reporter: Alessandro Molina Fix For: 10.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-16978) [C#] Intermittent Archery Failures
Raphael Taylor-Davies created ARROW-16978: - Summary: [C#] Intermittent Archery Failures Key: ARROW-16978 URL: https://issues.apache.org/jira/browse/ARROW-16978 Project: Apache Arrow Issue Type: Bug Reporter: Raphael Taylor-Davies We are seeing intermittent archery failures in arrow-rs - [here|https://github.com/apache/arrow-rs/runs/6987393626?check_suite_focus=true] {code:java} FAILED TEST: datetime C# producing, C# consuming 1 failures File "/arrow/dev/archery/archery/integration/runner.py", line 246, in _run_ipc_test_case run_binaries(producer, consumer, test_case) File "/arrow/dev/archery/archery/integration/runner.py", line 100, in run_gold return self._run_gold(gold_dir, consumer, test_case) File "/arrow/dev/archery/archery/integration/runner.py", line 322, in _run_gold consumer.stream_to_file(consumer_stream_path, consumer_file_path) File "/arrow/dev/archery/archery/integration/tester_csharp.py", line 63, in stream_to_file self.run_shell_command(cmd) File "/arrow/dev/archery/archery/integration/tester.py", line 49, in run_shell_command subprocess.check_call(cmd, shell=True) File "/opt/conda/envs/arrow/lib/python3.10/subprocess.py", line 369, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '/arrow/csharp/artifacts/Apache.Arrow.IntegrationTest/Debug/net6.0/Apache.Arrow.IntegrationTest --mode stream-to-file -a /tmp/tmpnhaxwkhj/d72f0c1c_0.14.1_datetime.gold.consumer_stream_as_file < /arrow/testing/data/arrow-ipc-stream/integration/0.14.1/generated_datetime.stream' returned non-zero exit status 1. {code} It is possible that this is something to do with how we are running the archery tests, but I am at a loss as to how to debug this issue and would appreciate some input. I think it started around when [this|https://github.com/apache/arrow/pull/13279] was merged -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-16977) [R] Update dataset row counting so no integer overflow on large datasets
Nicola Crane created ARROW-16977: Summary: [R] Update dataset row counting so no integer overflow on large datasets Key: ARROW-16977 URL: https://issues.apache.org/jira/browse/ARROW-16977 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-16976) [R] Build linux binaries on older image (like manylinux2014)
Neal Richardson created ARROW-16976: --- Summary: [R] Build linux binaries on older image (like manylinux2014) Key: ARROW-16976 URL: https://issues.apache.org/jira/browse/ARROW-16976 Project: Apache Arrow Issue Type: Improvement Components: Packaging, R Reporter: Neal Richardson ARROW-16752 observed that even with newer compilers installed on centos 7, you can't use binaries built on ubuntu 18.04 because ubuntu 18.04 has glibc 2.27 but centos 7 only has 2.17. But if we built the binaries on centos 7 with devtoolset-7 or 8 or something, all features could compile and we'd work with older glibc. Things built against older glibc are guaranteed to work with newer versions, and you can't just upgrade glibc because it would break the system. So for maximum compatibility, build with the oldest glibc. This strategy is like how they python manylinux standards work (IIUC). -- This message was sent by Atlassian Jira (v8.20.10#820010)