[jira] [Created] (ARROW-18359) PrettyPrint Improvements
Will Jones created ARROW-18359: -- Summary: PrettyPrint Improvements Key: ARROW-18359 URL: https://issues.apache.org/jira/browse/ARROW-18359 Project: Apache Arrow Issue Type: Improvement Components: C++, Python, R Reporter: Will Jones We have some pretty printing capabilities, but we may want to think at a high level about the design first. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18239) [C++][Docs] Add examples of Parquet TypedColumnWriter to user guide
Will Jones created ARROW-18239: -- Summary: [C++][Docs] Add examples of Parquet TypedColumnWriter to user guide Key: ARROW-18239 URL: https://issues.apache.org/jira/browse/ARROW-18239 Project: Apache Arrow Issue Type: Improvement Components: Documentation Reporter: Will Jones Since this is the more performant non-Arrow way to write Parquet data, we should show that. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18230) [Python] Pass Cmake args to Python CPP
Will Jones created ARROW-18230: -- Summary: [Python] Pass Cmake args to Python CPP Key: ARROW-18230 URL: https://issues.apache.org/jira/browse/ARROW-18230 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Will Jones Fix For: 11.0.0 We pass {{extra_cmake_args}} to {{_run_cmake}} (Cython build) but not to {{ _run_cmake_pyarrow_cpp}} (PyArrow C++ build). We should probably be passing to both. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18204) [R] Allow setting field metadata
Will Jones created ARROW-18204: -- Summary: [R] Allow setting field metadata Key: ARROW-18204 URL: https://issues.apache.org/jira/browse/ARROW-18204 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 10.0.0 Reporter: Will Jones Currently, can't create a {{Field}} with metadata, which makes it hard to create tests regarding field metadata. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17994) [C++] Add overflow argument is required when it shouldn't be
Will Jones created ARROW-17994: -- Summary: [C++] Add overflow argument is required when it shouldn't be Key: ARROW-17994 URL: https://issues.apache.org/jira/browse/ARROW-17994 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Will Jones Fix For: 11.0.0 If I pass a substrait plan that contains an add function, but don't provide the nullablity argument, I get the following error: {code:none} Traceback (most recent call last): File "", line 1, in File "pyarrow/_substrait.pyx", line 140, in pyarrow._substrait.run_query File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Expected Substrait call to have an enum argument at index 0 but the argument was not an enum. /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/extension_set.cc:684 call.GetEnumArg(arg_index) /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/extension_set.cc:702 ParseEnumArg(call, 0, kOverflowParser) /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/relation_internal.cc:332 FromProto(expr, ext_set, conversion_options) /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/serde.cc:156 FromProto(plan_rel.has_root() ? plan_rel.root().input() : plan_rel.rel(), ext_set, conversion_options) /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/util.cc:106 engine::DeserializePlans(substrait_buffer, consumer_factory, registry, nullptr, conversion_options_) /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/util.cc:130 executor.Init(substrait_buffer, registry) {code} Yet in the spec, this argument is supposed to be optional: https://github.com/substrait-io/substrait/blob/f3f6bdc947e689e800279666ff33f118e42d2146/extensions/functions_arithmetic.yaml#L11 If I modify the plan to include the argument, it works as expected. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17963) [C++] Implement cast_dictionary for string
Will Jones created ARROW-17963: -- Summary: [C++] Implement cast_dictionary for string Key: ARROW-17963 URL: https://issues.apache.org/jira/browse/ARROW-17963 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Will Jones Fix For: 11.0.0 We can cast dictionary(string, X) to string, but not the other way around. {code:R} > Array$create(c("a", "b"))$cast(dictionary(int32(), string())) Error: NotImplemented: Unsupported cast from string to dictionary using function cast_dictionary /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/compute/function.cc:249 func.DispatchBest(_types) > Array$create(as.factor(c("a", "b")))$cast(string()) Array [ "a", "b" ] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17954) [R] Update News for 10.0.0
Will Jones created ARROW-17954: -- Summary: [R] Update News for 10.0.0 Key: ARROW-17954 URL: https://issues.apache.org/jira/browse/ARROW-17954 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Will Jones Assignee: Will Jones Fix For: 10.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17944) [Python] Accept bytes object in pyarrow.substrait.run_query
Will Jones created ARROW-17944: -- Summary: [Python] Accept bytes object in pyarrow.substrait.run_query Key: ARROW-17944 URL: https://issues.apache.org/jira/browse/ARROW-17944 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Will Jones Fix For: 11.0.0 {{pyarrow.substrait.run_query()}} only accepts a PyArrow buffer, and will segfault if something else is passed. People might try to pass a Python bytes object, which isn't unreasonable. For example, they might use the value returned by protobufs {{SerializeToString()}} function, which is Python bytes. At the very least, we should not segfault. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17923) [C++] Consider dictionary arrays for special fragment fields
Will Jones created ARROW-17923: -- Summary: [C++] Consider dictionary arrays for special fragment fields Key: ARROW-17923 URL: https://issues.apache.org/jira/browse/ARROW-17923 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Will Jones I noticed in ARROW-15281 we made {{__filename}} a string column. In common cases, this will be inefficient if materialized. If possible, it may be better to have them be dictionary arrays. As an example, [here|https://github.com/apache/arrow/pull/12826#issuecomment-1230745059] is a user report of 10x increased memory usage caused by accidentally including these special fragment columns. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17897) [Packaging][Conan] Add back ARROW_GCS to conanfile.py
Will Jones created ARROW-17897: -- Summary: [Packaging][Conan] Add back ARROW_GCS to conanfile.py Key: ARROW-17897 URL: https://issues.apache.org/jira/browse/ARROW-17897 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Will Jones Assignee: Will Jones Fix For: 10.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17845) [CI][Conan] Re-enable Flight in Conan CI check
Will Jones created ARROW-17845: -- Summary: [CI][Conan] Re-enable Flight in Conan CI check Key: ARROW-17845 URL: https://issues.apache.org/jira/browse/ARROW-17845 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Will Jones Assignee: Will Jones -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17812) [C++][Documentation] Add Gandiva User Guide
Will Jones created ARROW-17812: -- Summary: [C++][Documentation] Add Gandiva User Guide Key: ARROW-17812 URL: https://issues.apache.org/jira/browse/ARROW-17812 Project: Apache Arrow Issue Type: Improvement Components: C++ - Gandiva Reporter: Will Jones Assignee: Will Jones Fix For: 10.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17788) [R][Doc] Add example of using Scanner
Will Jones created ARROW-17788: -- Summary: [R][Doc] Add example of using Scanner Key: ARROW-17788 URL: https://issues.apache.org/jira/browse/ARROW-17788 Project: Apache Arrow Issue Type: Improvement Components: Documentation, R Affects Versions: 9.0.0 Reporter: Will Jones Assignee: Will Jones Fix For: 10.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17776) [C++] Stabilize Parquet ArrowReaderProperties
Will Jones created ARROW-17776: -- Summary: [C++] Stabilize Parquet ArrowReaderProperties Key: ARROW-17776 URL: https://issues.apache.org/jira/browse/ARROW-17776 Project: Apache Arrow Issue Type: Improvement Components: C++, Parquet Affects Versions: 9.0.0 Reporter: Will Jones {{ArrowReaderProperties}} is still marked experimental, but it's pretty well used at this point. One possible change we might wish to make before stabilizing the API for it though: The {{ArrowWriterProperties}} class uses a namespaced builder class, which provides a nice syntax for creation and enforces immutability of the final properties. Perhaps we should mirror that design in the reader properties? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17441) [Python] Memory kept after del and pool.released_unused()
Will Jones created ARROW-17441: -- Summary: [Python] Memory kept after del and pool.released_unused() Key: ARROW-17441 URL: https://issues.apache.org/jira/browse/ARROW-17441 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 9.0.0 Reporter: Will Jones I was trying reproduce another issue involving memory pools not releasing memory, but encountered this confusing behavior: if I create a table, then call {{{}del table{}}}, and then {{{}pool.release_unused(){}}}, I still see significant memory usage. On mimalloc in particular, I see no meaningful drop in memory usage on either call. Am I missing something? {code:python} import os import psutil import time import gc process = psutil.Process(os.getpid()) import numpy as np from uuid import uuid4 import pyarrow as pa def gen_batches(n_groups=200, rows_per_group=200_000): for _ in range(n_groups): id_val = uuid4().bytes yield pa.table({ "x": np.random.random(rows_per_group), # This will compress poorly "y": np.random.random(rows_per_group), "a": pa.array(list(range(rows_per_group)), type=pa.int32()), # This compresses with delta encoding "id": pa.array([id_val] * rows_per_group), # This compresses with RLE }) def print_rss(): print(f"RSS: {process.memory_info().rss:,} bytes") print(f"memory_pool={pa.default_memory_pool().backend_name}") print_rss() print("reading table") tab = pa.concat_tables(list(gen_batches())) print_rss() print("deleting table") del tab gc.collect() print_rss() print("releasing unused memory") pa.default_memory_pool().release_unused() print_rss() print("waiting 10 seconds") time.sleep(10) print_rss() {code} {code:none} > ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool.py && \ ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool.py && \ ARROW_DEFAULT_MEMORY_POOL=system python test_pool.py memory_pool=mimalloc RSS: 44,449,792 bytes reading table RSS: 1,819,557,888 bytes deleting table RSS: 1,819,590,656 bytes releasing unused memory RSS: 1,819,852,800 bytes waiting 10 seconds RSS: 1,819,852,800 bytes memory_pool=jemalloc RSS: 45,629,440 bytes reading table RSS: 1,668,677,632 bytes deleting table RSS: 698,400,768 bytes releasing unused memory RSS: 699,023,360 bytes waiting 10 seconds RSS: 699,023,360 bytes memory_pool=system RSS: 44,875,776 bytes reading table RSS: 1,713,569,792 bytes deleting table RSS: 540,311,552 bytes releasing unused memory RSS: 540,311,552 bytes waiting 10 seconds RSS: 540,311,552 bytes {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17401) [C++] Add ReadTable method to RecordBatchFileReader
Will Jones created ARROW-17401: -- Summary: [C++] Add ReadTable method to RecordBatchFileReader Key: ARROW-17401 URL: https://issues.apache.org/jira/browse/ARROW-17401 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 9.0.0 Reporter: Will Jones For convenience, it would be helpful to add an method for just reading the entire file as a table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17400) [C++] Move Parquet APIs to use Result instead of Status
Will Jones created ARROW-17400: -- Summary: [C++] Move Parquet APIs to use Result instead of Status Key: ARROW-17400 URL: https://issues.apache.org/jira/browse/ARROW-17400 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 9.0.0 Reporter: Will Jones Notably, IPC and CSV have "open file" methods that return result, while opening a Parquet file requires passing in an out variable. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17349) [C++] Support casting field names of list and map when nested
Will Jones created ARROW-17349: -- Summary: [C++] Support casting field names of list and map when nested Key: ARROW-17349 URL: https://issues.apache.org/jira/browse/ARROW-17349 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 9.0.0 Reporter: Will Jones Fix For: 10.0.0 Different parquet implementations use different field names for internal fields of ListType and MapType, which can sometimes cause silly conflicts. For example, we use {{item}} as the field name for list, but Spark uses {{element}}. Fortunately, we can automatically cast between List and Map Types with different field names. Unfortunately, it only works at the top level. We should get it to work at arbitrary levels of nesting. This was discovered in delta-rs: https://github.com/delta-io/delta-rs/pull/684#discussion_r935099285 Here's a reproduction in Python: {code:Python} import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds def roundtrip_scanner(in_arr, out_type): table = pa.table({"arr": in_arr}) pq.write_table(table, "test.parquet") schema = pa.schema({"arr": out_type}) ds.dataset("test.parquet", schema=schema).to_table() # MapType ty_named = pa.map_(pa.field("x", pa.int32(), nullable=False), pa.int32()) ty = pa.map_(pa.int32(), pa.int32()) arr_named = pa.array([[(1, 2), (2, 4)]], type=ty_named) roundtrip_scanner(arr_named, ty) # ListType ty_named = pa.list_(pa.field("x", pa.int32(), nullable=False)) ty = pa.list_(pa.int32()) arr_named = pa.array([[1, 2, 4]], type=ty_named) roundtrip_scanner(arr_named, ty) # Combination MapType and ListType ty_named = pa.map_(pa.string(), pa.field("x", pa.list_(pa.field("x", pa.int32(), nullable=True)), nullable=False)) ty = pa.map_(pa.string(), pa.list_(pa.int32())) arr_named = pa.array([[("string", [1, 2, 3])]], type=ty_named) roundtrip_scanner(arr_named, ty) # Traceback (most recent call last): # File "", line 1, in # File "", line 5, in roundtrip_scanner # File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table # File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table # File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status # File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status # pyarrow.lib.ArrowNotImplementedError: Unsupported cast to map> from map ('arr')> {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17343) [Docs][C++] Add missing methods to ArrayBuilders API Reference
Will Jones created ARROW-17343: -- Summary: [Docs][C++] Add missing methods to ArrayBuilders API Reference Key: ARROW-17343 URL: https://issues.apache.org/jira/browse/ARROW-17343 Project: Apache Arrow Issue Type: Improvement Components: Documentation Affects Versions: 9.0.0 Reporter: Will Jones Fix For: 10.0.0 At the very least, {{StructBuilder}} doesn't show it's {{num_fields()}} and {{field_builder()}} methods. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17310) [C++] Expose SimpleRecordBatchReader publicly
Will Jones created ARROW-17310: -- Summary: [C++] Expose SimpleRecordBatchReader publicly Key: ARROW-17310 URL: https://issues.apache.org/jira/browse/ARROW-17310 Project: Apache Arrow Issue Type: New Feature Components: C++ Affects Versions: 9.0.0 Reporter: Will Jones Assignee: Will Jones Fix For: 10.0.0 It's unclear why this isn't public to begin with. Perhaps at the time, Iterator wasn't considered public, but now we are using it in public headers. https://github.com/apache/arrow/blob/916417da0a966797c453126f57b657a0449651b5/cpp/src/arrow/record_batch.cc#L359 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17298) [C++][Docs] Add Acero project example in Getting Started Section
Will Jones created ARROW-17298: -- Summary: [C++][Docs] Add Acero project example in Getting Started Section Key: ARROW-17298 URL: https://issues.apache.org/jira/browse/ARROW-17298 Project: Apache Arrow Issue Type: Improvement Components: Documentation Reporter: Will Jones Fix For: 10.0.0 >From [~westonpace]: {quote} A request I've seen a few times (and just received now) has been... Can you point me at a sample C++ starter project that links against Acero? For example, I tend to use a CMakeLists.txt that looks something like... cmake_minimum_required(VERSION 3.10) {code} set(CMAKE_MODULE_PATH "${CMAKE_SOURCE_DIR}/cmake;${CMAKE_MODULE_PATH}") set(CMAKE_CXX_FLAGS "-Wall -Wextra") # set(CMAKE_CXX_FLAGS_DEBUG "-g") set(CMAKE_CXX_FLAGS_RELEASE "-O3") # set the project name project(Experiments VERSION 1.0) # specify the C++ standard set(CMAKE_CXX_STANDARD 17) set(CMAKE_CXX_STANDARD_REQUIRED True) set(CMAKE_EXPORT_COMPILE_COMMANDS ON) if(NOT DEFINED CONDA_HOME) message(FATAL_ERROR "CONDA_HOME is a required variable") endif() include_directories(SYSTEM ${CONDA_HOME}/include) link_directories(${CONDA_HOME}/lib64) link_directories(${CONDA_HOME}/lib) function(experiment TARGET) add_executable( ${TARGET} ${TARGET}.cc ) target_link_libraries( ${TARGET} arrow arrow_dataset parquet aws-cpp-sdk-core aws-cpp-sdk-s3 glog pthread re2 utf8proc lz4 snappy z zstd aws-cpp-sdk-identity-management thrift ) if (MSVC) target_compile_options(${TARGET} PRIVATE /W4 /WX) else () target_compile_options(${TARGET} PRIVATE -Wall -Wextra -Wpedantic -Werror) endif () endfunction() experiment(arrow_16642) {code} {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17295) [C++] Build separate bundled_depenencies.so
Will Jones created ARROW-17295: -- Summary: [C++] Build separate bundled_depenencies.so Key: ARROW-17295 URL: https://issues.apache.org/jira/browse/ARROW-17295 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 8.0.1, 8.0.0 Reporter: Will Jones When building arrow _static_ libraries with bundled dependencies, we produce {{{}arrow_bundled_dependencies.a{}}}. But when building dynamic libraries, the bundled dependencies are statically linked directly into the arrow libraries (libarrow, libarrow_flight, etc.). This means that users can access the symbols of bundled dependencies in the static case, but not in the dynamic library case. One use case of this is being able to pass in gRPC configuration to a Flight server, which requires access to gRPC symbols. Could we change the dynamic library building to build an {{arrow_bundled_dependencies.so}} so that the symbols are accessible? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17188) [R] Update news for 9.0.0
Will Jones created ARROW-17188: -- Summary: [R] Update news for 9.0.0 Key: ARROW-17188 URL: https://issues.apache.org/jira/browse/ARROW-17188 Project: Apache Arrow Issue Type: New Feature Components: R Affects Versions: 9.0.0 Reporter: Will Jones Assignee: Will Jones Fix For: 9.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17152) [Docs] Enable dark mode on documentation site
Will Jones created ARROW-17152: -- Summary: [Docs] Enable dark mode on documentation site Key: ARROW-17152 URL: https://issues.apache.org/jira/browse/ARROW-17152 Project: Apache Arrow Issue Type: New Feature Reporter: Will Jones Fix For: 10.0.0 Attachments: Screen Shot 2022-07-20 at 3.10.51 PM.png, Screen Shot 2022-07-20 at 3.12.18 PM.png pydata-sphinx-theme adds dark mode in version 0.9.0. We will need to adapt our logo ([see docs|https://pydata-sphinx-theme.readthedocs.io/en/stable/user_guide/configuring.html?highlight=dark#different-logos-for-light-and-dark-mode]). There are also some places in the docs where we may need to adjust additional CSS. See attached screenshot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17151) [Docs] Pin pydata-sphinx-theme to 0.8 to avoid dark mode
Will Jones created ARROW-17151: -- Summary: [Docs] Pin pydata-sphinx-theme to 0.8 to avoid dark mode Key: ARROW-17151 URL: https://issues.apache.org/jira/browse/ARROW-17151 Project: Apache Arrow Issue Type: Improvement Components: Documentation Reporter: Will Jones Fix For: 9.0.0 pydata-sphinx-theme introduced automatic dark mode. However there is a series of changes we need to do (such as providing a dark-mode Arrow logo) before we will be ready for this. For the 9.0.0 release, we should instead pin to the version of pydata-sphinx-theme just before that release. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17150) [R] Allow statically linked libcurl in GCS when building libarrow DLL in RTools
Will Jones created ARROW-17150: -- Summary: [R] Allow statically linked libcurl in GCS when building libarrow DLL in RTools Key: ARROW-17150 URL: https://issues.apache.org/jira/browse/ARROW-17150 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 9.0.0 Reporter: Will Jones Fix For: 10.0.0 Neal's patch in ARROW-16510 enabled libcurl to be linked statically in the google cloud storage dependency, but this only seems to work for static libraries on RTools (Windows). For development Rtools environments, we currently use dynamic Arrow libraries instead, but currently we get linking errors to libcurl when ARROW_GCS is on. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17149) [R] Enable GCS tests for Windows
Will Jones created ARROW-17149: -- Summary: [R] Enable GCS tests for Windows Key: ARROW-17149 URL: https://issues.apache.org/jira/browse/ARROW-17149 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, R Affects Versions: 9.0.0 Reporter: Will Jones Fix For: 10.0.0 In ARROW-16879, I found the GCS tests were hanging in CI, but couldn't diagnose why. We should solve that and enable the tests. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17097) [C++] GCS: report common prefixes as directories
Will Jones created ARROW-17097: -- Summary: [C++] GCS: report common prefixes as directories Key: ARROW-17097 URL: https://issues.apache.org/jira/browse/ARROW-17097 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 8.0.0, 9.0.0 Reporter: Will Jones Fix For: 10.0.0 I got confused at the behavior differences between S3 and GCS, only to realize GCS only reports special directory markers as "directories" and not the common prefixes. This can have the effect of making a directory look empty in GCS, when it in fact has many folders (see example below). We currently use the [ListObjects|https://github.com/googleapis/google-cloud-cpp/blob/e0228233949873be9cba87ae4e37554e1ff0474d/google/cloud/storage/client.h#L974] method, but perhaps it would be more appropriate to use the [ListObjectsWithPrefix|https://github.com/googleapis/google-cloud-cpp/blob/e0228233949873be9cba87ae4e37554e1ff0474d/google/cloud/storage/client.h#L1006]. Since they are returned in the [same API call|https://cloud.google.com/storage/docs/json_api/v1/objects/list], it shouldn't add much overhead. {code:r} library(arrow) bucket <- gs_bucket("voltrondata-labs-datasets", retry_limit_seconds = 3, anonymous = TRUE) s3_bucket <- s3_bucket("voltrondata-labs-datasets", endpoint_override = "https://storage.googleapis.com;) # We did not create directory markers when uploading the data # https://github.com/apache/arrow/pull/11842#discussion_r764204767 # The directory appears empty to GCSFileSystem... bucket$ls("nyc-taxi") #> character(0) # ... but S3FileSystem knows otherwise! s3_bucket$ls("nyc-taxi") #> [1] "nyc-taxi/year=2009" "nyc-taxi/year=2010" "nyc-taxi/year=2011" #> [4] "nyc-taxi/year=2012" "nyc-taxi/year=2013" "nyc-taxi/year=2014" #> [7] "nyc-taxi/year=2015" "nyc-taxi/year=2016" "nyc-taxi/year=2017" #> [10] "nyc-taxi/year=2018" "nyc-taxi/year=2019" "nyc-taxi/year=2020" #> [13] "nyc-taxi/year=2021" "nyc-taxi/year=2022" # Using GCS API, we only get files! bucket$ls("nyc-taxi", recursive = TRUE) #> [1] "nyc-taxi/year=2009/month=1/part-0.parquet" #> [2] "nyc-taxi/year=2009/month=10/part-0.parquet" #> ... #> [157] "nyc-taxi/year=2022/month=1/part-0.parquet" #> [158] "nyc-taxi/year=2022/month=2/part-0.parquet" # Using S3 API, we can get directories! s3_bucket$ls("nyc-taxi", recursive = TRUE) #> [1] "nyc-taxi/year=2009" #> [2] "nyc-taxi/year=2009/month=1" #> [3] "nyc-taxi/year=2009/month=1/part-0.parquet" #> [4] "nyc-taxi/year=2009/month=10" #> [5] "nyc-taxi/year=2009/month=10/part-0.parquet" #> [6] "nyc-taxi/year=2009/month=11" #> ... #> [329] "nyc-taxi/year=2022/month=2" #> [330] "nyc-taxi/year=2022/month=2/part-0.parquet" {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17075) [C++] HDFS tests broken by trailing slash tests
Will Jones created ARROW-17075: -- Summary: [C++] HDFS tests broken by trailing slash tests Key: ARROW-17075 URL: https://issues.apache.org/jira/browse/ARROW-17075 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Will Jones Assignee: Will Jones Fix For: 9.0.0 https://github.com/apache/arrow/pull/13577#issuecomment-1184541864 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17069) [Python][R] GCSFIleSystem reports cannot resolve host on public buckets
Will Jones created ARROW-17069: -- Summary: [Python][R] GCSFIleSystem reports cannot resolve host on public buckets Key: ARROW-17069 URL: https://issues.apache.org/jira/browse/ARROW-17069 Project: Apache Arrow Issue Type: Bug Components: Python, R Affects Versions: 8.0.0 Reporter: Will Jones Assignee: Will Jones Fix For: 9.0.0 GCSFileSystem will return {{Couldn't resolve host name}} if you don't supply {{anonymous}} as the user: {code:python} import pyarrow.dataset as ds # Fails: dataset = ds.dataset("gs://anonymous@voltrondata-labs-datasets/taxi-data/?retry_limit_seconds=3") # Traceback (most recent call last): # File "", line 1, in # File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", line 749, in dataset # return _filesystem_dataset(source, **kwargs) # File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", line 441, in _filesystem_dataset # fs, paths_or_selector = _ensure_single_source(source, filesystem) # File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", line 417, in _ensure_single_source # raise FileNotFoundError(path) # FileNotFoundError: voltrondata-labs-datasets/taxi-data # This works fine: >>> dataset = >>> ds.dataset("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3") {code} I would expect that we could connect. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17047) [Python][Docs] Document how to get field from StructType
Will Jones created ARROW-17047: -- Summary: [Python][Docs] Document how to get field from StructType Key: ARROW-17047 URL: https://issues.apache.org/jira/browse/ARROW-17047 Project: Apache Arrow Issue Type: Improvement Components: Documentation Affects Versions: 8.0.0 Reporter: Will Jones It's not at all obvious how to get a particular field from a StructType from it's API page: https://arrow.apache.org/docs/python/generated/pyarrow.StructType.html#pyarrow.StructType We should add an example: {code:python} struct_type = pa.struct({"x": pa.int32(), "y": pa.string()}) struct_type[0] # pyarrow.Field pa.schema(list(struct_type)) # x: int32 # y: string {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17045) [C++] GCS doesn't drop ending slash for files
Will Jones created ARROW-17045: -- Summary: [C++] GCS doesn't drop ending slash for files Key: ARROW-17045 URL: https://issues.apache.org/jira/browse/ARROW-17045 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 8.0.0 Reporter: Will Jones Assignee: Will Jones Fix For: 9.0.0 There is inconsistent behavior between GCS and S3 when it comes to creating files. Example: {code:python} import pyarrow.fs from pyarrow.fs import FileSelector from datetime import timedelta gcs = pyarrow.fs.GcsFileSystem( endpoint_override="localhost:9001", scheme="http", anonymous=True, retry_time_limit=timedelta(seconds=1), ) gcs.create_dir("py_test") with gcs.open_output_stream("py_test/test.txt") as out_stream: out_stream.write(b"Hello world!") with gcs.open_output_stream("py_test/test.txt/") as out_stream: out_stream.write(b"Hello world!") gcs.get_file_info(FileSelector("py_test")) # [, ] s3 = pyarrow.fs.S3FileSystem( access_key="minioadmin", secret_key="minioadmin", scheme="http", endpoint_override="localhost:9000", allow_bucket_creation=True, allow_bucket_deletion=True, ) s3.create_dir("py-test") with s3.open_output_stream("py-test/test.txt") as out_stream: out_stream.write(b"Hello world!") with s3.open_output_stream("py-test/test.txt/") as out_stream: out_stream.write(b"Hello world!") s3.get_file_info(FileSelector("py-test")) # [] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17020) [Python][R] GcsFilesystem can appear to hang for non permanent errors
Will Jones created ARROW-17020: -- Summary: [Python][R] GcsFilesystem can appear to hang for non permanent errors Key: ARROW-17020 URL: https://issues.apache.org/jira/browse/ARROW-17020 Project: Apache Arrow Issue Type: Bug Components: Python, R Affects Versions: 8.0.0 Reporter: Will Jones GcsFileSystem will attempt to retry if it gets a non-permanent error (such as couldn't connect to server). That's fine, except: (1) the sleep call used by the retry doesn't seem to check for interrupts and (2) the default retry timeout is 15 minutes! The following snippets will hang for 15 minutes if you run them and wait about 5 seconds before trying to do a keyboard interrupt (CTRL+C): {code:bash} Rscript -e 'library(arrow); fs <- GcsFileSystem$create(endpoint_override="localhost:1234", anonymous=TRUE); fs$CreateDir("x")' python -c 'from pyarrow.fs import GcsFileSystem; fs = GcsFileSystem(endpoint_override="localhost:1234", anonymous=True); fs.create_dir("x")' {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-16936) [C++] arrow_bundled_dependencies missing Flight absl dependencies
Will Jones created ARROW-16936: -- Summary: [C++] arrow_bundled_dependencies missing Flight absl dependencies Key: ARROW-16936 URL: https://issues.apache.org/jira/browse/ARROW-16936 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 8.0.0 Reporter: Will Jones Assignee: Will Jones Fix For: 9.0.0 Attachments: absl_build_errors.txt If Flight is linked to statically, it seems to miss some abseil dependencies. I created a repo to reproduce this issue: [https://github.com/wjones127/arrow-cpp-external-proj] The build fails with the following linking attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-16914) [Docs][C++] Add example of using ExternalProject_Add to use Arrow
Will Jones created ARROW-16914: -- Summary: [Docs][C++] Add example of using ExternalProject_Add to use Arrow Key: ARROW-16914 URL: https://issues.apache.org/jira/browse/ARROW-16914 Project: Apache Arrow Issue Type: Improvement Components: C++, Documentation Affects Versions: 8.0.0 Reporter: Will Jones We've [given advice|https://stackoverflow.com/a/59939033/2048858] to use {{ExternalProject_Add}} to build Arrow from source within a users CMake project. (Correct me if I am wrong that that is the preferred method now.) But I found it non-trivial to implement this. We should add a simple example of doing this to the User Guide. We should also mention that we don't support {{add_subdirectory}} as that seems to be a common gotcha as well. This might overlap with ARROW-9740, but I don't quite understand what that issue is proposing. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16887) [Doc][R] Document GCSFileSystem for R package
Will Jones created ARROW-16887: -- Summary: [Doc][R] Document GCSFileSystem for R package Key: ARROW-16887 URL: https://issues.apache.org/jira/browse/ARROW-16887 Project: Apache Arrow Issue Type: Improvement Components: Documentation, R Reporter: Will Jones Fix For: 9.0.0 We should update the [cloud storage vignette|https://arrow.apache.org/docs/r/articles/fs.html] and the filesystem RD to show configuration and usage of GCSFileSystem. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16870) [C++] lld does not like --as-needed flag
Will Jones created ARROW-16870: -- Summary: [C++] lld does not like --as-needed flag Key: ARROW-16870 URL: https://issues.apache.org/jira/browse/ARROW-16870 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 8.0.0 Reporter: Will Jones Assignee: Will Jones Fix For: 9.0.0 I've been getting this annoying linking error if I try to build examples using Clang 13 on MacOS: {code:none} [build] [807/827] Linking CXX executable debug/flight-grpc-example [build] FAILED: debug/flight-grpc-example [build] : && /Library/Developer/CommandLineTools/usr/bin/c++ -Qunused-arguments -fcolor-diagnostics ... [build] ld: unknown option: --no-as-needed [build] clang: error: linker command failed with exit code 1 (use -v to see invocation) {code} Should we drop the {{--as-needed}} or should I carve out for Apple? cc [~davidli] My workaround has been to comment out these lines: https://github.com/apache/arrow/blob/982ea6c4d382d1e85164f09b711e87938eaa674a/cpp/examples/arrow/CMakeLists.txt#L39-L40 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16844) [C++][Python] Implement to/from substrait for Expression
Will Jones created ARROW-16844: -- Summary: [C++][Python] Implement to/from substrait for Expression Key: ARROW-16844 URL: https://issues.apache.org/jira/browse/ARROW-16844 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Will Jones DataFusion has the ability to convert between Substrait expressions and it's own internal expressions. (See: [https://github.com/datafusion-contrib/datafusion-substrait] .) It would be cool if we had a similar conversion for Acero's Expression class. This might unlock allowing datafusion-python to easily use PyArrow datasets, by using Substrait as intermediate format to pass down filter and projections from Datafusion into the scanner. (See early draft here: [https://github.com/datafusion-contrib/datafusion-python/pull/21].) One problem is that it's unclear what should be the type of the object in Python representing the Substrait expression. IIUC Python doesn't have direct bindings to the Substrait protobuf. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16828) [R][Packaging] Turn on all compression libs for binaries
Will Jones created ARROW-16828: -- Summary: [R][Packaging] Turn on all compression libs for binaries Key: ARROW-16828 URL: https://issues.apache.org/jira/browse/ARROW-16828 Project: Apache Arrow Issue Type: Improvement Components: Packaging, R Affects Versions: 8.0.0 Reporter: Will Jones Fix For: 9.0.0 We notably don't ship brotli for MacOS. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16802) [Docs] Improve Acero Documentation
Will Jones created ARROW-16802: -- Summary: [Docs] Improve Acero Documentation Key: ARROW-16802 URL: https://issues.apache.org/jira/browse/ARROW-16802 Project: Apache Arrow Issue Type: Improvement Components: Documentation Reporter: Will Jones >From [~amol-] : {quote}If we want to start promoting Acero to the world, I think we should work on improving a bit the documentation first. Having a blog post that then redirects people to a docs that they find hard to read/apply might actually be counterproductive as it might create a fame of being badly documented. At the moment the only mention of it is [https://arrow.apache.org/docs/cpp/streaming_execution.html] and it's not very easy to follow (not much explainations, just blocks of code). In comparison if you look at the compute chapter in Python ( [https://arrow.apache.org/docs/dev/python/compute.html] ) it's much more talkative and explains things as it goes. {quote} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16800) [C++] arrow::RecordBatchBuilder to use Result
Will Jones created ARROW-16800: -- Summary: [C++] arrow::RecordBatchBuilder to use Result Key: ARROW-16800 URL: https://issues.apache.org/jira/browse/ARROW-16800 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 8.0.0 Reporter: Will Jones Assignee: Will Jones Fix For: 9.0.0 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16789) [Format] Mark C Stream Interface as stable
Will Jones created ARROW-16789: -- Summary: [Format] Mark C Stream Interface as stable Key: ARROW-16789 URL: https://issues.apache.org/jira/browse/ARROW-16789 Project: Apache Arrow Issue Type: Improvement Components: Format Reporter: Will Jones Assignee: Will Jones As discussed in [this dev mailing list thread|https://lists.apache.org/thread/0y604o9s3wkyty328wv8d21ol7s40q55], we may wish to mark the C stream interface stable. All feedback is the thread was positive, so I will go ahead and make a PR and call a vote. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16761) [C++][Python] Track bytes_written on FileWriter / WrittenFile
Will Jones created ARROW-16761: -- Summary: [C++][Python] Track bytes_written on FileWriter / WrittenFile Key: ARROW-16761 URL: https://issues.apache.org/jira/browse/ARROW-16761 Project: Apache Arrow Issue Type: New Feature Components: C++, Python Affects Versions: 8.0.0 Reporter: Will Jones Fix For: 9.0.0 For Apache Iceberg and Delta Lake tables, we need to be able to get the size of the files written in bytes. In Iceberg, this is the required field {{file_size_in_bytes}} ([docs|https://iceberg.apache.org/spec/#manifests]). In Delta, this is the required field {{size}} as part of the Add action. I think this could be exposed on [FileWriter|https://github.com/apache/arrow/blob/8c63788ff7d52812599a546989b7df10887cb01e/cpp/src/arrow/dataset/file_base.h#L305] and then through that [WrittenFile|https://github.com/apache/arrow/blob/8c63788ff7d52812599a546989b7df10887cb01e/python/pyarrow/_dataset.pyx#L766-L769]. But lower-level than that I'm not yet sure. {{FileWriter}} owns its {{OutputStream}}; would {{OutputStream::Tell()}} give the correct count? -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16760) [Docs] Mention PYARROW_PARALLEL in Python dev docs
Will Jones created ARROW-16760: -- Summary: [Docs] Mention PYARROW_PARALLEL in Python dev docs Key: ARROW-16760 URL: https://issues.apache.org/jira/browse/ARROW-16760 Project: Apache Arrow Issue Type: Improvement Components: Documentation Affects Versions: 8.0.0 Reporter: Will Jones Assignee: Will Jones Fix For: 9.0.0 We should include {{PYARROW_PARALLEL}} in the Python developer docs. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16703) [R] Refactor map_batches() so it can stream results
Will Jones created ARROW-16703: -- Summary: [R] Refactor map_batches() so it can stream results Key: ARROW-16703 URL: https://issues.apache.org/jira/browse/ARROW-16703 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 8.0.0 Reporter: Will Jones Fix For: 9.0.0 As part of ARROW-15271, {{map_batches()}} was modified to return a {{RecordBatchReader}}, but the implementation collects all results as a list of record batches and then converts that to a reader. In theory, if we push the implementation down to C++, we should be able to make a proper streaming RBR. We won't know the schema ahead of time. We could optionally accept it, which would allow the function to be lazy. Or we could eagerly evaluate just the first batch to determine the schema. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16702) [C++] Add compute functions for list array containment
Will Jones created ARROW-16702: -- Summary: [C++] Add compute functions for list array containment Key: ARROW-16702 URL: https://issues.apache.org/jira/browse/ARROW-16702 Project: Apache Arrow Issue Type: New Feature Components: C++ Affects Versions: 8.0.0 Reporter: Will Jones Some operations we might implement: * {{array_contains(arr, x)}} : list array {{arr}} contains scalar {{x}} * {{arrays_overlap(arr, sc)}} : list array {{arr}} contains common elements in list scalar {{sc}} (could also impelement version with another array as second arg) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16658) [Python] Support arithmetic on arrays and scalars
Will Jones created ARROW-16658: -- Summary: [Python] Support arithmetic on arrays and scalars Key: ARROW-16658 URL: https://issues.apache.org/jira/browse/ARROW-16658 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 8.0.0 Reporter: Will Jones I was surprised to find you can't use standard arithmetic operators on PyArrow arrays and scalars. Instead, one must use the compute functions: {code:Python} import pyarrow as pa arr = pa.array([1, 2, 3]) pc.add(arr, 2) # Doesn't work: # arr + 2 # arr + pa.scalar(2) # arr + arr pc.multiply(arr, 20) # Doesn't work: # 20 * arr # pa.scalar(20) * arr {code} Is it intentional we don't support this? -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16632) [Website] Announce Acero Engine
Will Jones created ARROW-16632: -- Summary: [Website] Announce Acero Engine Key: ARROW-16632 URL: https://issues.apache.org/jira/browse/ARROW-16632 Project: Apache Arrow Issue Type: Improvement Components: Website Reporter: Will Jones Given consensus on Acero as the name for C++ streaming execution engine, it may be time to write a blog post announcing the engine, how it's currently available in the ecosystem, and what's happening next with it. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16510) [R] Add bindings for GCS filesystem
Will Jones created ARROW-16510: -- Summary: [R] Add bindings for GCS filesystem Key: ARROW-16510 URL: https://issues.apache.org/jira/browse/ARROW-16510 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 8.0.0 Reporter: Will Jones Fix For: 9.0.0 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16509) [R][Docs] Update dataset vignette
Will Jones created ARROW-16509: -- Summary: [R][Docs] Update dataset vignette Key: ARROW-16509 URL: https://issues.apache.org/jira/browse/ARROW-16509 Project: Apache Arrow Issue Type: Improvement Components: Documentation, R Affects Versions: 8.0.0 Reporter: Will Jones Fix For: 9.0.0 Since the dataset vignette was written, we've added join, aggregation, and distinct support (and soon union/union_all support). The dataset vignette currently says we don't support those operations. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16421) [R] Permission error on Windows when deleting file in dataset
Will Jones created ARROW-16421: -- Summary: [R] Permission error on Windows when deleting file in dataset Key: ARROW-16421 URL: https://issues.apache.org/jira/browse/ARROW-16421 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 7.0.0 Reporter: Will Jones Assignee: Will Jones On Windows this fails: {code:r} library(arrow) write_dataset(iris, "test_dataset") con <- open_dataset("test_dataset") |> to_duckdb() file.remove("test_dataset/part-0.parquet") #> Warning in file.remove("test_dataset/part-0.parquet"): cannot remove file #> 'test_dataset/part-0.parquet', reason 'Permission denied' #> [1] FALSE {code} But on MacOS it does not: {code:R} library(arrow) write_dataset(iris, "test_dataset") con <- open_dataset("test_dataset") |> to_duckdb() file.remove("test_dataset/part-0.parquet") #> [1] TRUE {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16399) [R][C++] datetime locale support on Windows MINGW / R
Will Jones created ARROW-16399: -- Summary: [R][C++] datetime locale support on Windows MINGW / R Key: ARROW-16399 URL: https://issues.apache.org/jira/browse/ARROW-16399 Project: Apache Arrow Issue Type: Improvement Components: C++, R Affects Versions: 7.0.0 Reporter: Will Jones In [https://github.com/apache/arrow/pull/12536] I found that locales other than "C" and "POSIX" didn't seem to be supported in the RTools environment (a MSYS2 fork). I saw some indications this might apply to any MINGW toolchain (https://stackoverflow.com/a/4497266/2048858), but nothing very recent or definitive. Is there a way we can enable this support? -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16243) [C++][Python] Remove Parquet ReadSchemaField method
Will Jones created ARROW-16243: -- Summary: [C++][Python] Remove Parquet ReadSchemaField method Key: ARROW-16243 URL: https://issues.apache.org/jira/browse/ARROW-16243 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Affects Versions: 7.0.0 Reporter: Will Jones Fix For: 9.0.0 It doesn't seem like the experimental {{ReadSchemaField()}} method does anything different than {{ReadColumn()}} at this point. We should remove it and it's corresponding Python method. https://github.com/apache/arrow/blob/cedb4f8112b9c622dad88e0b6e8e0600f7e52746/cpp/src/parquet/arrow/reader.h#L143-L156 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16239) [R] $columns on Table and RB should be named
Will Jones created ARROW-16239: -- Summary: [R] $columns on Table and RB should be named Key: ARROW-16239 URL: https://issues.apache.org/jira/browse/ARROW-16239 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 7.0.0 Reporter: Will Jones Fix For: 9.0.0 Currently, {{$columns}} method returns columns as a list without names. It would be nice if they were named instead, similar to {{as.list}} on a {{data.frame}}. {code:R} > library(arrow) > names(record_batch(x = 1, y = 'a')$columns) NULL > names(arrow_table(x = 1, y = 'a')$columns) NULL > as.list(data.frame(x = 1, y = 'a')) $x [1] 1 $y [1] "a" {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16130) [Python][Docs] Document ParquetWriteOptions class
Will Jones created ARROW-16130: -- Summary: [Python][Docs] Document ParquetWriteOptions class Key: ARROW-16130 URL: https://issues.apache.org/jira/browse/ARROW-16130 Project: Apache Arrow Issue Type: Improvement Components: Documentation, Python Affects Versions: 7.0.0 Reporter: Will Jones Fix For: 8.0.0 The class [{{ParquetFileWriteOptions}}|https://arrow.apache.org/docs/python/generated/pyarrow.dataset.ParquetFileFormat.html#pyarrow.dataset.ParquetFileFormat.make_write_options], returned by [{{ParquetFileFormat.make_write_options}}|https://arrow.apache.org/docs/python/generated/pyarrow.dataset.ParquetFileFormat.html#pyarrow.dataset.ParquetFileFormat.make_write_options] is not documented in the API docs, unlike {{{}ParquetReadOptions{}}}. Most of the associated options are documented in [{{pyarrow.parquet.write_table}}|https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table] already, so they should be easy to write up. For reference, we encountered this when trying to expose these options in [the delta-rs writer|https://github.com/delta-io/delta-rs/pull/581]. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16114) [Python] Document parquet.FileMetadata and statistics
Will Jones created ARROW-16114: -- Summary: [Python] Document parquet.FileMetadata and statistics Key: ARROW-16114 URL: https://issues.apache.org/jira/browse/ARROW-16114 Project: Apache Arrow Issue Type: Improvement Components: Documentation Affects Versions: 7.0.0 Reporter: Will Jones Fix For: 8.0.0 {{FileMetaData}} in parquet module (returned by {{ParquetFile.metadata}}) isn't in the API docs. We should add to the API docs so users can know what fields are available. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16085) [R] Support unifying schemas for InMemoryDatasets
Will Jones created ARROW-16085: -- Summary: [R] Support unifying schemas for InMemoryDatasets Key: ARROW-16085 URL: https://issues.apache.org/jira/browse/ARROW-16085 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 7.0.0 Reporter: Will Jones Fix For: 8.0.0 The following fails: {code:R} sub_df1 <- Table$create( x = Array$create(c(1, 2, 3)), y = Array$create(c("a", "b", "c")) ) sub_df2 <- Table$create( x = Array$create(c(4, 5)), z = Array$create(c("d", "e")) ) ds1 <- InMemoryDataset$create(sub_df1) ds2 <- InMemoryDataset$create(sub_df2) ds <- c(ds1, ds2) actual <- ds %>% collect() {code} {code} Type error: yielded batch had schema x: double y: string which did not match InMemorySource's: x: double y: string z: string /Users/willjones/Documents/arrows/arrow-quick/cpp/src/arrow/util/iterator.h:541 child_.Next() /Users/willjones/Documents/arrows/arrow-quick/cpp/src/arrow/util/iterator.h:152 value_.status() /Users/willjones/Documents/arrows/arrow-quick/cpp/src/arrow/util/iterator.h:180 maybe_element /Users/willjones/Documents/arrows/arrow-quick/cpp/src/arrow/dataset/scanner.cc:840 fragments_it.ToVector() {code} If we fixed this, we could implement a function that does for Tables what {{dplyr::bind_rows}} does for Tibbles: {code:R} concat_tables <- function(..., schema = NULL) { tables <- list2(...) dataset <- open_dataset(map(tables, InMemoryDataset$create), schema = schema) dplyr::collect(dataset, as_data_frame = FALSE) } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16054) [Python] Use tzdata timezone database on Windows
Will Jones created ARROW-16054: -- Summary: [Python] Use tzdata timezone database on Windows Key: ARROW-16054 URL: https://issues.apache.org/jira/browse/ARROW-16054 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Affects Versions: 7.0.0 Reporter: Will Jones In ARROW-13168, we enabled setting the path of the text-based database engine at runtime. This allowed R to use the tzdb package for the timezone database, since it uses the text format. However, it doesn't seem like tzdata Python package ships that text format. They do have [a "compact" text format|https://github.com/python/tzdata/blob/master/src/tzdata/zoneinfo/tzdata.zi], which _might_ be compatible with our vendored date library. Otherwise, we'd likely have to wait for binary format support in https://github.com/HowardHinnant/date/issues/564 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16024) [C++] Let users set Windows timezone db path with environment variable
Will Jones created ARROW-16024: -- Summary: [C++] Let users set Windows timezone db path with environment variable Key: ARROW-16024 URL: https://issues.apache.org/jira/browse/ARROW-16024 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 7.0.0 Reporter: Will Jones In ARROW-13168, we enabled a runtime option to set the location of the timezone database on Windows. For developers, the unit tests read the ARROW_TIMEZONE_DATABASE environment variable. It might be helpful for users to let them use that variable, but the question is where to put the initialization. Also, should it have precedent over the initialize method? If it did, it could override the R initialization that points to the tzdb package. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16006) [C++] Helpers for converting between rows and Arrow objects
Will Jones created ARROW-16006: -- Summary: [C++] Helpers for converting between rows and Arrow objects Key: ARROW-16006 URL: https://issues.apache.org/jira/browse/ARROW-16006 Project: Apache Arrow Issue Type: New Feature Components: C++ Affects Versions: 7.0.0 Reporter: Will Jones Assignee: Will Jones Short version: Given a way to convert a vector of rows and a schema to a RecordBatch, we can derive methods for efficiently converting a vector of rows to a Table or even an iterator of rows to a Record Batch Reader. Similarly, we could go the other way: given a way to convert a RecordBatch to a vector of rows, we can derive methods for converting from Tables or RBRs. Long version: https://docs.google.com/document/d/174tldmQLMCvOtjxGtFPeoLBefyE1x26_xntwfSzDXFA/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15989) [R] Implement rbind for Table and RecordBatch
Will Jones created ARROW-15989: -- Summary: [R] Implement rbind for Table and RecordBatch Key: ARROW-15989 URL: https://issues.apache.org/jira/browse/ARROW-15989 Project: Apache Arrow Issue Type: New Feature Components: R Affects Versions: 7.0.0 Reporter: Will Jones Fix For: 8.0.0 In ARROW-15013 we implemented c() for Arrow arrays. We should now be able to implement rbind for Tables and RecordBatches (rbind on batches would produce a table). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15975) [C++] Document VisitArrayInline and type traits
Will Jones created ARROW-15975: -- Summary: [C++] Document VisitArrayInline and type traits Key: ARROW-15975 URL: https://issues.apache.org/jira/browse/ARROW-15975 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 7.0.0 Reporter: Will Jones Assignee: Will Jones Fix For: 8.0.0 In ARROW-15952, we documented the {{ArrayVisitor}} and {{TypeVisitor}} classes. But as I discovered in [a cookbook PR|https://github.com/apache/arrow-cookbook/pull/166], you can't subclass these abstract visitors _and_ use type traits. Now I know why most visitor implementations within of Arrow don't subclasses these. We should instead suggest users simply use the {{VisitArrayInline}} and {{VisitTypeInline}} with their visitors, and ignore the {{ArrayVisitor}} and {{TypeVisitor}} classes and associated {{Accept()}} methods. In fact, can we deprecate (or even remove) those? Do they add anything valuable? -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15952) [C++] Document Array::accept() and ArrayVisitor
Will Jones created ARROW-15952: -- Summary: [C++] Document Array::accept() and ArrayVisitor Key: ARROW-15952 URL: https://issues.apache.org/jira/browse/ARROW-15952 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 7.0.0 Reporter: Will Jones Fix For: 8.0.0 We mention in the docs that: {quote}The classes arrow::Array and its subclasses provide strongly-typed accessors with support for the visitor pattern and other affordances.{quote} But ArrayVisitor class and the {{Array::Accept()}} [method|https://github.com/apache/arrow/blob/b956ba51ea11d050745e09548e33aa61fdcbfddc/cpp/src/arrow/array/array_base.h#L136] are missing from the API docs. We should add those, and potentially also provide an example of using the visitor. Likely worth doing the same for TypeVisitor and ScalarVisitor. It would also be nice to document the performance implication of using ScalarVisitor vs ArrayVisitor. Also we use an "inline" version of the visitors; is that something we do/should expose in the API as well? -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15922) [C++] Re-enable strftime locale test on Windows
Will Jones created ARROW-15922: -- Summary: [C++] Re-enable strftime locale test on Windows Key: ARROW-15922 URL: https://issues.apache.org/jira/browse/ARROW-15922 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Will Jones Assignee: Will Jones ARROW-13168 enabled timezone support on Windows, but found there was an issue with the vendored datetime library that caused invalid UTF-8 character to be emitted from strftime in certain locales. We should re-enable that test once we are able to get a fix in the date library (or some other solution is found). https://github.com/HowardHinnant/date/issues/726 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15906) [C++] S3Filesystem shouldn't create new buckets by default
Will Jones created ARROW-15906: -- Summary: [C++] S3Filesystem shouldn't create new buckets by default Key: ARROW-15906 URL: https://issues.apache.org/jira/browse/ARROW-15906 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 7.0.0 Reporter: Will Jones Fix For: 8.0.0 S3 buckets typically have a lot of governance around them (permissions, cost-tracking tags), so they should not be created unless a user explicitly asks. We should add an option to {{S3Options}} to control whether to create buckets, and default to False. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15860) [Python][Docs] Document RecordBatchReader
Will Jones created ARROW-15860: -- Summary: [Python][Docs] Document RecordBatchReader Key: ARROW-15860 URL: https://issues.apache.org/jira/browse/ARROW-15860 Project: Apache Arrow Issue Type: Improvement Components: Documentation, Python Affects Versions: 7.0.0 Reporter: Will Jones Fix For: 8.0.0 RecordBatchReader seems like a pretty important type, but it is missing from the Python API docs. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15803) [R] Empty JSON object parsed as corrupt data frame
Will Jones created ARROW-15803: -- Summary: [R] Empty JSON object parsed as corrupt data frame Key: ARROW-15803 URL: https://issues.apache.org/jira/browse/ARROW-15803 Project: Apache Arrow Issue Type: Bug Affects Versions: 7.0.0 Reporter: Will Jones Fix For: 8.0.0 If you have a JSON object field that is always empty, it seems to be not handled well, whether or not a schema is provided that tells Arrow what should be in that object. {code:r} library(arrow) #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp json_val <- '{ "rows": [ {"empty": {} }, {"empty": {} }, {"empty": {} } ] }' # Remove newlines json_val <- gsub("\n", "", json_val) json_file <- tempfile() writeLines(json_val, json_file) schema <- schema(field("rows", list_of(struct(empty = struct(y = int32()) raw <- read_json_arrow(json_file, schema=schema) raw$rows$empty #> Error: Corrupt x: no names {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15758) [C++] Explore upgrading to mimalloc V2
Will Jones created ARROW-15758: -- Summary: [C++] Explore upgrading to mimalloc V2 Key: ARROW-15758 URL: https://issues.apache.org/jira/browse/ARROW-15758 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 7.0.0 Reporter: Will Jones ARROW-15730 found that mimalloc wasn't releasing memory as expected. These memory allocators tend to hold onto memory longer than users expect, which can be confusing. But also there appears to be [a bug where it also doesn't reuse memory|https://github.com/microsoft/mimalloc/issues/383#issuecomment-846132613]. Both of these are addressed in v2.0.X (beta) of the library: the allocation is more aggressive in returning memory and the bug seems to not exist. [According to one of the maintainers|https://github.com/microsoft/mimalloc/issues/466#issuecomment-947819685], the main reason 2.0.X hasn't been declared stable is that some use cases have reported performance regressions. We could create a branch of Arrow using mimalloc v2 and run conbench benchmarks to see comparisons. If it's faster, we may consider moving forward; if not, we could provide feedback to the mimalloc maintainers which may help development along. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15725) [Python] Legacy dataset can't roundtrip Int64 with nulls if partitioned
Will Jones created ARROW-15725: -- Summary: [Python] Legacy dataset can't roundtrip Int64 with nulls if partitioned Key: ARROW-15725 URL: https://issues.apache.org/jira/browse/ARROW-15725 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 7.0.0, 4.0.0 Reporter: Will Jones If there is partitioning and the column has nulls, Int64 columns may not round trip successfully using the legacy datasets implementation. Simple reproduction: {code:python} import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds import tempfile table = pa.table({ 'x': pa.array([None, 7753285016841556620]), 'y': pa.array(['a', 'b']) }) ds_dir = tempfile.mkdtemp() pq.write_to_dataset(table, ds_dir, partition_cols=['y']) table_after = ds.dataset(ds_dir).to_table() print(table['x']) print(table_after['x']) assert table['x'] == table_after['x'] {code} {code} [ [ null, 7753285016841556620 ] ] [ [ null ], [ 7753285016841556992 ] ] {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15718) [R] Joining two datasets crashes if use_threads=FALSE
Will Jones created ARROW-15718: -- Summary: [R] Joining two datasets crashes if use_threads=FALSE Key: ARROW-15718 URL: https://issues.apache.org/jira/browse/ARROW-15718 Project: Apache Arrow Issue Type: Bug Components: C++, R Affects Versions: 7.0.0 Reporter: Will Jones Assignee: Will Jones Fix For: 8.0.0 In ARROW-14908 we solved the case of joining a dataset to an in memory table, but did not solve joining two datasets. The previous solution was to add +1 to the thread count, because the hash join logic might be called by the scanner's IO thread. For joining more than 1 dataset, we might have more than 1 IO thread, so we either need to add a larger arbitrary number or find a way to make the state logic more resilient to unexpected threads. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15667) [R] Windows build can fail if building only shared libraries
Will Jones created ARROW-15667: -- Summary: [R] Windows build can fail if building only shared libraries Key: ARROW-15667 URL: https://issues.apache.org/jira/browse/ARROW-15667 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 7.0.0 Reporter: Will Jones Assignee: Will Jones Fix For: 8.0.0 This should only affect dev environments. I noticed that when I build with shared libraries only that it fails because it's expecting arrow_bundled_dependencies, which I think we only build as part of static builds. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15627) [R] Support unify_schemas for union datasets
Will Jones created ARROW-15627: -- Summary: [R] Support unify_schemas for union datasets Key: ARROW-15627 URL: https://issues.apache.org/jira/browse/ARROW-15627 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 7.0.0 Reporter: Will Jones Fix For: 8.0.0 Also out of discussion on [https://github.com/apache/arrow/issues/12371] You can unify schemas between different parquet files, but it seems like you can't union together two (or more) datasets that have different schemas. This is odd, because we do compute the unified schema on [this line|https://github.com/apache/arrow/blob/ba0814e60a451525dd5492b68059aad8a4bdaf4f/r/R/dataset.R#L189], only to later assert all the schemas are the same. {code:R} library(arrow) library(dplyr) df1 <- arrow_table(x = array(c(1, 2, 3)), y = array(c("a", "b", "c"))) df2 <- arrow_table(x = array(c(4, 5)), z = array(c("d", "e"))) df1 %>% write_dataset("example1", format="parquet") df2 %>% write_dataset("example2", format="parquet") ds1 <- open_dataset("example1", format="parquet") ds2 <- open_dataset("example2", format="parquet") # These don't work ds <- c(ds1, ds2) # c() actually does the same thing ds <- open_dataset(list(ds1, ds2)) # This fails due to mismatch in schema ds <- open_dataset(c("example1", "example2"), format="parquet", unify_schemas = TRUE) # This does ds <- open_dataset(c("example2/part-0.parquet", "example1/part-0.parquet"), format="parquet", unify_schemas = TRUE) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15622) [R] Implement union_all for arrow_dplyr_query
Will Jones created ARROW-15622: -- Summary: [R] Implement union_all for arrow_dplyr_query Key: ARROW-15622 URL: https://issues.apache.org/jira/browse/ARROW-15622 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 7.0.0 Reporter: Will Jones Fix For: 8.0.0 GitHub issue inspiration: [https://github.com/apache/arrow/issues/12371] Basically union_all would chain the RecordBatchReaders. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15603) [C++] Clang 13 build fails on unused var
Will Jones created ARROW-15603: -- Summary: [C++] Clang 13 build fails on unused var Key: ARROW-15603 URL: https://issues.apache.org/jira/browse/ARROW-15603 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 7.0.0 Reporter: Will Jones Fix For: 8.0.0 Just a small issue. When I build with clang 13 I get the following error from a unused var warning: {code:java} /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:791:13: error: variable 'n' set but not used [-Werror,-Wunused-but-set-variable] int64_t n = 0; ^ /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:799:13: error: variable 'n' set but not used [-Werror,-Wunused-but-set-variable] int64_t n = 0; ^ {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15512) [C++] OT logging for memory pool allocations
Will Jones created ARROW-15512: -- Summary: [C++] OT logging for memory pool allocations Key: ARROW-15512 URL: https://issues.apache.org/jira/browse/ARROW-15512 Project: Apache Arrow Issue Type: New Feature Components: C++ Affects Versions: 6.0.1 Reporter: Will Jones Fix For: 8.0.0 ARROW-3016 suggests there is a real need for tracking memory allocations with context such as traceback and sizes. That ticket covers using Linux tools like perf and uprobe to do so. Using OpenTelemetry might provide a cross-platform to do that same, one that's in line with other tracing efforts. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15415) [C++] Cannot build debug with MSVC and vcpkg
Will Jones created ARROW-15415: -- Summary: [C++] Cannot build debug with MSVC and vcpkg Key: ARROW-15415 URL: https://issues.apache.org/jira/browse/ARROW-15415 Project: Apache Arrow Issue Type: Bug Components: C++, Documentation Affects Versions: 6.0.1 Reporter: Will Jones Assignee: Will Jones Fix For: 8.0.0 While trying to create a debug build of Arrow on Windows using vcpkg and MSVC, I encountered a few issues with the current build configuration: # Python debug and release libraries are passed, but our Cmake scripts only expect one or the other. Just as reported in ARROW-13470 # Since vcpkg upgraded gtest to 1.11.0, there is again a mismatch between the bundled gtest and the vcpkg versions. So we get the same error as was found in ARROW-14393 # Thrift could not find debug static libraries, because it was missing the "d" suffix. It should be {{libthriftmdd.lib}}, but was finding {{libthriftmd.lib}}. Additionally, the recommended {{clcache}} program from our Windows developer docs is no longer maintained. I found its dependency {{pyuv}} doesn't install on Windows anymore, and is also no longer maintained. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15408) [C++] Environment variable to turn on memory allocation logging
Will Jones created ARROW-15408: -- Summary: [C++] Environment variable to turn on memory allocation logging Key: ARROW-15408 URL: https://issues.apache.org/jira/browse/ARROW-15408 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 6.0.1 Reporter: Will Jones Fix For: 8.0.0 In Python, there is a [{{log_memory_allocations}} function|https://github.com/wesm/arrow/blob/33111644be84f84ce4601889fee06c6d17f05279/python/pyarrow/memory.pxi#L63] to change to use the LoggingMemoryPool. It would be nice to be able to do this in C++ and one very convenient way would be through an environment variable, since we already support {{ARROW_DEFAULT_MEMORY_POOL}}. Should probably be named something like {{ARROW_LOG_MEMORY_ALLOCATIONS}}. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15363) [C++] Add max length option to PrettyPrintOptions
Will Jones created ARROW-15363: -- Summary: [C++] Add max length option to PrettyPrintOptions Key: ARROW-15363 URL: https://issues.apache.org/jira/browse/ARROW-15363 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 6.0.1 Reporter: Will Jones Fix For: 8.0.0 Some pretty prints, especially for chunked or nested arrays, can be very long even with reasonable window settings. We should have a way to set some target maximum length to output. A half-measure was taken with ARROW-15329, which truncates the output of the pretty printing, but that doesn't handle string columns very well if those string values contain delimiters. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15329) [Python] Add character limit to ChunkedArray repr
Will Jones created ARROW-15329: -- Summary: [Python] Add character limit to ChunkedArray repr Key: ARROW-15329 URL: https://issues.apache.org/jira/browse/ARROW-15329 Project: Apache Arrow Issue Type: Task Components: Python Reporter: Will Jones Assignee: Will Jones Fix For: 7.0.0 Short term workaround for ARROW-14798 https://github.com/apache/arrow/pull/12091#issuecomment-1012316758 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15325) [R] Fix CRAN comment on map_batches collect
Will Jones created ARROW-15325: -- Summary: [R] Fix CRAN comment on map_batches collect Key: ARROW-15325 URL: https://issues.apache.org/jira/browse/ARROW-15325 Project: Apache Arrow Issue Type: Bug Reporter: Will Jones Fix For: 7.0.0 Got the following comment in [build {{homebrew-r-autobrew}}|https://github.com/ursacomputing/crossbow/runs/4799447427?check_suite_focus=true]: {code} map_batches: no visible binding for global variable 'collect' Undefined global functions or variables: collect {code} Looks like I should have used the .data inside of map_batches, based on "eliminating R CMD check NOTEs" section of https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15317) [R] Expose API to create Dataset from Fragments
Will Jones created ARROW-15317: -- Summary: [R] Expose API to create Dataset from Fragments Key: ARROW-15317 URL: https://issues.apache.org/jira/browse/ARROW-15317 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 6.0.1 Reporter: Will Jones Third-party packages may define dataset factories for table formats like Delta Lake and Apache Iceberg. These formats store metadata like schema, file lists, and file-level statistics on the side, and can construct a dataset without a discovery process needed. Python exposed enough API to do this successfully for [a Delta Lake dataset reader here|https://github.com/delta-io/delta-rs/blob/6a8195d6e3cbdcb0c58a14a3ffccc472dd094de0/python/deltalake/table.py#L267-L280]. I propose adding the following to the R API: * Expose {{Fragment}} as an R6 object * Add the {{MakeFragment}} method to various file format objects. It's key that {{partition_expression}} is included as an argument. ([See Python equivalent here|https://github.com/apache/arrow/blob/ab86daf3f7c8a67bee6a175a749575fd40417d27/python/pyarrow/_dataset_parquet.pyx#L209-L210]) * Add a dataset constructor that takes a list of {{Fragments}} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15276) [Docs][R] Add map_batches example from vignette to Cookbook
Will Jones created ARROW-15276: -- Summary: [Docs][R] Add map_batches example from vignette to Cookbook Key: ARROW-15276 URL: https://issues.apache.org/jira/browse/ARROW-15276 Project: Apache Arrow Issue Type: Improvement Components: Documentation Reporter: Will Jones In ARROW-14029 we are adding an example of using `map_batches()` to sample data and compute aggregate statistics without having the load the whole dataset into memory. We should add these to the cookbook as well. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15271) [R] Refactor do_exec_plan to return a RecordBatchReader
Will Jones created ARROW-15271: -- Summary: [R] Refactor do_exec_plan to return a RecordBatchReader Key: ARROW-15271 URL: https://issues.apache.org/jira/browse/ARROW-15271 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 6.0.1 Reporter: Will Jones Right now [{{do_exec_plan}}|https://github.com/apache/arrow/blob/master/r/R/query-engine.R#L18] returns an Arrow table because {{head}}, {{tail}}, and {{arrange}} do. If ARROW-14289 is completed and similar work is done for {{arrange}}, we may be able to alter {{do_exec_plan}} to return a RBR instead. The {{map_batches()}} implementation (ARROW-14029) could benefit from this refactor. And it might make ARROW-15040 more useful. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15264) [CI][C#] Build examples in CI
Will Jones created ARROW-15264: -- Summary: [CI][C#] Build examples in CI Key: ARROW-15264 URL: https://issues.apache.org/jira/browse/ARROW-15264 Project: Apache Arrow Issue Type: Improvement Components: C#, Continuous Integration Reporter: Will Jones We should validate in CI that the C# example always build with the latest version of Arrow. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15247) [Python] Convert array of Pandas dataframe to struct column
Will Jones created ARROW-15247: -- Summary: [Python] Convert array of Pandas dataframe to struct column Key: ARROW-15247 URL: https://issues.apache.org/jira/browse/ARROW-15247 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 6.0.1 Reporter: Will Jones Currently, converting a Pandas dataframe with a column of dataframes to Arrow fails with "Could not convert with type DataFrame: did not recognize Python value type when inferring an Arrow data type". We should be able to convert this to a List array, similar to how [the R binding do it|https://arrow.apache.org/docs/r/articles/arrow.html#r-to-arrow]. This could even be bi-directional, where structs could be parsed back into a column of dataframe in {{to_pandas()}} Here is an example that currently fails: {code:python} import pandas as pd import pyarrow as pa df1 = pd.DataFrame({ 'x': [1, 2, 3], 'y': ['a', 'b', 'c'] }) df = pd.DataFrame({ 'df': [df1]*10 }) pa.Table.from_pandas(df) {code} Here's what the other directly might look like for the same data: {code:python} sub_tab = [{'x': 1, 'y': 'a'}, {'x': 2, 'y': 'b'}, {'x': 3, 'y': 'c'}] tab = pa.table({ 'df': pa.array([sub_tab]*10) }) print(tab.schema) # df: list> #child 0, item: struct # child 0, x: int64 # child 1, y: string tab.to_pandas() {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15246) [Python] Automatic conversion of low-cardinality string array to Dictionary Array
Will Jones created ARROW-15246: -- Summary: [Python] Automatic conversion of low-cardinality string array to Dictionary Array Key: ARROW-15246 URL: https://issues.apache.org/jira/browse/ARROW-15246 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 6.0.1 Reporter: Will Jones Users who convert Pandas string arrays to Arrow arrays may be surprised to see the Arrow ones use far more memory when the cardinality is low. The solution is for them to first convert to a Pandas Categorical, but it might save some headaches if we can automatically (or possibly with an option) detect when it's appropriate to use a Dictionary type over a String type. Here's an example of what I'm talking about: {code:python} import pyarrow as pa import pandas as pd x_str = "x" * 30 df = pd.DataFrame({"col": [x_str] * 1_000_000}) %memit tab1 = pa.Table.from_pandas(df) # peak memory: 269.44 MiB, increment: 121.62 MiB df['col'] = df['col'].astype('category') %memit tab2 = pa.Table.from_pandas(df) # peak memory: 286.14 MiB, increment: 1.20 MiB {code} One bad consequence of inferring this automatically is if there is a sequence of Pandas DataFrames that are being converted, it's possible they may end up with differing schemas. For that reason it's likely this behavior should be optional. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15217) [C#] Add ToString() methods to Arrow classes
Will Jones created ARROW-15217: -- Summary: [C#] Add ToString() methods to Arrow classes Key: ARROW-15217 URL: https://issues.apache.org/jira/browse/ARROW-15217 Project: Apache Arrow Issue Type: Improvement Components: C# Reporter: Will Jones We should add {{ToString}} methods to {{RecordBatch}}, {{Schema}}, {{Field}}, {{DataType}}, {{Table}}, and {{ChunkedArray}}. The default implementation in C# is just to return the class name, which isn't very useful. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15135) [C++][R][Python] Support reading from Apache Iceberg tables
Will Jones created ARROW-15135: -- Summary: [C++][R][Python] Support reading from Apache Iceberg tables Key: ARROW-15135 URL: https://issues.apache.org/jira/browse/ARROW-15135 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Will Jones This is an umbrella issue for supporting the [Apache Iceberg table format|] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15109) [Python] Add more info to show_versions()
Will Jones created ARROW-15109: -- Summary: [Python] Add more info to show_versions() Key: ARROW-15109 URL: https://issues.apache.org/jira/browse/ARROW-15109 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 6.0.1 Reporter: Will Jones In the R arrow package, we have a function {{arrow_info()}} which provides information on versions and optional components. Python has {{show_versions()}}, but it's not as detailed. We can add the following to the Python function: * List of optional components and whether they are enabled * Which allocator is used * SIMD level Example R output: {code} Arrow package version: 6.0.1.9000 Capabilities: datasetTRUE parquetTRUE json TRUE s3 TRUE utf8proc TRUE re2TRUE snappy TRUE gzip TRUE brotli TRUE zstd TRUE lz4TRUE lz4_frame TRUE lzo FALSE bz2TRUE jemalloc TRUE mimalloc TRUE Memory: Allocator mimalloc Current0 bytes Max0 bytes Runtime: SIMD Level none Detected SIMD Level none Build: C++ Library Version7.0.0-SNAPSHOT C++ Compiler AppleClang C++ Compiler Version 13.0.0.1329 Git ID cf8d81d9fcbc43ce57b8a0d36c05f8b4273a5fa3 {code} Example Python output (current behavior): {code} pyarrow version info Package kind: not indicated Arrow C++ library version: 7.0.0-SNAPSHOT Arrow C++ compiler: AppleClang 13.0.0.1329 Arrow C++ compiler flags: -Qunused-arguments -fcolor-diagnostics -ggdb -O0 Arrow C++ git revision: d033ce769571a0f12e37ab165bc29d2b202b3a61 Arrow C++ git description: apache-arrow-7.0.0.dev-313-gd033ce769 {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15102) [R] Allow creation of struct type with fields
Will Jones created ARROW-15102: -- Summary: [R] Allow creation of struct type with fields Key: ARROW-15102 URL: https://issues.apache.org/jira/browse/ARROW-15102 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 6.0.1 Reporter: Will Jones Fix For: 8.0.0 StructTypes can be created with types: {code:R} struct(x = int32(), y = utf8()) {code} But they cannot be created with fields yet. This means you cannot construct a StructType with a non-nullable field (since fields are nullable by default.) We should support constructing a StructType with fields, like we do for a Schema: {code:R} # Schema from fields schema(field("x", int32()), field(y, utf8(), nullable=FALSE)) # Expected StructType from fields struct(field("x", int32()), field(y, utf8(), nullable=FALSE)) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15089) [C++] Add compute kernel to get MapArray value for given key
Will Jones created ARROW-15089: -- Summary: [C++] Add compute kernel to get MapArray value for given key Key: ARROW-15089 URL: https://issues.apache.org/jira/browse/ARROW-15089 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 6.0.1 Reporter: Will Jones Given a "map", an obvious operation is to get an item corresponding to a key. The idea here is to create a kernel that does this for each map in the array. IIRC MapArray isn't guaranteed to have unique keys. So one version would return an array of ItemType by returning the first of last item for a given key. Yet another version could return a ListType containing all matching items. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15087) [Docs][Python] Document MapArray in Python
Will Jones created ARROW-15087: -- Summary: [Docs][Python] Document MapArray in Python Key: ARROW-15087 URL: https://issues.apache.org/jira/browse/ARROW-15087 Project: Apache Arrow Issue Type: New Feature Components: Documentation, Python Affects Versions: 6.0.1, 6.0.0 Reporter: Will Jones ARROW-6904 exposed MapArray in Python back in late 2019, but it has not been documented yet. Should add to API reference and to [Python arrays user guide|https://arrow.apache.org/docs/python/data.html#arrays]. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15075) [C++][Dataset] Implement Dataset for JSON format
Will Jones created ARROW-15075: -- Summary: [C++][Dataset] Implement Dataset for JSON format Key: ARROW-15075 URL: https://issues.apache.org/jira/browse/ARROW-15075 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Will Jones We already have support for reading individual files, but not yet for reading datasets. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14999) [C++] List types with different field names are not equal
Will Jones created ARROW-14999: -- Summary: [C++] List types with different field names are not equal Key: ARROW-14999 URL: https://issues.apache.org/jira/browse/ARROW-14999 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 6.0.0 Reporter: Will Jones When comparing map types, the names of the fields are ignored. This was introduced in ARROW-7173. However for list types, they are not ignored. For example, {code:python} In [6]: l1 = pa.list_(pa.field("val", pa.int64())) In [7]: l2 = pa.list_(pa.int64()) In [8]: l1 Out[8]: ListType(list) In [9]: l2 Out[9]: ListType(list) In [10]: l1 == l2 Out[10]: False {code} Should we make list type comparison ignore field names too? -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14730) [C++][R][Python] Support reading from Delta Lake tables
Will Jones created ARROW-14730: -- Summary: [C++][R][Python] Support reading from Delta Lake tables Key: ARROW-14730 URL: https://issues.apache.org/jira/browse/ARROW-14730 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Will Jones [Delta Lake|https://delta.io/] is a parquet table format that supports ACID transactions. It's popularized by Databricks, which uses it as the default table format in their platform. Previously, it's only been readable from Spark, but now there is an effort in [delta-rs|https://github.com/delta-io/delta-rs] to make it accessible from elsewhere. There is already some integration with DataFusion (see: https://github.com/apache/arrow-datafusion/issues/525). There does already exist [a method to read Delta Lake tables into Arrow tables in Python|https://delta-io.github.io/delta-rs/python/api_reference.html#deltalake.table.DeltaTable.to_pyarrow_table] in the delta-rs Python bindings. This includes filtering by partitions. Is there a good way we could integrate this functionality with Arrow C++ Dataset and expose that in Python and R? Would that be something that should be implemented in Arrow libraries or in delta-rs? -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14597) Github actions install r-arrow with snappy compression
Dyfan Jones created ARROW-14597: --- Summary: Github actions install r-arrow with snappy compression Key: ARROW-14597 URL: https://issues.apache.org/jira/browse/ARROW-14597 Project: Apache Arrow Issue Type: New Feature Reporter: Dyfan Jones Hi All, I am having difficutly installing r-arrow with snappy compression on github action. I have set environment variable `ARROW_WITH_SNAPPY: ON` ([https://github.com/DyfanJones/noctua/blob/0079bf997737516fd3e1b61dbde7510044f79a2f/.github/workflows/R-CMD-check.yaml] ). However I get the following error in my unit tests: {code:java} Error: Error: NotImplemented: Support for codec 'snappy' not built In order to read this file, you will need to reinstall arrow with additional features enabled. Set one of these environment variables before installing: * LIBARROW_MINIMAL=false (for all optional features, including 'snappy') * ARROW_WITH_SNAPPY=ON (for just 'snappy') See https://arrow.apache.org/docs/r/articles/install.html for detail{code} arrow version: 6.0.0.2 My PR [https://github.com/DyfanJones/noctua/pull/169] with the github actions issue. Any advice is much appericated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13918) [Gandiva][Python] Add decimal support for make_literal and make_in_expression
Will Jones created ARROW-13918: -- Summary: [Gandiva][Python] Add decimal support for make_literal and make_in_expression Key: ARROW-13918 URL: https://issues.apache.org/jira/browse/ARROW-13918 Project: Apache Arrow Issue Type: Improvement Components: C++ - Gandiva, Python Reporter: Will Jones These are already implemented in C++, they just need to be exposed in Cython. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13917) [Gandiva] Add helper to determine valid decimal function return type
Will Jones created ARROW-13917: -- Summary: [Gandiva] Add helper to determine valid decimal function return type Key: ARROW-13917 URL: https://issues.apache.org/jira/browse/ARROW-13917 Project: Apache Arrow Issue Type: Improvement Components: C++ - Gandiva Reporter: Will Jones To evaluate a Gandiva function, you need to pass it's return type. For most types, we can look up the possible return types by using the `GetRegisteredFunctionSignatures` method, but those don't include details of the precision and scale parameters of the decimal type. Specifying the precision and scale parameters of the decimal type is left up to the user, but if the user gets it wrong, they can get invalid answers. See the reproducible example at the bottom. The precision and scale of the return type depend on the input types and the implementation of the decimal operations. Given the variation of logic across different functions (add, divide, trunc, round), it would be best if we were able to provide some utility to help the user determine the precise return type. Now return types aren't unique for every given function name and parameter types. For example, `add(date64[ms], int64` can return either `date64[ms]` or `timestamp[ms]`. So a generic utility has to return multiple possible return types. Example of invalid decimal results from bad return type: {code:python} from decimal import Decimal import pyarrow as pa from pyarrow.gandiva import TreeExprBuilder, make_projector def call_on_value(func, values, params, out_type): builder = TreeExprBuilder() param_literals = [] for param, param_type in params: param_literals.append(builder.make_literal(param, param_type)) inputs = [] arrays = [] for i, value in enumerate(values): inputs.append(builder.make_field(pa.field(str(i), value[1]))) arrays.append(pa.array([value[0]], value[1])) record_batch = pa.record_batch(arrays, [str(i) for i in range(len(values))]) func_x = builder.make_function(func, inputs + param_literals, out_type) expressions = [builder.make_expression(func_x, pa.field('result', out_type))] projector = make_projector(record_batch.schema, expressions, pa.default_memory_pool()) return projector.evaluate(record_batch) call_on_value( 'round', (Decimal("123.459"), pa.decimal128(28, 3)), [(2, pa.int32())], pa.decimal128(28, 3) ) # Returns: 123.459 (not rounded!) call_on_value( 'round', (Decimal("123.459"), pa.decimal128(28, 3)), [(-2, pa.int32())], pa.decimal128(28, 3) ) # Returns: 0.100 () {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12129) [Python][Gandiva] Infer return types for make_if and make_in_expression
Will Jones created ARROW-12129: -- Summary: [Python][Gandiva] Infer return types for make_if and make_in_expression Key: ARROW-12129 URL: https://issues.apache.org/jira/browse/ARROW-12129 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Will Jones On the {{TreeExprBuilder}} in {{pyarrow.gandiva}}, both the {{make_if}} and {{make_in_expression}} require the user to specify the return type. These could easily be inferred from the input values. ARROW-11342 exposes the return type of nodes as a method, so this should be easy to do once that is merged. To keep the changes backwards compatible, we can make the return_type an optional argument. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11342) [Python] [Gandiva] Expose ToString and result type information
Will Jones created ARROW-11342: -- Summary: [Python] [Gandiva] Expose ToString and result type information Key: ARROW-11342 URL: https://issues.apache.org/jira/browse/ARROW-11342 Project: Apache Arrow Issue Type: Improvement Reporter: Will Jones Assignee: Will Jones To make it easier to build and introspect the expression trees, I would like to expose the ToString() methods on Node, Expression, and Condition, as well as the methods exposing the fields and types inside. {code:python} import pyarrow as pa import pyarrow.gandiva as gandiva builder = gandiva.TreeExprBuilder() print(builder.make_literal(1000.0, pa.float64())) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)