[jira] [Created] (ARROW-18359) PrettyPrint Improvements

2022-11-17 Thread Will Jones (Jira)
Will Jones created ARROW-18359:
--

 Summary: PrettyPrint Improvements
 Key: ARROW-18359
 URL: https://issues.apache.org/jira/browse/ARROW-18359
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python, R
Reporter: Will Jones


We have some pretty printing capabilities, but we may want to think at a high 
level about the design first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18239) [C++][Docs] Add examples of Parquet TypedColumnWriter to user guide

2022-11-03 Thread Will Jones (Jira)
Will Jones created ARROW-18239:
--

 Summary: [C++][Docs] Add examples of Parquet TypedColumnWriter to 
user guide
 Key: ARROW-18239
 URL: https://issues.apache.org/jira/browse/ARROW-18239
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Will Jones


Since this is the more performant non-Arrow way to write Parquet data, we 
should show that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18230) [Python] Pass Cmake args to Python CPP

2022-11-02 Thread Will Jones (Jira)
Will Jones created ARROW-18230:
--

 Summary: [Python] Pass Cmake args to Python CPP 
 Key: ARROW-18230
 URL: https://issues.apache.org/jira/browse/ARROW-18230
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Will Jones
 Fix For: 11.0.0


We pass {{extra_cmake_args}} to {{_run_cmake}} (Cython build) but not to {{
_run_cmake_pyarrow_cpp}} (PyArrow C++ build). We should probably be passing to 
both.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18204) [R] Allow setting field metadata

2022-10-31 Thread Will Jones (Jira)
Will Jones created ARROW-18204:
--

 Summary: [R] Allow setting field metadata
 Key: ARROW-18204
 URL: https://issues.apache.org/jira/browse/ARROW-18204
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 10.0.0
Reporter: Will Jones


Currently, can't create a {{Field}} with metadata, which makes it hard to 
create tests regarding field metadata. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17994) [C++] Add overflow argument is required when it shouldn't be

2022-10-11 Thread Will Jones (Jira)
Will Jones created ARROW-17994:
--

 Summary: [C++] Add overflow argument is required when it shouldn't 
be
 Key: ARROW-17994
 URL: https://issues.apache.org/jira/browse/ARROW-17994
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Will Jones
 Fix For: 11.0.0


If I pass a substrait plan that contains an add function, but don't provide the 
nullablity argument, I get the following error:

{code:none}
Traceback (most recent call last):
  File "", line 1, in 
  File "pyarrow/_substrait.pyx", line 140, in pyarrow._substrait.run_query
  File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Expected Substrait call to have an enum argument at 
index 0 but the argument was not an enum.
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/extension_set.cc:684
  call.GetEnumArg(arg_index)
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/extension_set.cc:702
  ParseEnumArg(call, 0, kOverflowParser)
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/relation_internal.cc:332
  FromProto(expr, ext_set, conversion_options)
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/serde.cc:156
  FromProto(plan_rel.has_root() ? plan_rel.root().input() : plan_rel.rel(), 
ext_set, conversion_options)
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/util.cc:106
  engine::DeserializePlans(substrait_buffer, consumer_factory, registry, 
nullptr, conversion_options_)
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/util.cc:130
  executor.Init(substrait_buffer, registry)
{code}

Yet in the spec, this argument is supposed to be optional: 
https://github.com/substrait-io/substrait/blob/f3f6bdc947e689e800279666ff33f118e42d2146/extensions/functions_arithmetic.yaml#L11

If I modify the plan to include the argument, it works as expected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17963) [C++] Implement cast_dictionary for string

2022-10-07 Thread Will Jones (Jira)
Will Jones created ARROW-17963:
--

 Summary: [C++] Implement cast_dictionary for string
 Key: ARROW-17963
 URL: https://issues.apache.org/jira/browse/ARROW-17963
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Will Jones
 Fix For: 11.0.0


We can cast dictionary(string, X) to string, but not the other way around.

{code:R}
> Array$create(c("a", "b"))$cast(dictionary(int32(), string()))
Error: NotImplemented: Unsupported cast from string to dictionary using 
function cast_dictionary
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/compute/function.cc:249  
func.DispatchBest(_types)

> Array$create(as.factor(c("a", "b")))$cast(string())
Array

[
  "a",
  "b"
]
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17954) [R] Update News for 10.0.0

2022-10-06 Thread Will Jones (Jira)
Will Jones created ARROW-17954:
--

 Summary: [R] Update News for 10.0.0
 Key: ARROW-17954
 URL: https://issues.apache.org/jira/browse/ARROW-17954
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 10.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17944) [Python] Accept bytes object in pyarrow.substrait.run_query

2022-10-05 Thread Will Jones (Jira)
Will Jones created ARROW-17944:
--

 Summary: [Python] Accept bytes object in 
pyarrow.substrait.run_query
 Key: ARROW-17944
 URL: https://issues.apache.org/jira/browse/ARROW-17944
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Will Jones
 Fix For: 11.0.0


{{pyarrow.substrait.run_query()}} only accepts a PyArrow buffer, and will 
segfault if something else is passed. People might try to pass a Python bytes 
object, which isn't unreasonable. For example, they might use the value 
returned by protobufs {{SerializeToString()}} function, which is Python bytes. 
At the very least, we should not segfault.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17923) [C++] Consider dictionary arrays for special fragment fields

2022-10-03 Thread Will Jones (Jira)
Will Jones created ARROW-17923:
--

 Summary: [C++] Consider dictionary arrays for special fragment 
fields
 Key: ARROW-17923
 URL: https://issues.apache.org/jira/browse/ARROW-17923
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Will Jones


I noticed in ARROW-15281 we made {{__filename}} a string column. In common 
cases, this will be inefficient if materialized. If possible, it may be better 
to have them be dictionary arrays.

As an example, 
[here|https://github.com/apache/arrow/pull/12826#issuecomment-1230745059] is a 
user report of 10x increased memory usage caused by accidentally including 
these special fragment columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17897) [Packaging][Conan] Add back ARROW_GCS to conanfile.py

2022-09-29 Thread Will Jones (Jira)
Will Jones created ARROW-17897:
--

 Summary: [Packaging][Conan] Add back ARROW_GCS to conanfile.py
 Key: ARROW-17897
 URL: https://issues.apache.org/jira/browse/ARROW-17897
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 10.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17845) [CI][Conan] Re-enable Flight in Conan CI check

2022-09-26 Thread Will Jones (Jira)
Will Jones created ARROW-17845:
--

 Summary: [CI][Conan] Re-enable Flight in Conan CI check
 Key: ARROW-17845
 URL: https://issues.apache.org/jira/browse/ARROW-17845
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Will Jones
Assignee: Will Jones






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17812) [C++][Documentation] Add Gandiva User Guide

2022-09-21 Thread Will Jones (Jira)
Will Jones created ARROW-17812:
--

 Summary: [C++][Documentation] Add Gandiva User Guide
 Key: ARROW-17812
 URL: https://issues.apache.org/jira/browse/ARROW-17812
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Gandiva
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 10.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17788) [R][Doc] Add example of using Scanner

2022-09-20 Thread Will Jones (Jira)
Will Jones created ARROW-17788:
--

 Summary: [R][Doc] Add example of using Scanner
 Key: ARROW-17788
 URL: https://issues.apache.org/jira/browse/ARROW-17788
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, R
Affects Versions: 9.0.0
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 10.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17776) [C++] Stabilize Parquet ArrowReaderProperties

2022-09-19 Thread Will Jones (Jira)
Will Jones created ARROW-17776:
--

 Summary: [C++] Stabilize Parquet ArrowReaderProperties
 Key: ARROW-17776
 URL: https://issues.apache.org/jira/browse/ARROW-17776
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Parquet
Affects Versions: 9.0.0
Reporter: Will Jones


{{ArrowReaderProperties}} is still marked experimental, but it's pretty well 
used at this point.

One possible change we might wish to make before stabilizing the API for it 
though: The {{ArrowWriterProperties}} class uses a namespaced builder class, 
which provides a nice syntax for creation and enforces immutability of the 
final properties. Perhaps we should mirror that design in the reader properties?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17441) [Python] Memory kept after del and pool.released_unused()

2022-08-16 Thread Will Jones (Jira)
Will Jones created ARROW-17441:
--

 Summary: [Python] Memory kept after del and pool.released_unused()
 Key: ARROW-17441
 URL: https://issues.apache.org/jira/browse/ARROW-17441
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 9.0.0
Reporter: Will Jones


I was trying reproduce another issue involving memory pools not releasing 
memory, but encountered this confusing behavior: if I create a table, then call 
{{{}del table{}}}, and then {{{}pool.release_unused(){}}}, I still see 
significant memory usage. On mimalloc in particular, I see no meaningful drop 
in memory usage on either call.

Am I missing something?
{code:python}
import os
import psutil
import time
import gc
process = psutil.Process(os.getpid())
import numpy as np
from uuid import uuid4


import pyarrow as pa

def gen_batches(n_groups=200, rows_per_group=200_000):
for _ in range(n_groups):
id_val = uuid4().bytes
yield pa.table({
"x": np.random.random(rows_per_group), # This will compress poorly
"y": np.random.random(rows_per_group),
"a": pa.array(list(range(rows_per_group)), type=pa.int32()), # This 
compresses with delta encoding
"id": pa.array([id_val] * rows_per_group), # This compresses with 
RLE
})

def print_rss():
print(f"RSS: {process.memory_info().rss:,} bytes")

print(f"memory_pool={pa.default_memory_pool().backend_name}")
print_rss()
print("reading table")
tab = pa.concat_tables(list(gen_batches()))
print_rss()
print("deleting table")
del tab
gc.collect()
print_rss()
print("releasing unused memory")
pa.default_memory_pool().release_unused()
print_rss()
print("waiting 10 seconds")
time.sleep(10)
print_rss()
{code}
{code:none}
> ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool.py && \
ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool.py && \
ARROW_DEFAULT_MEMORY_POOL=system python test_pool.py
memory_pool=mimalloc
RSS: 44,449,792 bytes
reading table
RSS: 1,819,557,888 bytes
deleting table
RSS: 1,819,590,656 bytes
releasing unused memory
RSS: 1,819,852,800 bytes
waiting 10 seconds
RSS: 1,819,852,800 bytes
memory_pool=jemalloc
RSS: 45,629,440 bytes
reading table
RSS: 1,668,677,632 bytes
deleting table
RSS: 698,400,768 bytes
releasing unused memory
RSS: 699,023,360 bytes
waiting 10 seconds
RSS: 699,023,360 bytes
memory_pool=system
RSS: 44,875,776 bytes
reading table
RSS: 1,713,569,792 bytes
deleting table
RSS: 540,311,552 bytes
releasing unused memory
RSS: 540,311,552 bytes
waiting 10 seconds
RSS: 540,311,552 bytes
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17401) [C++] Add ReadTable method to RecordBatchFileReader

2022-08-12 Thread Will Jones (Jira)
Will Jones created ARROW-17401:
--

 Summary: [C++] Add ReadTable method to RecordBatchFileReader
 Key: ARROW-17401
 URL: https://issues.apache.org/jira/browse/ARROW-17401
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 9.0.0
Reporter: Will Jones


For convenience, it would be helpful to add an method for just reading the 
entire file as a table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17400) [C++] Move Parquet APIs to use Result instead of Status

2022-08-12 Thread Will Jones (Jira)
Will Jones created ARROW-17400:
--

 Summary: [C++] Move Parquet APIs to use Result instead of Status
 Key: ARROW-17400
 URL: https://issues.apache.org/jira/browse/ARROW-17400
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 9.0.0
Reporter: Will Jones


Notably, IPC and CSV have "open file" methods that return result, while opening 
a Parquet file requires passing in an out variable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17349) [C++] Support casting field names of list and map when nested

2022-08-08 Thread Will Jones (Jira)
Will Jones created ARROW-17349:
--

 Summary: [C++] Support casting field names of list and map when 
nested
 Key: ARROW-17349
 URL: https://issues.apache.org/jira/browse/ARROW-17349
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 9.0.0
Reporter: Will Jones
 Fix For: 10.0.0


Different parquet implementations use different field names for internal fields 
of ListType and MapType, which can sometimes cause silly conflicts. For 
example, we use {{item}} as the field name for list, but Spark uses 
{{element}}. Fortunately, we can automatically cast between List and Map Types 
with different field names. Unfortunately, it only works at the top level. We 
should get it to work at arbitrary levels of nesting.

This was discovered in delta-rs: 
https://github.com/delta-io/delta-rs/pull/684#discussion_r935099285

Here's a reproduction in Python:

{code:Python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

def roundtrip_scanner(in_arr, out_type):
table = pa.table({"arr": in_arr})
pq.write_table(table, "test.parquet")
schema = pa.schema({"arr": out_type})
ds.dataset("test.parquet", schema=schema).to_table()

# MapType
ty_named = pa.map_(pa.field("x", pa.int32(), nullable=False), pa.int32())
ty = pa.map_(pa.int32(), pa.int32())
arr_named = pa.array([[(1, 2), (2, 4)]], type=ty_named)
roundtrip_scanner(arr_named, ty)

# ListType
ty_named = pa.list_(pa.field("x", pa.int32(), nullable=False))
ty = pa.list_(pa.int32())
arr_named = pa.array([[1, 2, 4]], type=ty_named)
roundtrip_scanner(arr_named, ty)

# Combination MapType and ListType
ty_named = pa.map_(pa.string(), pa.field("x", pa.list_(pa.field("x", 
pa.int32(), nullable=True)), nullable=False))
ty = pa.map_(pa.string(), pa.list_(pa.int32()))
arr_named = pa.array([[("string", [1, 2, 3])]], type=ty_named)
roundtrip_scanner(arr_named, ty)
# Traceback (most recent call last):
#   File "", line 1, in 
#   File "", line 5, in roundtrip_scanner
#   File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
#   File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
#   File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
#   File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
# pyarrow.lib.ArrowNotImplementedError: Unsupported cast to map> from map ('arr')>
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17343) [Docs][C++] Add missing methods to ArrayBuilders API Reference

2022-08-08 Thread Will Jones (Jira)
Will Jones created ARROW-17343:
--

 Summary: [Docs][C++] Add missing methods to ArrayBuilders API 
Reference
 Key: ARROW-17343
 URL: https://issues.apache.org/jira/browse/ARROW-17343
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 9.0.0
Reporter: Will Jones
 Fix For: 10.0.0


At the very least, {{StructBuilder}} doesn't show it's {{num_fields()}} and 
{{field_builder()}} methods.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17310) [C++] Expose SimpleRecordBatchReader publicly

2022-08-04 Thread Will Jones (Jira)
Will Jones created ARROW-17310:
--

 Summary: [C++] Expose SimpleRecordBatchReader publicly
 Key: ARROW-17310
 URL: https://issues.apache.org/jira/browse/ARROW-17310
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Affects Versions: 9.0.0
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 10.0.0


It's unclear why this isn't public to begin with. Perhaps at the time, Iterator 
wasn't considered public, but now we are using it in public headers.

https://github.com/apache/arrow/blob/916417da0a966797c453126f57b657a0449651b5/cpp/src/arrow/record_batch.cc#L359



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17298) [C++][Docs] Add Acero project example in Getting Started Section

2022-08-03 Thread Will Jones (Jira)
Will Jones created ARROW-17298:
--

 Summary: [C++][Docs] Add Acero project example in Getting Started 
Section
 Key: ARROW-17298
 URL: https://issues.apache.org/jira/browse/ARROW-17298
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Will Jones
 Fix For: 10.0.0


>From [~westonpace]:

{quote}
A request I've seen a few times (and just received now) has been...
Can you point me at a sample C++ starter project that links against Acero?  For 
example, I tend to use a CMakeLists.txt that looks something like...
cmake_minimum_required(VERSION 3.10)

{code}
set(CMAKE_MODULE_PATH "${CMAKE_SOURCE_DIR}/cmake;${CMAKE_MODULE_PATH}")
set(CMAKE_CXX_FLAGS "-Wall -Wextra")
# set(CMAKE_CXX_FLAGS_DEBUG "-g")
set(CMAKE_CXX_FLAGS_RELEASE "-O3")

# set the project name
project(Experiments VERSION 1.0)

# specify the C++ standard
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED True)

set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

if(NOT DEFINED CONDA_HOME)
  message(FATAL_ERROR "CONDA_HOME is a required variable")
endif()

include_directories(SYSTEM ${CONDA_HOME}/include)
link_directories(${CONDA_HOME}/lib64)
link_directories(${CONDA_HOME}/lib)

function(experiment TARGET)
add_executable(
${TARGET}
${TARGET}.cc
)
target_link_libraries(
${TARGET}
arrow
arrow_dataset
parquet
aws-cpp-sdk-core
aws-cpp-sdk-s3
glog
pthread
re2
utf8proc
lz4
snappy
z
zstd
aws-cpp-sdk-identity-management
thrift
)
if (MSVC)
target_compile_options(${TARGET} PRIVATE /W4 /WX)
else ()
target_compile_options(${TARGET} PRIVATE -Wall -Wextra -Wpedantic 
-Werror)
endif ()
endfunction()

experiment(arrow_16642)
{code}
{quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17295) [C++] Build separate bundled_depenencies.so

2022-08-03 Thread Will Jones (Jira)
Will Jones created ARROW-17295:
--

 Summary: [C++] Build separate bundled_depenencies.so
 Key: ARROW-17295
 URL: https://issues.apache.org/jira/browse/ARROW-17295
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 8.0.1, 8.0.0
Reporter: Will Jones


When building arrow _static_ libraries with bundled dependencies, we produce 
{{{}arrow_bundled_dependencies.a{}}}. But when building dynamic libraries, the 
bundled dependencies are statically linked directly into the arrow libraries 
(libarrow, libarrow_flight, etc.). This means that users can access the symbols 
of bundled dependencies in the static case, but not in the dynamic library case.

One use case of this is being able to pass in gRPC configuration to a Flight 
server, which requires access to gRPC symbols.

Could we change the dynamic library building to build an 
{{arrow_bundled_dependencies.so}} so that the symbols are accessible?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17188) [R] Update news for 9.0.0

2022-07-22 Thread Will Jones (Jira)
Will Jones created ARROW-17188:
--

 Summary: [R] Update news for 9.0.0
 Key: ARROW-17188
 URL: https://issues.apache.org/jira/browse/ARROW-17188
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Affects Versions: 9.0.0
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 9.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17152) [Docs] Enable dark mode on documentation site

2022-07-20 Thread Will Jones (Jira)
Will Jones created ARROW-17152:
--

 Summary: [Docs] Enable dark mode on documentation site
 Key: ARROW-17152
 URL: https://issues.apache.org/jira/browse/ARROW-17152
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Will Jones
 Fix For: 10.0.0
 Attachments: Screen Shot 2022-07-20 at 3.10.51 PM.png, Screen Shot 
2022-07-20 at 3.12.18 PM.png

pydata-sphinx-theme adds dark mode in version 0.9.0. We will need to adapt our 
logo ([see 
docs|https://pydata-sphinx-theme.readthedocs.io/en/stable/user_guide/configuring.html?highlight=dark#different-logos-for-light-and-dark-mode]).
 There are also some places in the docs where we may need to adjust additional 
CSS. See attached screenshot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17151) [Docs] Pin pydata-sphinx-theme to 0.8 to avoid dark mode

2022-07-20 Thread Will Jones (Jira)
Will Jones created ARROW-17151:
--

 Summary: [Docs] Pin pydata-sphinx-theme to 0.8 to avoid dark mode
 Key: ARROW-17151
 URL: https://issues.apache.org/jira/browse/ARROW-17151
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Will Jones
 Fix For: 9.0.0


pydata-sphinx-theme introduced automatic dark mode. However there is a series 
of changes we need to do (such as providing a dark-mode Arrow logo) before we 
will be ready for this. For the 9.0.0 release, we should instead pin to the 
version of pydata-sphinx-theme just before that release.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17150) [R] Allow statically linked libcurl in GCS when building libarrow DLL in RTools

2022-07-20 Thread Will Jones (Jira)
Will Jones created ARROW-17150:
--

 Summary: [R] Allow statically linked libcurl in GCS when building 
libarrow DLL in RTools
 Key: ARROW-17150
 URL: https://issues.apache.org/jira/browse/ARROW-17150
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 9.0.0
Reporter: Will Jones
 Fix For: 10.0.0


Neal's patch in ARROW-16510 enabled libcurl to be linked statically in the 
google cloud storage dependency, but this only seems to work for static 
libraries on RTools (Windows). For development Rtools environments, we 
currently use dynamic Arrow libraries instead, but currently we get linking 
errors to libcurl when ARROW_GCS is on.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17149) [R] Enable GCS tests for Windows

2022-07-20 Thread Will Jones (Jira)
Will Jones created ARROW-17149:
--

 Summary: [R] Enable GCS tests for Windows
 Key: ARROW-17149
 URL: https://issues.apache.org/jira/browse/ARROW-17149
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, R
Affects Versions: 9.0.0
Reporter: Will Jones
 Fix For: 10.0.0


In ARROW-16879, I found the GCS tests were hanging in CI, but couldn't diagnose 
why. We should solve that and enable the tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17097) [C++] GCS: report common prefixes as directories

2022-07-15 Thread Will Jones (Jira)
Will Jones created ARROW-17097:
--

 Summary: [C++] GCS: report common prefixes as directories
 Key: ARROW-17097
 URL: https://issues.apache.org/jira/browse/ARROW-17097
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 8.0.0, 9.0.0
Reporter: Will Jones
 Fix For: 10.0.0


I got confused at the behavior differences between S3 and GCS, only to realize 
GCS only reports special directory markers as "directories" and not the common 
prefixes. This can have the effect of making a directory look empty in GCS, 
when it in fact has many folders (see example below).

We currently use the 
[ListObjects|https://github.com/googleapis/google-cloud-cpp/blob/e0228233949873be9cba87ae4e37554e1ff0474d/google/cloud/storage/client.h#L974]
 method, but perhaps it would be more appropriate to use the 
[ListObjectsWithPrefix|https://github.com/googleapis/google-cloud-cpp/blob/e0228233949873be9cba87ae4e37554e1ff0474d/google/cloud/storage/client.h#L1006].
 Since they are returned in the [same API 
call|https://cloud.google.com/storage/docs/json_api/v1/objects/list], it 
shouldn't add much overhead.
{code:r}
library(arrow)

bucket <- gs_bucket("voltrondata-labs-datasets", retry_limit_seconds = 3, 
anonymous = TRUE)
s3_bucket <- s3_bucket("voltrondata-labs-datasets", endpoint_override = 
"https://storage.googleapis.com;)

# We did not create directory markers when uploading the data
# https://github.com/apache/arrow/pull/11842#discussion_r764204767

# The directory appears empty to GCSFileSystem...
bucket$ls("nyc-taxi")
#> character(0)

# ... but S3FileSystem knows otherwise!
s3_bucket$ls("nyc-taxi")
#>  [1] "nyc-taxi/year=2009" "nyc-taxi/year=2010" "nyc-taxi/year=2011"
#>  [4] "nyc-taxi/year=2012" "nyc-taxi/year=2013" "nyc-taxi/year=2014"
#>  [7] "nyc-taxi/year=2015" "nyc-taxi/year=2016" "nyc-taxi/year=2017"
#> [10] "nyc-taxi/year=2018" "nyc-taxi/year=2019" "nyc-taxi/year=2020"
#> [13] "nyc-taxi/year=2021" "nyc-taxi/year=2022"

# Using GCS API, we only get files!
bucket$ls("nyc-taxi", recursive = TRUE)
#>   [1] "nyc-taxi/year=2009/month=1/part-0.parquet" 
#>   [2] "nyc-taxi/year=2009/month=10/part-0.parquet"
#> ...
#> [157] "nyc-taxi/year=2022/month=1/part-0.parquet" 
#> [158] "nyc-taxi/year=2022/month=2/part-0.parquet"

# Using S3 API, we can get directories!
s3_bucket$ls("nyc-taxi", recursive = TRUE)
#>   [1] "nyc-taxi/year=2009"
#>   [2] "nyc-taxi/year=2009/month=1"
#>   [3] "nyc-taxi/year=2009/month=1/part-0.parquet" 
#>   [4] "nyc-taxi/year=2009/month=10"   
#>   [5] "nyc-taxi/year=2009/month=10/part-0.parquet"
#>   [6] "nyc-taxi/year=2009/month=11"   
#> ...
#> [329] "nyc-taxi/year=2022/month=2"
#> [330] "nyc-taxi/year=2022/month=2/part-0.parquet"
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17075) [C++] HDFS tests broken by trailing slash tests

2022-07-14 Thread Will Jones (Jira)
Will Jones created ARROW-17075:
--

 Summary: [C++] HDFS tests broken by trailing slash tests
 Key: ARROW-17075
 URL: https://issues.apache.org/jira/browse/ARROW-17075
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 9.0.0


https://github.com/apache/arrow/pull/13577#issuecomment-1184541864



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17069) [Python][R] GCSFIleSystem reports cannot resolve host on public buckets

2022-07-13 Thread Will Jones (Jira)
Will Jones created ARROW-17069:
--

 Summary: [Python][R] GCSFIleSystem reports cannot resolve host on 
public buckets
 Key: ARROW-17069
 URL: https://issues.apache.org/jira/browse/ARROW-17069
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python, R
Affects Versions: 8.0.0
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 9.0.0


GCSFileSystem will return {{Couldn't resolve host name}} if you don't supply 
{{anonymous}} as the user:
{code:python}
import pyarrow.dataset as ds

# Fails:
dataset = 
ds.dataset("gs://anonymous@voltrondata-labs-datasets/taxi-data/?retry_limit_seconds=3")
# Traceback (most recent call last):
#   File "", line 1, in 
#   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
line 749, in dataset
# return _filesystem_dataset(source, **kwargs)
#   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
line 441, in _filesystem_dataset
# fs, paths_or_selector = _ensure_single_source(source, filesystem)
#   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
line 417, in _ensure_single_source
# raise FileNotFoundError(path)
# FileNotFoundError: voltrondata-labs-datasets/taxi-data

# This works fine:
>>> dataset = 
>>> ds.dataset("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3")
{code}

I would expect that we could connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17047) [Python][Docs] Document how to get field from StructType

2022-07-11 Thread Will Jones (Jira)
Will Jones created ARROW-17047:
--

 Summary: [Python][Docs] Document how to get field from StructType
 Key: ARROW-17047
 URL: https://issues.apache.org/jira/browse/ARROW-17047
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 8.0.0
Reporter: Will Jones


It's not at all obvious how to get a particular field from a StructType from 
it's API page:

https://arrow.apache.org/docs/python/generated/pyarrow.StructType.html#pyarrow.StructType

We should add an example:

{code:python}
struct_type = pa.struct({"x": pa.int32(), "y": pa.string()})
struct_type[0]
# pyarrow.Field
pa.schema(list(struct_type))
# x: int32
# y: string
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17045) [C++] GCS doesn't drop ending slash for files

2022-07-11 Thread Will Jones (Jira)
Will Jones created ARROW-17045:
--

 Summary: [C++] GCS doesn't drop ending slash for files
 Key: ARROW-17045
 URL: https://issues.apache.org/jira/browse/ARROW-17045
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 8.0.0
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 9.0.0


There is inconsistent behavior between GCS and S3 when it comes to creating 
files.

Example:

{code:python}
import pyarrow.fs
from pyarrow.fs import FileSelector
from datetime import timedelta

gcs = pyarrow.fs.GcsFileSystem(
endpoint_override="localhost:9001",
scheme="http",
anonymous=True,
retry_time_limit=timedelta(seconds=1),
)

gcs.create_dir("py_test")
with gcs.open_output_stream("py_test/test.txt") as out_stream:
out_stream.write(b"Hello world!")

with gcs.open_output_stream("py_test/test.txt/") as out_stream:
out_stream.write(b"Hello world!")

gcs.get_file_info(FileSelector("py_test"))
# [, ]

s3 = pyarrow.fs.S3FileSystem(
access_key="minioadmin",
secret_key="minioadmin",
scheme="http",
endpoint_override="localhost:9000",
allow_bucket_creation=True,
allow_bucket_deletion=True,
)

s3.create_dir("py-test")
with s3.open_output_stream("py-test/test.txt") as out_stream:
out_stream.write(b"Hello world!")
with s3.open_output_stream("py-test/test.txt/") as out_stream:
out_stream.write(b"Hello world!")

s3.get_file_info(FileSelector("py-test"))
# []
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17020) [Python][R] GcsFilesystem can appear to hang for non permanent errors

2022-07-08 Thread Will Jones (Jira)
Will Jones created ARROW-17020:
--

 Summary: [Python][R] GcsFilesystem can appear to hang for non 
permanent errors
 Key: ARROW-17020
 URL: https://issues.apache.org/jira/browse/ARROW-17020
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python, R
Affects Versions: 8.0.0
Reporter: Will Jones


GcsFileSystem will attempt to retry if it gets a non-permanent error (such as 
couldn't connect to server). That's fine, except: (1) the sleep call used by 
the retry doesn't seem to check for interrupts and (2) the default retry 
timeout is 15 minutes!

The following snippets will hang for 15 minutes if you run them and wait about 
5 seconds before trying to do a keyboard interrupt (CTRL+C):

{code:bash}
Rscript -e 'library(arrow); fs <- 
GcsFileSystem$create(endpoint_override="localhost:1234", anonymous=TRUE); 
fs$CreateDir("x")'

python -c 'from pyarrow.fs import GcsFileSystem; fs = 
GcsFileSystem(endpoint_override="localhost:1234", anonymous=True); 
fs.create_dir("x")'
{code}


 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-16936) [C++] arrow_bundled_dependencies missing Flight absl dependencies

2022-06-29 Thread Will Jones (Jira)
Will Jones created ARROW-16936:
--

 Summary: [C++] arrow_bundled_dependencies missing Flight absl 
dependencies
 Key: ARROW-16936
 URL: https://issues.apache.org/jira/browse/ARROW-16936
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 8.0.0
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 9.0.0
 Attachments: absl_build_errors.txt

If Flight is linked to statically, it seems to miss some abseil dependencies. I 
created a repo to reproduce this issue: 
[https://github.com/wjones127/arrow-cpp-external-proj]

The build fails with the following linking attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-16914) [Docs][C++] Add example of using ExternalProject_Add to use Arrow

2022-06-27 Thread Will Jones (Jira)
Will Jones created ARROW-16914:
--

 Summary: [Docs][C++] Add example of using ExternalProject_Add to 
use Arrow
 Key: ARROW-16914
 URL: https://issues.apache.org/jira/browse/ARROW-16914
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Documentation
Affects Versions: 8.0.0
Reporter: Will Jones


We've [given advice|https://stackoverflow.com/a/59939033/2048858] to use 
{{ExternalProject_Add}} to build Arrow from source within a users CMake 
project. (Correct me if I am wrong that that is the preferred method now.) But 
I found it non-trivial to implement this.

We should add a simple example of doing this to the User Guide. We should also 
mention that we don't support {{add_subdirectory}} as that seems to be a common 
gotcha as well.

This might overlap with ARROW-9740, but I don't quite understand what that 
issue is proposing.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16887) [Doc][R] Document GCSFileSystem for R package

2022-06-22 Thread Will Jones (Jira)
Will Jones created ARROW-16887:
--

 Summary: [Doc][R] Document GCSFileSystem for R package
 Key: ARROW-16887
 URL: https://issues.apache.org/jira/browse/ARROW-16887
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, R
Reporter: Will Jones
 Fix For: 9.0.0


We should update the [cloud storage 
vignette|https://arrow.apache.org/docs/r/articles/fs.html] and the filesystem 
RD to show configuration and usage of GCSFileSystem.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16870) [C++] lld does not like --as-needed flag

2022-06-21 Thread Will Jones (Jira)
Will Jones created ARROW-16870:
--

 Summary: [C++] lld does not like --as-needed flag
 Key: ARROW-16870
 URL: https://issues.apache.org/jira/browse/ARROW-16870
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 8.0.0
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 9.0.0


I've been getting this annoying linking error if I try to build examples using 
Clang 13 on MacOS:

{code:none}
[build] [807/827] Linking CXX executable debug/flight-grpc-example
[build] FAILED: debug/flight-grpc-example
[build] : && /Library/Developer/CommandLineTools/usr/bin/c++ -Qunused-arguments 
-fcolor-diagnostics  ...
[build] ld: unknown option: --no-as-needed
[build] clang: error: linker command failed with exit code 1 (use -v to see 
invocation)
{code}

Should we drop the {{--as-needed}} or should I carve out for Apple? cc 
[~davidli]

My workaround has been to comment out these lines: 
https://github.com/apache/arrow/blob/982ea6c4d382d1e85164f09b711e87938eaa674a/cpp/examples/arrow/CMakeLists.txt#L39-L40





--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16844) [C++][Python] Implement to/from substrait for Expression

2022-06-16 Thread Will Jones (Jira)
Will Jones created ARROW-16844:
--

 Summary: [C++][Python] Implement to/from substrait for Expression
 Key: ARROW-16844
 URL: https://issues.apache.org/jira/browse/ARROW-16844
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Will Jones


DataFusion has the ability to convert between Substrait expressions and it's 
own internal expressions. (See: 
[https://github.com/datafusion-contrib/datafusion-substrait] .) It would be 
cool if we had a similar conversion for Acero's Expression class.

This might unlock allowing datafusion-python to easily use PyArrow datasets, by 
using Substrait as intermediate format to pass down filter and projections from 
Datafusion into the scanner. (See early draft here: 
[https://github.com/datafusion-contrib/datafusion-python/pull/21].)

One problem is that it's unclear what should be the type of the object in 
Python representing the Substrait expression. IIUC Python doesn't have direct 
bindings to the Substrait protobuf.

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16828) [R][Packaging] Turn on all compression libs for binaries

2022-06-14 Thread Will Jones (Jira)
Will Jones created ARROW-16828:
--

 Summary: [R][Packaging] Turn on all compression libs for binaries
 Key: ARROW-16828
 URL: https://issues.apache.org/jira/browse/ARROW-16828
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging, R
Affects Versions: 8.0.0
Reporter: Will Jones
 Fix For: 9.0.0


We notably don't ship brotli for MacOS. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16802) [Docs] Improve Acero Documentation

2022-06-09 Thread Will Jones (Jira)
Will Jones created ARROW-16802:
--

 Summary: [Docs] Improve Acero Documentation
 Key: ARROW-16802
 URL: https://issues.apache.org/jira/browse/ARROW-16802
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Will Jones


>From [~amol-] :
{quote}If we want to start promoting Acero to the world, I think we should work 
on improving a bit the documentation first. Having a blog post that then 
redirects people to a docs that they find hard to read/apply might actually be 
counterproductive as it might create a fame of being badly documented.
At the moment the only mention of it is 
[https://arrow.apache.org/docs/cpp/streaming_execution.html] and it's not very 
easy to follow (not much explainations, just blocks of code). In comparison if 
you look at the compute chapter in Python ( 
[https://arrow.apache.org/docs/dev/python/compute.html] ) it's much more 
talkative and explains things as it goes.
{quote}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16800) [C++] arrow::RecordBatchBuilder to use Result

2022-06-09 Thread Will Jones (Jira)
Will Jones created ARROW-16800:
--

 Summary: [C++] arrow::RecordBatchBuilder to use Result
 Key: ARROW-16800
 URL: https://issues.apache.org/jira/browse/ARROW-16800
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 8.0.0
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 9.0.0






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16789) [Format] Mark C Stream Interface as stable

2022-06-08 Thread Will Jones (Jira)
Will Jones created ARROW-16789:
--

 Summary: [Format] Mark C Stream Interface as stable
 Key: ARROW-16789
 URL: https://issues.apache.org/jira/browse/ARROW-16789
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Format
Reporter: Will Jones
Assignee: Will Jones


As discussed in [this dev mailing list 
thread|https://lists.apache.org/thread/0y604o9s3wkyty328wv8d21ol7s40q55], we 
may wish to mark the C stream interface stable. All feedback is the thread was 
positive, so I will go ahead and make a PR and call a vote.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16761) [C++][Python] Track bytes_written on FileWriter / WrittenFile

2022-06-06 Thread Will Jones (Jira)
Will Jones created ARROW-16761:
--

 Summary: [C++][Python] Track bytes_written on FileWriter / 
WrittenFile
 Key: ARROW-16761
 URL: https://issues.apache.org/jira/browse/ARROW-16761
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Python
Affects Versions: 8.0.0
Reporter: Will Jones
 Fix For: 9.0.0


For Apache Iceberg and Delta Lake tables, we need to be able to get the size of 
the files written in bytes. In Iceberg, this is the required field 
{{file_size_in_bytes}} ([docs|https://iceberg.apache.org/spec/#manifests]). In 
Delta, this is the required field {{size}} as part of the Add action.

I think this could be exposed on 
[FileWriter|https://github.com/apache/arrow/blob/8c63788ff7d52812599a546989b7df10887cb01e/cpp/src/arrow/dataset/file_base.h#L305]
 and then through that 
[WrittenFile|https://github.com/apache/arrow/blob/8c63788ff7d52812599a546989b7df10887cb01e/python/pyarrow/_dataset.pyx#L766-L769].
 But lower-level than that I'm not yet sure. {{FileWriter}} owns its 
{{OutputStream}}; would {{OutputStream::Tell()}} give the correct count?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16760) [Docs] Mention PYARROW_PARALLEL in Python dev docs

2022-06-06 Thread Will Jones (Jira)
Will Jones created ARROW-16760:
--

 Summary: [Docs] Mention PYARROW_PARALLEL in Python dev docs
 Key: ARROW-16760
 URL: https://issues.apache.org/jira/browse/ARROW-16760
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 8.0.0
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 9.0.0


We should include {{PYARROW_PARALLEL}} in the Python developer docs. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16703) [R] Refactor map_batches() so it can stream results

2022-05-31 Thread Will Jones (Jira)
Will Jones created ARROW-16703:
--

 Summary: [R] Refactor map_batches() so it can stream results
 Key: ARROW-16703
 URL: https://issues.apache.org/jira/browse/ARROW-16703
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 8.0.0
Reporter: Will Jones
 Fix For: 9.0.0


As part of ARROW-15271, {{map_batches()}} was modified to return a 
{{RecordBatchReader}}, but the implementation collects all results as a list of 
record batches and then converts that to a reader. In theory, if we push the 
implementation down to C++, we should be able to make a proper streaming RBR.

We won't know the schema ahead of time. We could optionally accept it, which 
would allow the function to be lazy. Or we could eagerly evaluate just the 
first batch to determine the schema. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16702) [C++] Add compute functions for list array containment

2022-05-31 Thread Will Jones (Jira)
Will Jones created ARROW-16702:
--

 Summary: [C++] Add compute functions for list array containment
 Key: ARROW-16702
 URL: https://issues.apache.org/jira/browse/ARROW-16702
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Affects Versions: 8.0.0
Reporter: Will Jones


Some operations we might implement:

* {{array_contains(arr, x)}} : list array {{arr}} contains scalar {{x}}
* {{arrays_overlap(arr, sc)}} : list array {{arr}} contains common elements in 
list scalar {{sc}} (could also impelement version with another array as second 
arg)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16658) [Python] Support arithmetic on arrays and scalars

2022-05-25 Thread Will Jones (Jira)
Will Jones created ARROW-16658:
--

 Summary: [Python] Support arithmetic on arrays and scalars
 Key: ARROW-16658
 URL: https://issues.apache.org/jira/browse/ARROW-16658
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 8.0.0
Reporter: Will Jones


I was surprised to find you can't use standard arithmetic operators on PyArrow 
arrays and scalars. Instead, one must use the compute functions:

{code:Python}
import pyarrow as pa

arr = pa.array([1, 2, 3])
pc.add(arr, 2)
# Doesn't work:
# arr + 2
# arr + pa.scalar(2)
# arr + arr

pc.multiply(arr, 20)
# Doesn't work:
# 20 * arr
# pa.scalar(20) * arr
{code}

Is it intentional we don't support this?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16632) [Website] Announce Acero Engine

2022-05-23 Thread Will Jones (Jira)
Will Jones created ARROW-16632:
--

 Summary: [Website] Announce Acero Engine
 Key: ARROW-16632
 URL: https://issues.apache.org/jira/browse/ARROW-16632
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Will Jones


Given consensus on Acero as the name for C++ streaming execution engine, it may 
be time to write a blog post announcing the engine, how it's currently 
available in the ecosystem, and what's happening next with it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16510) [R] Add bindings for GCS filesystem

2022-05-09 Thread Will Jones (Jira)
Will Jones created ARROW-16510:
--

 Summary: [R] Add bindings for GCS filesystem
 Key: ARROW-16510
 URL: https://issues.apache.org/jira/browse/ARROW-16510
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 8.0.0
Reporter: Will Jones
 Fix For: 9.0.0






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16509) [R][Docs] Update dataset vignette

2022-05-09 Thread Will Jones (Jira)
Will Jones created ARROW-16509:
--

 Summary: [R][Docs] Update dataset vignette
 Key: ARROW-16509
 URL: https://issues.apache.org/jira/browse/ARROW-16509
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, R
Affects Versions: 8.0.0
Reporter: Will Jones
 Fix For: 9.0.0


Since the dataset vignette was written, we've added join, aggregation, and 
distinct support (and soon union/union_all support). The dataset vignette 
currently says we don't support those operations.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16421) [R] Permission error on Windows when deleting file in dataset

2022-04-29 Thread Will Jones (Jira)
Will Jones created ARROW-16421:
--

 Summary: [R] Permission error on Windows when deleting file in 
dataset
 Key: ARROW-16421
 URL: https://issues.apache.org/jira/browse/ARROW-16421
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 7.0.0
Reporter: Will Jones
Assignee: Will Jones


On Windows this fails: 
{code:r}
library(arrow)

write_dataset(iris, "test_dataset")

con <- open_dataset("test_dataset") |> to_duckdb()

file.remove("test_dataset/part-0.parquet")
#> Warning in file.remove("test_dataset/part-0.parquet"): cannot remove file
#> 'test_dataset/part-0.parquet', reason 'Permission denied'
#> [1] FALSE
{code}

But on MacOS it does not:

{code:R}
library(arrow)

write_dataset(iris, "test_dataset")

con <- open_dataset("test_dataset") |> to_duckdb()

file.remove("test_dataset/part-0.parquet")
#> [1] TRUE
{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16399) [R][C++] datetime locale support on Windows MINGW / R

2022-04-28 Thread Will Jones (Jira)
Will Jones created ARROW-16399:
--

 Summary: [R][C++] datetime locale support on Windows MINGW / R
 Key: ARROW-16399
 URL: https://issues.apache.org/jira/browse/ARROW-16399
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, R
Affects Versions: 7.0.0
Reporter: Will Jones


In [https://github.com/apache/arrow/pull/12536] I found that locales other than 
"C" and "POSIX" didn't seem to be supported in the RTools environment (a MSYS2 
fork). I saw some indications this might apply to any MINGW toolchain 
(https://stackoverflow.com/a/4497266/2048858), but nothing very recent or 
definitive.

Is there a way we can enable this support?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16243) [C++][Python] Remove Parquet ReadSchemaField method

2022-04-19 Thread Will Jones (Jira)
Will Jones created ARROW-16243:
--

 Summary: [C++][Python] Remove Parquet ReadSchemaField method
 Key: ARROW-16243
 URL: https://issues.apache.org/jira/browse/ARROW-16243
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Affects Versions: 7.0.0
Reporter: Will Jones
 Fix For: 9.0.0


It doesn't seem like the experimental {{ReadSchemaField()}} method does 
anything different than {{ReadColumn()}} at this point. We should remove it and 
it's corresponding Python method.

https://github.com/apache/arrow/blob/cedb4f8112b9c622dad88e0b6e8e0600f7e52746/cpp/src/parquet/arrow/reader.h#L143-L156



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16239) [R] $columns on Table and RB should be named

2022-04-19 Thread Will Jones (Jira)
Will Jones created ARROW-16239:
--

 Summary: [R] $columns on Table and RB should be named
 Key: ARROW-16239
 URL: https://issues.apache.org/jira/browse/ARROW-16239
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 7.0.0
Reporter: Will Jones
 Fix For: 9.0.0


Currently, {{$columns}} method returns columns as a list without names. It 
would be nice if they were named instead, similar to {{as.list}} on a 
{{data.frame}}.

{code:R}
> library(arrow)
> names(record_batch(x = 1, y = 'a')$columns)
NULL
> names(arrow_table(x = 1, y = 'a')$columns)
NULL
> as.list(data.frame(x = 1, y = 'a'))
$x
[1] 1

$y
[1] "a"
{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16130) [Python][Docs] Document ParquetWriteOptions class

2022-04-05 Thread Will Jones (Jira)
Will Jones created ARROW-16130:
--

 Summary: [Python][Docs] Document ParquetWriteOptions class
 Key: ARROW-16130
 URL: https://issues.apache.org/jira/browse/ARROW-16130
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Python
Affects Versions: 7.0.0
Reporter: Will Jones
 Fix For: 8.0.0


The class 
[{{ParquetFileWriteOptions}}|https://arrow.apache.org/docs/python/generated/pyarrow.dataset.ParquetFileFormat.html#pyarrow.dataset.ParquetFileFormat.make_write_options],
 returned by 
[{{ParquetFileFormat.make_write_options}}|https://arrow.apache.org/docs/python/generated/pyarrow.dataset.ParquetFileFormat.html#pyarrow.dataset.ParquetFileFormat.make_write_options]
 is not documented in the API docs, unlike {{{}ParquetReadOptions{}}}. Most of 
the associated options are documented in 
[{{pyarrow.parquet.write_table}}|https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table]
 already, so they should be easy to write up.

For reference, we encountered this when trying to expose these options in [the 
delta-rs writer|https://github.com/delta-io/delta-rs/pull/581]. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16114) [Python] Document parquet.FileMetadata and statistics

2022-04-04 Thread Will Jones (Jira)
Will Jones created ARROW-16114:
--

 Summary: [Python] Document parquet.FileMetadata and statistics
 Key: ARROW-16114
 URL: https://issues.apache.org/jira/browse/ARROW-16114
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 7.0.0
Reporter: Will Jones
 Fix For: 8.0.0


{{FileMetaData}} in parquet module (returned by {{ParquetFile.metadata}}) isn't 
in the API docs. We should add to the API docs so users can know what fields 
are available.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16085) [R] Support unifying schemas for InMemoryDatasets

2022-03-31 Thread Will Jones (Jira)
Will Jones created ARROW-16085:
--

 Summary: [R] Support unifying schemas for InMemoryDatasets
 Key: ARROW-16085
 URL: https://issues.apache.org/jira/browse/ARROW-16085
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 7.0.0
Reporter: Will Jones
 Fix For: 8.0.0


 

The following fails:

{code:R}
sub_df1 <- Table$create(
  x = Array$create(c(1, 2, 3)),
  y = Array$create(c("a", "b", "c"))
)
sub_df2 <- Table$create(
  x = Array$create(c(4, 5)),
  z = Array$create(c("d", "e"))
)

ds1 <- InMemoryDataset$create(sub_df1)
ds2 <- InMemoryDataset$create(sub_df2)
ds <- c(ds1, ds2)
actual <- ds %>% collect()
{code}

{code}
Type error: yielded batch had schema x: double
y: string which did not match InMemorySource's: x: double
y: string
z: string
/Users/willjones/Documents/arrows/arrow-quick/cpp/src/arrow/util/iterator.h:541 
 child_.Next()
/Users/willjones/Documents/arrows/arrow-quick/cpp/src/arrow/util/iterator.h:152 
 value_.status()
/Users/willjones/Documents/arrows/arrow-quick/cpp/src/arrow/util/iterator.h:180 
 maybe_element
/Users/willjones/Documents/arrows/arrow-quick/cpp/src/arrow/dataset/scanner.cc:840
  fragments_it.ToVector()
{code}

If we fixed this, we could implement a function that does for Tables what 
{{dplyr::bind_rows}} does for Tibbles:

{code:R}
concat_tables <- function(..., schema = NULL) {
  tables <- list2(...)

  dataset <- open_dataset(map(tables, InMemoryDataset$create), schema = schema)

  dplyr::collect(dataset, as_data_frame = FALSE)
}
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16054) [Python] Use tzdata timezone database on Windows

2022-03-28 Thread Will Jones (Jira)
Will Jones created ARROW-16054:
--

 Summary: [Python] Use tzdata timezone database on Windows
 Key: ARROW-16054
 URL: https://issues.apache.org/jira/browse/ARROW-16054
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Affects Versions: 7.0.0
Reporter: Will Jones


In ARROW-13168, we enabled setting the path of the text-based database engine 
at runtime. This allowed R to use the tzdb package for the timezone database, 
since it uses the text format.

However, it doesn't seem like tzdata Python package ships that text format. 
They do have [a "compact" text 
format|https://github.com/python/tzdata/blob/master/src/tzdata/zoneinfo/tzdata.zi],
 which _might_ be compatible with our vendored date library. Otherwise, we'd 
likely have to wait for binary format support in 
https://github.com/HowardHinnant/date/issues/564



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16024) [C++] Let users set Windows timezone db path with environment variable

2022-03-24 Thread Will Jones (Jira)
Will Jones created ARROW-16024:
--

 Summary: [C++] Let users set Windows timezone db path with 
environment variable
 Key: ARROW-16024
 URL: https://issues.apache.org/jira/browse/ARROW-16024
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 7.0.0
Reporter: Will Jones


In ARROW-13168, we enabled a runtime option to set the location of the timezone 
database on Windows. For developers, the unit tests read the 
ARROW_TIMEZONE_DATABASE environment variable. It might be helpful for users to 
let them use that variable, but the question is where to put the 
initialization. Also, should it have precedent over the initialize method? If 
it did, it could override the R initialization that points to the tzdb package.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16006) [C++] Helpers for converting between rows and Arrow objects

2022-03-22 Thread Will Jones (Jira)
Will Jones created ARROW-16006:
--

 Summary: [C++] Helpers for converting between rows and Arrow 
objects
 Key: ARROW-16006
 URL: https://issues.apache.org/jira/browse/ARROW-16006
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Affects Versions: 7.0.0
Reporter: Will Jones
Assignee: Will Jones


Short version: Given a way to convert a vector of rows and a schema to a 
RecordBatch, we can derive methods for efficiently converting a vector of rows 
to a Table or even an iterator of rows to a Record Batch Reader. Similarly, we 
could go the other way: given a way to convert a RecordBatch to a vector of 
rows, we can derive methods for converting from Tables or RBRs.

Long version: 
https://docs.google.com/document/d/174tldmQLMCvOtjxGtFPeoLBefyE1x26_xntwfSzDXFA/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15989) [R] Implement rbind for Table and RecordBatch

2022-03-21 Thread Will Jones (Jira)
Will Jones created ARROW-15989:
--

 Summary: [R] Implement rbind for Table and RecordBatch
 Key: ARROW-15989
 URL: https://issues.apache.org/jira/browse/ARROW-15989
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Affects Versions: 7.0.0
Reporter: Will Jones
 Fix For: 8.0.0


In ARROW-15013 we implemented c() for Arrow arrays. We should now be able to 
implement rbind for Tables and RecordBatches (rbind on batches would produce a 
table).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15975) [C++] Document VisitArrayInline and type traits

2022-03-18 Thread Will Jones (Jira)
Will Jones created ARROW-15975:
--

 Summary: [C++] Document VisitArrayInline and type traits
 Key: ARROW-15975
 URL: https://issues.apache.org/jira/browse/ARROW-15975
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 7.0.0
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 8.0.0


In ARROW-15952, we documented the {{ArrayVisitor}} and {{TypeVisitor}} classes. 
But as I discovered in [a cookbook 
PR|https://github.com/apache/arrow-cookbook/pull/166], you can't subclass these 
abstract visitors _and_ use type traits. Now I know why most visitor 
implementations within of Arrow don't subclasses these.

We should instead suggest users simply use the {{VisitArrayInline}} and 
{{VisitTypeInline}} with their visitors, and ignore the {{ArrayVisitor}} and 
{{TypeVisitor}} classes and associated {{Accept()}} methods. In fact, can we 
deprecate (or even remove) those? Do they add anything valuable?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15952) [C++] Document Array::accept() and ArrayVisitor

2022-03-16 Thread Will Jones (Jira)
Will Jones created ARROW-15952:
--

 Summary: [C++] Document Array::accept() and ArrayVisitor
 Key: ARROW-15952
 URL: https://issues.apache.org/jira/browse/ARROW-15952
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 7.0.0
Reporter: Will Jones
 Fix For: 8.0.0


We mention in the docs that:
{quote}The classes arrow::Array and its subclasses provide strongly-typed 
accessors with support for the visitor pattern and other affordances.{quote}

But ArrayVisitor class and the {{Array::Accept()}} 
[method|https://github.com/apache/arrow/blob/b956ba51ea11d050745e09548e33aa61fdcbfddc/cpp/src/arrow/array/array_base.h#L136]
 are missing from the API docs. We should add those, and potentially also 
provide an example of using the visitor.

Likely worth doing the same for TypeVisitor and ScalarVisitor. It would also be 
nice to document the performance implication of using ScalarVisitor vs 
ArrayVisitor. Also we use an "inline" version of the visitors; is that 
something we do/should expose in the API as well?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15922) [C++] Re-enable strftime locale test on Windows

2022-03-11 Thread Will Jones (Jira)
Will Jones created ARROW-15922:
--

 Summary: [C++] Re-enable strftime locale test on Windows
 Key: ARROW-15922
 URL: https://issues.apache.org/jira/browse/ARROW-15922
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Will Jones
Assignee: Will Jones


ARROW-13168 enabled timezone support on Windows, but found there was an issue 
with the vendored datetime library that caused invalid UTF-8 character to be 
emitted from strftime in certain locales. We should re-enable that test once we 
are able to get a fix in the date library (or some other solution is found).

https://github.com/HowardHinnant/date/issues/726



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15906) [C++] S3Filesystem shouldn't create new buckets by default

2022-03-10 Thread Will Jones (Jira)
Will Jones created ARROW-15906:
--

 Summary: [C++] S3Filesystem shouldn't create new buckets by default
 Key: ARROW-15906
 URL: https://issues.apache.org/jira/browse/ARROW-15906
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 7.0.0
Reporter: Will Jones
 Fix For: 8.0.0


S3 buckets typically have a lot of governance around them (permissions, 
cost-tracking tags), so they should not be created unless a user explicitly 
asks.

We should add an option to {{S3Options}} to control whether to create buckets, 
and default to False. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15860) [Python][Docs] Document RecordBatchReader

2022-03-07 Thread Will Jones (Jira)
Will Jones created ARROW-15860:
--

 Summary: [Python][Docs] Document RecordBatchReader
 Key: ARROW-15860
 URL: https://issues.apache.org/jira/browse/ARROW-15860
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Python
Affects Versions: 7.0.0
Reporter: Will Jones
 Fix For: 8.0.0


RecordBatchReader seems like a pretty important type, but it is missing from 
the Python API docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15803) [R] Empty JSON object parsed as corrupt data frame

2022-02-28 Thread Will Jones (Jira)
Will Jones created ARROW-15803:
--

 Summary: [R] Empty JSON object parsed as corrupt data frame
 Key: ARROW-15803
 URL: https://issues.apache.org/jira/browse/ARROW-15803
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 7.0.0
Reporter: Will Jones
 Fix For: 8.0.0


If you have a JSON object field that is always empty, it seems to be not 
handled well, whether or not a schema is provided that tells Arrow what should 
be in that object.

{code:r}
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#> timestamp

json_val <- '{
  "rows": [
{"empty": {} },
{"empty": {} },
{"empty": {} }
  ]
}'
# Remove newlines
json_val <- gsub("\n", "", json_val)

json_file <- tempfile()
writeLines(json_val, json_file)

schema <- schema(field("rows", list_of(struct(empty = struct(y = int32())
raw <- read_json_arrow(json_file, schema=schema)
raw$rows$empty
#> Error: Corrupt x: no names
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15758) [C++] Explore upgrading to mimalloc V2

2022-02-22 Thread Will Jones (Jira)
Will Jones created ARROW-15758:
--

 Summary: [C++] Explore upgrading to mimalloc V2
 Key: ARROW-15758
 URL: https://issues.apache.org/jira/browse/ARROW-15758
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 7.0.0
Reporter: Will Jones


ARROW-15730 found that mimalloc wasn't releasing memory as expected. These 
memory allocators tend to hold onto memory longer than users expect, which can 
be confusing. But also there appears to be [a bug where it also doesn't reuse 
memory|https://github.com/microsoft/mimalloc/issues/383#issuecomment-846132613].
 Both of these are addressed in v2.0.X (beta) of the library: the allocation is 
more aggressive in returning memory and the bug seems to not exist.

[According to one of the 
maintainers|https://github.com/microsoft/mimalloc/issues/466#issuecomment-947819685],
 the main reason 2.0.X hasn't been declared stable is that some use cases have 
reported performance regressions. We could create a branch of Arrow using 
mimalloc v2 and run conbench benchmarks to see comparisons. If it's faster, we 
may consider moving forward; if not, we could provide feedback to the mimalloc 
maintainers which may help development along.
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15725) [Python] Legacy dataset can't roundtrip Int64 with nulls if partitioned

2022-02-17 Thread Will Jones (Jira)
Will Jones created ARROW-15725:
--

 Summary: [Python] Legacy dataset can't roundtrip Int64 with nulls 
if partitioned
 Key: ARROW-15725
 URL: https://issues.apache.org/jira/browse/ARROW-15725
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 7.0.0, 4.0.0
Reporter: Will Jones


If there is partitioning and the column has nulls, Int64 columns may not round 
trip successfully using the legacy datasets implementation. 

Simple reproduction:

 {code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import tempfile

table = pa.table({
'x': pa.array([None, 7753285016841556620]),
'y': pa.array(['a', 'b'])
})

ds_dir = tempfile.mkdtemp()
pq.write_to_dataset(table, ds_dir, partition_cols=['y'])

table_after = ds.dataset(ds_dir).to_table()
print(table['x'])
print(table_after['x'])
assert table['x'] == table_after['x']
{code}

{code}
[
  [
null,
7753285016841556620
  ]
]
[
  [
null
  ],
  [
7753285016841556992
  ]
]
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15718) [R] Joining two datasets crashes if use_threads=FALSE

2022-02-17 Thread Will Jones (Jira)
Will Jones created ARROW-15718:
--

 Summary: [R] Joining two datasets crashes if use_threads=FALSE
 Key: ARROW-15718
 URL: https://issues.apache.org/jira/browse/ARROW-15718
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, R
Affects Versions: 7.0.0
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 8.0.0


In ARROW-14908 we solved the case of joining a dataset to an in memory table, 
but did not solve joining two datasets.

The previous solution was to add +1 to the thread count, because the hash join 
logic might be called by the scanner's IO thread. For joining more than 1 
dataset, we might have more than 1 IO thread, so we either need to add a larger 
arbitrary number or find a way to make the state logic more resilient to 
unexpected threads.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15667) [R] Windows build can fail if building only shared libraries

2022-02-11 Thread Will Jones (Jira)
Will Jones created ARROW-15667:
--

 Summary: [R] Windows build can fail if building only shared 
libraries
 Key: ARROW-15667
 URL: https://issues.apache.org/jira/browse/ARROW-15667
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 7.0.0
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 8.0.0


This should only affect dev environments. I noticed that when I build with 
shared libraries only that it fails because it's expecting 
arrow_bundled_dependencies, which I think we only build as part of static 
builds.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15627) [R] Support unify_schemas for union datasets

2022-02-09 Thread Will Jones (Jira)
Will Jones created ARROW-15627:
--

 Summary: [R] Support unify_schemas for union datasets
 Key: ARROW-15627
 URL: https://issues.apache.org/jira/browse/ARROW-15627
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 7.0.0
Reporter: Will Jones
 Fix For: 8.0.0


Also out of discussion on [https://github.com/apache/arrow/issues/12371]

You can unify schemas between different parquet files, but it seems like you 
can't union together two (or more) datasets that have different schemas. This 
is odd, because we do compute the unified schema on [this 
line|https://github.com/apache/arrow/blob/ba0814e60a451525dd5492b68059aad8a4bdaf4f/r/R/dataset.R#L189],
 only to later assert all the schemas are the same.

{code:R}
library(arrow)
library(dplyr)

df1 <- arrow_table(x = array(c(1, 2, 3)),
   y = array(c("a", "b", "c")))
df2 <- arrow_table(x = array(c(4, 5)),
   z = array(c("d", "e")))

df1 %>% write_dataset("example1", format="parquet")
df2 %>% write_dataset("example2", format="parquet")

ds1 <- open_dataset("example1", format="parquet")
ds2 <- open_dataset("example2", format="parquet")

# These don't work
ds <- c(ds1, ds2) # c() actually does the same thing
ds <- open_dataset(list(ds1, ds2)) # This fails due to mismatch in schema
ds <- open_dataset(c("example1", "example2"), format="parquet", unify_schemas = 
TRUE)

# This does
ds <- open_dataset(c("example2/part-0.parquet", "example1/part-0.parquet"), 
format="parquet", unify_schemas = TRUE)
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15622) [R] Implement union_all for arrow_dplyr_query

2022-02-08 Thread Will Jones (Jira)
Will Jones created ARROW-15622:
--

 Summary: [R] Implement union_all for arrow_dplyr_query
 Key: ARROW-15622
 URL: https://issues.apache.org/jira/browse/ARROW-15622
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 7.0.0
Reporter: Will Jones
 Fix For: 8.0.0


GitHub issue inspiration: [https://github.com/apache/arrow/issues/12371]

Basically union_all would chain the RecordBatchReaders.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15603) [C++] Clang 13 build fails on unused var

2022-02-07 Thread Will Jones (Jira)
Will Jones created ARROW-15603:
--

 Summary: [C++] Clang 13 build fails on unused var
 Key: ARROW-15603
 URL: https://issues.apache.org/jira/browse/ARROW-15603
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 7.0.0
Reporter: Will Jones
 Fix For: 8.0.0


Just a small issue. When I build with clang 13 I get the following error from a 
unused var warning:
{code:java}
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:791:13:
 error: variable 'n' set but not used [-Werror,-Wunused-but-set-variable]
    int64_t n = 0;
            ^
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:799:13:
 error: variable 'n' set but not used [-Werror,-Wunused-but-set-variable]
    int64_t n = 0;
            ^ {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15512) [C++] OT logging for memory pool allocations

2022-01-31 Thread Will Jones (Jira)
Will Jones created ARROW-15512:
--

 Summary: [C++] OT logging for memory pool allocations
 Key: ARROW-15512
 URL: https://issues.apache.org/jira/browse/ARROW-15512
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Affects Versions: 6.0.1
Reporter: Will Jones
 Fix For: 8.0.0


ARROW-3016 suggests there is a real need for tracking memory allocations with 
context such as traceback and sizes. That ticket covers using Linux tools like 
perf and uprobe to do so.

Using OpenTelemetry might provide a cross-platform to do that same, one that's 
in line with other tracing efforts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15415) [C++] Cannot build debug with MSVC and vcpkg

2022-01-22 Thread Will Jones (Jira)
Will Jones created ARROW-15415:
--

 Summary: [C++] Cannot build debug with MSVC and vcpkg
 Key: ARROW-15415
 URL: https://issues.apache.org/jira/browse/ARROW-15415
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Documentation
Affects Versions: 6.0.1
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 8.0.0


While trying to create a debug build of Arrow on Windows using vcpkg and MSVC, 
I encountered a few issues with the current build configuration:

 # Python debug and release libraries are passed, but our Cmake scripts only 
expect one or the other. Just as reported in ARROW-13470
 # Since vcpkg upgraded gtest to 1.11.0, there is again a mismatch between the 
bundled gtest and the vcpkg versions. So we get the same error as was found in 
ARROW-14393
 # Thrift could not find debug static libraries, because it was missing the "d" 
suffix. It should be {{libthriftmdd.lib}}, but was finding {{libthriftmd.lib}}.

Additionally, the recommended {{clcache}} program from our Windows developer 
docs is no longer maintained. I found its dependency {{pyuv}} doesn't install 
on Windows anymore, and is also no longer maintained.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15408) [C++] Environment variable to turn on memory allocation logging

2022-01-21 Thread Will Jones (Jira)
Will Jones created ARROW-15408:
--

 Summary: [C++] Environment variable to turn on memory allocation 
logging
 Key: ARROW-15408
 URL: https://issues.apache.org/jira/browse/ARROW-15408
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 6.0.1
Reporter: Will Jones
 Fix For: 8.0.0


In Python, there is a [{{log_memory_allocations}} 
function|https://github.com/wesm/arrow/blob/33111644be84f84ce4601889fee06c6d17f05279/python/pyarrow/memory.pxi#L63]
 to change to use the LoggingMemoryPool. It would be nice to be able to do this 
in C++ and one very convenient way would be through an environment variable, 
since we already support {{ARROW_DEFAULT_MEMORY_POOL}}. Should probably be 
named something like {{ARROW_LOG_MEMORY_ALLOCATIONS}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15363) [C++] Add max length option to PrettyPrintOptions

2022-01-18 Thread Will Jones (Jira)
Will Jones created ARROW-15363:
--

 Summary: [C++] Add max length option to PrettyPrintOptions
 Key: ARROW-15363
 URL: https://issues.apache.org/jira/browse/ARROW-15363
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 6.0.1
Reporter: Will Jones
 Fix For: 8.0.0


Some pretty prints, especially for chunked or nested arrays, can be very long 
even with reasonable window settings. We should have a way to set some target 
maximum length to output.

A half-measure was taken with ARROW-15329, which truncates the output of the 
pretty printing, but that doesn't handle string columns very well if those 
string values contain delimiters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15329) [Python] Add character limit to ChunkedArray repr

2022-01-13 Thread Will Jones (Jira)
Will Jones created ARROW-15329:
--

 Summary: [Python] Add character limit to ChunkedArray repr
 Key: ARROW-15329
 URL: https://issues.apache.org/jira/browse/ARROW-15329
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Reporter: Will Jones
Assignee: Will Jones
 Fix For: 7.0.0


Short term workaround for ARROW-14798

https://github.com/apache/arrow/pull/12091#issuecomment-1012316758



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15325) [R] Fix CRAN comment on map_batches collect

2022-01-13 Thread Will Jones (Jira)
Will Jones created ARROW-15325:
--

 Summary: [R] Fix CRAN comment on map_batches collect
 Key: ARROW-15325
 URL: https://issues.apache.org/jira/browse/ARROW-15325
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Will Jones
 Fix For: 7.0.0


Got the following comment in [build 
{{homebrew-r-autobrew}}|https://github.com/ursacomputing/crossbow/runs/4799447427?check_suite_focus=true]:

{code}
map_batches: no visible binding for global variable 'collect'
Undefined global functions or variables:
  collect
{code}

Looks like I should have used the .data inside of map_batches, based on 
"eliminating R CMD check NOTEs" section of 
https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html.




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15317) [R] Expose API to create Dataset from Fragments

2022-01-12 Thread Will Jones (Jira)
Will Jones created ARROW-15317:
--

 Summary: [R] Expose API to create Dataset from Fragments
 Key: ARROW-15317
 URL: https://issues.apache.org/jira/browse/ARROW-15317
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 6.0.1
Reporter: Will Jones


Third-party packages may define dataset factories for table formats like Delta 
Lake and Apache Iceberg. These formats store metadata like schema, file lists, 
and file-level statistics on the side, and can construct a dataset without a 
discovery process needed. Python exposed enough API to do this successfully for 
[a Delta Lake dataset reader 
here|https://github.com/delta-io/delta-rs/blob/6a8195d6e3cbdcb0c58a14a3ffccc472dd094de0/python/deltalake/table.py#L267-L280].

I propose adding the following to the R API:

 * Expose {{Fragment}} as an R6 object
 * Add the {{MakeFragment}} method to various file format objects. It's key 
that {{partition_expression}} is included as an argument. ([See Python 
equivalent 
here|https://github.com/apache/arrow/blob/ab86daf3f7c8a67bee6a175a749575fd40417d27/python/pyarrow/_dataset_parquet.pyx#L209-L210])
 * Add a dataset constructor that takes a list of {{Fragments}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15276) [Docs][R] Add map_batches example from vignette to Cookbook

2022-01-06 Thread Will Jones (Jira)
Will Jones created ARROW-15276:
--

 Summary: [Docs][R] Add map_batches example from vignette to 
Cookbook
 Key: ARROW-15276
 URL: https://issues.apache.org/jira/browse/ARROW-15276
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Will Jones


In ARROW-14029 we are adding an example of using `map_batches()` to sample data 
and compute aggregate statistics without having the load the whole dataset into 
memory. We should add these to the cookbook as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15271) [R] Refactor do_exec_plan to return a RecordBatchReader

2022-01-06 Thread Will Jones (Jira)
Will Jones created ARROW-15271:
--

 Summary: [R] Refactor do_exec_plan to return a RecordBatchReader
 Key: ARROW-15271
 URL: https://issues.apache.org/jira/browse/ARROW-15271
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 6.0.1
Reporter: Will Jones


Right now 
[{{do_exec_plan}}|https://github.com/apache/arrow/blob/master/r/R/query-engine.R#L18]
 returns an Arrow table because {{head}}, {{tail}}, and {{arrange}} do. If 
ARROW-14289 is completed and similar work is done for {{arrange}}, we may be 
able to alter {{do_exec_plan}} to return a RBR instead.

The {{map_batches()}} implementation (ARROW-14029) could benefit from this 
refactor. And it might make ARROW-15040 more useful.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15264) [CI][C#] Build examples in CI

2022-01-05 Thread Will Jones (Jira)
Will Jones created ARROW-15264:
--

 Summary: [CI][C#] Build examples in CI
 Key: ARROW-15264
 URL: https://issues.apache.org/jira/browse/ARROW-15264
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C#, Continuous Integration
Reporter: Will Jones


We should validate in CI that the C# example always build with the latest 
version of Arrow.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15247) [Python] Convert array of Pandas dataframe to struct column

2022-01-04 Thread Will Jones (Jira)
Will Jones created ARROW-15247:
--

 Summary: [Python] Convert array of Pandas dataframe to struct 
column
 Key: ARROW-15247
 URL: https://issues.apache.org/jira/browse/ARROW-15247
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 6.0.1
Reporter: Will Jones


Currently, converting a Pandas dataframe with a column of dataframes to Arrow 
fails with "Could not convert  with type DataFrame: did not recognize 
Python value type when inferring an Arrow data type". We should be able to 
convert this to a List array, similar to how [the R binding do 
it|https://arrow.apache.org/docs/r/articles/arrow.html#r-to-arrow]. This could 
even be bi-directional, where structs could be parsed back into a column of 
dataframe in {{to_pandas()}}

Here is an example that currently fails:

{code:python}
import pandas as pd
import pyarrow as pa

df1 = pd.DataFrame({
'x': [1, 2, 3],
'y': ['a', 'b', 'c']
})

df = pd.DataFrame({
'df': [df1]*10
})

pa.Table.from_pandas(df)
{code}

Here's what the other directly might look like for the same data:

{code:python}
sub_tab = [{'x': 1, 'y': 'a'},
   {'x': 2, 'y': 'b'},
   {'x': 3, 'y': 'c'}]

tab = pa.table({
'df': pa.array([sub_tab]*10)
})

print(tab.schema)
# df: list>
#child 0, item: struct
#   child 0, x: int64
#   child 1, y: string

tab.to_pandas()
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15246) [Python] Automatic conversion of low-cardinality string array to Dictionary Array

2022-01-04 Thread Will Jones (Jira)
Will Jones created ARROW-15246:
--

 Summary: [Python] Automatic conversion of low-cardinality string 
array to Dictionary Array
 Key: ARROW-15246
 URL: https://issues.apache.org/jira/browse/ARROW-15246
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 6.0.1
Reporter: Will Jones


Users who convert Pandas string arrays to Arrow arrays may be surprised to see 
the Arrow ones use far more memory when the cardinality is low. The solution is 
for them to first convert to a Pandas Categorical, but it might save some 
headaches if we can automatically (or possibly with an option) detect when it's 
appropriate to use a Dictionary type over a String type.

Here's an example of what I'm talking about:

{code:python}
import pyarrow as pa
import pandas as pd

x_str = "x" * 30
df = pd.DataFrame({"col": [x_str] * 1_000_000})

%memit tab1 = pa.Table.from_pandas(df)
# peak memory: 269.44 MiB, increment: 121.62 MiB

df['col'] = df['col'].astype('category')
%memit tab2 = pa.Table.from_pandas(df)
# peak memory: 286.14 MiB, increment: 1.20 MiB
{code}

One bad consequence of inferring this automatically is if there is a sequence 
of Pandas DataFrames that are being converted, it's possible they may end up 
with differing schemas. For that reason it's likely this behavior should be 
optional.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15217) [C#] Add ToString() methods to Arrow classes

2021-12-29 Thread Will Jones (Jira)
Will Jones created ARROW-15217:
--

 Summary: [C#] Add ToString() methods to Arrow classes
 Key: ARROW-15217
 URL: https://issues.apache.org/jira/browse/ARROW-15217
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C#
Reporter: Will Jones


We should add {{ToString}} methods to {{RecordBatch}}, {{Schema}}, {{Field}}, 
{{DataType}}, {{Table}}, and {{ChunkedArray}}.

The default implementation in C# is just to return the class name, which isn't 
very useful.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15135) [C++][R][Python] Support reading from Apache Iceberg tables

2021-12-16 Thread Will Jones (Jira)
Will Jones created ARROW-15135:
--

 Summary: [C++][R][Python] Support reading from Apache Iceberg 
tables
 Key: ARROW-15135
 URL: https://issues.apache.org/jira/browse/ARROW-15135
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Will Jones


This is an umbrella issue for supporting the [Apache Iceberg table format|]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15109) [Python] Add more info to show_versions()

2021-12-14 Thread Will Jones (Jira)
Will Jones created ARROW-15109:
--

 Summary: [Python] Add more info to show_versions()
 Key: ARROW-15109
 URL: https://issues.apache.org/jira/browse/ARROW-15109
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 6.0.1
Reporter: Will Jones


In the R arrow package, we have a function {{arrow_info()}} which provides 
information on versions and optional components. Python has 
{{show_versions()}}, but it's not as detailed. We can add the following to the 
Python function:

 * List of optional components and whether they are enabled
 * Which allocator is used
 * SIMD level 


Example R output:

{code}
Arrow package version: 6.0.1.9000

Capabilities:
   
datasetTRUE
parquetTRUE
json   TRUE
s3 TRUE
utf8proc   TRUE
re2TRUE
snappy TRUE
gzip   TRUE
brotli TRUE
zstd   TRUE
lz4TRUE
lz4_frame  TRUE
lzo   FALSE
bz2TRUE
jemalloc   TRUE
mimalloc   TRUE

Memory:
  
Allocator mimalloc
Current0 bytes
Max0 bytes

Runtime:

SIMD Level  none
Detected SIMD Level none

Build:
 
C++ Library Version7.0.0-SNAPSHOT
C++ Compiler   AppleClang
C++ Compiler Version  13.0.0.1329
Git ID   cf8d81d9fcbc43ce57b8a0d36c05f8b4273a5fa3
{code}

Example Python output (current behavior):

{code}
pyarrow version info

Package kind: not indicated
Arrow C++ library version: 7.0.0-SNAPSHOT
Arrow C++ compiler: AppleClang 13.0.0.1329
Arrow C++ compiler flags:  -Qunused-arguments -fcolor-diagnostics -ggdb -O0
Arrow C++ git revision: d033ce769571a0f12e37ab165bc29d2b202b3a61
Arrow C++ git description: apache-arrow-7.0.0.dev-313-gd033ce769
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15102) [R] Allow creation of struct type with fields

2021-12-14 Thread Will Jones (Jira)
Will Jones created ARROW-15102:
--

 Summary: [R] Allow creation of struct type with fields
 Key: ARROW-15102
 URL: https://issues.apache.org/jira/browse/ARROW-15102
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 6.0.1
Reporter: Will Jones
 Fix For: 8.0.0


StructTypes can be created with types:

{code:R}
struct(x = int32(), y = utf8())
{code}

But they cannot be created with fields yet. This means you cannot construct a 
StructType with a non-nullable field (since fields are nullable by default.) We 
should support constructing a StructType with fields, like we do for a Schema:

{code:R}
# Schema from fields
schema(field("x", int32()), field(y, utf8(), nullable=FALSE))
# Expected StructType from fields
struct(field("x", int32()), field(y, utf8(), nullable=FALSE))
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15089) [C++] Add compute kernel to get MapArray value for given key

2021-12-13 Thread Will Jones (Jira)
Will Jones created ARROW-15089:
--

 Summary: [C++] Add compute kernel to get MapArray value for given 
key
 Key: ARROW-15089
 URL: https://issues.apache.org/jira/browse/ARROW-15089
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 6.0.1
Reporter: Will Jones


Given a "map", an obvious operation is to get an item corresponding to a key. 
The idea here is to create a kernel that does this for each map in the array.

IIRC MapArray isn't guaranteed to have unique keys. So one version would return 
an array of ItemType by returning the first of last item for a given key. Yet 
another version could return a ListType containing all matching items. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15087) [Docs][Python] Document MapArray in Python

2021-12-13 Thread Will Jones (Jira)
Will Jones created ARROW-15087:
--

 Summary: [Docs][Python] Document MapArray in Python
 Key: ARROW-15087
 URL: https://issues.apache.org/jira/browse/ARROW-15087
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Documentation, Python
Affects Versions: 6.0.1, 6.0.0
Reporter: Will Jones


ARROW-6904 exposed MapArray in Python back in late 2019, but it has not been 
documented yet. Should add to API reference and to [Python arrays user 
guide|https://arrow.apache.org/docs/python/data.html#arrays].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15075) [C++][Dataset] Implement Dataset for JSON format

2021-12-12 Thread Will Jones (Jira)
Will Jones created ARROW-15075:
--

 Summary: [C++][Dataset] Implement Dataset for JSON format
 Key: ARROW-15075
 URL: https://issues.apache.org/jira/browse/ARROW-15075
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Will Jones


We already have support for reading individual files, but not yet for reading 
datasets. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14999) [C++] List types with different field names are not equal

2021-12-06 Thread Will Jones (Jira)
Will Jones created ARROW-14999:
--

 Summary: [C++] List types with different field names are not equal
 Key: ARROW-14999
 URL: https://issues.apache.org/jira/browse/ARROW-14999
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 6.0.0
Reporter: Will Jones


When comparing map types, the names of the fields are ignored. This was 
introduced in ARROW-7173.

However for list types, they are not ignored. For example,

{code:python}
In [6]: l1 = pa.list_(pa.field("val", pa.int64()))

In [7]: l2 = pa.list_(pa.int64())

In [8]: l1
Out[8]: ListType(list)

In [9]: l2
Out[9]: ListType(list)

In [10]: l1 == l2
Out[10]: False
{code}

Should we make list type comparison ignore field names too?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14730) [C++][R][Python] Support reading from Delta Lake tables

2021-11-16 Thread Will Jones (Jira)
Will Jones created ARROW-14730:
--

 Summary: [C++][R][Python] Support reading from Delta Lake tables
 Key: ARROW-14730
 URL: https://issues.apache.org/jira/browse/ARROW-14730
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Will Jones


[Delta Lake|https://delta.io/] is a parquet table format that supports ACID 
transactions. It's popularized by Databricks, which uses it as the default 
table format in their platform. Previously, it's only been readable from Spark, 
but now there is an effort in [delta-rs|https://github.com/delta-io/delta-rs] 
to make it accessible from elsewhere. There is already some integration with 
DataFusion (see: https://github.com/apache/arrow-datafusion/issues/525).

There does already exist [a method to read Delta Lake tables into Arrow tables 
in 
Python|https://delta-io.github.io/delta-rs/python/api_reference.html#deltalake.table.DeltaTable.to_pyarrow_table]
 in the delta-rs Python bindings. This includes filtering by partitions.

Is there a good way we could integrate this functionality with Arrow C++ 
Dataset and expose that in Python and R? Would that be something that should be 
implemented in Arrow libraries or in delta-rs?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14597) Github actions install r-arrow with snappy compression

2021-11-04 Thread Dyfan Jones (Jira)
Dyfan Jones created ARROW-14597:
---

 Summary: Github actions install r-arrow with snappy compression
 Key: ARROW-14597
 URL: https://issues.apache.org/jira/browse/ARROW-14597
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Dyfan Jones


Hi All,

I am having difficutly installing r-arrow with snappy compression on github 
action. I have set environment variable `ARROW_WITH_SNAPPY: ON` 
([https://github.com/DyfanJones/noctua/blob/0079bf997737516fd3e1b61dbde7510044f79a2f/.github/workflows/R-CMD-check.yaml]
 ). However I get the following error in my unit tests:


{code:java}
Error: Error: NotImplemented: Support for codec 'snappy' not built
In order to read this file, you will need to reinstall arrow with 
additional features enabled.
Set one of these environment variables before installing:
 * LIBARROW_MINIMAL=false (for all optional features, including 'snappy')  
 * ARROW_WITH_SNAPPY=ON (for just 'snappy')

See https://arrow.apache.org/docs/r/articles/install.html for detail{code}

arrow version: 6.0.0.2

My PR [https://github.com/DyfanJones/noctua/pull/169] with the github actions 
issue.

Any advice is much appericated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13918) [Gandiva][Python] Add decimal support for make_literal and make_in_expression

2021-09-06 Thread Will Jones (Jira)
Will Jones created ARROW-13918:
--

 Summary: [Gandiva][Python] Add decimal support for make_literal 
and make_in_expression
 Key: ARROW-13918
 URL: https://issues.apache.org/jira/browse/ARROW-13918
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Gandiva, Python
Reporter: Will Jones


These are already implemented in C++, they just need to be exposed in Cython.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13917) [Gandiva] Add helper to determine valid decimal function return type

2021-09-06 Thread Will Jones (Jira)
Will Jones created ARROW-13917:
--

 Summary: [Gandiva] Add helper to determine valid decimal function 
return type
 Key: ARROW-13917
 URL: https://issues.apache.org/jira/browse/ARROW-13917
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Gandiva
Reporter: Will Jones


To evaluate a Gandiva function, you need to pass it's return type. For most 
types, we can look up the possible return types by using the 
`GetRegisteredFunctionSignatures` method, but those don't include details of 
the precision and scale parameters of the decimal type.

Specifying the precision and scale parameters of the decimal type is left up to 
the user, but if the user  gets it wrong, they can get invalid answers. See the 
reproducible example at the bottom.

The precision and scale of the return type depend on the input types and the 
implementation of the decimal operations. Given the variation of logic across 
different functions (add, divide, trunc, round), it would be best if we were 
able to provide some utility to help the user determine the precise return type.

Now return types aren't unique for every given function name and parameter 
types. For example, `add(date64[ms], int64` can return either `date64[ms]` or 
`timestamp[ms]`. So a generic utility has to return multiple possible return 
types.


Example of invalid decimal results from bad return type:

{code:python}
from decimal import Decimal
import pyarrow as pa
from pyarrow.gandiva import TreeExprBuilder, make_projector

def call_on_value(func, values, params, out_type):
builder = TreeExprBuilder()

param_literals = []
for param, param_type in params:
param_literals.append(builder.make_literal(param, param_type))

inputs = []
arrays = []
for i, value in enumerate(values):
inputs.append(builder.make_field(pa.field(str(i), value[1])))
arrays.append(pa.array([value[0]], value[1]))

record_batch = pa.record_batch(arrays, [str(i) for i in range(len(values))])

func_x = builder.make_function(func, inputs + param_literals, out_type)

expressions = [builder.make_expression(func_x, pa.field('result', 
out_type))]


projector = make_projector(record_batch.schema, expressions, 
pa.default_memory_pool())

return projector.evaluate(record_batch)

call_on_value(
'round',
(Decimal("123.459"), pa.decimal128(28, 3)),
[(2, pa.int32())],
pa.decimal128(28, 3)
)
# Returns: 123.459 (not rounded!)

call_on_value(
'round',
(Decimal("123.459"), pa.decimal128(28, 3)),
[(-2, pa.int32())],
pa.decimal128(28, 3)
)
# Returns:  0.100 ()
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12129) [Python][Gandiva] Infer return types for make_if and make_in_expression

2021-03-28 Thread Will Jones (Jira)
Will Jones created ARROW-12129:
--

 Summary: [Python][Gandiva] Infer return types for make_if and 
make_in_expression
 Key: ARROW-12129
 URL: https://issues.apache.org/jira/browse/ARROW-12129
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Will Jones


On the {{TreeExprBuilder}} in {{pyarrow.gandiva}}, both the {{make_if}} and 
{{make_in_expression}} require the user to specify the return type. These could 
easily be inferred from the input values. ARROW-11342 exposes the return type 
of nodes as a method, so this should be easy to do once that is merged.

To keep the changes backwards compatible, we can make the return_type an 
optional argument.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11342) [Python] [Gandiva] Expose ToString and result type information

2021-01-21 Thread Will Jones (Jira)
Will Jones created ARROW-11342:
--

 Summary: [Python] [Gandiva] Expose ToString and result type 
information
 Key: ARROW-11342
 URL: https://issues.apache.org/jira/browse/ARROW-11342
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Will Jones
Assignee: Will Jones


To make it easier to build and introspect the expression trees, I would like to 
expose the ToString() methods on Node, Expression, and Condition, as well as 
the methods exposing the fields and types inside.


{code:python}
import pyarrow as pa
import pyarrow.gandiva as gandiva
builder = gandiva.TreeExprBuilder()

print(builder.make_literal(1000.0, pa.float64()))

{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >