[jira] [Created] (ARROW-16603) [Python] pyarrow.json.read_json ignores nullable=False in explicit_schema parse_options
Alenka Frim created ARROW-16603: --- Summary: [Python] pyarrow.json.read_json ignores nullable=False in explicit_schema parse_options Key: ARROW-16603 URL: https://issues.apache.org/jira/browse/ARROW-16603 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Alenka Frim Reproducible example: {code:python} import json import pyarrow.json as pj import pyarrow as pa s = {"id": "value", "nested": {"value": 1}} with open("issue.json", "w") as write_file: json.dump(s, write_file, indent=4) schema = pa.schema([ pa.field("id", pa.string(), nullable=False), pa.field("nested", pa.struct([pa.field("value", pa.int64(), nullable=False)])) ]) table = pj.read_json('issue.json', parse_options=pj.ParseOptions(explicit_schema=schema)) print(schema) print(table.schema) {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16602) [Dev] Use GitHub API to merge pull request
Kouhei Sutou created ARROW-16602: Summary: [Dev] Use GitHub API to merge pull request Key: ARROW-16602 URL: https://issues.apache.org/jira/browse/ARROW-16602 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Kouhei Sutou Assignee: Kouhei Sutou We use local "git merge" to merge a pull request in {{dev/merge_arrow_pr.py}}. If we use "git merge" to merge a pull request, GitHub's Web UI shows "Closed" mark not "Merged" mark in a pull request page. This sometimes confuses new contributors. "Why was my pull request closed without merging?" See https://github.com/apache/arrow/pull/12004#issuecomment-1031619771 for example. If we use GitHub API https://docs.github.com/en/rest/pulls/pulls#merge-a-pull-request to merge a pull request, GitHub's Web UI shows "Merged" mark not "Closed" mark. See https://github.com/apache/arrow/pull/13180 for example. I used GitHub API to merge the pull request. And we don't need to create a local branch on local repository to merge a pull request. But we must specify {{ARROW_GITHUB_API_TOKEN}} to run {{dev/merge_arrow_pr.py}}. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[GitHub] [arrow-julia] ericphanson opened a new issue, #323: [ArrowTypes] add `@arrow_record`?
ericphanson opened a new issue, #323: URL: https://github.com/apache/arrow-julia/issues/323 @kleinschmidt and I came up with ```julia using Arrow: ArrowTypes macro arrow_record(T) name = :(Symbol("JuliaLang.", @__MODULE__, ".", string(parentmodule($T), '.', nameof($T return quote ArrowTypes.arrowname(::Type{$T}) = $name ArrowTypes.ArrowType(::Type{$T}) = fieldtypes($T) ArrowTypes.toarrow(obj::$T) = ntuple(i -> getfield(obj, i), fieldcount($T)) ArrowTypes.JuliaType(::Val{$name}, ::Any) = $T ArrowTypes.fromarrow(::Type{$T}, args...) = $T(args...) end end ``` as a quick way to declare that a non-parametrized concrete type should be serialized like a "record" (a la [`StructTypes.StructType`](https://juliadata.github.io/StructTypes.jl/stable/#StructTypes.Struct)). (Or maybe it should be called `@arrow_fields` or something?). In other words, we want it to just serialize out as a tuple of its fields with a metadata tag. This could be used as, e.g. ```julia using Arrow struct Foo bar::String x::Int end @arrow_record Foo my_foo = Foo("hi", 1) table = [(; foo_col=my_foo)] roundtripped_table = Arrow.Table(Arrow.tobuffer(table)) using Test @test roundtripped_table.foo_col[1] == my_foo # true ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (ARROW-16601) [C++][FlightRPC] Don't encofcing static link with static GoogleTest for arrow_flight_testing
Kouhei Sutou created ARROW-16601: Summary: [C++][FlightRPC] Don't encofcing static link with static GoogleTest for arrow_flight_testing Key: ARROW-16601 URL: https://issues.apache.org/jira/browse/ARROW-16601 Project: Apache Arrow Issue Type: Improvement Components: C++, FlightRPC Reporter: Kouhei Sutou Assignee: Kouhei Sutou The link problem was fixed by ARROW-16588. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16600) [Java] Enable configurable scale coercion of BigDecimal
Todd Farmer created ARROW-16600: --- Summary: [Java] Enable configurable scale coercion of BigDecimal Key: ARROW-16600 URL: https://issues.apache.org/jira/browse/ARROW-16600 Project: Apache Arrow Issue Type: Improvement Components: Java Affects Versions: 8.0.0 Reporter: Todd Farmer Assignee: Todd Farmer Per ARROW-16427, JDBC drivers sometimes return ResultSets where the scale of BigDecimals in a single column differs by row. The existing mapping requires exact match of scale to the target Arrow vector that was created based on ResultSetMetaData (or configuration), and when any row does not match exactly, an Exception is thrown. To support JDBC drivers where scale may be inconsistent by row, Arrow should allow a less-strict mode that coerces BigDecimals to target vectors with greater scale. The default strict behavior should be retained, but it may be useful to allow coercion to proper target scale. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16599) [C++] Implementation of ExecuteScalarExpressionOverhead benchmarks without arrow for comparision
Tobias Zagorni created ARROW-16599: -- Summary: [C++] Implementation of ExecuteScalarExpressionOverhead benchmarks without arrow for comparision Key: ARROW-16599 URL: https://issues.apache.org/jira/browse/ARROW-16599 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Tobias Zagorni Assignee: Tobias Zagorni The ExecuteScalarExpressionOverhead group of benchmarks for now gives us values we can compare to different batch sizes, or to different expressions. But we don't really see how well arrow does compared to what is possible in general. The simple_expression and (negate x) complex_expression (x>0 and x<20) benchmarks, which perform an actual operation on data, can be implemented in pure C++ for comparison. I implemented complex_expression benchmark using technically unnecessary intermediate buffers for the > and < operator results, to match what happens in the arrow expression. What may seem unfair is that I currently re-use the input/output/intermediate buffers over all iterations. I also tried using new and delete each time, but could not measure a difference in performance. Reusing allowes to use std::vector for sightly cleaner code. Re-creating a vector each time would results in a lot of overhead initializing the vector values and is therefore not useful. Example output: {{ExecuteScalarExpressionOverhead/complex_expression/rows_per_batch:1000/real_time/threads:1 3328161 ns 3326213 ns 1277 batches_per_second=300.466k/s rows_per_second=300.466M/s ExecuteScalarExpressionOverhead/complex_expression/rows_per_batch:1000/real_time/threads:16 754880 ns 11940432 ns 5680 batches_per_second=1.32471M/s rows_per_second=1.32471G/s ExecuteScalarExpressionOverhead/complex_expression/rows_per_batch:1/real_time/threads:1 1370993 ns 1370182 ns 3047 batches_per_second=72.9398k/s rows_per_second=729.398M/s ExecuteScalarExpressionOverhead/complex_expression/rows_per_batch:1/real_time/threads:16 213412 ns 3377187 ns 20608 batches_per_second=468.578k/s rows_per_second=4.68578G/s ExecuteScalarExpressionOverhead/complex_expression/rows_per_batch:10/real_time/threads:1 1194552 ns 1192163 ns 3494 batches_per_second=8.37134k/s rows_per_second=837.134M/s ExecuteScalarExpressionOverhead/complex_expression/rows_per_batch:10/real_time/threads:16 193390 ns 3047981 ns 22576 batches_per_second=51.709k/s rows_per_second=5.1709G/s ExecuteScalarExpressionOverhead/complex_expression/rows_per_batch:100/real_time/threads:1 1243416 ns 1240591 ns 3325 batches_per_second=804.236/s rows_per_second=804.236M/s ExecuteScalarExpressionOverhead/complex_expression/rows_per_batch:100/real_time/threads:16 449956 ns 7057594 ns 9216 batches_per_second=2.22244k/s rows_per_second=2.22244G/s ExecuteScalarExpressionOverhead/simple_expression/rows_per_batch:1000/real_time/threads:1 1153192 ns 1151060 ns 3580 batches_per_second=867.158k/s rows_per_second=867.158M/s ExecuteScalarExpressionOverhead/simple_expression/rows_per_batch:1000/real_time/threads:16 297876 ns 4705702 ns 15152 batches_per_second=3.3571M/s rows_per_second=3.3571G/s ExecuteScalarExpressionOverhead/simple_expression/rows_per_batch:1/real_time/threads:1 519083 ns 518087 ns 8027 batches_per_second=192.647k/s rows_per_second=1.92647G/s ExecuteScalarExpressionOverhead/simple_expression/rows_per_batch:1/real_time/threads:16 70329 ns 1106796 ns 62320 batches_per_second=1.42189M/s rows_per_second=14.2189G/s ExecuteScalarExpressionOverhead/simple_expression/rows_per_batch:10/real_time/threads:1 420460 ns 419404 ns 9878 batches_per_second=23.7835k/s rows_per_second=2.37835G/s ExecuteScalarExpressionOverhead/simple_expression/rows_per_batch:10/real_time/threads:16 75645 ns 1189925 ns 56864 batches_per_second=132.196k/s rows_per_second=13.2196G/s ExecuteScalarExpressionOverhead/simple_expression/rows_per_batch:100/real_time/threads:1 425360 ns 424499 ns 9404 batches_per_second=2.35095k/s rows_per_second=2.35095G/s ExecuteScalarExpressionOverhead/simple_expression/rows_per_batch:100/real_time/threads:16 1057920 ns 16308254 ns 3984 batches_per_second=945.251/s rows_per_second=945.251M/s ExecuteScalarExpressionBaseline/rows_per_batch:1000/real_time/threads:1 876620 ns 876032 ns 4787 batches_per_second=1.14075M/s rows_per_second=1.14075G/s}} {{baseline: ExecuteScalarExpressionBaseline/rows_per_batch:1000/real_time/threads:16 106371 ns 1657205 ns 41536 batches_per_second=9.40109M/s rows_per_second=9
[jira] [Created] (ARROW-16598) [R] Sorting data.frame prior to writing Parquet affects file size
Michael Culshaw-Maurer created ARROW-16598: -- Summary: [R] Sorting data.frame prior to writing Parquet affects file size Key: ARROW-16598 URL: https://issues.apache.org/jira/browse/ARROW-16598 Project: Apache Arrow Issue Type: Bug Components: R Environment: MacBook Pro (non-M1), other info in R file Reporter: Michael Culshaw-Maurer Attachments: arrow_parquet_bug.R When using the arrow R package, sorting a data.frame prior to using write_parquet() results in different file sizes, depending on how the data.frame is sorted. I have attached a reproducible example showing how a few different sorting methods can lead to 2-3 fold changes in .parquet file size. It may be that I don't know enough about Parquet internals, but at the very least, I think this behavior should be documented on the arrow R package site. Most R users tend to approach sorting as a convenience and don't expect it to lead to performance changes when writing to a file. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16597) [Python][FlightRPC] Active server may segfault if Python interpreter shuts down
David Li created ARROW-16597: Summary: [Python][FlightRPC] Active server may segfault if Python interpreter shuts down Key: ARROW-16597 URL: https://issues.apache.org/jira/browse/ARROW-16597 Project: Apache Arrow Issue Type: Bug Components: FlightRPC, Python Affects Versions: 8.0.0 Reporter: David Li Assignee: David Li On Linux, this reliably segfaults for me with {{{}FATAL: exception not rethrown{}}}. Adding a \{[server.shutdown}} to the end fixes it. The reason is that the Python interpreter exits after running the script, and other Python threads [call PyThread_exit_thread|https://github.com/python/cpython/blob/v3.10.4/Python/ceval_gil.h#L221]. But one of the Python threads is currently in the middle of executing the RPC handler. PyThread_exit_thread boils down to pthread_exit which works by throwing an exception that it expects will not be caught. But gRPC places a {{catch(...)}} around RPC handlers and catches this exception, and then pthreads aborts when it doesn't catch the exception. We should force servers to shutdown at exit to avoid this. {code:python} import traceback import pyarrow as pa import pyarrow.flight as flight class Server(flight.FlightServerBase): def do_put(self, context, descriptor, reader, writer): raise flight.FlightCancelledError("foo", extra_info=b"bar") print("PyArrow version:", pa.__version__) server = Server("grpc://localhost:0") client = flight.connect(f"grpc://localhost:{server.port}") schema = pa.schema([]) writer, reader = client.do_put(flight.FlightDescriptor.for_command(b""), schema) try: writer.done_writing() except flight.FlightError as e: traceback.print_exc() print(e.extra_info) except Exception: traceback.print_exc() {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16596) [C++] Add option to control the cutoff between 1900 and 2000 when %y
Dragoș Moldovan-Grünfeld created ARROW-16596: Summary: [C++] Add option to control the cutoff between 1900 and 2000 when %y Key: ARROW-16596 URL: https://issues.apache.org/jira/browse/ARROW-16596 Project: Apache Arrow Issue Type: Improvement Components: C++, R Affects Versions: 8.0.0 Reporter: Dragoș Moldovan-Grünfeld When parsing to datetime a string with year in the short format ({{{}%y{}}}), it would be great if we could have control over the cutoff point between 1900 and 2000. Currently it is implicitly set to 68: {code:r} library(arrow, warn.conflicts = FALSE) a <- Array$create(c("68-05-17", "69-05-17")) call_function("strptime", a, options = list(format = "%y-%m-%d", unit = 0L)) #> Array #> #> [ #> 2068-05-17 00:00:00, #> 1969-05-17 00:00:00 #> ] {code} For example, lubridate has names this argument {{cutoff_2000}} argument (e.g. for {{{}fast_strptime{}}}. This works as follows: {code:r} library(lubridate, warn.conflicts = FALSE) dates_vector <- c("68-05-17", "69-05-17", "55-05-17") fast_strptime(dates_vector, format = "%y-%m-%d") #> [1] "2068-05-17 UTC" "1969-05-17 UTC" "2055-05-17 UTC" fast_strptime(dates_vector, format = "%y-%m-%d", cutoff_2000 = 50) #> [1] "1968-05-17 UTC" "1969-05-17 UTC" "1955-05-17 UTC" fast_strptime(dates_vector, format = "%y-%m-%d", cutoff_2000 = 70) #> [1] "2068-05-17 UTC" "2069-05-17 UTC" "2055-05-17 UTC" {code} In the {{lubridate::fast_strptime()}} documentation it is described as follows: {quote} cutoff_2000 integer. For y format, two-digit numbers smaller or equal to cutoff_2000 are parsed as though starting with 20, otherwise parsed as though starting with 19. Available only for functions relying on lubridates internal parser. {quote} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16595) [WEBSITE] DataFusion 8.0.0 Release Blog Post
Andy Grove created ARROW-16595: -- Summary: [WEBSITE] DataFusion 8.0.0 Release Blog Post Key: ARROW-16595 URL: https://issues.apache.org/jira/browse/ARROW-16595 Project: Apache Arrow Issue Type: Task Components: Website Reporter: Andy Grove DataFusion 8.0.0 Release Blog Post -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16594) [R] Consistently use "getOption" to set nightly repo
Jacob Wujciak-Jens created ARROW-16594: -- Summary: [R] Consistently use "getOption" to set nightly repo Key: ARROW-16594 URL: https://issues.apache.org/jira/browse/ARROW-16594 Project: Apache Arrow Issue Type: Wish Components: R Reporter: Jacob Wujciak-Jens Fix For: 9.0.0 In {{install_arrow}} the nightly repo url is set via {{getOption("arrow.dev_repo", "https;//...")}} this should also be the case in {{winlibs.R}} and {{nixlibs.R}} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16593) [CI][Python] test_plasma crashes
Antoine Pitrou created ARROW-16593: -- Summary: [CI][Python] test_plasma crashes Key: ARROW-16593 URL: https://issues.apache.org/jira/browse/ARROW-16593 Project: Apache Arrow Issue Type: Bug Components: C++ - Plasma, Continuous Integration, Python Reporter: Antoine Pitrou {{test_plasma}} has started crashing on various CI configurations recently: Debian Python, Ubuntu Python but also macOS Python. Example test log: https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=25782&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=9148 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16592) [FlightRPC][Python] Regression in DoPut error handling
Lubo Slivka created ARROW-16592: --- Summary: [FlightRPC][Python] Regression in DoPut error handling Key: ARROW-16592 URL: https://issues.apache.org/jira/browse/ARROW-16592 Project: Apache Arrow Issue Type: Bug Reporter: Lubo Slivka In PyArrow 8.0.0, any error raised while handling DoPut on the server results in FlightInternalError on the client. In PyArrow 7.0.0, errors raised while handling DoPut are propagated/converted to non-internal errors. --- Example: on 7.0.0, raising FlightCancelledError while handling DoPut on the server would propagate that error including extra_info all the way to the FlightClient. This is not the case anymore on 8.0.0. The FlightInternalError contains extra detail that is derived from the cancelled error though: {code:java} /arrow/cpp/src/arrow/flight/client.cc:363: Close() failed: IOError: . Detail: Cancelled. gRPC client debug context: {"created":"@1652777650.446052211","description":"Error received from peer ipv4:127.0.0.1:16001","file":"/opt/vcpkg/buildtrees/grpc/src/85a295989c-6cf7bf442d.clean/src/core/lib/surface/call.cc","file_line":903,"grpc_message":". Detail: Cancelled","grpc_status":1}. Client context: OK. Detail: Cancelled {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)