date:20220517

[jira] [Created] (ARROW-16603) [Python] pyarrow.json.read_json ignores nullable=False in explicit_schema parse_options

2022-05-17 Thread Alenka Frim (Jira)

Alenka Frim created ARROW-16603:
---

 Summary: [Python] pyarrow.json.read_json ignores nullable=False in 
explicit_schema parse_options
 Key: ARROW-16603
 URL: https://issues.apache.org/jira/browse/ARROW-16603
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Alenka Frim


Reproducible example:
{code:python}
import json
import pyarrow.json as pj
import pyarrow as pa

s = {"id": "value", "nested": {"value": 1}}

with open("issue.json", "w") as write_file:
json.dump(s, write_file, indent=4)

schema = pa.schema([
pa.field("id", pa.string(), nullable=False),
pa.field("nested", pa.struct([pa.field("value", pa.int64(), 
nullable=False)]))
])

table = pj.read_json('issue.json', 
parse_options=pj.ParseOptions(explicit_schema=schema))

print(schema)
print(table.schema)
{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16602) [Dev] Use GitHub API to merge pull request

2022-05-17 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-16602:


 Summary: [Dev] Use GitHub API to merge pull request
 Key: ARROW-16602
 URL: https://issues.apache.org/jira/browse/ARROW-16602
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


We use local "git merge" to merge a pull request in {{dev/merge_arrow_pr.py}}.

If we use "git merge" to merge a pull request, GitHub's Web UI shows "Closed" 
mark not "Merged" mark in a pull request page. This sometimes confuses new 
contributors. "Why was my pull request closed without merging?" See 
https://github.com/apache/arrow/pull/12004#issuecomment-1031619771 for example.

If we use GitHub API 
https://docs.github.com/en/rest/pulls/pulls#merge-a-pull-request to merge a 
pull request, GitHub's Web UI shows "Merged" mark not "Closed" mark. See 
https://github.com/apache/arrow/pull/13180 for example. I used GitHub API to 
merge the pull request.

And we don't need to create a local branch on local repository to merge a pull 
request. But we must specify {{ARROW_GITHUB_API_TOKEN}} to run 
{{dev/merge_arrow_pr.py}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[GitHub] [arrow-julia] ericphanson opened a new issue, #323: [ArrowTypes] add `@arrow_record`?

2022-05-17 Thread GitBox



ericphanson opened a new issue, #323:
URL: https://github.com/apache/arrow-julia/issues/323

   @kleinschmidt and I came up with
   ```julia
   using Arrow: ArrowTypes
   macro arrow_record(T)
   name = :(Symbol("JuliaLang.", @__MODULE__, ".", string(parentmodule($T), 
'.', nameof($T
   return quote
  ArrowTypes.arrowname(::Type{$T}) = $name
  ArrowTypes.ArrowType(::Type{$T}) = fieldtypes($T)
  ArrowTypes.toarrow(obj::$T) = ntuple(i -> getfield(obj, i), 
fieldcount($T))
  ArrowTypes.JuliaType(::Val{$name}, ::Any) = $T
  ArrowTypes.fromarrow(::Type{$T}, args...) = $T(args...)
   end
   end
   ```
   as a quick way to declare that a non-parametrized concrete type should be 
serialized like a "record" (a la 
[`StructTypes.StructType`](https://juliadata.github.io/StructTypes.jl/stable/#StructTypes.Struct)).
 (Or maybe it should be called `@arrow_fields` or something?). In other words, 
we want it to just serialize out as a tuple of its fields with a metadata tag. 
This could be used as, e.g.
   
   ```julia
   using Arrow
   
   struct Foo
   bar::String
   x::Int
   end
   
   @arrow_record Foo
   
   my_foo =  Foo("hi", 1)
   table = [(; foo_col=my_foo)]
   roundtripped_table = Arrow.Table(Arrow.tobuffer(table))
   using Test
   @test roundtripped_table.foo_col[1] == my_foo # true
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (ARROW-16601) [C++][FlightRPC] Don't encofcing static link with static GoogleTest for arrow_flight_testing

2022-05-17 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-16601:


 Summary: [C++][FlightRPC] Don't encofcing static link with static 
GoogleTest for arrow_flight_testing
 Key: ARROW-16601
 URL: https://issues.apache.org/jira/browse/ARROW-16601
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, FlightRPC
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


The link problem was fixed by ARROW-16588.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16600) [Java] Enable configurable scale coercion of BigDecimal

2022-05-17 Thread Todd Farmer (Jira)

Todd Farmer created ARROW-16600:
---

 Summary: [Java] Enable configurable scale coercion of BigDecimal
 Key: ARROW-16600
 URL: https://issues.apache.org/jira/browse/ARROW-16600
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Affects Versions: 8.0.0
Reporter: Todd Farmer
Assignee: Todd Farmer


Per ARROW-16427, JDBC drivers sometimes return ResultSets where the scale of 
BigDecimals in a single column differs by row.  The existing mapping requires 
exact match of scale to the target Arrow vector that was created based on 
ResultSetMetaData (or configuration), and when any row does not match exactly, 
an Exception is thrown.

To support JDBC drivers where scale may be inconsistent by row, Arrow should 
allow a less-strict mode that coerces BigDecimals to target vectors with 
greater scale.  The default strict behavior should be retained, but it may be 
useful to allow coercion to proper target scale. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16599) [C++] Implementation of ExecuteScalarExpressionOverhead benchmarks without arrow for comparision

2022-05-17 Thread Tobias Zagorni (Jira)

Tobias Zagorni created ARROW-16599:
--

 Summary: [C++] Implementation of ExecuteScalarExpressionOverhead 
benchmarks without arrow for comparision
 Key: ARROW-16599
 URL: https://issues.apache.org/jira/browse/ARROW-16599
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Tobias Zagorni
Assignee: Tobias Zagorni


The ExecuteScalarExpressionOverhead group of benchmarks for now gives us values 
we can compare to different batch sizes, or to different expressions. But we 
don't really see how well arrow does compared to what is possible in general.

The simple_expression and (negate x) complex_expression (x>0 and x<20) 
benchmarks, which perform an actual operation on data, can be implemented in 
pure C++ for comparison.

I implemented complex_expression benchmark using technically unnecessary 
intermediate buffers for the > and < operator results, to match what happens in 
the arrow expression.

What may seem unfair is that I currently re-use the input/output/intermediate 
buffers over all iterations. I also tried using new and delete each time, but 
could not measure a difference in performance. Reusing allowes to use 
std::vector for sightly cleaner code. Re-creating a vector each time would 
results in a lot of overhead initializing the vector values and is therefore 
not useful.

Example output:

{{ExecuteScalarExpressionOverhead/complex_expression/rows_per_batch:1000/real_time/threads:1
    3328161 ns  3326213 ns 1277 batches_per_second=300.466k/s 
rows_per_second=300.466M/s 
ExecuteScalarExpressionOverhead/complex_expression/rows_per_batch:1000/real_time/threads:16
    754880 ns 11940432 ns 5680 batches_per_second=1.32471M/s 
rows_per_second=1.32471G/s 
ExecuteScalarExpressionOverhead/complex_expression/rows_per_batch:1/real_time/threads:1
   1370993 ns  1370182 ns 3047 batches_per_second=72.9398k/s 
rows_per_second=729.398M/s 
ExecuteScalarExpressionOverhead/complex_expression/rows_per_batch:1/real_time/threads:16
   213412 ns  3377187 ns    20608 batches_per_second=468.578k/s 
rows_per_second=4.68578G/s 
ExecuteScalarExpressionOverhead/complex_expression/rows_per_batch:10/real_time/threads:1
  1194552 ns  1192163 ns 3494 batches_per_second=8.37134k/s 
rows_per_second=837.134M/s 
ExecuteScalarExpressionOverhead/complex_expression/rows_per_batch:10/real_time/threads:16
  193390 ns  3047981 ns    22576 batches_per_second=51.709k/s 
rows_per_second=5.1709G/s 
ExecuteScalarExpressionOverhead/complex_expression/rows_per_batch:100/real_time/threads:1
 1243416 ns  1240591 ns 3325 batches_per_second=804.236/s 
rows_per_second=804.236M/s 
ExecuteScalarExpressionOverhead/complex_expression/rows_per_batch:100/real_time/threads:16
 449956 ns  7057594 ns 9216 batches_per_second=2.22244k/s 
rows_per_second=2.22244G/s 
ExecuteScalarExpressionOverhead/simple_expression/rows_per_batch:1000/real_time/threads:1
 1153192 ns  1151060 ns 3580 batches_per_second=867.158k/s 
rows_per_second=867.158M/s 
ExecuteScalarExpressionOverhead/simple_expression/rows_per_batch:1000/real_time/threads:16
 297876 ns  4705702 ns    15152 batches_per_second=3.3571M/s 
rows_per_second=3.3571G/s 
ExecuteScalarExpressionOverhead/simple_expression/rows_per_batch:1/real_time/threads:1
 519083 ns   518087 ns 8027 batches_per_second=192.647k/s 
rows_per_second=1.92647G/s 
ExecuteScalarExpressionOverhead/simple_expression/rows_per_batch:1/real_time/threads:16
 70329 ns  1106796 ns    62320 batches_per_second=1.42189M/s 
rows_per_second=14.2189G/s 
ExecuteScalarExpressionOverhead/simple_expression/rows_per_batch:10/real_time/threads:1
    420460 ns   419404 ns 9878 batches_per_second=23.7835k/s 
rows_per_second=2.37835G/s 
ExecuteScalarExpressionOverhead/simple_expression/rows_per_batch:10/real_time/threads:16
    75645 ns  1189925 ns    56864 batches_per_second=132.196k/s 
rows_per_second=13.2196G/s 
ExecuteScalarExpressionOverhead/simple_expression/rows_per_batch:100/real_time/threads:1
   425360 ns   424499 ns 9404 batches_per_second=2.35095k/s 
rows_per_second=2.35095G/s 
ExecuteScalarExpressionOverhead/simple_expression/rows_per_batch:100/real_time/threads:16
 1057920 ns 16308254 ns 3984 batches_per_second=945.251/s 
rows_per_second=945.251M/s
ExecuteScalarExpressionBaseline/rows_per_batch:1000/real_time/threads:1
 876620 ns   876032 ns 4787 batches_per_second=1.14075M/s 
rows_per_second=1.14075G/s}}

{{baseline:
ExecuteScalarExpressionBaseline/rows_per_batch:1000/real_time/threads:16
    106371 ns  1657205 ns    41536 batches_per_second=9.40109M/s 
rows_per_second=9

[jira] [Created] (ARROW-16598) [R] Sorting data.frame prior to writing Parquet affects file size

2022-05-17 Thread Michael Culshaw-Maurer (Jira)

Michael Culshaw-Maurer created ARROW-16598:
--

 Summary: [R] Sorting data.frame prior to writing Parquet affects 
file size
 Key: ARROW-16598
 URL: https://issues.apache.org/jira/browse/ARROW-16598
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
 Environment: MacBook Pro (non-M1), other info in R file
Reporter: Michael Culshaw-Maurer
 Attachments: arrow_parquet_bug.R

When using the arrow R package, sorting a data.frame prior to using 
write_parquet() results in different file sizes, depending on how the 
data.frame is sorted. I have attached a reproducible example showing how a few 
different sorting methods can lead to 2-3 fold changes in .parquet file size.

It may be that I don't know enough about Parquet internals, but at the very 
least, I think this behavior should be documented on the arrow R package site. 
Most R users tend to approach sorting as a convenience and don't expect it to 
lead to performance changes when writing to a file.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16597) [Python][FlightRPC] Active server may segfault if Python interpreter shuts down

2022-05-17 Thread David Li (Jira)

David Li created ARROW-16597:


 Summary: [Python][FlightRPC] Active server may segfault if Python 
interpreter shuts down
 Key: ARROW-16597
 URL: https://issues.apache.org/jira/browse/ARROW-16597
 Project: Apache Arrow
  Issue Type: Bug
  Components: FlightRPC, Python
Affects Versions: 8.0.0
Reporter: David Li
Assignee: David Li


On Linux, this reliably segfaults for me with {{{}FATAL: exception not 
rethrown{}}}. Adding a \{[server.shutdown}} to the end fixes it.

The reason is that the Python interpreter exits after running the script, and 
other Python threads [call 
PyThread_exit_thread|https://github.com/python/cpython/blob/v3.10.4/Python/ceval_gil.h#L221].
 But one of the Python threads is currently in the middle of executing the RPC 
handler. PyThread_exit_thread boils down to pthread_exit which works by 
throwing an exception that it expects will not be caught. But gRPC places a 
{{catch(...)}} around RPC handlers and catches this exception, and then 
pthreads aborts when it doesn't catch the exception.

We should force servers to shutdown at exit to avoid this.

{code:python}
import traceback
import pyarrow as pa
import pyarrow.flight as flight

class Server(flight.FlightServerBase):
def do_put(self, context, descriptor, reader, writer):
raise flight.FlightCancelledError("foo", extra_info=b"bar")


print("PyArrow version:", pa.__version__)
server = Server("grpc://localhost:0")
client = flight.connect(f"grpc://localhost:{server.port}")

schema = pa.schema([])
writer, reader = client.do_put(flight.FlightDescriptor.for_command(b""), schema)
try:
writer.done_writing()
except flight.FlightError as e:
traceback.print_exc()
print(e.extra_info)
except Exception:
traceback.print_exc()
{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16596) [C++] Add option to control the cutoff between 1900 and 2000 when %y

2022-05-17 Thread Jira

Dragoș Moldovan-Grünfeld created ARROW-16596:


 Summary: [C++] Add option to control the cutoff between 1900 and 
2000 when %y 
 Key: ARROW-16596
 URL: https://issues.apache.org/jira/browse/ARROW-16596
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, R
Affects Versions: 8.0.0
Reporter: Dragoș Moldovan-Grünfeld


When parsing to datetime a string with year in the short format ({{{}%y{}}}), 
it would be great if we could have control over the cutoff point between 1900 
and 2000. Currently it is implicitly set to 68:
{code:r}
library(arrow, warn.conflicts = FALSE)

a <- Array$create(c("68-05-17", "69-05-17"))
call_function("strptime", a, options = list(format = "%y-%m-%d", unit = 0L))
#> Array
#> 
#> [
#>   2068-05-17 00:00:00,
#>   1969-05-17 00:00:00
#> ]
{code}
For example, lubridate has names this argument {{cutoff_2000}} argument (e.g. 
for {{{}fast_strptime{}}}. This works as follows:
{code:r}
library(lubridate, warn.conflicts = FALSE)

dates_vector <- c("68-05-17", "69-05-17", "55-05-17")
fast_strptime(dates_vector, format = "%y-%m-%d")
#> [1] "2068-05-17 UTC" "1969-05-17 UTC" "2055-05-17 UTC"
fast_strptime(dates_vector, format = "%y-%m-%d", cutoff_2000 = 50)
#> [1] "1968-05-17 UTC" "1969-05-17 UTC" "1955-05-17 UTC"
fast_strptime(dates_vector, format = "%y-%m-%d", cutoff_2000 = 70)
#> [1] "2068-05-17 UTC" "2069-05-17 UTC" "2055-05-17 UTC"
{code}
In the {{lubridate::fast_strptime()}} documentation it is described as follows:
{quote}
cutoff_2000 
integer. For y format, two-digit numbers smaller or equal to cutoff_2000 are 
parsed as though starting with 20, otherwise parsed as though starting with 19. 
Available only for functions relying on lubridates internal parser.
{quote}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16595) [WEBSITE] DataFusion 8.0.0 Release Blog Post

2022-05-17 Thread Andy Grove (Jira)

Andy Grove created ARROW-16595:
--

 Summary: [WEBSITE] DataFusion 8.0.0 Release Blog Post
 Key: ARROW-16595
 URL: https://issues.apache.org/jira/browse/ARROW-16595
 Project: Apache Arrow
  Issue Type: Task
  Components: Website
Reporter: Andy Grove


DataFusion 8.0.0 Release Blog Post



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16594) [R] Consistently use "getOption" to set nightly repo

2022-05-17 Thread Jacob Wujciak-Jens (Jira)

Jacob Wujciak-Jens created ARROW-16594:
--

 Summary: [R] Consistently use "getOption" to set nightly repo
 Key: ARROW-16594
 URL: https://issues.apache.org/jira/browse/ARROW-16594
 Project: Apache Arrow
  Issue Type: Wish
  Components: R
Reporter: Jacob Wujciak-Jens
 Fix For: 9.0.0


In {{install_arrow}} the nightly repo url is set via 
{{getOption("arrow.dev_repo", "https;//...")}}
this should also be the case in {{winlibs.R}} and {{nixlibs.R}} 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16593) [CI][Python] test_plasma crashes

2022-05-17 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-16593:
--

 Summary: [CI][Python] test_plasma crashes
 Key: ARROW-16593
 URL: https://issues.apache.org/jira/browse/ARROW-16593
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Plasma, Continuous Integration, Python
Reporter: Antoine Pitrou


{{test_plasma}} has started crashing on various CI configurations recently: 
Debian Python, Ubuntu Python but also macOS Python.

Example test log: 
https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=25782&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=9148



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16592) [FlightRPC][Python] Regression in DoPut error handling

2022-05-17 Thread Lubo Slivka (Jira)

Lubo Slivka created ARROW-16592:
---

 Summary: [FlightRPC][Python] Regression in DoPut error handling
 Key: ARROW-16592
 URL: https://issues.apache.org/jira/browse/ARROW-16592
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Lubo Slivka


In PyArrow 8.0.0, any error raised while handling DoPut on the server results 
in FlightInternalError on the client.

In PyArrow 7.0.0, errors raised while handling DoPut are propagated/converted 
to non-internal errors.

---

Example: on 7.0.0, raising FlightCancelledError while handling DoPut on the 
server would propagate that error including extra_info all the way to the 
FlightClient. This is not the case anymore on 8.0.0.

The FlightInternalError contains extra detail that is derived from the 
cancelled error though:
{code:java}
/arrow/cpp/src/arrow/flight/client.cc:363: Close() failed: IOError: . Detail: Cancelled. gRPC client debug 
context: {"created":"@1652777650.446052211","description":"Error received from 
peer 
ipv4:127.0.0.1:16001","file":"/opt/vcpkg/buildtrees/grpc/src/85a295989c-6cf7bf442d.clean/src/core/lib/surface/call.cc","file_line":903,"grpc_message":". Detail: Cancelled","grpc_status":1}. Client 
context: OK. Detail: Cancelled
 {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16603) [Python] pyarrow.json.read_json ignores nullable=False in explicit_schema parse_options

[jira] [Created] (ARROW-16602) [Dev] Use GitHub API to merge pull request

[GitHub] [arrow-julia] ericphanson opened a new issue, #323: [ArrowTypes] add `@arrow_record`?

[jira] [Created] (ARROW-16601) [C++][FlightRPC] Don't encofcing static link with static GoogleTest for arrow_flight_testing

[jira] [Created] (ARROW-16600) [Java] Enable configurable scale coercion of BigDecimal

[jira] [Created] (ARROW-16599) [C++] Implementation of ExecuteScalarExpressionOverhead benchmarks without arrow for comparision

[jira] [Created] (ARROW-16598) [R] Sorting data.frame prior to writing Parquet affects file size

[jira] [Created] (ARROW-16597) [Python][FlightRPC] Active server may segfault if Python interpreter shuts down

[jira] [Created] (ARROW-16596) [C++] Add option to control the cutoff between 1900 and 2000 when %y

[jira] [Created] (ARROW-16595) [WEBSITE] DataFusion 8.0.0 Release Blog Post

[jira] [Created] (ARROW-16594) [R] Consistently use "getOption" to set nightly repo

[jira] [Created] (ARROW-16593) [CI][Python] test_plasma crashes

[jira] [Created] (ARROW-16592) [FlightRPC][Python] Regression in DoPut error handling

13 matches

Site Navigation

Mail list logo

Footer information