[jira] [Updated] (ARROW-10337) [C++] More liberal parsing of ISO8601 timestamps with fractional seconds

2020-10-19 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-10337:
-
Summary: [C++] More liberal parsing of ISO8601 timestamps with fractional 
seconds  (was: More liberal parsing of ISO8601 timestamps with fractional 
seconds)

> [C++] More liberal parsing of ISO8601 timestamps with fractional seconds
> 
>
> Key: ARROW-10337
> URL: https://issues.apache.org/jira/browse/ARROW-10337
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Frank Smith
>Assignee: Frank Smith
>Priority: Minor
>
> The current ISO8601 timestamp parser assumes MILLI timestamps have 3 decimal 
> places, MICRO have 6 and NANO have 9. From ParseTimestampISO8601 in 
> cpp/src/arrow/util/value_parsing.h:
> {{ // We allow the following formats for all units:}}
> {{ // - "-MM-DD"}}
> {{ // - "-MM-DD[ T]hh"}}
> {{ // - "-MM-DD[ T]hhZ"}}
> {{ // - "-MM-DD[ T]hh:mm"}}
> {{ // - "-MM-DD[ T]hh:mmZ"}}
> {{ // - "-MM-DD[ T]hh:mm:ss"}}
> {{ // - "-MM-DD[ T]hh:mm:ssZ"}}
> {{ //}}
> {{ // We allow the following formats for unit==MILLI:}}
> {{ // - "-MM-DD[ T]hh:mm:ss.mmm"}}
> {{ // - "-MM-DD[ T]hh:mm:ss.mmmZ"}}
> {{ //}}
> {{ // We allow the following formats for unit==MICRO:}}
> {{ // - "-MM-DD[ T]hh:mm:ss.uu"}}
> {{ // - "-MM-DD[ T]hh:mm:ss.uuZ"}}
> {{ //}}
> {{ // We allow the following formats for unit==NANO:}}
> {{ // - "-MM-DD[ T]hh:mm:ss.n"}}
> {{ // - "-MM-DD[ T]hh:mm:ss.nZ"}}
> {{ //}}
> I propose that we change the parser to accept 1 to 3 digits for MILLI, 1 to 6 
> digits for MICRO, and 1 to 9 digits for NANO, as follows:
> {{ // We allow the following formats for all units:}}
> {{ // - "-MM-DD"}}
> {{ // - "-MM-DD[ T]hhZ?"}}
> {{ // - "-MM-DD[ T]hh:mmZ?"}}
> {{ // - "-MM-DD[ T]hh:mm:ssZ?"}}
> {{ //}}
> {{ // We allow the following formats for unit == MILLI, MICRO, or NANO:}}
> {{ // - "-MM-DD[ T]hh:mm:ss.s\{1,3}Z?"}}
> {{ //}}
> {{ // We allow the following formats for unit == MICRO, or NANO:}}
> {{ // - "-MM-DD[ T]hh:mm:ss.s\{4,6}Z?"}}
> {{ //}}
> {{ // We allow the following formats for unit == NANO:}}
> {{ // - "-MM-DD[ T]hh:mm:ss.s\{7,9}Z?"}}
> This will allow for parsing of timestamps when e.g. a CSV file does not write 
> timestamps with trailing zeroes.
> I am almost complete implementing this functionality, so a PR will be 
> following soon.
>  
>  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10309) [Ruby] gem install red-arrow fails

2020-10-19 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217230#comment-17217230
 ] 

Kouhei Sutou commented on ARROW-10309:
--

Really?
{{yum install -y ruby}} installs Ruby 2.0.0 not Ruby 2.6.3.

Could you show the exact command line that you used to install your Ruby 2.6.3?

> [Ruby] gem install red-arrow fails
> --
>
> Key: ARROW-10309
> URL: https://issues.apache.org/jira/browse/ARROW-10309
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Reporter: Bhargav Parsi
>Priority: Major
> Attachments: error2.txt, image-2020-10-14-14-51-27-796.png
>
>
> I am trying to install red arrow in 
> centos(centos-release-7-6.1810.2.el7.centos.x86_64).
> using ruby 2.6.3
>  I followed the steps mentioned here 
> [https://arrow.apache.org/install/|https://arrow.apache.org/install/)]
> Used the steps mentioned for centos 6/7.
> After that I ran `gem install red-arrow`.
> That gives 
> !image-2020-10-14-14-51-27-796.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10351) [C++][Flight] See if reading/writing to gRPC get/put streams asynchronously helps performance

2020-10-19 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-10351:


 Summary: [C++][Flight] See if reading/writing to gRPC get/put 
streams asynchronously helps performance
 Key: ARROW-10351
 URL: https://issues.apache.org/jira/browse/ARROW-10351
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


We don't use any asynchronous concepts in the way that Flight is implemented 
now, i.e. IPC deconstruction/reconstruction (which may include compression!) is 
not performed concurrent with moving FlightData objects through the gRPC 
machinery, which may yield suboptimal performance. 

It might be better to apply an actor-type approach where a dedicated thread 
retrieves and prepares the next raw IPC message (within a Future) while the 
current IPC message is being processed -- that way reading/writing to/from the 
gRPC stream is not blocked on the IPC code doing its thing. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10342) Arrow vector in Java(Scala) allocate byteBuffer error while read the bytes from Python pyarrow

2020-10-19 Thread Litchy Soong (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Litchy Soong updated ARROW-10342:
-
Description: 
I am using scala arrow 1.0.1 and pyarrow 1.0.1

Following error occurs when scala decode the byte that encoded from python.

tried to downgrade pyarrow to 0.17.0, 0.14.1, error still exists.

This is a loop function to decode the same data repeatedly. The error occurs 
sometimes, and I use a try catch block to include the code. And after an error 
occurs, following code could still work correctly, but after the first error 
occurs, following error would happen more frequently.
{quote}Error stack trace java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
 
com.intel.analytics.zoo.shaded.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:692)
 
com.intel.analytics.zoo.shaded.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:57)
 
com.intel.analytics.zoo.shaded.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:164)
 
com.intel.analytics.zoo.shaded.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:170)
 
com.intel.analytics.zoo.shaded.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:161)
 
com.intel.analytics.zoo.shaded.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:63)
{quote}
How to fix it?

  was:
I am using scala arrow 1.0.1 and pyarrow 1.0.1

Following error occurs when scala decode the byte that encoded from python.

tried to downgrade pyarrow to 0.17.0, 0.14.1, error still exists.

This is a loop function to decode the same data repeatedly. The error occurs 
sometimes, after an error occurs, following code could still work correctly.
{quote}

 Error stack trace java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
 
com.intel.analytics.zoo.shaded.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:692)
 
com.intel.analytics.zoo.shaded.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:57)
 
com.intel.analytics.zoo.shaded.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:164)
 
com.intel.analytics.zoo.shaded.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:170)
 
com.intel.analytics.zoo.shaded.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:161)
com.intel.analytics.zoo.shaded.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:63)
{quote}

 How to fix it?


> Arrow vector in Java(Scala) allocate byteBuffer error while read the bytes 
> from Python pyarrow
> --
>
> Key: ARROW-10342
> URL: https://issues.apache.org/jira/browse/ARROW-10342
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java, Python
>Affects Versions: 0.14.1, 0.17.0, 1.0.1
>Reporter: Litchy Soong
>Priority: Major
>
> I am using scala arrow 1.0.1 and pyarrow 1.0.1
> Following error occurs when scala decode the byte that encoded from python.
> tried to downgrade pyarrow to 0.17.0, 0.14.1, error still exists.
> This is a loop function to decode the same data repeatedly. The error occurs 
> sometimes, and I use a try catch block to include the code. And after an 
> error occurs, following code could still work correctly, but after the first 
> error occurs, following error would happen more frequently.
> {quote}Error stack trace java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
>  
> com.intel.analytics.zoo.shaded.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:692)
>  
> com.intel.analytics.zoo.shaded.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:57)
>  
> com.intel.analytics.zoo.shaded.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:164)
>  
> com.intel.analytics.zoo.shaded.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:170)
>  
> com.intel.analytics.zoo.shaded.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:161)
>  
> com.intel.analytics.zoo.shaded.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:63)
> {quote}
> How to fix it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5409) [C++] Improvement for IsIn Kernel when right array is small

2020-10-19 Thread David Sherrier (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217214#comment-17217214
 ] 

David Sherrier commented on ARROW-5409:
---

Hey Wes I added a benchmark (attached here) and found that at least with my 
implementation of a vector> it only outperforms our current 
implementation when the right side list is between 1 and 4 elements in length.  
Keep in mind I ran the benchmark on my laptop so it is possible it would 
perform better on a more powerful machine.

Benchmark code: https://github.com/david1437/arrow/tree/ARROW-5394
Vector Implementation with benchmark: 
https://github.com/david1437/arrow/tree/ARROW-5409

> [C++] Improvement for IsIn Kernel when right array is small
> ---
>
> Key: ARROW-5409
> URL: https://issues.apache.org/jira/browse/ARROW-5409
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Preeti Suman
>Assignee: David Sherrier
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: set_lookup_benchmark
>
>
> The core of the algorithm (as python) is 
> {code:java}
> for idx, elem in array:
>   output[i] = (elem in memo_table)
> {code}
>  Often the right operand list will be very small, in this case, the hashtable 
> should be replaced with a constant vector. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5409) [C++] Improvement for IsIn Kernel when right array is small

2020-10-19 Thread David Sherrier (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Sherrier updated ARROW-5409:
--
Attachment: set_lookup_benchmark

> [C++] Improvement for IsIn Kernel when right array is small
> ---
>
> Key: ARROW-5409
> URL: https://issues.apache.org/jira/browse/ARROW-5409
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Preeti Suman
>Assignee: David Sherrier
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: set_lookup_benchmark
>
>
> The core of the algorithm (as python) is 
> {code:java}
> for idx, elem in array:
>   output[i] = (elem in memo_table)
> {code}
>  Often the right operand list will be very small, in this case, the hashtable 
> should be replaced with a constant vector. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10270) [R] Fix CSV timestamp_parsers test on R-devel

2020-10-19 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-10270:

Fix Version/s: (was: 2.0.0)
   3.0.0

> [R] Fix CSV timestamp_parsers test on R-devel
> -
>
> Key: ARROW-10270
> URL: https://issues.apache.org/jira/browse/ARROW-10270
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Apparently there is a change in the development version of R with respect to 
> timezone handling. I suspect it is this: 
> https://github.com/wch/r-source/blob/trunk/doc/NEWS.Rd#L296-L300
> It causes this failure:
> {code}
> ── 1. Failure: read_csv_arrow() can read timestamps (@test-csv.R#216)  
> ─
> `tbl` not equal to `df`.
> Component "time": 'tzone' attributes are inconsistent ('UTC' and '')
> ── 2. Failure: read_csv_arrow() can read timestamps (@test-csv.R#219)  
> ─
> `tbl` not equal to `df`.
> Component "time": 'tzone' attributes are inconsistent ('UTC' and '')
> {code}
> This needs to be fixed for the CRAN release because they check on the devel 
> version. But it doesn't need to block the 2.0 release candidate because I can 
> (at minimum) skip these tests before submitting to CRAN (FYI [~kszucs])
> I'll also add a CI job to test on R-devel. I just removed 2 R jobs so we can 
> afford to add one back.
> cc [~romainfrancois]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10350) [Rust] parquet_derive crate cannot be published to crates.io

2020-10-19 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217141#comment-17217141
 ] 

Neville Dipale commented on ARROW-10350:


I added them as part of another commit, but the pre-release tests were failing. 
I couldn't figure out what the problem was, so I reverted the changes.

I think it's fine that we don't have the crate published as part of this 
release. Users can still use it from git for now.

> [Rust] parquet_derive crate cannot be published to crates.io
> 
>
> Key: ARROW-10350
> URL: https://issues.apache.org/jira/browse/ARROW-10350
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 2.0.0
>Reporter: Andy Grove
>Priority: Major
> Fix For: 3.0.0
>
>
> The new parquet_derive crate is missing some fields in the Cargo manifest so 
> cannot be published.
> {code:java}
>Uploading parquet_derive v2.0.0 
> (/home/andygrove/arrow-release/apache-arrow-2.0.0/rust/parquet_derive)
> error: api errors (status 200 OK): missing or empty metadata fields: 
> description, license. Please see 
> https://doc.rust-lang.org/cargo/reference/manifest.html for how to upload 
> metadata
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10350) [Rust] parquet_derive crate cannot be published to crates.io

2020-10-19 Thread Andy Grove (Jira)
Andy Grove created ARROW-10350:
--

 Summary: [Rust] parquet_derive crate cannot be published to 
crates.io
 Key: ARROW-10350
 URL: https://issues.apache.org/jira/browse/ARROW-10350
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Affects Versions: 2.0.0
Reporter: Andy Grove
 Fix For: 3.0.0


The new parquet_derive crate is missing some fields in the Cargo manifest so 
cannot be published.
{code:java}
   Uploading parquet_derive v2.0.0 
(/home/andygrove/arrow-release/apache-arrow-2.0.0/rust/parquet_derive)
error: api errors (status 200 OK): missing or empty metadata fields: 
description, license. Please see 
https://doc.rust-lang.org/cargo/reference/manifest.html for how to upload 
metadata
 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10349) [Python] build and publish aarch64 wheels

2020-10-19 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217022#comment-17217022
 ] 

Antoine Pitrou commented on ARROW-10349:


Hmm, I just learned that manylinux is Aarch64-compatible, nice:
https://www.python.org/dev/peps/pep-0599/

> [Python] build and publish aarch64 wheels
> -
>
> Key: ARROW-10349
> URL: https://issues.apache.org/jira/browse/ARROW-10349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, Python
> Environment: os: Linux
> arch: aarch64
>Reporter: Jonathan Swinney
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The currently released source distribution for Arrow on pypi.org doesn't 
> build on Ubuntu 20.04. It may be possible install additional build 
> dependencies to make it work, but it would be better to publish aarch64 
> (arm64) wheels to pypi.org in addition to the currently published x86_64 
> wheels for Linux.
> {{$ pip install pyarrow}}
> should just work on Linux/aarch64.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10349) [Python] build and publish aarch64 wheels

2020-10-19 Thread Jonathan Swinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217020#comment-17217020
 ] 

Jonathan Swinney commented on ARROW-10349:
--

This is partially fixed by [https://github.com/apache/arrow/pull/8491] but I 
haven't figured out how to modify the CI code to use Travis CI the do Arm 
builds. Travis-CI.com is the only CI currently in use by Apache Arrow that 
supports Arm64 builds on linux. Guidance on the necessary changes to CI/Travis 
code would be appreciated.

> [Python] build and publish aarch64 wheels
> -
>
> Key: ARROW-10349
> URL: https://issues.apache.org/jira/browse/ARROW-10349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, Python
> Environment: os: Linux
> arch: aarch64
>Reporter: Jonathan Swinney
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The currently released source distribution for Arrow on pypi.org doesn't 
> build on Ubuntu 20.04. It may be possible install additional build 
> dependencies to make it work, but it would be better to publish aarch64 
> (arm64) wheels to pypi.org in addition to the currently published x86_64 
> wheels for Linux.
> {{$ pip install pyarrow}}
> should just work on Linux/aarch64.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10349) [Python] build and publish aarch64 wheels

2020-10-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10349:
---
Labels: pull-request-available  (was: )

> [Python] build and publish aarch64 wheels
> -
>
> Key: ARROW-10349
> URL: https://issues.apache.org/jira/browse/ARROW-10349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, Python
> Environment: os: Linux
> arch: aarch64
>Reporter: Jonathan Swinney
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The currently released source distribution for Arrow on pypi.org doesn't 
> build on Ubuntu 20.04. It may be possible install additional build 
> dependencies to make it work, but it would be better to publish aarch64 
> (arm64) wheels to pypi.org in addition to the currently published x86_64 
> wheels for Linux.
> {{$ pip install pyarrow}}
> should just work on Linux/aarch64.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10349) [Python] build and publish aarch64 wheels

2020-10-19 Thread Jonathan Swinney (Jira)
Jonathan Swinney created ARROW-10349:


 Summary: [Python] build and publish aarch64 wheels
 Key: ARROW-10349
 URL: https://issues.apache.org/jira/browse/ARROW-10349
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging, Python
 Environment: os: Linux
arch: aarch64
Reporter: Jonathan Swinney


The currently released source distribution for Arrow on pypi.org doesn't build 
on Ubuntu 20.04. It may be possible install additional build dependencies to 
make it work, but it would be better to publish aarch64 (arm64) wheels to 
pypi.org in addition to the currently published x86_64 wheels for Linux.

{{$ pip install pyarrow}}

should just work on Linux/aarch64.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10348) [C++] Fix crash on invalid Parquet file (OSS-Fuzz)

2020-10-19 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-10348:
---
Fix Version/s: (was: 2.0.0)
   3.0.0

> [C++] Fix crash on invalid Parquet file (OSS-Fuzz)
> --
>
> Key: ARROW-10348
> URL: https://issues.apache.org/jira/browse/ARROW-10348
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10348) [C++] Fix crash on invalid Parquet file (OSS-Fuzz)

2020-10-19 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-10348.

Fix Version/s: (was: 3.0.0)
   2.0.0
   Resolution: Fixed

Issue resolved by pull request 8490
[https://github.com/apache/arrow/pull/8490]

> [C++] Fix crash on invalid Parquet file (OSS-Fuzz)
> --
>
> Key: ARROW-10348
> URL: https://issues.apache.org/jira/browse/ARROW-10348
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10309) [Ruby] gem install red-arrow fails

2020-10-19 Thread Bhargav Parsi (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216893#comment-17216893
 ] 

Bhargav Parsi commented on ARROW-10309:
---

Yes, We use `yum install -y ruby` in our docker container. 

> [Ruby] gem install red-arrow fails
> --
>
> Key: ARROW-10309
> URL: https://issues.apache.org/jira/browse/ARROW-10309
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Reporter: Bhargav Parsi
>Priority: Major
> Attachments: error2.txt, image-2020-10-14-14-51-27-796.png
>
>
> I am trying to install red arrow in 
> centos(centos-release-7-6.1810.2.el7.centos.x86_64).
> using ruby 2.6.3
>  I followed the steps mentioned here 
> [https://arrow.apache.org/install/|https://arrow.apache.org/install/)]
> Used the steps mentioned for centos 6/7.
> After that I ran `gem install red-arrow`.
> That gives 
> !image-2020-10-14-14-51-27-796.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10348) [C++] Fix crash on invalid Parquet file (OSS-Fuzz)

2020-10-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10348:
---
Labels: pull-request-available  (was: )

> [C++] Fix crash on invalid Parquet file (OSS-Fuzz)
> --
>
> Key: ARROW-10348
> URL: https://issues.apache.org/jira/browse/ARROW-10348
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10348) [C++] Fix crash on invalid Parquet file (OSS-Fuzz)

2020-10-19 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10348:
--

 Summary: [C++] Fix crash on invalid Parquet file (OSS-Fuzz)
 Key: ARROW-10348
 URL: https://issues.apache.org/jira/browse/ARROW-10348
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 3.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10347) [Python][Dataset] Test behaviour in case of duplicate partition field / data column

2020-10-19 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-10347:
--
Description: 
See https://www.mail-archive.com/user@arrow.apache.org/msg00680.html, and my 
answer to it (experimentation in 
https://nbviewer.jupyter.org/gist/jorisvandenbossche/9382de2eb96db5db2ef801f63a359082).
 
It seems we support that the partition field is also present in the actual 
data, but it's probably good to add some explicit tests to ensure the expected 
behaviour.

> [Python][Dataset] Test behaviour in case of duplicate partition field / data 
> column
> ---
>
> Key: ARROW-10347
> URL: https://issues.apache.org/jira/browse/ARROW-10347
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>
> See https://www.mail-archive.com/user@arrow.apache.org/msg00680.html, and my 
> answer to it (experimentation in 
> https://nbviewer.jupyter.org/gist/jorisvandenbossche/9382de2eb96db5db2ef801f63a359082).
>  
> It seems we support that the partition field is also present in the actual 
> data, but it's probably good to add some explicit tests to ensure the 
> expected behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10337) More liberal parsing of ISO8601 timestamps with fractional seconds

2020-10-19 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216821#comment-17216821
 ] 

Antoine Pitrou commented on ARROW-10337:


Thanks for the report. I agree a PR would be welcome :-)

> More liberal parsing of ISO8601 timestamps with fractional seconds
> --
>
> Key: ARROW-10337
> URL: https://issues.apache.org/jira/browse/ARROW-10337
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Frank Smith
>Assignee: Frank Smith
>Priority: Minor
>
> The current ISO8601 timestamp parser assumes MILLI timestamps have 3 decimal 
> places, MICRO have 6 and NANO have 9. From ParseTimestampISO8601 in 
> cpp/src/arrow/util/value_parsing.h:
> {{ // We allow the following formats for all units:}}
> {{ // - "-MM-DD"}}
> {{ // - "-MM-DD[ T]hh"}}
> {{ // - "-MM-DD[ T]hhZ"}}
> {{ // - "-MM-DD[ T]hh:mm"}}
> {{ // - "-MM-DD[ T]hh:mmZ"}}
> {{ // - "-MM-DD[ T]hh:mm:ss"}}
> {{ // - "-MM-DD[ T]hh:mm:ssZ"}}
> {{ //}}
> {{ // We allow the following formats for unit==MILLI:}}
> {{ // - "-MM-DD[ T]hh:mm:ss.mmm"}}
> {{ // - "-MM-DD[ T]hh:mm:ss.mmmZ"}}
> {{ //}}
> {{ // We allow the following formats for unit==MICRO:}}
> {{ // - "-MM-DD[ T]hh:mm:ss.uu"}}
> {{ // - "-MM-DD[ T]hh:mm:ss.uuZ"}}
> {{ //}}
> {{ // We allow the following formats for unit==NANO:}}
> {{ // - "-MM-DD[ T]hh:mm:ss.n"}}
> {{ // - "-MM-DD[ T]hh:mm:ss.nZ"}}
> {{ //}}
> I propose that we change the parser to accept 1 to 3 digits for MILLI, 1 to 6 
> digits for MICRO, and 1 to 9 digits for NANO, as follows:
> {{ // We allow the following formats for all units:}}
> {{ // - "-MM-DD"}}
> {{ // - "-MM-DD[ T]hhZ?"}}
> {{ // - "-MM-DD[ T]hh:mmZ?"}}
> {{ // - "-MM-DD[ T]hh:mm:ssZ?"}}
> {{ //}}
> {{ // We allow the following formats for unit == MILLI, MICRO, or NANO:}}
> {{ // - "-MM-DD[ T]hh:mm:ss.s\{1,3}Z?"}}
> {{ //}}
> {{ // We allow the following formats for unit == MICRO, or NANO:}}
> {{ // - "-MM-DD[ T]hh:mm:ss.s\{4,6}Z?"}}
> {{ //}}
> {{ // We allow the following formats for unit == NANO:}}
> {{ // - "-MM-DD[ T]hh:mm:ss.s\{7,9}Z?"}}
> This will allow for parsing of timestamps when e.g. a CSV file does not write 
> timestamps with trailing zeroes.
> I am almost complete implementing this functionality, so a PR will be 
> following soon.
>  
>  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10347) [Python][Dataset] Test behaviour in case of duplicate partition field / data column

2020-10-19 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10347:
-

 Summary: [Python][Dataset] Test behaviour in case of duplicate 
partition field / data column
 Key: ARROW-10347
 URL: https://issues.apache.org/jira/browse/ARROW-10347
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10346) [Python] Default S3 region is eu-central-1 even with LANG=C

2020-10-19 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216812#comment-17216812
 ] 

Antoine Pitrou commented on ARROW-10346:


Perhaps in the Python test file instead?

> [Python] Default S3 region is eu-central-1 even with LANG=C
> ---
>
> Key: ARROW-10346
> URL: https://issues.apache.org/jira/browse/ARROW-10346
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Uwe Korn
>Priority: Minor
>
> Verifying the macOS wheels using {{LANG=C 
> dev/release/verify-release-candidate.sh wheels 2.0.0 2}} fails for me with
> {code}
> @pytest.mark.s3
> def test_s3_real_aws():
> # Exercise connection code with an AWS-backed S3 bucket.
> # This is a minimal integration check for ARROW-9261 and similar 
> issues.
> from pyarrow.fs import S3FileSystem
> fs = S3FileSystem(anonymous=True)
> >   assert fs.region == 'us-east-1'  # default region
> E   AssertionError: assert 'eu-central-1' == 'us-east-1'
> E - us-east-1
> E + eu-central-1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10346) [Python] Default S3 region is eu-central-1 even with LANG=C

2020-10-19 Thread Uwe Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216808#comment-17216808
 ] 

Uwe Korn commented on ARROW-10346:
--

{{AWS_CONFIG_FILE=/dev/null}} was sufficient for the tests to pass, should be 
set this globally in {{dev/release/verify-release-candidate.sh}} ?

> [Python] Default S3 region is eu-central-1 even with LANG=C
> ---
>
> Key: ARROW-10346
> URL: https://issues.apache.org/jira/browse/ARROW-10346
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Uwe Korn
>Priority: Minor
>
> Verifying the macOS wheels using {{LANG=C 
> dev/release/verify-release-candidate.sh wheels 2.0.0 2}} fails for me with
> {code}
> @pytest.mark.s3
> def test_s3_real_aws():
> # Exercise connection code with an AWS-backed S3 bucket.
> # This is a minimal integration check for ARROW-9261 and similar 
> issues.
> from pyarrow.fs import S3FileSystem
> fs = S3FileSystem(anonymous=True)
> >   assert fs.region == 'us-east-1'  # default region
> E   AssertionError: assert 'eu-central-1' == 'us-east-1'
> E - us-east-1
> E + eu-central-1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10346) [Python] Default S3 region is eu-central-1 even with LANG=C

2020-10-19 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216776#comment-17216776
 ] 

Antoine Pitrou commented on ARROW-10346:


Perhaps you can retry with {{AWS_CONFIG_FILE=/dev/null}} and 
{{AWS_SHARED_CREDENTIALS_FILE=/dev/null}}? (those are environment variables, 
btw)

> [Python] Default S3 region is eu-central-1 even with LANG=C
> ---
>
> Key: ARROW-10346
> URL: https://issues.apache.org/jira/browse/ARROW-10346
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Uwe Korn
>Priority: Minor
>
> Verifying the macOS wheels using {{LANG=C 
> dev/release/verify-release-candidate.sh wheels 2.0.0 2}} fails for me with
> {code}
> @pytest.mark.s3
> def test_s3_real_aws():
> # Exercise connection code with an AWS-backed S3 bucket.
> # This is a minimal integration check for ARROW-9261 and similar 
> issues.
> from pyarrow.fs import S3FileSystem
> fs = S3FileSystem(anonymous=True)
> >   assert fs.region == 'us-east-1'  # default region
> E   AssertionError: assert 'eu-central-1' == 'us-east-1'
> E - us-east-1
> E + eu-central-1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9991) [C++] split kernels for strings/binary

2020-10-19 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-9991:
--
Fix Version/s: (was: 2.0.0)
   3.0.0

> [C++] split kernels for strings/binary
> --
>
> Key: ARROW-9991
> URL: https://issues.apache.org/jira/browse/ARROW-9991
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Maarten Breddels
>Assignee: Maarten Breddels
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> Similar to Python str.split and bytes.split, we'd like to have a way to 
> convert str into list[str] (and similarly for bytes).
> When the separator is given, the algorithms for both types are the same. 
> Python, however, overloads strip. When given no separator, the algorithm will 
> split considering all whitespace (unicode for str, ascii for bytes) as 
> separator.
> I'd rather see not too much overloaded kernels, e.g.
> binary_split (takes string/binary separator, and maxsplit arg, no special 
> utf8 version needed)
> utf8_split_whitespace (similar to Python's version given no separator)
> ascii_split_whitespace (similar to Python's version given no separator, but 
> considering ascii, although this could work on any binary data)
> there can also be rsplit versions of these, or they could be an argument.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9991) [C++] split kernels for strings/binary

2020-10-19 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9991.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8271
[https://github.com/apache/arrow/pull/8271]

> [C++] split kernels for strings/binary
> --
>
> Key: ARROW-9991
> URL: https://issues.apache.org/jira/browse/ARROW-9991
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Maarten Breddels
>Assignee: Maarten Breddels
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> Similar to Python str.split and bytes.split, we'd like to have a way to 
> convert str into list[str] (and similarly for bytes).
> When the separator is given, the algorithms for both types are the same. 
> Python, however, overloads strip. When given no separator, the algorithm will 
> split considering all whitespace (unicode for str, ascii for bytes) as 
> separator.
> I'd rather see not too much overloaded kernels, e.g.
> binary_split (takes string/binary separator, and maxsplit arg, no special 
> utf8 version needed)
> utf8_split_whitespace (similar to Python's version given no separator)
> ascii_split_whitespace (similar to Python's version given no separator, but 
> considering ascii, although this could work on any binary data)
> there can also be rsplit versions of these, or they could be an argument.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10346) [Python] Default S3 region is eu-central-1 even with LANG=C

2020-10-19 Thread Uwe Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216765#comment-17216765
 ] 

Uwe Korn commented on ARROW-10346:
--

Yes, there was a user-config changing this. A bit confusing to me as I haven't 
used S3 on this machine before, probably some other tool put it there.

Do you have an idea how to detect that and error out with an explanation?

> [Python] Default S3 region is eu-central-1 even with LANG=C
> ---
>
> Key: ARROW-10346
> URL: https://issues.apache.org/jira/browse/ARROW-10346
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Uwe Korn
>Priority: Minor
>
> Verifying the macOS wheels using {{LANG=C 
> dev/release/verify-release-candidate.sh wheels 2.0.0 2}} fails for me with
> {code}
> @pytest.mark.s3
> def test_s3_real_aws():
> # Exercise connection code with an AWS-backed S3 bucket.
> # This is a minimal integration check for ARROW-9261 and similar 
> issues.
> from pyarrow.fs import S3FileSystem
> fs = S3FileSystem(anonymous=True)
> >   assert fs.region == 'us-east-1'  # default region
> E   AssertionError: assert 'eu-central-1' == 'us-east-1'
> E - us-east-1
> E + eu-central-1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10346) [Python] Default S3 region is eu-central-1 even with LANG=C

2020-10-19 Thread Uwe Korn (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Korn updated ARROW-10346:
-
Priority: Minor  (was: Major)

> [Python] Default S3 region is eu-central-1 even with LANG=C
> ---
>
> Key: ARROW-10346
> URL: https://issues.apache.org/jira/browse/ARROW-10346
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Uwe Korn
>Priority: Minor
>
> Verifying the macOS wheels using {{LANG=C 
> dev/release/verify-release-candidate.sh wheels 2.0.0 2}} fails for me with
> {code}
> @pytest.mark.s3
> def test_s3_real_aws():
> # Exercise connection code with an AWS-backed S3 bucket.
> # This is a minimal integration check for ARROW-9261 and similar 
> issues.
> from pyarrow.fs import S3FileSystem
> fs = S3FileSystem(anonymous=True)
> >   assert fs.region == 'us-east-1'  # default region
> E   AssertionError: assert 'eu-central-1' == 'us-east-1'
> E - us-east-1
> E + eu-central-1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10346) [Python] Default S3 region is eu-central-1 even with LANG=C

2020-10-19 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216721#comment-17216721
 ] 

Antoine Pitrou commented on ARROW-10346:


That must be because it's picking up your local AWS configuration. It shouldn't 
have anything to do with the locale.

> [Python] Default S3 region is eu-central-1 even with LANG=C
> ---
>
> Key: ARROW-10346
> URL: https://issues.apache.org/jira/browse/ARROW-10346
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Uwe Korn
>Priority: Major
>
> Verifying the macOS wheels using {{LANG=C 
> dev/release/verify-release-candidate.sh wheels 2.0.0 2}} fails for me with
> {code}
> @pytest.mark.s3
> def test_s3_real_aws():
> # Exercise connection code with an AWS-backed S3 bucket.
> # This is a minimal integration check for ARROW-9261 and similar 
> issues.
> from pyarrow.fs import S3FileSystem
> fs = S3FileSystem(anonymous=True)
> >   assert fs.region == 'us-east-1'  # default region
> E   AssertionError: assert 'eu-central-1' == 'us-east-1'
> E - us-east-1
> E + eu-central-1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9164) [C++] Provide APIs for adding "docstrings" to arrow::compute::Function classes that can be accessed by bindings

2020-10-19 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-9164:
--
Fix Version/s: (was: 2.0.0)
   3.0.0

> [C++] Provide APIs for adding "docstrings" to arrow::compute::Function 
> classes that can be accessed by bindings
> ---
>
> Key: ARROW-9164
> URL: https://issues.apache.org/jira/browse/ARROW-9164
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9164) [C++] Provide APIs for adding "docstrings" to arrow::compute::Function classes that can be accessed by bindings

2020-10-19 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9164.
---
Fix Version/s: (was: 3.0.0)
   2.0.0
   Resolution: Fixed

Issue resolved by pull request 8457
[https://github.com/apache/arrow/pull/8457]

> [C++] Provide APIs for adding "docstrings" to arrow::compute::Function 
> classes that can be accessed by bindings
> ---
>
> Key: ARROW-9164
> URL: https://issues.apache.org/jira/browse/ARROW-9164
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10261) [Rust] [BREAKING] Lists should take Field instead of DataType

2020-10-19 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216687#comment-17216687
 ] 

Antoine Pitrou commented on ARROW-10261:


Indeed, C++ uses {{List}}.

> [Rust] [BREAKING] Lists should take Field instead of DataType
> -
>
> Key: ARROW-10261
> URL: https://issues.apache.org/jira/browse/ARROW-10261
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Integration, Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Priority: Major
>
> There is currently no way of tracking nested field metadata on lists. For 
> example, if a list's children are nullable, there's no way of telling just by 
> looking at the Field.
> This causes problems with integration testing, and also affects Parquet 
> roundtrips.
> I propose the breaking change of [Large|FixedSize]List taking a Field instead 
> of Box, as this will overcome this issue, and ensure that the Rust 
> implementation passes integration tests.
> CC [~andygrove] [~jorgecarleitao] [~alamb]  [~jhorstmann] ([~carols10cents] 
> as this addresses some of the roundtrip failures).
> I'm leaning towards this landing in 3.0.0, as I'd love for us to have 
> completed or made significant traction on the Arrow Parquet writer (and 
> reader), and integration testing, by then.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10106) [FlightRPC][Java] Expose onIsReady() callback on OutboundStreamListener

2020-10-19 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-10106.
--
Resolution: Fixed

Resolved by 
[https://github.com/apache/arrow/pull/8476|https://github.com/apache/arrow/pull/8476].

> [FlightRPC][Java] Expose onIsReady() callback on OutboundStreamListener
> ---
>
> Key: ARROW-10106
> URL: https://issues.apache.org/jira/browse/ARROW-10106
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Java
>Reporter: James Duong
>Assignee: James Duong
>Priority: Major
>  Labels: flight, pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> OutboundStreamListener has a method isReady() that FlightProducers need to 
> poll during implementations of getStream() to avoid buffering too much data.
> An enhancement would be to allow setting a callback to run (for example, 
> notifying a CountdownLatch) so that FlightProducer implementations don't need 
> to busy wait.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10106) [FlightRPC][Java] Expose onIsReady() callback on OutboundStreamListener

2020-10-19 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-10106:
-
Fix Version/s: 3.0.0

> [FlightRPC][Java] Expose onIsReady() callback on OutboundStreamListener
> ---
>
> Key: ARROW-10106
> URL: https://issues.apache.org/jira/browse/ARROW-10106
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Java
>Reporter: James Duong
>Assignee: James Duong
>Priority: Major
>  Labels: flight, pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> OutboundStreamListener has a method isReady() that FlightProducers need to 
> poll during implementations of getStream() to avoid buffering too much data.
> An enhancement would be to allow setting a callback to run (for example, 
> notifying a CountdownLatch) so that FlightProducer implementations don't need 
> to busy wait.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10203) [Doc] Capture guidance for endianness support in contributors guide.

2020-10-19 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-10203.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8374
[https://github.com/apache/arrow/pull/8374]

> [Doc] Capture guidance for endianness support in contributors guide.
> 
>
> Key: ARROW-10203
> URL: https://issues.apache.org/jira/browse/ARROW-10203
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> https://mail-archives.apache.org/mod_mbox/arrow-dev/202009.mbox/%3ccak7z5t--hhhr9dy43pyhd6m-xou4qogwqvlwzsg-koxxjpt...@mail.gmail.com%3e



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10345) [C++] NaN breaks sorting

2020-10-19 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216674#comment-17216674
 ] 

Antoine Pitrou commented on ARROW-10345:


cc [~yibo]

> [C++] NaN breaks sorting
> 
>
> Key: ARROW-10345
> URL: https://issues.apache.org/jira/browse/ARROW-10345
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:python}
> >>> import numpy as np
> >>> import pyarrow.compute as pc
> >>> pc.sort_indices([3.0, 4.0, 1.0, 2.0, None])
> 
> [
>   2,
>   3,
>   0,
>   1,
>   4
> ]
> >>> pc.sort_indices([3.0, 4.0, np.nan, 1.0, 2.0, None])
> 
> [
>   0,
>   1,
>   2,
>   3,
>   4,
>   5
> ]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10345) [C++] NaN breaks sorting

2020-10-19 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10345:
--

 Summary: [C++] NaN breaks sorting
 Key: ARROW-10345
 URL: https://issues.apache.org/jira/browse/ARROW-10345
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 2.0.0
Reporter: Antoine Pitrou
 Fix For: 3.0.0


{code:python}
>>> import numpy as np
>>> import pyarrow.compute as pc
>>> pc.sort_indices([3.0, 4.0, 1.0, 2.0, None])

[
  2,
  3,
  0,
  1,
  4
]
>>> pc.sort_indices([3.0, 4.0, np.nan, 1.0, 2.0, None])

[
  0,
  1,
  2,
  3,
  4,
  5
]
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10241) [C++][Compute] Add variance kernel benchmark

2020-10-19 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-10241:
---
Fix Version/s: (was: 2.0.0)
   3.0.0

> [C++][Compute] Add variance kernel benchmark
> 
>
> Key: ARROW-10241
> URL: https://issues.apache.org/jira/browse/ARROW-10241
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10241) [C++][Compute] Add variance kernel benchmark

2020-10-19 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-10241.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8407
[https://github.com/apache/arrow/pull/8407]

> [C++][Compute] Add variance kernel benchmark
> 
>
> Key: ARROW-10241
> URL: https://issues.apache.org/jira/browse/ARROW-10241
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10343) [C++] Unable to parse strings into timestamps

2020-10-19 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216616#comment-17216616
 ] 

Antoine Pitrou commented on ARROW-10343:


Thanks for the report. This appears to work on git master:

{code:python}
>>> us_arr = pa.array([
...:   "2014-12-07 07:48:59.285332",
...:   "2014-12-07 08:01:49.758975",
...:   "2014-12-07 10:11:35.884304"])
>>> us_arr.cast(pa.timestamp('us'))

[
  2014-12-07 07:48:59.285332,
  2014-12-07 08:01:49.758975,
  2014-12-07 10:11:35.884304
]
{code}

Timezone indicators are not supported currently, except for UTC where you have 
to use "Z", not "+00":
{code:python}
>>> us_arr = pa.array([
...:   "2014-12-07 07:48:59.285332Z",
...:   "2014-12-07 08:01:49.758975Z",
...:   "2014-12-07 10:11:35.884304Z"])
>>> us_arr.cast(pa.timestamp('us'))

[
  2014-12-07 07:48:59.285332,
  2014-12-07 08:01:49.758975,
  2014-12-07 10:11:35.884304
]
{code}


> [C++] Unable to parse strings into timestamps
> -
>
> Key: ARROW-10343
> URL: https://issues.apache.org/jira/browse/ARROW-10343
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: macOS 10.15.7, Python 3.8.2
>Reporter: Niclas Roos
>Priority: Minor
>
> Hi,
> I'm working with parquet files generated by a AWS RDS Postgres snapshot 
> export. 
> I'm trying to parse a date column stored as a string into a timestamp, but it 
> fails.
> I've managed to parse the same date format (as in the first example below) 
> when reading from a csv, so I tried to investigate it as far as I could on my 
> own, and here's my results:
> {code:java}
> import pyarrow as pa
> import pytz
> #
> ## the format I get from the database
> us_tz_arr = pa.array([
>   "2014-12-07 07:48:59.285332+00",
>   "2014-12-07 08:01:49.758975+00",
>   "2014-12-07 10:11:35.884304+00"])
> us_tz_arr.cast(pa.timestamp('us', tz=pytz.UTC))
> -> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35.884304+00
> #
> ## tried removing the timezone
> us_arr = pa.array([
>   "2014-12-07 07:48:59.285332",
>   "2014-12-07 08:01:49.758975",
>   "2014-12-07 10:11:35.884304"])
> us_arr.cast(pa.timestamp('us'))
> -> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35.884304
> #
> ## tried removing the microseconds but keeping the timezone
> second_tz_arr = pa.array([
>   "2014-12-07 07:48:59+00",
>   "2014-12-07 08:01:49+00",
>   "2014-12-07 10:11:35+00"])
> second_tz_arr.cast(pa.timestamp('s', tz=pytz.UTC))
> -> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35+00
> #
> ## removing microseconds and timezone, makes it work!
> s_arr = pa.array([
>   "2014-12-07 07:48:59",
>   "2014-12-07 08:01:49",
>   "2014-12-07 10:11:35"])
> s_arr.cast(pa.timestamp('s'))
> -> 
> [
>   2014-12-07 07:48:59,
>   2014-12-07 08:01:49,
>   2014-12-07 10:11:35
> ]{code}
>  PS. This is my first bug report, so apologies if important things are 
> missing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10343) [C++] Unable to parse strings into timestamps

2020-10-19 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-10343:
---
Summary: [C++] Unable to parse strings into timestamps  (was: Unable to 
parse strings into timestamps)

> [C++] Unable to parse strings into timestamps
> -
>
> Key: ARROW-10343
> URL: https://issues.apache.org/jira/browse/ARROW-10343
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
> Environment: macOS 10.15.7, Python 3.8.2
>Reporter: Niclas Roos
>Priority: Minor
>
> Hi,
> I'm working with parquet files generated by a AWS RDS Postgres snapshot 
> export. 
> I'm trying to parse a date column stored as a string into a timestamp, but it 
> fails.
> I've managed to parse the same date format (as in the first example below) 
> when reading from a csv, so I tried to investigate it as far as I could on my 
> own, and here's my results:
> {code:java}
> import pyarrow as pa
> import pytz
> #
> ## the format I get from the database
> us_tz_arr = pa.array([
>   "2014-12-07 07:48:59.285332+00",
>   "2014-12-07 08:01:49.758975+00",
>   "2014-12-07 10:11:35.884304+00"])
> us_tz_arr.cast(pa.timestamp('us', tz=pytz.UTC))
> -> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35.884304+00
> #
> ## tried removing the timezone
> us_arr = pa.array([
>   "2014-12-07 07:48:59.285332",
>   "2014-12-07 08:01:49.758975",
>   "2014-12-07 10:11:35.884304"])
> us_arr.cast(pa.timestamp('us'))
> -> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35.884304
> #
> ## tried removing the microseconds but keeping the timezone
> second_tz_arr = pa.array([
>   "2014-12-07 07:48:59+00",
>   "2014-12-07 08:01:49+00",
>   "2014-12-07 10:11:35+00"])
> second_tz_arr.cast(pa.timestamp('s', tz=pytz.UTC))
> -> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35+00
> #
> ## removing microseconds and timezone, makes it work!
> s_arr = pa.array([
>   "2014-12-07 07:48:59",
>   "2014-12-07 08:01:49",
>   "2014-12-07 10:11:35"])
> s_arr.cast(pa.timestamp('s'))
> -> 
> [
>   2014-12-07 07:48:59,
>   2014-12-07 08:01:49,
>   2014-12-07 10:11:35
> ]{code}
>  PS. This is my first bug report, so apologies if important things are 
> missing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10343) [C++] Unable to parse strings into timestamps

2020-10-19 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-10343:
---
Component/s: C++

> [C++] Unable to parse strings into timestamps
> -
>
> Key: ARROW-10343
> URL: https://issues.apache.org/jira/browse/ARROW-10343
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: macOS 10.15.7, Python 3.8.2
>Reporter: Niclas Roos
>Priority: Minor
>
> Hi,
> I'm working with parquet files generated by a AWS RDS Postgres snapshot 
> export. 
> I'm trying to parse a date column stored as a string into a timestamp, but it 
> fails.
> I've managed to parse the same date format (as in the first example below) 
> when reading from a csv, so I tried to investigate it as far as I could on my 
> own, and here's my results:
> {code:java}
> import pyarrow as pa
> import pytz
> #
> ## the format I get from the database
> us_tz_arr = pa.array([
>   "2014-12-07 07:48:59.285332+00",
>   "2014-12-07 08:01:49.758975+00",
>   "2014-12-07 10:11:35.884304+00"])
> us_tz_arr.cast(pa.timestamp('us', tz=pytz.UTC))
> -> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35.884304+00
> #
> ## tried removing the timezone
> us_arr = pa.array([
>   "2014-12-07 07:48:59.285332",
>   "2014-12-07 08:01:49.758975",
>   "2014-12-07 10:11:35.884304"])
> us_arr.cast(pa.timestamp('us'))
> -> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35.884304
> #
> ## tried removing the microseconds but keeping the timezone
> second_tz_arr = pa.array([
>   "2014-12-07 07:48:59+00",
>   "2014-12-07 08:01:49+00",
>   "2014-12-07 10:11:35+00"])
> second_tz_arr.cast(pa.timestamp('s', tz=pytz.UTC))
> -> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35+00
> #
> ## removing microseconds and timezone, makes it work!
> s_arr = pa.array([
>   "2014-12-07 07:48:59",
>   "2014-12-07 08:01:49",
>   "2014-12-07 10:11:35"])
> s_arr.cast(pa.timestamp('s'))
> -> 
> [
>   2014-12-07 07:48:59,
>   2014-12-07 08:01:49,
>   2014-12-07 10:11:35
> ]{code}
>  PS. This is my first bug report, so apologies if important things are 
> missing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10344) [Python] Get all columns names (or schema) from Feather file, before loading whole Feather file

2020-10-19 Thread Gert Hulselmans (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gert Hulselmans updated ARROW-10344:

Description: 
Is there a way to get all column names (or schema) from a Feather file before 
loading the full Feather file?

My Feather files are big (like 100GB) and the names of the columns are 
different per analysis and can't be hard coded.

{code:python}
import pyarrow.feather as feather

# Code here to check which columns are in the feather file.
...
my_columns = ...

# Result is pandas.DataFrame
read_df = feather.read_feather('/path/to/file', columns=my_columns)

# Result is pyarrow.Table
read_arrow = feather.read_table('/path/to/file', columns=my_columns)


{code}

  was:
Is there a way to get all column names (and e.g. number of columns and number 
of rows) from a Feather file before loading the full Feather file?

My Feather files are big (like 100GB) and the names of the columns are 
different per analysis and can't be hard coded.

{code:python}
import pyarrow.feather as feather

# Code here to check which columns are in the feather file.
...
my_columns = ...

# Result is pandas.DataFrame
read_df = feather.read_feather('/path/to/file', columns=my_columns)

# Result is pyarrow.Table
read_arrow = feather.read_table('/path/to/file', columns=my_columns)


{code}

Summary: [Python]  Get all columns names (or schema) from Feather file, 
before loading whole Feather file  (was: [Python]  Get all columns names from 
Feather file, before loading whole Feather file)

> [Python]  Get all columns names (or schema) from Feather file, before loading 
> whole Feather file
> 
>
> Key: ARROW-10344
> URL: https://issues.apache.org/jira/browse/ARROW-10344
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Gert Hulselmans
>Priority: Major
>
> Is there a way to get all column names (or schema) from a Feather file before 
> loading the full Feather file?
> My Feather files are big (like 100GB) and the names of the columns are 
> different per analysis and can't be hard coded.
> {code:python}
> import pyarrow.feather as feather
> # Code here to check which columns are in the feather file.
> ...
> my_columns = ...
> # Result is pandas.DataFrame
> read_df = feather.read_feather('/path/to/file', columns=my_columns)
> # Result is pyarrow.Table
> read_arrow = feather.read_table('/path/to/file', columns=my_columns)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10344) [Python] Get all columns names from Feather file, before loading whole Feather file

2020-10-19 Thread Gert Hulselmans (Jira)
Gert Hulselmans created ARROW-10344:
---

 Summary: [Python]  Get all columns names from Feather file, before 
loading whole Feather file
 Key: ARROW-10344
 URL: https://issues.apache.org/jira/browse/ARROW-10344
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Affects Versions: 1.0.1
Reporter: Gert Hulselmans


Is there a way to get all column names (and e.g. number of columns and number 
of rows) from a Feather file before loading the full Feather file?

My Feather files are big (like 100GB) and the names of the columns are 
different per analysis and can't be hard coded.

{code:python}
import pyarrow.feather as feather

# Code here to check which columns are in the feather file.
...
my_columns = ...

# Result is pandas.DataFrame
read_df = feather.read_feather('/path/to/file', columns=my_columns)

# Result is pyarrow.Table
read_arrow = feather.read_table('/path/to/file', columns=my_columns)


{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10159) [Rust][DataFusion] Add support for Dictionary types in data fusion

2020-10-19 Thread Andrew Lamb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216587#comment-17216587
 ] 

Andrew Lamb commented on ARROW-10159:
-

Good call [~nevi_me] -- I have indeed completed the work I planned for 
DictionaryArray / DataFusion at this time. Thank you for the reminder. 

> [Rust][DataFusion] Add support for Dictionary types in data fusion
> --
>
> Key: ARROW-10159
> URL: https://issues.apache.org/jira/browse/ARROW-10159
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We have a system that need to process low cardinality string data (aka there 
> are only a few distinct values, but there are many millions of values).
> Using a `StringArray` is very expensive as the same string value is copied 
> over and over again. The `DictionaryArray` was exactly designed to handle 
> this situatio:  rather than repeating each string, it uses indexes into a 
> dictionary and thus repeats integer values. 
> Sadly, DataFusion does not support processing on `DictionaryArray` types for 
> several reasons.
> This test (to be added to `arrow/rust/datafusion/tests/sql.rs`) shows what I 
> would like to be possible:
> {code}
> #[tokio::test]
> async fn query_on_string_dictionary() -> Result<()> {
> // ensure that data fusion can operate on dictionary types
> // Use StringDictionary (32 bit indexes = keys)
> let field_type = DataType::Dictionary(
> Box::new(DataType::Int32),
> Box::new(DataType::Utf8),
> );
> let schema = Arc::new(Schema::new(vec![Field::new("d1", field_type, 
> true)]));
> let keys_builder = PrimitiveBuildernew(10);
> let values_builder = StringBuilder::new(10);
> let mut builder = StringDictionaryBuilder::new(
> keys_builder, values_builder
> );
> builder.append("one")?;
> builder.append_null()?;
> builder.append("three")?;
> let array = Arc::new(builder.finish());
> let data = RecordBatch::try_new(
> schema.clone(),
> vec![array],
> )?;
> let table = MemTable::new(schema, vec![vec![data]])?;
> let mut ctx = ExecutionContext::new();
> ctx.register_table("test", Box::new(table));
> // Basic SELECT
> let sql = "SELECT * FROM test";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "\"one\"\nNULL\n\"three\"".to_string();
> assert_eq!(expected, actual);
> // basic filtering
> let sql = "SELECT * FROM test WHERE d1 IS NOT NULL";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "\"one\"\n\"three\"".to_string();
> assert_eq!(expected, actual);
> // filtering with constant
> let sql = "SELECT * FROM test WHERE d1 = 'three'";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "\"three\"".to_string();
> assert_eq!(expected, actual);
> // Expression evaluation
> let sql = "SELECT concat(d1, '-foo') FROM test";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "\"one-foo\"\nNULL\n\"three-foo\"".to_string();
> assert_eq!(expected, actual);
> // aggregation
> let sql = "SELECT COUNT(d1) FROM test";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "2".to_string();
> assert_eq!(expected, actual);
> Ok(())
> }
> {code}
> However, it errors immediately:
> {code}
>  query_on_string_dictionary stdout 
> thread 'query_on_string_dictionary' panicked at 'assertion failed: `(left == 
> right)`
>   left: `"\"one\"\nNULL\n\"three\""`,
>  right: `"???\nNULL\n???"`', datafusion/tests/sql.rs:989:5
> note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
> {code{
> This ticket tracks adding proper support Dictionary types to DataFusion. I 
> will break the work down into several smaller subtasks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10159) [Rust][DataFusion] Add support for Dictionary types in data fusion

2020-10-19 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-10159.
-
Resolution: Fixed

All subtasks completed

> [Rust][DataFusion] Add support for Dictionary types in data fusion
> --
>
> Key: ARROW-10159
> URL: https://issues.apache.org/jira/browse/ARROW-10159
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We have a system that need to process low cardinality string data (aka there 
> are only a few distinct values, but there are many millions of values).
> Using a `StringArray` is very expensive as the same string value is copied 
> over and over again. The `DictionaryArray` was exactly designed to handle 
> this situatio:  rather than repeating each string, it uses indexes into a 
> dictionary and thus repeats integer values. 
> Sadly, DataFusion does not support processing on `DictionaryArray` types for 
> several reasons.
> This test (to be added to `arrow/rust/datafusion/tests/sql.rs`) shows what I 
> would like to be possible:
> {code}
> #[tokio::test]
> async fn query_on_string_dictionary() -> Result<()> {
> // ensure that data fusion can operate on dictionary types
> // Use StringDictionary (32 bit indexes = keys)
> let field_type = DataType::Dictionary(
> Box::new(DataType::Int32),
> Box::new(DataType::Utf8),
> );
> let schema = Arc::new(Schema::new(vec![Field::new("d1", field_type, 
> true)]));
> let keys_builder = PrimitiveBuildernew(10);
> let values_builder = StringBuilder::new(10);
> let mut builder = StringDictionaryBuilder::new(
> keys_builder, values_builder
> );
> builder.append("one")?;
> builder.append_null()?;
> builder.append("three")?;
> let array = Arc::new(builder.finish());
> let data = RecordBatch::try_new(
> schema.clone(),
> vec![array],
> )?;
> let table = MemTable::new(schema, vec![vec![data]])?;
> let mut ctx = ExecutionContext::new();
> ctx.register_table("test", Box::new(table));
> // Basic SELECT
> let sql = "SELECT * FROM test";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "\"one\"\nNULL\n\"three\"".to_string();
> assert_eq!(expected, actual);
> // basic filtering
> let sql = "SELECT * FROM test WHERE d1 IS NOT NULL";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "\"one\"\n\"three\"".to_string();
> assert_eq!(expected, actual);
> // filtering with constant
> let sql = "SELECT * FROM test WHERE d1 = 'three'";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "\"three\"".to_string();
> assert_eq!(expected, actual);
> // Expression evaluation
> let sql = "SELECT concat(d1, '-foo') FROM test";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "\"one-foo\"\nNULL\n\"three-foo\"".to_string();
> assert_eq!(expected, actual);
> // aggregation
> let sql = "SELECT COUNT(d1) FROM test";
> let actual = execute( ctx, sql).await.join("\n");
> let expected = "2".to_string();
> assert_eq!(expected, actual);
> Ok(())
> }
> {code}
> However, it errors immediately:
> {code}
>  query_on_string_dictionary stdout 
> thread 'query_on_string_dictionary' panicked at 'assertion failed: `(left == 
> right)`
>   left: `"\"one\"\nNULL\n\"three\""`,
>  right: `"???\nNULL\n???"`', datafusion/tests/sql.rs:989:5
> note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
> {code{
> This ticket tracks adding proper support Dictionary types to DataFusion. I 
> will break the work down into several smaller subtasks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10310) [C++][Gandiva] Add single argument round() in Gandiva

2020-10-19 Thread Praveen Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Kumar resolved ARROW-10310.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8467
[https://github.com/apache/arrow/pull/8467]

> [C++][Gandiva] Add single argument round() in Gandiva
> -
>
> Key: ARROW-10310
> URL: https://issues.apache.org/jira/browse/ARROW-10310
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Sagnik Chakraborty
>Assignee: Sagnik Chakraborty
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10315) [C++] CSV skip wrong rows

2020-10-19 Thread Maciej (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216554#comment-17216554
 ] 

Maciej commented on ARROW-10315:


Emitting nulls wouldn't work for me. I may stick with checking the file myself 
before loading by Arrow.

> [C++] CSV skip wrong rows
> -
>
> Key: ARROW-10315
> URL: https://issues.apache.org/jira/browse/ARROW-10315
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.0.1
>Reporter: Maciej
>Priority: Major
>
> It would be helpful to add another option to {color:#267f99}ReadOptions 
> {color}which will enable skipping rows with wrong data (e.g. data type 
> mismatch with column type) and continue reading next rows. Wrong rows numbers 
> may be reported at the end of processing.
> This way I can deal with the wrongly formatted data or ignore it if I have a 
> large load success rate and I don’t care about the exceptions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-10314) [C++] CSV wrong row number in error message

2020-10-19 Thread Maciej (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej closed ARROW-10314.
--
Resolution: Feedback Received

> [C++] CSV wrong row number in error message
> ---
>
> Key: ARROW-10314
> URL: https://issues.apache.org/jira/browse/ARROW-10314
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.0.1
>Reporter: Maciej
>Priority: Major
>
> When I try to read CSV file with wrong data, I get message like:
> {code:java}
> CSV file reader error: Invalid: In CSV column #0: CSV conversion error to 
> timestamp[s]: invalid value '1'
> {code}
> Would be very helpful to add information about row with wrong data e.g.
> {code:java}
> CSV file reader error: Invalid: In CSV column #0 line number #123456: CSV 
> conversion error to timestamp[s]: invalid value '1'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10314) [C++] CSV wrong row number in error message

2020-10-19 Thread Maciej (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216547#comment-17216547
 ] 

Maciej commented on ARROW-10314:


OK, thanks for the answer.

> [C++] CSV wrong row number in error message
> ---
>
> Key: ARROW-10314
> URL: https://issues.apache.org/jira/browse/ARROW-10314
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.0.1
>Reporter: Maciej
>Priority: Major
>
> When I try to read CSV file with wrong data, I get message like:
> {code:java}
> CSV file reader error: Invalid: In CSV column #0: CSV conversion error to 
> timestamp[s]: invalid value '1'
> {code}
> Would be very helpful to add information about row with wrong data e.g.
> {code:java}
> CSV file reader error: Invalid: In CSV column #0 line number #123456: CSV 
> conversion error to timestamp[s]: invalid value '1'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-8435) [Python] A TypeError is raised while token expires during writing to S3

2020-10-19 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche closed ARROW-8435.

Resolution: Feedback Received

> [Python] A TypeError is raised while token expires during writing to S3
> ---
>
> Key: ARROW-8435
> URL: https://issues.apache.org/jira/browse/ARROW-8435
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
>Reporter: Shawn Li
>Priority: Critical
>
> This issue occurs when a STS token expires *in the middle of* writing to S3. 
> An OSError: Write failed: TypeError("'NoneType' object is not 
> subscriptable",) is raised instead of a PermissionError.
>  
> OSError: Write failed: TypeError("'NoneType' object is not subscriptable",)
> Traceback (most recent call last):
>  File "/usr/local/lib/python3.6/site-packages/pyarrow/parquet.py", line 1450, 
> in
>  write_to_dataset write_table(subtable, f, **kwargs)
>  File "/usr/local/lib/python3.6/site-packages/pyarrow/parquet.py", line 1344, 
> in
>  write_table writer.write_table(table, row_group_size=row_group_size)
>  File "/usr/local/lib/python3.6/site-packages/pyarrow/parquet.py", line 474, 
> in
>  write_table self.writer.write_table(table, row_group_size=row_group_size)
>  File "pyarrow/_parquet.pyx", line 1375, in 
> pyarrow._parquet.ParquetWriter.write_table File "pyarrow/error.pxi", line 80, 
> in
>  pyarrow.lib.check_statuspyarrow.lib.ArrowIOError: Arrow error: IOError: The 
> provided token has expired.. Detail: Python exception: PermissionError
>  During handling of the above exception, another exception occurred:
>  Traceback (most recent call last):
>  File "/usr/local/lib/python3.6/site-packages/s3fs/core.py", line 1096, in 
> _upload_chunk PartNumber=part, UploadId=self.mpu['UploadId'],TypeError: 
> 'NoneType' object is not subscriptable
> environment is:
>  s3fs==0.4.0
>  boto3==1.10.27
>  botocore==1.13.27
>  pyarrow==0.15.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8435) [Python] A TypeError is raised while token expires during writing to S3

2020-10-19 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216535#comment-17216535
 ] 

Joris Van den Bossche commented on ARROW-8435:
--

Given this is also tracked in the s3fs issue 
https://github.com/dask/s3fs/issues/314, going to close this one

> [Python] A TypeError is raised while token expires during writing to S3
> ---
>
> Key: ARROW-8435
> URL: https://issues.apache.org/jira/browse/ARROW-8435
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
>Reporter: Shawn Li
>Priority: Critical
>
> This issue occurs when a STS token expires *in the middle of* writing to S3. 
> An OSError: Write failed: TypeError("'NoneType' object is not 
> subscriptable",) is raised instead of a PermissionError.
>  
> OSError: Write failed: TypeError("'NoneType' object is not subscriptable",)
> Traceback (most recent call last):
>  File "/usr/local/lib/python3.6/site-packages/pyarrow/parquet.py", line 1450, 
> in
>  write_to_dataset write_table(subtable, f, **kwargs)
>  File "/usr/local/lib/python3.6/site-packages/pyarrow/parquet.py", line 1344, 
> in
>  write_table writer.write_table(table, row_group_size=row_group_size)
>  File "/usr/local/lib/python3.6/site-packages/pyarrow/parquet.py", line 474, 
> in
>  write_table self.writer.write_table(table, row_group_size=row_group_size)
>  File "pyarrow/_parquet.pyx", line 1375, in 
> pyarrow._parquet.ParquetWriter.write_table File "pyarrow/error.pxi", line 80, 
> in
>  pyarrow.lib.check_statuspyarrow.lib.ArrowIOError: Arrow error: IOError: The 
> provided token has expired.. Detail: Python exception: PermissionError
>  During handling of the above exception, another exception occurred:
>  Traceback (most recent call last):
>  File "/usr/local/lib/python3.6/site-packages/s3fs/core.py", line 1096, in 
> _upload_chunk PartNumber=part, UploadId=self.mpu['UploadId'],TypeError: 
> 'NoneType' object is not subscriptable
> environment is:
>  s3fs==0.4.0
>  boto3==1.10.27
>  botocore==1.13.27
>  pyarrow==0.15.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9963) [Python] Recognize datetime.timezone.utc as UTC on conversion python->pyarrow

2020-10-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9963:
--
Labels: pull-request-available  (was: )

> [Python] Recognize datetime.timezone.utc as UTC on conversion python->pyarrow
> -
>
> Key: ARROW-9963
> URL: https://issues.apache.org/jira/browse/ARROW-9963
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Related to ARROW-5248, but specifically for the stdlib 
> {{datetime.timezone.utc}}, I think it would be nice to "recognize" this as 
> UTC. Currently it is converted to "+00:00", while for pytz this is not the 
> case:
> {code}
> from datetime import datetime, timezone
> import pytz
> print(pa.array([datetime.now(timezone.utc)]).type)
> print(pa.array([datetime.now(pytz.utc)]).type)
> {code}
> gives
> {code}
> timestamp[us, tz=+00:00]
> timestamp[us, tz=UTC]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10343) Unable to parse strings into timestamps

2020-10-19 Thread Niclas Roos (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niclas Roos updated ARROW-10343:

Description: 
Hi,

I'm working with parquet files generated by a AWS RDS Postgres snapshot export. 

I'm trying to parse a date column stored as a string into a timestamp, but it 
fails.

I've managed to parse the same date format (as in the first example below) when 
reading from a csv, so I tried to investigate it as far as I could on my own, 
and here's my results:
{code:java}
import pyarrow as pa
import pytz

#
## the format I get from the database
us_tz_arr = pa.array([
  "2014-12-07 07:48:59.285332+00",
  "2014-12-07 08:01:49.758975+00",
  "2014-12-07 10:11:35.884304+00"])

us_tz_arr.cast(pa.timestamp('us', tz=pytz.UTC))
-> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35.884304+00

#
## tried removing the timezone
us_arr = pa.array([
  "2014-12-07 07:48:59.285332",
  "2014-12-07 08:01:49.758975",
  "2014-12-07 10:11:35.884304"])

us_arr.cast(pa.timestamp('us'))
-> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35.884304

#
## tried removing the microseconds but keeping the timezone
second_tz_arr = pa.array([
  "2014-12-07 07:48:59+00",
  "2014-12-07 08:01:49+00",
  "2014-12-07 10:11:35+00"])

second_tz_arr.cast(pa.timestamp('s', tz=pytz.UTC))
-> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35+00

#
## removing microseconds and timezone, makes it work!
s_arr = pa.array([
  "2014-12-07 07:48:59",
  "2014-12-07 08:01:49",
  "2014-12-07 10:11:35"])

s_arr.cast(pa.timestamp('s'))
-> 
[
  2014-12-07 07:48:59,
  2014-12-07 08:01:49,
  2014-12-07 10:11:35
]{code}
 PS. This is my first bug report, so apologies if important things are missing.

  was:
Hi,

I'm working with parquet files generated by a AWS RDS Postgres snapshot export. 

I'm trying to parse a date column stored as a string into a timestamp, but it 
fails.

I've managed to parse the same date format (as in the first example below) when 
reading from a csv, so I tried to investigate it as far as I could on my own, 
and here's my results:
{code:java}
// code placeholder 
import pyarrow as pa
import pytz

#
## the format I get from the database
us_tz_arr = pa.array([
  "2014-12-07 07:48:59.285332+00",
  "2014-12-07 08:01:49.758975+00",
  "2014-12-07 10:11:35.884304+00"])

us_tz_arr.cast(pa.timestamp('us', tz=pytz.UTC))
-> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35.884304+00

#
## tried removing the timezone
us_arr = pa.array([
  "2014-12-07 07:48:59.285332",
  "2014-12-07 08:01:49.758975",
  "2014-12-07 10:11:35.884304"])

us_arr.cast(pa.timestamp('us'))
-> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35.884304

#
## tried removing the microseconds but keeping the timezone
second_tz_arr = pa.array([
  "2014-12-07 07:48:59+00",
  "2014-12-07 08:01:49+00",
  "2014-12-07 10:11:35+00"])

second_tz_arr.cast(pa.timestamp('s', tz=pytz.UTC))
-> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35+00

#
## removing microseconds and timezone, makes it work!
s_arr = pa.array([
  "2014-12-07 07:48:59",
  "2014-12-07 08:01:49",
  "2014-12-07 10:11:35"])

s_arr.cast(pa.timestamp('s'))
-> 
[
  2014-12-07 07:48:59,
  2014-12-07 08:01:49,
  2014-12-07 10:11:35
]{code}
 PS. This is my first bug report, so apologies if important things are missing.


> Unable to parse strings into timestamps
> ---
>
> Key: ARROW-10343
> URL: https://issues.apache.org/jira/browse/ARROW-10343
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
> Environment: macOS 10.15.7, Python 3.8.2
>Reporter: Niclas Roos
>Priority: Minor
>
> Hi,
> I'm working with parquet files generated by a AWS RDS Postgres snapshot 
> export. 
> I'm trying to parse a date column stored as a string into a timestamp, but it 
> fails.
> I've managed to parse the same date format (as in the first example below) 
> when reading from a csv, so I tried to investigate it as far as I could on my 
> own, and here's my results:
> {code:java}
> import pyarrow as pa
> import pytz
> #
> ## the 

[jira] [Updated] (ARROW-10343) Unable to parse strings into timestamps

2020-10-19 Thread Niclas Roos (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niclas Roos updated ARROW-10343:

Priority: Minor  (was: Major)

> Unable to parse strings into timestamps
> ---
>
> Key: ARROW-10343
> URL: https://issues.apache.org/jira/browse/ARROW-10343
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
> Environment: macOS 10.15.7, Python 3.8.2
>Reporter: Niclas Roos
>Priority: Minor
>
> Hi,
> I'm working with parquet files generated by a AWS RDS Postgres snapshot 
> export. 
> I'm trying to parse a date column stored as a string into a timestamp, but it 
> fails.
> I've managed to parse the same date format (as in the first example below) 
> when reading from a csv, so I tried to investigate it as far as I could on my 
> own, and here's my results:
> {code:java}
> // code placeholder 
> import pyarrow as pa
> import pytz
> #
> ## the format I get from the database
> us_tz_arr = pa.array([
>   "2014-12-07 07:48:59.285332+00",
>   "2014-12-07 08:01:49.758975+00",
>   "2014-12-07 10:11:35.884304+00"])
> us_tz_arr.cast(pa.timestamp('us', tz=pytz.UTC))
> -> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35.884304+00
> #
> ## tried removing the timezone
> us_arr = pa.array([
>   "2014-12-07 07:48:59.285332",
>   "2014-12-07 08:01:49.758975",
>   "2014-12-07 10:11:35.884304"])
> us_arr.cast(pa.timestamp('us'))
> -> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35.884304
> #
> ## tried removing the microseconds but keeping the timezone
> second_tz_arr = pa.array([
>   "2014-12-07 07:48:59+00",
>   "2014-12-07 08:01:49+00",
>   "2014-12-07 10:11:35+00"])
> second_tz_arr.cast(pa.timestamp('s', tz=pytz.UTC))
> -> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35+00
> #
> ## removing microseconds and timezone, makes it work!
> s_arr = pa.array([
>   "2014-12-07 07:48:59",
>   "2014-12-07 08:01:49",
>   "2014-12-07 10:11:35"])
> s_arr.cast(pa.timestamp('s'))
> -> 
> [
>   2014-12-07 07:48:59,
>   2014-12-07 08:01:49,
>   2014-12-07 10:11:35
> ]{code}
>  PS. This is my first bug report, so apologies if important things are 
> missing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10343) Unable to parse strings into timestamps

2020-10-19 Thread Niclas Roos (Jira)
Niclas Roos created ARROW-10343:
---

 Summary: Unable to parse strings into timestamps
 Key: ARROW-10343
 URL: https://issues.apache.org/jira/browse/ARROW-10343
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 1.0.1
 Environment: macOS 10.15.7, Python 3.8.2
Reporter: Niclas Roos


Hi,

I'm working with parquet files generated by a AWS RDS Postgres snapshot export. 

I'm trying to parse a date column stored as a string into a timestamp, but it 
fails.

I've managed to parse the same date format (as in the first example below) when 
reading from a csv, so I tried to investigate it as far as I could on my own, 
and here's my results:
{code:java}
// code placeholder 
import pyarrow as pa
import pytz

#
## the format I get from the database
us_tz_arr = pa.array([
  "2014-12-07 07:48:59.285332+00",
  "2014-12-07 08:01:49.758975+00",
  "2014-12-07 10:11:35.884304+00"])

us_tz_arr.cast(pa.timestamp('us', tz=pytz.UTC))
-> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35.884304+00

#
## tried removing the timezone
us_arr = pa.array([
  "2014-12-07 07:48:59.285332",
  "2014-12-07 08:01:49.758975",
  "2014-12-07 10:11:35.884304"])

us_arr.cast(pa.timestamp('us'))
-> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35.884304

#
## tried removing the microseconds but keeping the timezone
second_tz_arr = pa.array([
  "2014-12-07 07:48:59+00",
  "2014-12-07 08:01:49+00",
  "2014-12-07 10:11:35+00"])

second_tz_arr.cast(pa.timestamp('s', tz=pytz.UTC))
-> ArrowInvalid: Failed to parse string: 2014-12-07 10:11:35+00

#
## removing microseconds and timezone, makes it work!
s_arr = pa.array([
  "2014-12-07 07:48:59",
  "2014-12-07 08:01:49",
  "2014-12-07 10:11:35"])

s_arr.cast(pa.timestamp('s'))
-> 
[
  2014-12-07 07:48:59,
  2014-12-07 08:01:49,
  2014-12-07 10:11:35
]{code}
 PS. This is my first bug report, so apologies if important things are missing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)