[jira] [Commented] (ARROW-10153) [Java] Adding values to VarCharVector beyond 2GB results in IndexOutOfBoundsException

2020-10-06 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209303#comment-17209303
 ] 

Micah Kornfield commented on ARROW-10153:
-

The main implications of using LargeVarChatVector by default are:
1.  It has an 8-bytes instead of 4-byte overhead per string value.
2.  It might not be supported in all Arrow implementations (I would need to 
double check the matrix/integration tests).

There isn't anything built into java that will do the conversion automatically. 
 You could probably determine this yourself via accessors on the vectors (I 
think getByteCapacity, etc).  Although you would potentially run into other 
problems trying to copy values that are close to 2GB in size from one vector 
another (you would have a pretty high peak off-heap memory usage).

> [Java] Adding values to VarCharVector beyond 2GB results in 
> IndexOutOfBoundsException
> -
>
> Key: ARROW-10153
> URL: https://issues.apache.org/jira/browse/ARROW-10153
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 1.0.0
>Reporter: Samarth Jain
>Priority: Major
>
> On executing the below test case, one can see that on adding the 2049th 
> string of size 1MB, it fails.  
> {code:java}
> int length = 1024 * 1024;
> StringBuilder sb = new StringBuilder(length);
> for (int i = 0; i < length; i++) {
>  sb.append("a");
> }
> byte[] str = sb.toString().getBytes();
> VarCharVector vector = new VarCharVector("v", new 
> RootAllocator(Long.MAX_VALUE));
> vector.allocateNew(3000);
> for (int i = 0; i < 3000; i++) {
>  vector.setSafe(i, str);
> }{code}
>  
> {code:java}
> Exception in thread "main" java.lang.IndexOutOfBoundsException: index: 
> -2147483648, length: 1048576 (expected: range(0, 2147483648))Exception in 
> thread "main" java.lang.IndexOutOfBoundsException: index: -2147483648, 
> length: 1048576 (expected: range(0, 2147483648)) at 
> org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699) at 
> org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:762) at 
> org.apache.arrow.vector.BaseVariableWidthVector.setBytes(BaseVariableWidthVector.java:1212)
>  at 
> org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1011)
> {code}
> Stepping through the code, 
>  
> [https://github.com/apache/arrow/blob/master/java/memory/memory-core/src/main/java/org/apache/arrow/memory/ArrowBuf.java#L425]
> returns the negative index `-2147483648`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10203) Capture guidance for endianness support in contributors guide.

2020-10-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10203:
---
Labels: pull-request-available  (was: )

> Capture guidance for endianness support in contributors guide.
> --
>
> Key: ARROW-10203
> URL: https://issues.apache.org/jira/browse/ARROW-10203
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://mail-archives.apache.org/mod_mbox/arrow-dev/202009.mbox/%3ccak7z5t--hhhr9dy43pyhd6m-xou4qogwqvlwzsg-koxxjpt...@mail.gmail.com%3e



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10203) Capture guidance for endianness support in contributors guide.

2020-10-06 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-10203:
---

 Summary: Capture guidance for endianness support in contributors 
guide.
 Key: ARROW-10203
 URL: https://issues.apache.org/jira/browse/ARROW-10203
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Micah Kornfield
Assignee: Micah Kornfield


https://mail-archives.apache.org/mod_mbox/arrow-dev/202009.mbox/%3ccak7z5t--hhhr9dy43pyhd6m-xou4qogwqvlwzsg-koxxjpt...@mail.gmail.com%3e



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10058) [C++] Investigate performance of LevelsToBitmap without BMI2

2020-10-06 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-10058.
-
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8320
[https://github.com/apache/arrow/pull/8320]

> [C++] Investigate performance of LevelsToBitmap without BMI2
> 
>
> Key: ARROW-10058
> URL: https://issues.apache.org/jira/browse/ARROW-10058
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
> Attachments: opt-level-conv.diff
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Currently, when some Parquet nested data involves some repetition levels, 
> converting the levels to bitmap goes through a slow scalar path unless the 
> BMI2 instruction set is available and efficient (the latter using the PEXT 
> instruction to process 16 levels at once).
> It may be possible to emulate PEXT for 5- or 6-bit masks by using a lookup 
> table, allowing to process 5-6 levels at once.
> (also, it would be good to add nested reading benchmarks for non-trivial 
> nesting; currently we only benchmark one-level struct and one-level list)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10197) [Gandiva][python] Execute expression on filtered data

2020-10-06 Thread Will Jones (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209275#comment-17209275
 ] 

Will Jones commented on ARROW-10197:


You may also need to edit these lines:

[https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/includes/libgandiva.pxd#L190-L196]

Some potentially relevant docs: 
[https://cython.readthedocs.io/en/latest/src/tutorial/clibraries.html]

 

> [Gandiva][python] Execute expression on filtered data
> -
>
> Key: ARROW-10197
> URL: https://issues.apache.org/jira/browse/ARROW-10197
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Gandiva, Python
>Reporter: Kirill Lykov
>Priority: Major
> Fix For: 3.0.0
>
>
> Looks like there is no way to execute an expression on filtered data in 
> python. 
>  Basically, I cannot pass `SelectionVector` to projector's `evaluate` method
> ```python
>  import pyarrow as pa
>  import pyarrow.gandiva as gandiva
> table = pa.Table.from_arrays([pa.array([1., 31., 46., 3., 57., 44., 22.]),
>                                    pa.array([5., 45., 36., 73.,
>                                              83., 23., 76.])],
>                                   ['a', 'b'])
> builder = gandiva.TreeExprBuilder()
>  node_a = builder.make_field(table.schema.field("a"))
>  node_b = builder.make_field(table.schema.field("b"))
>  fifty = builder.make_literal(50.0, pa.float64())
>  eleven = builder.make_literal(11.0, pa.float64())
> cond_1 = builder.make_function("less_than", [node_a, fifty], pa.bool_())
>  cond_2 = builder.make_function("greater_than", [node_a, node_b],
>                                     pa.bool_())
>  cond_3 = builder.make_function("less_than", [node_b, eleven], pa.bool_())
>  cond = builder.make_or([builder.make_and([cond_1, cond_2]), cond_3])
>  condition = builder.make_condition(cond)
> filter = gandiva.make_filter(table.schema, condition)
>  filterResult = filter.evaluate(table.to_batches()[0], 
> pa.default_memory_pool()) --> filterResult has type SelectionVector
>  print(result)
> sum = builder.make_function("add", [node_a, node_b], pa.float64())
>  field_result = pa.field("c", pa.float64())
>  expr = builder.make_expression(sum, field_result)
>  projector = gandiva.make_projector(
>  table.schema, [expr], pa.default_memory_pool())
> r, = projector.evaluate(table.to_batches()[0], result) --> Here there is a 
> problem that I don't know how to use filterResult with projector
>  ```
> In C++, I see that it is possible to pass SelectionVector as second argument 
> to projector::Evaluate: 
> [https://github.com/apache/arrow/blob/c5fa23ea0e15abe47b35524fa6a79c7b8c160fa0/cpp/src/gandiva/tests/filter_project_test.cc#L270]
>   
>  Meanwhile, it looks like it is impossible in `gandiva.pyx`: 
> [https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/gandiva.pyx#L154]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10202) [CI][Windows] Use sf.net mirror for MSYS2

2020-10-06 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-10202:


 Summary: [CI][Windows] Use sf.net mirror for MSYS2
 Key: ARROW-10202
 URL: https://issues.apache.org/jira/browse/ARROW-10202
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10105) [FlightRPC] Add client option to disable certificate validation with TLS

2020-10-06 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209248#comment-17209248
 ] 

David Li commented on ARROW-10105:
--

If you would like to separate out the Java changes, those look straightforward 
and we can merge.

For the certs - seems gRPC wants the certs themselves and not necessarily a 
file, can we just embed the certs into the binary (or even, embed just a single 
invalid cert or something?)

> [FlightRPC] Add client option to disable certificate validation with TLS
> 
>
> Key: ARROW-10105
> URL: https://issues.apache.org/jira/browse/ARROW-10105
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, FlightRPC, Java, Python
>Reporter: James Duong
>Assignee: James Duong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Users of Flight may want to disable certificate validation if they want to 
> only use encryption. A use case might be that the Flight server uses a 
> self-signed certificate and doesn't distribute a certificate for clients to 
> use.
> This feature would be to add an explicit option to FlightClient.Builder to 
> disable certificate validation. Note that this should not happen implicitly 
> if a client uses a TLS location, but does not set a certificate. The client 
> should explicitly set this option so that they are fully aware that they are 
> making a connection with reduced security.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10105) [FlightRPC] Add client option to disable certificate validation with TLS

2020-10-06 Thread James Duong (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209239#comment-17209239
 ] 

James Duong commented on ARROW-10105:
-

I managed to get CentOS 5.11 / manylinux1 to build now.

It looks like the MinGW builds are failing because the latest gRPC release 
available is 1.29 on msys:
https://packages.msys2.org/package/mingw-w64-i686-grpc

1.29 does have the features needed, but they have them in the 
grpc_impl::experimental namespace rather than grpc::experimental namespace. We 
could make the namespace used a #define that's set differently for MinGW 
builds. (We also could have done this for CentOS5 but we'd basically have been 
just delaying dealing with upgrade challenges, so I felt the right thing to do 
was to resolve those now).

The URSA builds use 1.21.4 which doesn't have the TlsCredentials feature at all.

> [FlightRPC] Add client option to disable certificate validation with TLS
> 
>
> Key: ARROW-10105
> URL: https://issues.apache.org/jira/browse/ARROW-10105
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, FlightRPC, Java, Python
>Reporter: James Duong
>Assignee: James Duong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Users of Flight may want to disable certificate validation if they want to 
> only use encryption. A use case might be that the Flight server uses a 
> self-signed certificate and doesn't distribute a certificate for clients to 
> use.
> This feature would be to add an explicit option to FlightClient.Builder to 
> disable certificate validation. Note that this should not happen implicitly 
> if a client uses a TLS location, but does not set a certificate. The client 
> should explicitly set this option so that they are fully aware that they are 
> making a connection with reduced security.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10201) [C++][CI] Disable S3 in arm64 job on Travis CI

2020-10-06 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-10201.
--
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8372
[https://github.com/apache/arrow/pull/8372]

> [C++][CI] Disable S3 in arm64 job on Travis CI
> --
>
> Key: ARROW-10201
> URL: https://issues.apache.org/jira/browse/ARROW-10201
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10105) [FlightRPC] Add client option to disable certificate validation with TLS

2020-10-06 Thread James Duong (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209235#comment-17209235
 ] 

James Duong commented on ARROW-10105:
-

If we cannot change the environments now, would it make sense to include just 
the Java changes for Arrow 2.0 and add python/C++ to the January release?

> [FlightRPC] Add client option to disable certificate validation with TLS
> 
>
> Key: ARROW-10105
> URL: https://issues.apache.org/jira/browse/ARROW-10105
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, FlightRPC, Java, Python
>Reporter: James Duong
>Assignee: James Duong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Users of Flight may want to disable certificate validation if they want to 
> only use encryption. A use case might be that the Flight server uses a 
> self-signed certificate and doesn't distribute a certificate for clients to 
> use.
> This feature would be to add an explicit option to FlightClient.Builder to 
> disable certificate validation. Note that this should not happen implicitly 
> if a client uses a TLS location, but does not set a certificate. The client 
> should explicitly set this option so that they are fully aware that they are 
> making a connection with reduced security.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10105) [FlightRPC] Add client option to disable certificate validation with TLS

2020-10-06 Thread James Duong (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209233#comment-17209233
 ] 

James Duong commented on ARROW-10105:
-

Thanks [~lidavidm]. I wasn't able to open the second link, but I'm assuming 
you're pointing to the grpc-cpp reference there that is using 1.21.:
https://github.com/ursa-labs/ursabot/blob/e958c5f95b31e98108df54cf13596c4fde944c3a/projects/arrow/docker/conda-cpp.txt#L19
 ?

I don't know what the relationship of this repo is to the Arrow repo or 
how/when this gets updated. Be good if we could get some insight into this 
[~uwe] and [~kszucs].

I believe I've gotten past the RE2 linker error now on CentOS 5.11, but not 
sure if there are more CentOS 5 issues cropping up after. I have not 
implemented the change to use the dummy certificate. Related to this, I'm 
planning to just copy the root PEM file that ships with grpc that they put in 
/usr/share/grpc. Where would a good place be to put this in the source tree? 
And the install location would be /usr/share/arrow.

> [FlightRPC] Add client option to disable certificate validation with TLS
> 
>
> Key: ARROW-10105
> URL: https://issues.apache.org/jira/browse/ARROW-10105
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, FlightRPC, Java, Python
>Reporter: James Duong
>Assignee: James Duong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Users of Flight may want to disable certificate validation if they want to 
> only use encryption. A use case might be that the Flight server uses a 
> self-signed certificate and doesn't distribute a certificate for clients to 
> use.
> This feature would be to add an explicit option to FlightClient.Builder to 
> disable certificate validation. Note that this should not happen implicitly 
> if a client uses a TLS location, but does not set a certificate. The client 
> should explicitly set this option so that they are fully aware that they are 
> making a connection with reduced security.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9414) [C++] apt package includes headers for S3 interface, but no support

2020-10-06 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209231#comment-17209231
 ] 

Kouhei Sutou commented on ARROW-9414:
-

Yes.
I'll do this today or tomorrow.

> [C++] apt package includes headers for S3 interface, but no support
> ---
>
> Key: ARROW-9414
> URL: https://issues.apache.org/jira/browse/ARROW-9414
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.17.1
> Environment: Ubuntu 18.04.04 LTS
>Reporter: Simon Bertron
>Assignee: Kouhei Sutou
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: test.cpp
>
>
> I believe that the apt package is built without S3 support. But s3fs.h is 
> exported in filesystem/api.h anyway. This creates undefined reference errors 
> when trying to link to the package.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10201) [C++][CI] Disable S3 in arm64 job on Travis CI

2020-10-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10201:
---
Labels: pull-request-available  (was: )

> [C++][CI] Disable S3 in arm64 job on Travis CI
> --
>
> Key: ARROW-10201
> URL: https://issues.apache.org/jira/browse/ARROW-10201
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10201) [C++][CI] Disable S3 in arm64 job on Travis CI

2020-10-06 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-10201:


 Summary: [C++][CI] Disable S3 in arm64 job on Travis CI
 Key: ARROW-10201
 URL: https://issues.apache.org/jira/browse/ARROW-10201
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Continuous Integration
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10105) [FlightRPC] Add client option to disable certificate validation with TLS

2020-10-06 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209222#comment-17209222
 ] 

David Li commented on ARROW-10105:
--

[~jduong] I'm also not sure about the Ursabot builds, but from a short look, I 
think they may be using a fixed Conda environment instead of the one specified 
in the repository (e.g. the error message points to files that don't exist in 
the latest conda package). Notably see 
[https://github.com/ursa-labs/ursabot/tree/master/projects/arrow] and 
[https://github.com/ursa-labs/ursabot/blob/05ec280304742f9795f30f589a60a5a1011d38cd/projects/arrow/docker/conda-cpp.txt|https://github.com/ursa-labs/ursabot/blob/05ec280304742f9795f30f589a60a5a1011d38cd/projects/arrow/docker/conda-cpp.txt.]

Maybe [~uwe] or [~kszucs] could comment there.

I'm not sure about the CentOS build. Overall, I'd be hesitant merging something 
that changes the dependencies and build drastically right before the release 
cutoff unless the release manager is OK with it.

> [FlightRPC] Add client option to disable certificate validation with TLS
> 
>
> Key: ARROW-10105
> URL: https://issues.apache.org/jira/browse/ARROW-10105
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, FlightRPC, Java, Python
>Reporter: James Duong
>Assignee: James Duong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Users of Flight may want to disable certificate validation if they want to 
> only use encryption. A use case might be that the Flight server uses a 
> self-signed certificate and doesn't distribute a certificate for clients to 
> use.
> This feature would be to add an explicit option to FlightClient.Builder to 
> disable certificate validation. Note that this should not happen implicitly 
> if a client uses a TLS location, but does not set a certificate. The client 
> should explicitly set this option so that they are fully aware that they are 
> making a connection with reduced security.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10200) [Java][CI] Fix failure of Java CI on s390x

2020-10-06 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-10200.
--
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8368
[https://github.com/apache/arrow/pull/8368]

> [Java][CI] Fix failure of Java CI on s390x
> --
>
> Key: ARROW-10200
> URL: https://issues.apache.org/jira/browse/ARROW-10200
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Java
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> ARROW-9701 causes the following failure due to missing a library 
> {{libprotobuf.so.18}}.
> {code:java}
> ...
> [ERROR] PROTOC FAILED: 
> /arrow/java/flight/flight-core/target/protoc-plugins/protoc-3.7.1-linux-s390_64.exe:
>  error while loading shared libraries: libprotobuf.so.18: cannot open shared 
> object file: No such file or directory
> [ERROR] /arrow/java/flight/flight-core/../../../format/Flight.proto [0:0]: 
> /arrow/java/flight/flight-core/target/protoc-plugins/protoc-3.7.1-linux-s390_64.exe:
>  error while loading shared libraries: libprotobuf.so.18: cannot open shared 
> object file: No such file or directory
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10200) [Java][CI] Fix failure of Java CI on s390x

2020-10-06 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-10200:
-
Component/s: Java
 Continuous Integration

> [Java][CI] Fix failure of Java CI on s390x
> --
>
> Key: ARROW-10200
> URL: https://issues.apache.org/jira/browse/ARROW-10200
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Java
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> ARROW-9701 causes the following failure due to missing a library 
> {{libprotobuf.so.18}}.
> {code:java}
> ...
> [ERROR] PROTOC FAILED: 
> /arrow/java/flight/flight-core/target/protoc-plugins/protoc-3.7.1-linux-s390_64.exe:
>  error while loading shared libraries: libprotobuf.so.18: cannot open shared 
> object file: No such file or directory
> [ERROR] /arrow/java/flight/flight-core/../../../format/Flight.proto [0:0]: 
> /arrow/java/flight/flight-core/target/protoc-plugins/protoc-3.7.1-linux-s390_64.exe:
>  error while loading shared libraries: libprotobuf.so.18: cannot open shared 
> object file: No such file or directory
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10200) [Java][CI] Fix failure of Java CI on s390x

2020-10-06 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-10200:


Assignee: Kazuaki Ishizaki

> [Java][CI] Fix failure of Java CI on s390x
> --
>
> Key: ARROW-10200
> URL: https://issues.apache.org/jira/browse/ARROW-10200
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Java
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> ARROW-9701 causes the following failure due to missing a library 
> {{libprotobuf.so.18}}.
> {code:java}
> ...
> [ERROR] PROTOC FAILED: 
> /arrow/java/flight/flight-core/target/protoc-plugins/protoc-3.7.1-linux-s390_64.exe:
>  error while loading shared libraries: libprotobuf.so.18: cannot open shared 
> object file: No such file or directory
> [ERROR] /arrow/java/flight/flight-core/../../../format/Flight.proto [0:0]: 
> /arrow/java/flight/flight-core/target/protoc-plugins/protoc-3.7.1-linux-s390_64.exe:
>  error while loading shared libraries: libprotobuf.so.18: cannot open shared 
> object file: No such file or directory
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9843) [C++] Implement Between trinary kernel

2020-10-06 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209182#comment-17209182
 ] 

Wes McKinney commented on ARROW-9843:
-

I would suggest only implementing two versions, {{(data: Array, left: Array, 
right: Array)}} and {{(data: Array, left: Scalar, right: Scalar)}}.

I added a missing "avoid" to the issue description. The idea is to compute 
BETWEEN as a loop of {{(left < data[i]) && (data[i] < right)}} instead of doing 
{{AND(GREATER(data, left), LESS(data, right))}}

> [C++] Implement Between trinary kernel
> --
>
> Key: ARROW-9843
> URL: https://issues.apache.org/jira/browse/ARROW-9843
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> A specialized {{between(arr, left_bound, right_bound)}} kernel would avoid 
> multiple scans and AND operation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9843) [C++] Implement Between trinary kernel

2020-10-06 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-9843:

Description: 
A specialized {{between(arr, left_bound, right_bound)}} kernel would avoid 
multiple scans and AND operation

  was:A specialized {{between(arr, left_bound, right_bound)}} kernel would 
multiple scans and AND operation


> [C++] Implement Between trinary kernel
> --
>
> Key: ARROW-9843
> URL: https://issues.apache.org/jira/browse/ARROW-9843
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> A specialized {{between(arr, left_bound, right_bound)}} kernel would avoid 
> multiple scans and AND operation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10176) [CI] Nightly valgrind job fails

2020-10-06 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-10176.

Resolution: Fixed

Issue resolved by pull request 8369
[https://github.com/apache/arrow/pull/8369]

> [CI] Nightly valgrind job fails
> ---
>
> Key: ARROW-10176
> URL: https://issues.apache.org/jira/browse/ARROW-10176
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, CI
>Reporter: Neal Richardson
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> https://github.com/ursa-labs/crossbow/runs/1204693039



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-4960) [R] Add crossbow task for r-arrow-feedstock

2020-10-06 Thread Uwe Korn (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Korn updated ARROW-4960:

Fix Version/s: 2.0.0

> [R] Add crossbow task for r-arrow-feedstock
> ---
>
> Key: ARROW-4960
> URL: https://issues.apache.org/jira/browse/ARROW-4960
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, R
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We also have an R package on conda-forge now: 
> [https://github.com/conda-forge/r-arrow-feedstock] This should be tested 
> using crossbow as we do with the other packages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9564) [Packaging] Vendor r-arrow-feedstock conda-forge recipe

2020-10-06 Thread Uwe Korn (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Korn resolved ARROW-9564.
-
Resolution: Duplicate

> [Packaging] Vendor r-arrow-feedstock conda-forge recipe
> ---
>
> Key: ARROW-9564
> URL: https://issues.apache.org/jira/browse/ARROW-9564
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Assignee: Uwe Korn
>Priority: Major
> Fix For: 3.0.0
>
>
> Since we have r-arrow on conda-forge, we should exercise it similarly to 
> arrow-cpp and pyarrow packages.
> cc [~uwe] [~npr]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-4960) [R] Add crossbow task for r-arrow-feedstock

2020-10-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-4960:
--
Labels: pull-request-available  (was: )

> [R] Add crossbow task for r-arrow-feedstock
> ---
>
> Key: ARROW-4960
> URL: https://issues.apache.org/jira/browse/ARROW-4960
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, R
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We also have an R package on conda-forge now: 
> [https://github.com/conda-forge/r-arrow-feedstock] This should be tested 
> using crossbow as we do with the other packages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10015) [Rust] Implement SIMD for aggregate kernel sum

2020-10-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10015:
---
Labels: pull-request-available  (was: )

> [Rust] Implement SIMD for aggregate kernel sum
> --
>
> Key: ARROW-10015
> URL: https://issues.apache.org/jira/browse/ARROW-10015
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Jorge Leitão
>Assignee: Jörn Horstmann
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, our aggregations are made in a simple loop. However, as described 
> [here|https://rust-lang.github.io/packed_simd/perf-guide/vert-hor-ops.html], 
> horizontal operations can also be SIMDed, reports of 2.7x speedups.
> The goal of this improvement is to support SIMD for the "sum", for primitive 
> types.
> The code to modify is in 
> [here|https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/aggregate.rs].
>  A good indication that this issue is completed is when the script
> {{cargo bench --bench aggregate_kernels && cargo bench --bench 
> aggregate_kernels --features simd}}
> yields a speed-up.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10176) [CI] Nightly valgrind job fails

2020-10-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10176:
---
Labels: pull-request-available  (was: )

> [CI] Nightly valgrind job fails
> ---
>
> Key: ARROW-10176
> URL: https://issues.apache.org/jira/browse/ARROW-10176
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, CI
>Reporter: Neal Richardson
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://github.com/ursa-labs/crossbow/runs/1204693039



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10192) [C++][Python] Segfault when converting nested struct array with dictionary field to pandas series

2020-10-06 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-10192.
-
Resolution: Fixed

Issue resolved by pull request 8361
[https://github.com/apache/arrow/pull/8361]

> [C++][Python] Segfault when converting nested struct array with dictionary 
> field to pandas series
> -
>
> Key: ARROW-10192
> URL: https://issues.apache.org/jira/browse/ARROW-10192
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Krisztian Szucs
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Reproducer:
> {code:python}
> def test_struct_array_with_dictionary_field_to_pandas():
> ty = pa.struct([
> pa.field('dict', pa.dictionary(pa.int64(), pa.int32())),
> ])
> data = [
> {'dict': -1859762450}
> ]
> arr = pa.array(data, type=ty)
> arr.to_pandas()
> {code}
> Raises SIGSTOP:
> {code}
> * thread #1, stop reason = signal SIGSTOP
>   * frame #0: 0x7fff6e2b733a libsystem_kernel.dylib`__pthread_kill + 10
> frame #1: 0x7fff6e373e60 libsystem_pthread.dylib`pthread_kill + 430
> frame #2: 0x7fff6e1ce93e libsystem_c.dylib`raise + 26
> frame #3: 0x7fff6e3685fd libsystem_platform.dylib`_sigtramp + 29
> frame #4: 0x00011517adfd 
> libarrow_python.200.0.0.dylib`arrow::py::ConvertStruct(options=0x7f84fc5a0230,
>  data=0x7f84fc59ef18, out_values=0x7f84fc53d140) at 
> arrow_to_pandas.cc:685:54
> frame #5: 0x00011514c642 
> libarrow_python.200.0.0.dylib`arrow::py::ObjectWriterVisitor::Visit(this=0x7ffee06a1a88,
>  type=0x7f84fc5a00e8) at arrow_to_pandas.cc:1031:12
> frame #6: 0x0001151499c4 libarrow_python.200.0.0.dylib`arrow::Status 
> arrow::VisitTypeInline(type=0x7f84fc5a00e8,
>  visitor=0x7ffee06a1a88) at visitor_inline.h:88:5
> frame #7: 0x000115149305 
> libarrow_python.200.0.0.dylib`arrow::py::ObjectWriter::CopyInto(this=0x7f84fc5a0228,
>  data=std::__1::shared_ptr::element_type @ 
> 0x7f84fc59ef18 strong=2 weak=1, rel_placement=0) at arrow_to_pand
> as.cc:1055:12
> {code}
> {code:cpp}
> frame #4: 0x00011517adfd 
> libarrow_python.200.0.0.dylib`arrow::py::ConvertStruct(options=0x7f84fc5a0230,
>  data=0x7f84fc59ef18, out_values=0x7f84fc53d140) at 
> arrow_to_pandas.cc:685:54
>682if (!arr->field(static_cast(field_idx))->IsNull(i)) {
>683  // Value exists in child array, obtain it
>684  auto array = 
> reinterpret_cast(fields_data[field_idx].obj());
> -> 685  auto ptr = reinterpret_cast char*>(PyArray_GETPTR1(array, i));
>686  field_value.reset(PyArray_GETITEM(array, ptr));
>687  RETURN_IF_PYERROR();
>688} else {
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10183) Create a ForEach library function that runs on an iterator of futures

2020-10-06 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209110#comment-17209110
 ] 

Antoine Pitrou commented on ARROW-10183:


Just for the record, we already have a {{AsCompleted}} iterator. We could have 
a variant that takes an iterator of futures rather than a vector of futures.

> Create a ForEach library function that runs on an iterator of futures
> -
>
> Key: ARROW-10183
> URL: https://issues.apache.org/jira/browse/ARROW-10183
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> This method should take in an iterator of futures and a callback and pull an 
> item off the iterator, "await" it, run the callback on it, and then fetch the 
> next item from the iterator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10183) Create a ForEach library function that runs on an iterator of futures

2020-10-06 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-10183:
---
Component/s: C++

> Create a ForEach library function that runs on an iterator of futures
> -
>
> Key: ARROW-10183
> URL: https://issues.apache.org/jira/browse/ARROW-10183
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> This method should take in an iterator of futures and a callback and pull an 
> item off the iterator, "await" it, run the callback on it, and then fetch the 
> next item from the iterator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10183) Create a ForEach library function that runs on an iterator of futures

2020-10-06 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209104#comment-17209104
 ] 

Weston Pace commented on ARROW-10183:
-

This is also a bit of a mental investigation on my part to be sure this can be 
done without exploding the stack.  Since this is essentially 
iterator.next().then(iterator.next().then(iterator.next().then(...  My 
understanding is that it can, and there are numerous articles on continuations 
and avoiding stack busting while doing this kind of thing.  I have yet to 
synthesize all that knowledge and put it into practice.

> Create a ForEach library function that runs on an iterator of futures
> -
>
> Key: ARROW-10183
> URL: https://issues.apache.org/jira/browse/ARROW-10183
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Weston Pace
>Priority: Major
>
> This method should take in an iterator of futures and a callback and pull an 
> item off the iterator, "await" it, run the callback on it, and then fetch the 
> next item from the iterator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10183) Create a ForEach library function that runs on an iterator of futures

2020-10-06 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-10183:

Description: This method should take in an iterator of futures and a 
callback and pull an item off the iterator, "await" it, run the callback on it, 
and then fetch the next item from the iterator.  (was: This method should take 
in an iterator and spawn N threads to pull items off the iterator and start 
working on them.  It should return a future which will complete when all N 
threads have finished and the iterator is exhausted.)

> Create a ForEach library function that runs on an iterator of futures
> -
>
> Key: ARROW-10183
> URL: https://issues.apache.org/jira/browse/ARROW-10183
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Weston Pace
>Priority: Major
>
> This method should take in an iterator of futures and a callback and pull an 
> item off the iterator, "await" it, run the callback on it, and then fetch the 
> next item from the iterator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10183) Create a ForEach library function that runs on an iterator of futures

2020-10-06 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-10183:

Summary: Create a ForEach library function that runs on an iterator of 
futures  (was: Create an async ParallelForEach that runs on an iterator)

> Create a ForEach library function that runs on an iterator of futures
> -
>
> Key: ARROW-10183
> URL: https://issues.apache.org/jira/browse/ARROW-10183
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Weston Pace
>Priority: Major
>
> This method should take in an iterator and spawn N threads to pull items off 
> the iterator and start working on them.  It should return a future which will 
> complete when all N threads have finished and the iterator is exhausted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10183) Create an async ParallelForEach that runs on an iterator

2020-10-06 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209101#comment-17209101
 ] 

Weston Pace commented on ARROW-10183:
-

Not spawning a thread pool, it would take a thread pool as an argument.  The 
current use case I'm looking at is the CSV reader.

The current implementation is:
 * One thread (outside the thread pool, let's call it the I/O thread) reads 
from an input stream, up to X blocks in advance, and places the blocks in a 
blocking queue.
 * Another thread (calling thread, may be a thread pool thread, let's call it 
the parse thread) takes blocks off of the blocking queue (which it sees as an 
iterator) and creates thread pool tasks for conversion.  This step will block 
if I/O is slow.
 * The thread pool threads (conversion tasks) then do the conversion, possibly 
making new conversion tasks which are added to the thread pool.
 * Once the parsing thread is done reading the iterator it blocks until the 
conversion tasks have finished.

The goal is to change the parsing thread so it is no longer blocking, as it may 
be a thread pool thread, and if it is blocking it shouldn't tie up the thread.  
I can keep the dedicated I/O thread since it is outside the thread pool.

This changes the I/O thread from an iterator of Block to an iterator of 
Future.

Converting the parse thread is a little trickier.  It currently is...

{{    iterator = StartIterator();}}

{{    for each block in iterator:}}

{{        }}{{ParseBlockAndCreateConversionTasks();}}

{{    WaitForConversionTasks();}}

The "for each" part is a little trickier with a generator that returns 
promises.  This task is aiming to replace that bit.

Now that I think it all through like this I suppose the "parallel for each" and 
"N threads" wording is not needed.  This is a natural point to allow for 
concurrency (e.g. allowing up to N parse threads).  However, the original 
implementation had only the single parse thread so I don't need to introduce it 
here.  I'll go ahead and strip that from the task and start with just a basic 
serial for each.  Even with a serial for-each there is a need for a common 
library function that, given an iterator of futures, and a function, applies 
the function to each of the items in the iterator.

 

> Create an async ParallelForEach that runs on an iterator
> 
>
> Key: ARROW-10183
> URL: https://issues.apache.org/jira/browse/ARROW-10183
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Weston Pace
>Priority: Major
>
> This method should take in an iterator and spawn N threads to pull items off 
> the iterator and start working on them.  It should return a future which will 
> complete when all N threads have finished and the iterator is exhausted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10200) [Java][CI] Fix failure of Java CI on s390x

2020-10-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10200:
---
Labels: pull-request-available  (was: )

> [Java][CI] Fix failure of Java CI on s390x
> --
>
> Key: ARROW-10200
> URL: https://issues.apache.org/jira/browse/ARROW-10200
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ARROW-9701 causes the following failure due to missing a library 
> {{libprotobuf.so.18}}.
> {code:java}
> ...
> [ERROR] PROTOC FAILED: 
> /arrow/java/flight/flight-core/target/protoc-plugins/protoc-3.7.1-linux-s390_64.exe:
>  error while loading shared libraries: libprotobuf.so.18: cannot open shared 
> object file: No such file or directory
> [ERROR] /arrow/java/flight/flight-core/../../../format/Flight.proto [0:0]: 
> /arrow/java/flight/flight-core/target/protoc-plugins/protoc-3.7.1-linux-s390_64.exe:
>  error while loading shared libraries: libprotobuf.so.18: cannot open shared 
> object file: No such file or directory
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10200) [Java][CI] Fix failure of Java CI on s390x

2020-10-06 Thread Kazuaki Ishizaki (Jira)
Kazuaki Ishizaki created ARROW-10200:


 Summary: [Java][CI] Fix failure of Java CI on s390x
 Key: ARROW-10200
 URL: https://issues.apache.org/jira/browse/ARROW-10200
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Kazuaki Ishizaki


ARROW-9701 causes the following failure due to missing a library 
{{libprotobuf.so.18}}.
{code:java}
...
[ERROR] PROTOC FAILED: 
/arrow/java/flight/flight-core/target/protoc-plugins/protoc-3.7.1-linux-s390_64.exe:
 error while loading shared libraries: libprotobuf.so.18: cannot open shared 
object file: No such file or directory
[ERROR] /arrow/java/flight/flight-core/../../../format/Flight.proto [0:0]: 
/arrow/java/flight/flight-core/target/protoc-plugins/protoc-3.7.1-linux-s390_64.exe:
 error while loading shared libraries: libprotobuf.so.18: cannot open shared 
object file: No such file or directory
...
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10099) [C++][Dataset] Also allow integer partition fields to be dictionary encoded

2020-10-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10099:
---
Labels: dataset dataset-dask-integration pull-request-available  (was: 
dataset dataset-dask-integration)

> [C++][Dataset] Also allow integer partition fields to be dictionary encoded
> ---
>
> Key: ARROW-10099
> URL: https://issues.apache.org/jira/browse/ARROW-10099
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, dataset-dask-integration, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In ARROW-8647, we added the option to indicate that you partition field 
> columns should be dictionary encoded, but it currently does only do this for 
> string type, and not for integer type (wiht the reasoning that for integers, 
> it is not giving any memory efficiency gains to use dictionary encoding). 
> In dask, they have been using categorical dtypes for _all_ partition fields, 
> also if they are integers. They would like to keep doing this (apart from 
> memory efficiency, using categorical/dictionary type also gives information 
> about all uniques values of the column, without having to calculate this), so 
> it would be nice to enable this use case. 
> So I think we could either simply always dictionary encode also integers when 
> {{max_partition_dictionary_size}} indicates partition fields should be 
> dictionary encoded, or either have an additional option to indicate also 
> integer partition fields should be encoded (if the other option indicates 
> dictionary encoding should be used).
> Based on feedback from the dask PR using the dataset API at 
> https://github.com/dask/dask/pull/6534#issuecomment-698723009
> cc [~rjzamora] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10199) [Rust][Parquet] Release Parquet at crates.io to remove debug prints

2020-10-06 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krzysztof Stanisławek updated ARROW-10199:
--
Description: 
Version of Parquet released to docs.rs & crates.io has debug prints in 
[https://github.com/apache/arrow/blob/886d87bdea78ce80e39a4b5b6fd6ca6042474c5f/rust/parquet/src/column/writer.rs#L60].
 They were pretty hard to track down, so I suggest considering logging create 
in the future. When is the new version going to be released? Is there some 
stable schedule I can expect?

Is it recommended to use the current snapshot straight from github instead of 
crates.io?

  was:
Version of Parquet released to docs.rs & crates.io has debug prints in 
[https://docs.rs/crate/parquet/1.0.1/source/src/column/writer.rs] (line 60). 
They were pretty hard to track down, so I suggest considering logging create in 
the future. When is the new version going to be released? Is there some stable 
schedule I can expect?

Is it recommended to use the current snapshot straight from github instead of 
crates.io?


> [Rust][Parquet] Release Parquet at crates.io to remove debug prints
> ---
>
> Key: ARROW-10199
> URL: https://issues.apache.org/jira/browse/ARROW-10199
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Krzysztof Stanisławek
>Priority: Critical
>
> Version of Parquet released to docs.rs & crates.io has debug prints in 
> [https://github.com/apache/arrow/blob/886d87bdea78ce80e39a4b5b6fd6ca6042474c5f/rust/parquet/src/column/writer.rs#L60].
>  They were pretty hard to track down, so I suggest considering logging create 
> in the future. When is the new version going to be released? Is there some 
> stable schedule I can expect?
> Is it recommended to use the current snapshot straight from github instead of 
> crates.io?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10199) [Rust][Parquet] Release Parquet at crates.io to remove debug prints

2020-10-06 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krzysztof Stanisławek updated ARROW-10199:
--
Description: 
Version of Parquet released to docs.rs & crates.io has debug prints in 
[https://docs.rs/crate/parquet/1.0.1/source/src/column/writer.rs] (line 60). 
They were pretty hard to track down, so I suggest considering logging create in 
the future. When is the new version going to be released? Is there some stable 
schedule I can expect?

Is it recommended to use the current snapshot straight from github instead of 
crates.io?

  was:
Version of Parquet released to docs.rs & crates.io has debug prints in 
[https://docs.rs/crate/parquet/1.0.1/source/src/column/writer.rs] (line 30). 
They were pretty hard to track down, so I suggest considering logging create in 
the future. When is the new version going to be released? Is there some stable 
schedule I can expect?

Is it recommended to use the current snapshot straight from github instead of 
crates.io?


> [Rust][Parquet] Release Parquet at crates.io to remove debug prints
> ---
>
> Key: ARROW-10199
> URL: https://issues.apache.org/jira/browse/ARROW-10199
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Krzysztof Stanisławek
>Priority: Critical
>
> Version of Parquet released to docs.rs & crates.io has debug prints in 
> [https://docs.rs/crate/parquet/1.0.1/source/src/column/writer.rs] (line 60). 
> They were pretty hard to track down, so I suggest considering logging create 
> in the future. When is the new version going to be released? Is there some 
> stable schedule I can expect?
> Is it recommended to use the current snapshot straight from github instead of 
> crates.io?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9586) [FlightRPC][Java] Allow using a per-call Arrow allocator

2020-10-06 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-9586:

Fix Version/s: (was: 2.0.0)
   3.0.0

> [FlightRPC][Java] Allow using a per-call Arrow allocator
> 
>
> Key: ARROW-9586
> URL: https://issues.apache.org/jira/browse/ARROW-9586
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Java
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We've been running into issues with Flight and gRPC leaking direct memory at 
> scale. One thing we'd like to do is have a (child) allocator per DoGet/DoPut 
> call, so we can more accurately track memory usage. We have a candidate 
> implementation that is rather messy, but can be upstreamed as part of 
> flight-grpc.
> This also requires changes to _ensure_ all Arrow resources are cleaned up 
> before we notify gRPC that the call has finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10199) [Rust][Parquet] Release Parquet at crates.io to remove debug prints

2020-10-06 Thread Jira
Krzysztof Stanisławek created ARROW-10199:
-

 Summary: [Rust][Parquet] Release Parquet at crates.io to remove 
debug prints
 Key: ARROW-10199
 URL: https://issues.apache.org/jira/browse/ARROW-10199
 Project: Apache Arrow
  Issue Type: Wish
  Components: Rust
Affects Versions: 1.0.1
Reporter: Krzysztof Stanisławek


Version of Parquet released to docs.rs & crates.io has debug prints in 
[https://docs.rs/crate/parquet/1.0.1/source/src/column/writer.rs] (line 30). 
They were pretty hard to track down, so I suggest considering logging create in 
the future. When is the new version going to be released? Is there some stable 
schedule I can expect?

Is it recommended to use the current snapshot straight from github instead of 
crates.io?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9279) [C++] Implement PrettyPrint for Scalars

2020-10-06 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-9279:
---
Fix Version/s: (was: 2.0.0)
   3.0.0

> [C++] Implement PrettyPrint for Scalars
> ---
>
> Key: ARROW-9279
> URL: https://issues.apache.org/jira/browse/ARROW-9279
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> It would be useful, especially for nested scalar objects.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10177) [CI][Gandiva] Nightly gandiva-jar-xenial fails

2020-10-06 Thread Projjal Chanda (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Projjal Chanda reassigned ARROW-10177:
--

Assignee: Sagnik Chakraborty

> [CI][Gandiva] Nightly gandiva-jar-xenial fails
> --
>
> Key: ARROW-10177
> URL: https://issues.apache.org/jira/browse/ARROW-10177
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva, Continuous Integration
>Reporter: Neal Richardson
>Assignee: Sagnik Chakraborty
>Priority: Major
> Fix For: 2.0.0
>
>
> The following tests FAILED:
>27 - gandiva-projector-test (Failed)
>42 - gandiva-projector-test-static (Failed)
> https://travis-ci.org/github/ursa-labs/crossbow/builds/732659880



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9414) [C++] apt package includes headers for S3 interface, but no support

2020-10-06 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208996#comment-17208996
 ] 

Neal Richardson commented on ARROW-9414:


[~kou] are you planning on doing anything with this for 2.0?

> [C++] apt package includes headers for S3 interface, but no support
> ---
>
> Key: ARROW-9414
> URL: https://issues.apache.org/jira/browse/ARROW-9414
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.17.1
> Environment: Ubuntu 18.04.04 LTS
>Reporter: Simon Bertron
>Assignee: Kouhei Sutou
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: test.cpp
>
>
> I believe that the apt package is built without S3 support. But s3fs.h is 
> exported in filesystem/api.h anyway. This creates undefined reference errors 
> when trying to link to the package.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-10105) [FlightRPC] Add client option to disable certificate validation with TLS

2020-10-06 Thread James Duong (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208994#comment-17208994
 ] 

James Duong edited comment on ARROW-10105 at 10/6/20, 5:43 PM:
---

[~lidavidm] Update - I was able to get gRPC 1.32 to build on CentOSs 5. In 
1.32, they essentially collapsed the logic that was for manylinux1 into the 
regular Linux case, but this seems to cause build failures on CentOS5 in 
practice.

I've reproduced the macro definitions that were used for manylinux1 prior to 
1.32 and now gRPC compiles.
What fails now is when linking libarrow_flight.so:
ImportError: /arrow/python/pyarrow/libarrow_flight.so.200: undefined symbol: 
_ZN3re23RE2C1ERKSs
When demangled this is
ImportError: /arrow/python/pyarrow/libarrow_flight.so.200: undefined symbol: 
re2::RE2::RE2(std::string const&)

gRPC 1.32 added a dependency on RE2. I've added RE2 to Flight's CMakeLists, but 
that hasn't fixed this problem.
https://github.com/apache/arrow/blob/8ce02f7d5bd8d7cb732406af26bdc3b9481b/cpp/src/arrow/flight/CMakeLists.txt#L23

I am also not understanding why the Ursa builds do not seem to get gRPC 1.32.


was (Author: jduong):
Update - I was able to get gRPC 1.32 to build on CentOSs 5. In 1.32, they 
essentially collapsed the logic that was for manylinux1 into the regular Linux 
case, but this seems to cause build failures on CentOS5 in practice.

I've reproduced the macro definitions that were used for manylinux1 prior to 
1.32 and now gRPC compiles.
What fails now is when linking libarrow_flight.so:
ImportError: /arrow/python/pyarrow/libarrow_flight.so.200: undefined symbol: 
_ZN3re23RE2C1ERKSs
When demangled this is
ImportError: /arrow/python/pyarrow/libarrow_flight.so.200: undefined symbol: 
re2::RE2::RE2(std::string const&)

gRPC 1.32 added a dependency on RE2. I've added RE2 to Flight's CMakeLists, but 
that hasn't fixed this problem.
https://github.com/apache/arrow/blob/8ce02f7d5bd8d7cb732406af26bdc3b9481b/cpp/src/arrow/flight/CMakeLists.txt#L23

I am also not understanding why the Ursa builds do not seem to get gRPC 1.32.

> [FlightRPC] Add client option to disable certificate validation with TLS
> 
>
> Key: ARROW-10105
> URL: https://issues.apache.org/jira/browse/ARROW-10105
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, FlightRPC, Java, Python
>Reporter: James Duong
>Assignee: James Duong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Users of Flight may want to disable certificate validation if they want to 
> only use encryption. A use case might be that the Flight server uses a 
> self-signed certificate and doesn't distribute a certificate for clients to 
> use.
> This feature would be to add an explicit option to FlightClient.Builder to 
> disable certificate validation. Note that this should not happen implicitly 
> if a client uses a TLS location, but does not set a certificate. The client 
> should explicitly set this option so that they are fully aware that they are 
> making a connection with reduced security.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10105) [FlightRPC] Add client option to disable certificate validation with TLS

2020-10-06 Thread James Duong (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208994#comment-17208994
 ] 

James Duong commented on ARROW-10105:
-

Update - I was able to get gRPC 1.32 to build on CentOSs 5. In 1.32, they 
essentially collapsed the logic that was for manylinux1 into the regular Linux 
case, but this seems to cause build failures on CentOS5 in practice.

I've reproduced the macro definitions that were used for manylinux1 prior to 
1.32 and now gRPC compiles.
What fails now is when linking libarrow_flight.so:
ImportError: /arrow/python/pyarrow/libarrow_flight.so.200: undefined symbol: 
_ZN3re23RE2C1ERKSs
When demangled this is
ImportError: /arrow/python/pyarrow/libarrow_flight.so.200: undefined symbol: 
re2::RE2::RE2(std::string const&)

gRPC 1.32 added a dependency on RE2. I've added RE2 to Flight's CMakeLists, but 
that hasn't fixed this problem.
https://github.com/apache/arrow/blob/8ce02f7d5bd8d7cb732406af26bdc3b9481b/cpp/src/arrow/flight/CMakeLists.txt#L23

I am also not understanding why the Ursa builds do not seem to get gRPC 1.32.

> [FlightRPC] Add client option to disable certificate validation with TLS
> 
>
> Key: ARROW-10105
> URL: https://issues.apache.org/jira/browse/ARROW-10105
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, FlightRPC, Java, Python
>Reporter: James Duong
>Assignee: James Duong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Users of Flight may want to disable certificate validation if they want to 
> only use encryption. A use case might be that the Flight server uses a 
> self-signed certificate and doesn't distribute a certificate for clients to 
> use.
> This feature would be to add an explicit option to FlightClient.Builder to 
> disable certificate validation. Note that this should not happen implicitly 
> if a client uses a TLS location, but does not set a certificate. The client 
> should explicitly set this option so that they are fully aware that they are 
> making a connection with reduced security.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8672) [Java] Implement RecordBatch IPC buffer compression from ARROW-300

2020-10-06 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-8672:
---
Fix Version/s: (was: 2.0.0)
   3.0.0

> [Java] Implement RecordBatch IPC buffer compression from ARROW-300
> --
>
> Key: ARROW-8672
> URL: https://issues.apache.org/jira/browse/ARROW-8672
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Wes McKinney
>Assignee: Liya Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10193) [Python] Segfault when converting to fixed size binary array

2020-10-06 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-10193.
-
Resolution: Fixed

Issue resolved by pull request 8360
[https://github.com/apache/arrow/pull/8360]

> [Python] Segfault when converting to fixed size binary array
> 
>
> Key: ARROW-10193
> URL: https://issues.apache.org/jira/browse/ARROW-10193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Reproducer:
> {code:python}
> data = [b'\x19h\r\x9e\x00\x00\x00\x00\x01\x9b\x9fA']
> assert len(data[0]) == 12
> ty = pa.binary(12)
> arr = pa.array(data, type=ty)
> {code}
> Trace:
> {code}
> pyarrow/tests/test_convert_builtin.py::test_fixed_size_binary_length_check 
> ../src/arrow/array/builder_binary.cc:53:  Check failed: (size) == 
> (byte_width_) Appending wrong size to FixedSizeBinaryBuilder
> 0   libarrow.200.0.0.dylib  0x00010e7f9704 
> _ZN5arrow4util7CerrLog14PrintBackTraceEv + 52
> 1   libarrow.200.0.0.dylib  0x00010e7f9622 
> _ZN5arrow4util7CerrLogD2Ev + 98
> 2   libarrow.200.0.0.dylib  0x00010e7f9585 
> _ZN5arrow4util7CerrLogD1Ev + 21
> 3   libarrow.200.0.0.dylib  0x00010e7f95ac 
> _ZN5arrow4util7CerrLogD0Ev + 28
> 4   libarrow.200.0.0.dylib  0x00010e7f9492 
> _ZN5arrow4util8ArrowLogD2Ev + 82
> 5   libarrow.200.0.0.dylib  0x00010e7f94c5 
> _ZN5arrow4util8ArrowLogD1Ev + 21
> 6   libarrow.200.0.0.dylib  0x00010e303ec1 
> _ZN5arrow22FixedSizeBinaryBuilder14CheckValueSizeEx + 209
> 7   libarrow.200.0.0.dylib  0x00010e30c361 
> _ZN5arrow22FixedSizeBinaryBuilder12UnsafeAppendEN6nonstd7sv_lite17basic_string_viewIcNSt3__111char_traitsIc
>  + 49
> 8   libarrow_python.200.0.0.dylib   0x00010b4efa7d 
> _ZN5arrow2py20PyPrimitiveConverterINS_19FixedSizeBinaryTypeEvE6AppendEP7_object
>  + 813
> {code}
> The input {{const char*}} value gets implicitly casted to string_view which 
> makes the length check fail in debug builds.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9607) [C++][Gandiva] Add bitwise_and(), bitwise_or() and bitwise_not() functions for integers

2020-10-06 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-9607:
--

Assignee: Sagnik Chakraborty

> [C++][Gandiva] Add bitwise_and(), bitwise_or() and bitwise_not() functions 
> for integers
> ---
>
> Key: ARROW-9607
> URL: https://issues.apache.org/jira/browse/ARROW-9607
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Sagnik Chakraborty
>Assignee: Sagnik Chakraborty
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9099) [C++][Gandiva] Add TRIM function for string

2020-10-06 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-9099:
--

Assignee: Sagnik Chakraborty

> [C++][Gandiva] Add TRIM function for string
> ---
>
> Key: ARROW-9099
> URL: https://issues.apache.org/jira/browse/ARROW-9099
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Sagnik Chakraborty
>Assignee: Sagnik Chakraborty
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9795) [C++][Gandiva] Implement castTIMESTAMP(int64) in Gandiva

2020-10-06 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-9795:
--

Assignee: Sagnik Chakraborty

> [C++][Gandiva] Implement castTIMESTAMP(int64) in Gandiva
> 
>
> Key: ARROW-9795
> URL: https://issues.apache.org/jira/browse/ARROW-9795
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Sagnik Chakraborty
>Assignee: Sagnik Chakraborty
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9501) [C++][Gandiva] Add logic in timestampdiff() when end date is last day of a month

2020-10-06 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-9501:
--

Assignee: Sagnik Chakraborty

> [C++][Gandiva] Add logic in timestampdiff() when end date is last day of a 
> month
> 
>
> Key: ARROW-9501
> URL: https://issues.apache.org/jira/browse/ARROW-9501
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Sagnik Chakraborty
>Assignee: Sagnik Chakraborty
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> {{timestampdiff}}(*month*, _startDate_, _endDate_) returns wrong result in 
> Gandiva when the _endDate_ < _startDate_ and _endDate_ is the last day of the 
> month. An additional month is said to have passed when the end day is greater 
> than or equal to the start day, but this does not hold true for dates which 
> are last days of the month.
> Case in point, if _startDate_ = *2020-01-31*, _endDate_ = *2020-02-29*, 
> previously {{timestampdiff}}() returned *0*, but the correct result should be 
> *1*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9328) [C++][Gandiva] Add LTRIM, RTRIM, BTRIM functions for string

2020-10-06 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-9328:
--

Assignee: Sagnik Chakraborty

> [C++][Gandiva] Add LTRIM, RTRIM, BTRIM functions for string
> ---
>
> Key: ARROW-9328
> URL: https://issues.apache.org/jira/browse/ARROW-9328
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Sagnik Chakraborty
>Assignee: Sagnik Chakraborty
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9641) [C++][Gandiva] Implement round() for floating point and double floating point numbers

2020-10-06 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-9641:
--

Assignee: Sagnik Chakraborty

> [C++][Gandiva] Implement round() for floating point and double floating point 
> numbers
> -
>
> Key: ARROW-9641
> URL: https://issues.apache.org/jira/browse/ARROW-9641
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Sagnik Chakraborty
>Assignee: Sagnik Chakraborty
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9640) [C++][Gandiva] Implement round() for integers and long integers

2020-10-06 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-9640:
--

Assignee: Sagnik Chakraborty

> [C++][Gandiva] Implement round() for integers and long integers
> ---
>
> Key: ARROW-9640
> URL: https://issues.apache.org/jira/browse/ARROW-9640
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Sagnik Chakraborty
>Assignee: Sagnik Chakraborty
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10177) [CI][Gandiva] Nightly gandiva-jar-xenial fails

2020-10-06 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208990#comment-17208990
 ] 

Neal Richardson commented on ARROW-10177:
-

FYI [~sagnikc] [~projjal] [~praveenbingo]

> [CI][Gandiva] Nightly gandiva-jar-xenial fails
> --
>
> Key: ARROW-10177
> URL: https://issues.apache.org/jira/browse/ARROW-10177
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva, Continuous Integration
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 2.0.0
>
>
> The following tests FAILED:
>27 - gandiva-projector-test (Failed)
>42 - gandiva-projector-test-static (Failed)
> https://travis-ci.org/github/ursa-labs/crossbow/builds/732659880



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10050) [C++][Gandiva] Implement concat() in Gandiva for up to 10 arguments

2020-10-06 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-10050:
---

Assignee: Sagnik Chakraborty

> [C++][Gandiva] Implement concat() in Gandiva for up to 10 arguments
> ---
>
> Key: ARROW-10050
> URL: https://issues.apache.org/jira/browse/ARROW-10050
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Sagnik Chakraborty
>Assignee: Sagnik Chakraborty
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10120) [C++][Parquet] Create reading benchmarks for 2-level nested data

2020-10-06 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-10120.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8342
[https://github.com/apache/arrow/pull/8342]

> [C++][Parquet] Create reading benchmarks for 2-level nested data
> 
>
> Key: ARROW-10120
> URL: https://issues.apache.org/jira/browse/ARROW-10120
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We already have benchmarks for reading one-level list and one-level struct. 
> It would be nice to add list-of-list, list-of-struct, struct-of-struct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9943) [C++] Arrow metadata not applied recursively when reading Parquet file

2020-10-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9943:
--
Labels: pull-request-available  (was: )

> [C++] Arrow metadata not applied recursively when reading Parquet file
> --
>
> Key: ARROW-9943
> URL: https://issues.apache.org/jira/browse/ARROW-9943
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, {{ApplyOriginalMetadata}} in {{src/parquet/arrow/schema.cc}} is 
> only applied for the top-level node of each schema field. Nested metadata 
> (such as dicts-inside-lists, etc.) will not be applied.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10197) [Gandiva][python] Execute expression on filtered data

2020-10-06 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-10197:
-
Fix Version/s: 3.0.0

> [Gandiva][python] Execute expression on filtered data
> -
>
> Key: ARROW-10197
> URL: https://issues.apache.org/jira/browse/ARROW-10197
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Gandiva, Python
>Reporter: Kirill Lykov
>Priority: Major
> Fix For: 3.0.0
>
>
> Looks like there is no way to execute an expression on filtered data in 
> python. 
>  Basically, I cannot pass `SelectionVector` to projector's `evaluate` method
> ```python
>  import pyarrow as pa
>  import pyarrow.gandiva as gandiva
> table = pa.Table.from_arrays([pa.array([1., 31., 46., 3., 57., 44., 22.]),
>                                    pa.array([5., 45., 36., 73.,
>                                              83., 23., 76.])],
>                                   ['a', 'b'])
> builder = gandiva.TreeExprBuilder()
>  node_a = builder.make_field(table.schema.field("a"))
>  node_b = builder.make_field(table.schema.field("b"))
>  fifty = builder.make_literal(50.0, pa.float64())
>  eleven = builder.make_literal(11.0, pa.float64())
> cond_1 = builder.make_function("less_than", [node_a, fifty], pa.bool_())
>  cond_2 = builder.make_function("greater_than", [node_a, node_b],
>                                     pa.bool_())
>  cond_3 = builder.make_function("less_than", [node_b, eleven], pa.bool_())
>  cond = builder.make_or([builder.make_and([cond_1, cond_2]), cond_3])
>  condition = builder.make_condition(cond)
> filter = gandiva.make_filter(table.schema, condition)
>  filterResult = filter.evaluate(table.to_batches()[0], 
> pa.default_memory_pool()) --> filterResult has type SelectionVector
>  print(result)
> sum = builder.make_function("add", [node_a, node_b], pa.float64())
>  field_result = pa.field("c", pa.float64())
>  expr = builder.make_expression(sum, field_result)
>  projector = gandiva.make_projector(
>  table.schema, [expr], pa.default_memory_pool())
> r, = projector.evaluate(table.to_batches()[0], result) --> Here there is a 
> problem that I don't know how to use filterResult with projector
>  ```
> In C++, I see that it is possible to pass SelectionVector as second argument 
> to projector::Evaluate: 
> [https://github.com/apache/arrow/blob/c5fa23ea0e15abe47b35524fa6a79c7b8c160fa0/cpp/src/gandiva/tests/filter_project_test.cc#L270]
>   
>  Meanwhile, it looks like it is impossible in `gandiva.pyx`: 
> [https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/gandiva.pyx#L154]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10197) [Gandiva][python] Execute expression on filtered data

2020-10-06 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-10197:
-
Priority: Major  (was: Trivial)

> [Gandiva][python] Execute expression on filtered data
> -
>
> Key: ARROW-10197
> URL: https://issues.apache.org/jira/browse/ARROW-10197
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Gandiva, Python
>Reporter: Kirill Lykov
>Priority: Major
>
> Looks like there is no way to execute an expression on filtered data in 
> python. 
>  Basically, I cannot pass `SelectionVector` to projector's `evaluate` method
> ```python
>  import pyarrow as pa
>  import pyarrow.gandiva as gandiva
> table = pa.Table.from_arrays([pa.array([1., 31., 46., 3., 57., 44., 22.]),
>                                    pa.array([5., 45., 36., 73.,
>                                              83., 23., 76.])],
>                                   ['a', 'b'])
> builder = gandiva.TreeExprBuilder()
>  node_a = builder.make_field(table.schema.field("a"))
>  node_b = builder.make_field(table.schema.field("b"))
>  fifty = builder.make_literal(50.0, pa.float64())
>  eleven = builder.make_literal(11.0, pa.float64())
> cond_1 = builder.make_function("less_than", [node_a, fifty], pa.bool_())
>  cond_2 = builder.make_function("greater_than", [node_a, node_b],
>                                     pa.bool_())
>  cond_3 = builder.make_function("less_than", [node_b, eleven], pa.bool_())
>  cond = builder.make_or([builder.make_and([cond_1, cond_2]), cond_3])
>  condition = builder.make_condition(cond)
> filter = gandiva.make_filter(table.schema, condition)
>  filterResult = filter.evaluate(table.to_batches()[0], 
> pa.default_memory_pool()) --> filterResult has type SelectionVector
>  print(result)
> sum = builder.make_function("add", [node_a, node_b], pa.float64())
>  field_result = pa.field("c", pa.float64())
>  expr = builder.make_expression(sum, field_result)
>  projector = gandiva.make_projector(
>  table.schema, [expr], pa.default_memory_pool())
> r, = projector.evaluate(table.to_batches()[0], result) --> Here there is a 
> problem that I don't know how to use filterResult with projector
>  ```
> In C++, I see that it is possible to pass SelectionVector as second argument 
> to projector::Evaluate: 
> [https://github.com/apache/arrow/blob/c5fa23ea0e15abe47b35524fa6a79c7b8c160fa0/cpp/src/gandiva/tests/filter_project_test.cc#L270]
>   
>  Meanwhile, it looks like it is impossible in `gandiva.pyx`: 
> [https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/gandiva.pyx#L154]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10198) [Dev] Python merge script doesn't close PRs if not merged on master

2020-10-06 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10198:
--

 Summary: [Dev] Python merge script doesn't close PRs if not merged 
on master
 Key: ARROW-10198
 URL: https://issues.apache.org/jira/browse/ARROW-10198
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools
Affects Versions: 1.0.1
Reporter: Neville Dipale


When using the merge script to merge PRs against non-master branches, the PR on 
Github doesn't get closed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6582) [R] Arrow to R fails with embedded nuls in strings

2020-10-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6582:
--
Labels: pull-request-available  (was: )

> [R] Arrow to R fails with embedded nuls in strings
> --
>
> Key: ARROW-6582
> URL: https://issues.apache.org/jira/browse/ARROW-6582
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.14.1
> Environment: Windows 10
> R 3.4.4
>Reporter: John Cassil
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Apologies if this issue isn't categorized or documented appropriately.  
> Please be gentle! :)
> As a heavy R user that normally interacts with parquet files using SparklyR, 
> I have recently decided to try to use arrow::read_parquet() on a few parquet 
> files that were on my local machine rather than in hadoop.  I was not able to 
> proceed after several various attempts due to embedded nuls.  For example:
> try({df <- read_parquet('out_2019-09_data_1.snappy.parquet') })
> Error in Table__to_dataframe(x, use_threads = option_use_threads()) : 
>   embedded nul in string: 'INSTALL BOTH LEFT FRONT AND RIGHT FRONT  TORQUE 
> ARMS\0 ARMS'
> Is there a solution to this?
> I have also hit roadblocks with embedded nuls in the past with csvs using 
> data.table::fread(), but readr::read_delim() seems to handle them gracefully 
> with just a warning after proceeding.
> Apologies that I do not have a handy reprex. I don't know if I can even 
> recreate a parquet file with embedded nuls using arrow if it won't let me 
> read one in, and I can't share this file due to company restrictions.
> Please let me know how I can be of any more help!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5350) [Rust] Support filtering on primitive/string lists

2020-10-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5350:
--
Labels: pull-request-available  (was: )

> [Rust] Support filtering on primitive/string lists
> --
>
> Key: ARROW-5350
> URL: https://issues.apache.org/jira/browse/ARROW-5350
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We currently only filter on primitive types, but not on lists and structs. 
> Add the ability to filter on nested array types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10174) [Java] Reading of Dictionary encoded struct vector fails

2020-10-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10174:
---
Labels: pull-request-available  (was: )

> [Java] Reading of Dictionary encoded struct vector fails 
> -
>
> Key: ARROW-10174
> URL: https://issues.apache.org/jira/browse/ARROW-10174
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 1.0.1
>Reporter: Benjamin Wilhelm
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Write an index vector and a dictionary with a dictionary vector of the type 
> {{Struct}} using an {{ArrowStreamWriter}}. Reading this again fails with an 
> exception.
> Code to reproduce:
> {code:java}
> final RootAllocator allocator = new RootAllocator();
> // Create the dictionary
> final StructVector dict = StructVector.empty("Dict", allocator);
> final NullableStructWriter dictWriter = dict.getWriter();
> final IntWriter dictA = dictWriter.integer("a");
> final IntWriter dictB = dictWriter.integer("b");
> for (int i = 0; i < 3; i++) {
>   dictWriter.start();
>   dictA.writeInt(i);
>   dictB.writeInt(i);
>   dictWriter.end();
> }
> dict.setValueCount(3);
> final Dictionary dictionary = new Dictionary(dict, new DictionaryEncoding(1, 
> false, null));
> // Create the vector
> final Random random = new Random();
> final StructVector vector = StructVector.empty("Dict", allocator);
> final NullableStructWriter vectorWriter = vector.getWriter();
> final IntWriter vectorA = vectorWriter.integer("a");
> final IntWriter vectorB = vectorWriter.integer("b");
> for (int i = 0; i < 10; i++) {
>   int v = random.nextInt(3);
>   vectorWriter.start();
>   vectorA.writeInt(v);
>   vectorB.writeInt(v);
>   vectorWriter.end();
> }
> vector.setValueCount(10);
> // Encode the vector using the dictionary
> final IntVector indexVector = (IntVector) DictionaryEncoder.encode(vector, 
> dictionary);
> // Write the vector to out
> final ByteArrayOutputStream out = new ByteArrayOutputStream();
> final VectorSchemaRoot root = new 
> VectorSchemaRoot(Collections.singletonList(indexVector.getField()),
>   Collections.singletonList(indexVector));
> final ArrowStreamWriter writer = new ArrowStreamWriter(root, new 
> MapDictionaryProvider(dictionary),
>   Channels.newChannel(out));
> writer.start();
> writer.writeBatch();
> writer.end();
> // Read the vector from out
> try (final ArrowStreamReader reader = new ArrowStreamReader(new 
> ByteArrayInputStream(out.toByteArray()),
>   allocator)) {
>   reader.loadNextBatch();
>   final VectorSchemaRoot readRoot = reader.getVectorSchemaRoot();
>   final FieldVector readIndexVector = readRoot.getVector(0);
>   // Get the dictionary and decode
>   final Map readDictionaryMap = 
> reader.getDictionaryVectors();
>   final Dictionary readDictionary = 
> readDictionaryMap.get(readIndexVector.getField().getDictionary().getId());
>   final ValueVector readVector = 
> DictionaryEncoder.decode(readIndexVector, readDictionary);
> }
> {code}
> Exception:
> {code}
> java.lang.IllegalArgumentException: not all nodes and buffers were consumed. 
> nodes: [ArrowFieldNode [length=3, nullCount=0], ArrowFieldNode [length=3, 
> nullCount=0]] buffers: [ArrowBuf[21], address:140118352739688, length:1, 
> ArrowBuf[22], address:140118352739696, length:12, ArrowBuf[23], 
> address:140118352739712, length:1, ArrowBuf[24], address:140118352739720, 
> length:12]
>   at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:63)
>   at org.apache.arrow.vector.ipc.ArrowReader.load(ArrowReader.java:241)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.loadDictionary(ArrowReader.java:232)
>   at 
> org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:129)
>   at com.knime.AppTest.testDictionaryStruct(AppTest.java:83)
> {code}
> If I see it corretly the error happens in 
> {{DictionaryUtilities#toMessageFormat}}. If a dictionary encoded vector is 
> encountered still the children of the memory format field are used (none 
> because this is Int). However, the children of the field of the dictionary 
> vector should be mapped to the message format and set as children.
> I can create a fix and open a pull request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10167) [Rust] Support display of DictionaryArrays in sql.rs

2020-10-06 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão updated ARROW-10167:
-
Component/s: Rust

> [Rust] Support display of DictionaryArrays in sql.rs
> 
>
> Key: ARROW-10167
> URL: https://issues.apache.org/jira/browse/ARROW-10167
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When I try to display a DictionaryArray values, I get either a ??? in sql.rs
> This ticket tracks adding proper support for printing these types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10196) [C++] Add Future::DeferNotOk()

2020-10-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10196:
---
Labels: pull-request-available  (was: )

> [C++] Add Future::DeferNotOk()
> --
>
> Key: ARROW-10196
> URL: https://issues.apache.org/jira/browse/ARROW-10196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.0.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Provide a static method mapping Result> -> Future. If the Result 
> is an error, a finished future containing its Status will be constructed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10197) [Gandiva][python] Execute expression on filtered data

2020-10-06 Thread Kirill Lykov (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirill Lykov updated ARROW-10197:
-
Description: 
Looks like there is no way to execute an expression on filtered data in python. 
 Basically, I cannot pass `SelectionVector` to projector's `evaluate` method

```python
 import pyarrow as pa
 import pyarrow.gandiva as gandiva

table = pa.Table.from_arrays([pa.array([1., 31., 46., 3., 57., 44., 22.]),
                                   pa.array([5., 45., 36., 73.,
                                             83., 23., 76.])],
                                  ['a', 'b'])

builder = gandiva.TreeExprBuilder()
 node_a = builder.make_field(table.schema.field("a"))
 node_b = builder.make_field(table.schema.field("b"))
 fifty = builder.make_literal(50.0, pa.float64())
 eleven = builder.make_literal(11.0, pa.float64())

cond_1 = builder.make_function("less_than", [node_a, fifty], pa.bool_())
 cond_2 = builder.make_function("greater_than", [node_a, node_b],
                                    pa.bool_())
 cond_3 = builder.make_function("less_than", [node_b, eleven], pa.bool_())
 cond = builder.make_or([builder.make_and([cond_1, cond_2]), cond_3])
 condition = builder.make_condition(cond)

filter = gandiva.make_filter(table.schema, condition)


 filterResult = filter.evaluate(table.to_batches()[0], 
pa.default_memory_pool()) --> filterResult has type SelectionVector
 print(result)

sum = builder.make_function("add", [node_a, node_b], pa.float64())
 field_result = pa.field("c", pa.float64())
 expr = builder.make_expression(sum, field_result)
 projector = gandiva.make_projector(
 table.schema, [expr], pa.default_memory_pool())

r, = projector.evaluate(table.to_batches()[0], result) --> Here there is a 
problem that I don't know how to use filterResult with projector
 ```

In C++, I see that it is possible to pass SelectionVector as second argument to 
projector::Evaluate: 
[https://github.com/apache/arrow/blob/c5fa23ea0e15abe47b35524fa6a79c7b8c160fa0/cpp/src/gandiva/tests/filter_project_test.cc#L270]
  
 Meanwhile, it looks like it is impossible in `gandiva.pyx`: 
[https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/gandiva.pyx#L154]

  was:
Looks like there is no way to execute an expression on filtered data in python. 
 Basically, I cannot pass `SelectionVector` to projector's `evaluate` method

```python
 import pyarrow as pa
 import pyarrow.gandiva as gandiva

table = pa.Table.from_arrays([pa.array([1., 31., 46., 3., 57., 44., 22.]),
                                   pa.array([5., 45., 36., 73.,
                                             83., 23., 76.])],
                                  ['a', 'b'])

builder = gandiva.TreeExprBuilder()
 node_a = builder.make_field(table.schema.field("a"))
 node_b = builder.make_field(table.schema.field("b"))
 fifty = builder.make_literal(50.0, pa.float64())
 eleven = builder.make_literal(11.0, pa.float64())

cond_1 = builder.make_function("less_than", [node_a, fifty], pa.bool_())
 cond_2 = builder.make_function("greater_than", [node_a, node_b],
                                    pa.bool_())
 cond_3 = builder.make_function("less_than", [node_b, eleven], pa.bool_())
 cond = builder.make_or([builder.make_and([cond_1, cond_2]), cond_3])
 condition = builder.make_condition(cond)

filter = gandiva.make_filter(table.schema, condition)
 # filterResult has type SelectionVector
 filterResult = filter.evaluate(table.to_batches()[0], pa.default_memory_pool())
 print(result)

sum = builder.make_function("add", [node_a, node_b], pa.float64())
 field_result = pa.field("c", pa.float64())
 expr = builder.make_expression(sum, field_result)
 projector = gandiva.make_projector(
table.schema, [expr], pa.default_memory_pool())

# Here there is a problem that I don't know how to use filterResult with 
projector
 r, = projector.evaluate(table.to_batches()[0], result)
 ```

In C++, I see that it is possible to pass SelectionVector as second argument to 
projector::Evaluate: 
[https://github.com/apache/arrow/blob/c5fa23ea0e15abe47b35524fa6a79c7b8c160fa0/cpp/src/gandiva/tests/filter_project_test.cc#L270]
  
 Meanwhile, it looks like it is impossible in `gandiva.pyx`: 
[https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/gandiva.pyx#L154]


> [Gandiva][python] Execute expression on filtered data
> -
>
> Key: ARROW-10197
> URL: https://issues.apache.org/jira/browse/ARROW-10197
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Gandiva, Python
>Reporter: Kirill Lykov
>Priority: Trivial
>
> Looks like there is no way to execute an expression on filtered data in 
> python. 
>  Basically, I cannot pass `SelectionVector` to projector's `evaluate` method
> 

[jira] [Updated] (ARROW-10197) [Gandiva][python] Execute expression on filtered data

2020-10-06 Thread Kirill Lykov (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirill Lykov updated ARROW-10197:
-
Description: 
Looks like there is no way to execute an expression on filtered data in python. 
 Basically, I cannot pass `SelectionVector` to projector's `evaluate` method

```python
 import pyarrow as pa
 import pyarrow.gandiva as gandiva

table = pa.Table.from_arrays([pa.array([1., 31., 46., 3., 57., 44., 22.]),
                                   pa.array([5., 45., 36., 73.,
                                             83., 23., 76.])],
                                  ['a', 'b'])

builder = gandiva.TreeExprBuilder()
 node_a = builder.make_field(table.schema.field("a"))
 node_b = builder.make_field(table.schema.field("b"))
 fifty = builder.make_literal(50.0, pa.float64())
 eleven = builder.make_literal(11.0, pa.float64())

cond_1 = builder.make_function("less_than", [node_a, fifty], pa.bool_())
 cond_2 = builder.make_function("greater_than", [node_a, node_b],
                                    pa.bool_())
 cond_3 = builder.make_function("less_than", [node_b, eleven], pa.bool_())
 cond = builder.make_or([builder.make_and([cond_1, cond_2]), cond_3])
 condition = builder.make_condition(cond)

filter = gandiva.make_filter(table.schema, condition)
 # filterResult has type SelectionVector
 filterResult = filter.evaluate(table.to_batches()[0], pa.default_memory_pool())
 print(result)

sum = builder.make_function("add", [node_a, node_b], pa.float64())
 field_result = pa.field("c", pa.float64())
 expr = builder.make_expression(sum, field_result)
 projector = gandiva.make_projector(
table.schema, [expr], pa.default_memory_pool())

# Here there is a problem that I don't know how to use filterResult with 
projector
 r, = projector.evaluate(table.to_batches()[0], result)
 ```

In C++, I see that it is possible to pass SelectionVector as second argument to 
projector::Evaluate: 
[https://github.com/apache/arrow/blob/c5fa23ea0e15abe47b35524fa6a79c7b8c160fa0/cpp/src/gandiva/tests/filter_project_test.cc#L270]
  
 Meanwhile, it looks like it is impossible in `gandiva.pyx`: 
[https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/gandiva.pyx#L154]

  was:
Looks like there is no way to execute an expression on filtered data in python. 
Basically, I cannot pass `SelectionVector` to projector's `evaluate` method

```python
import pyarrow as pa
import pyarrow.gandiva as gandiva

table = pa.Table.from_arrays([pa.array([1., 31., 46., 3., 57., 44., 22.]),
                                  pa.array([5., 45., 36., 73.,
                                            83., 23., 76.])],
                                 ['a', 'b'])

builder = gandiva.TreeExprBuilder()
node_a = builder.make_field(table.schema.field("a"))
node_b = builder.make_field(table.schema.field("b"))
fifty = builder.make_literal(50.0, pa.float64())
eleven = builder.make_literal(11.0, pa.float64())

cond_1 = builder.make_function("less_than", [node_a, fifty], pa.bool_())
cond_2 = builder.make_function("greater_than", [node_a, node_b],
                                   pa.bool_())
cond_3 = builder.make_function("less_than", [node_b, eleven], pa.bool_())
cond = builder.make_or([builder.make_and([cond_1, cond_2]), cond_3])
condition = builder.make_condition(cond)

filter = gandiva.make_filter(table.schema, condition)
# filterResult has type SelectionVector
filterResult = filter.evaluate(table.to_batches()[0], pa.default_memory_pool())
print(result)

sum = builder.make_function("add", [node_a, node_b], pa.float64())
field_result = pa.field("c", pa.float64())
expr = builder.make_expression(sum, field_result)
projector = gandiva.make_projector(
        table.schema, [expr], pa.default_memory_pool())

### Here there is a problem that I don't know how to use filterResult with 
projector
r, = projector.evaluate(table.to_batches()[0], result)
```

In C++, I see that it is possible to pass SelectionVector as second argument to 
projector::Evaluate: 
[https://github.com/apache/arrow/blob/c5fa23ea0e15abe47b35524fa6a79c7b8c160fa0/cpp/src/gandiva/tests/filter_project_test.cc#L270]
 
Meanwhile, it looks like it is impossible in `gandiva.pyx`: 
[https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/gandiva.pyx#L154]


> [Gandiva][python] Execute expression on filtered data
> -
>
> Key: ARROW-10197
> URL: https://issues.apache.org/jira/browse/ARROW-10197
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Gandiva, Python
>Reporter: Kirill Lykov
>Priority: Trivial
>
> Looks like there is no way to execute an expression on filtered data in 
> python. 
>  Basically, I cannot pass `SelectionVector` to projector's `evaluate` method
> ```python
>  import pyarrow 

[jira] [Resolved] (ARROW-10191) [Rust] [Parquet] Add roundtrip tests for single column batches

2020-10-06 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10191.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8330
[https://github.com/apache/arrow/pull/8330]

> [Rust] [Parquet] Add roundtrip tests for single column batches
> --
>
> Key: ARROW-10191
> URL: https://issues.apache.org/jira/browse/ARROW-10191
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> To aid with test coverage and picking up information loss during Parquet and 
> Arrow roundtrips, we can add tests that assert that all supported Arrow 
> datatypes can be written and read correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10167) [Rust] Support display of DictionaryArrays in sql.rs

2020-10-06 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10167.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8333
[https://github.com/apache/arrow/pull/8333]

> [Rust] Support display of DictionaryArrays in sql.rs
> 
>
> Key: ARROW-10167
> URL: https://issues.apache.org/jira/browse/ARROW-10167
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When I try to display a DictionaryArray values, I get either a ??? in sql.rs
> This ticket tracks adding proper support for printing these types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10187) [Rust] Test failures on 32 bit ARM (Raspberry Pi)

2020-10-06 Thread Mahmut Bulut (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208817#comment-17208817
 ] 

Mahmut Bulut commented on ARROW-10187:
--

[~andygrove] Hi Andy, I don't have raspberry pi at hand. I want to check the 
compilation problems on ARM asap, target_pointer_width gate might be a good 
option for it. What version of rpi did you use?

> [Rust] Test failures on 32 bit ARM (Raspberry Pi)
> -
>
> Key: ARROW-10187
> URL: https://issues.apache.org/jira/browse/ARROW-10187
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> Perhaps these failures are to be expected and perhaps we can't really support 
> 32 bit?
>  
> {code:java}
>  array::array::tests::test_primitive_array_from_vec stdout 
> thread 'array::array::tests::test_primitive_array_from_vec' panicked at 
> 'assertion failed: `(left == right)`
>   left: `144`,
>  right: `104`', arrow/src/array/array.rs:2383:9 
> array::array::tests::test_primitive_array_from_vec_option stdout 
> thread 'array::array::tests::test_primitive_array_from_vec_option' panicked 
> at 'assertion failed: `(left == right)`
>   left: `224`,
>  right: `176`', arrow/src/array/array.rs:2409:9 
> array::null::tests::test_null_array stdout 
> thread 'array::null::tests::test_null_array' panicked at 'assertion failed: 
> `(left == right)`
>   left: `64`,
>  right: `32`', arrow/src/array/null.rs:134:9 
> array::union::tests::test_dense_union_i32 stdout 
> thread 'array::union::tests::test_dense_union_i32' panicked at 'assertion 
> failed: `(left == right)`
>   left: `1024`,
>  right: `768`', arrow/src/array/union.rs:704:9 
> memory::tests::test_allocate stdout 
> thread 'memory::tests::test_allocate' panicked at 'assertion failed: `(left 
> == right)`
>   left: `0`,
>  right: `32`', arrow/src/memory.rs:243:13
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10188) [Rust] [DataFusion] Some examples are broken

2020-10-06 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10188.

Resolution: Fixed

Issue resolved by pull request 8355
[https://github.com/apache/arrow/pull/8355]

> [Rust] [DataFusion] Some examples are broken
> 
>
> Key: ARROW-10188
> URL: https://issues.apache.org/jira/browse/ARROW-10188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The flight server example fails with "No such file or directory".
> The dataframe example produces an empty result set.
> The simple_udaf example produces no output at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10187) [Rust] Test failures on 32 bit ARM (Raspberry Pi)

2020-10-06 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208807#comment-17208807
 ] 

Andy Grove commented on ARROW-10187:


[~nevi_me] [~vertexclique] I'd be interested in your opinions on this one.

> [Rust] Test failures on 32 bit ARM (Raspberry Pi)
> -
>
> Key: ARROW-10187
> URL: https://issues.apache.org/jira/browse/ARROW-10187
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> Perhaps these failures are to be expected and perhaps we can't really support 
> 32 bit?
>  
> {code:java}
>  array::array::tests::test_primitive_array_from_vec stdout 
> thread 'array::array::tests::test_primitive_array_from_vec' panicked at 
> 'assertion failed: `(left == right)`
>   left: `144`,
>  right: `104`', arrow/src/array/array.rs:2383:9 
> array::array::tests::test_primitive_array_from_vec_option stdout 
> thread 'array::array::tests::test_primitive_array_from_vec_option' panicked 
> at 'assertion failed: `(left == right)`
>   left: `224`,
>  right: `176`', arrow/src/array/array.rs:2409:9 
> array::null::tests::test_null_array stdout 
> thread 'array::null::tests::test_null_array' panicked at 'assertion failed: 
> `(left == right)`
>   left: `64`,
>  right: `32`', arrow/src/array/null.rs:134:9 
> array::union::tests::test_dense_union_i32 stdout 
> thread 'array::union::tests::test_dense_union_i32' panicked at 'assertion 
> failed: `(left == right)`
>   left: `1024`,
>  right: `768`', arrow/src/array/union.rs:704:9 
> memory::tests::test_allocate stdout 
> thread 'memory::tests::test_allocate' panicked at 'assertion failed: `(left 
> == right)`
>   left: `0`,
>  right: `32`', arrow/src/memory.rs:243:13
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10187) [Rust] Test failures on 32 bit ARM (Raspberry Pi)

2020-10-06 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208805#comment-17208805
 ] 

Andy Grove commented on ARROW-10187:


If these tests really are specific to 64 bit platforms then we could use 
conditional compilation and only compile them when target_pointer_width == 64.

See 
[https://doc.rust-lang.org/reference/conditional-compilation.html#target_pointer_width]
 for more information.

> [Rust] Test failures on 32 bit ARM (Raspberry Pi)
> -
>
> Key: ARROW-10187
> URL: https://issues.apache.org/jira/browse/ARROW-10187
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> Perhaps these failures are to be expected and perhaps we can't really support 
> 32 bit?
>  
> {code:java}
>  array::array::tests::test_primitive_array_from_vec stdout 
> thread 'array::array::tests::test_primitive_array_from_vec' panicked at 
> 'assertion failed: `(left == right)`
>   left: `144`,
>  right: `104`', arrow/src/array/array.rs:2383:9 
> array::array::tests::test_primitive_array_from_vec_option stdout 
> thread 'array::array::tests::test_primitive_array_from_vec_option' panicked 
> at 'assertion failed: `(left == right)`
>   left: `224`,
>  right: `176`', arrow/src/array/array.rs:2409:9 
> array::null::tests::test_null_array stdout 
> thread 'array::null::tests::test_null_array' panicked at 'assertion failed: 
> `(left == right)`
>   left: `64`,
>  right: `32`', arrow/src/array/null.rs:134:9 
> array::union::tests::test_dense_union_i32 stdout 
> thread 'array::union::tests::test_dense_union_i32' panicked at 'assertion 
> failed: `(left == right)`
>   left: `1024`,
>  right: `768`', arrow/src/array/union.rs:704:9 
> memory::tests::test_allocate stdout 
> thread 'memory::tests::test_allocate' panicked at 'assertion failed: `(left 
> == right)`
>   left: `0`,
>  right: `32`', arrow/src/memory.rs:243:13
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10197) [Gandiva][python] Execute expression on filtered data

2020-10-06 Thread Kirill Lykov (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208803#comment-17208803
 ] 

Kirill Lykov commented on ARROW-10197:
--

I've tried to fix it by adding

```python

def evaluate(self, RecordBatch batch, shared_ptr[CSelectionVector] selection):
    cdef vector[shared_ptr[CArray]] results
    check_status(self.projector.get().Evaluate(
        batch.sp_batch.get()[0], selection.get(), self.pool.pool, ))

    cdef shared_ptr[CArray] result
    arrays = []
    for result in results:
    arrays.append(pyarrow_wrap_array(result))
    return arrays
```

But I get error:

 Call with wrong number of arguments (expected 3, got 4)

Which means that I don't understand how this pyx is translated to python.
I thought this `self.projector.get().Evaluate` is somehow magically translated 
to the call of this method
[https://github.com/apache/arrow/blob/7ad49eeca5215d9b2a56b6439f1bd6ea3ea9/cpp/src/gandiva/projector.h#L106]

 

> [Gandiva][python] Execute expression on filtered data
> -
>
> Key: ARROW-10197
> URL: https://issues.apache.org/jira/browse/ARROW-10197
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Gandiva, Python
>Reporter: Kirill Lykov
>Priority: Trivial
>
> Looks like there is no way to execute an expression on filtered data in 
> python. 
> Basically, I cannot pass `SelectionVector` to projector's `evaluate` method
> ```python
> import pyarrow as pa
> import pyarrow.gandiva as gandiva
> table = pa.Table.from_arrays([pa.array([1., 31., 46., 3., 57., 44., 22.]),
>                                   pa.array([5., 45., 36., 73.,
>                                             83., 23., 76.])],
>                                  ['a', 'b'])
> builder = gandiva.TreeExprBuilder()
> node_a = builder.make_field(table.schema.field("a"))
> node_b = builder.make_field(table.schema.field("b"))
> fifty = builder.make_literal(50.0, pa.float64())
> eleven = builder.make_literal(11.0, pa.float64())
> cond_1 = builder.make_function("less_than", [node_a, fifty], pa.bool_())
> cond_2 = builder.make_function("greater_than", [node_a, node_b],
>                                    pa.bool_())
> cond_3 = builder.make_function("less_than", [node_b, eleven], pa.bool_())
> cond = builder.make_or([builder.make_and([cond_1, cond_2]), cond_3])
> condition = builder.make_condition(cond)
> filter = gandiva.make_filter(table.schema, condition)
> # filterResult has type SelectionVector
> filterResult = filter.evaluate(table.to_batches()[0], 
> pa.default_memory_pool())
> print(result)
> sum = builder.make_function("add", [node_a, node_b], pa.float64())
> field_result = pa.field("c", pa.float64())
> expr = builder.make_expression(sum, field_result)
> projector = gandiva.make_projector(
>         table.schema, [expr], pa.default_memory_pool())
> ### Here there is a problem that I don't know how to use filterResult with 
> projector
> r, = projector.evaluate(table.to_batches()[0], result)
> ```
> In C++, I see that it is possible to pass SelectionVector as second argument 
> to projector::Evaluate: 
> [https://github.com/apache/arrow/blob/c5fa23ea0e15abe47b35524fa6a79c7b8c160fa0/cpp/src/gandiva/tests/filter_project_test.cc#L270]
>  
> Meanwhile, it looks like it is impossible in `gandiva.pyx`: 
> [https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/gandiva.pyx#L154]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10197) [Gandiva][python] Execute expression on filtered data

2020-10-06 Thread Kirill Lykov (Jira)
Kirill Lykov created ARROW-10197:


 Summary: [Gandiva][python] Execute expression on filtered data
 Key: ARROW-10197
 URL: https://issues.apache.org/jira/browse/ARROW-10197
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Gandiva, Python
Reporter: Kirill Lykov


Looks like there is no way to execute an expression on filtered data in python. 
Basically, I cannot pass `SelectionVector` to projector's `evaluate` method

```python
import pyarrow as pa
import pyarrow.gandiva as gandiva

table = pa.Table.from_arrays([pa.array([1., 31., 46., 3., 57., 44., 22.]),
                                  pa.array([5., 45., 36., 73.,
                                            83., 23., 76.])],
                                 ['a', 'b'])

builder = gandiva.TreeExprBuilder()
node_a = builder.make_field(table.schema.field("a"))
node_b = builder.make_field(table.schema.field("b"))
fifty = builder.make_literal(50.0, pa.float64())
eleven = builder.make_literal(11.0, pa.float64())

cond_1 = builder.make_function("less_than", [node_a, fifty], pa.bool_())
cond_2 = builder.make_function("greater_than", [node_a, node_b],
                                   pa.bool_())
cond_3 = builder.make_function("less_than", [node_b, eleven], pa.bool_())
cond = builder.make_or([builder.make_and([cond_1, cond_2]), cond_3])
condition = builder.make_condition(cond)

filter = gandiva.make_filter(table.schema, condition)
# filterResult has type SelectionVector
filterResult = filter.evaluate(table.to_batches()[0], pa.default_memory_pool())
print(result)

sum = builder.make_function("add", [node_a, node_b], pa.float64())
field_result = pa.field("c", pa.float64())
expr = builder.make_expression(sum, field_result)
projector = gandiva.make_projector(
        table.schema, [expr], pa.default_memory_pool())

### Here there is a problem that I don't know how to use filterResult with 
projector
r, = projector.evaluate(table.to_batches()[0], result)
```

In C++, I see that it is possible to pass SelectionVector as second argument to 
projector::Evaluate: 
[https://github.com/apache/arrow/blob/c5fa23ea0e15abe47b35524fa6a79c7b8c160fa0/cpp/src/gandiva/tests/filter_project_test.cc#L270]
 
Meanwhile, it looks like it is impossible in `gandiva.pyx`: 
[https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/gandiva.pyx#L154]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10196) [C++] Add Future::DeferNotOk()

2020-10-06 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-10196:


 Summary: [C++] Add Future::DeferNotOk()
 Key: ARROW-10196
 URL: https://issues.apache.org/jira/browse/ARROW-10196
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 1.0.1
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 2.0.0


Provide a static method mapping Result> -> Future. If the Result 
is an error, a finished future containing its Status will be constructed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10195) [C++] Add string struct extract kernel using re2

2020-10-06 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-10195:


 Summary: [C++] Add string struct extract kernel using re2
 Key: ARROW-10195
 URL: https://issues.apache.org/jira/browse/ARROW-10195
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Maarten Breddels
Assignee: Maarten Breddels


Similar to Pandas' str.extract a way to convert a string to a struct of strings 
using the re2 regex library (when having named captured groups). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10194) [Python] Array.to_numpy() with type fixed_size_list(int64(), 1) doesn't roundtrip for large integer values

2020-10-06 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-10194:

Description: 
Reproducer:

{code:python}
data = [None, [9007199254740993]]
arr = pa.array(data, type=pa.list_(pa.uint64(), 1))
ndarray = arr.to_numpy(zero_copy_only=False)
restored = pa.array(ndarray, type=arr.type)
assert restored.equals(arr)
{code}

Error:

{code}
E   assert False
E+  where False = (\n[\n  
null,\n  [\n9007199254740993\n  ]\n])
E+where  = \n[\n  null,\n  [\n9007199254740992\n  ]\n].equals
{code}

The inner numpy array ({{ndarray[1]}}) has float64 dtype where the integer gets 
truncated because of the precision.

  was:
Reproducer:

{code:python}
data = [None, [9007199254740993]]
arr = pa.array(data, type=pa.list_(pa.uint64(), 1))
ndarray = arr.to_numpy(zero_copy_only=False)
restored = pa.array(ndarray, type=arr.type)
assert restored.equals(arr)
{code}

Error:

{code}
E   assert False
E+  where False = (\n[\n  
null,\n  [\n90071992547409944\n  ]\n])
E+where  = \n[\n  null,\n  [\n90071992547409952\n  ]\n].equals
{code}

The inner numpy array ({{ndarray[1]}}) has float64 dtype where the integer gets 
truncated because of the precision.


> [Python] Array.to_numpy() with type fixed_size_list(int64(), 1) doesn't 
> roundtrip for large integer values
> --
>
> Key: ARROW-10194
> URL: https://issues.apache.org/jira/browse/ARROW-10194
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Krisztian Szucs
>Priority: Major
>
> Reproducer:
> {code:python}
> data = [None, [9007199254740993]]
> arr = pa.array(data, type=pa.list_(pa.uint64(), 1))
> ndarray = arr.to_numpy(zero_copy_only=False)
> restored = pa.array(ndarray, type=arr.type)
> assert restored.equals(arr)
> {code}
> Error:
> {code}
> E   assert False
> E+  where False =  pyarrow.lib.FixedSizeListArray object at 
> 0x7fbdb0239e20>( 0x7fbdb0262220>\n[\n  null,\n  [\n9007199254740993\n  ]\n])
> E+where  object at 0x7fbdb0239e20> =  0x7fbdb0239e20>\n[\n  null,\n  [\n9007199254740992\n  ]\n].equals
> {code}
> The inner numpy array ({{ndarray[1]}}) has float64 dtype where the integer 
> gets truncated because of the precision.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10194) [Python] Array.to_numpy() with type fixed_size_list(int64(), 1) doesn't roundtrip for large integer values

2020-10-06 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-10194:
---

 Summary: [Python] Array.to_numpy() with type 
fixed_size_list(int64(), 1) doesn't roundtrip for large integer values
 Key: ARROW-10194
 URL: https://issues.apache.org/jira/browse/ARROW-10194
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Krisztian Szucs


Reproducer:

{code:python}
data = [None, [9007199254740993]]
arr = pa.array(data, type=pa.list_(pa.uint64(), 1))
ndarray = arr.to_numpy(zero_copy_only=False)
restored = pa.array(ndarray, type=arr.type)
assert restored.equals(arr)
{code}

Error:

{code}
E   assert False
E+  where False = (\n[\n  
null,\n  [\n90071992547409944\n  ]\n])
E+where  = \n[\n  null,\n  [\n90071992547409952\n  ]\n].equals
{code}

The inner numpy array ({{ndarray[1]}}) has float64 dtype where the integer gets 
truncated because of the precision.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10193) [Python] Segfault when converting to fixed size binary array

2020-10-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10193:
---
Labels: pull-request-available  (was: )

> [Python] Segfault when converting to fixed size binary array
> 
>
> Key: ARROW-10193
> URL: https://issues.apache.org/jira/browse/ARROW-10193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Reproducer:
> {code:python}
> data = [b'\x19h\r\x9e\x00\x00\x00\x00\x01\x9b\x9fA']
> assert len(data[0]) == 12
> ty = pa.binary(12)
> arr = pa.array(data, type=ty)
> {code}
> Trace:
> {code}
> pyarrow/tests/test_convert_builtin.py::test_fixed_size_binary_length_check 
> ../src/arrow/array/builder_binary.cc:53:  Check failed: (size) == 
> (byte_width_) Appending wrong size to FixedSizeBinaryBuilder
> 0   libarrow.200.0.0.dylib  0x00010e7f9704 
> _ZN5arrow4util7CerrLog14PrintBackTraceEv + 52
> 1   libarrow.200.0.0.dylib  0x00010e7f9622 
> _ZN5arrow4util7CerrLogD2Ev + 98
> 2   libarrow.200.0.0.dylib  0x00010e7f9585 
> _ZN5arrow4util7CerrLogD1Ev + 21
> 3   libarrow.200.0.0.dylib  0x00010e7f95ac 
> _ZN5arrow4util7CerrLogD0Ev + 28
> 4   libarrow.200.0.0.dylib  0x00010e7f9492 
> _ZN5arrow4util8ArrowLogD2Ev + 82
> 5   libarrow.200.0.0.dylib  0x00010e7f94c5 
> _ZN5arrow4util8ArrowLogD1Ev + 21
> 6   libarrow.200.0.0.dylib  0x00010e303ec1 
> _ZN5arrow22FixedSizeBinaryBuilder14CheckValueSizeEx + 209
> 7   libarrow.200.0.0.dylib  0x00010e30c361 
> _ZN5arrow22FixedSizeBinaryBuilder12UnsafeAppendEN6nonstd7sv_lite17basic_string_viewIcNSt3__111char_traitsIc
>  + 49
> 8   libarrow_python.200.0.0.dylib   0x00010b4efa7d 
> _ZN5arrow2py20PyPrimitiveConverterINS_19FixedSizeBinaryTypeEvE6AppendEP7_object
>  + 813
> {code}
> The input {{const char*}} value gets implicitly casted to string_view which 
> makes the length check fail in debug builds.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10192) [C++][Python] Segfault when converting nested struct array with dictionary field to pandas series

2020-10-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10192:
---
Labels: pull-request-available  (was: )

> [C++][Python] Segfault when converting nested struct array with dictionary 
> field to pandas series
> -
>
> Key: ARROW-10192
> URL: https://issues.apache.org/jira/browse/ARROW-10192
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Krisztian Szucs
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Reproducer:
> {code:python}
> def test_struct_array_with_dictionary_field_to_pandas():
> ty = pa.struct([
> pa.field('dict', pa.dictionary(pa.int64(), pa.int32())),
> ])
> data = [
> {'dict': -1859762450}
> ]
> arr = pa.array(data, type=ty)
> arr.to_pandas()
> {code}
> Raises SIGSTOP:
> {code}
> * thread #1, stop reason = signal SIGSTOP
>   * frame #0: 0x7fff6e2b733a libsystem_kernel.dylib`__pthread_kill + 10
> frame #1: 0x7fff6e373e60 libsystem_pthread.dylib`pthread_kill + 430
> frame #2: 0x7fff6e1ce93e libsystem_c.dylib`raise + 26
> frame #3: 0x7fff6e3685fd libsystem_platform.dylib`_sigtramp + 29
> frame #4: 0x00011517adfd 
> libarrow_python.200.0.0.dylib`arrow::py::ConvertStruct(options=0x7f84fc5a0230,
>  data=0x7f84fc59ef18, out_values=0x7f84fc53d140) at 
> arrow_to_pandas.cc:685:54
> frame #5: 0x00011514c642 
> libarrow_python.200.0.0.dylib`arrow::py::ObjectWriterVisitor::Visit(this=0x7ffee06a1a88,
>  type=0x7f84fc5a00e8) at arrow_to_pandas.cc:1031:12
> frame #6: 0x0001151499c4 libarrow_python.200.0.0.dylib`arrow::Status 
> arrow::VisitTypeInline(type=0x7f84fc5a00e8,
>  visitor=0x7ffee06a1a88) at visitor_inline.h:88:5
> frame #7: 0x000115149305 
> libarrow_python.200.0.0.dylib`arrow::py::ObjectWriter::CopyInto(this=0x7f84fc5a0228,
>  data=std::__1::shared_ptr::element_type @ 
> 0x7f84fc59ef18 strong=2 weak=1, rel_placement=0) at arrow_to_pand
> as.cc:1055:12
> {code}
> {code:cpp}
> frame #4: 0x00011517adfd 
> libarrow_python.200.0.0.dylib`arrow::py::ConvertStruct(options=0x7f84fc5a0230,
>  data=0x7f84fc59ef18, out_values=0x7f84fc53d140) at 
> arrow_to_pandas.cc:685:54
>682if (!arr->field(static_cast(field_idx))->IsNull(i)) {
>683  // Value exists in child array, obtain it
>684  auto array = 
> reinterpret_cast(fields_data[field_idx].obj());
> -> 685  auto ptr = reinterpret_cast char*>(PyArray_GETPTR1(array, i));
>686  field_value.reset(PyArray_GETITEM(array, ptr));
>687  RETURN_IF_PYERROR();
>688} else {
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10192) [C++][Python] Segfault when converting nested struct array with dictionary field to pandas series

2020-10-06 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-10192:
---
Fix Version/s: 2.0.0

> [C++][Python] Segfault when converting nested struct array with dictionary 
> field to pandas series
> -
>
> Key: ARROW-10192
> URL: https://issues.apache.org/jira/browse/ARROW-10192
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Krisztian Szucs
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 2.0.0
>
>
> Reproducer:
> {code:python}
> def test_struct_array_with_dictionary_field_to_pandas():
> ty = pa.struct([
> pa.field('dict', pa.dictionary(pa.int64(), pa.int32())),
> ])
> data = [
> {'dict': -1859762450}
> ]
> arr = pa.array(data, type=ty)
> arr.to_pandas()
> {code}
> Raises SIGSTOP:
> {code}
> * thread #1, stop reason = signal SIGSTOP
>   * frame #0: 0x7fff6e2b733a libsystem_kernel.dylib`__pthread_kill + 10
> frame #1: 0x7fff6e373e60 libsystem_pthread.dylib`pthread_kill + 430
> frame #2: 0x7fff6e1ce93e libsystem_c.dylib`raise + 26
> frame #3: 0x7fff6e3685fd libsystem_platform.dylib`_sigtramp + 29
> frame #4: 0x00011517adfd 
> libarrow_python.200.0.0.dylib`arrow::py::ConvertStruct(options=0x7f84fc5a0230,
>  data=0x7f84fc59ef18, out_values=0x7f84fc53d140) at 
> arrow_to_pandas.cc:685:54
> frame #5: 0x00011514c642 
> libarrow_python.200.0.0.dylib`arrow::py::ObjectWriterVisitor::Visit(this=0x7ffee06a1a88,
>  type=0x7f84fc5a00e8) at arrow_to_pandas.cc:1031:12
> frame #6: 0x0001151499c4 libarrow_python.200.0.0.dylib`arrow::Status 
> arrow::VisitTypeInline(type=0x7f84fc5a00e8,
>  visitor=0x7ffee06a1a88) at visitor_inline.h:88:5
> frame #7: 0x000115149305 
> libarrow_python.200.0.0.dylib`arrow::py::ObjectWriter::CopyInto(this=0x7f84fc5a0228,
>  data=std::__1::shared_ptr::element_type @ 
> 0x7f84fc59ef18 strong=2 weak=1, rel_placement=0) at arrow_to_pand
> as.cc:1055:12
> {code}
> {code:cpp}
> frame #4: 0x00011517adfd 
> libarrow_python.200.0.0.dylib`arrow::py::ConvertStruct(options=0x7f84fc5a0230,
>  data=0x7f84fc59ef18, out_values=0x7f84fc53d140) at 
> arrow_to_pandas.cc:685:54
>682if (!arr->field(static_cast(field_idx))->IsNull(i)) {
>683  // Value exists in child array, obtain it
>684  auto array = 
> reinterpret_cast(fields_data[field_idx].obj());
> -> 685  auto ptr = reinterpret_cast char*>(PyArray_GETPTR1(array, i));
>686  field_value.reset(PyArray_GETITEM(array, ptr));
>687  RETURN_IF_PYERROR();
>688} else {
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10193) [Python] Segfault when converting to fixed size binary array

2020-10-06 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-10193:
---

 Summary: [Python] Segfault when converting to fixed size binary 
array
 Key: ARROW-10193
 URL: https://issues.apache.org/jira/browse/ARROW-10193
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 2.0.0


Reproducer:
{code:python}
 data = [b'\x19h\r\x9e\x00\x00\x00\x00\x01\x9b\x9fA']
assert len(data[0]) == 12
ty = pa.binary(12)
arr = pa.array(data, type=ty)
{code}

Trace:
{code}
pyarrow/tests/test_convert_builtin.py::test_fixed_size_binary_length_check 
../src/arrow/array/builder_binary.cc:53:  Check failed: (size) == (byte_width_) 
Appending wrong size to FixedSizeBinaryBuilder
0   libarrow.200.0.0.dylib  0x00010e7f9704 
_ZN5arrow4util7CerrLog14PrintBackTraceEv + 52
1   libarrow.200.0.0.dylib  0x00010e7f9622 
_ZN5arrow4util7CerrLogD2Ev + 98
2   libarrow.200.0.0.dylib  0x00010e7f9585 
_ZN5arrow4util7CerrLogD1Ev + 21
3   libarrow.200.0.0.dylib  0x00010e7f95ac 
_ZN5arrow4util7CerrLogD0Ev + 28
4   libarrow.200.0.0.dylib  0x00010e7f9492 
_ZN5arrow4util8ArrowLogD2Ev + 82
5   libarrow.200.0.0.dylib  0x00010e7f94c5 
_ZN5arrow4util8ArrowLogD1Ev + 21
6   libarrow.200.0.0.dylib  0x00010e303ec1 
_ZN5arrow22FixedSizeBinaryBuilder14CheckValueSizeEx + 209
7   libarrow.200.0.0.dylib  0x00010e30c361 
_ZN5arrow22FixedSizeBinaryBuilder12UnsafeAppendEN6nonstd7sv_lite17basic_string_viewIcNSt3__111char_traitsIc
 + 49
8   libarrow_python.200.0.0.dylib   0x00010b4efa7d 
_ZN5arrow2py20PyPrimitiveConverterINS_19FixedSizeBinaryTypeEvE6AppendEP7_object 
+ 813
{code}

The input {{const char*}} value gets implicitly casted to string_view which 
makes the length check fail in debug builds.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10193) [Python] Segfault when converting to fixed size binary array

2020-10-06 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-10193:

Description: 
Reproducer:
{code:python}
   data = [b'\x19h\r\x9e\x00\x00\x00\x00\x01\x9b\x9fA']
assert len(data[0]) == 12
ty = pa.binary(12)
arr = pa.array(data, type=ty)
{code}

Trace:
{code}
pyarrow/tests/test_convert_builtin.py::test_fixed_size_binary_length_check 
../src/arrow/array/builder_binary.cc:53:  Check failed: (size) == (byte_width_) 
Appending wrong size to FixedSizeBinaryBuilder
0   libarrow.200.0.0.dylib  0x00010e7f9704 
_ZN5arrow4util7CerrLog14PrintBackTraceEv + 52
1   libarrow.200.0.0.dylib  0x00010e7f9622 
_ZN5arrow4util7CerrLogD2Ev + 98
2   libarrow.200.0.0.dylib  0x00010e7f9585 
_ZN5arrow4util7CerrLogD1Ev + 21
3   libarrow.200.0.0.dylib  0x00010e7f95ac 
_ZN5arrow4util7CerrLogD0Ev + 28
4   libarrow.200.0.0.dylib  0x00010e7f9492 
_ZN5arrow4util8ArrowLogD2Ev + 82
5   libarrow.200.0.0.dylib  0x00010e7f94c5 
_ZN5arrow4util8ArrowLogD1Ev + 21
6   libarrow.200.0.0.dylib  0x00010e303ec1 
_ZN5arrow22FixedSizeBinaryBuilder14CheckValueSizeEx + 209
7   libarrow.200.0.0.dylib  0x00010e30c361 
_ZN5arrow22FixedSizeBinaryBuilder12UnsafeAppendEN6nonstd7sv_lite17basic_string_viewIcNSt3__111char_traitsIc
 + 49
8   libarrow_python.200.0.0.dylib   0x00010b4efa7d 
_ZN5arrow2py20PyPrimitiveConverterINS_19FixedSizeBinaryTypeEvE6AppendEP7_object 
+ 813
{code}

The input {{const char*}} value gets implicitly casted to string_view which 
makes the length check fail in debug builds.

  was:
Reproducer:
{code:python}
 data = [b'\x19h\r\x9e\x00\x00\x00\x00\x01\x9b\x9fA']
assert len(data[0]) == 12
ty = pa.binary(12)
arr = pa.array(data, type=ty)
{code}

Trace:
{code}
pyarrow/tests/test_convert_builtin.py::test_fixed_size_binary_length_check 
../src/arrow/array/builder_binary.cc:53:  Check failed: (size) == (byte_width_) 
Appending wrong size to FixedSizeBinaryBuilder
0   libarrow.200.0.0.dylib  0x00010e7f9704 
_ZN5arrow4util7CerrLog14PrintBackTraceEv + 52
1   libarrow.200.0.0.dylib  0x00010e7f9622 
_ZN5arrow4util7CerrLogD2Ev + 98
2   libarrow.200.0.0.dylib  0x00010e7f9585 
_ZN5arrow4util7CerrLogD1Ev + 21
3   libarrow.200.0.0.dylib  0x00010e7f95ac 
_ZN5arrow4util7CerrLogD0Ev + 28
4   libarrow.200.0.0.dylib  0x00010e7f9492 
_ZN5arrow4util8ArrowLogD2Ev + 82
5   libarrow.200.0.0.dylib  0x00010e7f94c5 
_ZN5arrow4util8ArrowLogD1Ev + 21
6   libarrow.200.0.0.dylib  0x00010e303ec1 
_ZN5arrow22FixedSizeBinaryBuilder14CheckValueSizeEx + 209
7   libarrow.200.0.0.dylib  0x00010e30c361 
_ZN5arrow22FixedSizeBinaryBuilder12UnsafeAppendEN6nonstd7sv_lite17basic_string_viewIcNSt3__111char_traitsIc
 + 49
8   libarrow_python.200.0.0.dylib   0x00010b4efa7d 
_ZN5arrow2py20PyPrimitiveConverterINS_19FixedSizeBinaryTypeEvE6AppendEP7_object 
+ 813
{code}

The input {{const char*}} value gets implicitly casted to string_view which 
makes the length check fail in debug builds.


> [Python] Segfault when converting to fixed size binary array
> 
>
> Key: ARROW-10193
> URL: https://issues.apache.org/jira/browse/ARROW-10193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 2.0.0
>
>
> Reproducer:
> {code:python}
>data = [b'\x19h\r\x9e\x00\x00\x00\x00\x01\x9b\x9fA']
> assert len(data[0]) == 12
> ty = pa.binary(12)
> arr = pa.array(data, type=ty)
> {code}
> Trace:
> {code}
> pyarrow/tests/test_convert_builtin.py::test_fixed_size_binary_length_check 
> ../src/arrow/array/builder_binary.cc:53:  Check failed: (size) == 
> (byte_width_) Appending wrong size to FixedSizeBinaryBuilder
> 0   libarrow.200.0.0.dylib  0x00010e7f9704 
> _ZN5arrow4util7CerrLog14PrintBackTraceEv + 52
> 1   libarrow.200.0.0.dylib  0x00010e7f9622 
> _ZN5arrow4util7CerrLogD2Ev + 98
> 2   libarrow.200.0.0.dylib  0x00010e7f9585 
> _ZN5arrow4util7CerrLogD1Ev + 21
> 3   libarrow.200.0.0.dylib  0x00010e7f95ac 
> _ZN5arrow4util7CerrLogD0Ev + 28
> 4   libarrow.200.0.0.dylib  0x00010e7f9492 
> _ZN5arrow4util8ArrowLogD2Ev + 82
> 5   libarrow.200.0.0.dylib  0x00010e7f94c5 
> _ZN5arrow4util8ArrowLogD1Ev + 21
> 6   libarrow.200.0.0.dylib  0x00010e303ec1 
> _ZN5arrow22FixedSizeBinaryBuilder14CheckValueSizeEx + 209
> 7   libarrow.200.0.0.dylib  0x00010e30c361 
> 

[jira] [Updated] (ARROW-10193) [Python] Segfault when converting to fixed size binary array

2020-10-06 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-10193:

Description: 
Reproducer:
{code:python}
data = [b'\x19h\r\x9e\x00\x00\x00\x00\x01\x9b\x9fA']
assert len(data[0]) == 12
ty = pa.binary(12)
arr = pa.array(data, type=ty)
{code}

Trace:
{code}
pyarrow/tests/test_convert_builtin.py::test_fixed_size_binary_length_check 
../src/arrow/array/builder_binary.cc:53:  Check failed: (size) == (byte_width_) 
Appending wrong size to FixedSizeBinaryBuilder
0   libarrow.200.0.0.dylib  0x00010e7f9704 
_ZN5arrow4util7CerrLog14PrintBackTraceEv + 52
1   libarrow.200.0.0.dylib  0x00010e7f9622 
_ZN5arrow4util7CerrLogD2Ev + 98
2   libarrow.200.0.0.dylib  0x00010e7f9585 
_ZN5arrow4util7CerrLogD1Ev + 21
3   libarrow.200.0.0.dylib  0x00010e7f95ac 
_ZN5arrow4util7CerrLogD0Ev + 28
4   libarrow.200.0.0.dylib  0x00010e7f9492 
_ZN5arrow4util8ArrowLogD2Ev + 82
5   libarrow.200.0.0.dylib  0x00010e7f94c5 
_ZN5arrow4util8ArrowLogD1Ev + 21
6   libarrow.200.0.0.dylib  0x00010e303ec1 
_ZN5arrow22FixedSizeBinaryBuilder14CheckValueSizeEx + 209
7   libarrow.200.0.0.dylib  0x00010e30c361 
_ZN5arrow22FixedSizeBinaryBuilder12UnsafeAppendEN6nonstd7sv_lite17basic_string_viewIcNSt3__111char_traitsIc
 + 49
8   libarrow_python.200.0.0.dylib   0x00010b4efa7d 
_ZN5arrow2py20PyPrimitiveConverterINS_19FixedSizeBinaryTypeEvE6AppendEP7_object 
+ 813
{code}

The input {{const char*}} value gets implicitly casted to string_view which 
makes the length check fail in debug builds.

  was:
Reproducer:
{code:python}
   data = [b'\x19h\r\x9e\x00\x00\x00\x00\x01\x9b\x9fA']
assert len(data[0]) == 12
ty = pa.binary(12)
arr = pa.array(data, type=ty)
{code}

Trace:
{code}
pyarrow/tests/test_convert_builtin.py::test_fixed_size_binary_length_check 
../src/arrow/array/builder_binary.cc:53:  Check failed: (size) == (byte_width_) 
Appending wrong size to FixedSizeBinaryBuilder
0   libarrow.200.0.0.dylib  0x00010e7f9704 
_ZN5arrow4util7CerrLog14PrintBackTraceEv + 52
1   libarrow.200.0.0.dylib  0x00010e7f9622 
_ZN5arrow4util7CerrLogD2Ev + 98
2   libarrow.200.0.0.dylib  0x00010e7f9585 
_ZN5arrow4util7CerrLogD1Ev + 21
3   libarrow.200.0.0.dylib  0x00010e7f95ac 
_ZN5arrow4util7CerrLogD0Ev + 28
4   libarrow.200.0.0.dylib  0x00010e7f9492 
_ZN5arrow4util8ArrowLogD2Ev + 82
5   libarrow.200.0.0.dylib  0x00010e7f94c5 
_ZN5arrow4util8ArrowLogD1Ev + 21
6   libarrow.200.0.0.dylib  0x00010e303ec1 
_ZN5arrow22FixedSizeBinaryBuilder14CheckValueSizeEx + 209
7   libarrow.200.0.0.dylib  0x00010e30c361 
_ZN5arrow22FixedSizeBinaryBuilder12UnsafeAppendEN6nonstd7sv_lite17basic_string_viewIcNSt3__111char_traitsIc
 + 49
8   libarrow_python.200.0.0.dylib   0x00010b4efa7d 
_ZN5arrow2py20PyPrimitiveConverterINS_19FixedSizeBinaryTypeEvE6AppendEP7_object 
+ 813
{code}

The input {{const char*}} value gets implicitly casted to string_view which 
makes the length check fail in debug builds.


> [Python] Segfault when converting to fixed size binary array
> 
>
> Key: ARROW-10193
> URL: https://issues.apache.org/jira/browse/ARROW-10193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 2.0.0
>
>
> Reproducer:
> {code:python}
> data = [b'\x19h\r\x9e\x00\x00\x00\x00\x01\x9b\x9fA']
> assert len(data[0]) == 12
> ty = pa.binary(12)
> arr = pa.array(data, type=ty)
> {code}
> Trace:
> {code}
> pyarrow/tests/test_convert_builtin.py::test_fixed_size_binary_length_check 
> ../src/arrow/array/builder_binary.cc:53:  Check failed: (size) == 
> (byte_width_) Appending wrong size to FixedSizeBinaryBuilder
> 0   libarrow.200.0.0.dylib  0x00010e7f9704 
> _ZN5arrow4util7CerrLog14PrintBackTraceEv + 52
> 1   libarrow.200.0.0.dylib  0x00010e7f9622 
> _ZN5arrow4util7CerrLogD2Ev + 98
> 2   libarrow.200.0.0.dylib  0x00010e7f9585 
> _ZN5arrow4util7CerrLogD1Ev + 21
> 3   libarrow.200.0.0.dylib  0x00010e7f95ac 
> _ZN5arrow4util7CerrLogD0Ev + 28
> 4   libarrow.200.0.0.dylib  0x00010e7f9492 
> _ZN5arrow4util8ArrowLogD2Ev + 82
> 5   libarrow.200.0.0.dylib  0x00010e7f94c5 
> _ZN5arrow4util8ArrowLogD1Ev + 21
> 6   libarrow.200.0.0.dylib  0x00010e303ec1 
> _ZN5arrow22FixedSizeBinaryBuilder14CheckValueSizeEx + 209
> 7   libarrow.200.0.0.dylib  0x00010e30c361 
> 

[jira] [Commented] (ARROW-10128) [Rust] Dictionary-encoding is out of spec

2020-10-06 Thread Andrew Lamb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208710#comment-17208710
 ] 

Andrew Lamb commented on ARROW-10128:
-

:thumbsup:

> [Rust] Dictionary-encoding is out of spec
> -
>
> Key: ARROW-10128
> URL: https://issues.apache.org/jira/browse/ARROW-10128
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust
>Reporter: Jorge Leitão
>Priority: Major
>
> According to [the 
> spec|https://arrow.apache.org/docs/format/Columnar.html#physical-memory-layout],
>  every array can be dictionary-encoded, on which its values are encoded by a 
> unique set of values.
> However, none of our arrays support this encoding and the physical memory 
> layout of this encoding is not being fulfilled.
> We have a DictionaryArray, but, AFAIK, it does not respect the physical 
> memory layout set out by the spec.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10128) [Rust] Dictionary-encoding is out of spec

2020-10-06 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-10128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208693#comment-17208693
 ] 

Jorge Leitão commented on ARROW-10128:
--

I am sorry, this is a non-issue. I looked at it more carefully and I 
mis-interpreted our implementation.

> [Rust] Dictionary-encoding is out of spec
> -
>
> Key: ARROW-10128
> URL: https://issues.apache.org/jira/browse/ARROW-10128
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust
>Reporter: Jorge Leitão
>Priority: Major
>
> According to [the 
> spec|https://arrow.apache.org/docs/format/Columnar.html#physical-memory-layout],
>  every array can be dictionary-encoded, on which its values are encoded by a 
> unique set of values.
> However, none of our arrays support this encoding and the physical memory 
> layout of this encoding is not being fulfilled.
> We have a DictionaryArray, but, AFAIK, it does not respect the physical 
> memory layout set out by the spec.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-10128) [Rust] Dictionary-encoding is out of spec

2020-10-06 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão closed ARROW-10128.

Resolution: Invalid

> [Rust] Dictionary-encoding is out of spec
> -
>
> Key: ARROW-10128
> URL: https://issues.apache.org/jira/browse/ARROW-10128
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust
>Reporter: Jorge Leitão
>Priority: Major
>
> According to [the 
> spec|https://arrow.apache.org/docs/format/Columnar.html#physical-memory-layout],
>  every array can be dictionary-encoded, on which its values are encoded by a 
> unique set of values.
> However, none of our arrays support this encoding and the physical memory 
> layout of this encoding is not being fulfilled.
> We have a DictionaryArray, but, AFAIK, it does not respect the physical 
> memory layout set out by the spec.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10192) [C++][Python] Segfault when converting nested struct array with dictionary field to pandas series

2020-10-06 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-10192:
--

Assignee: Antoine Pitrou

> [C++][Python] Segfault when converting nested struct array with dictionary 
> field to pandas series
> -
>
> Key: ARROW-10192
> URL: https://issues.apache.org/jira/browse/ARROW-10192
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Krisztian Szucs
>Assignee: Antoine Pitrou
>Priority: Major
>
> Reproducer:
> {code:python}
> def test_struct_array_with_dictionary_field_to_pandas():
> ty = pa.struct([
> pa.field('dict', pa.dictionary(pa.int64(), pa.int32())),
> ])
> data = [
> {'dict': -1859762450}
> ]
> arr = pa.array(data, type=ty)
> arr.to_pandas()
> {code}
> Raises SIGSTOP:
> {code}
> * thread #1, stop reason = signal SIGSTOP
>   * frame #0: 0x7fff6e2b733a libsystem_kernel.dylib`__pthread_kill + 10
> frame #1: 0x7fff6e373e60 libsystem_pthread.dylib`pthread_kill + 430
> frame #2: 0x7fff6e1ce93e libsystem_c.dylib`raise + 26
> frame #3: 0x7fff6e3685fd libsystem_platform.dylib`_sigtramp + 29
> frame #4: 0x00011517adfd 
> libarrow_python.200.0.0.dylib`arrow::py::ConvertStruct(options=0x7f84fc5a0230,
>  data=0x7f84fc59ef18, out_values=0x7f84fc53d140) at 
> arrow_to_pandas.cc:685:54
> frame #5: 0x00011514c642 
> libarrow_python.200.0.0.dylib`arrow::py::ObjectWriterVisitor::Visit(this=0x7ffee06a1a88,
>  type=0x7f84fc5a00e8) at arrow_to_pandas.cc:1031:12
> frame #6: 0x0001151499c4 libarrow_python.200.0.0.dylib`arrow::Status 
> arrow::VisitTypeInline(type=0x7f84fc5a00e8,
>  visitor=0x7ffee06a1a88) at visitor_inline.h:88:5
> frame #7: 0x000115149305 
> libarrow_python.200.0.0.dylib`arrow::py::ObjectWriter::CopyInto(this=0x7f84fc5a0228,
>  data=std::__1::shared_ptr::element_type @ 
> 0x7f84fc59ef18 strong=2 weak=1, rel_placement=0) at arrow_to_pand
> as.cc:1055:12
> {code}
> {code:cpp}
> frame #4: 0x00011517adfd 
> libarrow_python.200.0.0.dylib`arrow::py::ConvertStruct(options=0x7f84fc5a0230,
>  data=0x7f84fc59ef18, out_values=0x7f84fc53d140) at 
> arrow_to_pandas.cc:685:54
>682if (!arr->field(static_cast(field_idx))->IsNull(i)) {
>683  // Value exists in child array, obtain it
>684  auto array = 
> reinterpret_cast(fields_data[field_idx].obj());
> -> 685  auto ptr = reinterpret_cast char*>(PyArray_GETPTR1(array, i));
>686  field_value.reset(PyArray_GETITEM(array, ptr));
>687  RETURN_IF_PYERROR();
>688} else {
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8464) [Rust] [DataFusion] Add better and faster support for dictionary types

2020-10-06 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-8464:
---
Summary: [Rust] [DataFusion] Add better and faster support for dictionary 
types  (was: [Rust] [DataFusion] Add support for dictionary types)

> [Rust] [DataFusion] Add better and faster support for dictionary types
> --
>
> Key: ARROW-8464
> URL: https://issues.apache.org/jira/browse/ARROW-8464
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>
> Usecases: Efficiently process large columns of low cardinality Strings
>  
>  * BatchIterator should accept both DictionaryBatch and RecordBatch
>  * Type Coercion optimizer rule should inject expression for converting 
> dictionary value types to index types (for equality expressions, and 
> IN(values, ...)
>  * Physical expression would lookup index for dictionary values referenced in 
> the query so that at runtime, only indices are being compared per batch



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8464) [Rust] [DataFusion] Add support for dictionary types

2020-10-06 Thread Andrew Lamb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208691#comment-17208691
 ] 

Andrew Lamb commented on ARROW-8464:


FYI [~andygrove] -- I am doing some part of this in ARROW-10159 -- however, the 
initial implementation effectively converts DictionaryArray --> PrimitiveArray 
/ StringArray and then uses the existing processing.

To support the actual efficient processing usecase, I think significant work 
will be needed to add appropriate dictionary support to the arrow compute 
kernels

> [Rust] [DataFusion] Add support for dictionary types
> 
>
> Key: ARROW-8464
> URL: https://issues.apache.org/jira/browse/ARROW-8464
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>
> Usecases: Efficiently process large columns of low cardinality Strings
>  
>  * BatchIterator should accept both DictionaryBatch and RecordBatch
>  * Type Coercion optimizer rule should inject expression for converting 
> dictionary value types to index types (for equality expressions, and 
> IN(values, ...)
>  * Physical expression would lookup index for dictionary values referenced in 
> the query so that at runtime, only indices are being compared per batch



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10128) [Rust] Dictionary-encoding is out of spec

2020-10-06 Thread Andrew Lamb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208689#comment-17208689
 ] 

Andrew Lamb commented on ARROW-10128:
-

Thanks [~jorgecarleitao] -- I wonder what you mean by "none of our arrays 
support this encoding" ?  Do you have any more specifics of where the specs 
differ?

The `DictionaryArray` seems to support indexes / values according yo me 
understanding of the spec. I would say that the DictionaryArray type is not 
particularly easy to work with (as the Rust type only encodes the type of the 
index array, not (also) the type of the dictionary values)


> [Rust] Dictionary-encoding is out of spec
> -
>
> Key: ARROW-10128
> URL: https://issues.apache.org/jira/browse/ARROW-10128
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust
>Reporter: Jorge Leitão
>Priority: Major
>
> According to [the 
> spec|https://arrow.apache.org/docs/format/Columnar.html#physical-memory-layout],
>  every array can be dictionary-encoded, on which its values are encoded by a 
> unique set of values.
> However, none of our arrays support this encoding and the physical memory 
> layout of this encoding is not being fulfilled.
> We have a DictionaryArray, but, AFAIK, it does not respect the physical 
> memory layout set out by the spec.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10168) [Rust] [Parquet] Extend arrow schema conversion to projected fields

2020-10-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10168:
---
Labels: pull-request-available  (was: )

> [Rust] [Parquet] Extend arrow schema conversion to projected fields
> ---
>
> Key: ARROW-10168
> URL: https://issues.apache.org/jira/browse/ARROW-10168
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When writing Arrow data to Parquet, we serialise the schema's IPC 
> representation. This schema is then read back by the Parquet reader, and used 
> to preserve the array type information from the original Arrow data.
> We however do not rely on the above mechanism when reading projected columns 
> from a Parquet file; i.e. if we have a file with 3 columns, but we only read 
> 2 columns, we do not yet rely on the serialised arrow schema; and can thus 
> lose type information.
> This behaviour was deliberately left out, as the function 
> *parquet_to_arrow_schema_by_columns* does not check for the existence of 
> arrow schema in the metadata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10191) [Rust] [Parquet] Add roundtrip tests for single column batches

2020-10-06 Thread Carol Nichols (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208685#comment-17208685
 ] 

Carol Nichols commented on ARROW-10191:
---

Hi, I have a Jira account now!

> [Rust] [Parquet] Add roundtrip tests for single column batches
> --
>
> Key: ARROW-10191
> URL: https://issues.apache.org/jira/browse/ARROW-10191
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> To aid with test coverage and picking up information loss during Parquet and 
> Arrow roundtrips, we can add tests that assert that all supported Arrow 
> datatypes can be written and read correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10192) [C++][Python] Segfault when converting nested struct array with dictionary field to pandas series

2020-10-06 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-10192:
---

 Summary: [C++][Python] Segfault when converting nested struct 
array with dictionary field to pandas series
 Key: ARROW-10192
 URL: https://issues.apache.org/jira/browse/ARROW-10192
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Krisztian Szucs


Reproducer:

{code:python}
def test_struct_array_with_dictionary_field_to_pandas():
ty = pa.struct([
pa.field('dict', pa.dictionary(pa.int64(), pa.int32())),
])
data = [
{'dict': -1859762450}
]
arr = pa.array(data, type=ty)
arr.to_pandas()
{code}

Raises SIGSTOP:
{code}
* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x7fff6e2b733a libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x7fff6e373e60 libsystem_pthread.dylib`pthread_kill + 430
frame #2: 0x7fff6e1ce93e libsystem_c.dylib`raise + 26
frame #3: 0x7fff6e3685fd libsystem_platform.dylib`_sigtramp + 29
frame #4: 0x00011517adfd 
libarrow_python.200.0.0.dylib`arrow::py::ConvertStruct(options=0x7f84fc5a0230,
 data=0x7f84fc59ef18, out_values=0x7f84fc53d140) at 
arrow_to_pandas.cc:685:54
frame #5: 0x00011514c642 
libarrow_python.200.0.0.dylib`arrow::py::ObjectWriterVisitor::Visit(this=0x7ffee06a1a88,
 type=0x7f84fc5a00e8) at arrow_to_pandas.cc:1031:12
frame #6: 0x0001151499c4 libarrow_python.200.0.0.dylib`arrow::Status 
arrow::VisitTypeInline(type=0x7f84fc5a00e8, 
visitor=0x7ffee06a1a88) at visitor_inline.h:88:5
frame #7: 0x000115149305 
libarrow_python.200.0.0.dylib`arrow::py::ObjectWriter::CopyInto(this=0x7f84fc5a0228,
 data=std::__1::shared_ptr::element_type @ 
0x7f84fc59ef18 strong=2 weak=1, rel_placement=0) at arrow_to_pand
as.cc:1055:12
{code}

{code:cpp}
frame #4: 0x00011517adfd 
libarrow_python.200.0.0.dylib`arrow::py::ConvertStruct(options=0x7f84fc5a0230,
 data=0x7f84fc59ef18, out_values=0x7f84fc53d140) at 
arrow_to_pandas.cc:685:54
   682if (!arr->field(static_cast(field_idx))->IsNull(i)) {
   683  // Value exists in child array, obtain it
   684  auto array = 
reinterpret_cast(fields_data[field_idx].obj());
-> 685  auto ptr = reinterpret_cast(PyArray_GETPTR1(array, i));
   686  field_value.reset(PyArray_GETITEM(array, ptr));
   687  RETURN_IF_PYERROR();
   688} else {
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10191) [Rust] [Parquet] Add roundtrip tests for single column batches

2020-10-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10191:
---
Labels: pull-request-available  (was: )

> [Rust] [Parquet] Add roundtrip tests for single column batches
> --
>
> Key: ARROW-10191
> URL: https://issues.apache.org/jira/browse/ARROW-10191
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> To aid with test coverage and picking up information loss during Parquet and 
> Arrow roundtrips, we can add tests that assert that all supported Arrow 
> datatypes can be written and read correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10191) [Rust] [Parquet] Add roundtrip tests for single column batches

2020-10-06 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10191:
--

 Summary: [Rust] [Parquet] Add roundtrip tests for single column 
batches
 Key: ARROW-10191
 URL: https://issues.apache.org/jira/browse/ARROW-10191
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.1
Reporter: Neville Dipale


To aid with test coverage and picking up information loss during Parquet and 
Arrow roundtrips, we can add tests that assert that all supported Arrow 
datatypes can be written and read correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10165) [Rust] [DataFusion] Allow DataFusion to cast all type combinations supported by Arrow cast kernel

2020-10-06 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-10165:

Summary: [Rust] [DataFusion] Allow DataFusion to cast all type combinations 
supported by Arrow cast kernel  (was: [Rust] [DataFusion] Add DictionaryArray 
coercion support)

> [Rust] [DataFusion] Allow DataFusion to cast all type combinations supported 
> by Arrow cast kernel
> -
>
> Key: ARROW-10165
> URL: https://issues.apache.org/jira/browse/ARROW-10165
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When the  DataFusion planner inserts casts, today it relies on special logic 
> to determine the valid coded casts. 
> The actual arrow cast kernels support a much wider range of data types, and 
> thus DataFusion is artificially limiting the casts it supports for no 
> particularly good reason I can see.
> This ticket tracks the work to remove the extra casting checking in the 
> datafusion planner and instead simply rely on runtime check of arrow cast 
> compute kernel
> The potential  downside of this approach is that the error may be generated 
> later in the execution process (rather than the planner), and possibly have a 
> less specific error message, the upside is there is less code and we get 
> several conversions immediately (like timestamp predicate casting)
> I also plan to add DictionaryArray support to the casting kernels and I would 
> like to avoid having to replicate some part of that logic in DataFusion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10163) [Rust] [DataFusion] Add DictionaryArray coercion support

2020-10-06 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-10163:

Summary: [Rust] [DataFusion] Add DictionaryArray coercion support  (was: 
[Rust] [DataFusion] Allow DataFusion to cast all type combinations supported by 
Arrow cast kernel)

> [Rust] [DataFusion] Add DictionaryArray coercion support
> 
>
> Key: ARROW-10163
> URL: https://issues.apache.org/jira/browse/ARROW-10163
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>
> --- 
> There is code in the datafusion physical planner that coerces arguments to 
> compatible types for some expressions (e.g. for equals: 
> https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/expressions.rs#L1153)
> This code needs to be modified to understand dictionary types (so, for 
> example we can express a predicate like col1 = "foo", where col1 is a 
> DictionaryArray.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >