date:20200825



 [ 
https://issues.apache.org/jira/browse/ARROW-7226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-7226:


Assignee: Apache Arrow JIRA Bot  (was: Andrew Wieteska)

> [JSON][Python] Json loader fails on example in documentation.
> -
>
> Key: ARROW-7226
> URL: https://issues.apache.org/jira/browse/ARROW-7226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Rinke Hoekstra
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I was just trying this with the example found in the pyarrow docs at 
> [http://arrow.apache.org/docs/python/json.html]
> The documented example does not work. Is this related to this issue, or is it 
> another matter?
> It says to load the following JSON file:
> {{{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"
>  {{{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"
> I fixed this to make it valid JSON (It is valid [JSON 
> Lines|[http://jsonlines.org/]], but that's another issue):
> {{[{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}},}}
>  {{{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"}}]}}
> Then reading the JSON from a file called `my_data.json`:
> {{from pyarrow import json}}
>  {{table = json.read_json("my_data.json")}}
> Gives the following error:
> {code:java}
> ---}}
>  ArrowInvalid Traceback (most recent call last)
>   in ()
>  1 from pyarrow import json
>  > 2 table = json.read_json('test.json')
> ~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/_json.pyx
>  in pyarrow._json.read_json()
> ~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()
> ArrowInvalid: JSON parse error: A column changed from object to array
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7226) [JSON][Python] Json loader fails on example in documentation.



 [ 
https://issues.apache.org/jira/browse/ARROW-7226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-7226:


Assignee: Andrew Wieteska  (was: Apache Arrow JIRA Bot)

> [JSON][Python] Json loader fails on example in documentation.
> -
>
> Key: ARROW-7226
> URL: https://issues.apache.org/jira/browse/ARROW-7226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Rinke Hoekstra
>Assignee: Andrew Wieteska
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I was just trying this with the example found in the pyarrow docs at 
> [http://arrow.apache.org/docs/python/json.html]
> The documented example does not work. Is this related to this issue, or is it 
> another matter?
> It says to load the following JSON file:
> {{{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"
>  {{{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"
> I fixed this to make it valid JSON (It is valid [JSON 
> Lines|[http://jsonlines.org/]], but that's another issue):
> {{[{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}},}}
>  {{{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"}}]}}
> Then reading the JSON from a file called `my_data.json`:
> {{from pyarrow import json}}
>  {{table = json.read_json("my_data.json")}}
> Gives the following error:
> {code:java}
> ---}}
>  ArrowInvalid Traceback (most recent call last)
>   in ()
>  1 from pyarrow import json
>  > 2 table = json.read_json('test.json')
> ~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/_json.pyx
>  in pyarrow._json.read_json()
> ~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()
> ArrowInvalid: JSON parse error: A column changed from object to array
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7226) [JSON][Python] Json loader fails on example in documentation.



 [ 
https://issues.apache.org/jira/browse/ARROW-7226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7226:
--
Labels: pull-request-available  (was: )

> [JSON][Python] Json loader fails on example in documentation.
> -
>
> Key: ARROW-7226
> URL: https://issues.apache.org/jira/browse/ARROW-7226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Rinke Hoekstra
>Assignee: Andrew Wieteska
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I was just trying this with the example found in the pyarrow docs at 
> [http://arrow.apache.org/docs/python/json.html]
> The documented example does not work. Is this related to this issue, or is it 
> another matter?
> It says to load the following JSON file:
> {{{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"
>  {{{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"
> I fixed this to make it valid JSON (It is valid [JSON 
> Lines|[http://jsonlines.org/]], but that's another issue):
> {{[{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}},}}
>  {{{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"}}]}}
> Then reading the JSON from a file called `my_data.json`:
> {{from pyarrow import json}}
>  {{table = json.read_json("my_data.json")}}
> Gives the following error:
> {code:java}
> ---}}
>  ArrowInvalid Traceback (most recent call last)
>   in ()
>  1 from pyarrow import json
>  > 2 table = json.read_json('test.json')
> ~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/_json.pyx
>  in pyarrow._json.read_json()
> ~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()
> ArrowInvalid: JSON parse error: A column changed from object to array
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7226) [JSON][Python] Json loader fails on example in documentation.

2020-08-25 Thread Andrew Wieteska (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wieteska reassigned ARROW-7226:
--

Assignee: Andrew Wieteska

> [JSON][Python] Json loader fails on example in documentation.
> -
>
> Key: ARROW-7226
> URL: https://issues.apache.org/jira/browse/ARROW-7226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Rinke Hoekstra
>Assignee: Andrew Wieteska
>Priority: Major
>
> I was just trying this with the example found in the pyarrow docs at 
> [http://arrow.apache.org/docs/python/json.html]
> The documented example does not work. Is this related to this issue, or is it 
> another matter?
> It says to load the following JSON file:
> {{{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"
>  {{{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"
> I fixed this to make it valid JSON (It is valid [JSON 
> Lines|[http://jsonlines.org/]], but that's another issue):
> {{[{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}},}}
>  {{{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"}}]}}
> Then reading the JSON from a file called `my_data.json`:
> {{from pyarrow import json}}
>  {{table = json.read_json("my_data.json")}}
> Gives the following error:
> {code:java}
> ---}}
>  ArrowInvalid Traceback (most recent call last)
>   in ()
>  1 from pyarrow import json
>  > 2 table = json.read_json('test.json')
> ~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/_json.pyx
>  in pyarrow._json.read_json()
> ~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()
> ArrowInvalid: JSON parse error: A column changed from object to array
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-9849) [Rust] [DataFusion] Make UDFs not need a Field

2020-08-25 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9849.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8045
[https://github.com/apache/arrow/pull/8045]

> [Rust] [DataFusion] Make UDFs not need a Field
> --
>
> Key: ARROW-9849
> URL: https://issues.apache.org/jira/browse/ARROW-9849
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/arrow/pull/7967,] shows that it is possible to not 
> require users to pass a `Field` to UDFs declarations and instead just pass a 
> `DataType`.
> Let's deprecate Field from them, and instead just use `DataType`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-9464) [Rust] [DataFusion] Physical plan refactor to support optimization rules and more efficient use of threads

2020-08-25 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9464.
---
Resolution: Fixed

Issue resolved by pull request 8034
[https://github.com/apache/arrow/pull/8034]

> [Rust] [DataFusion] Physical plan refactor to support optimization rules and 
> more efficient use of threads
> --
>
> Key: ARROW-9464
> URL: https://issues.apache.org/jira/browse/ARROW-9464
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> I would like to propose a refactor of the physical/execution planning based 
> on the experience I have had in implementing distributed execution in 
> Ballista.
> This will likely need subtasks but here is an overview of the changes I am 
> proposing.
> h3. *Introduce physical plan optimization rule to insert "shuffle" operators*
> We should extend the ExecutionPlan trait so that each operator can specify 
> its input and output partitioning needs, and then have an optimization rule 
> that can insert any repartitioning or reordering steps required.
> For example, these are the methods to be added to ExecutionPlan. This design 
> is based on Apache Spark.
>  
> {code:java}
> /// Specifies how data is partitioned across different nodes in the cluster
> fn output_partitioning() -> Partitioning {
> Partitioning::UnknownPartitioning(0)
> }
> /// Specifies the data distribution requirements of all the children for this 
> operator
> fn required_child_distribution() -> Distribution {
> Distribution::UnspecifiedDistribution
> }
> /// Specifies how data is ordered in each partition
> fn output_ordering() -> Option> {
> None
> }
> /// Specifies the data distribution requirements of all the children for this 
> operator
> fn required_child_ordering() -> Option>> {
> None
> }
>  {code}
> A good example of applying this rule would be in the case of hash aggregates 
> where we perform a partial aggregate in parallel across partitions and then 
> coalesce the results and apply a final hash aggregate.
> Another example would be a SortMergeExec specifying the sort order required 
> for its children.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-9816) [C++] Escape quotes in config.h

2020-08-25 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-9816.
-
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8016
[https://github.com/apache/arrow/pull/8016]

> [C++] Escape quotes in config.h
> ---
>
> Key: ARROW-9816
> URL: https://issues.apache.org/jira/browse/ARROW-9816
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Lawrence Chan
>Assignee: Lawrence Chan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently the config.h file is generated without the `ESCAPE_QUOTES` option, 
> which cases quotes in e.g. CXXFLAGS to break config.h parsing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9816) [C++] Escape quotes in config.h

2020-08-25 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-9816:
---

Assignee: Lawrence Chan

> [C++] Escape quotes in config.h
> ---
>
> Key: ARROW-9816
> URL: https://issues.apache.org/jira/browse/ARROW-9816
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Lawrence Chan
>Assignee: Lawrence Chan
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently the config.h file is generated without the `ESCAPE_QUOTES` option, 
> which cases quotes in e.g. CXXFLAGS to break config.h parsing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-9813) [C++] Disable semantic interposition

2020-08-25 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-9813.
-
Resolution: Fixed

Issue resolved by pull request 8048
[https://github.com/apache/arrow/pull/8048]

> [C++] Disable semantic interposition
> 
>
> Key: ARROW-9813
> URL: https://issues.apache.org/jira/browse/ARROW-9813
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> On gcc, semantic interposition is enabled by default. It can be beneficial to 
> disable it when building Arrow libraries (and it's most certainly harmless 
> anyway).
> See 
> https://stackoverflow.com/questions/35745543/new-option-in-gcc-5-3-fno-semantic-interposition
>  for more background on this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-9820) [C++] Plugin Architecture for Filesystem and File IO

2020-08-25 Thread Lawrence Chan (Jira)

[
https://issues.apache.org/jira/browse/ARROW-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184756#comment-17184756
]

Lawrence Chan edited comment on ARROW-9820 at 8/25/20, 10:17 PM:
-

- Language-agnostic - once a storage driver is written/built, _any_ arrow
library can load it (assuming we've finished implementing the plugin API). So
rather than needing to add support to each language, I just need to write the
wrapper once, and then users can use that filesystem in C++, python, go, rust,
whatever.
- Application-agnostic - if users want to use my storage driver in a downstream
application, I can distribute a plugin and arrow can load the plugin at runtime
without needing to do a special build of that application with my filesystem
code. This greatly simplifies the ability for users to add storage
functionality without recompiling the entire world that uses arrow. You might
argue that this could be achieved by linking arrow as a shared library, but
there are use cases where static linking is desirable, or use cases where I
don't control the arrow shared library but the users can obtain my plugin.
- Maintainer-friendly and Sysadmin-friendly - if I maintain a storage driver
plugin, I can version control it entirely independently, distribute it
separately from the arrow library, and have a simpler build system that doesnt
necessarily need to integrate with the arrow cmake machinery. Otherwise
somehow cmake needs to know about the extra filesystem implementation and needs
to do something to embed it at compile-time.
- There are also some functions in the C++ library that have hardcoded string
comparisions to e.g. "hdfs". These are not the hardest ones to solve, because
we could switch it to a lookup from a global mapping that the user can register
factory function to, but I figured I would mention them anyways.

If you are wondering about the concrete hurdle that prompted this, it's that
the pyarrow bits are seemingly half wrappers to the C++ lib and and half
implemented in python, with what I _think_ are manually-written Cython wrappers
around the pieces that need to be visible in python. For my storage library, I
don't really want to mess with forking pyarrow and writing Cython wrappers and
rebuilding pyarrow, and I'd like to just do it once in C/C++ and have it work
in pyarrow automatically.

I understand the hesitation here, but I think the scary bits can be done
safely, and I think this will open the doors to a more organized and
community-driven collection of storage drivers without cluttering the arrow
codebase. For some related prior art, this feels to me like a tiny lower-level
version of CSI plugins. If we wanted to support the whole universe of drivers
from within the arrow codebase, it would get pretty bloated.

was (Author: llchan):
- Language-agnostic - once a storage driver is written/built, _any_ arrow
library can load it (assuming we've finished implementing the plugin API). So
rather than needing to add support to each language, I just need to write the
wrapper once, and then users can use that filesystem in C++, python, go, rust,
whatever.
- Application-agnostic - if users want to use my storage driver in a downstream
application, I can distribute a plugin and arrow can load the plugin at runtime
without needing to do a special build of that application with my filesystem
code. This greatly simplifies the ability for users to add storage
functionality without recompiling the entire world that uses arrow. You might
argue that this could be achieved by linking arrow as a shared library, but
there are use cases where static linking is desirable, or use cases where I
don't control the arrow shared library but the users can obtain my plugin.
- Maintainer-friendly - if I maintain a storage driver plugin, I can version
control it entirely independently, distribute it separately from the arrow
library, and have a simpler build system that doesnt necessarily need to
integrate with the arrow cmake machinery. Otherwise somehow cmake needs to
know about the extra filesystem implementation and needs to do something to
embed it at compile-time.
- There are also some functions in the C++ library that have hardcoded string
comparisions to e.g. "hdfs". These are not the hardest ones to solve, because
we could switch it to a lookup from a global mapping that the user can register
factory function to, but I figured I would mention them anyways.

[jira] [Commented] (ARROW-9820) [C++] Plugin Architecture for Filesystem and File IO

2020-08-25 Thread Lawrence Chan (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184756#comment-17184756
 ] 

Lawrence Chan commented on ARROW-9820:
--

- Language-agnostic - once a storage driver is written/built, _any_ arrow 
library can load it (assuming we've finished implementing the plugin API). So 
rather than needing to add support to each language, I just need to write the 
wrapper once, and then users can use that filesystem in C++, python, go, rust, 
whatever.
- Application-agnostic - if users want to use my storage driver in a downstream 
application, I can distribute a plugin and arrow can load the plugin at runtime 
without needing to do a special build of that application with my filesystem 
code.  This greatly simplifies the ability for users to add storage 
functionality without recompiling the entire world that uses arrow.  You might 
argue that this could be achieved by linking arrow as a shared library, but 
there are use cases where static linking is desirable, or use cases where I 
don't control the arrow shared library but the users can obtain my plugin.
- Maintainer-friendly - if I maintain a storage driver plugin, I can version 
control it entirely independently, distribute it separately from the arrow 
library, and have a simpler build system that doesnt necessarily need to 
integrate with the arrow cmake machinery.  Otherwise somehow cmake needs to 
know about the extra filesystem implementation and needs to do something to 
embed it at compile-time.
- There are also some functions in the C++ library that have hardcoded string 
comparisions to e.g. "hdfs".  These are not the hardest ones to solve, because 
we could switch it to a lookup from a global mapping that the user can register 
factory function to, but I figured I would mention them anyways.

If you are wondering about the concrete hurdle that prompted this, it's that 
the pyarrow bits are seemingly half wrappers to the C++ lib and and half 
implemented in python, with what I _think_ are manually-written Cython wrappers 
around the pieces that need to be visible in python.  For my storage library, I 
don't really want to mess with forking pyarrow and writing Cython wrappers and 
rebuilding pyarrow, and I'd like to just do it once in C/C++ and have it work 
in pyarrow automatically.

I understand the hesitation here, but I think the scary bits can be done 
safely, and I think this will open the doors to a more organized and 
community-driven collection of storage drivers without cluttering the arrow 
codebase.  For some related prior art, this feels to me like a tiny lower-level 
version of CSI plugins.  If we wanted to support the whole universe of drivers 
from within the arrow codebase, it would get pretty bloated.

> [C++] Plugin Architecture for Filesystem and File IO
> 
>
> Key: ARROW-9820
> URL: https://issues.apache.org/jira/browse/ARROW-9820
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Lawrence Chan
>Priority: Minor
>
> Adding a new custom filesystem with corresponding file i/o streams is quite a 
> process at the moment.  Looks like HDFS and S3FS are basically hardcoded in 
> many places.  It would be useful to develop a plugin system to allow users to 
> interface with other data stores without maintaining a permanent fork with 
> hardcoded changes.
> We can either do runtime plugins or compile-time plugins.  Runtime is more 
> user-friendly, but with C++, ABI compatibility is fairly delicate.  So we 
> would either want to use a C ABI or accept a youre-on-your-own situation 
> where the user is expected to be very careful with versioning and compiler 
> flags.
> With compile-time plugins, maybe there's a way to have the cmake machinery 
> build third party code and also register those new URI schemes automatically.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9860) Arrow Flight JavaScript Client or Example

2020-08-25 Thread Alex Monahan (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Monahan updated ARROW-9860:

Component/s: Python

> Arrow Flight JavaScript Client or Example
> -
>
> Key: ARROW-9860
> URL: https://issues.apache.org/jira/browse/ARROW-9860
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: JavaScript, Python
>Reporter: Alex Monahan
>Priority: Major
>
> Is it possible to use Apache Arrow Flight to send data from a Python Web 
> Server to a JavaScript browser client? If it is possible, is there a code 
> example to use to get started? 
>  
> If this is not possible, what is the fastest way to send data from a Python 
> Web Server to Apache Arrow in the browser today? Would it be faster to send a 
> Parquet file and unpack it client-side, or send Arrow directly/with gzip/ 
> etc.?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9820) [C++] Plugin Architecture for Filesystem and File IO

2020-08-25 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184736#comment-17184736
 ] 

Wes McKinney commented on ARROW-9820:
-

What would not be solved by creating an implementation of 
{{arrow::fs::FileSystem}}?

> [C++] Plugin Architecture for Filesystem and File IO
> 
>
> Key: ARROW-9820
> URL: https://issues.apache.org/jira/browse/ARROW-9820
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Lawrence Chan
>Priority: Minor
>
> Adding a new custom filesystem with corresponding file i/o streams is quite a 
> process at the moment.  Looks like HDFS and S3FS are basically hardcoded in 
> many places.  It would be useful to develop a plugin system to allow users to 
> interface with other data stores without maintaining a permanent fork with 
> hardcoded changes.
> We can either do runtime plugins or compile-time plugins.  Runtime is more 
> user-friendly, but with C++, ABI compatibility is fairly delicate.  So we 
> would either want to use a C ABI or accept a youre-on-your-own situation 
> where the user is expected to be very careful with versioning and compiler 
> flags.
> With compile-time plugins, maybe there's a way to have the cmake machinery 
> build third party code and also register those new URI schemes automatically.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9860) Arrow Flight JavaScript Client or Example

2020-08-25 Thread Alex Monahan (Jira)

Alex Monahan created ARROW-9860:
---

 Summary: Arrow Flight JavaScript Client or Example
 Key: ARROW-9860
 URL: https://issues.apache.org/jira/browse/ARROW-9860
 Project: Apache Arrow
  Issue Type: Wish
  Components: JavaScript
Reporter: Alex Monahan


Is it possible to use Apache Arrow Flight to send data from a Python Web Server 
to a JavaScript browser client? If it is possible, is there a code example to 
use to get started? 

 

If this is not possible, what is the fastest way to send data from a Python Web 
Server to Apache Arrow in the browser today? Would it be faster to send a 
Parquet file and unpack it client-side, or send Arrow directly/with gzip/ etc.?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9859) [C++] S3 FileSystemFromUri with special char in secret key fails

Neal Richardson created ARROW-9859:
--

 Summary: [C++] S3 FileSystemFromUri with special char in secret 
key fails
 Key: ARROW-9859
 URL: https://issues.apache.org/jira/browse/ARROW-9859
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Documentation, Python
Reporter: Neal Richardson
 Fix For: 2.0.0


S3 Secret access keys can contain special characters like {{/}}. When they do

1) FileSystemFromUri will fail to parse the URI unless you URL-encode them 
(e.g. replace / with %2F)
2) When you do escape the special characters, requests that require 
authorization fail with the message "The request signature we calculated does 
not match the signature you provided. Check your key and signing method." This 
may suggest that there's some extra URL encoding/decoding that needs to happen 
inside.

I was only able to work around this by generating a new access key that 
happened not to have special characters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9858) [C++][Python][Docs] User guide for S3FileSystem

Neal Richardson created ARROW-9858:
--

 Summary: [C++][Python][Docs] User guide for S3FileSystem
 Key: ARROW-9858
 URL: https://issues.apache.org/jira/browse/ARROW-9858
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Documentation, Python
Reporter: Neal Richardson
 Fix For: 2.0.0


https://arrow.apache.org/docs/python/filesystems.html is pretty thin

https://arrow.apache.org/docs/python/api/filesystems.html doesn't mention S3

and in general there are some tricks to getting FileSystemFromUri to work



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9857) Failed to install Arrow 0.14.1

2020-08-25 Thread SHOBHIT SHUKLA (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHOBHIT SHUKLA updated ARROW-9857:
--
Description: 
We are seeing issue to install arrow R package on RHEL machines. Using below 
command to install arrow.
*R -e "install.packages(\"remotes\",repos = 
\"http://cran.r-project.org\;);remotes::install_github(\"apache/arrow\", subdir 
= \"r\", ref = \"apache-arrow-0.14.1\")"*


*Error logs :*

The downloaded source packages are in
'/tmp/Rtmpycmy4e/downloaded_packages'
[0m[91mUpdating HTML index of packages in '.Library'
[0m[91mMaking 'packages.html' ...[0m[91m done
[0m[91mRunning `R CMD build`...
[0m* checking for file 
'/tmp/Rtmpycmy4e/remotesa23015b2/apache-arrow-8d09de2/r/DESCRIPTION' ... OK
* preparing 'arrow':
* checking DESCRIPTION meta-information ... OK
* cleaning src
* running 'cleanup'
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building 'arrow_0.14.1.tar.gz'
[91m* installing *source* package 'arrow' ...
[0m[91m** using staged installation
[0m$ pkg-config --cflags --silence-errors arrow parquet
PKGCONFIG_CFLAGS = "-DNDEBUG  "
$ pkg-config --libs arrow parquet
PKGCONFIG_LIBS = "-lparquet -larrow  "

Found pkg-config cflags and libs!
PKG_CFLAGS=-DNDEBUG   -DARROW_R_WITH_ARROW
PKG_LIBS=-lparquet -larrow  
[91m** libs
[0mg++ -std=gnu++11 -I"/opt/ibm/conda/R/lib64/R/include" -DNDEBUG -DNDEBUG   
-DARROW_R_WITH_ARROW -I"/opt/ibm/conda/R/lib64/R/library/Rcpp/include" 
-I/usr/local/include -fvisibility=hidden -fpic  -g -O2  -c array.cpp -o array.o
[91mIn file included from 
/opt/ibm/conda/R/lib64/R/library/Rcpp/include/Rcpp/macros/macros.h:134:0,
 from 
/opt/ibm/conda/R/lib64/R/library/Rcpp/include/Rcpp/r/headers.h:69,
 from 
/opt/ibm/conda/R/lib64/R/library/Rcpp/include/RcppCommon.h:29,
 from ./arrow_types.h:24,
 from array.cpp:18:
./arrow_types.h:188:26: error: 'Type' is not a member of 'arrow::ipc::Message'
 RCPP_EXPOSED_ENUM_NODECL(arrow::ipc::Message::Type)
  ^
/opt/ibm/conda/R/lib64/R/library/Rcpp/include/Rcpp/macros/module.h:76:106: 
note: in definition of macro 'RCPP_EXPOSED_ENUM_AS'
 #define RCPP_EXPOSED_ENUM_AS(CLASS)   namespace Rcpp{ namespace traits{ 
template<> struct r_type_traits< CLASS >{ typedef r_type_enum_tag r_category ; 
} ; }}

  ^
./arrow_types.h:188:1: note: in expansion of macro 'RCPP_EXPOSED_ENUM_NODECL'
 RCPP_EXPOSED_ENUM_NODECL(arrow::ipc::Message::Type)
 ^
./arrow_types.h:188:26: error: 'Type' is not a member of 'arrow::ipc::Message'
 RCPP_EXPOSED_ENUM_NODECL(arrow::ipc::Message::Type)
  ^
/opt/ibm/conda/R/lib64/R/library/Rcpp/include/Rcpp/macros/module.h:76:106: 
note: in definition of macro 'RCPP_EXPOSED_ENUM_AS'
 #define RCPP_EXPOSED_ENUM_AS(CLASS)   namespace Rcpp{ namespace traits{ 
template<> struct r_type_traits< CLASS >{ typedef r_type_enum_tag r_category ; 
} ; }}

  ^
./arrow_types.h:188:1: note: in expansion of macro 'RCPP_EXPOSED_ENUM_NODECL'
 RCPP_EXPOSED_ENUM_NODECL(arrow::ipc::Message::Type)
 ^
[0m[91m/opt/ibm/conda/R/lib64/R/library/Rcpp/include/Rcpp/macros/module.h:76:112:
 error: template argument 1 is invalid
 #define RCPP_EXPOSED_ENUM_AS(CLASS)   namespace Rcpp{ namespace traits{ 
template<> struct r_type_traits< CLASS >{ typedef r_type_enum_tag r_category ; 
} ; }}

^
/opt/ibm/conda/R/lib64/R/library/Rcpp/include/Rcpp/macros/module.h:80:3: note: 
in expansion of macro 'RCPP_EXPOSED_ENUM_AS'
   RCPP_EXPOSED_ENUM_AS(CLASS)  \
   ^
./arrow_types.h:188:1: note: in expansion of macro 'RCPP_EXPOSED_ENUM_NODECL'
 RCPP_EXPOSED_ENUM_NODECL(arrow::ipc::Message::Type)
 ^
[0m[91m./arrow_types.h:188:26: error: 'Type' is not a member of 
'arrow::ipc::Message'
 RCPP_EXPOSED_ENUM_NODECL(arrow::ipc::Message::Type)
  ^
/opt/ibm/conda/R/lib64/R/library/Rcpp/include/Rcpp/macros/module.h:77:109: 
note: in definition of macro 'RCPP_EXPOSED_ENUM_WRAP'
 #define RCPP_EXPOSED_ENUM_WRAP(CLASS) namespace Rcpp{ namespace traits{ 
template<> struct wrap_type_traits< CLASS >{typedef wrap_type_enum_tag 
wrap_category ; } ; }}

 ^
./arrow_types.h:188:1: note: in expansion of macro 'RCPP_EXPOSED_ENUM_NODECL'
 RCPP_EXPOSED_ENUM_NODECL(arrow::ipc::Message::Type)
 ^
[0m[91m./arrow_types.h:188:26: error: 'Type' is not a member of 
'arrow::ipc::Message'
 RCPP_EXPOSED_ENUM_NODECL(arrow::ipc::Message::Type)

[jira] [Created] (ARROW-9857) Failed to install Arrow 0.14.1

2020-08-25 Thread SHOBHIT SHUKLA (Jira)

SHOBHIT SHUKLA created ARROW-9857:
-

 Summary: Failed to install Arrow 0.14.1
 Key: ARROW-9857
 URL: https://issues.apache.org/jira/browse/ARROW-9857
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.14.1
Reporter: SHOBHIT SHUKLA


We are seeing issue to install arrow R package on RHEL machines. Using below 
command to install arrow.
R -e "install.packages(\"remotes\",repos = 
\"http://cran.r-project.org\;);remotes::install_github(\"apache/arrow\", subdir 
= \"r\", ref = \"apache-arrow-0.14.1\")"


*Error logs :*

The downloaded source packages are in
'/tmp/Rtmpycmy4e/downloaded_packages'
[0m[91mUpdating HTML index of packages in '.Library'
[0m[91mMaking 'packages.html' ...[0m[91m done
[0m[91mRunning `R CMD build`...
[0m* checking for file 
'/tmp/Rtmpycmy4e/remotesa23015b2/apache-arrow-8d09de2/r/DESCRIPTION' ... OK
* preparing 'arrow':
* checking DESCRIPTION meta-information ... OK
* cleaning src
* running 'cleanup'
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building 'arrow_0.14.1.tar.gz'
[91m* installing *source* package 'arrow' ...
[0m[91m** using staged installation
[0m$ pkg-config --cflags --silence-errors arrow parquet
PKGCONFIG_CFLAGS = "-DNDEBUG  "
$ pkg-config --libs arrow parquet
PKGCONFIG_LIBS = "-lparquet -larrow  "

Found pkg-config cflags and libs!
PKG_CFLAGS=-DNDEBUG   -DARROW_R_WITH_ARROW
PKG_LIBS=-lparquet -larrow  
[91m** libs
[0mg++ -std=gnu++11 -I"/opt/ibm/conda/R/lib64/R/include" -DNDEBUG -DNDEBUG   
-DARROW_R_WITH_ARROW -I"/opt/ibm/conda/R/lib64/R/library/Rcpp/include" 
-I/usr/local/include -fvisibility=hidden -fpic  -g -O2  -c array.cpp -o array.o
[91mIn file included from 
/opt/ibm/conda/R/lib64/R/library/Rcpp/include/Rcpp/macros/macros.h:134:0,
 from 
/opt/ibm/conda/R/lib64/R/library/Rcpp/include/Rcpp/r/headers.h:69,
 from 
/opt/ibm/conda/R/lib64/R/library/Rcpp/include/RcppCommon.h:29,
 from ./arrow_types.h:24,
 from array.cpp:18:
./arrow_types.h:188:26: error: 'Type' is not a member of 'arrow::ipc::Message'
 RCPP_EXPOSED_ENUM_NODECL(arrow::ipc::Message::Type)
  ^
/opt/ibm/conda/R/lib64/R/library/Rcpp/include/Rcpp/macros/module.h:76:106: 
note: in definition of macro 'RCPP_EXPOSED_ENUM_AS'
 #define RCPP_EXPOSED_ENUM_AS(CLASS)   namespace Rcpp{ namespace traits{ 
template<> struct r_type_traits< CLASS >{ typedef r_type_enum_tag r_category ; 
} ; }}

  ^
./arrow_types.h:188:1: note: in expansion of macro 'RCPP_EXPOSED_ENUM_NODECL'
 RCPP_EXPOSED_ENUM_NODECL(arrow::ipc::Message::Type)
 ^
./arrow_types.h:188:26: error: 'Type' is not a member of 'arrow::ipc::Message'
 RCPP_EXPOSED_ENUM_NODECL(arrow::ipc::Message::Type)
  ^
/opt/ibm/conda/R/lib64/R/library/Rcpp/include/Rcpp/macros/module.h:76:106: 
note: in definition of macro 'RCPP_EXPOSED_ENUM_AS'
 #define RCPP_EXPOSED_ENUM_AS(CLASS)   namespace Rcpp{ namespace traits{ 
template<> struct r_type_traits< CLASS >{ typedef r_type_enum_tag r_category ; 
} ; }}

  ^
./arrow_types.h:188:1: note: in expansion of macro 'RCPP_EXPOSED_ENUM_NODECL'
 RCPP_EXPOSED_ENUM_NODECL(arrow::ipc::Message::Type)
 ^
[0m[91m/opt/ibm/conda/R/lib64/R/library/Rcpp/include/Rcpp/macros/module.h:76:112:
 error: template argument 1 is invalid
 #define RCPP_EXPOSED_ENUM_AS(CLASS)   namespace Rcpp{ namespace traits{ 
template<> struct r_type_traits< CLASS >{ typedef r_type_enum_tag r_category ; 
} ; }}

^
/opt/ibm/conda/R/lib64/R/library/Rcpp/include/Rcpp/macros/module.h:80:3: note: 
in expansion of macro 'RCPP_EXPOSED_ENUM_AS'
   RCPP_EXPOSED_ENUM_AS(CLASS)  \
   ^
./arrow_types.h:188:1: note: in expansion of macro 'RCPP_EXPOSED_ENUM_NODECL'
 RCPP_EXPOSED_ENUM_NODECL(arrow::ipc::Message::Type)
 ^
[0m[91m./arrow_types.h:188:26: error: 'Type' is not a member of 
'arrow::ipc::Message'
 RCPP_EXPOSED_ENUM_NODECL(arrow::ipc::Message::Type)
  ^
/opt/ibm/conda/R/lib64/R/library/Rcpp/include/Rcpp/macros/module.h:77:109: 
note: in definition of macro 'RCPP_EXPOSED_ENUM_WRAP'
 #define RCPP_EXPOSED_ENUM_WRAP(CLASS) namespace Rcpp{ namespace traits{ 
template<> struct wrap_type_traits< CLASS >{typedef wrap_type_enum_tag 
wrap_category ; } ; }}

 ^
./arrow_types.h:188:1: note: in expansion of macro 'RCPP_EXPOSED_ENUM_NODECL'
 RCPP_EXPOSED_ENUM_NODECL(arrow::ipc::Message::Type)
 ^

[jira] [Commented] (ARROW-9820) [C++] Plugin Architecture for Filesystem and File IO

2020-08-25 Thread Lawrence Chan (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184682#comment-17184682
 ] 

Lawrence Chan commented on ARROW-9820:
--

I agree lifetimes with C-based plugins require some care to get correct, but I 
think it is something we can design to be relatively safe for the end user.  I 
have some work in progress that I can push up to a PR draft and it may be 
easier to discuss with some code in hand.  The general gist of it is that 
anything allocated by the plugin will be immediately wrapped in safer C++ 
owning objects that will handle destruction.  There will also be ABI versioning 
so that we have an upgrade path for future backwards-incompatible changes that 
are safe from dangerous ABI mismatches.  I think some of this will be more 
clear once I get that PR pushed up.

For context about our use case: we have an in-house data storage system that 
can read/write files via a userspace library, and it has a fair amount of 
overlap with arrow::fs stuff in spirit.  I wrote OutputStream + 
RandomAccessFile subclasses and got the I/O working fine, but once I started 
looking at the pyarrow bindings and the dataset stuff I realized the other 
required changes would need to be hardcoded in a way that will be very 
difficult for me to maintain down the road, so I started thinking about 
pluggable storage drivers.

> [C++] Plugin Architecture for Filesystem and File IO
> 
>
> Key: ARROW-9820
> URL: https://issues.apache.org/jira/browse/ARROW-9820
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Lawrence Chan
>Priority: Minor
>
> Adding a new custom filesystem with corresponding file i/o streams is quite a 
> process at the moment.  Looks like HDFS and S3FS are basically hardcoded in 
> many places.  It would be useful to develop a plugin system to allow users to 
> interface with other data stores without maintaining a permanent fork with 
> hardcoded changes.
> We can either do runtime plugins or compile-time plugins.  Runtime is more 
> user-friendly, but with C++, ABI compatibility is fairly delicate.  So we 
> would either want to use a C ABI or accept a youre-on-your-own situation 
> where the user is expected to be very careful with versioning and compiler 
> flags.
> With compile-time plugins, maybe there's a way to have the cmake machinery 
> build third party code and also register those new URI schemes automatically.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-9855) [R] Fix bad merge/Rcpp conflict



 [ 
https://issues.apache.org/jira/browse/ARROW-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-9855.

Resolution: Fixed

Issue resolved by pull request 8053
[https://github.com/apache/arrow/pull/8053]

> [R] Fix bad merge/Rcpp conflict
> ---
>
> Key: ARROW-9855
> URL: https://issues.apache.org/jira/browse/ARROW-9855
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> ARROW-8001 merged after the switch to cpp11 but was based on master before 
> it, so that brought some generated code that still referenced Rcpp



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9856) [R] Add bindings for string compute functions

Neal Richardson created ARROW-9856:
--

 Summary: [R] Add bindings for string compute functions
 Key: ARROW-9856
 URL: https://issues.apache.org/jira/browse/ARROW-9856
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Neal Richardson


See https://arrow.apache.org/docs/cpp/compute.html#string-predicates and below. 
Since R's base string functions, as well as stringr/stringi, aren't generics 
that we can define methods for, this will probably make most sense within the 
context of a dplyr expression where we have more control over the evaluation.

This will require enabling utf8proc in the builds; there's already an 
rtools-package for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9855) [R] Fix bad merge/Rcpp conflict



 [ 
https://issues.apache.org/jira/browse/ARROW-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9855:


Assignee: Apache Arrow JIRA Bot  (was: Neal Richardson)

> [R] Fix bad merge/Rcpp conflict
> ---
>
> Key: ARROW-9855
> URL: https://issues.apache.org/jira/browse/ARROW-9855
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ARROW-8001 merged after the switch to cpp11 but was based on master before 
> it, so that brought some generated code that still referenced Rcpp



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9855) [R] Fix bad merge/Rcpp conflict



 [ 
https://issues.apache.org/jira/browse/ARROW-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9855:


Assignee: Neal Richardson  (was: Apache Arrow JIRA Bot)

> [R] Fix bad merge/Rcpp conflict
> ---
>
> Key: ARROW-9855
> URL: https://issues.apache.org/jira/browse/ARROW-9855
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ARROW-8001 merged after the switch to cpp11 but was based on master before 
> it, so that brought some generated code that still referenced Rcpp



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9855) [R] Fix bad merge/Rcpp conflict



 [ 
https://issues.apache.org/jira/browse/ARROW-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9855:
--
Labels: pull-request-available  (was: )

> [R] Fix bad merge/Rcpp conflict
> ---
>
> Key: ARROW-9855
> URL: https://issues.apache.org/jira/browse/ARROW-9855
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ARROW-8001 merged after the switch to cpp11 but was based on master before 
> it, so that brought some generated code that still referenced Rcpp



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9855) [R] Fix bad merge/Rcpp conflict

Neal Richardson created ARROW-9855:
--

 Summary: [R] Fix bad merge/Rcpp conflict
 Key: ARROW-9855
 URL: https://issues.apache.org/jira/browse/ARROW-9855
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 2.0.0


ARROW-8001 merged after the switch to cpp11 but was based on master before it, 
so that brought some generated code that still referenced Rcpp



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9854) [R] Support reading/writing data to/from S3

Neal Richardson created ARROW-9854:
--

 Summary: [R] Support reading/writing data to/from S3
 Key: ARROW-9854
 URL: https://issues.apache.org/jira/browse/ARROW-9854
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 2.0.0


Current S3 support is limited to (1) being able to instantiate an S3FileSystem 
object, primarily from a URI, and (2) ability to open_dataset from an S3 URI. 
Before widely declaring that we support S3 in R, we should be able to:

* download dataset (i.e. copy files/directory recursively)
* read_parquet/feather/etc. from S3 (use FileSystem->OpenInputFile(path))
* write_$FORMAT via FileSystem->OpenOutputStream(path)
* write_dataset
* for linux, an argument to install_arrow to help, assuming you've installed 
aws-sdk-cpp already (turn on ARROW_S3, AWSSDK_SOURCE=SYSTEM)
* testing with minio on CI
* set up a real test bucket and user for e2e testing
* update docs and vignettes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-3757) [R] R bindings for Flight RPC client



 [ 
https://issues.apache.org/jira/browse/ARROW-3757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-3757:
---
Fix Version/s: 2.0.0

> [R] R bindings for Flight RPC client
> 
>
> Key: ARROW-3757
> URL: https://issues.apache.org/jira/browse/ARROW-3757
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: FlightRPC, R
>Reporter: Wes McKinney
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9761) [C++] Add experimental pull-based iterator structures to C interface implementation



 [ 
https://issues.apache.org/jira/browse/ARROW-9761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9761:


Assignee: Antoine Pitrou  (was: Apache Arrow JIRA Bot)

> [C++] Add experimental pull-based iterator structures to C interface 
> implementation
> ---
>
> Key: ARROW-9761
> URL: https://issues.apache.org/jira/browse/ARROW-9761
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This purpose of this would be to validate some initial use cases / workflows 
> prior to potentially formalizing the interface in the C ABI



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9761) [C++] Add experimental pull-based iterator structures to C interface implementation



 [ 
https://issues.apache.org/jira/browse/ARROW-9761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9761:


Assignee: Apache Arrow JIRA Bot  (was: Antoine Pitrou)

> [C++] Add experimental pull-based iterator structures to C interface 
> implementation
> ---
>
> Key: ARROW-9761
> URL: https://issues.apache.org/jira/browse/ARROW-9761
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This purpose of this would be to validate some initial use cases / workflows 
> prior to potentially formalizing the interface in the C ABI



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9761) [C++] Add experimental pull-based iterator structures to C interface implementation



 [ 
https://issues.apache.org/jira/browse/ARROW-9761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9761:
--
Labels: pull-request-available  (was: )

> [C++] Add experimental pull-based iterator structures to C interface 
> implementation
> ---
>
> Key: ARROW-9761
> URL: https://issues.apache.org/jira/browse/ARROW-9761
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This purpose of this would be to validate some initial use cases / workflows 
> prior to potentially formalizing the interface in the C ABI



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9853) [RUST] Implement "take" kernel for dictionary arrays



 [ 
https://issues.apache.org/jira/browse/ARROW-9853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9853:
--
Labels: pull-request-available  (was: )

> [RUST] Implement "take" kernel for dictionary arrays
> 
>
> Key: ARROW-9853
> URL: https://issues.apache.org/jira/browse/ARROW-9853
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Jörn Horstmann
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8001) [R][Dataset] Bindings for dataset writing



 [ 
https://issues.apache.org/jira/browse/ARROW-8001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-8001.

Resolution: Fixed

Issue resolved by pull request 8041
[https://github.com/apache/arrow/pull/8041]

> [R][Dataset] Bindings for dataset writing
> -
>
> Key: ARROW-8001
> URL: https://issues.apache.org/jira/browse/ARROW-8001
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This was started in ARROW-8002 but there's more to implement and test



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9853) [RUST] Implement "take" kernel for dictionary arrays

2020-08-25 Thread Jira

Jörn Horstmann created ARROW-9853:
-

 Summary: [RUST] Implement "take" kernel for dictionary arrays
 Key: ARROW-9853
 URL: https://issues.apache.org/jira/browse/ARROW-9853
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Affects Versions: 1.0.0
Reporter: Jörn Horstmann






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9761) [C++] Add experimental pull-based iterator structures to C interface implementation



 [ 
https://issues.apache.org/jira/browse/ARROW-9761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-9761:
-

Assignee: Antoine Pitrou

> [C++] Add experimental pull-based iterator structures to C interface 
> implementation
> ---
>
> Key: ARROW-9761
> URL: https://issues.apache.org/jira/browse/ARROW-9761
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 2.0.0
>
>
> This purpose of this would be to validate some initial use cases / workflows 
> prior to potentially formalizing the interface in the C ABI



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8040) [Python][Packaging] Add Parquet encryption / OpenSSL to Python wheels

2020-08-25 Thread Itamar Turner-Trauring (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184135#comment-17184135
 ] 

Itamar Turner-Trauring commented on ARROW-8040:
---

I would like to get this working (doing this on behalf of a client)—the 
packaging sides seem relatively simple, just adding the right flags to the 
build scripts, and maybe making sure OpenSSL is compiled in statically.

However, it doesn't seem like there's Python bindings for the encryption? Or at 
least, it's not clear to me how to use Parquet encryption from Python... So 
does that need to be done separately? Or is there an example I can look at?

Thanks!

> [Python][Packaging] Add Parquet encryption / OpenSSL to Python wheels
> -
>
> Key: ARROW-8040
> URL: https://issues.apache.org/jira/browse/ARROW-8040
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9812) [Python] Map data types doesn't work from Arrow to Pandas and Parquet

2020-08-25 Thread Mayur Srivastava (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184108#comment-17184108
 ] 

Mayur Srivastava commented on ARROW-9812:
-

Thanks [~jorisvandenbossche]

When ARROW-1644 is done, we can start using for non-pandas use cases.

 

 

> [Python] Map data types doesn't work from Arrow to Pandas and Parquet
> -
>
> Key: ARROW-9812
> URL: https://issues.apache.org/jira/browse/ARROW-9812
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mayur Srivastava
>Priority: Major
>
> Hi,
> I'm having problems using 'map' data type in Arrow/parquet/pandas.
> I'm able to convert a pandas data frame to Arrow with a map data type.
> But, Arrow to Pandas doesn't work.
> When I write Arrow to Parquet, it seems to work, but I'm not sure if the data 
> type is written correctly.
> When I read back Parquet to Arrow, it fails saying "reading list of structs" 
> is not supported. It seems that map is stored as list of structs.
> There are two problems here:
>  # Map data type doesn't work from Arrow -> Pandas.
>  # Map data type doesn't get written to or read from Arrow -> Parquet.
> Questions:
> 1. Am I doing something wrong? Is there a way to get these to work? 
> 2. If these are unsupported features, will this be fixed in a future version? 
> Do you plans or ETA?
> The following code example (followed by output) should demonstrate the issues:
> I'm using Arrow 1.0.0 and Pandas 1.0.5.
> Thanks!
> Mayur
> {code:java}
> $ cat arrowtest.py
> import pyarrow as pa
> import pandas as pd
> import pyarrow.parquet as pq
> import traceback as tb
> import io
> print(f'PyArrow Version = {pa.__version__}')
> print(f'Pandas Version = {pd.__version__}')
> df1 = pd.DataFrame({'a': [[('b', '2')]]})
> print(f'df1')
> print(f'{df1}')
> print(f'Pandas -> Arrow')
> try:
> t1 = pa.Table.from_pandas(df1, schema=pa.schema([pa.field('a', 
> pa.map_(pa.string(), pa.string()))]))
> print('PASSED')
> print(t1)
> except:
> print(f'FAILED')
> tb.print_exc()
> print(f'Arrow -> Pandas')
> try:
> t1.to_pandas()
> print('PASSED')
> except:
> print(f'FAILED')
> tb.print_exc()print(f'Arrow -> Parquet')
> fh = io.BytesIO()
> try:
> pq.write_table(t1, fh)
> print('PASSED')
> except:
> print('FAILED')
> tb.print_exc()
> 
> print(f'Parquet -> Arrow')
> try:
> t2 = pq.read_table(source=fh)
> print('PASSED')
> print(t2)
> except:
> print('FAILED')
> tb.print_exc()
> {code}
> {code:java}
> $ python3.6 arrowtest.py
> PyArrow Version = 1.0.0 
> Pandas Version = 1.0.5 
> df1 
> a 0 [(b, 2)] 
>  
> Pandas -> Arrow 
> PASSED 
> pyarrow.Table 
> a: map
>  child 0, entries: struct not null
>  child 0, key: string not null
>  child 1, value: string 
>  
> Arrow -> Pandas 
> FAILED 
> Traceback (most recent call last):
> File "arrowtest.py", line 26, in  t1.to_pandas() 
> File "pyarrow/array.pxi", line 715, in 
> pyarrow.lib._PandasConvertible.to_pandas 
> File "pyarrow/table.pxi", line 1565, in pyarrow.lib.Table._to_pandas File 
> "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 779, in 
> table_to_blockmanager blocks = _table_to_blocks(options, table, categories, 
> ext_columns_dtypes) 
> File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 
> 1115, in _table_to_blocks list(extension_columns.keys())) 
> File "pyarrow/table.pxi", line 1028, in pyarrow.lib.table_to_blocks File 
> "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status 
> pyarrow.lib.ArrowNotImplementedError: No known equivalent Pandas block for 
> Arrow data of type map is known. 
>  
> Arrow -> Parquet 
> PASSED 
>  
> Parquet -> Arrow 
> FAILED 
> Traceback (most recent call last): File "arrowtest.py", line 43, in  
> t2 = pq.read_table(source=fh) 
> File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1586, in 
> read_table use_pandas_metadata=use_pandas_metadata) 
> File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1474, in 
> read use_threads=use_threads 
> File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.to_table 
> File "pyarrow/_dataset.pyx", line 1994, in pyarrow._dataset.Scanner.to_table 
> File "pyarrow/error.pxi", line 122, in 
> pyarrow.lib.pyarrow_internal_check_status 
> File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status 
> pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet 
> files not yet supported: key_value: list null, value: string> not null> not null
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9852) [C++] Fix crash on invalid IPC input (OSS-Fuzz)



 [ 
https://issues.apache.org/jira/browse/ARROW-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9852:


Assignee: Apache Arrow JIRA Bot  (was: Antoine Pitrou)

> [C++] Fix crash on invalid IPC input (OSS-Fuzz)
> ---
>
> Key: ARROW-9852
> URL: https://issues.apache.org/jira/browse/ARROW-9852
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9852) [C++] Fix crash on invalid IPC input (OSS-Fuzz)



 [ 
https://issues.apache.org/jira/browse/ARROW-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9852:


Assignee: Antoine Pitrou  (was: Apache Arrow JIRA Bot)

> [C++] Fix crash on invalid IPC input (OSS-Fuzz)
> ---
>
> Key: ARROW-9852
> URL: https://issues.apache.org/jira/browse/ARROW-9852
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9852) [C++] Fix crash on invalid IPC input (OSS-Fuzz)



 [ 
https://issues.apache.org/jira/browse/ARROW-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9852:
--
Labels: pull-request-available  (was: )

> [C++] Fix crash on invalid IPC input (OSS-Fuzz)
> ---
>
> Key: ARROW-9852
> URL: https://issues.apache.org/jira/browse/ARROW-9852
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9852) [C++] Fix crash on invalid IPC input (OSS-Fuzz)

Antoine Pitrou created ARROW-9852:
-

 Summary: [C++] Fix crash on invalid IPC input (OSS-Fuzz)
 Key: ARROW-9852
 URL: https://issues.apache.org/jira/browse/ARROW-9852
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9275) [Rust] – Async Sans IO: R/W into/to Arrow Arrays

2020-08-25 Thread Andrew Lamb (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184086#comment-17184086
 ] 

Andrew Lamb commented on ARROW-9275:


In general, I think the notion of implementing async Parquet and Arrow APIs 
that don't rely on tokio or other executors is a good idea. 

I think in order to make the crate as widely useful as possible, it should also 
retain a synchronous API for use with the rust standard library.

One pattern I have seen is a using a `async` crate option that adds the 
appropriate async options (and possibly additional dependencies). For example, 
https://docs.rs/bzip2/0.4.1/bzip2/#async-io



> [Rust] – Async Sans IO: R/W into/to Arrow Arrays
> 
>
> Key: ARROW-9275
> URL: https://issues.apache.org/jira/browse/ARROW-9275
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Mahmut Bulut
>Assignee: Mahmut Bulut
>Priority: Major
>
> This issue can be considered an epic level that spans across other arrow 
> projects.
> *Drill down*
> Currently, traits like `ParquetReader` only allow synchronous interface which 
> uses BufReader having 8KB constant buffer. Over the network, this becomes a 
> problem. This can be easily solvable with differential buffers. In addition 
> to this shortage, there is a problem of executor engine is needed to schedule 
> from async trait methods to sync trait methods which should sit somewhere in 
> between to make requests asynchronous to external IO. On-disk IO is 
> acceptable with the approach we currently have since no reliable evented IO 
> exists for on-disk IO on major platforms.
> All these considered abstractions that will expose asynchronous IO without 
> any side from executors, needs to be exposed.
>  
> *Design Suggestions & Considerations*
> The design should apply and consider:
>  * Sans IO, (for more information about Sans approach please see 
> [https://sans-io.readthedocs.io/] ) 
>  * Not including any executor specific data, at all.
>  * Tests should work with any executor with little to no modification.
>  * Buffers are adjusted accordingly and use differential buffers to optimize 
> network trips.
>  * Sync IO shouldn't be touched. At all costs. If we try to unify Sync IO 
> traits or we do overlapping implementation, that will make our life harder in 
> the future. Sans IO should be compartmentalized.
>  
> *Notes*
> If Sans approach is not taken, the project will:
>  * use an extreme amount of dependencies.
>  * be not compatible with other Rust code at all.
>  * break currently working code uses array ingestions.
>  * integrations tests are going to be harder.
>  * it will really hard to adapt to completion-based APIs stabilize in the 
> future. (in the user projects)
>  * this suggestion is not about the flight format or any flight-related 
> information atm. This is purely making on-disk, remote IO (provider backends 
> like AWS etc.) async.
>  
> *Open points*
> A couple of open points:
>  * Identifying traits that are going to be asyncized.
>  * Designing internal routines.
>  * package name to expose.
>  * Gather traits into the designated packages in all file formats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9851) [C++] Valgrind errors due to unrecognized instructions



 [ 
https://issues.apache.org/jira/browse/ARROW-9851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9851:
--
Labels: pull-request-available  (was: )

> [C++] Valgrind errors due to unrecognized instructions
> --
>
> Key: ARROW-9851
> URL: https://issues.apache.org/jira/browse/ARROW-9851
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Valgrind seems to barf on AVX512 instructions:
> https://github.com/ursa-labs/crossbow/runs/1025065792



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9851) [C++] Valgrind errors due to unrecognized instructions



[ 
https://issues.apache.org/jira/browse/ARROW-9851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184010#comment-17184010
 ] 

Antoine Pitrou commented on ARROW-9851:
---

AVX512 support is still not merged in Valgrind mainline: 
https://bugs.kde.org/show_bug.cgi?id=383010

> [C++] Valgrind errors due to unrecognized instructions
> --
>
> Key: ARROW-9851
> URL: https://issues.apache.org/jira/browse/ARROW-9851
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>
> Valgrind seems to barf on AVX512 instructions:
> https://github.com/ursa-labs/crossbow/runs/1025065792



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9851) [C++] Valgrind errors due to unrecognized instructions

Antoine Pitrou created ARROW-9851:
-

 Summary: [C++] Valgrind errors due to unrecognized instructions
 Key: ARROW-9851
 URL: https://issues.apache.org/jira/browse/ARROW-9851
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


Valgrind seems to barf on AVX512 instructions:

https://github.com/ursa-labs/crossbow/runs/1025065792



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9781) [C++] Fix uninitialized value warnings



 [ 
https://issues.apache.org/jira/browse/ARROW-9781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-9781:
--
Description: The nightly valgrind build show warnings due to unitialized 
values: [https://github.com/ursa-labs/crossbow/runs/996955686]  (was: The 
nightly valgrind build has failures due to unitialized values: 
https://github.com/ursa-labs/crossbow/runs/996955686)

> [C++] Fix uninitialized value warnings
> --
>
> Key: ARROW-9781
> URL: https://issues.apache.org/jira/browse/ARROW-9781
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> The nightly valgrind build show warnings due to unitialized values: 
> [https://github.com/ursa-labs/crossbow/runs/996955686]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9781) [C++] Fix uninitialized value warnings



 [ 
https://issues.apache.org/jira/browse/ARROW-9781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-9781:
--
Summary: [C++] Fix uninitialized value warnings  (was: [C++] Fix valgrind 
uninitialized value warnings)

> [C++] Fix uninitialized value warnings
> --
>
> Key: ARROW-9781
> URL: https://issues.apache.org/jira/browse/ARROW-9781
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> The nightly valgrind build has failures due to unitialized values: 
> https://github.com/ursa-labs/crossbow/runs/996955686



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9813) [C++] Disable semantic interposition



 [ 
https://issues.apache.org/jira/browse/ARROW-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9813:


Assignee: Antoine Pitrou  (was: Apache Arrow JIRA Bot)

> [C++] Disable semantic interposition
> 
>
> Key: ARROW-9813
> URL: https://issues.apache.org/jira/browse/ARROW-9813
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> On gcc, semantic interposition is enabled by default. It can be beneficial to 
> disable it when building Arrow libraries (and it's most certainly harmless 
> anyway).
> See 
> https://stackoverflow.com/questions/35745543/new-option-in-gcc-5-3-fno-semantic-interposition
>  for more background on this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9813) [C++] Disable semantic interposition



 [ 
https://issues.apache.org/jira/browse/ARROW-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9813:


Assignee: Apache Arrow JIRA Bot  (was: Antoine Pitrou)

> [C++] Disable semantic interposition
> 
>
> Key: ARROW-9813
> URL: https://issues.apache.org/jira/browse/ARROW-9813
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Apache Arrow JIRA Bot
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> On gcc, semantic interposition is enabled by default. It can be beneficial to 
> disable it when building Arrow libraries (and it's most certainly harmless 
> anyway).
> See 
> https://stackoverflow.com/questions/35745543/new-option-in-gcc-5-3-fno-semantic-interposition
>  for more background on this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9813) [C++] Disable semantic interposition



 [ 
https://issues.apache.org/jira/browse/ARROW-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9813:
--
Labels: pull-request-available  (was: )

> [C++] Disable semantic interposition
> 
>
> Key: ARROW-9813
> URL: https://issues.apache.org/jira/browse/ARROW-9813
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> On gcc, semantic interposition is enabled by default. It can be beneficial to 
> disable it when building Arrow libraries (and it's most certainly harmless 
> anyway).
> See 
> https://stackoverflow.com/questions/35745543/new-option-in-gcc-5-3-fno-semantic-interposition
>  for more background on this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9813) [C++] Disable semantic interposition



 [ 
https://issues.apache.org/jira/browse/ARROW-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-9813:
-

Assignee: Antoine Pitrou

> [C++] Disable semantic interposition
> 
>
> Key: ARROW-9813
> URL: https://issues.apache.org/jira/browse/ARROW-9813
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Trivial
> Fix For: 2.0.0
>
>
> On gcc, semantic interposition is enabled by default. It can be beneficial to 
> disable it when building Arrow libraries (and it's most certainly harmless 
> anyway).
> See 
> https://stackoverflow.com/questions/35745543/new-option-in-gcc-5-3-fno-semantic-interposition
>  for more background on this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-9702) [C++] Move bpacking simd to runtime path



 [ 
https://issues.apache.org/jira/browse/ARROW-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9702.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 7940
[https://github.com/apache/arrow/pull/7940]

> [C++] Move bpacking simd to runtime path
> 
>
> Key: ARROW-9702
> URL: https://issues.apache.org/jira/browse/ARROW-9702
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Frank Du
>Assignee: Frank Du
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Currently there are some static avx512 SIMD codes for unpack32 function, it 
> should be reworked to runtime path. Also it can be implemented with avx2.
>  
> The unpack32 API is used by PlainDecodingBoolean.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9844) [Go][CI] Add Travis CI job for Go on s390x



 [ 
https://issues.apache.org/jira/browse/ARROW-9844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9844:
--
Labels: pull-request-available  (was: )

> [Go][CI] Add Travis CI job for Go on s390x
> --
>
> Key: ARROW-9844
> URL: https://issues.apache.org/jira/browse/ARROW-9844
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Go
>Reporter: Vivian Kong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As suggested in [https://github.com/apache/arrow/pull/8011], add a Travis CI 
> job for Go on s390x.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-9820) [C++] Plugin Architecture for Filesystem and File IO



[ 
https://issues.apache.org/jira/browse/ARROW-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183975#comment-17183975
 ] 

Antoine Pitrou edited comment on ARROW-9820 at 8/25/20, 12:21 PM:
--

Thanks for posting this. I agree it would be a good idea to allow adding custom 
filesystem implementations.

Some more comments:
 1) Arrow C++ is one specific library implementing the Arrow format. Other 
Arrow implementations don't necessarily provide the same facilities. That said, 
the ones that bind around Arrow C++ (e.g. PyArrow) generally expose the 
facilities that in Arrow C++.
 2) If using C rather than C++ , how would we handle lifetime and ownership 
issues? That sounds like a can of worms. Arrow C++ is using C++ for a reason... 
(if someone OTOH wants to write a C Arrow implementation, nobody will object :))
 3) runtime vs. compile-time: people shouldn't have to recompile Arrow C++ to 
add a new filesystem type. If that's what you mean by "runtime", then let's do 
that. OTOH, it doesn't have to be a "zero configuration" thing (i.e. it's ok to 
have to call a registration function).
 4) filesystem API stability: we can change the API assuming there are *good* 
reasons to change it. But that's orthogonal to this issue, and you should open 
separate JIRAs for that.

Given all this, perhaps you could tell us a bit more about what kind of plugin 
API you're expecting or able to work with.


was (Author: pitrou):
Thanks for posting this. I agree it would be a good idea to allow adding custom 
filesystem implementations.

Some more comments:
1) Arrow C++ is one specific library implementing the Arrow format. Other Arrow 
implementations don't necessarily provide. That said, the ones that bind around 
Arrow C++ (e.g. PyArrow) generally expose the facilities that in Arrow C++.
2) If using C rather than C++ , how would we handle lifetime and ownership 
issues? That sounds like a can of worms. Arrow C++ is using C++ for a reason... 
(if someone OTOH wants to write a C Arrow implementation, nobody will object 
:-))
3) runtime vs. compile-time: people shouldn't have to recompile Arrow C++ to 
add a new filesystem type. If that's what you mean by "runtime", then let's do 
that. OTOH, it doesn't have to be a "zero configuration" thing (i.e. it's ok to 
have to call a registration function).
4) filesystem API stability: we can change the API assuming there are *good* 
reasons to change it. But that's orthogonal to this issue, and you should open 
separate JIRAs for that.

Given all this, perhaps you could tell us a bit more about what kind of plugin 
API you're expecting or able to work with.

> [C++] Plugin Architecture for Filesystem and File IO
> 
>
> Key: ARROW-9820
> URL: https://issues.apache.org/jira/browse/ARROW-9820
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Lawrence Chan
>Priority: Minor
>
> Adding a new custom filesystem with corresponding file i/o streams is quite a 
> process at the moment.  Looks like HDFS and S3FS are basically hardcoded in 
> many places.  It would be useful to develop a plugin system to allow users to 
> interface with other data stores without maintaining a permanent fork with 
> hardcoded changes.
> We can either do runtime plugins or compile-time plugins.  Runtime is more 
> user-friendly, but with C++, ABI compatibility is fairly delicate.  So we 
> would either want to use a C ABI or accept a youre-on-your-own situation 
> where the user is expected to be very careful with versioning and compiler 
> flags.
> With compile-time plugins, maybe there's a way to have the cmake machinery 
> build third party code and also register those new URI schemes automatically.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9820) [C++] Plugin Architecture for Filesystem and File IO



[ 
https://issues.apache.org/jira/browse/ARROW-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183975#comment-17183975
 ] 

Antoine Pitrou commented on ARROW-9820:
---

Thanks for posting this. I agree it would be a good idea to allow adding custom 
filesystem implementations.

Some more comments:
1) Arrow C++ is one specific library implementing the Arrow format. Other Arrow 
implementations don't necessarily provide. That said, the ones that bind around 
Arrow C++ (e.g. PyArrow) generally expose the facilities that in Arrow C++.
2) If using C rather than C++, how would we handle lifetime and ownership 
issues? That sounds like a can of worms. Arrow C++ is using C++ for a reason... 
(if someone OTOH wants to write a C Arrow implementation, nobody will object 
:-))
3) runtime vs. compile-time: people shouldn't have to recompile Arrow C++ to 
add a new filesystem type. If that's what you mean by "runtime", then let's do 
that. OTOH, it doesn't have to be a "zero configuration" thing (i.e. it's ok to 
have to call a registration function).
4) filesystem API stability: we can change the API assuming there are *good* 
reasons to change it. But that's orthogonal to this issue, and you should open 
separate JIRAs for that.

Given all this, perhaps you could tell us a bit more about what kind of plugin 
API you're expecting or able to work with.

> [C++] Plugin Architecture for Filesystem and File IO
> 
>
> Key: ARROW-9820
> URL: https://issues.apache.org/jira/browse/ARROW-9820
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Lawrence Chan
>Priority: Minor
>
> Adding a new custom filesystem with corresponding file i/o streams is quite a 
> process at the moment.  Looks like HDFS and S3FS are basically hardcoded in 
> many places.  It would be useful to develop a plugin system to allow users to 
> interface with other data stores without maintaining a permanent fork with 
> hardcoded changes.
> We can either do runtime plugins or compile-time plugins.  Runtime is more 
> user-friendly, but with C++, ABI compatibility is fairly delicate.  So we 
> would either want to use a C ABI or accept a youre-on-your-own situation 
> where the user is expected to be very careful with versioning and compiler 
> flags.
> With compile-time plugins, maybe there's a way to have the cmake machinery 
> build third party code and also register those new URI schemes automatically.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-9820) [C++] Plugin Architecture for Filesystem and File IO



[ 
https://issues.apache.org/jira/browse/ARROW-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183975#comment-17183975
 ] 

Antoine Pitrou edited comment on ARROW-9820 at 8/25/20, 12:20 PM:
--

Thanks for posting this. I agree it would be a good idea to allow adding custom 
filesystem implementations.

Some more comments:
1) Arrow C++ is one specific library implementing the Arrow format. Other Arrow 
implementations don't necessarily provide. That said, the ones that bind around 
Arrow C++ (e.g. PyArrow) generally expose the facilities that in Arrow C++.
2) If using C rather than C++ , how would we handle lifetime and ownership 
issues? That sounds like a can of worms. Arrow C++ is using C++ for a reason... 
(if someone OTOH wants to write a C Arrow implementation, nobody will object 
:-))
3) runtime vs. compile-time: people shouldn't have to recompile Arrow C++ to 
add a new filesystem type. If that's what you mean by "runtime", then let's do 
that. OTOH, it doesn't have to be a "zero configuration" thing (i.e. it's ok to 
have to call a registration function).
4) filesystem API stability: we can change the API assuming there are *good* 
reasons to change it. But that's orthogonal to this issue, and you should open 
separate JIRAs for that.

Given all this, perhaps you could tell us a bit more about what kind of plugin 
API you're expecting or able to work with.


was (Author: pitrou):
Thanks for posting this. I agree it would be a good idea to allow adding custom 
filesystem implementations.

Some more comments:
1) Arrow C++ is one specific library implementing the Arrow format. Other Arrow 
implementations don't necessarily provide. That said, the ones that bind around 
Arrow C++ (e.g. PyArrow) generally expose the facilities that in Arrow C++.
2) If using C rather than C++, how would we handle lifetime and ownership 
issues? That sounds like a can of worms. Arrow C++ is using C++ for a reason... 
(if someone OTOH wants to write a C Arrow implementation, nobody will object 
:-))
3) runtime vs. compile-time: people shouldn't have to recompile Arrow C++ to 
add a new filesystem type. If that's what you mean by "runtime", then let's do 
that. OTOH, it doesn't have to be a "zero configuration" thing (i.e. it's ok to 
have to call a registration function).
4) filesystem API stability: we can change the API assuming there are *good* 
reasons to change it. But that's orthogonal to this issue, and you should open 
separate JIRAs for that.

Given all this, perhaps you could tell us a bit more about what kind of plugin 
API you're expecting or able to work with.

> [C++] Plugin Architecture for Filesystem and File IO
> 
>
> Key: ARROW-9820
> URL: https://issues.apache.org/jira/browse/ARROW-9820
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Lawrence Chan
>Priority: Minor
>
> Adding a new custom filesystem with corresponding file i/o streams is quite a 
> process at the moment.  Looks like HDFS and S3FS are basically hardcoded in 
> many places.  It would be useful to develop a plugin system to allow users to 
> interface with other data stores without maintaining a permanent fork with 
> hardcoded changes.
> We can either do runtime plugins or compile-time plugins.  Runtime is more 
> user-friendly, but with C++, ABI compatibility is fairly delicate.  So we 
> would either want to use a C ABI or accept a youre-on-your-own situation 
> where the user is expected to be very careful with versioning and compiler 
> flags.
> With compile-time plugins, maybe there's a way to have the cmake machinery 
> build third party code and also register those new URI schemes automatically.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9849) [Rust] [DataFusion] Make UDFs not need a Field



 [ 
https://issues.apache.org/jira/browse/ARROW-9849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9849:


Assignee: Jorge  (was: Apache Arrow JIRA Bot)

> [Rust] [DataFusion] Make UDFs not need a Field
> --
>
> Key: ARROW-9849
> URL: https://issues.apache.org/jira/browse/ARROW-9849
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/arrow/pull/7967,] shows that it is possible to not 
> require users to pass a `Field` to UDFs declarations and instead just pass a 
> `DataType`.
> Let's deprecate Field from them, and instead just use `DataType`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9849) [Rust] [DataFusion] Make UDFs not need a Field



 [ 
https://issues.apache.org/jira/browse/ARROW-9849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9849:


Assignee: Apache Arrow JIRA Bot  (was: Jorge)

> [Rust] [DataFusion] Make UDFs not need a Field
> --
>
> Key: ARROW-9849
> URL: https://issues.apache.org/jira/browse/ARROW-9849
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Apache Arrow JIRA Bot
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/arrow/pull/7967,] shows that it is possible to not 
> require users to pass a `Field` to UDFs declarations and instead just pass a 
> `DataType`.
> Let's deprecate Field from them, and instead just use `DataType`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9849) [Rust] [DataFusion] Make UDFs not need a Field



 [ 
https://issues.apache.org/jira/browse/ARROW-9849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9849:
--
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Make UDFs not need a Field
> --
>
> Key: ARROW-9849
> URL: https://issues.apache.org/jira/browse/ARROW-9849
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/arrow/pull/7967,] shows that it is possible to not 
> require users to pass a `Field` to UDFs declarations and instead just pass a 
> `DataType`.
> Let's deprecate Field from them, and instead just use `DataType`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-9699) [C++][Compute] Improve mode kernel performance for small integer types



 [ 
https://issues.apache.org/jira/browse/ARROW-9699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9699.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 7963
[https://github.com/apache/arrow/pull/7963]

> [C++][Compute] Improve mode kernel performance for small integer types
> --
>
> Key: ARROW-9699
> URL: https://issues.apache.org/jira/browse/ARROW-9699
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Mode kernel usesl hash table to count distinct values. For small integer 
> types (bool, int8, uint8), counting directly with a value indexed array can 
> be more efficient. This card is to evaluate the approach and upstream patch 
> if workable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9850) [Go] Defer should not be used in the loop