[jira] [Resolved] (ARROW-6807) [Java][FlightRPC] Expose gRPC service

2019-10-09 Thread Praveen Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Kumar resolved ARROW-6807.
--
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5597
[https://github.com/apache/arrow/pull/5597]

> [Java][FlightRPC] Expose gRPC service 
> --
>
> Key: ARROW-6807
> URL: https://issues.apache.org/jira/browse/ARROW-6807
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: FlightRPC, Java
>Reporter: Rohit Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> Have a utility class that exposes the flight service & client so that 
> multiple services can be plugged into the same endpoint. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4815) [Rust] [DataFusion] Add support for * in SQL projection

2019-10-09 Thread Davis Silverman (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948211#comment-16948211
 ] 

Davis Silverman commented on ARROW-4815:


Are you working on this now? I have made a bit of progress on this. Made the 
types, and have been fixing up errors. You say it may be a bit of work so maybe 
its harder than it seems, but I think I have been doing pretty well so far.

 

Here is my branch so far: 
[https://github.com/sinistersnare/arrow/commit/ba465f88b9ef9310aec15723756ddfa95a226119]

 

If you think that there is a better issue for me, let me know, I would be happy 
to help. I just kind of look at the '1.0.0 release' supertask and picked this 
one.

> [Rust] [DataFusion] Add support for * in SQL projection
> ---
>
> Key: ARROW-4815
> URL: https://issues.apache.org/jira/browse/ARROW-4815
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Affects Versions: 0.12.0
>Reporter: Andy Grove
>Priority: Major
>
> Currently column names must always be provided and there is no support for 
> SELECT * FROM table



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6844) List columns read broken with 0.15.0

2019-10-09 Thread Benoit Rostykus (Jira)
Benoit Rostykus created ARROW-6844:
--

 Summary: List columns read broken with 0.15.0
 Key: ARROW-6844
 URL: https://issues.apache.org/jira/browse/ARROW-6844
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 0.15.0
Reporter: Benoit Rostykus


Columns of type `array` (such as `array`, 
`array`...) are not readable anymore using `pyarrow == 0.15.0` (but were 
with `pyarrow == 0.14.1`) when the original writer of the parquet file is 
`parquet-mr 1.9.1`.

```
import pyarrow.parquet as pq

pf = pq.ParquetFile('sample.gz.parquet')

print(pf.read(columns=['profile_ids']))
```
with 0.14.1:
```
pyarrow.Table
profile_ids: list
 child 0, element: int64

...
```
with 0.15.0:

```

Traceback (most recent call last):
 File "", line 1, in 
 File 
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
 line 253, in read
 use_threads=use_threads)
 File "pyarrow/_parquet.pyx", line 1131, in 
pyarrow._parquet.ParquetReader.read_all
 File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column data for field 0 with type list 
is inconsistent with schema list

```

I've tested parquet files coming from multiple tables (with various schemas) 
created with `parquet-mr`, couldn't read any `array` column 
anymore.

 

I _think_ the bug was introduced with [this 
commit|[https://github.com/apache/arrow/commit/06fd2da5e8e71b660e6eea4b7702ca175e31f3f5]].

I think the root of the issue comes from the fact that `parquet-mr` writes the 
inner struct name as `"element"` by default (see 
[here|[https://github.com/apache/parquet-mr/blob/b4198be200e7e2df82bc9a18d54c8cd16aa156ac/parquet-column/src/main/java/org/apache/parquet/schema/ConversionPatterns.java#L33]]),
 whereas `parquet-cpp` (or `pyarrow`?) assumes `"item"` (see for example [this 
test|[https://github.com/apache/arrow/blob/c805b5fadb548925c915e0e130d6ed03c95d1398/python/pyarrow/tests/test_schema.py#L74]]).
 The round-tripping tests write/read in pyarrow only obviously won't catch this.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6839) [Java] access File Footer custom_metadata

2019-10-09 Thread Ji Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji Liu reassigned ARROW-6839:
-

Assignee: Ji Liu

> [Java] access File Footer custom_metadata
> -
>
> Key: ARROW-6839
> URL: https://issues.apache.org/jira/browse/ARROW-6839
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: John Muehlhausen
>Assignee: Ji Liu
>Priority: Minor
>
> Access custom_metadata from ARROW-6836



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-4795) [Rust] [DataFusion] Add example for RESTful SQL query service

2019-10-09 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove closed ARROW-4795.
-
Fix Version/s: (was: 1.0.0)
   Resolution: Won't Fix

Closing this since it is not really part of the scope of Arrow/DataFusion.

> [Rust] [DataFusion] Add example for RESTful SQL query service
> -
>
> Key: ARROW-4795
> URL: https://issues.apache.org/jira/browse/ARROW-4795
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Affects Versions: 0.12.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> I built a simple RESTful server using Rocket that allows SQL queries to be 
> executed using DataFusion. This might be a nice thing to include in the 
> project as an example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6843) [Website] Disable deploy on pull request

2019-10-09 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-6843.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 33
[https://github.com/apache/arrow-site/pull/33]

> [Website] Disable deploy on pull request
> 
>
> Key: ARROW-6843
> URL: https://issues.apache.org/jira/browse/ARROW-6843
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6843) [Website] Disable deploy on pull request

2019-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6843:
--
Labels: pull-request-available  (was: )

> [Website] Disable deploy on pull request
> 
>
> Key: ARROW-6843
> URL: https://issues.apache.org/jira/browse/ARROW-6843
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6843) [Website] Disable deploy on pull request

2019-10-09 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-6843:
---

 Summary: [Website] Disable deploy on pull request
 Key: ARROW-6843
 URL: https://issues.apache.org/jira/browse/ARROW-6843
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6842) [Website] Jekyll error building website

2019-10-09 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-6842.
-
Resolution: Fixed

Issue resolved by pull request 32
[https://github.com/apache/arrow-site/pull/32]

> [Website] Jekyll error building website
> ---
>
> Key: ARROW-6842
> URL: https://issues.apache.org/jira/browse/ARROW-6842
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Website
>Reporter: Wes McKinney
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I'm getting the following error locally on a fresh checkout and {{bundle 
> install --path vendor/bundle}}
> {code}
> $ bundle exec jekyll serve
> Configuration file: /home/wesm/code/arrow-site/_config.yml
> Source: /home/wesm/code/arrow-site
>Destination: build
>  Incremental build: disabled. Enable with --incremental
>   Generating... 
> jekyll 3.8.4 | Error:  wrong number of arguments (given 2, expected 1)
> {code}
> Never seen this so not sure how to debug



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6842) [Website] Jekyll error building website

2019-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6842:
--
Labels: pull-request-available  (was: )

> [Website] Jekyll error building website
> ---
>
> Key: ARROW-6842
> URL: https://issues.apache.org/jira/browse/ARROW-6842
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Website
>Reporter: Wes McKinney
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> I'm getting the following error locally on a fresh checkout and {{bundle 
> install --path vendor/bundle}}
> {code}
> $ bundle exec jekyll serve
> Configuration file: /home/wesm/code/arrow-site/_config.yml
> Source: /home/wesm/code/arrow-site
>Destination: build
>  Incremental build: disabled. Enable with --incremental
>   Generating... 
> jekyll 3.8.4 | Error:  wrong number of arguments (given 2, expected 1)
> {code}
> Never seen this so not sure how to debug



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6842) [Website] Jekyll error building website

2019-10-09 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948088#comment-16948088
 ] 

Kouhei Sutou commented on ARROW-6842:
-

https://github.com/apache/arrow-site/pull/32 is workaround.

> [Website] Jekyll error building website
> ---
>
> Key: ARROW-6842
> URL: https://issues.apache.org/jira/browse/ARROW-6842
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Website
>Reporter: Wes McKinney
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I'm getting the following error locally on a fresh checkout and {{bundle 
> install --path vendor/bundle}}
> {code}
> $ bundle exec jekyll serve
> Configuration file: /home/wesm/code/arrow-site/_config.yml
> Source: /home/wesm/code/arrow-site
>Destination: build
>  Incremental build: disabled. Enable with --incremental
>   Generating... 
> jekyll 3.8.4 | Error:  wrong number of arguments (given 2, expected 1)
> {code}
> Never seen this so not sure how to debug



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6842) [Website] Jekyll error building website

2019-10-09 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-6842:
---

Assignee: Kouhei Sutou

> [Website] Jekyll error building website
> ---
>
> Key: ARROW-6842
> URL: https://issues.apache.org/jira/browse/ARROW-6842
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Website
>Reporter: Wes McKinney
>Assignee: Kouhei Sutou
>Priority: Major
> Fix For: 1.0.0
>
>
> I'm getting the following error locally on a fresh checkout and {{bundle 
> install --path vendor/bundle}}
> {code}
> $ bundle exec jekyll serve
> Configuration file: /home/wesm/code/arrow-site/_config.yml
> Source: /home/wesm/code/arrow-site
>Destination: build
>  Incremental build: disabled. Enable with --incremental
>   Generating... 
> jekyll 3.8.4 | Error:  wrong number of arguments (given 2, expected 1)
> {code}
> Never seen this so not sure how to debug



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6842) [Website] Jekyll error building website

2019-10-09 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948086#comment-16948086
 ] 

Kouhei Sutou commented on ARROW-6842:
-

I sent a pull request: https://github.com/envygeeks/jekyll-assets/pull/620


> [Website] Jekyll error building website
> ---
>
> Key: ARROW-6842
> URL: https://issues.apache.org/jira/browse/ARROW-6842
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Website
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I'm getting the following error locally on a fresh checkout and {{bundle 
> install --path vendor/bundle}}
> {code}
> $ bundle exec jekyll serve
> Configuration file: /home/wesm/code/arrow-site/_config.yml
> Source: /home/wesm/code/arrow-site
>Destination: build
>  Incremental build: disabled. Enable with --incremental
>   Generating... 
> jekyll 3.8.4 | Error:  wrong number of arguments (given 2, expected 1)
> {code}
> Never seen this so not sure how to debug



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6659) [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge

2019-10-09 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948083#comment-16948083
 ] 

Andy Grove commented on ARROW-6659:
---

Yes, basically what I am suggesting is that the HashAggregateExec never merges 
and just aggregates each partition.

When we create the physical plan from the logical plan, we should have the 
logic there to know that if we are aggregating a partitioned data source then 
we need to merge and then aggregate again ... so yes, moving logic from 
HashAggregateExec to create_physical_plan, as you said.

Currently the logical plan doesn't know about partitioning though ... so we'd 
need to add that info.

> [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge
> -
>
> Key: ARROW-6659
> URL: https://issues.apache.org/jira/browse/ARROW-6659
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>
> HashAggregateExec current creates one HashPartition per input partition for 
> the initial aggregate per partition, and then explicitly calls MergeExec and 
> then creates another HashPartition for the final reduce operation.
> This is fine for in-memory queries in DataFusion but is not extensible. For 
> example, it is not possible to provide a different MergeExec implementation 
> that would distribute queries to a cluster.
> A better design would be to move the logic into the query planner so that the 
> physical plan contains explicit steps such as:
>  
> {code:java}
> - HashAggregate // final aggregate
>   - MergeExec
> - HashAggregate // aggregate per partition
>  {code}
> This would then make it easier to customize the plan in other projects, to 
> support distributed execution:
> {code:java}
>  - HashAggregate // final aggregate
>- MergeExec
>   - DistributedExec
>  - HashAggregate // aggregate per partition{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6782) [C++] Build minimal core Arrow libraries without any Boost headers

2019-10-09 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6782.
-
Resolution: Fixed

Issue resolved by pull request 5611
[https://github.com/apache/arrow/pull/5611]

> [C++] Build minimal core Arrow libraries without any Boost headers
> --
>
> Key: ARROW-6782
> URL: https://issues.apache.org/jira/browse/ARROW-6782
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> We have a couple of places where these are used. It would be good to be able 
> to build without any Boost headers available



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6832) [R] Implement Codec::IsAvailable

2019-10-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6832:
--

Assignee: Neal Richardson

> [R] Implement Codec::IsAvailable
> 
>
> Key: ARROW-6832
> URL: https://issues.apache.org/jira/browse/ARROW-6832
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> New in ARROW-6631



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6832) [R] Implement Codec::IsAvailable

2019-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6832:
--
Labels: pull-request-available  (was: )

> [R] Implement Codec::IsAvailable
> 
>
> Key: ARROW-6832
> URL: https://issues.apache.org/jira/browse/ARROW-6832
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> New in ARROW-6631



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6842) [Website] Jekyll error building website

2019-10-09 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948045#comment-16948045
 ] 

Wes McKinney commented on ARROW-6842:
-

Updating bundler doesn't seem to have helped

> [Website] Jekyll error building website
> ---
>
> Key: ARROW-6842
> URL: https://issues.apache.org/jira/browse/ARROW-6842
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Website
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I'm getting the following error locally on a fresh checkout and {{bundle 
> install --path vendor/bundle}}
> {code}
> $ bundle exec jekyll serve
> Configuration file: /home/wesm/code/arrow-site/_config.yml
> Source: /home/wesm/code/arrow-site
>Destination: build
>  Incremental build: disabled. Enable with --incremental
>   Generating... 
> jekyll 3.8.4 | Error:  wrong number of arguments (given 2, expected 1)
> {code}
> Never seen this so not sure how to debug



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6842) [Website] Jekyll error building website

2019-10-09 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948044#comment-16948044
 ] 

Wes McKinney commented on ARROW-6842:
-

Here's the diff between a working setup and a non-working setup

https://www.diffchecker.com/rlvpokIR

I'll see if updating bundler make the problem go away

> [Website] Jekyll error building website
> ---
>
> Key: ARROW-6842
> URL: https://issues.apache.org/jira/browse/ARROW-6842
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Website
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I'm getting the following error locally on a fresh checkout and {{bundle 
> install --path vendor/bundle}}
> {code}
> $ bundle exec jekyll serve
> Configuration file: /home/wesm/code/arrow-site/_config.yml
> Source: /home/wesm/code/arrow-site
>Destination: build
>  Incremental build: disabled. Enable with --incremental
>   Generating... 
> jekyll 3.8.4 | Error:  wrong number of arguments (given 2, expected 1)
> {code}
> Never seen this so not sure how to debug



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6842) [Website] Jekyll error building website

2019-10-09 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948033#comment-16948033
 ] 

Wes McKinney commented on ARROW-6842:
-

[~kou] in case you know

> [Website] Jekyll error building website
> ---
>
> Key: ARROW-6842
> URL: https://issues.apache.org/jira/browse/ARROW-6842
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Website
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I'm getting the following error locally on a fresh checkout and {{bundle 
> install --path vendor/bundle}}
> {code}
> $ bundle exec jekyll serve
> Configuration file: /home/wesm/code/arrow-site/_config.yml
> Source: /home/wesm/code/arrow-site
>Destination: build
>  Incremental build: disabled. Enable with --incremental
>   Generating... 
> jekyll 3.8.4 | Error:  wrong number of arguments (given 2, expected 1)
> {code}
> Never seen this so not sure how to debug



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6842) [Website] Jekyll error building website

2019-10-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6842:
---

 Summary: [Website] Jekyll error building website
 Key: ARROW-6842
 URL: https://issues.apache.org/jira/browse/ARROW-6842
 Project: Apache Arrow
  Issue Type: Bug
  Components: Website
Reporter: Wes McKinney
 Fix For: 1.0.0


I'm getting the following error locally on a fresh checkout and {{bundle 
install --path vendor/bundle}}

{code}
$ bundle exec jekyll serve
Configuration file: /home/wesm/code/arrow-site/_config.yml
Source: /home/wesm/code/arrow-site
   Destination: build
 Incremental build: disabled. Enable with --incremental
  Generating... 
jekyll 3.8.4 | Error:  wrong number of arguments (given 2, expected 1)
{code}

Never seen this so not sure how to debug



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6822) [Website] merge_pr.py is published

2019-10-09 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-6822.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

> [Website] merge_pr.py is published
> --
>
> Key: ARROW-6822
> URL: https://issues.apache.org/jira/browse/ARROW-6822
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We can download merge_pr.py at https://arrow.apache.org/merge_pr.py



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6841) [C++] Upgrade to LLVM 8

2019-10-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6841:
---

 Summary: [C++] Upgrade to LLVM 8
 Key: ARROW-6841
 URL: https://issues.apache.org/jira/browse/ARROW-6841
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Now that LLVM 9 has been released, LLVM 8 has been promoted to stable according 
to 

http://apt.llvm.org/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6514) [Developer][C++][CMake] LLVM tools are restricted to the exact version 7.0

2019-10-09 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948018#comment-16948018
 ] 

Wes McKinney commented on ARROW-6514:
-

AFAIK minor revisions of LLVM are supposed to be bugfix only, so this might be 
okay

> [Developer][C++][CMake] LLVM tools are restricted to the exact version 7.0
> --
>
> Key: ARROW-6514
> URL: https://issues.apache.org/jira/browse/ARROW-6514
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Priority: Major
>
> I have LLVM 7.1 installed locally, and FindClangTools couldn't locate it 
> because ARROW_LLVM_VERSION is [hardcoded to 
> 7.0|https://github.com/apache/arrow/blob/3f2a33f902983c0d395e0480e8a8df40ed5da29c/cpp/CMakeLists.txt#L91-L99]
>  and clang tools is [restricted to the minor 
> version|https://github.com/apache/arrow/blob/3f2a33f902983c0d395e0480e8a8df40ed5da29c/cpp/cmake_modules/FindClangTools.cmake#L78].
> If it makes sense to restrict clang tools location down to the minor version, 
> then we need to pass the located LLVM's version instead of the hardcoded one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6830) [R] Select Subset of Columns in read_arrow

2019-10-09 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948016#comment-16948016
 ] 

Neal Richardson commented on ARROW-6830:


That's the intent, though in this example I don't know what {{data_rbfr}} is. 
If it's a file, you can use {{mmap_open()}} to memory map it.

> [R] Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {code:java}
> for(i in 0:data_rbfr$num_record_batches) {
> rbn <- data_rbfr$get_batch(i)
>   
>   if (i == 0) 
>   {
> merged <- as.data.frame(rbn$column(5)$as_vector())
>   }
>   else 
>   {
> dfn <- as.data.frame(rbn$column(5)$as_vector())
> merged <- rbind(merged,dfn)
>   }
> 
>   print(paste(i, nrow(merged)))
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6830) [R] Select Subset of Columns in read_arrow

2019-10-09 Thread John Muehlhausen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947997#comment-16947997
 ] 

John Muehlhausen commented on ARROW-6830:
-

Not sure how the R integration works, but if the 30gigs are memory-mapped but 
you only access certain columns, the other columns won't actually consume any 
memory.

> [R] Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {code:java}
> for(i in 0:data_rbfr$num_record_batches) {
> rbn <- data_rbfr$get_batch(i)
>   
>   if (i == 0) 
>   {
> merged <- as.data.frame(rbn$column(5)$as_vector())
>   }
>   else 
>   {
> dfn <- as.data.frame(rbn$column(5)$as_vector())
> merged <- rbind(merged,dfn)
>   }
> 
>   print(paste(i, nrow(merged)))
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5916) [C++] Allow RecordBatch.length to be less than array lengths

2019-10-09 Thread John Muehlhausen (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Muehlhausen updated ARROW-5916:

Priority: Minor  (was: Blocker)

> [C++] Allow RecordBatch.length to be less than array lengths
> 
>
> Key: ARROW-5916
> URL: https://issues.apache.org/jira/browse/ARROW-5916
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: John Muehlhausen
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: test.arrow_ipc
>
>
> 0.13 ignored RecordBatch.length.  0.14 requires that RecordBatch.length and 
> array length be equal.  As per 
> [https://lists.apache.org/thread.html/2692dd8fe09c92aa313bded2f4c2d4240b9ef75a8604ec214eb02571@%3Cdev.arrow.apache.org%3E]
>  , we discussed changing this so that RecordBatch.length can be [0,array 
> length].
>  If RecordBatch.length is less than array length, the reader should ignore 
> the portion of the array(s) beyond RecordBatch.length.  This will allow 
> partially populated batches to be read in scenarios identified in the above 
> discussion.
> {code:c++}
>   Status GetFieldMetadata(int field_index, ArrayData* out) {
> auto nodes = metadata_->nodes();
> // pop off a field
> if (field_index >= static_cast(nodes->size())) {
>   return Status::Invalid("Ran out of field metadata, likely malformed");
> }
> const flatbuf::FieldNode* node = nodes->Get(field_index);
> *//out->length = node->length();*
> *out->length = metadata_->length();*
> out->null_count = node->null_count();
> out->offset = 0;
> return Status::OK();
>   }
> {code}
> Attached is a test IPC File containing a batch with length 1, array length 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5916) [C++] Allow RecordBatch.length to be less than array lengths

2019-10-09 Thread John Muehlhausen (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Muehlhausen updated ARROW-5916:

Priority: Blocker  (was: Minor)

> [C++] Allow RecordBatch.length to be less than array lengths
> 
>
> Key: ARROW-5916
> URL: https://issues.apache.org/jira/browse/ARROW-5916
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: John Muehlhausen
>Priority: Blocker
> Fix For: 1.0.0
>
> Attachments: test.arrow_ipc
>
>
> 0.13 ignored RecordBatch.length.  0.14 requires that RecordBatch.length and 
> array length be equal.  As per 
> [https://lists.apache.org/thread.html/2692dd8fe09c92aa313bded2f4c2d4240b9ef75a8604ec214eb02571@%3Cdev.arrow.apache.org%3E]
>  , we discussed changing this so that RecordBatch.length can be [0,array 
> length].
>  If RecordBatch.length is less than array length, the reader should ignore 
> the portion of the array(s) beyond RecordBatch.length.  This will allow 
> partially populated batches to be read in scenarios identified in the above 
> discussion.
> {code:c++}
>   Status GetFieldMetadata(int field_index, ArrayData* out) {
> auto nodes = metadata_->nodes();
> // pop off a field
> if (field_index >= static_cast(nodes->size())) {
>   return Status::Invalid("Ran out of field metadata, likely malformed");
> }
> const flatbuf::FieldNode* node = nodes->Get(field_index);
> *//out->length = node->length();*
> *out->length = metadata_->length();*
> out->null_count = node->null_count();
> out->offset = 0;
> return Status::OK();
>   }
> {code}
> Attached is a test IPC File containing a batch with length 1, array length 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6840) [C++/Python] retrieve fd of open memory mapped file and Open() memory mapped file by fd

2019-10-09 Thread John Muehlhausen (Jira)
John Muehlhausen created ARROW-6840:
---

 Summary: [C++/Python] retrieve fd of open memory mapped file and 
Open() memory mapped file by fd
 Key: ARROW-6840
 URL: https://issues.apache.org/jira/browse/ARROW-6840
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: John Muehlhausen


We want to retrieve the file descriptor of a memory mapped file for the purpose 
of transferring it across process boundaries.  On the receiving end, we want to 
be able to map a file based on the file descriptor rather than the path.

This helps with race conditions when the path may have been unlinked.


cf 
[https://lists.apache.org/thread.html/83373ab00f552ee8afd2bac2b2721468b3f28fe283490e379998453a@%3Cdev.arrow.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6839) [Java] access File Footer custom_metadata

2019-10-09 Thread John Muehlhausen (Jira)
John Muehlhausen created ARROW-6839:
---

 Summary: [Java] access File Footer custom_metadata
 Key: ARROW-6839
 URL: https://issues.apache.org/jira/browse/ARROW-6839
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: John Muehlhausen


Access custom_metadata from ARROW-6836



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6838) [JS] access File Footer custom_metadata

2019-10-09 Thread John Muehlhausen (Jira)
John Muehlhausen created ARROW-6838:
---

 Summary: [JS] access File Footer custom_metadata
 Key: ARROW-6838
 URL: https://issues.apache.org/jira/browse/ARROW-6838
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Reporter: John Muehlhausen


Access custom_metadata from ARROW-6836



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6837) [C++/Python] access File Footer custom_metadata

2019-10-09 Thread John Muehlhausen (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Muehlhausen updated ARROW-6837:

Priority: Minor  (was: Major)

> [C++/Python] access File Footer custom_metadata
> ---
>
> Key: ARROW-6837
> URL: https://issues.apache.org/jira/browse/ARROW-6837
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: John Muehlhausen
>Priority: Minor
>
> Access custom_metadata from ARROW-6836



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6837) [C++/Python] access File Footer custom_metadata

2019-10-09 Thread John Muehlhausen (Jira)
John Muehlhausen created ARROW-6837:
---

 Summary: [C++/Python] access File Footer custom_metadata
 Key: ARROW-6837
 URL: https://issues.apache.org/jira/browse/ARROW-6837
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Python
Reporter: John Muehlhausen


Access custom_metadata from ARROW-6836



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6836) [Format] add a custom_metadata:[KeyValue] field to the Footer table in File.fbs

2019-10-09 Thread John Muehlhausen (Jira)
John Muehlhausen created ARROW-6836:
---

 Summary: [Format] add a custom_metadata:[KeyValue] field to the 
Footer table in File.fbs
 Key: ARROW-6836
 URL: https://issues.apache.org/jira/browse/ARROW-6836
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Format
Reporter: John Muehlhausen


add a custom_metadata:[KeyValue] field to the Footer table in File.fbs

Use case:

If a file is expanded with additional recordbatches and the custom_metadata 
changes, Schema is no longer an appropriate place to make this change since the 
two copies of Schema (at the beginning and end of the file) would then be 
ambiguous

cf 
https://lists.apache.org/thread.html/c3b3d1456b7062a435f6795c0308ccb7c8fe55c818cfed2cf55f76c5@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-501) [C++] Implement concurrent / buffering InputStream for streaming data use cases

2019-10-09 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947962#comment-16947962
 ] 

Wes McKinney commented on ARROW-501:


We still don't have a {{ReadaheadInputStream}} (that implements the 
{{InputStream}} interface) but we can add one when it is actually needed. 

> [C++] Implement concurrent / buffering InputStream for streaming data use 
> cases
> ---
>
> Key: ARROW-501
> URL: https://issues.apache.org/jira/browse/ARROW-501
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: csv, filesystem, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> Related to ARROW-500, when processing an input data stream, we may wish to 
> continue buffering input (up to an maximum buffer size) in between 
> synchronous Read calls



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6835) [Archery][CMake] Restore ARROW_LINT_ONLY

2019-10-09 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-6835:
-

 Summary: [Archery][CMake] Restore ARROW_LINT_ONLY  
 Key: ARROW-6835
 URL: https://issues.apache.org/jira/browse/ARROW-6835
 Project: Apache Arrow
  Issue Type: Bug
  Components: Archery
Reporter: Francois Saint-Jacques


This is used by developers to fasten the cmake build creation and loosen the 
required installed toolchains (notably libraries). This was yanked because 
ARROW_LINT_ONLY effectively exit-early and doesn't generate 
`compile_commands.json`.

Restore this option, but ensure that archery toggles accordingly to the usage 
of iwyu or clang-tidy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-5808) [GLib][Ruby] Dockerize (add to docker-compose) current GLib + Ruby Travis CI entry

2019-10-09 Thread Yosuke Shiro (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yosuke Shiro reassigned ARROW-5808:
---

Assignee: Yosuke Shiro

> [GLib][Ruby] Dockerize (add to docker-compose) current GLib + Ruby Travis CI 
> entry
> --
>
> Key: ARROW-5808
> URL: https://issues.apache.org/jira/browse/ARROW-5808
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib, Ruby
>Reporter: Wes McKinney
>Assignee: Yosuke Shiro
>Priority: Major
> Fix For: 1.0.0
>
>
> Add to docker-compose and use in Travis CI
> https://github.com/apache/arrow/blob/master/.travis.yml#L265



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6834) [C++] Appveyor build failing on master

2019-10-09 Thread Uwe Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947947#comment-16947947
 ] 

Uwe Korn commented on ARROW-6834:
-

I fixed a lot of copy commands in that release: 
[https://github.com/conda-forge/gtest-feedstock/pull/40/files#diff-d04c86b6bb20341f5f7c53165501a393]
 Would be happy to have a second pair of eyes on that.

> [C++] Appveyor build failing on master
> --
>
> Key: ARROW-6834
> URL: https://issues.apache.org/jira/browse/ARROW-6834
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Not sure what introduced this
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/27992011/job/cj247lfl0s48xrsl
> {code}
> LINK: command "C:\PROGRA~2\MI0E91~1.0\VC\bin\amd64\link.exe /nologo 
> src\arrow\CMakeFiles\arrow-public-api-test.dir\public_api_test.cc.obj 
> /out:release\arrow-public-api-test.exe 
> /implib:release\arrow-public-api-test.lib 
> /pdb:release\arrow-public-api-test.pdb /version:0.0 /machine:x64 
> /NODEFAULTLIB:LIBCMT /INCREMENTAL:NO /subsystem:console 
> release\arrow_testing.lib release\arrow.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\double-conversion.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\libcrypto.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\libssl.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\brotlienc-static.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\brotlidec-static.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\brotlicommon-static.lib 
> C:\Miniconda36-x64\envs\arrow\Library\bin\aws-cpp-sdk-config.lib 
> C:\Miniconda36-x64\envs\arrow\Library\bin\aws-cpp-sdk-transfer.lib 
> C:\Miniconda36-x64\envs\arrow\Library\bin\aws-cpp-sdk-s3.lib 
> C:\Miniconda36-x64\envs\arrow\Library\bin\aws-cpp-sdk-core.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\double-conversion.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\libboost_filesystem.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\libboost_system.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gtest_main.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gtest.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gmock.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\libcrypto.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\aws-c-event-stream.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\aws-c-common.lib BCrypt.lib 
> Kernel32.lib Ws2_32.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\aws-checksums.lib 
> mimalloc_ep\src\mimalloc_ep\lib\mimalloc-1.0\mimalloc-static-release.lib 
> Ws2_32.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib 
> ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST 
> /MANIFESTFILE:release\arrow-public-api-test.exe.manifest" failed (exit code 
> 1120) with the following output:
> public_api_test.cc.obj : error LNK2019: unresolved external symbol 
> "__declspec(dllimport) public: static void __cdecl 
> testing::Test::SetUpTestSuite(void)" 
> (__imp_?SetUpTestSuite@Test@testing@@SAXXZ) referenced in function "public: 
> static void (__cdecl*__cdecl testing::internal::SuiteApiResolver testing::Test>::GetSetUpCaseOrSuite(char const *,int))(void)" 
> (?GetSetUpCaseOrSuite@?$SuiteApiResolver@VTest@testing@@@internal@testing@@SAP6AXXZPEBDH@Z)
> public_api_test.cc.obj : error LNK2019: unresolved external symbol 
> "__declspec(dllimport) public: static void __cdecl 
> testing::Test::TearDownTestSuite(void)" 
> (__imp_?TearDownTestSuite@Test@testing@@SAXXZ) referenced in function 
> "public: static void (__cdecl*__cdecl 
> testing::internal::SuiteApiResolver testing::Test>::GetTearDownCaseOrSuite(char const *,int))(void)" 
> (?GetTearDownCaseOrSuite@?$SuiteApiResolver@VTest@testing@@@internal@testing@@SAP6AXXZPEBDH@Z)
> release\arrow-public-api-test.exe : fatal error LNK1120: 2 unresolved 
> externals
> [205/515] Building CXX object 
> src\arrow\CMakeFiles\arrow-array-test.dir\array_test.cc.obj
> [206/515] Building CXX object 
> src\arrow\CMakeFiles\arrow-array-test.dir\array_dict_test.cc.obj
> ninja: build stopped: subcommand failed.
> (arrow) C:\projects\arrow\cpp\build>goto scriptexit 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6834) [C++] Appveyor build failing on master

2019-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6834:
--
Labels: pull-request-available  (was: )

> [C++] Appveyor build failing on master
> --
>
> Key: ARROW-6834
> URL: https://issues.apache.org/jira/browse/ARROW-6834
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Not sure what introduced this
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/27992011/job/cj247lfl0s48xrsl
> {code}
> LINK: command "C:\PROGRA~2\MI0E91~1.0\VC\bin\amd64\link.exe /nologo 
> src\arrow\CMakeFiles\arrow-public-api-test.dir\public_api_test.cc.obj 
> /out:release\arrow-public-api-test.exe 
> /implib:release\arrow-public-api-test.lib 
> /pdb:release\arrow-public-api-test.pdb /version:0.0 /machine:x64 
> /NODEFAULTLIB:LIBCMT /INCREMENTAL:NO /subsystem:console 
> release\arrow_testing.lib release\arrow.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\double-conversion.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\libcrypto.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\libssl.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\brotlienc-static.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\brotlidec-static.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\brotlicommon-static.lib 
> C:\Miniconda36-x64\envs\arrow\Library\bin\aws-cpp-sdk-config.lib 
> C:\Miniconda36-x64\envs\arrow\Library\bin\aws-cpp-sdk-transfer.lib 
> C:\Miniconda36-x64\envs\arrow\Library\bin\aws-cpp-sdk-s3.lib 
> C:\Miniconda36-x64\envs\arrow\Library\bin\aws-cpp-sdk-core.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\double-conversion.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\libboost_filesystem.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\libboost_system.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gtest_main.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gtest.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gmock.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\libcrypto.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\aws-c-event-stream.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\aws-c-common.lib BCrypt.lib 
> Kernel32.lib Ws2_32.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\aws-checksums.lib 
> mimalloc_ep\src\mimalloc_ep\lib\mimalloc-1.0\mimalloc-static-release.lib 
> Ws2_32.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib 
> ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST 
> /MANIFESTFILE:release\arrow-public-api-test.exe.manifest" failed (exit code 
> 1120) with the following output:
> public_api_test.cc.obj : error LNK2019: unresolved external symbol 
> "__declspec(dllimport) public: static void __cdecl 
> testing::Test::SetUpTestSuite(void)" 
> (__imp_?SetUpTestSuite@Test@testing@@SAXXZ) referenced in function "public: 
> static void (__cdecl*__cdecl testing::internal::SuiteApiResolver testing::Test>::GetSetUpCaseOrSuite(char const *,int))(void)" 
> (?GetSetUpCaseOrSuite@?$SuiteApiResolver@VTest@testing@@@internal@testing@@SAP6AXXZPEBDH@Z)
> public_api_test.cc.obj : error LNK2019: unresolved external symbol 
> "__declspec(dllimport) public: static void __cdecl 
> testing::Test::TearDownTestSuite(void)" 
> (__imp_?TearDownTestSuite@Test@testing@@SAXXZ) referenced in function 
> "public: static void (__cdecl*__cdecl 
> testing::internal::SuiteApiResolver testing::Test>::GetTearDownCaseOrSuite(char const *,int))(void)" 
> (?GetTearDownCaseOrSuite@?$SuiteApiResolver@VTest@testing@@@internal@testing@@SAP6AXXZPEBDH@Z)
> release\arrow-public-api-test.exe : fatal error LNK1120: 2 unresolved 
> externals
> [205/515] Building CXX object 
> src\arrow\CMakeFiles\arrow-array-test.dir\array_test.cc.obj
> [206/515] Building CXX object 
> src\arrow\CMakeFiles\arrow-array-test.dir\array_dict_test.cc.obj
> ninja: build stopped: subcommand failed.
> (arrow) C:\projects\arrow\cpp\build>goto scriptexit 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6834) [C++] Appveyor build failing on master

2019-10-09 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947946#comment-16947946
 ] 

Wes McKinney commented on ARROW-6834:
-

googletest 1.10 just posted to conda-forge 4 hours ago so this is the root cause

https://anaconda.org/conda-forge/gtest

> [C++] Appveyor build failing on master
> --
>
> Key: ARROW-6834
> URL: https://issues.apache.org/jira/browse/ARROW-6834
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Blocker
> Fix For: 1.0.0
>
>
> Not sure what introduced this
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/27992011/job/cj247lfl0s48xrsl
> {code}
> LINK: command "C:\PROGRA~2\MI0E91~1.0\VC\bin\amd64\link.exe /nologo 
> src\arrow\CMakeFiles\arrow-public-api-test.dir\public_api_test.cc.obj 
> /out:release\arrow-public-api-test.exe 
> /implib:release\arrow-public-api-test.lib 
> /pdb:release\arrow-public-api-test.pdb /version:0.0 /machine:x64 
> /NODEFAULTLIB:LIBCMT /INCREMENTAL:NO /subsystem:console 
> release\arrow_testing.lib release\arrow.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\double-conversion.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\libcrypto.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\libssl.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\brotlienc-static.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\brotlidec-static.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\brotlicommon-static.lib 
> C:\Miniconda36-x64\envs\arrow\Library\bin\aws-cpp-sdk-config.lib 
> C:\Miniconda36-x64\envs\arrow\Library\bin\aws-cpp-sdk-transfer.lib 
> C:\Miniconda36-x64\envs\arrow\Library\bin\aws-cpp-sdk-s3.lib 
> C:\Miniconda36-x64\envs\arrow\Library\bin\aws-cpp-sdk-core.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\double-conversion.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\libboost_filesystem.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\libboost_system.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gtest_main.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gtest.lib 
> googletest_ep-prefix\src\googletest_ep\lib\gmock.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\libcrypto.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\aws-c-event-stream.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\aws-c-common.lib BCrypt.lib 
> Kernel32.lib Ws2_32.lib 
> C:\Miniconda36-x64\envs\arrow\Library\lib\aws-checksums.lib 
> mimalloc_ep\src\mimalloc_ep\lib\mimalloc-1.0\mimalloc-static-release.lib 
> Ws2_32.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib 
> ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST 
> /MANIFESTFILE:release\arrow-public-api-test.exe.manifest" failed (exit code 
> 1120) with the following output:
> public_api_test.cc.obj : error LNK2019: unresolved external symbol 
> "__declspec(dllimport) public: static void __cdecl 
> testing::Test::SetUpTestSuite(void)" 
> (__imp_?SetUpTestSuite@Test@testing@@SAXXZ) referenced in function "public: 
> static void (__cdecl*__cdecl testing::internal::SuiteApiResolver testing::Test>::GetSetUpCaseOrSuite(char const *,int))(void)" 
> (?GetSetUpCaseOrSuite@?$SuiteApiResolver@VTest@testing@@@internal@testing@@SAP6AXXZPEBDH@Z)
> public_api_test.cc.obj : error LNK2019: unresolved external symbol 
> "__declspec(dllimport) public: static void __cdecl 
> testing::Test::TearDownTestSuite(void)" 
> (__imp_?TearDownTestSuite@Test@testing@@SAXXZ) referenced in function 
> "public: static void (__cdecl*__cdecl 
> testing::internal::SuiteApiResolver testing::Test>::GetTearDownCaseOrSuite(char const *,int))(void)" 
> (?GetTearDownCaseOrSuite@?$SuiteApiResolver@VTest@testing@@@internal@testing@@SAP6AXXZPEBDH@Z)
> release\arrow-public-api-test.exe : fatal error LNK1120: 2 unresolved 
> externals
> [205/515] Building CXX object 
> src\arrow\CMakeFiles\arrow-array-test.dir\array_test.cc.obj
> [206/515] Building CXX object 
> src\arrow\CMakeFiles\arrow-array-test.dir\array_dict_test.cc.obj
> ninja: build stopped: subcommand failed.
> (arrow) C:\projects\arrow\cpp\build>goto scriptexit 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6834) [C++] Appveyor build failing on master

2019-10-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6834:
---

 Summary: [C++] Appveyor build failing on master
 Key: ARROW-6834
 URL: https://issues.apache.org/jira/browse/ARROW-6834
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Not sure what introduced this

https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/27992011/job/cj247lfl0s48xrsl

{code}
LINK: command "C:\PROGRA~2\MI0E91~1.0\VC\bin\amd64\link.exe /nologo 
src\arrow\CMakeFiles\arrow-public-api-test.dir\public_api_test.cc.obj 
/out:release\arrow-public-api-test.exe 
/implib:release\arrow-public-api-test.lib 
/pdb:release\arrow-public-api-test.pdb /version:0.0 /machine:x64 
/NODEFAULTLIB:LIBCMT /INCREMENTAL:NO /subsystem:console 
release\arrow_testing.lib release\arrow.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\double-conversion.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\libcrypto.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\libssl.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\brotlienc-static.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\brotlidec-static.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\brotlicommon-static.lib 
C:\Miniconda36-x64\envs\arrow\Library\bin\aws-cpp-sdk-config.lib 
C:\Miniconda36-x64\envs\arrow\Library\bin\aws-cpp-sdk-transfer.lib 
C:\Miniconda36-x64\envs\arrow\Library\bin\aws-cpp-sdk-s3.lib 
C:\Miniconda36-x64\envs\arrow\Library\bin\aws-cpp-sdk-core.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\double-conversion.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\libboost_filesystem.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\libboost_system.lib 
googletest_ep-prefix\src\googletest_ep\lib\gtest_main.lib 
googletest_ep-prefix\src\googletest_ep\lib\gtest.lib 
googletest_ep-prefix\src\googletest_ep\lib\gmock.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\libcrypto.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\aws-c-event-stream.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\aws-c-common.lib BCrypt.lib 
Kernel32.lib Ws2_32.lib 
C:\Miniconda36-x64\envs\arrow\Library\lib\aws-checksums.lib 
mimalloc_ep\src\mimalloc_ep\lib\mimalloc-1.0\mimalloc-static-release.lib 
Ws2_32.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib 
oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST 
/MANIFESTFILE:release\arrow-public-api-test.exe.manifest" failed (exit code 
1120) with the following output:
public_api_test.cc.obj : error LNK2019: unresolved external symbol 
"__declspec(dllimport) public: static void __cdecl 
testing::Test::SetUpTestSuite(void)" 
(__imp_?SetUpTestSuite@Test@testing@@SAXXZ) referenced in function "public: 
static void (__cdecl*__cdecl testing::internal::SuiteApiResolver::GetSetUpCaseOrSuite(char const *,int))(void)" 
(?GetSetUpCaseOrSuite@?$SuiteApiResolver@VTest@testing@@@internal@testing@@SAP6AXXZPEBDH@Z)
public_api_test.cc.obj : error LNK2019: unresolved external symbol 
"__declspec(dllimport) public: static void __cdecl 
testing::Test::TearDownTestSuite(void)" 
(__imp_?TearDownTestSuite@Test@testing@@SAXXZ) referenced in function "public: 
static void (__cdecl*__cdecl testing::internal::SuiteApiResolver::GetTearDownCaseOrSuite(char const *,int))(void)" 
(?GetTearDownCaseOrSuite@?$SuiteApiResolver@VTest@testing@@@internal@testing@@SAP6AXXZPEBDH@Z)
release\arrow-public-api-test.exe : fatal error LNK1120: 2 unresolved externals
[205/515] Building CXX object 
src\arrow\CMakeFiles\arrow-array-test.dir\array_test.cc.obj
[206/515] Building CXX object 
src\arrow\CMakeFiles\arrow-array-test.dir\array_dict_test.cc.obj
ninja: build stopped: subcommand failed.
(arrow) C:\projects\arrow\cpp\build>goto scriptexit 
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6830) [R] Select Subset of Columns in read_arrow

2019-10-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6830:
---
Summary: [R] Select Subset of Columns in read_arrow  (was: Question / 
Feature Request- Select Subset of Columns in read_arrow)

> [R] Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {code:java}
> for(i in 0:data_rbfr$num_record_batches) {
> rbn <- data_rbfr$get_batch(i)
>   
>   if (i == 0) 
>   {
> merged <- as.data.frame(rbn$column(5)$as_vector())
>   }
>   else 
>   {
> dfn <- as.data.frame(rbn$column(5)$as_vector())
> merged <- rbind(merged,dfn)
>   }
> 
>   print(paste(i, nrow(merged)))
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6830) Question / Feature Request- Select Subset of Columns in read_arrow

2019-10-09 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947927#comment-16947927
 ] 

Neal Richardson commented on ARROW-6830:


Looking at [https://github.com/apache/arrow/blob/master/r/R/read-table.R], 
{{read_arrow}} returns a data frame while {{read_table}} keeps the data in an 
Arrow Table. Tables have a {{$select()}} method (which is [how 
{{read_csv_arrow}} implements 
{{col_select}}|https://github.com/apache/arrow/blob/master/r/R/csv.R#L124]), 
and you can more naturally access that through the usual {{[}} method. So IIUC 
what you're trying to do, 
{code:r}
tab <- read_table(data_rbfr)
as.data.frame(tab[, 6])
{code}
and of course you could reference that column by name instead of position.

If you wanted to add {{col_select}} to {{read_arrow()}}, I'd recommend 
following the model of {{read_csv_arrow}}, which sounds pretty straightforward. 
Happy to review a pull request if you submit it.

> Question / Feature Request- Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {code:java}
> for(i in 0:data_rbfr$num_record_batches) {
> rbn <- data_rbfr$get_batch(i)
>   
>   if (i == 0) 
>   {
> merged <- as.data.frame(rbn$column(5)$as_vector())
>   }
>   else 
>   {
> dfn <- as.data.frame(rbn$column(5)$as_vector())
> merged <- rbind(merged,dfn)
>   }
> 
>   print(paste(i, nrow(merged)))
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6830) [R] Select Subset of Columns in read_arrow

2019-10-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6830:
---
Component/s: (was: C++)

> [R] Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {code:java}
> for(i in 0:data_rbfr$num_record_batches) {
> rbn <- data_rbfr$get_batch(i)
>   
>   if (i == 0) 
>   {
> merged <- as.data.frame(rbn$column(5)$as_vector())
>   }
>   else 
>   {
> dfn <- as.data.frame(rbn$column(5)$as_vector())
> merged <- rbind(merged,dfn)
>   }
> 
>   print(paste(i, nrow(merged)))
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6833) [R][CI] Add crossbow job for full R autobrew macOS build

2019-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6833:
--
Labels: pull-request-available  (was: )

> [R][CI] Add crossbow job for full R autobrew macOS build
> 
>
> Key: ARROW-6833
> URL: https://issues.apache.org/jira/browse/ARROW-6833
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>
> I have a separate nightly job that runs this on multiple R versions, but it 
> would be nice to be able to have crossbow check this on a PR. As it turns 
> out, the ARROW_S3 feature doesn't work with autobrew in practice--aws-sdk-cpp 
> doesn't seem to ship static libs via Homebrew, so the autobrew packaging 
> doesn't work, even though the formula builds and {{brew audit}} is clean.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6833) [R][CI] Add crossbow job for full R autobrew macOS build

2019-10-09 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6833:
--

 Summary: [R][CI] Add crossbow job for full R autobrew macOS build
 Key: ARROW-6833
 URL: https://issues.apache.org/jira/browse/ARROW-6833
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, R
Reporter: Neal Richardson
Assignee: Neal Richardson


I have a separate nightly job that runs this on multiple R versions, but it 
would be nice to be able to have crossbow check this on a PR. As it turns out, 
the ARROW_S3 feature doesn't work with autobrew in practice--aws-sdk-cpp 
doesn't seem to ship static libs via Homebrew, so the autobrew packaging 
doesn't work, even though the formula builds and {{brew audit}} is clean.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6832) [R] Implement Codec::IsAvailable

2019-10-09 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6832:
--

 Summary: [R] Implement Codec::IsAvailable
 Key: ARROW-6832
 URL: https://issues.apache.org/jira/browse/ARROW-6832
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
 Fix For: 1.0.0


New in ARROW-6631



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6659) [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge

2019-10-09 Thread Kyle McCarthy (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947901#comment-16947901
 ] 

Kyle McCarthy edited comment on ARROW-6659 at 10/9/19 6:03 PM:
---

Do you have a specific solution in mind? I was thinking that this could be done 
by moving some of the logic out from the partitions method in the 
HashAggregateExec to the create_physical_plan on the ExecutionContext. I was 
also thinking that it it probably could work with generics.


was (Author: kylemccarthy):
Do you have a specific solution in mind? I was thinking that this could be done 
by pulling some of the logic out from the partitions method in the 
HashAggregateExec, but also it probably could work with generics.

> [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge
> -
>
> Key: ARROW-6659
> URL: https://issues.apache.org/jira/browse/ARROW-6659
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>
> HashAggregateExec current creates one HashPartition per input partition for 
> the initial aggregate per partition, and then explicitly calls MergeExec and 
> then creates another HashPartition for the final reduce operation.
> This is fine for in-memory queries in DataFusion but is not extensible. For 
> example, it is not possible to provide a different MergeExec implementation 
> that would distribute queries to a cluster.
> A better design would be to move the logic into the query planner so that the 
> physical plan contains explicit steps such as:
>  
> {code:java}
> - HashAggregate // final aggregate
>   - MergeExec
> - HashAggregate // aggregate per partition
>  {code}
> This would then make it easier to customize the plan in other projects, to 
> support distributed execution:
> {code:java}
>  - HashAggregate // final aggregate
>- MergeExec
>   - DistributedExec
>  - HashAggregate // aggregate per partition{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6659) [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge

2019-10-09 Thread Kyle McCarthy (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947901#comment-16947901
 ] 

Kyle McCarthy commented on ARROW-6659:
--

Do you have a specific solution in mind? I was thinking that this could be done 
by pulling some of the logic out from the partitions method in the 
HashAggregateExec, but also it probably could work with generics.

> [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge
> -
>
> Key: ARROW-6659
> URL: https://issues.apache.org/jira/browse/ARROW-6659
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>
> HashAggregateExec current creates one HashPartition per input partition for 
> the initial aggregate per partition, and then explicitly calls MergeExec and 
> then creates another HashPartition for the final reduce operation.
> This is fine for in-memory queries in DataFusion but is not extensible. For 
> example, it is not possible to provide a different MergeExec implementation 
> that would distribute queries to a cluster.
> A better design would be to move the logic into the query planner so that the 
> physical plan contains explicit steps such as:
>  
> {code:java}
> - HashAggregate // final aggregate
>   - MergeExec
> - HashAggregate // aggregate per partition
>  {code}
> This would then make it easier to customize the plan in other projects, to 
> support distributed execution:
> {code:java}
>  - HashAggregate // final aggregate
>- MergeExec
>   - DistributedExec
>  - HashAggregate // aggregate per partition{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6831) [R] Update R macOS/Windows builds for change in cmake compression defaults

2019-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6831:
--
Labels: pull-request-available  (was: )

> [R] Update R macOS/Windows builds for change in cmake compression defaults
> --
>
> Key: ARROW-6831
> URL: https://issues.apache.org/jira/browse/ARROW-6831
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>
> ARROW-6631 changed the defaults for including compressions but did not update 
> these build scripts. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6831) [R] Update R macOS/Windows builds for change in cmake compression defaults

2019-10-09 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6831:
--

 Summary: [R] Update R macOS/Windows builds for change in cmake 
compression defaults
 Key: ARROW-6831
 URL: https://issues.apache.org/jira/browse/ARROW-6831
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson


ARROW-6631 changed the defaults for including compressions but did not update 
these build scripts. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6377) [C++] Extending STL API to support row-wise conversion

2019-10-09 Thread Omer Ozarslan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947781#comment-16947781
 ] 

Omer Ozarslan commented on ARROW-6377:
--

||Arrow Type||C++ Type||
|NA
 BOOL
 UINT8
 INT8
 UINT16
 INT16
 UINT32
 INT32
 UINT64
 INT64
 HALF_FLOAT
 FLOAT
 DOUBLE
 STRING
 BINARY
 FIXED_SIZE_BINARY
 DATE32
 DATE64
 TIMESTAMP
 TIME32
 TIME64
 INTERVAL
 DECIMAL
 LIST
 STRUCT
 UNION
 DICTIONARY
 MAP
 EXTENSION
 FIXED_SIZE_LIST
 DURATION
 LARGE_STRING
 LARGE_BINARY
 LARGE_LIST| |

> [C++] Extending STL API to support row-wise conversion
> --
>
> Key: ARROW-6377
> URL: https://issues.apache.org/jira/browse/ARROW-6377
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Omer Ozarslan
>Priority: Major
> Fix For: 1.0.0
>
>
> Using array builders is the recommended way in the documentation for 
> converting rowwise data to arrow tables currently. However, array builders 
> has a low level interface to support various use cases in the library. They 
> require additional boilerplate due to type erasure, although some of these 
> boilerplate could be avoided in compile time if the schema is already known 
> and fixed (also discussed in ARROW-4067).
> In some other part of the library, STL API provides a nice abstraction over 
> builders by inferring data type and builders from values provided, reducing 
> the boilerplate significantly. It handles automatically converting tuples 
> with a limited set of native types currently: numeric types, string and 
> vector (+ nullable variations of these in case ARROW-6326 is merged). It also 
> allows passing references in tuple values (implemented recently in 
> ARROW-6284).
> As a more concrete example, this is the code which can be used to convert 
> {{row_data}} provided in examples:
>   
> {code:cpp}
> arrow::Status VectorToColumnarTableSTL(const std::vector& 
> rows,
>std::shared_ptr* table) {
> auto rng = rows | ranges::views::transform([](const data_row& row) {
>return std::tuple&>(
>row.id, row.cost, row.cost_components);
>});
> return arrow::stl::TableFromTupleRange(arrow::default_memory_pool(), rng,
>{"id", "cost", "cost_components"},
>table);
> }
> {code}
> So, it allows more concise code for consumers of the API compared to using 
> builders directly.
> There is no direct support by the library for other types (binary, struct, 
> union etc. types or converting iterable objects other than vectors to lists). 
> Users are provided a way to specialize their own data structures. One 
> limitation for implicit inference is that it is hard (or even impossible) to 
> infer exact type to use in some cases. For example, should 
> {{std::string_view}} value be inferred as string, binary, large binary or 
> list? This ambiguity can be avoided by providing some way for user to 
> explicitly state correct type for storing a column. For example a user can 
> return a so called {{BinaryCell}} class to return binary values.
> Proposed changes:
>  * Implementing cell "adapters": Cells are non-owning references for each 
> type. It's user's responsibility keep pointed values alive. (Can scalars be 
> used in this context?)
>  ** BinaryCell
>  ** StringCell
>  ** ListCell (fo adapting any Range)
>  ** StructCell
>  ** ...
>  * Primitive types don't need such adapters since their values are trivial to 
> cast (e.g. just use int8_t(value) to use Int8Type).
>  * Adding benchmarks for comparing with builder performance. There is likely 
> to be some performance penalty due to hindering compiler optimizations. Yet, 
> this is acceptable in exchange of a more concise code IMHO. For fine-grained 
> control over performance, it will be still possible to directly use builders.
> I have implemented something similar to BinaryCell for my use case. If above 
> changes sound reasonable, I will go ahead and start implementing other cells 
> to submit.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (ARROW-6377) [C++] Extending STL API to support row-wise conversion

2019-10-09 Thread Omer Ozarslan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omer Ozarslan updated ARROW-6377:
-
Comment: was deleted

(was: ||Arrow Type||C++ Type||
|NA
 BOOL
 UINT8
 INT8
 UINT16
 INT16
 UINT32
 INT32
 UINT64
 INT64
 HALF_FLOAT
 FLOAT
 DOUBLE
 STRING
 BINARY
 FIXED_SIZE_BINARY
 DATE32
 DATE64
 TIMESTAMP
 TIME32
 TIME64
 INTERVAL
 DECIMAL
 LIST
 STRUCT
 UNION
 DICTIONARY
 MAP
 EXTENSION
 FIXED_SIZE_LIST
 DURATION
 LARGE_STRING
 LARGE_BINARY
 LARGE_LIST| |)

> [C++] Extending STL API to support row-wise conversion
> --
>
> Key: ARROW-6377
> URL: https://issues.apache.org/jira/browse/ARROW-6377
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Omer Ozarslan
>Priority: Major
> Fix For: 1.0.0
>
>
> Using array builders is the recommended way in the documentation for 
> converting rowwise data to arrow tables currently. However, array builders 
> has a low level interface to support various use cases in the library. They 
> require additional boilerplate due to type erasure, although some of these 
> boilerplate could be avoided in compile time if the schema is already known 
> and fixed (also discussed in ARROW-4067).
> In some other part of the library, STL API provides a nice abstraction over 
> builders by inferring data type and builders from values provided, reducing 
> the boilerplate significantly. It handles automatically converting tuples 
> with a limited set of native types currently: numeric types, string and 
> vector (+ nullable variations of these in case ARROW-6326 is merged). It also 
> allows passing references in tuple values (implemented recently in 
> ARROW-6284).
> As a more concrete example, this is the code which can be used to convert 
> {{row_data}} provided in examples:
>   
> {code:cpp}
> arrow::Status VectorToColumnarTableSTL(const std::vector& 
> rows,
>std::shared_ptr* table) {
> auto rng = rows | ranges::views::transform([](const data_row& row) {
>return std::tuple&>(
>row.id, row.cost, row.cost_components);
>});
> return arrow::stl::TableFromTupleRange(arrow::default_memory_pool(), rng,
>{"id", "cost", "cost_components"},
>table);
> }
> {code}
> So, it allows more concise code for consumers of the API compared to using 
> builders directly.
> There is no direct support by the library for other types (binary, struct, 
> union etc. types or converting iterable objects other than vectors to lists). 
> Users are provided a way to specialize their own data structures. One 
> limitation for implicit inference is that it is hard (or even impossible) to 
> infer exact type to use in some cases. For example, should 
> {{std::string_view}} value be inferred as string, binary, large binary or 
> list? This ambiguity can be avoided by providing some way for user to 
> explicitly state correct type for storing a column. For example a user can 
> return a so called {{BinaryCell}} class to return binary values.
> Proposed changes:
>  * Implementing cell "adapters": Cells are non-owning references for each 
> type. It's user's responsibility keep pointed values alive. (Can scalars be 
> used in this context?)
>  ** BinaryCell
>  ** StringCell
>  ** ListCell (fo adapting any Range)
>  ** StructCell
>  ** ...
>  * Primitive types don't need such adapters since their values are trivial to 
> cast (e.g. just use int8_t(value) to use Int8Type).
>  * Adding benchmarks for comparing with builder performance. There is likely 
> to be some performance penalty due to hindering compiler optimizations. Yet, 
> this is acceptable in exchange of a more concise code IMHO. For fine-grained 
> control over performance, it will be still possible to directly use builders.
> I have implemented something similar to BinaryCell for my use case. If above 
> changes sound reasonable, I will go ahead and start implementing other cells 
> to submit.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6782) [C++] Build minimal core Arrow libraries without any Boost headers

2019-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6782:
--
Labels: pull-request-available  (was: )

> [C++] Build minimal core Arrow libraries without any Boost headers
> --
>
> Key: ARROW-6782
> URL: https://issues.apache.org/jira/browse/ARROW-6782
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> We have a couple of places where these are used. It would be good to be able 
> to build without any Boost headers available



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6782) [C++] Build minimal core Arrow libraries without any Boost headers

2019-10-09 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6782:
---

Assignee: Wes McKinney

> [C++] Build minimal core Arrow libraries without any Boost headers
> --
>
> Key: ARROW-6782
> URL: https://issues.apache.org/jira/browse/ARROW-6782
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> We have a couple of places where these are used. It would be good to be able 
> to build without any Boost headers available



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6830) Question / Feature Request- Select Subset of Columns in read_arrow

2019-10-09 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-6830:
-
Description: 
*Note:*  Not sure if this is a limitation of the R library or the underlying 
C++ code:

I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
batches of varying row sizes

1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
how *read_feather* has a (col_select =... )

2. Or is it possible using *RecordBatchFileReader* to filter columns?

 

The only thing I seem to be able to do (please confirm if this is my only 
option) is loop over all record batches, select a single column at a time, and 
construct the data I need to pull out manually.  ie like the following:
{code:java}
for(i in 0:data_rbfr$num_record_batches) {
rbn <- data_rbfr$get_batch(i)
  
  if (i == 0) 
  {
merged <- as.data.frame(rbn$column(5)$as_vector())
  }
  else 
  {
dfn <- as.data.frame(rbn$column(5)$as_vector())
merged <- rbind(merged,dfn)
  }

  print(paste(i, nrow(merged)))
} {code}
 

 

  was:
*Note:*  Not sure if this is a limitation of the R library or the underlying 
C++ code:

I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
batches of varying row sizes

1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
how *read_feather* has a (col_select =... )

2. Or is it possible using *RecordBatchFileReader* to filter columns?

 

The only thing I seem to be able to do (please confirm if this is my only 
option) is loop over all record batches, select a single column at a time, and 
construct the data I need to pull out manually.  ie like the following:


{{for(i in 0:data_rbfr$num_record_batches) {}}
{{ rbn <- data_rbfr$get_batch(i)}}
 
{{ if (i == 0) }}
{{ {}}
{{ merged <- as.data.frame(rbn$column(5)$as_vector())}}
{{ }}}
{{ else }}
{{ {}}
{{ dfn <- as.data.frame(rbn$column(5)$as_vector())}}
{{ merged <- rbind(merged,dfn)}}
{{ }}}
 
{{ print(paste(i, nrow(merged)))}}
{{}}}

 

 


> Question / Feature Request- Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {code:java}
> for(i in 0:data_rbfr$num_record_batches) {
> rbn <- data_rbfr$get_batch(i)
>   
>   if (i == 0) 
>   {
> merged <- as.data.frame(rbn$column(5)$as_vector())
>   }
>   else 
>   {
> dfn <- as.data.frame(rbn$column(5)$as_vector())
> merged <- rbind(merged,dfn)
>   }
> 
>   print(paste(i, nrow(merged)))
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6830) Question / Feature Request- Select Subset of Columns in read_arrow

2019-10-09 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-6830:
-
Description: 
*Note:*  Not sure if this is a limitation of the R library or the underlying 
C++ code:

I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
batches of varying row sizes

1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
how *read_feather* has a (col_select =... )

2. Or is it possible using *RecordBatchFileReader* to filter columns?

 

The only thing I seem to be able to do (please confirm if this is my only 
option) is loop over all record batches, select a single column at a time, and 
construct the data I need to pull out manually.  ie like the following:


{{for(i in 0:data_rbfr$num_record_batches) {}}
{{ rbn <- data_rbfr$get_batch(i)}}
 
{{ if (i == 0) }}
{{ {}}
{{ merged <- as.data.frame(rbn$column(5)$as_vector())}}
{{ }}}
{{ else }}
{{ {}}
{{ dfn <- as.data.frame(rbn$column(5)$as_vector())}}
{{ merged <- rbind(merged,dfn)}}
{{ }}}
 
{{ print(paste(i, nrow(merged)))}}
{{}}}

 

 

  was:
*Note:*  Not sure if this is a limitation of the R library or the underlying 
C++ code:

I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
batches of varying row sizes

1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
how *read_feather* has a (col_select =... )

2. Or is it possible using *RecordBatchFileReader* to filter columns?

 

The only thing I seem to be able to do (please confirm if this is my only 
option) is loop over all record batches, select a single column at a time, and 
construct the data I need to pull out manually.  ie like the following:

{{data_rbfr <- arrow::RecordBatchFileReader("arrowfile")}}

{{for(i in 0:data_rbfr$num_record_batches) {}}
{{  rbn <- data_rbfr$get_batch(i)}}
{{  if (i == 0) }}
{{ {}}
{{   merged <- as.data.frame(rbn$column(5)$as_vector())}}
{{ }}}
{{ else }}
{{ {}}
{{   dfn <- as.data.frame(rbn$column(5)$as_vector())}}
{{   merged <- rbind(merged,dfn)}}
{{ }}}
{{ }}}

 


> Question / Feature Request- Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {{for(i in 0:data_rbfr$num_record_batches) {}}
> {{ rbn <- data_rbfr$get_batch(i)}}
>  
> {{ if (i == 0) }}
> {{ {}}
> {{ merged <- as.data.frame(rbn$column(5)$as_vector())}}
> {{ }}}
> {{ else }}
> {{ {}}
> {{ dfn <- as.data.frame(rbn$column(5)$as_vector())}}
> {{ merged <- rbind(merged,dfn)}}
> {{ }}}
>  
> {{ print(paste(i, nrow(merged)))}}
> {{}}}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6274) [Rust] [DataFusion] Add support for writing results to CSV

2019-10-09 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-6274.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5577
[https://github.com/apache/arrow/pull/5577]

> [Rust] [DataFusion] Add support for writing results to CSV
> --
>
> Key: ARROW-6274
> URL: https://issues.apache.org/jira/browse/ARROW-6274
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Hengruo Zhang
>Priority: Major
>  Labels: beginner, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> There is currently no simple way to result query results to CSV. It would be 
> good to have convenience methods either in ExecutionContext or separate 
> utility methods to enable results to be written in CSV format to stdout or to 
> a file.
> There is sample code in unit tests for this and the approach is to iterate 
> over each row in a batch and then iterate over each column and downcast it to 
> an appropriate type (based on the schema associated with the batch) and then 
> pull out the value for the row.
> See 
> [https://github.com/apache/arrow/blob/master/rust/datafusion/tests/sql.rs#L425-L497]
>  for example code in a test
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6830) Question / Feature Request- Select Subset of Columns in read_arrow

2019-10-09 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-6830:
-
Description: 
*Note:*  Not sure if this is a limitation of the R library or the underlying 
C++ code:

I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
batches of varying row sizes

1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
how *read_feather* has a (col_select =... )

2. Or is it possible using *RecordBatchFileReader* to filter columns?

 

The only thing I seem to be able to do (please confirm if this is my only 
option) is loop over all record batches, select a single column at a time, and 
construct the data I need to pull out manually.  ie like the following:

{{data_rbfr <- arrow::RecordBatchFileReader("arrowfile")}}

{{for(i in 0:data_rbfr$num_record_batches) {}}
{{  rbn <- data_rbfr$get_batch(i)}}
{{  if (i == 0) }}
{{ {}}
{{   merged <- as.data.frame(rbn$column(5)$as_vector())}}
{{ }}}
{{ else }}
{{ {}}
{{   dfn <- as.data.frame(rbn$column(5)$as_vector())}}
{{   merged <- rbind(merged,dfn)}}
{{ }}}
{{ }}}

 

  was:
*Note:*  Not sure if this is a limitation of the R library or the underlying 
C++ code:

I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
batches of varying row sizes

1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
how *read_feather* has a (col_select =... )

2. Or is it possible using *RecordBatchFileReader* to filter columns?

 

The only thing I seem to be able to do (please confirm if this is my only 
option) is loop over all record batches, select a single column at a time, and 
construct the data I need to pull out manually.  ie like the following:

data_rbfr <- arrow::RecordBatchFileReader("arrowfile")

FOREACH BATCH:
 batch <- data_rbfr$get_batch(i) 
col4 <- batch$column(4)
 col5 <- batch$column(7)

 


> Question / Feature Request- Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {{data_rbfr <- arrow::RecordBatchFileReader("arrowfile")}}
> {{for(i in 0:data_rbfr$num_record_batches) {}}
> {{  rbn <- data_rbfr$get_batch(i)}}
> {{  if (i == 0) }}
> {{ {}}
> {{   merged <- as.data.frame(rbn$column(5)$as_vector())}}
> {{ }}}
> {{ else }}
> {{ {}}
> {{   dfn <- as.data.frame(rbn$column(5)$as_vector())}}
> {{   merged <- rbind(merged,dfn)}}
> {{ }}}
> {{ }}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6631) [C++] Do not build with any compression library dependencies by default

2019-10-09 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6631.
-
Resolution: Fixed

Issue resolved by pull request 5607
[https://github.com/apache/arrow/pull/5607]

> [C++] Do not build with any compression library dependencies by default
> ---
>
> Key: ARROW-6631
> URL: https://issues.apache.org/jira/browse/ARROW-6631
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Numerous packaging scripts will have to be updated if we decide to do this. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6830) Question / Feature Request- Select Subset of Columns in read_arrow

2019-10-09 Thread Anthony Abate (Jira)
Anthony Abate created ARROW-6830:


 Summary: Question / Feature Request- Select Subset of Columns in 
read_arrow
 Key: ARROW-6830
 URL: https://issues.apache.org/jira/browse/ARROW-6830
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, R
Reporter: Anthony Abate


*Note:*  Not sure if this is a limitation of the R library or the underlying 
C++ code:

I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
batches of varying row sizes

1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
how *read_feather* has a (col_select =... )

2. Or is it possible using *RecordBatchFileReader* to filter columns?

 

The only thing I seem to be able to do (please confirm if this is my only 
option) is loop over all record batches, select a single column at a time, and 
construct the data I need to pull out manually.  ie like the following:

data_rbfr <- arrow::RecordBatchFileReader("arrowfile")

FOREACH BATCH:
 batch <- data_rbfr$get_batch(i) 
col4 <- batch$column(4)
 col5 <- batch$column(7)

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6829) [Docs] Migrate integration test docs to Sphinx, fix instructions after ARROW-6466

2019-10-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6829:
---

 Summary: [Docs] Migrate integration test docs to Sphinx, fix 
instructions after ARROW-6466
 Key: ARROW-6829
 URL: https://issues.apache.org/jira/browse/ARROW-6829
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Wes McKinney
 Fix For: 1.0.0


Follow up to ARROW-6466.

Also, the readme uses out of date archery flags

https://github.com/apache/arrow/blob/master/integration/README.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6823) [C++][Python][R] Support metadata in the feather format?

2019-10-09 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947688#comment-16947688
 ] 

Wes McKinney commented on ARROW-6823:
-

I see. It's probably time for me to take care of the FeatherV2 implementation. 
It's probably a take of work or less -- I think it is probably doable for 1.0.0

> [C++][Python][R] Support metadata in the feather format?
> 
>
> Key: ARROW-6823
> URL: https://issues.apache.org/jira/browse/ARROW-6823
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: feather
>
> This might need to wait / could be enabled by the feather v2 (ARROW-5510), 
> but thought to open a specific issue about it: do we want to support saving 
> metadata in feather files?
> With Parquet files, you can have file-level metadata (which we currently use 
> to eg store the pandas_metadata). I think it would be useful to have a 
> similar mechanism for Feather files.
> A use case where this came up is in GeoPandas where we would like to store 
> the Coordinate Reference System identifier of the geometry data inside the 
> file, to avoid needing a sidecar file just for that.
> In a v2 world (using the IPC format), I suppose this could be the metadata of 
> the Schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6826) [Archery] Default build should be minimal

2019-10-09 Thread Ben Kietzman (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947686#comment-16947686
 ] 

Ben Kietzman commented on ARROW-6826:
-

Related: currently the flag {{--with-parquet}} wraps the cmake flag 
{{ARROW_PARQUET}}. This is much more ergonomic than setting the cmake flag 
using {{--cmake-extras="-DARROW_PARQUET=ON"}}, but the mapping can only be 
discovered by reading archery's code or 
{{$CMAKE_BUILD_DIR/cmake_options.json}}. Additionally there are many arrow 
specific options in {{DefineOptions.cmake}} which aren't exposed through a 
convenience flag (for example {{ARROW_USE_ASAN}}). I think it would improve 
usability if all options present in {{DefineOptions.cmake}} were present as 
convenience flags for {{archery build}} and the mapping explained through 
{{archery build --help-explain-cmake-options}} or so.

> [Archery] Default build should be minimal
> -
>
> Key: ARROW-6826
> URL: https://issues.apache.org/jira/browse/ARROW-6826
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Archery
>Reporter: Francois Saint-Jacques
>Priority: Minor
>
> Follow of https://github.com/apache/arrow/pull/5600/files#r332655141



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6827) [Archery] lint sub-command should provide a --fail-fast option

2019-10-09 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-6827:
-

 Summary: [Archery] lint sub-command should provide a --fail-fast 
option
 Key: ARROW-6827
 URL: https://issues.apache.org/jira/browse/ARROW-6827
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Archery
Reporter: Francois Saint-Jacques






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6828) [Archery] Benchmark diff should provide a TUI friendly output

2019-10-09 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6828:
--
Description: 
The goal is to provide a format similar to ursabot markdown comment:

 

- Filter only change diff (ignore small deltas)
- Colors on success/failure
- Tabulated

> [Archery] Benchmark diff should provide a TUI friendly output
> -
>
> Key: ARROW-6828
> URL: https://issues.apache.org/jira/browse/ARROW-6828
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Archery
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> The goal is to provide a format similar to ursabot markdown comment:
>  
> - Filter only change diff (ignore small deltas)
> - Colors on success/failure
> - Tabulated



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6828) [Archery] Benchmark diff should provide a TUI friendly output

2019-10-09 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-6828:
-

 Summary: [Archery] Benchmark diff should provide a TUI friendly 
output
 Key: ARROW-6828
 URL: https://issues.apache.org/jira/browse/ARROW-6828
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Archery
Reporter: Francois Saint-Jacques






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6466) [Developer] Refactor integration/integration_test.py into a proper Python package

2019-10-09 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-6466.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5600
[https://github.com/apache/arrow/pull/5600]

> [Developer] Refactor integration/integration_test.py into a proper Python 
> package
> -
>
> Key: ARROW-6466
> URL: https://issues.apache.org/jira/browse/ARROW-6466
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> This could also facilitate writing unit tests for the integration tests.
> Maybe this could be a part of archery?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6826) [Archery] Default build should be minimal

2019-10-09 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-6826:
-

 Summary: [Archery] Default build should be minimal
 Key: ARROW-6826
 URL: https://issues.apache.org/jira/browse/ARROW-6826
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Archery
Reporter: Francois Saint-Jacques


Follow of https://github.com/apache/arrow/pull/5600/files#r332655141



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6825) [C++] Rework CSV reader IO around readahead iterator

2019-10-09 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-6825:
-

 Summary: [C++] Rework CSV reader IO around readahead iterator
 Key: ARROW-6825
 URL: https://issues.apache.org/jira/browse/ARROW-6825
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


Following ARROW-6764, we should try to remove the custom ReadaheadSpooler and 
use the generic readahead iteration facility instead. This will require 
reworking the blocking / chunking logic to mimick what is done in the JSON 
reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-501) [C++] Implement concurrent / buffering InputStream for streaming data use cases

2019-10-09 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-501.
--
Resolution: Duplicate

The readahead iterator and the input stream iterator added in ARROW-6764 should 
address this need.

> [C++] Implement concurrent / buffering InputStream for streaming data use 
> cases
> ---
>
> Key: ARROW-501
> URL: https://issues.apache.org/jira/browse/ARROW-501
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: csv, filesystem, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> Related to ARROW-500, when processing an input data stream, we may wish to 
> continue buffering input (up to an maximum buffer size) in between 
> synchronous Read calls



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6778) [C++] Support DurationType in Cast kernel

2019-10-09 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-6778.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5578
[https://github.com/apache/arrow/pull/5578]

> [C++] Support DurationType in Cast kernel
> -
>
> Key: ARROW-6778
> URL: https://issues.apache.org/jira/browse/ARROW-6778
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Currently, duration is not yet supported in basic cast operations (using the 
> python binding from ARROW-5855, currently from my branch, not yet merged):
> {code}
> In [25]: arr = pa.array([1, 2])
> In [26]: arr.cast(pa.duration('s'))  
> ...
> ArrowNotImplementedError: No cast implemented from int64 to duration[s]
> In [27]: arr = pa.array([1, 2], pa.duration('s'))  
> In [28]: arr.cast(pa.duration('ms'))
> ...
> ArrowNotImplementedError: No cast implemented from duration[s] to duration[ms]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6764) [C++] Add readahead iterator

2019-10-09 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-6764.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5573
[https://github.com/apache/arrow/pull/5573]

> [C++] Add readahead iterator
> 
>
> Key: ARROW-6764
> URL: https://issues.apache.org/jira/browse/ARROW-6764
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> This could replace the current ad-hoc ReadaheadSpooler, at least for JSON.
> CSV currently uses non-zero padding, but it could switch to the same strategy 
> as JSON (i.e. keep track of partial / completion blocks).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6824) [Plasma] Support batched create and seal requests for small objects

2019-10-09 Thread Philipp Moritz (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philipp Moritz resolved ARROW-6824.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5596
[https://github.com/apache/arrow/pull/5596]

> [Plasma] Support batched create and seal requests for small objects
> ---
>
> Key: ARROW-6824
> URL: https://issues.apache.org/jira/browse/ARROW-6824
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Affects Versions: 0.15.0
>Reporter: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently the plasma create API supports creating and sealing a single object 
> – this makes sense for large objects because their creating throughput is 
> limited by the memory throughput of the client when the data is filled into 
> the buffer. However sometimes we want to create lots of small objects in 
> which case the throughput is limited by the number of IPCs to the store we 
> can do when creating new objects. This can be fixed by offering a version of 
> CreateAndSeal that allows us to create multiple objects at the same time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6824) [Plasma] Support batched create and seal requests for small objects

2019-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6824:
--
Labels: pull-request-available  (was: )

> [Plasma] Support batched create and seal requests for small objects
> ---
>
> Key: ARROW-6824
> URL: https://issues.apache.org/jira/browse/ARROW-6824
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Affects Versions: 0.15.0
>Reporter: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
>
> Currently the plasma create API supports creating and sealing a single object 
> – this makes sense for large objects because their creating throughput is 
> limited by the memory throughput of the client when the data is filled into 
> the buffer. However sometimes we want to create lots of small objects in 
> which case the throughput is limited by the number of IPCs to the store we 
> can do when creating new objects. This can be fixed by offering a version of 
> CreateAndSeal that allows us to create multiple objects at the same time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6824) [Plasma] Support batched create and seal requests for small objects

2019-10-09 Thread Philipp Moritz (Jira)
Philipp Moritz created ARROW-6824:
-

 Summary: [Plasma] Support batched create and seal requests for 
small objects
 Key: ARROW-6824
 URL: https://issues.apache.org/jira/browse/ARROW-6824
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Plasma
Affects Versions: 0.15.0
Reporter: Philipp Moritz


Currently the plasma create API supports creating and sealing a single object – 
this makes sense for large objects because their creating throughput is limited 
by the memory throughput of the client when the data is filled into the buffer. 
However sometimes we want to create lots of small objects in which case the 
throughput is limited by the number of IPCs to the store we can do when 
creating new objects. This can be fixed by offering a version of CreateAndSeal 
that allows us to create multiple objects at the same time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6689) [Rust] [DataFusion] Query execution enhancements for 1.0.0 release

2019-10-09 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-6689:
-

Assignee: Andy Grove

> [Rust] [DataFusion] Query execution enhancements for 1.0.0 release
> --
>
> Key: ARROW-6689
> URL: https://issues.apache.org/jira/browse/ARROW-6689
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>
> There a number of optimizations that can be made to the new query execution 
> and this is a top level story to track them all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-5227) [Rust] [DataFusion] Re-implement query execution with an extensible physical query plan

2019-10-09 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove closed ARROW-5227.
-
Resolution: Fixed

> [Rust] [DataFusion] Re-implement query execution with an extensible physical 
> query plan
> ---
>
> Key: ARROW-5227
> URL: https://issues.apache.org/jira/browse/ARROW-5227
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
>  This story (maybe it should have been an epic with hindsight) is to 
> re-implement query execution in DataFusion using a physical plan that 
> supports partitions and parallel execution.
> This will replace the current query execution which happens directly from the 
> logical plan.
> The new physical plan is based on traits and is therefore extensible by other 
> projects that use Arrow. For example, another project could add physical 
> plans for distributed compute.
> See design doc at 
> [https://docs.google.com/document/d/1ATZGIs8ry_kJeoTgmJjLrg6Ssb5VE7lNzWuz_4p6EWk/edit?usp=sharing]
>  for more info



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6666) [Rust] [DataFusion] Implement string literal expression

2019-10-09 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947432#comment-16947432
 ] 

Andy Grove commented on ARROW-:
---

I realize now that we didn't ever support this  this will basically involve 
a different code path than how we deal with primitive types ... I have moved 
this out of the ARROW-5227 story and can take a look at this soon and maybe 
just show you some sample code if you're still interested in taking this on.

> [Rust] [DataFusion] Implement string literal expression
> ---
>
> Key: ARROW-
> URL: https://issues.apache.org/jira/browse/ARROW-
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>  Labels: beginner
> Fix For: 1.0.0
>
>
> Implement string literal expression in the new physical query plan. It is 
> already implemented in the code that executed directly from the logical plan 
> so it should largely be a copy and paste exercise.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6666) [Rust] [DataFusion] Implement string literal expression

2019-10-09 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-:
--
Parent Issue: ARROW-6689  (was: ARROW-5227)

> [Rust] [DataFusion] Implement string literal expression
> ---
>
> Key: ARROW-
> URL: https://issues.apache.org/jira/browse/ARROW-
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>  Labels: beginner
> Fix For: 1.0.0
>
>
> Implement string literal expression in the new physical query plan. It is 
> already implemented in the code that executed directly from the logical plan 
> so it should largely be a copy and paste exercise.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6689) [Rust] [DataFusion] Query execution enhancements for 1.0.0 release

2019-10-09 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-6689:
--
Summary: [Rust] [DataFusion] Query execution enhancements for 1.0.0 release 
 (was: [Rust] [DataFusion] Optimize query execution)

> [Rust] [DataFusion] Query execution enhancements for 1.0.0 release
> --
>
> Key: ARROW-6689
> URL: https://issues.apache.org/jira/browse/ARROW-6689
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>
> There a number of optimizations that can be made to the new query execution 
> and this is a top level story to track them all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)