[jira] [Updated] (ARROW-6869) [C++] Dictionary "delta" building logic in builder_dict.h produces invalid arrays

2019-10-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6869:
---
Priority: Blocker  (was: Major)

> [C++] Dictionary "delta" building logic in builder_dict.h produces invalid 
> arrays
> -
>
> Key: ARROW-6869
> URL: https://issues.apache.org/jira/browse/ARROW-6869
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Blocker
> Fix For: 1.0.0
>
>
> Looking at the unit tests for the dictionary delta logic -- the arrays that 
> are produced by subsequent invocations of {{Finish}} yield DictionaryArray 
> instances with partial dictionaries. I think this is misleading (I was 
> surprised to find this while working on ARROW-6861). We should develop a 
> different approach to computing dictionary delta. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6859) [CI][Nightly] Disable docker layer caching for CircleCI tasks

2019-10-12 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6859.

Resolution: Fixed

Issue resolved by pull request 5617
[https://github.com/apache/arrow/pull/5617]

> [CI][Nightly] Disable docker layer caching for CircleCI tasks
> -
>
> Key: ARROW-6859
> URL: https://issues.apache.org/jira/browse/ARROW-6859
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> CircleCI builds are failing because the layer caching is not available for 
> free plans.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6793) [R] Arrow C++ binary packaging for Linux

2019-10-11 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949791#comment-16949791
 ] 

Neal Richardson commented on ARROW-6793:


You don't need devtools/remotes if you want to install the current version. 
Just install it from CRAN.

> [R] Arrow C++ binary packaging for Linux
> 
>
> Key: ARROW-6793
> URL: https://issues.apache.org/jira/browse/ARROW-6793
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> Our current installation experience on Linux isn't ideal. Unless you've 
> already installed the Arrow C++ library, when you install the R package, you 
> get a shell that tells you to install the C++ library. That was a useful 
> approach to allow us to get the package on CRAN, which makes it easy for 
> macOS and Windows users to install, but it doesn't improve the installation 
> experience for Linux users. This is an impediment to adoption of arrow not 
> only by users but also by package maintainers who might want to depend on 
> arrow. 
> macOS and Windows have a better experience because at installation time, the 
> configure scripts download and statically link a prebuilt C++ library. CRAN 
> bundles the whole thing up and delivers that as a binary R package. 
> Python wheels do a similar thing: they're binaries that contain all external 
> dependencies. And there are pyarrow wheels for Linux. This suggests that we 
> could do something similar for R: build a generic Linux binary of the C++ 
> library and download it in the R package configure script at install time.
> I experimented with using the Arrow C++ binaries included in the Python 
> wheels in R. See discussion at the end of ARROW-5956. This worked on macOS 
> (not useful for R, but it proved the concept) and almost worked on Linux, but 
> it turned out that the "manylinux2010" standard is too archaic to work with 
> contemporary Rcpp. 
> Proposal: do a similar workflow to what the manylinux2010 pyarrow build does, 
> just with slightly more modern compiler/settings. Publish that C++ binary 
> package to bintray. Then download it in the R configure script if a 
> local/system package isn't found.
> Once we have a basic version working, test against various distros on 
> [R-hub|https://builder.r-hub.io/advanced] to make sure we're solid everywhere 
> and/or ensure the current fallback behavior when we encounter a distro that 
> this doesn't work for. If necessary, we can make multiple flavors of this C++ 
> binary for debian, centos, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6793) [R] Arrow C++ binary packaging for Linux

2019-10-11 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949770#comment-16949770
 ] 

Neal Richardson commented on ARROW-6793:


The binaries available on the install page _are_ in sync with tagged versions 
on GitHub, but you seem to be installing the head of the master branch (what 
you get if you do install_github without specifying a tag). If you want to use 
the built binary libraries for an official release version of the C++ library, 
you need to use the corresponding R package. You can get that from CRAN–it 
isn't lagging. In the output you pasted above, you were installing from a CRAN 
snapshot "https://mran.microsoft.com/snapshot/2019-09-19/;. That's your lag.

> [R] Arrow C++ binary packaging for Linux
> 
>
> Key: ARROW-6793
> URL: https://issues.apache.org/jira/browse/ARROW-6793
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> Our current installation experience on Linux isn't ideal. Unless you've 
> already installed the Arrow C++ library, when you install the R package, you 
> get a shell that tells you to install the C++ library. That was a useful 
> approach to allow us to get the package on CRAN, which makes it easy for 
> macOS and Windows users to install, but it doesn't improve the installation 
> experience for Linux users. This is an impediment to adoption of arrow not 
> only by users but also by package maintainers who might want to depend on 
> arrow. 
> macOS and Windows have a better experience because at installation time, the 
> configure scripts download and statically link a prebuilt C++ library. CRAN 
> bundles the whole thing up and delivers that as a binary R package. 
> Python wheels do a similar thing: they're binaries that contain all external 
> dependencies. And there are pyarrow wheels for Linux. This suggests that we 
> could do something similar for R: build a generic Linux binary of the C++ 
> library and download it in the R package configure script at install time.
> I experimented with using the Arrow C++ binaries included in the Python 
> wheels in R. See discussion at the end of ARROW-5956. This worked on macOS 
> (not useful for R, but it proved the concept) and almost worked on Linux, but 
> it turned out that the "manylinux2010" standard is too archaic to work with 
> contemporary Rcpp. 
> Proposal: do a similar workflow to what the manylinux2010 pyarrow build does, 
> just with slightly more modern compiler/settings. Publish that C++ binary 
> package to bintray. Then download it in the R configure script if a 
> local/system package isn't found.
> Once we have a basic version working, test against various distros on 
> [R-hub|https://builder.r-hub.io/advanced] to make sure we're solid everywhere 
> and/or ensure the current fallback behavior when we encounter a distro that 
> this doesn't work for. If necessary, we can make multiple flavors of this C++ 
> binary for debian, centos, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-5502) [R] file readers should mmap

2019-10-10 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-5502.

Resolution: Not A Problem

Closing this. All file readers now do memory map, and if you want to turn that 
off, you can invoke the appropriate R6 classes directly.

> [R] file readers should mmap
> 
>
> Key: ARROW-5502
> URL: https://issues.apache.org/jira/browse/ARROW-5502
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> Arrow is supposed to let you work with datasets bigger than memory. Memory 
> mapping is a big part of that. It should be the default way that files are 
> read in the `read_*` functions. To disable memory mapping, we could use a 
> global `option()`, or a function argument, but that might clutter the 
> interface. Or we could not give a choice and only fall back to not memory 
> mapping if the platform/file system doesn't support it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6833) [R][CI] Add crossbow job for full R autobrew macOS build

2019-10-10 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6833.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5613
[https://github.com/apache/arrow/pull/5613]

> [R][CI] Add crossbow job for full R autobrew macOS build
> 
>
> Key: ARROW-6833
> URL: https://issues.apache.org/jira/browse/ARROW-6833
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> I have a separate nightly job that runs this on multiple R versions, but it 
> would be nice to be able to have crossbow check this on a PR. As it turns 
> out, the ARROW_S3 feature doesn't work with autobrew in practice--aws-sdk-cpp 
> doesn't seem to ship static libs via Homebrew, so the autobrew packaging 
> doesn't work, even though the formula builds and {{brew audit}} is clean.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6831) [R] Update R macOS/Windows builds for change in cmake compression defaults

2019-10-10 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6831.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5612
[https://github.com/apache/arrow/pull/5612]

> [R] Update R macOS/Windows builds for change in cmake compression defaults
> --
>
> Key: ARROW-6831
> URL: https://issues.apache.org/jira/browse/ARROW-6831
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> ARROW-6631 changed the defaults for including compressions but did not update 
> these build scripts. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6832) [R] Implement Codec::IsAvailable

2019-10-10 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6832.

Resolution: Fixed

Issue resolved by pull request 5615
[https://github.com/apache/arrow/pull/5615]

> [R] Implement Codec::IsAvailable
> 
>
> Key: ARROW-6832
> URL: https://issues.apache.org/jira/browse/ARROW-6832
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> New in ARROW-6631



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6830) [R] Select Subset of Columns in read_arrow

2019-10-10 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948779#comment-16948779
 ] 

Neal Richardson commented on ARROW-6830:


{{nrow(tab)}} doesn't necessarily have any relationship to memory usage. If you 
want to see what memory the C++ library is using, you can look at 
{{default_memory_pool()$bytes_allocated()}}.

Re: "a better way so I can do multiple columns in a single pass", see the 
examples on [https://arrow.apache.org/docs/r/reference/RecordBatch.html] for 
using {{[}}.

> [R] Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {code:java}
> for(i in 0:data_rbfr$num_record_batches) {
> rbn <- data_rbfr$get_batch(i)
>   
>   if (i == 0) 
>   {
> merged <- as.data.frame(rbn$column(5)$as_vector())
>   }
>   else 
>   {
> dfn <- as.data.frame(rbn$column(5)$as_vector())
> merged <- rbind(merged,dfn)
>   }
> 
>   print(paste(i, nrow(merged)))
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6793) [R] Arrow C++ binary packaging for Linux

2019-10-10 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948774#comment-16948774
 ] 

Neal Richardson commented on ARROW-6793:


You're welcome to use 
[https://github.com/apache/arrow/blob/master/r/Dockerfile]. Though for the 
record, this ticket is about something different.

> [R] Arrow C++ binary packaging for Linux
> 
>
> Key: ARROW-6793
> URL: https://issues.apache.org/jira/browse/ARROW-6793
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> Our current installation experience on Linux isn't ideal. Unless you've 
> already installed the Arrow C++ library, when you install the R package, you 
> get a shell that tells you to install the C++ library. That was a useful 
> approach to allow us to get the package on CRAN, which makes it easy for 
> macOS and Windows users to install, but it doesn't improve the installation 
> experience for Linux users. This is an impediment to adoption of arrow not 
> only by users but also by package maintainers who might want to depend on 
> arrow. 
> macOS and Windows have a better experience because at installation time, the 
> configure scripts download and statically link a prebuilt C++ library. CRAN 
> bundles the whole thing up and delivers that as a binary R package. 
> Python wheels do a similar thing: they're binaries that contain all external 
> dependencies. And there are pyarrow wheels for Linux. This suggests that we 
> could do something similar for R: build a generic Linux binary of the C++ 
> library and download it in the R package configure script at install time.
> I experimented with using the Arrow C++ binaries included in the Python 
> wheels in R. See discussion at the end of ARROW-5956. This worked on macOS 
> (not useful for R, but it proved the concept) and almost worked on Linux, but 
> it turned out that the "manylinux2010" standard is too archaic to work with 
> contemporary Rcpp. 
> Proposal: do a similar workflow to what the manylinux2010 pyarrow build does, 
> just with slightly more modern compiler/settings. Publish that C++ binary 
> package to bintray. Then download it in the R configure script if a 
> local/system package isn't found.
> Once we have a basic version working, test against various distros on 
> [R-hub|https://builder.r-hub.io/advanced] to make sure we're solid everywhere 
> and/or ensure the current fallback behavior when we encounter a distro that 
> this doesn't work for. If necessary, we can make multiple flavors of this C++ 
> binary for debian, centos, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6793) [R] Arrow C++ binary packaging for Linux

2019-10-10 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948734#comment-16948734
 ] 

Neal Richardson commented on ARROW-6793:


> trying URL 
> '[https://mran.microsoft.com/snapshot/2019-09-19/src/contrib/arrow_0.14.1.1.tar.gz']

That's not the new version of arrow. For development purposes you should be 
installing from the git repository, not CRAN (and definitely not an old 
snapshot of CRAN).

> [R] Arrow C++ binary packaging for Linux
> 
>
> Key: ARROW-6793
> URL: https://issues.apache.org/jira/browse/ARROW-6793
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> Our current installation experience on Linux isn't ideal. Unless you've 
> already installed the Arrow C++ library, when you install the R package, you 
> get a shell that tells you to install the C++ library. That was a useful 
> approach to allow us to get the package on CRAN, which makes it easy for 
> macOS and Windows users to install, but it doesn't improve the installation 
> experience for Linux users. This is an impediment to adoption of arrow not 
> only by users but also by package maintainers who might want to depend on 
> arrow. 
> macOS and Windows have a better experience because at installation time, the 
> configure scripts download and statically link a prebuilt C++ library. CRAN 
> bundles the whole thing up and delivers that as a binary R package. 
> Python wheels do a similar thing: they're binaries that contain all external 
> dependencies. And there are pyarrow wheels for Linux. This suggests that we 
> could do something similar for R: build a generic Linux binary of the C++ 
> library and download it in the R package configure script at install time.
> I experimented with using the Arrow C++ binaries included in the Python 
> wheels in R. See discussion at the end of ARROW-5956. This worked on macOS 
> (not useful for R, but it proved the concept) and almost worked on Linux, but 
> it turned out that the "manylinux2010" standard is too archaic to work with 
> contemporary Rcpp. 
> Proposal: do a similar workflow to what the manylinux2010 pyarrow build does, 
> just with slightly more modern compiler/settings. Publish that C++ binary 
> package to bintray. Then download it in the R configure script if a 
> local/system package isn't found.
> Once we have a basic version working, test against various distros on 
> [R-hub|https://builder.r-hub.io/advanced] to make sure we're solid everywhere 
> and/or ensure the current fallback behavior when we encounter a distro that 
> this doesn't work for. If necessary, we can make multiple flavors of this C++ 
> binary for debian, centos, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6830) [R] Select Subset of Columns in read_arrow

2019-10-10 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948727#comment-16948727
 ] 

Neal Richardson commented on ARROW-6830:


[https://github.com/apache/arrow/blob/master/r/R/read-table.R] is pretty simple 
(and note that if you give it a string file name, it will invoke 
RecordBatchFileReader). There are no additional arguments that would let you 
push computation down to record batches contained within the file (though I 
thought we were talking about selecting columns). We are working on a [C++ 
Datasets 
API|https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?pli=1#heading=h.22aikbvt54fv]
 that will do that and much more. 

If you want to do some of that in R now, RecordBatchFileReader sounds like a 
reasonable place to start. It memory maps by default, and as you've seen you 
can iterate over the batches. You can filter each record batch separately 
(using {{[}} methods or lower level if you prefer) and collect them all into a 
data.frame.

> [R] Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {code:java}
> for(i in 0:data_rbfr$num_record_batches) {
> rbn <- data_rbfr$get_batch(i)
>   
>   if (i == 0) 
>   {
> merged <- as.data.frame(rbn$column(5)$as_vector())
>   }
>   else 
>   {
> dfn <- as.data.frame(rbn$column(5)$as_vector())
> merged <- rbind(merged,dfn)
>   }
> 
>   print(paste(i, nrow(merged)))
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6832) [R] Implement Codec::IsAvailable

2019-10-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6832:
--

Assignee: Neal Richardson

> [R] Implement Codec::IsAvailable
> 
>
> Key: ARROW-6832
> URL: https://issues.apache.org/jira/browse/ARROW-6832
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> New in ARROW-6631



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6830) [R] Select Subset of Columns in read_arrow

2019-10-09 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948016#comment-16948016
 ] 

Neal Richardson commented on ARROW-6830:


That's the intent, though in this example I don't know what {{data_rbfr}} is. 
If it's a file, you can use {{mmap_open()}} to memory map it.

> [R] Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {code:java}
> for(i in 0:data_rbfr$num_record_batches) {
> rbn <- data_rbfr$get_batch(i)
>   
>   if (i == 0) 
>   {
> merged <- as.data.frame(rbn$column(5)$as_vector())
>   }
>   else 
>   {
> dfn <- as.data.frame(rbn$column(5)$as_vector())
> merged <- rbind(merged,dfn)
>   }
> 
>   print(paste(i, nrow(merged)))
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6830) [R] Select Subset of Columns in read_arrow

2019-10-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6830:
---
Summary: [R] Select Subset of Columns in read_arrow  (was: Question / 
Feature Request- Select Subset of Columns in read_arrow)

> [R] Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {code:java}
> for(i in 0:data_rbfr$num_record_batches) {
> rbn <- data_rbfr$get_batch(i)
>   
>   if (i == 0) 
>   {
> merged <- as.data.frame(rbn$column(5)$as_vector())
>   }
>   else 
>   {
> dfn <- as.data.frame(rbn$column(5)$as_vector())
> merged <- rbind(merged,dfn)
>   }
> 
>   print(paste(i, nrow(merged)))
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6830) Question / Feature Request- Select Subset of Columns in read_arrow

2019-10-09 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947927#comment-16947927
 ] 

Neal Richardson commented on ARROW-6830:


Looking at [https://github.com/apache/arrow/blob/master/r/R/read-table.R], 
{{read_arrow}} returns a data frame while {{read_table}} keeps the data in an 
Arrow Table. Tables have a {{$select()}} method (which is [how 
{{read_csv_arrow}} implements 
{{col_select}}|https://github.com/apache/arrow/blob/master/r/R/csv.R#L124]), 
and you can more naturally access that through the usual {{[}} method. So IIUC 
what you're trying to do, 
{code:r}
tab <- read_table(data_rbfr)
as.data.frame(tab[, 6])
{code}
and of course you could reference that column by name instead of position.

If you wanted to add {{col_select}} to {{read_arrow()}}, I'd recommend 
following the model of {{read_csv_arrow}}, which sounds pretty straightforward. 
Happy to review a pull request if you submit it.

> Question / Feature Request- Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {code:java}
> for(i in 0:data_rbfr$num_record_batches) {
> rbn <- data_rbfr$get_batch(i)
>   
>   if (i == 0) 
>   {
> merged <- as.data.frame(rbn$column(5)$as_vector())
>   }
>   else 
>   {
> dfn <- as.data.frame(rbn$column(5)$as_vector())
> merged <- rbind(merged,dfn)
>   }
> 
>   print(paste(i, nrow(merged)))
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6830) [R] Select Subset of Columns in read_arrow

2019-10-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6830:
---
Component/s: (was: C++)

> [R] Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {code:java}
> for(i in 0:data_rbfr$num_record_batches) {
> rbn <- data_rbfr$get_batch(i)
>   
>   if (i == 0) 
>   {
> merged <- as.data.frame(rbn$column(5)$as_vector())
>   }
>   else 
>   {
> dfn <- as.data.frame(rbn$column(5)$as_vector())
> merged <- rbind(merged,dfn)
>   }
> 
>   print(paste(i, nrow(merged)))
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6833) [R][CI] Add crossbow job for full R autobrew macOS build

2019-10-09 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6833:
--

 Summary: [R][CI] Add crossbow job for full R autobrew macOS build
 Key: ARROW-6833
 URL: https://issues.apache.org/jira/browse/ARROW-6833
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, R
Reporter: Neal Richardson
Assignee: Neal Richardson


I have a separate nightly job that runs this on multiple R versions, but it 
would be nice to be able to have crossbow check this on a PR. As it turns out, 
the ARROW_S3 feature doesn't work with autobrew in practice--aws-sdk-cpp 
doesn't seem to ship static libs via Homebrew, so the autobrew packaging 
doesn't work, even though the formula builds and {{brew audit}} is clean.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6832) [R] Implement Codec::IsAvailable

2019-10-09 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6832:
--

 Summary: [R] Implement Codec::IsAvailable
 Key: ARROW-6832
 URL: https://issues.apache.org/jira/browse/ARROW-6832
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
 Fix For: 1.0.0


New in ARROW-6631



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6831) [R] Update R macOS/Windows builds for change in cmake compression defaults

2019-10-09 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6831:
--

 Summary: [R] Update R macOS/Windows builds for change in cmake 
compression defaults
 Key: ARROW-6831
 URL: https://issues.apache.org/jira/browse/ARROW-6831
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson


ARROW-6631 changed the defaults for including compressions but did not update 
these build scripts. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3750) [R] Pass various wrapped Arrow objects created in Python into R with zero copy via reticulate

2019-10-08 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947241#comment-16947241
 ] 

Neal Richardson commented on ARROW-3750:


See also ARROW-5956 for possibly relevant discussion

> [R] Pass various wrapped Arrow objects created in Python into R with zero 
> copy via reticulate
> -
>
> Key: ARROW-3750
> URL: https://issues.apache.org/jira/browse/ARROW-3750
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> A user may wish to use some functionality available only in pyarrow using 
> reticulate; it would be useful to be able to construct an R wrapper object to 
> the C++ object inside the corresponding Python type, e.g. {{pyarrow.Table}}. 
> This probably will require some new functions to return the memory address of 
> the shared_ptr/unique_ptr inside the Cython types so that a function on the R 
> side can copy the smart pointer and create the corresponding R wrapper type
> cc [~pitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5916) [C++] Allow RecordBatch.length to be less than array lengths

2019-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-5916:
---
Component/s: C++

> [C++] Allow RecordBatch.length to be less than array lengths
> 
>
> Key: ARROW-5916
> URL: https://issues.apache.org/jira/browse/ARROW-5916
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: John Muehlhausen
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: test.arrow_ipc
>
>
> 0.13 ignored RecordBatch.length.  0.14 requires that RecordBatch.length and 
> array length be equal.  As per 
> [https://lists.apache.org/thread.html/2692dd8fe09c92aa313bded2f4c2d4240b9ef75a8604ec214eb02571@%3Cdev.arrow.apache.org%3E]
>  , we discussed changing this so that RecordBatch.length can be [0,array 
> length].
>  If RecordBatch.length is less than array length, the reader should ignore 
> the portion of the array(s) beyond RecordBatch.length.  This will allow 
> partially populated batches to be read in scenarios identified in the above 
> discussion.
> {code:c++}
>   Status GetFieldMetadata(int field_index, ArrayData* out) {
> auto nodes = metadata_->nodes();
> // pop off a field
> if (field_index >= static_cast(nodes->size())) {
>   return Status::Invalid("Ran out of field metadata, likely malformed");
> }
> const flatbuf::FieldNode* node = nodes->Get(field_index);
> *//out->length = node->length();*
> *out->length = metadata_->length();*
> out->null_count = node->null_count();
> out->offset = 0;
> return Status::OK();
>   }
> {code}
> Attached is a test IPC File containing a batch with length 1, array length 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5489) [C++] Normalize kernels and ChunkedArray behavior

2019-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-5489:
---
Component/s: C++

> [C++] Normalize kernels and ChunkedArray behavior
> -
>
> Key: ARROW-5489
> URL: https://issues.apache.org/jira/browse/ARROW-5489
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
> Fix For: 1.0.0
>
>
> Some kernels (the wrappers, e.g. Unique) support ChunkedArray inputs, and 
> some don't. We should normalize this usage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6103) [Java] Do we really want to use the maven release plugin?

2019-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6103:
---
Component/s: Java
 Developer Tools

> [Java] Do we really want to use the maven release plugin?
> -
>
> Key: ARROW-6103
> URL: https://issues.apache.org/jira/browse/ARROW-6103
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools, Java
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>
> For reference .. I'm filing this issue to track investigation work around 
> this ..
> {code:java}
> The biggest problem for the Git commit is our Java package
> requires "apache-arrow-${VERSION}" tag on
> https://github.com/apache/arrow . (Right?)
> I think that "mvm release:perform" in
> dev/release/01-perform.sh does so but I don't know the
> details of "mvm release:perform"...{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6741) [Release] Update changelog.py to use APACHE_ prefixed JIRA_USERNAME and JIRA_PASSWORD environment variables

2019-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6741:
---
Component/s: Developer Tools

> [Release] Update changelog.py to use APACHE_ prefixed JIRA_USERNAME and 
> JIRA_PASSWORD environment variables
> ---
>
> Key: ARROW-6741
> URL: https://issues.apache.org/jira/browse/ARROW-6741
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Merge script has recently changed to use APACHE_JIRA_USERNAME and 
> APACHE_JIRA_PASSWORD, changelog.py should use the same environment variables.
> Also we could use crossbow.py changelog command which implements
> the same functionality.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6657) [Rust] [DataFusion] Implement COUNT aggregate expression

2019-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6657:
---
Component/s: Rust - DataFusion

> [Rust] [DataFusion] Implement COUNT aggregate expression
> 
>
> Key: ARROW-6657
> URL: https://issues.apache.org/jira/browse/ARROW-6657
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>  Labels: beginner, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Implement COUNT aggregate expressions. See the SUM implementation for 
> inspiration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6658) [Rust] [DataFusion] Implement AVG aggregate expression

2019-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6658:
---
Component/s: Rust - DataFusion

> [Rust] [DataFusion] Implement AVG aggregate expression
> --
>
> Key: ARROW-6658
> URL: https://issues.apache.org/jira/browse/ARROW-6658
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>  Labels: beginner, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Implement AVG aggregate expression. See COUNT and SUM for inspiration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6819) arrow::read_parquet ignores as_data_frame when sparklyr package is attached

2019-10-08 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947079#comment-16947079
 ] 

Neal Richardson commented on ARROW-6819:


A {{tibble}} is a {{data.frame}}, just with additional attributes that affect 
how it prints. These additional attributes are added to the data.frame that 
{{arrow}} returns, but you only notice them when you're using the tibble 
package. When you load {{sparklyr}}, the tibble namespace is loaded, so the 
tibble print method is found and used.

This behavior is noted in the new package vignette: 
[https://github.com/apache/arrow/blob/master/r/vignettes/arrow.Rmd#L15]

{{as_data_frame}} governs whether you get an R data.frame/tibble or whether the 
data is kept in Arrow memory as an arrow::Table 
[https://github.com/apache/arrow/blob/master/r/R/parquet.R#L27-L28]

> arrow::read_parquet ignores as_data_frame when sparklyr package is attached
> ---
>
> Key: ARROW-6819
> URL: https://issues.apache.org/jira/browse/ARROW-6819
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.15.0
> Environment: R version 3.6.1 (2019-07-05) on x86_64, darwin15.6.0 
> (Mac OS 10.13.4)
>Reporter: Ryan Patrick Kyle
>Priority: Major
>
> I am currently using v0.15.0 of the arrow package, installed from source 
> using CRAN. I also have v1.0.4 of the sparklyr package installed. While 
> attempting to read in Parquet data with both packages attached, the 
> read_parquet function appears to ignore the as_data_frame argument (which 
> defaults to TRUE).
> [https://github.com/apache/arrow/blob/3d55122c56a508894823a1b79bca71f519fdd52f/r/R/parquet.R#L35-L47]
> I am not certain, but I suspect the issue may be in the way 
> Table__to_dataframe coerces Arrow Table objects into tibbles, since this 
> statement appears also to produce a tibble (I expected a data.frame to be 
> returned):
> {{arrow:::Table__to_dataframe(tab, use_threads=FALSE)}}
>  
> A reproducible example follows.
>  
> {{# This does work as expected, returns data.frame}}
> {{library(arrow)}}
> {{temp <- tempfile()}}
>  
> {{download.file("https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet?raw=true;,
>  temp)}}
> {{read_parquet(temp, as_data_frame=TRUE)}}
> {{# This does not work as expected, returns tibble}}
> {{library(sparklyr)}}
> {{read_parquet(temp, as_data_frame=TRUE)}}{{ }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6819) arrow::read_parquet ignores as_data_frame when sparklyr package is attached

2019-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson closed ARROW-6819.
--
  Assignee: Neal Richardson
Resolution: Not A Problem

> arrow::read_parquet ignores as_data_frame when sparklyr package is attached
> ---
>
> Key: ARROW-6819
> URL: https://issues.apache.org/jira/browse/ARROW-6819
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.15.0
> Environment: R version 3.6.1 (2019-07-05) on x86_64, darwin15.6.0 
> (Mac OS 10.13.4)
>Reporter: Ryan Patrick Kyle
>Assignee: Neal Richardson
>Priority: Major
>
> I am currently using v0.15.0 of the arrow package, installed from source 
> using CRAN. I also have v1.0.4 of the sparklyr package installed. While 
> attempting to read in Parquet data with both packages attached, the 
> read_parquet function appears to ignore the as_data_frame argument (which 
> defaults to TRUE).
> [https://github.com/apache/arrow/blob/3d55122c56a508894823a1b79bca71f519fdd52f/r/R/parquet.R#L35-L47]
> I am not certain, but I suspect the issue may be in the way 
> Table__to_dataframe coerces Arrow Table objects into tibbles, since this 
> statement appears also to produce a tibble (I expected a data.frame to be 
> returned):
> {{arrow:::Table__to_dataframe(tab, use_threads=FALSE)}}
>  
> A reproducible example follows.
>  
> {{# This does work as expected, returns data.frame}}
> {{library(arrow)}}
> {{temp <- tempfile()}}
>  
> {{download.file("https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet?raw=true;,
>  temp)}}
> {{read_parquet(temp, as_data_frame=TRUE)}}
> {{# This does not work as expected, returns tibble}}
> {{library(sparklyr)}}
> {{read_parquet(temp, as_data_frame=TRUE)}}{{ }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6811) [R] Assorted post-0.15 release cleanups

2019-10-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6811.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5601
[https://github.com/apache/arrow/pull/5601]

> [R] Assorted post-0.15 release cleanups
> ---
>
> Key: ARROW-6811
> URL: https://issues.apache.org/jira/browse/ARROW-6811
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6811) [R] Assorted post-0.15 release cleanups

2019-10-07 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6811:
--

 Summary: [R] Assorted post-0.15 release cleanups
 Key: ARROW-6811
 URL: https://issues.apache.org/jira/browse/ARROW-6811
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6810) [Website] Add docs for R package 0.15 release

2019-10-07 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6810:
--

 Summary: [Website] Add docs for R package 0.15 release
 Key: ARROW-6810
 URL: https://issues.apache.org/jira/browse/ARROW-6810
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Website
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6766) [Python] libarrow_python..dylib does not exist

2019-10-04 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944823#comment-16944823
 ] 

Neal Richardson commented on ARROW-6766:


I don't use conda but I believe Uwe does so he's probably better positioned to 
advise.

> [Python] libarrow_python..dylib does not exist
> --
>
> Key: ARROW-6766
> URL: https://issues.apache.org/jira/browse/ARROW-6766
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0, 0.15.0
>Reporter: Tarek Allam
>Priority: Major
>
> {{After following the instructions found on the developer guides for Python, 
> I was}}
>  {{able to build fine by using:}}
> {{# Assuming immediately prior one has run:}}
>  {{# $ git clone g...@github.com:apache/arrow.git}}
>  # $ conda create -y -n pyarrow-dev -c conda-forge 
>  #   --file arrow/ci/conda_env_unix.yml 
>  #   --file arrow/ci/conda_env_cpp.yml 
>  #   --file arrow/ci/conda_env_python.yml 
>  #    compilers 
>  {{#  python=3.7}}
>  {{# $ conda activate pyarrow-dev}}
>  {{# $ brew update && brew bundle --file=arrow/cpp/Brewfile}}{{export 
> ARROW_HOME=$(pwd)/arrow/dist}}
>  {{export LD_LIBRARY_PATH=$(pwd)/arrow/dist/lib:$LD_LIBRARY_PATH}}{{export 
> CC=`which clang`}}
>  {{export CXX=`which clang++`}}{\{mkdir arrow/cpp/build }}
>      pushd arrow/cpp/build \
>      cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
>      -DCMAKE_INSTALL_LIBDIR=lib \
>      -DARROW_FLIGHT=OFF \
>      -DARROW_GANDIVA=OFF \
>      -DARROW_ORC=ON \
>      -DARROW_PARQUET=ON \
>      -DARROW_PYTHON=ON \
>      -DARROW_PLASMA=ON \
>      -DARROW_BUILD_TESTS=ON \
>     ..
>  {{make -j4}}
>  {{make install}}
>  {{popd}}
> But when I run:
> {{pushd arrow/python}}
>  {{export PYARROW_WITH_FLIGHT=0}}
>  {{export PYARROW_WITH_GANDIVA=0}}
>  {{export PYARROW_WITH_ORC=1}}
>  {{export PYARROW_WITH_PARQUET=1}}
>  {{python setup.py build_ext --inplace}}
>  {{popd}}
> I get the following errors:
> {{-- Build output directory: 
> /Users/tallamjr/Github/arrow/python/build/temp.macosx-10.9-x86_64-3.7/release}}
>  {{-- Found the Arrow core library: 
> /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow.dylib}}
>  {{-- Found the Arrow Python library: 
> /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow_python.dylib}}
>  {{CMake Error: File 
> /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow..dylib does not 
> exist.}}{{...}}{{CMake Error: File 
> /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow..dylib does not exist.}}
>  {{CMake Error at CMakeLists.txt:230 (configure_file):}}
>  \{{ configure_file Problem configuring file}}
>  {{Call Stack (most recent call first):}}
>  \{{ CMakeLists.txt:315 (bundle_arrow_lib)}}
>  {{CMake Error: File 
> /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow_python..dylib does not 
> exist.}}
>  {{CMake Error at CMakeLists.txt:226 (configure_file):}}
>  \{{ configure_file Problem configuring file}}
>  {{Call Stack (most recent call first):}}
>  \{{ CMakeLists.txt:320 (bundle_arrow_lib)}}
>  {{CMake Error: File 
> /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow_python..dylib does not 
> exist.}}
>  {{CMake Error at CMakeLists.txt:230 (configure_file):}}
>  \{{ configure_file Problem configuring file}}
>  {{Call Stack (most recent call first):}}
>  \{{ CMakeLists.txt:320 (bundle_arrow_lib)}}
>  
> What is quite strange is that the libraries seem to indeed be there but they
>  have an addition component such as `libarrow.15.dylib` .e.g:
> {{$ ls -l libarrow_python.15.dylib && echo $PWD}}
>  {{lrwxr-xr-x 1 tallamjr staff 28 Oct 2 14:02 libarrow_python.15.dylib ->}}
>  {{libarrow_python.15.0.0.dylib}}
>  {{/Users/tallamjr/github/arrow/dist/lib}}
> I guess I am not exactly sure what the issue here is but it appears to be that
>  the version is not captured as a variable that is used by CMAKE? I have run 
> the
>  same setup on `master` (`7d18c1c`) and on `apache-arrow-0.14.0` (`a591d76`)
>  which both seem to produce same errors.
> Apologies if this is not quite the format for JIRA issues here or perhaps if
>  it's not the correct platform for this, I'm very new to the project and
>  contributing to apache in general. Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6439) [R] Implement S3 file-system interface in R

2019-10-04 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944737#comment-16944737
 ] 

Neal Richardson commented on ARROW-6439:


This is ready to start now.

> [R] Implement S3 file-system interface in R
> ---
>
> Key: ARROW-6439
> URL: https://issues.apache.org/jira/browse/ARROW-6439
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Romain Francois
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6439) [R] Implement S3 file-system interface in R

2019-10-04 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6439:
--

Assignee: Romain Francois

> [R] Implement S3 file-system interface in R
> ---
>
> Key: ARROW-6439
> URL: https://issues.apache.org/jira/browse/ARROW-6439
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Romain Francois
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6437) [R] Add AWS SDK to system dependencies for macOS

2019-10-04 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6437:
---
Summary: [R] Add AWS SDK to system dependencies for macOS  (was: [R] Add 
AWS SDK to system dependencies for macOS and Windows)

> [R] Add AWS SDK to system dependencies for macOS
> 
>
> Key: ARROW-6437
> URL: https://issues.apache.org/jira/browse/ARROW-6437
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The Arrow C++ library now has an S3 filesystem implementation (ARROW-453), 
> and in order to take advantage of that from R, we need to add the 
> {{aws-sdk-cpp}} dependency to the macOS and Windows toolchains. 
> There is no PKGBUILD for this at https://github.com/msys2/MINGW-packages, but 
> https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=aws-sdk-cpp-git has 
> one. 
> For macOS, there already is a formula at 
> https://github.com/Homebrew/homebrew-core/blob/master/Formula/aws-sdk-cpp.rb, 
> maybe that's sufficient?
> Once that is in place, we can enable {{ARROW_S3=ON}} in cmake and build with 
> it 
> (https://github.com/apache/arrow/pull/5167/files#diff-b048bf4c1679dce1028fd897a7c43b93R177)
> cc [~jeroenooms]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6793) [R] Arrow C++ binary packaging for Linux

2019-10-04 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6793:
--

 Summary: [R] Arrow C++ binary packaging for Linux
 Key: ARROW-6793
 URL: https://issues.apache.org/jira/browse/ARROW-6793
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 1.0.0


Our current installation experience on Linux isn't ideal. Unless you've already 
installed the Arrow C++ library, when you install the R package, you get a 
shell that tells you to install the C++ library. That was a useful approach to 
allow us to get the package on CRAN, which makes it easy for macOS and Windows 
users to install, but it doesn't improve the installation experience for Linux 
users. This is an impediment to adoption of arrow not only by users but also by 
package maintainers who might want to depend on arrow. 

macOS and Windows have a better experience because at installation time, the 
configure scripts download and statically link a prebuilt C++ library. CRAN 
bundles the whole thing up and delivers that as a binary R package. 

Python wheels do a similar thing: they're binaries that contain all external 
dependencies. And there are pyarrow wheels for Linux. This suggests that we 
could do something similar for R: build a generic Linux binary of the C++ 
library and download it in the R package configure script at install time.

I experimented with using the Arrow C++ binaries included in the Python wheels 
in R. See discussion at the end of ARROW-5956. This worked on macOS (not useful 
for R, but it proved the concept) and almost worked on Linux, but it turned out 
that the "manylinux2010" standard is too archaic to work with contemporary 
Rcpp. 

Proposal: do a similar workflow to what the manylinux2010 pyarrow build does, 
just with slightly more modern compiler/settings. Publish that C++ binary 
package to bintray. Then download it in the R configure script if a 
local/system package isn't found.

Once we have a basic version working, test against various distros on 
[R-hub|https://builder.r-hub.io/advanced] to make sure we're solid everywhere 
and/or ensure the current fallback behavior when we encounter a distro that 
this doesn't work for. If necessary, we can make multiple flavors of this C++ 
binary for debian, centos, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6596) [R] Getting "Cannot call io___MemoryMappedFile__Open()" error while reading a parquet file

2019-10-04 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson closed ARROW-6596.
--
  Assignee: Wes McKinney
Resolution: Information Provided

> [R] Getting "Cannot call io___MemoryMappedFile__Open()" error while reading a 
> parquet file
> --
>
> Key: ARROW-6596
> URL: https://issues.apache.org/jira/browse/ARROW-6596
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.14.1
> Environment: ubuntu 18.04
>Reporter: Addhyan
>Assignee: Wes McKinney
>Priority: Major
>  Labels: Docker, R, arrow, parquet
>
> I am using r/Dockerfile to get all the R dependency and following back to get 
> everything to get the arrow/r work in linux (either ubuntu/debian) but it is 
> continuously giving me this error:
> Error in io___MemoryMappedFile__Open(fs::path_abs(path), mode) : 
>   Cannot call io___MemoryMappedFile__Open()
> I have installed all the required cpp libraries as mentioned here: 
> [https://arrow.apache.org/install/] under "Ubuntu 18.04 LTS or later".  I 
> have also tried to use 
> [cpp/Dockerfile|https://github.com/apache/arrow/blob/master/cpp/Dockerfile] 
> and then followed backwards without any luck. The error is consistent and 
> doesn't go away. 
> I am trying to build a docker image with dockerfile containing everything 
> that arrow needs, all the cpp libraries etc. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6681) [C#] Record Batches in reverse order?

2019-10-04 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6681:
---
Summary: [C#] Record Batches in reverse order?  (was: [C# -> R] - Record 
Batches in reverse order?)

> [C#] Record Batches in reverse order?
> -
>
> Key: ARROW-6681
> URL: https://issues.apache.org/jira/browse/ARROW-6681
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Are 'RecordBatches' being in C# being written in reverse order?
> I made a simple test which creates a single row per record batch of 0 to 99 
> and attempted to read this in R. To my surprise batch(0) in R had the value 
> 99 not 0
> This may not seem like a big deal, however when dealing with 'huge' files, 
> its more efficient to use Record Batches / index lookup than attempting to 
> load the entire file into memory.
> Having the order consistent within the different language / API seems only to 
> make sense - for now I can work around this by reversing the order before 
> writing.
>  
> https://github.com/apache/arrow/issues/5475
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6681) [C#] Record Batches in reverse order?

2019-10-04 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6681:
---
Component/s: (was: R)

> [C#] Record Batches in reverse order?
> -
>
> Key: ARROW-6681
> URL: https://issues.apache.org/jira/browse/ARROW-6681
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Are 'RecordBatches' being in C# being written in reverse order?
> I made a simple test which creates a single row per record batch of 0 to 99 
> and attempted to read this in R. To my surprise batch(0) in R had the value 
> 99 not 0
> This may not seem like a big deal, however when dealing with 'huge' files, 
> its more efficient to use Record Batches / index lookup than attempting to 
> load the entire file into memory.
> Having the order consistent within the different language / API seems only to 
> make sense - for now I can work around this by reversing the order before 
> writing.
>  
> https://github.com/apache/arrow/issues/5475
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6437) [R] Add AWS SDK to system dependencies for macOS and Windows

2019-10-04 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944640#comment-16944640
 ] 

Neal Richardson commented on ARROW-6437:


Following up: appears that aws-sdk-cpp doesn't work on mingw. See 
[https://github.com/r-windows/rtools-packages/pull/37]

So S3 support will have to be flagged based on whether the C++ library is built 
with ARROW_S3=ON, and that won't be on for Windows.

I'll try to update the homebrew/autobrew formula though so we can have macOS 
support.

> [R] Add AWS SDK to system dependencies for macOS and Windows
> 
>
> Key: ARROW-6437
> URL: https://issues.apache.org/jira/browse/ARROW-6437
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> The Arrow C++ library now has an S3 filesystem implementation (ARROW-453), 
> and in order to take advantage of that from R, we need to add the 
> {{aws-sdk-cpp}} dependency to the macOS and Windows toolchains. 
> There is no PKGBUILD for this at https://github.com/msys2/MINGW-packages, but 
> https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=aws-sdk-cpp-git has 
> one. 
> For macOS, there already is a formula at 
> https://github.com/Homebrew/homebrew-core/blob/master/Formula/aws-sdk-cpp.rb, 
> maybe that's sufficient?
> Once that is in place, we can enable {{ARROW_S3=ON}} in cmake and build with 
> it 
> (https://github.com/apache/arrow/pull/5167/files#diff-b048bf4c1679dce1028fd897a7c43b93R177)
> cc [~jeroenooms]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6792) [R] Explore roxygen2 R6 class documentation

2019-10-04 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6792:
--

 Summary: [R] Explore roxygen2 R6 class documentation
 Key: ARROW-6792
 URL: https://issues.apache.org/jira/browse/ARROW-6792
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
 Fix For: 1.0.0


roxygen2 version 7.0 adds support for documenting R6 classes, rather than the 
ad hoc approach we've had to take without it: 
[https://github.com/r-lib/roxygen2/blob/master/vignettes/rd.Rmd#L203]

Try it out and see how we like it, and consider refactoring the docs to use it 
everywhere.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6791) Memory Leak

2019-10-04 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944616#comment-16944616
 ] 

Neal Richardson commented on ARROW-6791:


Duplicate of ARROW-5086?

> Memory Leak 
> 
>
> Key: ARROW-6791
> URL: https://issues.apache.org/jira/browse/ARROW-6791
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0, 0.14.1
> Environment: Ubuntu 18.04, 32GB ram, conda-forge installation
>Reporter: George Prichard
>Priority: Major
>
> Memory leak with large string columns crashes the program. This only seems to 
> affect 0.14.x  - it works fine for me in 0.13.0. It might be related to 
> earlier similar issues? e.g. [https://github.com/apache/arrow/issues/2624]
> Below is a reprex which works in earlier versions, but crashes on read 
> (writing is fine) in this one. The real-life version of the data is full of 
> URLs as the strings. 
> Weirdly it crashes my 32GB Ubuntu 18.04, but runs (if very slowly for the 
> read) on my 16GB Macbook. 
> Thanks so much for the excellent tools! 
>  
>  
> {code:java}
> import pandas as pd
> n_rows = int(1e6)
> n_cols = 10
> col_length = 100
> df = pd.DataFrame()
> for i in range(n_cols):
> df[f'col_{i}'] = pd.util.testing.rands_array(col_length, n_rows)
> print('Generated df', df.shape)
> filename = 'tmp.parquet'
> print('Writing parquet')
> df.to_parquet(filename)
> print('Reading parquet')
> pd.read_parquet(filename)
> {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-3808) [R] Implement [.arrow::Array

2019-10-03 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-3808:
---
Fix Version/s: (was: 0.15.0)
   1.0.0

> [R] Implement [.arrow::Array
> 
>
> Key: ARROW-3808
> URL: https://issues.apache.org/jira/browse/ARROW-3808
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Romain Francois
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-3808) [R] Implement [.arrow::Array

2019-10-03 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-3808.

Fix Version/s: (was: 1.0.0)
   0.15.0
   Resolution: Fixed

Issue resolved by pull request 5531
[https://github.com/apache/arrow/pull/5531]

> [R] Implement [.arrow::Array
> 
>
> Key: ARROW-3808
> URL: https://issues.apache.org/jira/browse/ARROW-3808
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Romain Francois
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1900) [C++] Add kernel functions for determining value range (maximum and minimum) of integer arrays

2019-10-03 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943794#comment-16943794
 ] 

Neal Richardson commented on ARROW-1900:


This would have been helpful in ARROW-3808, not as an optimization but because 
I literally wanted the min and max of an integer array.

> [C++] Add kernel functions for determining value range (maximum and minimum) 
> of integer arrays
> --
>
> Key: ARROW-1900
> URL: https://issues.apache.org/jira/browse/ARROW-1900
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
> Fix For: 1.0.0
>
>
> These functions can be useful internally for determining when a "small range" 
> alternative to a hash table can be used for integer arrays. The maximum and 
> minimum is determined in a single scan.
> We already have infrastructure for aggregate kernels, so this would be an 
> easy addition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6784) [C++][R] Move filter, take, select C++ code from Rcpp to C++ library

2019-10-03 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6784:
--

 Summary: [C++][R] Move filter, take, select C++ code from Rcpp to 
C++ library
 Key: ARROW-6784
 URL: https://issues.apache.org/jira/browse/ARROW-6784
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Neal Richardson
 Fix For: 1.0.0


Followup to ARROW-3808 and some other previous work. Of particular interest:
 * Filter and Take methods for ChunkedArray, in r/src/compute.cpp
 * Methods for that and some other things that apply Array and ChunkedArray 
methods across the columns of a RecordBatch or Table, respectively
 * RecordBatch__select and Table__select to take columns



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6773) [C++] Filter kernel returns invalid data when filtering with an Array slice

2019-10-02 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6773:
--

Assignee: Neal Richardson  (was: Ben Kietzman)

> [C++] Filter kernel returns invalid data when filtering with an Array slice
> ---
>
> Key: ARROW-6773
> URL: https://issues.apache.org/jira/browse/ARROW-6773
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> See ARROW-3808. This failing test reproduces the issue:
> {code:java}
> --- a/cpp/src/arrow/compute/kernels/filter_test.cc
> +++ b/cpp/src/arrow/compute/kernels/filter_test.cc
> @@ -151,6 +151,12 @@ TYPED_TEST(TestFilterKernelWithNumeric, FilterNumeric) {
>this->AssertFilter("[7, 8, 9]", "[null, 1, 0]", "[null, 8]");
>this->AssertFilter("[7, 8, 9]", "[1, null, 1]", "[7, null, 9]");
>  
> +  this->AssertFilterArrays(
> +ArrayFromJSON(this->type_singleton(), "[7, 8, 9]"),
> +ArrayFromJSON(boolean(), "[0, 1, 1, 1, 0, 1]")->Slice(3, 3),
> +ArrayFromJSON(this->type_singleton(), "[7, 9]")
> +  );
> +
> {code}
> {code:java}
> arrow/cpp/src/arrow/testing/gtest_util.cc:82: Failure
> Failed
> @@ -2, +2 @@
> +0
> [  FAILED  ] TestFilterKernelWithNumeric/9.FilterNumeric, where TypeParam = 
> arrow::DoubleType (0 ms)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6773) [C++] Filter kernel returns invalid data when filtering with an Array slice

2019-10-02 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6773:
--

 Summary: [C++] Filter kernel returns invalid data when filtering 
with an Array slice
 Key: ARROW-6773
 URL: https://issues.apache.org/jira/browse/ARROW-6773
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Neal Richardson
Assignee: Ben Kietzman
 Fix For: 1.0.0


See ARROW-3808. This failing test reproduces the issue:
{code:java}
--- a/cpp/src/arrow/compute/kernels/filter_test.cc
+++ b/cpp/src/arrow/compute/kernels/filter_test.cc
@@ -151,6 +151,12 @@ TYPED_TEST(TestFilterKernelWithNumeric, FilterNumeric) {
   this->AssertFilter("[7, 8, 9]", "[null, 1, 0]", "[null, 8]");
   this->AssertFilter("[7, 8, 9]", "[1, null, 1]", "[7, null, 9]");
 
+  this->AssertFilterArrays(
+ArrayFromJSON(this->type_singleton(), "[7, 8, 9]"),
+ArrayFromJSON(boolean(), "[0, 1, 1, 1, 0, 1]")->Slice(3, 3),
+ArrayFromJSON(this->type_singleton(), "[7, 9]")
+  );
+
{code}
{code:java}
arrow/cpp/src/arrow/testing/gtest_util.cc:82: Failure
Failed

@@ -2, +2 @@
+0

[  FAILED  ] TestFilterKernelWithNumeric/9.FilterNumeric, where TypeParam = 
arrow::DoubleType (0 ms)
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6714) [R] Fix untested RecordBatchWriter case

2019-09-27 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6714.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5518
[https://github.com/apache/arrow/pull/5518]

> [R] Fix untested RecordBatchWriter case
> ---
>
> Key: ARROW-6714
> URL: https://issues.apache.org/jira/browse/ARROW-6714
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Passing a data.frame to RecordBatchWriter$write() would trigger a segfault



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6701) [C++][R] Lint failing on R cpp code

2019-09-27 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6701.

Resolution: Fixed

Issue resolved by pull request 5514
[https://github.com/apache/arrow/pull/5514]

> [C++][R] Lint failing on R cpp code
> ---
>
> Key: ARROW-6701
> URL: https://issues.apache.org/jira/browse/ARROW-6701
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration, R
>Reporter: Micah Kornfield
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> [See as an example 
> https://travis-ci.org/apache/arrow/jobs/589772132#L695|https://travis-ci.org/apache/arrow/jobs/589772132#L695]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6429) [CI][Crossbow] Nightly spark integration job fails

2019-09-27 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6429:
---
Fix Version/s: (was: 0.15.0)
   1.0.0

> [CI][Crossbow] Nightly spark integration job fails
> --
>
> Key: ARROW-6429
> URL: https://issues.apache.org/jira/browse/ARROW-6429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: nightly, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> See https://circleci.com/gh/ursa-labs/crossbow/2310. Either fix, skip job and 
> create followup Jira to unskip, or delete job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6532) [R] Write parquet files with compression

2019-09-27 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6532:
---
Fix Version/s: (was: 0.15.0)
   1.0.0

> [R] Write parquet files with compression
> 
>
> Key: ARROW-6532
> URL: https://issues.apache.org/jira/browse/ARROW-6532
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Romain Francois
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> Followup to ARROW-6360. See ARROW-6216 for the C++ side. `write_parquet()` 
> should be able to write compressed files, including with a specified 
> compression level.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6716) [CI] [Rust] New 1.40.0 nightly causing builds to fail

2019-09-27 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6716:
---
Fix Version/s: (was: 0.15.0)
   1.0.0

> [CI] [Rust] New 1.40.0 nightly causing builds to fail
> -
>
> Key: ARROW-6716
> URL: https://issues.apache.org/jira/browse/ARROW-6716
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: CI, Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> So much for pinning the nightly version ... that doesn't work when there is a 
> new major version of a nightly apparently.
> Travis is now using:
> {code:java}
> rustc 1.40.0-nightly (37538aa13 2019-09-25) {code}
> Despite rust-toolchain containing:
> {code:java}
> nightly-2019-07-30 {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6606) [C++] Construct tree structure from std::vector

2019-09-27 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6606:
---
Fix Version/s: (was: 0.15.0)
   1.0.0

> [C++] Construct tree structure from std::vector
> --
>
> Key: ARROW-6606
> URL: https://issues.apache.org/jira/browse/ARROW-6606
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> This will be used by FileSystemDataSource for pushdown predicate pruning of 
> branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6532) [R] Write parquet files with compression

2019-09-27 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6532.

Fix Version/s: (was: 1.0.0)
   0.15.0
   Resolution: Fixed

Issue resolved by pull request 5451
[https://github.com/apache/arrow/pull/5451]

> [R] Write parquet files with compression
> 
>
> Key: ARROW-6532
> URL: https://issues.apache.org/jira/browse/ARROW-6532
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Romain Francois
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> Followup to ARROW-6360. See ARROW-6216 for the C++ side. `write_parquet()` 
> should be able to write compressed files, including with a specified 
> compression level.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-3808) [R] Implement [.arrow::Array

2019-09-27 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-3808:
--

Assignee: Neal Richardson

> [R] Implement [.arrow::Array
> 
>
> Key: ARROW-3808
> URL: https://issues.apache.org/jira/browse/ARROW-3808
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Romain Francois
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6682) [C#] Arrow R/C++ hangs reading binary file generated by C#

2019-09-26 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6682:
---
Component/s: (was: R)
 (was: C++)
 C#

> [C#] Arrow R/C++ hangs reading binary file generated by C#
> --
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Assignee: Eric Erhardt
>Priority: Major
>  Labels: pull-request-available
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, 
> Generated_4000Batch_50Columns_100Rows_PerBatch.zip, arrow.benchmark.r, 
> script.runner.ps1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6682) [C#] Arrow R/C++ hangs reading binary file generated by C#

2019-09-26 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6682:
---
Summary: [C#] Arrow R/C++ hangs reading binary file generated by C#  (was: 
[C++][R] Arrow Hangs reading binary file generated by C#)

> [C#] Arrow R/C++ hangs reading binary file generated by C#
> --
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Assignee: Eric Erhardt
>Priority: Major
>  Labels: pull-request-available
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, 
> Generated_4000Batch_50Columns_100Rows_PerBatch.zip, arrow.benchmark.r, 
> script.runner.ps1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6714) [R] Fix untested RecordBatchWriter case

2019-09-26 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6714:
--

 Summary: [R] Fix untested RecordBatchWriter case
 Key: ARROW-6714
 URL: https://issues.apache.org/jira/browse/ARROW-6714
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson


Passing a data.frame to RecordBatchWriter$write() would trigger a segfault



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6701) [C++][R] Lint failing on R cpp code

2019-09-26 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6701:
--

Assignee: Neal Richardson

> [C++][R] Lint failing on R cpp code
> ---
>
> Key: ARROW-6701
> URL: https://issues.apache.org/jira/browse/ARROW-6701
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration, R
>Reporter: Micah Kornfield
>Assignee: Neal Richardson
>Priority: Blocker
> Fix For: 1.0.0
>
>
> [See as an example 
> https://travis-ci.org/apache/arrow/jobs/589772132#L695|https://travis-ci.org/apache/arrow/jobs/589772132#L695]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6701) [C++][R] Lint failing on R cpp code

2019-09-26 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938719#comment-16938719
 ] 

Neal Richardson commented on ARROW-6701:


¯\_(ツ)_/¯ I'll try to reproduce and fix locally

> [C++][R] Lint failing on R cpp code
> ---
>
> Key: ARROW-6701
> URL: https://issues.apache.org/jira/browse/ARROW-6701
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration, R
>Reporter: Micah Kornfield
>Priority: Blocker
> Fix For: 1.0.0
>
>
> [See as an example 
> https://travis-ci.org/apache/arrow/jobs/589772132#L695|https://travis-ci.org/apache/arrow/jobs/589772132#L695]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6701) [C++][R] Lint failing on R cpp code

2019-09-26 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938715#comment-16938715
 ] 

Neal Richardson commented on ARROW-6701:


This seems to be passing on master now; was it already fixed? 
[https://travis-ci.org/apache/arrow/jobs/589914595]

> [C++][R] Lint failing on R cpp code
> ---
>
> Key: ARROW-6701
> URL: https://issues.apache.org/jira/browse/ARROW-6701
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration, R
>Reporter: Micah Kornfield
>Priority: Blocker
> Fix For: 1.0.0
>
>
> [See as an example 
> https://travis-ci.org/apache/arrow/jobs/589772132#L695|https://travis-ci.org/apache/arrow/jobs/589772132#L695]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6706) [Developer Tools] Cannot merge PRs from authors with "Á" (U+00C1) in their name

2019-09-26 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938691#comment-16938691
 ] 

Neal Richardson commented on ARROW-6706:


Can we fail explicitly if someone uses Python 2 and tell them to use Python 3?

> [Developer Tools] Cannot merge PRs from authors with "Á" (U+00C1) in their 
> name
> ---
>
> Key: ARROW-6706
> URL: https://issues.apache.org/jira/browse/ARROW-6706
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> I tried merging a PR from Ádám Lippai ([https://github.com/alippai)] and the 
> merge script failed with:
>  
> {code:java}
> ./dev/merge_arrow_pr.py 
> ARROW_HOME = /home/andy/git/andygrove/arrow/dev
> PROJECT_NAME = arrow
> Which pull request would you like to merge? (e.g. 34): 5499
> Env APACHE_JIRA_USERNAME not set, please enter your JIRA username:andygrove
> Env APACHE_JIRA_PASSWORD not set, please enter your JIRA password:*
> === Pull Request #5499 ===
> title ARROW-6705: [Rust] [DataFusion] README has invalid github URL
> sourcealippai/patch-1
> targetmaster
> url   https://api.github.com/repos/apache/arrow/pulls/5499
> === JIRA ARROW-6705 ===
> Summary   [Rust] [DataFusion] README has invalid github URL
> Assignee  NOT ASSIGNED!!!
> ComponentsRust
> StatusOpen
> URL   https://issues.apache.org/jira/browse/ARROW-6705Proceed with 
> merging pull request #5499? (y/n): y
> Switched to branch 'PR_TOOL_MERGE_PR_5499_MASTER'
> Automatic merge went well; stopped before committing as requested
> Traceback (most recent call last):
>   File "./dev/merge_arrow_pr.py", line 571, in 
> cli()
>   File "./dev/merge_arrow_pr.py", line 556, in cli
> pr.merge()
>   File "./dev/merge_arrow_pr.py", line 354, in merge
> print("Author {}: {}".format(i + 1, author))
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xc1' in position 
> 0: ordinal not in range(128)
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938182#comment-16938182
 ] 

Neal Richardson commented on ARROW-6682:


You can try {{options(arrow.use_threads=FALSE)}}

cf. 
[https://github.com/apache/arrow/blob/a89c803ad86ad7e22890fe3b8314bb41eec7d5af/r/R/arrow-package.R#L42-L44]

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Assignee: Eric Erhardt
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, 
> Generated_4000Batch_50Columns_100Rows_PerBatch.zip, arrow.benchmark.r, 
> script.runner.ps1
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Neal Richardson (Jira)


[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937939#comment-16937939
 ] 

Neal Richardson commented on ARROW-6682:


{{sessionInfo()}} will show the relevant versions. The nightly package build 
includes the date in the version string.

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, 
> Generated_4000Batch_50Columns_100Rows_PerBatch.zip
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937856#comment-16937856
 ] 

Neal Richardson commented on ARROW-6682:


I'm 5 for 5 so far.

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937852#comment-16937852
 ] 

Neal Richardson commented on ARROW-6682:


Works on my (virtual) machine:

{code:r}
> system.time(tab <- 
> read_arrow("Generated_4000Batch_50Columns_100Rows_PerBatch.arrow"))
   user  system elapsed 
   1.103.968.69 
> dim(tab)
[1] 40 50
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252  
 
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C 
 
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 

other attached packages:
[1] arrow_0.14.1.9000

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.5 bit_1.1-14   compiler_3.6.0   magrittr_1.5 
assertthat_0.2.1 R6_2.4.0
 [7] tools_3.6.0  glue_1.3.1   Rcpp_1.0.1   bit64_0.9-7  
rlang_0.3.4  purrr_0.3.2 
> 
{code}

Would you mind installing a nightly build of the R package and see if whatever 
you're seeing has been resolved in the master branch?

{code:r}
install.packages("arrow", repos="https://dl.bintray.com/ursalabs/arrow-r;)
{code}

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) Arrow Hangs on Large # of Columns (30+)

2019-09-25 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937827#comment-16937827
 ] 

Neal Richardson commented on ARROW-6682:


Hi, would you mind sharing the code you're running that's hanging, as well as 
additional system information?

All source code is available at https://github.com/apache/arrow. 

> Arrow Hangs on Large # of Columns (30+)
> ---
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Blocker
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6682) Arrow Hangs on Large # of Columns (30+)

2019-09-25 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6682:
---
Priority: Major  (was: Blocker)

> Arrow Hangs on Large # of Columns (30+)
> ---
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6622) [C++][R] SubTreeFileSystem path error on Windows

2019-09-25 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6622.

Fix Version/s: (was: 1.0.0)
   0.15.0
   Resolution: Fixed

Issue resolved by pull request 5445
[https://github.com/apache/arrow/pull/5445]

> [C++][R] SubTreeFileSystem path error on Windows
> 
>
> Key: ARROW-6622
> URL: https://issues.apache.org/jira/browse/ARROW-6622
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: filesystem, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> On ARROW-6438, we got this error on Windows testing out the subtree:
> {code}
> > test_check("arrow")
>   -- 1. Error: SubTreeFilesystem (@test-filesystem.R#86)  
> 
>   Unknown error: Underlying filesystem returned path 
> 'C:/Users/appveyor/AppData/Local/Temp/1/RtmpqWFbxi/working_dir/Rtmp2Dfa6d/file2904934312d/DESCRIPTION',
>  which is not a subpath of 
> 'C:/Users/appveyor/AppData/Local/Temp/1\RtmpqWFbxi/working_dir\Rtmp2Dfa6d\file2904934312d/'
>   1: st_fs$GetTargetStats(c("DESCRIPTION", "test", "nope", "DESC.txt")) at 
> testthat/test-filesystem.R:86
>   2: map(fs___FileSystem__GetTargetStats_Paths(self, x), shared_ptr, class = 
> FileStats)
>   3: fs___FileSystem__GetTargetStats_Paths(self, x)
>   
>   == testthat results  
> ===
>   [ OK: 992 | SKIPPED: 2 | WARNINGS: 0 | FAILED: 1 ]
> {code}
> Notice the mixture of forward slashes and backslashes in the paths so that 
> they don't match up. 
> I'm not sure which layer is doing the wrong thing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6649) [R] print() methods for Table, RecordBatch, etc.

2019-09-24 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6649.

Fix Version/s: (was: 1.0.0)
   0.15.0
   Resolution: Fixed

Issue resolved by pull request 5492
[https://github.com/apache/arrow/pull/5492]

> [R] print() methods for Table, RecordBatch, etc.
> 
>
> Key: ARROW-6649
> URL: https://issues.apache.org/jira/browse/ARROW-6649
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Inspired by tibble: show schema, head of data, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6679) [RELEASE] autobrew license in LICENSE.txt is not acceptable

2019-09-24 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937274#comment-16937274
 ] 

Neal Richardson commented on ARROW-6679:


Sorry, I thought this was dealt with adequately in 
https://github.com/apache/arrow/pull/5095 (see discussion). What are the 
options for resolution? Jeroen adds a license file to 
https://github.com/jeroen/autobrew, or we remove the file?

> [RELEASE] autobrew license in LICENSE.txt is not acceptable
> ---
>
> Key: ARROW-6679
> URL: https://issues.apache.org/jira/browse/ARROW-6679
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Wes McKinney
>Priority: Blocker
> Fix For: 0.15.0
>
>
> {code}
> This project includes code from the autobrew project.
> * r/tools/autobrew and dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb
>   are based on code from the autobrew project.
> Copyright: Copyright (c) 2017 - 2019, Jeroen Ooms.
> All rights reserved.
> Homepage: https://github.com/jeroen/autobrew
> {code}
> This code needs to be made available under a Category A license
> https://apache.org/legal/resolved.html#category-a



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6629) [Doc][C++] Document the FileSystem API

2019-09-24 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6629.

Fix Version/s: (was: 1.0.0)
   0.15.0
   Resolution: Fixed

Issue resolved by pull request 5487
[https://github.com/apache/arrow/pull/5487]

> [Doc][C++] Document the FileSystem API
> --
>
> Key: ARROW-6629
> URL: https://issues.apache.org/jira/browse/ARROW-6629
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Neal Richardson
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> In ARROW-6622, I was looking for a place in the docs to add about path 
> normalization, and I couldn't find filesystem docs at all. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6649) [R] print() methods for Table, RecordBatch, etc.

2019-09-24 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6649:
--

Assignee: Neal Richardson

> [R] print() methods for Table, RecordBatch, etc.
> 
>
> Key: ARROW-6649
> URL: https://issues.apache.org/jira/browse/ARROW-6649
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Inspired by tibble: show schema, head of data, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6675) [JS] Add scanReverse function

2019-09-24 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6675:
---
Component/s: JavaScript

> [JS] Add scanReverse function
> -
>
> Key: ARROW-6675
> URL: https://issues.apache.org/jira/browse/ARROW-6675
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: JavaScript
>Reporter: Malcolm MacLachlan
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> * Add scanReverse function to dataFrame and filteredDataframe
>  * Update tests



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6675) [JS] Add scanReverse function

2019-09-24 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6675:
---
Summary: [JS] Add scanReverse function  (was: Add scanReverse function)

> [JS] Add scanReverse function
> -
>
> Key: ARROW-6675
> URL: https://issues.apache.org/jira/browse/ARROW-6675
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Malcolm MacLachlan
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> * Add scanReverse function to dataFrame and filteredDataframe
>  * Update tests



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6532) [R] Write parquet files with compression

2019-09-23 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6532:
--

Assignee: Romain François  (was: Neal Richardson)

> [R] Write parquet files with compression
> 
>
> Key: ARROW-6532
> URL: https://issues.apache.org/jira/browse/ARROW-6532
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Romain François
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Followup to ARROW-6360. See ARROW-6216 for the C++ side. `write_parquet()` 
> should be able to write compressed files, including with a specified 
> compression level.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6532) [R] Write parquet files with compression

2019-09-23 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6532:
--

Assignee: Neal Richardson

> [R] Write parquet files with compression
> 
>
> Key: ARROW-6532
> URL: https://issues.apache.org/jira/browse/ARROW-6532
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Followup to ARROW-6360. See ARROW-6216 for the C++ side. `write_parquet()` 
> should be able to write compressed files, including with a specified 
> compression level.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-3817) [R] $ method for RecordBatch

2019-09-23 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-3817.

Fix Version/s: (was: 1.0.0)
   0.15.0
   Resolution: Fixed

Issue resolved by pull request 5459
[https://github.com/apache/arrow/pull/5459]

> [R] $ method for RecordBatch
> 
>
> Key: ARROW-3817
> URL: https://issues.apache.org/jira/browse/ARROW-3817
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Romain François
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6670) [CI][R] Fix fix for R nightly jobs

2019-09-23 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6670.

Resolution: Fixed

Issue resolved by pull request 5479
[https://github.com/apache/arrow/pull/5479]

> [CI][R] Fix fix for R nightly jobs
> --
>
> Key: ARROW-6670
> URL: https://issues.apache.org/jira/browse/ARROW-6670
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6670) [CI][R] Fix fix for R nightly jobs

2019-09-23 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6670:
--

 Summary: [CI][R] Fix fix for R nightly jobs
 Key: ARROW-6670
 URL: https://issues.apache.org/jira/browse/ARROW-6670
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 0.15.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6651) [R] Fix R conda job

2019-09-21 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6651.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5461
[https://github.com/apache/arrow/pull/5461]

> [R] Fix R conda job
> ---
>
> Key: ARROW-6651
> URL: https://issues.apache.org/jira/browse/ARROW-6651
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> ARROW-6214 touched the build scripts it uses and now the nightly job is 
> failing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6651) [R] Fix R conda job

2019-09-21 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6651:
--

 Summary: [R] Fix R conda job
 Key: ARROW-6651
 URL: https://issues.apache.org/jira/browse/ARROW-6651
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration
Reporter: Neal Richardson
Assignee: Neal Richardson


ARROW-6214 touched the build scripts it uses and now the nightly job is failing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6576) [R] Fix sparklyr integration tests

2019-09-20 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934847#comment-16934847
 ] 

Neal Richardson commented on ARROW-6576:


Closing; this is out of our hands now. See 
https://github.com/rstudio/sparklyr/pull/2133.

> [R] Fix sparklyr integration tests
> --
>
> Key: ARROW-6576
> URL: https://issues.apache.org/jira/browse/ARROW-6576
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 0.15.0
>
>
> Ticket just for our tracking purposes; the work will be done in the sparklyr 
> repository.
> ARROW-5505 moved a function that sparklyr uses, so the code there should 
> adapt.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6576) [R] Fix sparklyr integration tests

2019-09-20 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6576.

Resolution: Fixed

> [R] Fix sparklyr integration tests
> --
>
> Key: ARROW-6576
> URL: https://issues.apache.org/jira/browse/ARROW-6576
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 0.15.0
>
>
> Ticket just for our tracking purposes; the work will be done in the sparklyr 
> repository.
> ARROW-5505 moved a function that sparklyr uses, so the code there should 
> adapt.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6649) [R] print() methods for Table, RecordBatch, etc.

2019-09-20 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6649:
--

 Summary: [R] print() methods for Table, RecordBatch, etc.
 Key: ARROW-6649
 URL: https://issues.apache.org/jira/browse/ARROW-6649
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
 Fix For: 1.0.0


Inspired by tibble: show schema, head of data, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-3817) [R] $ method for RecordBatch

2019-09-20 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-3817:
--

Assignee: Neal Richardson

> [R] $ method for RecordBatch
> 
>
> Key: ARROW-3817
> URL: https://issues.apache.org/jira/browse/ARROW-3817
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Romain François
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6640) [C++] Error when BufferedInputStream Peek more than bytes buffered

2019-09-20 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6640:
---
Component/s: C++

> [C++] Error when BufferedInputStream Peek more than bytes buffered
> --
>
> Key: ARROW-6640
> URL: https://issues.apache.org/jira/browse/ARROW-6640
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Zherui Cao
>Assignee: Zherui Cao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> An example:
> BufferedInputStream:Peek(10), but only 8 buffered remaining (buffer_pos is 2 
> right now)
> it will increase the buffer size by 2. In the mean time the buffer_pos will 
> be reset to 0, but it should remain 2.
> Resetting buffer_pos will cause problems.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-5216) [CI] Add Appveyor badge to README

2019-09-20 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-5216.

Fix Version/s: (was: 1.0.0)
   0.15.0
   Resolution: Fixed

Issue resolved by pull request 5453
[https://github.com/apache/arrow/pull/5453]

> [CI] Add Appveyor badge to README
> -
>
> Key: ARROW-5216
> URL: https://issues.apache.org/jira/browse/ARROW-5216
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I was trying to see what was running in appveyor and couldn't find it. 
> Krisztián helped me to find 
> [https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow], but it 
> would be nice to add the badge to the README next to the Travis-CI one for a 
> quick link to it (as well as showing off build status).
> I was just going to add it myself, but unlike Travis, you can't guess the 
> Appveyor badge URL from the project name because they have a hash in them; 
> only someone with sufficient privileges on the project in Appveyor can get to 
> the settings panel to find the URL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6539) [R] Provide mechanism to write out old format

2019-09-20 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6539.

Resolution: Fixed

Issue resolved by pull request 5420
[https://github.com/apache/arrow/pull/5420]

> [R] Provide mechanism to write out old format
> -
>
> Key: ARROW-6539
> URL: https://issues.apache.org/jira/browse/ARROW-6539
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Neal Richardson
>Assignee: Romain François
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> See ARROW-6474. {{sparklyr}} will have the same issue so we should make sure 
> this is supported in R.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6640) [C++] Error when BufferedInputStream Peek more than bytes buffered

2019-09-20 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6640:
---
Summary: [C++] Error when BufferedInputStream Peek more than bytes buffered 
 (was: [C++]Error when BufferedInputStream Peek more than bytes buffered)

> [C++] Error when BufferedInputStream Peek more than bytes buffered
> --
>
> Key: ARROW-6640
> URL: https://issues.apache.org/jira/browse/ARROW-6640
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Zherui Cao
>Assignee: Zherui Cao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> An example:
> BufferedInputStream:Peek(10), but only 8 buffered remaining (buffer_pos is 2 
> right now)
> it will increase the buffer size by 2. In the mean time the buffer_pos will 
> be reset to 0, but it should remain 2.
> Resetting buffer_pos will cause problems.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6540) [R] Add Validate() methods

2019-09-20 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6540:
--

Assignee: Romain François

> [R] Add Validate() methods
> --
>
> Key: ARROW-6540
> URL: https://issues.apache.org/jira/browse/ARROW-6540
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Romain François
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> See ARROW-6174 and ARROW-6177



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6540) [R] Add Validate() methods

2019-09-20 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6540.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5452
[https://github.com/apache/arrow/pull/5452]

> [R] Add Validate() methods
> --
>
> Key: ARROW-6540
> URL: https://issues.apache.org/jira/browse/ARROW-6540
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> See ARROW-6174 and ARROW-6177



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6533) [R] Compression codec should take a "level"

2019-09-20 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6533.

Fix Version/s: (was: 1.0.0)
   0.15.0
   Resolution: Fixed

Issue resolved by pull request 5450
[https://github.com/apache/arrow/pull/5450]

> [R] Compression codec should take a "level"
> ---
>
> Key: ARROW-6533
> URL: https://issues.apache.org/jira/browse/ARROW-6533
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> See ARROW-6216 for the C++ side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6533) [R] Compression codec should take a "level"

2019-09-20 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6533:
--

Assignee: Romain François

> [R] Compression codec should take a "level"
> ---
>
> Key: ARROW-6533
> URL: https://issues.apache.org/jira/browse/ARROW-6533
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Romain François
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> See ARROW-6216 for the C++ side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6542) [R] Add View() method to array types

2019-09-20 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6542.

Fix Version/s: (was: 1.0.0)
   0.15.0
   Resolution: Fixed

Issue resolved by pull request 5435
[https://github.com/apache/arrow/pull/5435]

> [R] Add View() method to array types
> 
>
> Key: ARROW-6542
> URL: https://issues.apache.org/jira/browse/ARROW-6542
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Romain François
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> See ARROW-6048



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6544) [R] Documentation/polishing for 0.15 release

2019-09-20 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6544.

Resolution: Fixed

Issue resolved by pull request 5444
[https://github.com/apache/arrow/pull/5444]

> [R] Documentation/polishing for 0.15 release
> 
>
> Key: ARROW-6544
> URL: https://issues.apache.org/jira/browse/ARROW-6544
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6630) [Doc][C++] Document the file readers (CSV, JSON, Parquet, etc.)

2019-09-19 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6630:
--

 Summary: [Doc][C++] Document the file readers (CSV, JSON, Parquet, 
etc.)
 Key: ARROW-6630
 URL: https://issues.apache.org/jira/browse/ARROW-6630
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Documentation
Reporter: Neal Richardson
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   8   >