[jira] [Commented] (ARROW-5502) [R] file readers should mmap

2019-10-11 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949488#comment-16949488
 ] 

Wes McKinney commented on ARROW-5502:
-

Note that we stopped memory mapping by default in {{pyarrow.parquet}}.

> [R] file readers should mmap
> 
>
> Key: ARROW-5502
> URL: https://issues.apache.org/jira/browse/ARROW-5502
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> Arrow is supposed to let you work with datasets bigger than memory. Memory 
> mapping is a big part of that. It should be the default way that files are 
> read in the `read_*` functions. To disable memory mapping, we could use a 
> global `option()`, or a function argument, but that might clutter the 
> interface. Or we could not give a choice and only fall back to not memory 
> mapping if the platform/file system doesn't support it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5502) [R] file readers should mmap

2019-06-11 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861119#comment-16861119
 ] 

Wes McKinney commented on ARROW-5502:
-

The Parquet C++ library by default only reads the serialized column data from 
disk that needs to be deserialized. Using memory-mapping indeed avoids memory 
allocation.

Note that for high latency file sources (like Amazon S3) -- where memory 
mapping is not possible -- many data warehousing systems have found it more 
efficient to read an entire Parquet row group into memory at a time and discard 
the unused columns. We will likely be forced as a matter of performance 
optimization to add some reader options to parquet-cpp around this issue

> [R] file readers should mmap
> 
>
> Key: ARROW-5502
> URL: https://issues.apache.org/jira/browse/ARROW-5502
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 0.14.0
>
>
> Arrow is supposed to let you work with datasets bigger than memory. Memory 
> mapping is a big part of that. It should be the default way that files are 
> read in the `read_*` functions. To disable memory mapping, we could use a 
> global `option()`, or a function argument, but that might clutter the 
> interface. Or we could not give a choice and only fall back to not memory 
> mapping if the platform/file system doesn't support it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5502) [R] file readers should mmap

2019-06-11 Thread Neal Richardson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861100#comment-16861100
 ] 

Neal Richardson commented on ARROW-5502:


Memory mapping would make the loading in memory to copy to R lazy, and will be 
necessary for things like {{read_parquet(f, col_select)}} to not read all 
columns into Arrow before copying to R.

Yes, I believed it was possible now, but that's not a friendly enough interface 
for package users, IMO. 

> [R] file readers should mmap
> 
>
> Key: ARROW-5502
> URL: https://issues.apache.org/jira/browse/ARROW-5502
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 0.14.0
>
>
> Arrow is supposed to let you work with datasets bigger than memory. Memory 
> mapping is a big part of that. It should be the default way that files are 
> read in the `read_*` functions. To disable memory mapping, we could use a 
> global `option()`, or a function argument, but that might clutter the 
> interface. Or we could not give a choice and only fall back to not memory 
> mapping if the platform/file system doesn't support it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5502) [R] file readers should mmap

2019-06-11 Thread JIRA


[ 
https://issues.apache.org/jira/browse/ARROW-5502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861084#comment-16861084
 ] 

Romain François commented on ARROW-5502:


You can memory map right now, although at this point data is being copied to R 
vectors rather than borrowed from the memory mapped file, we'll need to use 
ALTREP to go further. 

 

The file argument of most reading functions may be an instance of 
arrow::io::MemoryMappedFile, which you get by using the mmap_open() function in 
R: 
{code}
library(arrow, warn.conflicts = FALSE)
library(tibble)
tf <- tempfile()
write.csv(iris, tf, row.names = FALSE, quote = FALSE)
f <- mmap_open(tf)
f
#> arrow::io::MemoryMappedFile
tab <- read_csv_arrow(f)
as_tibble(tab)
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>  
#> 1 5.1 3.5 1.4 0.2 setosa 
#> 2 4.9 3 1.4 0.2 setosa 
#> 3 4.7 3.2 1.3 0.2 setosa 
#> 4 4.6 3.1 1.5 0.2 setosa 
#> 5 5 3.6 1.4 0.2 setosa 
#> 6 5.4 3.9 1.7 0.4 setosa 
#> 7 4.6 3.4 1.4 0.3 setosa 
#> 8 5 3.4 1.5 0.2 setosa 
#> 9 4.4 2.9 1.4 0.2 setosa 
#> 10 4.9 3.1 1.5 0.1 setosa 
#> # … with 140 more rows
{code}
Created on 2019-06-11 by the [reprex package|https://reprex.tidyverse.org/] 
(v0.3.0.9000)

> [R] file readers should mmap
> 
>
> Key: ARROW-5502
> URL: https://issues.apache.org/jira/browse/ARROW-5502
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 0.14.0
>
>
> Arrow is supposed to let you work with datasets bigger than memory. Memory 
> mapping is a big part of that. It should be the default way that files are 
> read in the `read_*` functions. To disable memory mapping, we could use a 
> global `option()`, or a function argument, but that might clutter the 
> interface. Or we could not give a choice and only fall back to not memory 
> mapping if the platform/file system doesn't support it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)