[jira] [Commented] (ARROW-6278) [R] Handle raw vector from read_parquet

2019-08-16 Thread Brendan Hogan (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909374#comment-16909374
 ] 

Brendan Hogan commented on ARROW-6278:
--

Sure, feel free to translate as adding HDFS support.  That would be interesting 
to try out.

I will add that the real value for any of this parquet access will only be 
unlocked once arrow properly supports nested fields, i.e. ARROW-1644.  Although 
I am happy to put a plug in for HDFS support in the meantime.  Thanks.

 

> [R] Handle raw vector from read_parquet 
> 
>
> Key: ARROW-6278
> URL: https://issues.apache.org/jira/browse/ARROW-6278
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Brendan Hogan
>Priority: Major
>
> {{read_parquet}} currently handles a path to a local file or an Arrow input 
> stream.  Would it be possible to add support for a raw vector containing the 
> contents of a parquet file?
> Apologies if there is already a way to do this.  I have tried populating a 
> buffer and passing that as input, but that is unsupported as well.  An 
> example of how to work using an input stream would be useful as well.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6278) [R] Handle raw vector from read_parquet

2019-08-16 Thread Neal Richardson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909350#comment-16909350
 ] 

Neal Richardson commented on ARROW-6278:


I know there is some support for HDFS in the Arrow C++ library but I don't see 
any R bindings to it yet. Would you mind if I rewrote this ticket to be for 
adding HDFS support to the R package?

It looks like François's suggestion unblocks you for now. You may also consider 
syncing the files from HDFS to your local file system and passing the file path 
to {{read_parquet}}; if the files are large that will be much more efficient 
with memory.

> [R] Handle raw vector from read_parquet 
> 
>
> Key: ARROW-6278
> URL: https://issues.apache.org/jira/browse/ARROW-6278
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Brendan Hogan
>Priority: Major
>
> {{read_parquet}} currently handles a path to a local file or an Arrow input 
> stream.  Would it be possible to add support for a raw vector containing the 
> contents of a parquet file?
> Apologies if there is already a way to do this.  I have tried populating a 
> buffer and passing that as input, but that is unsupported as well.  An 
> example of how to work using an input stream would be useful as well.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6278) [R] Handle raw vector from read_parquet

2019-08-16 Thread Brendan Hogan (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909303#comment-16909303
 ] 

Brendan Hogan commented on ARROW-6278:
--

Fair question.  I have parquet files in HDFS.  I can, of course, open a spark 
session and {{spark_read_parquet}}, but I am exploring options for 
lighter-weight read access.  I can grab the data into a raw vector via WebHDFS 
(e.g. [https://mitre.github.io/webhdfs/]).  Hence my interest in 
{{read_parquet}} on that.  I'm open to other suggestions here.

> [R] Handle raw vector from read_parquet 
> 
>
> Key: ARROW-6278
> URL: https://issues.apache.org/jira/browse/ARROW-6278
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Brendan Hogan
>Priority: Major
>
> {{read_parquet}} currently handles a path to a local file or an Arrow input 
> stream.  Would it be possible to add support for a raw vector containing the 
> contents of a parquet file?
> Apologies if there is already a way to do this.  I have tried populating a 
> buffer and passing that as input, but that is unsupported as well.  An 
> example of how to work using an input stream would be useful as well.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6278) [R] Handle raw vector from read_parquet

2019-08-16 Thread Neal Richardson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909289#comment-16909289
 ] 

Neal Richardson commented on ARROW-6278:


Thanks. Out of curiosity, why are you trying to do this? Why do you have a 
Parquet file in memory as a raw vector? I'm wondering if there's a better 
solution to your actual problem than extending {{read_parquet}} to read raw 
vectors.

> [R] Handle raw vector from read_parquet 
> 
>
> Key: ARROW-6278
> URL: https://issues.apache.org/jira/browse/ARROW-6278
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Brendan Hogan
>Priority: Major
>
> {{read_parquet}} currently handles a path to a local file or an Arrow input 
> stream.  Would it be possible to add support for a raw vector containing the 
> contents of a parquet file?
> Apologies if there is already a way to do this.  I have tried populating a 
> buffer and passing that as input, but that is unsupported as well.  An 
> example of how to work using an input stream would be useful as well.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6278) [R] Handle raw vector from read_parquet

2019-08-16 Thread Brendan Hogan (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909288#comment-16909288
 ] 

Brendan Hogan commented on ARROW-6278:
--

[~fsaintjacques], yes BufferReader appears to work fine.  Thank you.
{code:java}
> test_br <- BufferReader(test_raw) 
> test_df <- read_parquet(test_br) 
>
{code}
 

> [R] Handle raw vector from read_parquet 
> 
>
> Key: ARROW-6278
> URL: https://issues.apache.org/jira/browse/ARROW-6278
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Brendan Hogan
>Priority: Major
>
> {{read_parquet}} currently handles a path to a local file or an Arrow input 
> stream.  Would it be possible to add support for a raw vector containing the 
> contents of a parquet file?
> Apologies if there is already a way to do this.  I have tried populating a 
> buffer and passing that as input, but that is unsupported as well.  An 
> example of how to work using an input stream would be useful as well.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6278) [R] Handle raw vector from read_parquet

2019-08-16 Thread Brendan Hogan (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909286#comment-16909286
 ] 

Brendan Hogan commented on ARROW-6278:
--

[~npr], here is an example of what I'm trying to do:
{code:java}
> test_raw <- readBin(system.file("v0.7.1.parquet", package="arrow"), what = 
> "raw", n = 5000) 
> test_df <- read_parquet(test_raw) 
Error in UseMethod("parquet_file_reader") :
 no applicable method for 'parquet_file_reader' applied to an object of class 
"raw"
{code}

> [R] Handle raw vector from read_parquet 
> 
>
> Key: ARROW-6278
> URL: https://issues.apache.org/jira/browse/ARROW-6278
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Brendan Hogan
>Priority: Major
>
> {{read_parquet}} currently handles a path to a local file or an Arrow input 
> stream.  Would it be possible to add support for a raw vector containing the 
> contents of a parquet file?
> Apologies if there is already a way to do this.  I have tried populating a 
> buffer and passing that as input, but that is unsupported as well.  An 
> example of how to work using an input stream would be useful as well.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6278) [R] Handle raw vector from read_parquet

2019-08-16 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909248#comment-16909248
 ] 

Francois Saint-Jacques commented on ARROW-6278:
---

There's the BufferReader in C++

https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/memory.h#L131-L168

which seems to be referenced/reachable in R bindinds:

https://github.com/apache/arrow/blob/master/r/src/io.cpp#L137-L141

> [R] Handle raw vector from read_parquet 
> 
>
> Key: ARROW-6278
> URL: https://issues.apache.org/jira/browse/ARROW-6278
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Brendan Hogan
>Priority: Major
>
> {{read_parquet}} currently handles a path to a local file or an Arrow input 
> stream.  Would it be possible to add support for a raw vector containing the 
> contents of a parquet file?
> Apologies if there is already a way to do this.  I have tried populating a 
> buffer and passing that as input, but that is unsupported as well.  An 
> example of how to work using an input stream would be useful as well.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6278) [R] Handle raw vector from read_parquet

2019-08-16 Thread Neal Richardson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909246#comment-16909246
 ] 

Neal Richardson commented on ARROW-6278:


Could you give an example of the code you have that you'd expect to work?

> [R] Handle raw vector from read_parquet 
> 
>
> Key: ARROW-6278
> URL: https://issues.apache.org/jira/browse/ARROW-6278
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Brendan Hogan
>Priority: Major
>
> {{read_parquet}} currently handles a path to a local file or an Arrow input 
> stream.  Would it be possible to add support for a raw vector containing the 
> contents of a parquet file?
> Apologies if there is already a way to do this.  I have tried populating a 
> buffer and passing that as input, but that is unsupported as well.  An 
> example of how to work using an input stream would be useful as well.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)