[jira] [Commented] (ARROW-288) Implement Arrow adapter for Spark Datasets

2016-10-17 Thread Frederick Reiss (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583076#comment-15583076
 ] 

Frederick Reiss commented on ARROW-288:
---

[~bryanc], [~holdenk], and [~yinxusen] are looking into this item on our end. 
We should have an update around the end of this week.

> Implement Arrow adapter for Spark Datasets
> --
>
> Key: ARROW-288
> URL: https://issues.apache.org/jira/browse/ARROW-288
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java - Vectors
>Reporter: Wes McKinney
>
> It would be valuable for applications that use Arrow to be able to 
> * Convert between Spark DataFrames/Datasets and Java Arrow vectors
> * Send / Receive Arrow record batches / Arrow file format RPCs to / from 
> Spark 
> * Allow PySpark to use Arrow for messaging in UDF evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ARROW-288) Implement Arrow adapter for Spark Datasets

2016-09-23 Thread Frederick Reiss (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15517881#comment-15517881
 ] 

Frederick Reiss commented on ARROW-288:
---

Apologies for my delay in replying here; it's been a very hectic week.

Along the lines of what [~ja...@japila.pl] says above, I think it would be good 
to break this overall task into smaller, bite-size chunks.

One top-level question that we'll need to answer before we can break things 
down properly: Should we use Arrow's Java APIs or Arrow's C++ APIs to perform 
the conversion?

If we use the Java APIs to convert the data, then the "collect Dataset to 
Arrow" will go roughly like this:
# Determine that the Spark Dataset can indeed be expressed in Arrow format.
# Obtain low-level access to the internal columnar representation of the 
Dataset.
# Convert Spark's columnar representation to Arrow using the Arrow Java APIs.
# Ship the Arrow buffer over the Py4j socket to the Python process as an array 
of bytes.
# Cast the array of bytes to a Python Arrow array.
All these steps will be contingent on Spark accepting a dependency on Arrow's 
Java API. This last point might be a bit tricky, given that the API doesn't 
have any users right now. At the least, we would need to break out some 
testing/documentation activities to create greater confidence in the robustness 
of the Java APIs.

If we use Arrow's C++ API to do the conversion, the flow would go as follows:
# Determine that the Spark Dataset can be expressed in Arrow format
# Obtain low-level access to the internal columnar representation of the Dataset
# Ship chunks of column values over the Py4j socket to the Python process as 
arrays of primitive types
# Insert the column values into an Arrow buffer on the Python side, using C++ 
APIs
Note that the last step here could potentially be implemented against Pandas 
dataframes instead of Arrow as a short-term expedient.

A third possibility is to use Parquet as an intermediate format:
# Determine that the Spark Dataset can be expressed in Arrow format.
# Write the Dataset to a Parquet file in a location that the Python process can 
access.
# Read the Parquet file back into an Arrow buffer in the Python process using 
C++ APIs.
This approach would involve a lot less code, but it would of course require 
creating and deleting temporary files.



> Implement Arrow adapter for Spark Datasets
> --
>
> Key: ARROW-288
> URL: https://issues.apache.org/jira/browse/ARROW-288
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java - Vectors
>Reporter: Wes McKinney
>
> It would be valuable for applications that use Arrow to be able to 
> * Convert between Spark DataFrames/Datasets and Java Arrow vectors
> * Send / Receive Arrow record batches / Arrow file format RPCs to / from 
> Spark 
> * Allow PySpark to use Arrow for messaging in UDF evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ARROW-288) Implement Arrow adapter for Spark Datasets

2016-09-22 Thread Jacek Laskowski (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15513039#comment-15513039
 ] 

Jacek Laskowski commented on ARROW-288:
---

I've scheduled a [Spark/Scala 
meetup|http://www.meetup.com/WarsawScala/events/234156519/] next week and found 
the issue that we could help with somehow. We've got no experience with Arrow 
but quite fine with Spark SQL's Datasets.

Could you [~wesmckinn] or [~julienledem] describe the very small steps needed 
for the task? They could also just be a subtasks of the "umbrella" task. Thanks.

> Implement Arrow adapter for Spark Datasets
> --
>
> Key: ARROW-288
> URL: https://issues.apache.org/jira/browse/ARROW-288
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java - Vectors
>Reporter: Wes McKinney
>
> It would be valuable for applications that use Arrow to be able to 
> * Convert between Spark DataFrames/Datasets and Java Arrow vectors
> * Send / Receive Arrow record batches / Arrow file format RPCs to / from 
> Spark 
> * Allow PySpark to use Arrow for messaging in UDF evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ARROW-288) Implement Arrow adapter for Spark Datasets

2016-09-16 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497799#comment-15497799
 ] 

Julien Le Dem commented on ARROW-288:
-

[~freiss] Hi Frederick, I'll let Wes confirm but I believe that this JIRA is 
not immediately planned. Help is always welcome. We're happy to assist with 
reviews and answering questions.

> Implement Arrow adapter for Spark Datasets
> --
>
> Key: ARROW-288
> URL: https://issues.apache.org/jira/browse/ARROW-288
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java - Vectors
>Reporter: Wes McKinney
>
> It would be valuable for applications that use Arrow to be able to 
> * Convert between Spark DataFrames/Datasets and Java Arrow vectors
> * Send / Receive Arrow record batches / Arrow file format RPCs to / from 
> Spark 
> * Allow PySpark to use Arrow for messaging in UDF evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ARROW-288) Implement Arrow adapter for Spark Datasets

2016-09-16 Thread Frederick Reiss (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497618#comment-15497618
 ] 

Frederick Reiss commented on ARROW-288:
---

[~wesmckinn], are you planning to do this yourself? Would you like some help 
with this task?

> Implement Arrow adapter for Spark Datasets
> --
>
> Key: ARROW-288
> URL: https://issues.apache.org/jira/browse/ARROW-288
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java - Vectors
>Reporter: Wes McKinney
>
> It would be valuable for applications that use Arrow to be able to 
> * Convert between Spark DataFrames/Datasets and Java Arrow vectors
> * Send / Receive Arrow record batches / Arrow file format RPCs to / from 
> Spark 
> * Allow PySpark to use Arrow for messaging in UDF evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)