[jira] [Commented] (ARROW-288) Implement Arrow adapter for Spark Datasets

2016-10-13 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573017#comment-15573017
 ] 

Wes McKinney commented on ARROW-288:


hi [~freiss] and [~jlaskowski] -- we made pretty big progress on the C++ side 
to be able to be closer to full interoperability with the Arrow Java library. 
We still need to do some integration testing, but it would be great to start 
exploring the technical plan for making this happen. I was just talking with 
[~rxin] about this the other day, so there may be someone on the Spark side who 
could help with this effort, too. 

The first step is to convert a Spark Dataset into 1 or more Arrow record 
batches, including metadata conversion, and then converting back. The Java <-> 
C++ data movement itself is a comparatively minor task because that is just 
sending a serialized byte buffer through the existing protocol. We can test 
this out in Python using the Arrow <-> pandas bridge which has already been 
completed. 

Let me know if anyone will have the bandwidth to work on this and we can 
coordinate. thanks!

> Implement Arrow adapter for Spark Datasets
> --
>
> Key: ARROW-288
> URL: https://issues.apache.org/jira/browse/ARROW-288
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java - Vectors
>Reporter: Wes McKinney
>
> It would be valuable for applications that use Arrow to be able to 
> * Convert between Spark DataFrames/Datasets and Java Arrow vectors
> * Send / Receive Arrow record batches / Arrow file format RPCs to / from 
> Spark 
> * Allow PySpark to use Arrow for messaging in UDF evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ARROW-288) Implement Arrow adapter for Spark Datasets

2016-09-23 Thread Frederick Reiss (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15517881#comment-15517881
 ] 

Frederick Reiss commented on ARROW-288:
---

Apologies for my delay in replying here; it's been a very hectic week.

Along the lines of what [~ja...@japila.pl] says above, I think it would be good 
to break this overall task into smaller, bite-size chunks.

One top-level question that we'll need to answer before we can break things 
down properly: Should we use Arrow's Java APIs or Arrow's C++ APIs to perform 
the conversion?

If we use the Java APIs to convert the data, then the "collect Dataset to 
Arrow" will go roughly like this:
# Determine that the Spark Dataset can indeed be expressed in Arrow format.
# Obtain low-level access to the internal columnar representation of the 
Dataset.
# Convert Spark's columnar representation to Arrow using the Arrow Java APIs.
# Ship the Arrow buffer over the Py4j socket to the Python process as an array 
of bytes.
# Cast the array of bytes to a Python Arrow array.
All these steps will be contingent on Spark accepting a dependency on Arrow's 
Java API. This last point might be a bit tricky, given that the API doesn't 
have any users right now. At the least, we would need to break out some 
testing/documentation activities to create greater confidence in the robustness 
of the Java APIs.

If we use Arrow's C++ API to do the conversion, the flow would go as follows:
# Determine that the Spark Dataset can be expressed in Arrow format
# Obtain low-level access to the internal columnar representation of the Dataset
# Ship chunks of column values over the Py4j socket to the Python process as 
arrays of primitive types
# Insert the column values into an Arrow buffer on the Python side, using C++ 
APIs
Note that the last step here could potentially be implemented against Pandas 
dataframes instead of Arrow as a short-term expedient.

A third possibility is to use Parquet as an intermediate format:
# Determine that the Spark Dataset can be expressed in Arrow format.
# Write the Dataset to a Parquet file in a location that the Python process can 
access.
# Read the Parquet file back into an Arrow buffer in the Python process using 
C++ APIs.
This approach would involve a lot less code, but it would of course require 
creating and deleting temporary files.



> Implement Arrow adapter for Spark Datasets
> --
>
> Key: ARROW-288
> URL: https://issues.apache.org/jira/browse/ARROW-288
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java - Vectors
>Reporter: Wes McKinney
>
> It would be valuable for applications that use Arrow to be able to 
> * Convert between Spark DataFrames/Datasets and Java Arrow vectors
> * Send / Receive Arrow record batches / Arrow file format RPCs to / from 
> Spark 
> * Allow PySpark to use Arrow for messaging in UDF evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ARROW-288) Implement Arrow adapter for Spark Datasets

2016-09-22 Thread Jacek Laskowski (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15513039#comment-15513039
 ] 

Jacek Laskowski commented on ARROW-288:
---

I've scheduled a [Spark/Scala 
meetup|http://www.meetup.com/WarsawScala/events/234156519/] next week and found 
the issue that we could help with somehow. We've got no experience with Arrow 
but quite fine with Spark SQL's Datasets.

Could you [~wesmckinn] or [~julienledem] describe the very small steps needed 
for the task? They could also just be a subtasks of the "umbrella" task. Thanks.

> Implement Arrow adapter for Spark Datasets
> --
>
> Key: ARROW-288
> URL: https://issues.apache.org/jira/browse/ARROW-288
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java - Vectors
>Reporter: Wes McKinney
>
> It would be valuable for applications that use Arrow to be able to 
> * Convert between Spark DataFrames/Datasets and Java Arrow vectors
> * Send / Receive Arrow record batches / Arrow file format RPCs to / from 
> Spark 
> * Allow PySpark to use Arrow for messaging in UDF evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ARROW-288) Implement Arrow adapter for Spark Datasets

2016-09-17 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15499623#comment-15499623
 ] 

Wes McKinney commented on ARROW-288:


Yes, we'd happily accept help with this. There is lots of work to do both in 
Arrow (e.g. integration testing / conforming the Java and C++ implementations) 
and Spark.

> Implement Arrow adapter for Spark Datasets
> --
>
> Key: ARROW-288
> URL: https://issues.apache.org/jira/browse/ARROW-288
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java - Vectors
>Reporter: Wes McKinney
>
> It would be valuable for applications that use Arrow to be able to 
> * Convert between Spark DataFrames/Datasets and Java Arrow vectors
> * Send / Receive Arrow record batches / Arrow file format RPCs to / from 
> Spark 
> * Allow PySpark to use Arrow for messaging in UDF evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ARROW-288) Implement Arrow adapter for Spark Datasets

2016-09-16 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497799#comment-15497799
 ] 

Julien Le Dem commented on ARROW-288:
-

[~freiss] Hi Frederick, I'll let Wes confirm but I believe that this JIRA is 
not immediately planned. Help is always welcome. We're happy to assist with 
reviews and answering questions.

> Implement Arrow adapter for Spark Datasets
> --
>
> Key: ARROW-288
> URL: https://issues.apache.org/jira/browse/ARROW-288
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java - Vectors
>Reporter: Wes McKinney
>
> It would be valuable for applications that use Arrow to be able to 
> * Convert between Spark DataFrames/Datasets and Java Arrow vectors
> * Send / Receive Arrow record batches / Arrow file format RPCs to / from 
> Spark 
> * Allow PySpark to use Arrow for messaging in UDF evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ARROW-288) Implement Arrow adapter for Spark Datasets

2016-09-16 Thread Frederick Reiss (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497618#comment-15497618
 ] 

Frederick Reiss commented on ARROW-288:
---

[~wesmckinn], are you planning to do this yourself? Would you like some help 
with this task?

> Implement Arrow adapter for Spark Datasets
> --
>
> Key: ARROW-288
> URL: https://issues.apache.org/jira/browse/ARROW-288
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java - Vectors
>Reporter: Wes McKinney
>
> It would be valuable for applications that use Arrow to be able to 
> * Convert between Spark DataFrames/Datasets and Java Arrow vectors
> * Send / Receive Arrow record batches / Arrow file format RPCs to / from 
> Spark 
> * Allow PySpark to use Arrow for messaging in UDF evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)