[jira] [Commented] (ARROW-288) Implement Arrow adapter for Spark Datasets
[ https://issues.apache.org/jira/browse/ARROW-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573017#comment-15573017 ] Wes McKinney commented on ARROW-288: hi [~freiss] and [~jlaskowski] -- we made pretty big progress on the C++ side to be able to be closer to full interoperability with the Arrow Java library. We still need to do some integration testing, but it would be great to start exploring the technical plan for making this happen. I was just talking with [~rxin] about this the other day, so there may be someone on the Spark side who could help with this effort, too. The first step is to convert a Spark Dataset into 1 or more Arrow record batches, including metadata conversion, and then converting back. The Java <-> C++ data movement itself is a comparatively minor task because that is just sending a serialized byte buffer through the existing protocol. We can test this out in Python using the Arrow <-> pandas bridge which has already been completed. Let me know if anyone will have the bandwidth to work on this and we can coordinate. thanks! > Implement Arrow adapter for Spark Datasets > -- > > Key: ARROW-288 > URL: https://issues.apache.org/jira/browse/ARROW-288 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Java - Vectors >Reporter: Wes McKinney > > It would be valuable for applications that use Arrow to be able to > * Convert between Spark DataFrames/Datasets and Java Arrow vectors > * Send / Receive Arrow record batches / Arrow file format RPCs to / from > Spark > * Allow PySpark to use Arrow for messaging in UDF evaluation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ARROW-288) Implement Arrow adapter for Spark Datasets
[ https://issues.apache.org/jira/browse/ARROW-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15517881#comment-15517881 ] Frederick Reiss commented on ARROW-288: --- Apologies for my delay in replying here; it's been a very hectic week. Along the lines of what [~ja...@japila.pl] says above, I think it would be good to break this overall task into smaller, bite-size chunks. One top-level question that we'll need to answer before we can break things down properly: Should we use Arrow's Java APIs or Arrow's C++ APIs to perform the conversion? If we use the Java APIs to convert the data, then the "collect Dataset to Arrow" will go roughly like this: # Determine that the Spark Dataset can indeed be expressed in Arrow format. # Obtain low-level access to the internal columnar representation of the Dataset. # Convert Spark's columnar representation to Arrow using the Arrow Java APIs. # Ship the Arrow buffer over the Py4j socket to the Python process as an array of bytes. # Cast the array of bytes to a Python Arrow array. All these steps will be contingent on Spark accepting a dependency on Arrow's Java API. This last point might be a bit tricky, given that the API doesn't have any users right now. At the least, we would need to break out some testing/documentation activities to create greater confidence in the robustness of the Java APIs. If we use Arrow's C++ API to do the conversion, the flow would go as follows: # Determine that the Spark Dataset can be expressed in Arrow format # Obtain low-level access to the internal columnar representation of the Dataset # Ship chunks of column values over the Py4j socket to the Python process as arrays of primitive types # Insert the column values into an Arrow buffer on the Python side, using C++ APIs Note that the last step here could potentially be implemented against Pandas dataframes instead of Arrow as a short-term expedient. A third possibility is to use Parquet as an intermediate format: # Determine that the Spark Dataset can be expressed in Arrow format. # Write the Dataset to a Parquet file in a location that the Python process can access. # Read the Parquet file back into an Arrow buffer in the Python process using C++ APIs. This approach would involve a lot less code, but it would of course require creating and deleting temporary files. > Implement Arrow adapter for Spark Datasets > -- > > Key: ARROW-288 > URL: https://issues.apache.org/jira/browse/ARROW-288 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Java - Vectors >Reporter: Wes McKinney > > It would be valuable for applications that use Arrow to be able to > * Convert between Spark DataFrames/Datasets and Java Arrow vectors > * Send / Receive Arrow record batches / Arrow file format RPCs to / from > Spark > * Allow PySpark to use Arrow for messaging in UDF evaluation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ARROW-288) Implement Arrow adapter for Spark Datasets
[ https://issues.apache.org/jira/browse/ARROW-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15513039#comment-15513039 ] Jacek Laskowski commented on ARROW-288: --- I've scheduled a [Spark/Scala meetup|http://www.meetup.com/WarsawScala/events/234156519/] next week and found the issue that we could help with somehow. We've got no experience with Arrow but quite fine with Spark SQL's Datasets. Could you [~wesmckinn] or [~julienledem] describe the very small steps needed for the task? They could also just be a subtasks of the "umbrella" task. Thanks. > Implement Arrow adapter for Spark Datasets > -- > > Key: ARROW-288 > URL: https://issues.apache.org/jira/browse/ARROW-288 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Java - Vectors >Reporter: Wes McKinney > > It would be valuable for applications that use Arrow to be able to > * Convert between Spark DataFrames/Datasets and Java Arrow vectors > * Send / Receive Arrow record batches / Arrow file format RPCs to / from > Spark > * Allow PySpark to use Arrow for messaging in UDF evaluation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ARROW-288) Implement Arrow adapter for Spark Datasets
[ https://issues.apache.org/jira/browse/ARROW-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15499623#comment-15499623 ] Wes McKinney commented on ARROW-288: Yes, we'd happily accept help with this. There is lots of work to do both in Arrow (e.g. integration testing / conforming the Java and C++ implementations) and Spark. > Implement Arrow adapter for Spark Datasets > -- > > Key: ARROW-288 > URL: https://issues.apache.org/jira/browse/ARROW-288 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Java - Vectors >Reporter: Wes McKinney > > It would be valuable for applications that use Arrow to be able to > * Convert between Spark DataFrames/Datasets and Java Arrow vectors > * Send / Receive Arrow record batches / Arrow file format RPCs to / from > Spark > * Allow PySpark to use Arrow for messaging in UDF evaluation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ARROW-288) Implement Arrow adapter for Spark Datasets
[ https://issues.apache.org/jira/browse/ARROW-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497799#comment-15497799 ] Julien Le Dem commented on ARROW-288: - [~freiss] Hi Frederick, I'll let Wes confirm but I believe that this JIRA is not immediately planned. Help is always welcome. We're happy to assist with reviews and answering questions. > Implement Arrow adapter for Spark Datasets > -- > > Key: ARROW-288 > URL: https://issues.apache.org/jira/browse/ARROW-288 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Java - Vectors >Reporter: Wes McKinney > > It would be valuable for applications that use Arrow to be able to > * Convert between Spark DataFrames/Datasets and Java Arrow vectors > * Send / Receive Arrow record batches / Arrow file format RPCs to / from > Spark > * Allow PySpark to use Arrow for messaging in UDF evaluation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ARROW-288) Implement Arrow adapter for Spark Datasets
[ https://issues.apache.org/jira/browse/ARROW-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497618#comment-15497618 ] Frederick Reiss commented on ARROW-288: --- [~wesmckinn], are you planning to do this yourself? Would you like some help with this task? > Implement Arrow adapter for Spark Datasets > -- > > Key: ARROW-288 > URL: https://issues.apache.org/jira/browse/ARROW-288 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Java - Vectors >Reporter: Wes McKinney > > It would be valuable for applications that use Arrow to be able to > * Convert between Spark DataFrames/Datasets and Java Arrow vectors > * Send / Receive Arrow record batches / Arrow file format RPCs to / from > Spark > * Allow PySpark to use Arrow for messaging in UDF evaluation -- This message was sent by Atlassian JIRA (v6.3.4#6332)