[jira] [Commented] (ARROW-288) Implement Arrow adapter for Spark Datasets
[ https://issues.apache.org/jira/browse/ARROW-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583076#comment-15583076 ] Frederick Reiss commented on ARROW-288: --- [~bryanc], [~holdenk], and [~yinxusen] are looking into this item on our end. We should have an update around the end of this week. > Implement Arrow adapter for Spark Datasets > -- > > Key: ARROW-288 > URL: https://issues.apache.org/jira/browse/ARROW-288 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Java - Vectors >Reporter: Wes McKinney > > It would be valuable for applications that use Arrow to be able to > * Convert between Spark DataFrames/Datasets and Java Arrow vectors > * Send / Receive Arrow record batches / Arrow file format RPCs to / from > Spark > * Allow PySpark to use Arrow for messaging in UDF evaluation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ARROW-288) Implement Arrow adapter for Spark Datasets
[ https://issues.apache.org/jira/browse/ARROW-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15517881#comment-15517881 ] Frederick Reiss commented on ARROW-288: --- Apologies for my delay in replying here; it's been a very hectic week. Along the lines of what [~ja...@japila.pl] says above, I think it would be good to break this overall task into smaller, bite-size chunks. One top-level question that we'll need to answer before we can break things down properly: Should we use Arrow's Java APIs or Arrow's C++ APIs to perform the conversion? If we use the Java APIs to convert the data, then the "collect Dataset to Arrow" will go roughly like this: # Determine that the Spark Dataset can indeed be expressed in Arrow format. # Obtain low-level access to the internal columnar representation of the Dataset. # Convert Spark's columnar representation to Arrow using the Arrow Java APIs. # Ship the Arrow buffer over the Py4j socket to the Python process as an array of bytes. # Cast the array of bytes to a Python Arrow array. All these steps will be contingent on Spark accepting a dependency on Arrow's Java API. This last point might be a bit tricky, given that the API doesn't have any users right now. At the least, we would need to break out some testing/documentation activities to create greater confidence in the robustness of the Java APIs. If we use Arrow's C++ API to do the conversion, the flow would go as follows: # Determine that the Spark Dataset can be expressed in Arrow format # Obtain low-level access to the internal columnar representation of the Dataset # Ship chunks of column values over the Py4j socket to the Python process as arrays of primitive types # Insert the column values into an Arrow buffer on the Python side, using C++ APIs Note that the last step here could potentially be implemented against Pandas dataframes instead of Arrow as a short-term expedient. A third possibility is to use Parquet as an intermediate format: # Determine that the Spark Dataset can be expressed in Arrow format. # Write the Dataset to a Parquet file in a location that the Python process can access. # Read the Parquet file back into an Arrow buffer in the Python process using C++ APIs. This approach would involve a lot less code, but it would of course require creating and deleting temporary files. > Implement Arrow adapter for Spark Datasets > -- > > Key: ARROW-288 > URL: https://issues.apache.org/jira/browse/ARROW-288 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Java - Vectors >Reporter: Wes McKinney > > It would be valuable for applications that use Arrow to be able to > * Convert between Spark DataFrames/Datasets and Java Arrow vectors > * Send / Receive Arrow record batches / Arrow file format RPCs to / from > Spark > * Allow PySpark to use Arrow for messaging in UDF evaluation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ARROW-288) Implement Arrow adapter for Spark Datasets
[ https://issues.apache.org/jira/browse/ARROW-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15513039#comment-15513039 ] Jacek Laskowski commented on ARROW-288: --- I've scheduled a [Spark/Scala meetup|http://www.meetup.com/WarsawScala/events/234156519/] next week and found the issue that we could help with somehow. We've got no experience with Arrow but quite fine with Spark SQL's Datasets. Could you [~wesmckinn] or [~julienledem] describe the very small steps needed for the task? They could also just be a subtasks of the "umbrella" task. Thanks. > Implement Arrow adapter for Spark Datasets > -- > > Key: ARROW-288 > URL: https://issues.apache.org/jira/browse/ARROW-288 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Java - Vectors >Reporter: Wes McKinney > > It would be valuable for applications that use Arrow to be able to > * Convert between Spark DataFrames/Datasets and Java Arrow vectors > * Send / Receive Arrow record batches / Arrow file format RPCs to / from > Spark > * Allow PySpark to use Arrow for messaging in UDF evaluation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ARROW-288) Implement Arrow adapter for Spark Datasets
[ https://issues.apache.org/jira/browse/ARROW-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497799#comment-15497799 ] Julien Le Dem commented on ARROW-288: - [~freiss] Hi Frederick, I'll let Wes confirm but I believe that this JIRA is not immediately planned. Help is always welcome. We're happy to assist with reviews and answering questions. > Implement Arrow adapter for Spark Datasets > -- > > Key: ARROW-288 > URL: https://issues.apache.org/jira/browse/ARROW-288 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Java - Vectors >Reporter: Wes McKinney > > It would be valuable for applications that use Arrow to be able to > * Convert between Spark DataFrames/Datasets and Java Arrow vectors > * Send / Receive Arrow record batches / Arrow file format RPCs to / from > Spark > * Allow PySpark to use Arrow for messaging in UDF evaluation -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ARROW-288) Implement Arrow adapter for Spark Datasets
[ https://issues.apache.org/jira/browse/ARROW-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15497618#comment-15497618 ] Frederick Reiss commented on ARROW-288: --- [~wesmckinn], are you planning to do this yourself? Would you like some help with this task? > Implement Arrow adapter for Spark Datasets > -- > > Key: ARROW-288 > URL: https://issues.apache.org/jira/browse/ARROW-288 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Java - Vectors >Reporter: Wes McKinney > > It would be valuable for applications that use Arrow to be able to > * Convert between Spark DataFrames/Datasets and Java Arrow vectors > * Send / Receive Arrow record batches / Arrow file format RPCs to / from > Spark > * Allow PySpark to use Arrow for messaging in UDF evaluation -- This message was sent by Atlassian JIRA (v6.3.4#6332)