[
https://issues.apache.org/jira/browse/ARROW-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15517881#comment-15517881
]
Frederick Reiss edited comment on ARROW-288 at 9/23/16 11:37 PM:
-
Apologies for my delay in replying here; it's been a very hectic week.
Along the lines of what [~ja...@japila.pl] says above, I think it would be good
to break this overall task into smaller, bite-size chunks.
One top-level question that we'll need to answer before we can break things
down properly: Should we use Arrow's Java APIs or Arrow's C++ APIs to perform
the conversion?
If we use the Java APIs to convert the data, then the "collect Dataset to
Arrow" will go roughly like this:
# Determine that the Spark Dataset can indeed be expressed in Arrow format.
# Obtain low-level access to the internal columnar representation of the
Dataset.
# Convert Spark's columnar representation to Arrow using the Arrow Java APIs.
# Ship the Arrow buffer over the Py4j socket to the Python process as an array
of bytes.
# Cast the array of bytes to a Python Arrow array.
All these steps will be contingent on Spark accepting a dependency on Arrow's
Java API. This last point might be a bit tricky, given that the API doesn't
have any users right now. At the least, we would need to break out some
testing/documentation activities to create greater confidence in the robustness
of the Java APIs.
If we use Arrow's C++ API to do the conversion, the flow would go as follows:
# Determine that the Spark Dataset can be expressed in Arrow format
# Obtain low-level access to the internal columnar representation of the Dataset
# Ship chunks of column values over the Py4j socket to the Python process as
arrays of primitive types
# Insert the column values into an Arrow buffer on the Python side, using C++
APIs
Note that the last step here could potentially be implemented against Pandas
dataframes instead of Arrow as a short-term expedient.
A third possibility is to use Parquet as an intermediate format:
# Determine that the Spark Dataset can be expressed in Arrow format.
# Write the Dataset to a Parquet file in a location that the Python process can
access.
# Read the Parquet file back into an Arrow buffer in the Python process using
C++ APIs.
This approach would involve a lot less code, but it would of course require
creating and deleting temporary files.
was (Author: freiss):
Apologies for my delay in replying here; it's been a very hectic week.
Along the lines of what [~ja...@japila.pl] says above, I think it would be good
to break this overall task into smaller, bite-size chunks.
One top-level question that we'll need to answer before we can break things
down properly: Should we use Arrow's Java APIs or Arrow's C++ APIs to perform
the conversion?
If we use the Java APIs to convert the data, then the "collect Dataset to
Arrow" will go roughly like this:
# Determine that the Spark Dataset can indeed be expressed in Arrow format.
# Obtain low-level access to the internal columnar representation of the
Dataset.
# Convert Spark's columnar representation to Arrow using the Arrow Java APIs.
# Ship the Arrow buffer over the Py4j socket to the Python process as an array
of bytes.
# Cast the array of bytes to a Python Arrow array.
All these steps will be contingent on Spark accepting a dependency on Arrow's
Java API. This last point might be a bit tricky, given that the API doesn't
have any users right now. At the least, we would need to break out some
testing/documentation activities to create greater confidence in the robustness
of the Java APIs.
If we use Arrow's C++ API to do the conversion, the flow would go as follows:
# Determine that the Spark Dataset can be expressed in Arrow format
# Obtain low-level access to the internal columnar representation of the Dataset
# Ship chunks of column values over the Py4j socket to the Python process as
arrays of primitive types
# Insert the column values into an Arrow buffer on the Python side, using C++
APIs
Note that the last step here could potentially be implemented against Pandas
dataframes instead of Arrow as a short-term expedient.
A third possibility is to use Parquet as an intermediate format:
# Determine that the Spark Dataset can be expressed in Arrow format.
# Write the Dataset to a Parquet file in a location that the Python process can
access.
# Read the Parquet file back into an Arrow buffer in the Python process using
C++ APIs.
This approach would involve a lot less code, but it would of course require
creating and deleting temporary files.
> Implement Arrow adapter for Spark Datasets
> --
>
> Key: ARROW-288
> URL: https://issues.apache.org/jira/browse/ARROW-288
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Java - Vectors
>